Advanced Options
Tokenize
By default Lexos splits strings of text into tokens every time it encounters a space character. For Western languages, this means that each token generally corresponds to a word. Click the by Characters radio button to treat every character as a separate token.If you wish to use n-grams (by Tokens), increase the 1-gram incrementer to 2, 3, 4, etc. For example, "the dog ran" would produce the 1-gram tokens the, dog, ran., the 2-grams the dog, dog ran, and so on. Alternately, 2-grams (by Characters) would create tokens: th, he, e , and so on.
Note that increasing the n-gram size may produce a larger DTM, and the table will thus take longer to build.
Culling Options
"Culling Options" is a generic term we use for methods of decreasing the number of terms used to generate the DTM based on statistical criteria (as opposed to something like applying a stopword list in Scrubber). Lexos offers three different methods:1. Most Frequent Words: This method takes a slice of the DTM containing only the top-N most frequently occurring terms. The default setting is 100, meaning "use" only the top-100 most frequently occurring terms.
2. Culling: This method builds the DTM using only terms that occur in at least N documents. The default setting is 1. To cull words in the DTM to only include terms that appear at least once in all the active documents, set the value to the total number of active documents. (Note: you can quickly determine the number of active documents by panning over the folder icon in the upper-right hand corner; for example, to use only those terms that appear at least once in your ten active documents, you could set the option to be: Must be in 10 documents.