Advanced Options
Tokenize
By default Lexos splits strings of text into tokens every time it encounters a space character. For Western languages, this means that each token generally corresponds to a word. Click the by Characters radio button to treat every character as a separate token. If you wish to use n-grams, increase the 1-gram incrementer to 2, 3, 4, etc. For example, "the dog ran" would produce the 1-gram tokens the, dog, ran., the 2-grams the dog, dog ran, and so on. 2-grams tokenized by characters would begin th, he, e , and so on.Note that increasing the n-gram size my produce a larger DTM, and the table will thus take longer to build.
Mark-
Culling Options
"Culling Options" is a generic term we use for methods of decreasing the number of terms used to generate the DTM based on statistical criteria (as opposed to something like applying a stopword list in Scrubber). Lexos offer three different methods:1. Most Frequent Words: This method takes a slice of the DTM containing only the top N most frequently occurring terms. The default setting is 100.
2. Culling: This method builds the DTM using only terms that occur in at least N documents. The default setting is 1.
3. Greywords: This method removes from the DTM those terms occurring in particularly low frequencies. Lexos calculates the cut-off point based on the average length of your documents.