This site requires Javascript to be turned on. Please enable Javascript and reload the page.

In the Margins Main Menu Welcome The In the Margins home page Lexomics The starting point for the Lexomics path Manual Start page for the Lexos Manual Topics Explore this path to learn about the Lexomic methods Glossary Glossary of terms used in Lexos and In the Margins Bibliography Beginning of bibliography path Lexos Install Guide Install Guide

Advanced Options

5371

Tokenize

By default Lexos splits strings of text into tokens every time it encounters a space character. For Western languages, this means that each token generally corresponds to a word. Click the by Characters radio button to treat every character as a separate token. If you wish to use n-grams, increase the 1-gram incrementer to 2, 3, 4, etc. For example, "the dog ran" would produce the 1-gram tokens the, dog, ran., the 2-grams the dog, dog ran, and so on. 2-grams tokenized by characters would begin th, he, e , and so on.

Note that increasing the n-gram size my produce a larger DTM, and the table will thus take longer to build.
Mark-

Culling Options

"Culling Options" is a generic term we use for methods of decreasing the number of terms used to generate the DTM based on statistical criteria (as opposed to something like applying a stopword list in Scrubber). Lexos offer three different methods:

1. Most Frequent Words: This method takes a slice of the DTM containing only the top N most frequently occurring terms. The default setting is 100.
2. Culling: This method builds the DTM using only terms that occur in at least N documents. The default setting is 1.
3. Greywords: This method removes from the DTM those terms occurring in particularly low frequencies. Lexos calculates the cut-off point based on the average length of your documents.

Normalize

By default, Lexos displays the frequency of the occurrence of terms in your documents as a proportion of the entire text. If you wish to see the actual number of occurrences, click the Raw Counts radio button. You may also attempt to take into account differences in the lengths of your documents by calculating their Term Frequency-Inverse Document Frequency (TF-IDF). Lexos offers three different methods of calculating TF-IDF based on Euclidean Distance, Manhattan Distance, or without using a distance metric (Norm: None). For further discussion on these options, see the topics article on [TF-IDF](http://scalar.usc.edu/works/lexos/tf-idf).

Assign Temporary Labels

Lexos automatically uses the label in the "Document Name" column in the Manage tool as the document label. However, you may change the label used in your table by entering a new value for it in the forms displayed in Assign Temporary Labels. This is particularly useful if you want to save different labels when you download your DTM. Keep in mind that whatever labels you set will be applied in all other Lexos tools that use the Advanced Options. However, the original document name in Manage will not be affected. After assigning temporary labels in Tokenizer, click the Regenerate Table button to rebuild the table with the new labels.