Thanks for your patience during our recent outage at scalar.usc.edu. While Scalar content is loading normally now, saving is still slow, and Scalar's 'additional metadata' features have been disabled, which may interfere with features like timelines and maps that depend on metadata. This also means that saving a page or media item will remove its additional metadata. If this occurs, you can use the 'All versions' link at the bottom of the page to restore the earlier version. We are continuing to troubleshoot, and will provide further updates as needed. Note that this only affects Scalar projects at scalar.usc.edu, and not those hosted elsewhere.
In the MarginsMain MenuWelcomeThe In the Margins home pageLexomicsThe starting point for the Lexomics pathManualStart page for the Lexos ManualTopicsExplore this path to learn about the Lexomic methodsGlossaryGlossary of terms used in Lexos and In the MarginsBibliographyBeginning of bibliography pathLexos Install GuideInstall GuideScott Kleinman9a8f11284fbcd30816f25779706745a199e2813bMark D. LeBlanc23eecdfefefedd63f3c03839b2eb82298bb7b6acMichael Drout982893aaef23041e734606413d064fcc52ac209a
Advanced Options
12016-08-16T10:37:11-07:00Scott Kleinman9a8f11284fbcd30816f25779706745a199e2813b53719Manual page for the Lexos Tokenize and Analyze Advanced Optionsplain2018-06-26T18:33:41-07:00Mark D. LeBlanc23eecdfefefedd63f3c03839b2eb82298bb7b6ac
Tokenize
By default Lexos splits strings of text into tokens every time it encounters a space character. For Western languages, this means that each token generally corresponds to a word. Alternately, select the tokenize by Characters radio button to treat every character (or groups of consecutive characters) as a separate token.
If you wish to use n-grams (by Tokens), increase the N-gram incrementer from 1- gram to 2-, 3-, 4-gram, etc. For example given the text: "the dog ran" would produce the 1-gram tokens the, dog, ran., whereas counting 2-grams (N=2) would count the instances of bi-grams or pairs of words: the dog, dog ran, and so on. Alternately, tokenizing as 2-grams (by Characters) would create tokens of two letters: th, he, e , and so on.
Note: Counting by character n-grams is needed for tokenizing non-western languages that don't have whitespace between tokens (e.g., words). For example, when working on the classical Chinese text Dream of the Red Chamber, we sometimes count by 2-gram Character tokens, thus counting all instances of two Chinese characters that appear together, e.g., a text of: "连忙赶至寺"would count tokens 连忙, 忙赶, 赶至, etc. Even with western languages, we encourage experimentation with tokenizing by character n-grams!
Note that increasing the n-gram size may produce a larger DTM, and the table will thus take longer to build.
Culling Options
"Culling Options" is a generic term we use for methods of decreasing the number of terms used to generate the DTM based on statistical criteria (as opposed to something like applying a stopword list in Scrubber). Lexos offers three different methods:
1. Most Frequent Words: This method takes a slice of the DTM containing only the top-N most frequently occurring terms in the present set of active documents. The default setting is 100, meaning "use" only the top-100 most frequently occurring terms. 2. Culling: This method builds the DTM using only terms that occur in at least N documents. The default setting is 1. To cull words in the DTM to only include terms that appear at least once in all the active documents, set the value to the total number of active documents. (Note: you can quickly determine the number of active documents by panning over the folder icon in the upper-right hand corner; for example, to use only those terms that appear at least once in your ten active documents, you could set the option to be: Must be in 10 documents.
Normalize
By default, Lexos displays the frequency of the occurrence of terms in your documents as a proportion of the entire text. If you wish to see the actual number of occurrences, click the Raw Counts radio button. You may also attempt to take into account differences in the lengths of your documents by calculating their Term Frequency-Inverse Document Frequency (TF-IDF). Lexos offers three different methods of calculating TF-IDF based on Euclidean Distance, Manhattan Distance, or without using a distance metric (Norm: None).
Assign Temporary Labels
Lexos automatically uses the label in the "Document Name" column in the Manage tool as the document label. However, you may change the label used in your table by entering a new value for it in the forms displayed in Assign Temporary Labels. This is particularly useful if you want to save different labels when you download your DTM. Keep in mind that whatever labels you set will be applied in all other Lexos tools that use the Advanced Options. However, the original document name in Manage will not be affected. After assigning temporary labels in Tokenizer, click the Show matrix button to rebuild the table with the new labels.