In the Margins

The Tokenize/Count Tool

Tokenization is the process of dividing a string of text into countable units called “tokens”. Tokens are typically individual characters or words, but they can also be “n-grams”, units composed of one or more sequences of characters or words. By default, Lexos divides text into tokens using spaces as token delimiters. However, it can be set to treat every character as a token or to treat n-gram sequences as tokens.

Once the text is divided into tokens, Lexos assembles a Document-Term Matrix (DTM). This is a table of “terms” (also called “types”)—unique token forms—that occur in the active documents. Lexos calculates the number of times each document contains each term to produce the DTM. It displays the DTM as a table where you can explore important statistical information about your texts. Note that text corpora containing a large number of documents or types can take a while to process, so please be patient. If the table is too big, it may cause your browser to hang, and you may be forced to download the DTM to a spreadsheet program and work there. Lexos attempts to warn you when it is likely that you will need to download your data. Even if Lexos is able to display your DTM quickly, you may wish to download the data for use in other

Using the DTM Table

By default, Lexos displays the DTM with documents listed in columns and terms listed in rows. You may choose to transpose the table by selecting the Documents as Rows, Terms as Columns option. However, it is most likely that you will have relatively few documents and a relatively large number of terms. Transposing the matrix will produce a table with potentially hundreds or thousands of columns, requiring you to scroll horizontally to view them. Lexos will warn you when this is likely and give you the option to download the transposed table to a spreadsheet program, where you may find it easier to work. You may also click the eye icon to toggle the visibility of individual columns. If you change the setting between Documents as Columns, Terms as Rows and Documents as Rows, Terms as Columns, click the Regenerate Table button to apply the change of setting.

By default Lexos displays 10 table rows per page, but you can change this using the Display dropdown menu. You can also filter the rows by entering keywords in the Search form. To sort the table, click on a column header. A small icon next to the arrow in the header label will indicate both which column is being used for sorting and whether the sort direction is ascending or descending. Lexos calculates totals and averages for both rows and columns.

To download the DTM, click the Download CSV or the Download TSV button. “CSV” is short for comma-separated values, whereas “TSV” is short for tab-separated values. In your downloaded file, a comma or a tab will serve as the column delimiter. Spreadsheet programs can usually open both formats, but you may find one or the other easier to use for your purposes.

Using the Advanced Options

The configuration options in the top right inset section of the Tokenizer tool allow you to change how the DTM is built. An important feature of these options is that they are saved to your session and will apply to all the other Lexos tools that make use of the DTM. For instance, if you restrict your DTM to only the 10 most frequent terms in your corpus, this slice of your DTM will also be used to generate word clouds, cluster analyses, and so on. The same configuration options occur in the other Lexos tools, so it is possible to change the settings there. In Tokenizer, you should click the Regenerate Table button each time you change the settings to re-build the DTM with the new configuration.

Here is an overview of the Advanced Options:

Tokenize

By default Lexos splits strings of text into tokens every time it encounters a space character. For Western languages, this means that each token generally corresponds to a word. Click the by Characters radio button to treat every character as a separate token. If you wish to use n-grams, increase the 1-gram incrementer to 2, 3, 4, etc. Note that increasing the n-gram size will produce a larger DTM, and the table will thus take longer to build.

Culling Options

“Culling Options” is a generic term we use for methods of decreasing the number of terms used to generate the DTM based on statistical criteria (as opposed to something like applying a stopword list in Scrubber). Lexos offer three different methods:

  1. Most Frequent Words: This method takes a slice of the DTM containing only the top N most frequently occurring terms. The default setting is 100.
  2. Culling: This method builds the DTM using only terms that occur in at least N documents. The default setting is 1.
  3. Greywords: This method removes from the DTM those terms occurring in particularly low frequencies. Lexos calculates the cut-off point based on the average length of your documents.

Normalize

By default, Lexos displays the frequency of the occurrence of terms in your documents as a proportion of the entire text. If you wish to see the actual number of occurrences, click the Raw Counts radio button, followed by the Regenerate Table button. You may also attempt to take into account differences in the lengths of your documents by calculating their Term Frequency-Inverse Document Frequency (TF-IDF). Lexos offers three different methods of calculating TF-IDF based on Euclidean Distance, Manhattan Distance, or without using a distance metric (Norm: None). For further discussion on these optins, see the topics article on TF-IDF.

Assign Temporary Labels

Lexos automatically uses the label in the “Document Name” column in the Manage tool as the document label in Tokenizer. However, you may change the label used in your table by entering a new value for it in the forms displayed in Assign Temporary Labels. This is particularly useful if you want to save different labels when you download your DTM. Keep in mind that whatever labels you set will be applied in all other Lexos tools that use the Advanced Options. However, the original document name in Manage will not be affected. After assignign temporary labels, click the Regenerate Table button to rebuild the table with the new labels.

This page has paths:

  1. Manual Scott Kleinman