This site requires Javascript to be turned on. Please enable Javascript and reload the page.

In the Margins

The Tokenize/Count Tool

The Tokenizer/Count tool, also known as Tokenizer, is the backbone for many functions in Lexos. Tokenization is the process of dividing a string of text into countable units called “tokens”. Tokens are typically individual characters or words, but they can also be “n-grams”, units composed of one or more sequences of characters or words. By default, Lexos divides text into tokens using spaces as token delimiters. However, it can be set to treat every character as a token or to treat n-gram sequences as tokens.

Once the text is divided into tokens, Lexos assembles a Document-Term Matrix (DTM). This is a table of “terms” (also called “types”)—unique token forms—that occur in the active documents. Lexos calculates the number of times each document contains each term to produce the DTM. It displays the DTM as a table where you can explore important statistical information about your texts. Note that text corpora containing a large number of documents or types can take a while to process, so please be patient. If the table is too big, it may cause your browser to hang, and you may be forced to download the DTM to a spreadsheet program and work there. Lexos attempts to warn you when it is likely that you will need to download your data. Even if Lexos is able to display your DTM quickly, you may wish to download the data for use in other tools.

Using the Document-Term Matrix (DTM) Table

By default, Lexos displays the DTM with documents listed in columns and terms listed in rows. You may choose to transpose the table by selecting the Documents as Rows, Terms as Columns option. However, it is most likely that you will have relatively few documents and a relatively large number of terms. Transposing the matrix will produce a table with potentially hundreds or thousands of columns, requiring you to scroll horizontally to view them. Lexos will warn you when this is likely and give you the option to download the transposed table to a spreadsheet program, where you may find it easier to work. You may also click the Column visibility button to toggle the visibility of individual columns. If you change the setting between Documents as Columns, Terms as Rows and Documents as Rows, Terms as Columns, click the Show Matrix button to apply the change of setting.

By default Lexos displays 10 table rows per page, but you can change this using the Show N entries menu. You can also filter the rows by entering keywords in the Search form.

To sort the table, click on a column header. A small icon next to the arrow in the header label will indicate both which column is being used for sorting and whether the sort direction is ascending or descending. Lexos calculates totals and averages for both rows and columns.

To download the DTM, click the Excel, CSV, or the PDF button depending on your desired format. “CSV” is short for Comma-Separated Values. In your downloaded file, a comma will serve as the column delimiter and these files can be opened by other programs later, e.g., Excel.

Using the Advanced Options

The Tokenize/Normalize configuration options in the top right inset section of the Tokenizer tool allow you to change how the DTM is built. An important feature of these options is that they are saved to your session and will apply to all the other Lexos tools that make use of the DTM. For instance, if you restrict your DTM to only the 10 most frequent terms in your corpus, this slice of your DTM will also be used to generate word clouds, cluster analyses, and so on. The same configuration options occur in the other Lexos tools, so it is possible to change the settings there. In Tokenizer, you should click the Show Matrix button each time you change the settings to re-build the DTM with the new configuration.

Tokenizer provides several methods of manipulating the DTM in the panel at the top right of the screen. Instructions for using these methods can be found in Advanced Options.

This page has paths:

Contents of this path: