In the Margins

The Topwords Tool

The Lexos Topwords tool helps you find terms that are more prominent in a certain document or class of documents than in other documents or classes of documents. We call these highly prominent terms "topwords" (even when the terms may not, strictly speaking, be "words".) The Topwords tool uses a Z-test to determine which terms are outliers beyond the normal range of distribution in a document or a group of documents. Our experience shows that the prominent terms you identify here in Topwords make good candidates for further analysis using the Lexos Rolling Windows tool.

Topwords uses the power of class labels. In the Manage tool, you can right-click and set a class label on an individual document or group of documents. For example, you might set the class of a group of documents with the same author to the author’s name or label documents by genre, and so on. If you do not assign class labels to your active documents, Topwords will only allow you to compare each document to the corpus (all active documents).

Note that Topwords assumes the distribution of term frequencies is a normal distribution, which is not the case for most data, so the results should be used with caution. In addition, documents and/or document classes should have at least 100 tokens each.

Topwords Settings

If you have not set any class labels on the Manage page, Topwords will, by default, compare each document to all the other active documents in your workspace and display a message prompting you to assign class labels.

If your documents have class labels, you have two additional options: Compare each document to other class(es) and Compare each class to other class(es). The first allows you to compare the proportion of each term in a document within one class to their proportions in another class as a whole. The second allows you to the proportion of each term in one class to their proportions in another class. Lexos performs the analysis on all terms that appear at least once in the corpus.

How to Read the Results

Topwords produces a series of tables, each showing a different comparison (labeled at the top of the table). For example, Document "A" compares to Document "B".

Within each table, the terms are ranked according to their Z-score. Only the top 20 statistically significant terms are shown (those for which the Z-score has an absolute value larger than 1.96). If a term has a larger positive z-score, the term is used more frequently in document or class “A” relative to document or class “B”. If a term has a larger negative z-score, the term is used more frequently in document or class “B” relative to “A”

Because Topwords only displays the top 20 statistically significant terms on the web page, you may see a complete list of significant terms by downloading the results via the Download Topwords button.

This page has paths: