In the Margins

The Topwords Tool

The Lexos Topwords tool helps you find terms that are more prominent in a certain document or class of documents than in other classes of documents. We call these highly prominent terms "topwords" (even when the terms may not, strictly speaking, be "words".) The Topwords tool uses a Z-test to determine which terms are outliers beyond the normal range of distribution in a document or a group of documents. Our experience shows that the prominent terms you identify here in Topwords make good candidates for further analysis using the Lexos Rolling Windows tool.

Topwords leverages the power of class labels to documents in the Manage tool. In the Manage tool, you can right-click and set a class label on a group active documents or set each document's class individually. For example, you might label all documents with an author name or label documents by author gender, and so on. If you do not assign class labels to your active documents, Topwords will only allow you to compare the proportions of each term in a single document to their proportions in the overall set of active documents in our collection.

Note that in Topwords, documents and/or document classes should have at least 100 tokens each.

Topwords Settings

If you have not set any class labels in the Manage tool, Topwords will by default compare a single document to all the other active documents in your workspace. If your documents already contain class labels, you have two additional options: Compare each document against other class(es) and Compare classes one to another. The first allows you to compare each document to one or more classes. The second allows you to compare the classes themselves.

Lexos performs the analysis on all terms that appear in both groups (whether individual documents or groups of documents with class labels).

How to Read the Results

Class Divisions
If documents have been assigned class labels, Lexos will show the document names within each class. A message will appear if no classes have been assigned, prompting the user to set class labels to enable document-to-class and class-to-class comparisons.

Topwords produces a series of tables, each showing a different comparison (labeled at the top of the table), for example:  Document "A" compared to "B". Within each table, the terms are ranked according to their Z-score. Only the top 20 statistically significant terms (those for which the Z-score has an absolute value larger than 1.96) are shown. A larger positive Z-score indicates a term in document or class (A) that is used more frequently than in the comparison group (B). A larger negative Z-score indicates a term that is used relatively rarely in  (A) relative to (B). 



Note that Topwords assumes the distribution of term frequencies is a normal distribution, which is not the case for most data, so the results should be used with caution.

This page has paths: