In the Margins

The Topwords Tool

The Lexos Topwords tool allows you to ask what terms are more prominent in a certain document or class of documents than in other classes of documents or the collection as a whole. We call these highly prominent terms "topwords" (even when the term are not, strictly speaking, "words".) The Topwords tool uses a Z-test, which determines which terms are outliers beyond the normal range of distribution in a document or a group of documents. The prominent terms you identify in Topwords make good candidates for further analysis using the Lexos Rolling Windows tool.

Topwords allows you to configure the criteria for determining the bounds of statistical prominence. Options for limiting the proportions to use in an analysis include commonly used metrics such as Standard Deviation and Interquartile Range (IQR), as well the ability to use customizable bounds. Topwords leverages the power of class labels to documents in the Manage tool. There, you can right-click and set the class on all active documents or set each document's class individually. If you do not assign class labels to your active documents, Topwords will only allow you to compare the proportions of each term in a single document to their proportions in the overall set of active documents in our collection.

Note that in Topwords, documents and/or document classes should have at least 100 tokens each.

Topwords Settings

If you have not set any class labels in the Manage tool, Topwords will by default compare a single document to all the other active documents in your workspace. If your documents already contain class labels, you have two additional options: Compare each document against other class(es) and Compare classes one to another. The first allows you to compare each document to one or more classes. The second allows you to compare the classes themselves.

By default, Lexos will perform the analysis on all terms that appear in both groups (whether individual documents or groups of documents with class labels). However, you may also choose from two Built-in Options. These are Standard Deviation and Interquartile Range. Standard Deviation calculates the mean frequency for all terms and compares the frequency of individual terms to this baseline. You may choose to limit to the analysis to the Top Outliers Only (the 5% most prominent terms above the mean), the Bottom Outliers Only (the 5% least prominent terms below the mean), or the Non-Outliers Only (the terms that fall within one standard deviation above or below the mean). Options are similar for Interquartile Range, except the distribution of terms is divided into first, second, and third quartiles with the second quartile occupying the middle range. This essentially sets a slightly different set of cut-off boundaries from standard deviation. Lexos also allows you to set your own boundaries by selecting Customize Options. You may base your boundaries on the Proportional Counts (that is, the proportion of the documents assigned to each term) or the Raw Counts of each term. Setting the Upper Boundary and Lower Boundary is equivalent to basing the analysis only the terms with proportions/counts in between these settings.

How to Read the Results

Topwords produces a series of tables, each showing a different comparison (labeled at the top of the table). Within each table, the terms are ranked according to their Z-score. Only the top 20 statistically significant terms (those for which the Z-score has an absolute value larger than 1.96) are shown. A larger positive Z-score indicates a term in this document or class is used more frequently than in the comparison group. A larger negative Z-score indicates a term that is used relatively rarely. Note that Topwords assumes the distribution of term frequencies is a normal distribution, which is not the case for most data, so the results should be used with caution.

This page has paths: