The Lexos Statistics tool provides a basic overview of statistical content in your collection as an addition to the specific term counts/proportions available in the Document-Term Matrix (DTM) provided in Tokenizer.
Statistics for the Entire Corpus
Lexos calculates the average, median, and interquartile range (IQR) of your documents' sizes (based on term counts). This information is used to determine if any of the document sizes are anomalously large or small, that is, if any of your document sizes are outliers. Outliers are those document sizes that fall below Q1 - 1.5(IQR) or above Q3 + 1.5(IQR).
Lexos first shows a box plot visualization that highlights potential outliers of document size in the collection of active documents, Lexos provides a anomaly warning for any document with a size that is particularly large or small compared to the rest of your corpus. You should consider removing outlier documents from subsequent analyses and/or consider additional cutting of some documents to make term counts more uniform.
In addition to checking for documents in the active set that may be relatively small or large relative to the entire collection, Statistics generates a table containing the number of distinct terms, the number of terms occurring once (hapax legomena), the total term count, and the average term frequency in each document. You may generate statistics on all of your active files or you may select a subset of your active documents by using the Select Document(s) checkboxes. All of the Advanced Options for manipulating the Document-Term Matrix (DTM) are available. When you have chosen your settings, click the Generate Statistics button.
Using the Statistics Table
The statistics table may be sorted by column by clicking on the column headers. An icon will indicate which column is being used for sorting and whether the sort direction is ascending or descending. (Note: the first click will sort that column in increasing order; click again to sort in decreasing order.) Use the Display dropdown menu to display more than the default 10 rows per page. The statistics table may be copied to your computer's clipboard by clicking the Copy button. It may also be downloaded as an Excel spreadsheet, Comma-Separated Value (CSV) file, or a PDF.