In the Margins

Glossary

This page is intended to provide definitions for the terms used within the Lexos suite, as well as to disambiguate terms drawn from natural language, programming languages, and linguistic analysis. New entries are being added on an ongoing basis.

Agglomerative Hierarchical Clustering

Character

A character is any individual symbol. The letters that make up the Roman alphabet are characters, as are non-alphabetic symbols such as the Hanzi used in Chinese writing. In Lexos, the term character generally refers to countable symbols.

Community Detection

Cosine Similarity

Cutting

Dendrogram

Dimensionality Reduction

Distance Metric

Document

In Lexos, a document is any collection of words (known as terms in Lexos) or characters collected together to form a single item within the Lexos tool. A document is distinct from a file in that the term document refers specifically to the items manipulated within the Lexos software suite, as opposed to file, which refers to the items that are either uploaded from or downloaded to a user’s device.

Edit Distance

Euclidean Distance

Exclusive Cluster Analysis

Feature Selection

File

File refers to items that can be manipulated through the file manager on a user’s computer i.e. windows explorer, archive manager, etc. File is only used in the Lexos suite when referring to functions that involve the user’s file system, such as uploading or downloading.

Flat Cluster Analysis

Lapax Legomena

A term occurring only once in a document or corpus.

Hierarchical Cluster Analysis

K-Means Clustering

Lemma

The dictionary headword form of a word. For instance, “cat” is the lemma for “cat”, “cats”, “cat’s”, and “cats’”. Lemmas are generally used to consolidate grammatical variations of the same word as a single term, but they may also be used for spelling variants.

Lexomics

The term “lexomics” was originally used to describe the computer-assisted detection of “words” (short sequences of bases) in genomes,* but we have extended it to apply to literature, where lexomics is the analysis of the frequency, distribution, and arrangement of words in large-scale patterns. Using statistical methods and computer-based tools to analyze data retrieved from electronic corpora, lexomic analysis allows us to identify patterns of vocabulary use that are too subtle or diffuse to be perceived easily. We then use the results derived from statistical and computer-based analysis to augment traditional literary approaches including close reading, philological analysis, and source study. Lexomics thus combines information processing and analysis with methods developed by medievalists over the past two centuries. We can use traditional methods to identify problems that can be addressed in new ways by lexomics, and we also use the results of lexomic analysis to help us zero in on textual relationships or portions of texts that might not previously have received much attention.

N-gram

An n-gram is a string of one or more tokens delimited by length. N-grams can be characters or larger tokens (e.g. space-bounded strings typically equivalent to words in Western languages). A one-character n-gram is described as a 1-gram or uni-gram. There are also 2-grams (bi-grams), 3-grams (tri-grams), 4-grams, and 5-grams. Larger n-grams are rarely used. Using n-grams to create a sliding window of characters in a text is one method of counting terms in non-Western languages (or DNA sequences) where spaces or other markers are not used to delimit token boundaries.

Normalization

Overlapping Cluster Analysis

Partitioning Cluster Analysis

Rolling Window Analysis

Scrubbing

Segment

After cutting a text in Lexos, the separated pieces of the text are referred to as segments. However, segments are treated by Lexos as documents and they may be referred to as documents when the focus is not on their being a part of the entire text.

Similarity

Sparse Matrix

Standard Deviation

Standard Error Test

Stopword

Supervised Learning

Term

A term is the unique form of a token. If a token “cat” occurs two times in a document, the term count for “cat” is 2. In computational linguistics, terms are sometimes called “types”, but we avoid this usage for consistency.

Text

Text is a general term used to refer to the objects studied in lexomics, irrespective of the form. It thus may refer to either a file or documents, but it is typically used to refer to the whole work, rather than smaller segments.

Token

A token is an individual string of characters that may occur any number of times in a document. Tokens can be characters, words, or n-grams (strings of one or more characters or words).

Tokenization

The process of dividing a text into tokens.

Type

See term.

Unicode

Unsupervised Learning

Word

A word is, in many Western languages, a set of characters bounded by whitespace or punctuation marks, where whitespace refers to one or more spaces, tabs, or new-line inserts. However, to avoid ambiguity when dealing with many non-Western languages such as Chinese, where a single Hanzi character can refer to the equivalent of an entire Western word, term is used throughout throughout the Lexos interface and documentation in place of word. There are a few exceptions where “word” is used because it is part of an established phrase, it is less awkward, or because the context refers to the semantic category of words.