In the Margins

Glossary

Glossary

This page is intended to provide definitions for the terms used within the Lexos suite, as well as to disambiguate terms drawn from natural language, programming languages, and linguistic analysis. New entries are being added on an ongoing basis.


Agglomerative Hierarchical Clustering
Agglomerative Hierarchical Clustering is a method of bottom-up analysis wherein each document is its own cluster, after which the clusters are merged to form one cluster for all documents.

Average Linkage


Also called UPGMA, this linkage is an un-weighted hybrid of complete and single linkages.

Bray-Curtis Distance


The Bray-Curtis dissimilarity is a standardized form of Manhattan distance. Not metric itself, it is instead the proportion of values not shared between points, or, equivalently, the sum of absolute differences over the sum of all instances. Points further from the origin are more impactful on the percentage.

Canberra Distance


Canberra distance is a weighted version of the Manhattan distance. Instead of the sum of differences, it measures the sum of the difference ratios. The weighting allows this metric to be very sensitive to differences between points near the origin.

Character
A character is any individual symbol. The letters that make up the Roman alphabet are characters, as are non-alphabetic symbols such as the Hanzi used in Chinese writing. In Lexos, the term character generally refers to countable symbols.

Chebyshev Distance


The Chebyshev distance ignores all but the greatest component difference between two vectors. It is similar to Euclidean and Manhattan distances, in that instead of infinite (continuous) or square-grid (orthogonal) movement, it allows 8 directions of freedom (orthogonal and diagonal). It is used reliably in very niche circumstances.

Community Detection



Complete Linkage


Also called the ‘farthest neighbour’ algorithm, this linkage produces spherical clusters of similar diameter.

Correlation Distance


Correlation distance is equivalent to Cosine distance after vectors have been shifted by their means. Correlation distance metrics perform well in very high dimensions with few null values.

Cosine Distance


Cosine distance is the measure of the angle formed by two vectors from the origin; it only judges the orientation of points in space, not their magnitudes. This metric is related to Euclidean distance by factoring dot products. Cosine distance is best for working in very high dimensions, especially if there are many null values in the vectors.

Cosine Similarity



Cutting
The process of creating multiple documents/segments from a source file.

Dendrogram
The dendrogram is a method of visualizing how closely related documents are via hierarchical clustering analysis. The name derives from ‘dendron’, Greek for ‘tree’, and dendrograms are indeed rooted trees (a type of mathematical object). Each document, or leaf, of the tree is connected to every other by a series of branches. The length of each branch is distance from the center of the leaves of that branch to the next closest branch. One popular use for dendrograms are so-called ‘trees of life’ which show how various species, genera, families, etc. are related.

Dimensionality Reduction



Distance Metric
The Distance Metric is the method used to compare two documents. Document data is stored in vectors, where each index contains the number of times a specific term appears. For example, the sentence, “The buffalo from Buffalo who buffalo buffalo from Buffalo buffalo buffalo from Buffalo” would have the vector <The:1, buffalo:5, from:3, Buffalo:3, who:1>. Comparing this again a ‘sentence’ with no terms is equivalent to finding the distance between the vectors <0,0,0,0,0> and <1,5,3,3,1>. The way the comparison is measured (distance between two points, number of words different, etc.) is the distance metric.

Divisive Clustering
This top-down clustering method assumes all documents are in one cluster, then uses an algorithm to divide them until each document is in its own cluster.

Document
In Lexos, a document is any collection of words (known as terms in Lexos) or characters collected together to form a single item within the Lexos tool. A document is distinct from a file in that the term document refers specifically to the items manipulated within the Lexos software suite, as opposed to file, which refers to the items that are either uploaded from or downloaded to a user’s device.

Edit Distance



Euclidean Distance


The Euclidean distance is the 'intuitive' way of measuring distance: the length of a straight line between two points. This metric is one of the most widely used due to its reliability and simplicity.

Exclusive Cluster Analysis

Feature Selection



File
File refers to items that can be manipulated through the file manager on a user’s computer i.e. windows explorer, archive manager, etc. File is only used in the Lexos suite when referring to functions that involve the user’s file system, such as uploading or downloading.

Flat Cluster Analysis



Hapax Legomena
A term occurring only once in a document or corpus.

Hamming Distance


Hamming distance is similar to Jaccard, but it also considers shared absences and it completely ignores abundance in lieu of existence. The primary function is to determine the number of differing null and valued components. This is the best test to determine similarity between vocabularies. In Lexos, the Hamming distance is treated as a proportion instead of a raw count.

Hierarchical Cluster Analysis
Hierarchical Clustering is a method of bottom-up analysis wherein the distance between each pair of documents is calculated and stored in a matrix that is reduced by iterating a linkage algorithm. This reduction yields the branch heights and divisions which are represented by a dendrogram. This method of analysis produces consistent results.

Jaccard Distance


Derived from Bray-Curtis, the Jaccard distance is the ratio of the size of symmetric differences to the size of the union for the vector components of the points. Unlike the Bray-Curtis dissimilarity, Jaccard is metric. The primary use of Jaccard distance is to measure the dimensions unique to a vector.

K-Means Clustering


Keepwords
Keepwords are the opposite of stopwords. When scrubbing with the keepwords option, all terms except keepwords will be deleted. See stopwords.

Lemma
The dictionary headword form of a word. For instance, “cat” is the lemma for “cat”, “cats”, “cat’s”, and “cats’”. Lemmas are generally used to consolidate grammatical variations of the same word as a single term, but they may also be used for spelling variants.

Lexomics
The term “lexomics” was originally used to describe the computer-assisted detection of “words” (short sequences of bases) in genomes,* but we have extended it to apply to literature, where lexomics is the analysis of the frequency, distribution, and arrangement of words in large-scale patterns. Using statistical methods and computer-based tools to analyze data retrieved from electronic corpora, lexomic analysis allows us to identify patterns of vocabulary use that are too subtle or diffuse to be perceived easily. We then use the results derived from statistical and computer-based analysis to augment traditional literary approaches including close reading, philological analysis, and source study. Lexomics thus combines information processing and analysis with methods developed by medievalists over the past two centuries. We can use traditional methods to identify problems that can be addressed in new ways by lexomics, and we also use the results of lexomic analysis to help us zero in on textual relationships or portions of texts that might not previously have received much attention.

Manhattan Distance


The Manhattan, or Taxicab, distance is so named because, unlike Euclidean distance, which 'goes as the crow flies', length is defined as the shortest path between two points on a grid - thus, it is more comparable to a taxicab's route in Manhattan. This metric is equivalent to measuring the area between two distribution curves (cf. Riemann sums). Manhattan distance is well-suited for points with fewer dimensions.

N-gram
An n-gram is a string of one or more tokens delimited by length. N-grams can be characters or larger tokens (e.g. space-bounded strings typically equivalent to words in Western languages). A one-character n-gram is described as a 1-gram or uni-gram. There are also 2-grams (bi-grams), 3-grams (tri-grams), 4-grams, and 5-grams. Larger n-grams are rarely used. Using n-grams to create a sliding window of characters in a text is one method of counting terms in non-Western languages (or DNA sequences) where spaces or other markers are not used to delimit token boundaries.

Normalization

Overlapping Cluster Analysis

Partitioning Cluster Analysis

Rolling Window Analysis

Scrubbing

Segment
After cutting a text in Lexos, the separated pieces of the text are referred to as segments. However, segments are treated by Lexos as documents and they may be referred to as documents when the focus is not on their being a part of the entire text.

Similarity

Single Linkage


Also called the ‘nearest neighbor’ algorithm, this linkage method produces ellipsoidal and chain-like clusters.

Sparse Matrix
A sparse matrix is a matrix that contains many null values.

Squared Euclidean Distance


The Squared Euclidean distance is, as the name suggests, the Euclidean distance multiplied by itself. This additional operation places progressively greater weight on points that are further apart, and is thus very useful for sets of points that are extremely close together. The Squared Euclidean distance is also called the 'quadrance' in geometry.

Standard Deviation
The standard deviation is a measure of a dataset’s diversity.

Standard Error Test



Standardized Euclidean Distance


The standardization of the Euclidean distance is in the term 1/si, the reciprocal of the standard deviation of the i th component of all vectors. This inclusion adjusts the weights so all components have a variance of 1 and a norm of 0. It is most useful for controlling factors that affect data disproportionately to other factors. If si instead represented the average abundance proportion, this would be chi-squared distance.

Stopwords
A stopword is a term that is deleted during scrubbing. The stopword feature can be used to remove names, common terms, etc. from active files. See keepwords.

Term
A term is the unique form of a token. If a token “cat” occurs two times in a document, the term count for “cat” is 2. In computational linguistics, terms are sometimes called “types”, but we avoid this usage for consistency.

Text
Text is a general term used to refer to the objects studied in lexomics, irrespective of the form. It thus may refer to either a file or documents, but it is typically used to refer to the whole work, rather than smaller segments.

Token
A token is an individual string of characters that may occur any number of times in a document. Tokens can be characters, words, or n-grams (strings of one or more characters or words).

Tokenization
The process of dividing a text into tokens.

Type
See term.

Unicode

Unsupervised Learning



Weighted Linkage


Also called WPGMA, this linkage is a weighted hybrid of single and complete.

Word
A word is, in many Western languages, a set of characters bounded by whitespace or punctuation marks, where whitespace refers to one or more spaces, tabs, or new-line inserts. However, to avoid ambiguity when dealing with many non-Western languages such as Chinese, where a single Hanzi character can refer to the equivalent of an entire Western word, term is used throughout throughout the Lexos interface and documentation in place of word. There are a few exceptions where “word” is used because it is part of an established phrase, it is less awkward, or because the context refers to the semantic category of words.