In the Margins

Glossary

This page is intended to provide definitions for the terms used within the Lexos suite, as well as to disambiguate terms drawn from natural language, programming languages, and linguistic analysis.


Supervised Learning


Unsupervised Learning



Character
In Lexos, a character is any individual symbol. The letters that make up the Roman alphabet are known as characters in Lexos, as are the Hanzi used in Chinese writing. 

Document
In Lexos, a document is any collection of words (known as terms in Lexos) or characters collected together to form a single item within the Lexos tool. A document is distinct from a file in that the term document refers specifically to the items manipulated within the Lexos software suite, as opposed to file, which refers to the items that are either uploaded from or downloaded to a user's device. 

Text
Text is a general term used to refer to any collection of terms or characters; it encompasses both files and documents. 

File
File refers to items that can be manipulated through the file manager on a user's computer i.e. windows explorer, archive manager, etc. File is only used in the Lexos suite when referring to functions that involve the user's file system, such as uploading or downloading. 

n-gram
Refers to the counting of one of two kinds: word ngrams or character ngrams. A 1-word ngram (a 1-gram term) was and perhaps still is the standard default when counting English words, whereas 2-grams count the occurrence of pairs of words "the dog", "dog ran", "ran away", etc. Character ngrams refer to contiguous characters in a set window size, shown here for English for 4grams: "the ", "he d", "e do", " dog", etc. Counting terms under a sliding window of characters through the text provides enables one method for counting terms in non-Western languages or for counting ngrams in DNA or protein sequence: e.g., 4grams of "ATCTTGCC" = [ATCT, TCTT, CTTG, TTGC, ...].

Word
A word is, in many Western languages, a set of characters bounded by whitespace, where whitespace refers to one or more spaces, tabs, or new-line inserts. However, to avoid ambiguity when dealing with many non-Western languages such as Chinese, where a single Hanzi character can refer to the equivalent of an entire Western word, term is used throughout Lexos in place of word.

Segment
After cutting a text in Lexos, the separated pieces of said text are referred to as segments. 

RollingWindow Analysis

Lexomics
The term “lexomics” was originally used to describe the computer-assisted detection of “words” (short sequences of bases) in genomes, 15 but we have extended it to apply to literature, where lexomics is the analysis of the frequency, distribution, and arrangement of words in large-scale patterns. Using statistical methods and computer-based tools to analyze data retrieved from electronic corpora, lexomic analysis allows us to identify patterns of vocabulary use that are too subtle or diffuse to be perceived easily. We then use the results derived from statistical and computer-based analysis to augment traditional literary approaches including close reading, philological analysis, and source study. Lexomics thus combines information processing and analysis with methods developed by medievalists over the past two centuries. We can use traditional methods to identify problems that can be addressed in new ways by lexomics, and we also use the results of lexomic analysis to help us zero in on textual relationships or portions of texts that might not previously have received much attention.
[edit: reword and reference; this is from 

Token
A token is whatever individual unit is being used. Token can refer to terms, characters, or any other specific language form that can be picked out of a text. 

Term
A term is a grouping of characters of n or greater length, where n is any natural number, that contains some meaning. I need a better definition. 

Type

Lemma

Stopword

Unicode

Dendrogram