DHRI@A-State

Distant Reading Glossary

In semantic rather than alphabetical order, because you've got a powerful distant reading tool built into your browser: CTRL/CMD + F.

Metadata:
Data about other data. In text mining (see below), metadata often includes citational information (titles, authors, dates, etc.); in sentiment analysis (see below), metadata about individual words includes part(s) of speech and affective qualities.

Algorithms:
In software, a set of rules to follow when solving a particular kind of problem. 

Corpus:
A collection of written texts. (Plural: "corpora")

Machine learning:
Algorithms by which a computer uses data and examples to "learn" to solve a particular kind of problem more quickly or effectively.

Data mining:
Using machine learning and statistical methods to look for patterns in large datasets. Text mining is a subset of data mining.

Text mining (also called "text analytics"):
Using software to find patterns in a corpus, treating the text as data. 

Big data:
A field of methods for mining datasets that are so large traditional data mining software and methods can't handle it.

Distant reading:
Using text mining algorithms and metadata analysis to make sense of large quantities of literary text. Originally coined by Franco Moretti as a counterpoint to the method of literary analysis called "close reading," in a series of essays for the New Left Review, later published in Graphs, Maps, and Trees.

Sentiment analysis:
Using text mining, machine learning, and metadata analysis to determine affective qualities of a text. One of the most common applications of sentiment analysis is companies seeking offsite reviews of their products.

Collocation:
A method of text mining that looks for words that frequently occur in close proximity to one another.

Topic modeling:
A series of text mining methods for discovering more abstract topics from a text, primarily by examining collocation. One of the most common methods is called Latent Dirichlet Allocation

Stemming:
The process of reducing a word to its most basic "stem" or root form, without inflections or affixes. 

Lemmatizing:
In text analytics, grouping words together that share a stem. Thus, "analyze," "analyzes," "analyzing," "analysis," "analytic," and "analytics" might all be categorized under the lemma "analyze."

Stopwords:
In text analytics, a list of words excluded from analysis. Much text analytics software includes a basic list of English stopwords, including the most common words (the, a, this) and sometimes pronouns (he, she, I). Often researchers add to or edit their stopwords list to get more useful data.

Natural Language Processing:
A subfield of computer science that considers how to process and analyze large amounts of human (i.e. natural) language. Many natural language processing methods and advances overlap with text analytics and sentiment analysis.

Term Frequency - Inverse Document Frequency (TF-IDF):
"Term frequency" considers how frequently a word shows up in a single document (number of times the word appears divided by the word count of the document). "Inverse document frequency" considers the frequency with which that word shows up in a larger corpus (number of documents in the corpus that contain the word divided by the number of documents overall). TF-IDF combines these two metrics to model how important the word is to its source, in proportion to how important it is more broadly.  

Bag-of-Words:
A model of text analytics that disregards grammar and syntax, looking at a text as a collection of words instead of considering connections between them or sentences/phrases more broadly. The bag-of-words model favors analytics such as term frequency/inverse document frequency; it makes collocation, natural language processing, and sentiment analysis difficult or impossible. 

Inter-document Similarity:
A measure of how similar two documents are. The most common metric (and the one used on SameDiff) is called cosine similarity, which typically uses a bag-of-words model, comparing the two documents by quantity and frequency of words.

Stylometrics:
The application of linguistic stylistic analysis, often used to determine authorship (by looking, for instance, at the 50 most-common words in that author's writing and comparing it to unknown text) and sometimes used to make broader categorizations of texts (such as Hacker Factor's Gender Guesser, which uses stylometrics to distinguish "masculine" from "feminine" writing).

R:
The most-common statistical programming language. Most text mining is conducted in R.

This page has paths: