Distant Reading
We’re overwhelmed with text in this digital age. There’s simply too much of it — too much to read, too much to skim, certainly too much to make sense of. Google processes 20 petabytes of data per day, which in text terms is equivalent to 40% of all human-written works, ever (Baez 2015). The average Internet-using person encounters so much text even just in the form of privacy policies that if we read all the software agreements we sign with a click, it would take 76 full work days (Madrigal 2012).
Given the universality and reach of this problem, perhaps it is not surprising that researchers from all over the university, from biology to linguistics to history to business to computer science, have been attracted to computerized text analysis tools.
Text analysis drives natural language processing advances that make tools such as Siri and Alexa possible. Businesses mine the web using sentiment analysis tools to understand how their products are being received. Social media data has been mined for trends in influenza (Corley, Cook, Mikler, and Singh 2010), adverse drug reactions (Sarker et al. 2015), and e-cigarettes’ role in smoking cessation (Aphinyanaphongs 2016).
Outside of academia, social media mining has proven more insidious. Police forces worldwide mine social media to monitor and quash protests, during the revolution in Egypt in 2011 (Salem 2014), a terminated “Jasmine Revolution” protest in China (Poell 2014), and the Quebec student strike in 2012 (Thorburn 2014). The ACLU maintains thousands of pages of documentation of police departments deploying forces based on monitoring of hashtags including #BlackLivesMatter, #DontShoot, and #PoliceBrutality.
For humanists, of course, the massive influx of and access to digital text presents particular opportunities — perhaps most visibly and popularly in the form of distant reading, finding trends in text as data and the metadata of text production that help us understand large quantities of fiction.
Agenda of This Workshop
This workshop will introduce you to some of those researchers’ approaches, with a particular focus on how digital humanists have approached text analysis. We’ll discuss how we might use — and analyze and interpret and futurize and hypothesize, because hermeneutics are our strongest domain — the methods of text analysis available to us.
Ultimately, you will select a corpus of words to analyze using one of the tools we discuss. You will begin with a data-exploration tool and come up with research questions that you might be able to answer with a more complex or bespoke tool. You'll pick a visualization you've created to discuss with the rest of the group. Finally, we'll take a humanistic turn and save the last half hour or so of the workshop to trouble the tools, methodologies, and results we’ve discussed.
Recent Traces of Computerized Text Analysis
You may have seen digital humanist text analysis in the news recently: When some unknown Trump administration official wrote an anonymous NYTimes op-ed, the Internet was atwitter with text analysts comparing officials' public writing style with the op-ed's. At least two of these analyses identified Mike Pompeo as the likely author. Analyst David Robinson came to his conclusion not because of the author’s favoring of the word "lodestar," an unusual word pundits had identified as stylistically noteworthy, but because of the use of the word "malign" among other language features. Other analyses pointed to Elaine Chao, or somebody else in the state department.
This area of study, known as stylometrics, is a growing subfield of computational linguistics. Similar tools were used to identify J.K. Rowling as the author of Cuckoo’s Calling.
Linguistic corpus analysis tools are also behind the Washington Post's data essay looking at State of the Union addresses from 1993-2019, identifying words in each that hadn't previously been spoken in States of the Union. New words in 2019 include “bloodthirsty,” “fentanyl,” “screeched,” and “venomous” (interpret away!).
Collocation, Lemmatizing, and WordSmith Tools
Such text analyses are grounded in the most basic features of literary text analysis: Counting word frequencies in a given text, finding unusual two- and three-word sequences (bigrams and trigrams), and presenting those findings for human readers. Software intended for this kind of analysis long predates the Internet and even predates graphical user interfaces.
At one point, in fact, literary scholars produced concordances by hand: Geoffrey Rockwell and Stefan Sinclair (2016), creators of the Voyant Tools suite we’ll be using, cite Parrish’s 1959 A Concordance to the Poems of Matthew Arnold, which described a time-honored process of “cutting out lines of printed text and pasting them on 3-by-5-inch slips of paper, each bearing an index word written in the upper left-hand corner [...] Sixty-seven people (three of whom died during the enterprise) took part in the truly heroic labor of cutting and pasting, alphabetizing the 211,000 slips, and proofreading” (50). So get out your index cards, ready your scissors — kidding. Instead, thank your Microsoft and Google overlords; this kind of concordance is trivially easy to create at this point in computational linguistic history.
The first concordancing software debuted in the late 1960s at the University of Toronto (Rockwell and Sinclair 2016, 53). Beginning in 1996, concordancing software that “lemmatizes” text (i.e. trims words to their most basic stems, removing prefixes and suffixes) has been available for researchers and the general public to download online. Lexically’s WordSmith Tools has been sold for the same 50-pound price since 1996.
Corpus analysis has been in a long boom era among genre and language-learning scholars (Flowerdew 1998, 2005; Hyland 2002, 2005, 2012.; Tribble 2006, 2014). The same tools have been employed to analyze public text, including 175,000 news articles’ representations of refugees (Gabrielatos and Baker 2008), 200,000 articles’ representation of Islam (Baker, Gabrielatos, and McEnery 2012), and 4,000 articles’ representation of feminism (Jaworska and Krishnamurthy 2012).
To do:
* History of "distant reading" as a term, the Stanford Literary Lab, Matt Jockers, Geoffrey Rockwell and Stefan Sinclair
* Search by article, topic modeling for search, Elsevier text mining, using topic modeling to help researchers find and synthesize sources, automatic systematic reviews of a kind (not great yet as far as I can tell) -> library science, could maybe show the Elsevier & JSTOR videos
* Social media and sentiment mining