Querying Digitized Book Collections with the Google Books Ngram Viewer
By Shalin Hai-Jew, Kansas State University
Superficially, the Google Books Ngram Viewer looks like something that could be used to make interesting word art visualizations. Type in a word, a phrase, a symbol, or a formula (any string), and the Ngram Viewer will return a line graph showing frequency counts of that text string from its text corpus of scanned books.
At a high level, the time-series data may be seen as conveying something about word use trends over time. Within these data, there may also be observed changes in word senses (meanings and usages that tend to "drift" over time and use). The time trends may be quite long, for hundreds of years, from the beginning of the age of mass-produced books.
However, after some more exploration, it turns out that such word use trends may surface other insights. For example, when did a particular technology catch on and enter wide popular usage? What does the Ngram Viewer show about the usages of terms considered offensive today? What does the Ngram Viewer show about competing synonyms? What does it show about in-world phenomena like government censorship?
What is an N-Gram?
An n-gram refers to a contiguous sequence of “n” number of items (whether characters, syllables, phonemes, or words). As used in the Ngram Viewer, a “gram” refers to a contiguous sequence of alphanumeric text without a (white)space. A “unigram” consists of one (1) gram; a “bigram” or “digram” refers to two (2) groups of characters; a “trigram” refers to three (3) contiguous characters, and then the count continues as “four-gram,” “five-gram,” “six-gram,” and so on.
The text searches may be conducted generally from the following years: 1500 – 2000. Once a search is conducted, at the bottom of the Ngram Viewer page are clickable links to specific text resources from highlighted years (in which there were anomalous occurrences of that select text). The research work may be deepened with access to some of the original source texts.
More Sophisticated Queries
Put in multiple strings, and compare their occurrences. Use “wildcard” indicators (that allow Boolean delimiting) to extract yet other insights. Add specifiers for particular speech tagging to disambiguate searches. Find terms at the beginnings or ends of sentences. The About Ngram Viewer page offers some additional ways to query the textual data.
Creative Research Applications
This Ngram Viewer, as a word search database, gives a broad sense of how texts may be queried for meaningful insights—in a “big data” context (with an underlying collection of tens of millions of texts). A number of publications have emerged in the research literature illuminating historical events, social movements, government censorship effects on language, and other insights.
A Shadow Dataset
The Ngram Viewer draws its data from a shadow dataset extracted from the digitized books in its Google Books collection. These contain tables of data, with a row for each n-gram, a year, and the counts of the occurrences of that n-gram for that year. The shadow dataset enables access to the “ngram” counts without potentially enabling reverse engineering of a whole text based on unique strings of text (Aiden & Michel, 2013).
Google has made parts of its dataset downloadable for researcher use beyond the confines of its Ngram Viewer.
Various Language Text Corpuses
The Ngram Viewer enables querying across a range of text corpuses in a number of languages (Italian, Russian, Spanish, Hebrew, German, French, American English, and British English. Queries may also be done across text corpuses.
Other Resources
The Google Books Ngram Viewer can be a powerful tool for research in the digital humanities and other areas.
References
Aiden, E. & Michel, J-B. (2013) Uncharted: Big Data as a Lens on Human Culture. New York: Riverhead Books.
About the Author
Shalin Hai-Jew works as an instructional designer at Kansas State University. She has an edited text, Design Strategies and Innovations in Multimedia Presentations, forthcoming this summer. She may be reached at shalin@k-state.edu.
Previous page on path | Cover, page 14 of 21 | Next page on path |
Discussion of "Querying Digitized Book Collections with the Google Books Ngram Viewer"
Add your voice to this discussion.
Checking your signed in status ...