This site requires Javascript to be turned on. Please enable Javascript and reload the page.

Archaeology of a Book: An experimental approach to reading rare books in archival contexts

Digital Projects

Scanned facsimiles are valued by libraries and scholars because they increase the accessibility and discoverability of historical documents. More people can see this documents, at a lower cost, than ever before. But this is only the beginning of the new forms of scholarship enabled by digital reproductions. This page describes two new tools that seek to expand the utility of these resources for scholarly research.

Cobre

Cobre is a comparative book reader for the Primeros Libros developed by Texas A&M University. It enables readers to explore multiple options for engaging with digital facsimiles, from a reader interface that mimics a printed book to a comparative "filmstrip" that allows for the side-by-side comparison of multiple copies. In the example below, the Cobre reader makes visible a difference in the frontmatter of two copies of the Advertencias (volume one). One copy begins the book immediately after the dedication, while the other has significantly more material, including indulgences dated 1603.

In the proceedings of the 2012 45th Hawaii International Conference on System Science (HICSS), the developers of Cobre describe in detail their process and goals for the Cobre project. After spending several years collection "user stories" in collaboration with the Asociación Mexicana de Bibliotecas e Instituciones con Fondos Antiguos, they set out to develop a tool that would embrace the "Frankenbook": the wide array of variations between multiple volumes of a single text.

Ocular

Ocular is a tool for the automatic transcription of early modern printed books. The tool, which was developed by Taylor Berg-Kirkpatrick and others at U. C. Berkeley in 2013, is an "Optical Character Recognition" or OCR tool designed specifically to handle the challenges of hand printed, worm-eaten books. Unlike OCR systems designed for new books, Ocular statistically models variations in fonts, alignment, and inking of letters that are byproducts of early printing presses. They also draw on a language model, or statistical model of what language should look like, in order to fill in information about distorted letters.

As of 2015, Ocular was the state-of-the-art for automatic transcription of historical documents. When we tried to use it on the Primeros Libros texts, however, it failed because its language model expected monolingual English documents - something like the Wall Street Journal. In 2014, we collaborated with computer scientists at U.T. Austin and at Berkeley to expand Ocular for Primeros Libros by adding multilingual capabilities, as well as an interface for handling the irregular orthography (spelling, punctuation, shorthand, etc.) characteristic of early modern printing. The result is a tool that can automatically transcribe books in multiple languages, like the Advertencias, while simultaneously identifying patterns in language switching embedded in the books. The image to the right shows a fragment of a page from the Advertencias, a processed facsimile, and an automatic transcription.

Automatic transcription for early modern texts like the Advertencias is still an arduous process: in many cases, accuracy is less than fifty percent. Being able to rapidly transcribe large bodies of text like that of the Primeros Libros project, however, will enable new ways of "discovering" the texts. We will be able to analyze linguistic patterns in indigenous languages, for example, or identify places where authors borrow from one another. When paired with close analyses of the original books, we hope that this will reveal new aspects of the books' role in history, and their position as parts of history.

In August 2015, LLILAS Benson received an NEH Digital Humanities Implementation Grant to continue to develop Ocular and to use it to transcribe the Primeros Libros collection. You can follow this project on the Reading the First Books website.

This page has paths:

Contents of this path:

This page references:

Cobre Reader comparing frontmatter of the Advertencias
Automatic Transcription Output with Ocular Three versions of the same fragment from the Advertencias: one is a facsimile image of the original page; one is a facsimile image in black in white with language changes marked in red; the final is an automatic transcription of the original page, with language shifts automatically generated. Transcription labels languages correctly on all but one word, and shows few character errors, though it does insert extra spaces in the Nahuatl text. Text reads: “Ay proprio vocablo de logro, que es, tetech-tlaixtlapanaliztli, tetechtlamieccaquixtiliztli, ypara dezir diste alogro? Cuix tetech otitlaix”.
Challenges for Automatic Transcription