In order to determine who may have authored the anonymous Cheap Repository Tracts, we took two separate approaches: cluster analysis and topic modeling. Through these two forms of computational analysis, it was our goal to seek out hidden patterns within the texts that would reveal similarities between the not signed texts and texts of known authors, thereby suggesting possible authorship.
In choosing our corpus, we were influenced by Maciej Eder's "Does Size Matter? Authorship Attribution, Small Samples, Big Problem," in which he argues that a larger pool of texts leads to a more successful experiment. In another article, “Mind your Corpus: Systematic Errors in Authorship Attribution,” Eder discusses the potential for a "dirty corpus" - one that contains errors in transcription, for example - to negatively influence the results of an experiment like ours when dealing with a small body of work. This was a concern for us, as the Cheap Repository Tracts are a finite group of texts, and the limited time we had to create this project meant that we could only work with so much material. Bearing this in mind, we aimed to incorporate as many texts as we could in our corpus, and to make sure that those texts were as "clean" as possible.
Before we could conduct our experiments, we first needed digitized copies of all the texts in our corpus. Fortunately, as they were written in the eighteenth century, these texts are not under copyright and we were able to find the Hannah More texts we used already digitized on readbookonline.net, the Oxford Text Archive, and Project Gutenberg. We were able to access the other texts via Eighteenth Century Collections Online. However, many of these texts were not digitized; they existed only in PDF form, which cannot be run through the algorithms used for this project. Thus, we took it upon ourselves to transcribe as many of the tracts as we could. Initially, we attempted to use Optical Character Recognition software to aid us in our transcription, but due to the state of the PDF images and the difficulty of using OCR on eighteenth-century typefaces, we eventually found it more productive to type out the transcriptions on our own1. The resulting digital texts can be found on this website, free for the public to read and use.
Once we had our digitized corpus, we were ready to begin our topic modeling and cluster analysis. To read about the specific methodologies and results of these two experiments, please click on the links below to follow their respective "paths."
1 Ted Underwood recently addressed the difficulty of using OCR on 18th-century texts in his blog post, "A half-decent OCR normalizer for English texts after 1700." Underwood has developed a beta version of a Python script that will hopefully address some of the issues we struggled with.
|Previous page on path||Welcome, page 1 of 5||Next page on path|