Topic Modeling

The Hannah More Project

Computational Analysis, Author Attribution, and the Cheap Repository Tracts of the 18th Century

Previous page on path

Other paths that intersect here:

Our Experiments
Page 2 of 2 in path

You appear to be using an older verion of Internet Explorer. For the best experience please upgrade your IE version or switch to a another web browser.

Topic Modeling

What is topic modeling?

Topic modeling is a type of statistical model that pulls abstract topics out of a collection of documents. The topics generally take the form of a “bag of words” from the collection, and are selected randomly and subsequently rearranged through an algorithmic process (the major one being Latent Dirichlet Allocation or LDA). Topic modeling is an essential tool of distant reading in the digital humanities and is generally used to find thematic similarities in a large corpus of texts. The results of a topic model can be processed as a visualization for easier comprehension.

Why did we choose this method?

Since the overarching goal of this project is to attempt to find authorial attribution for the anonymous Cheap Repository Tracts, we intended to use topic modeling to further that goal. We had hoped to run topic models that included the anonymous tracts, Hannah More texts and texts by the other possible authors, particularly those laid out in Blanch’s thesis. This might have given us some clues if topics/frequently used words that were clustered within a particular author’s works were also prevalent in a particular anonymous tract.

However, we were limited by the lack of pre-existing transcribed tracts. We lacked the resources and time to transcribe all the tracts necessary to run such a model. Once we determined that our goal of matching one of these possible authors to a not signed tract was untenable, we hoped to use topic modeling to see whether there was a correlation between the frequency of the topics within More's texts and the not signed texts, which might suggest that these works could have been written by More.

While we ultimately weren’t able to use topic modeling towards authorial attribution, we were able to use it for some interesting analysis of the tracts and their relationship with Hannah More’s texts, which we hope to use in our further study of More and the Cheap Repository Tracts. This analysis is further addressed in our Topic Modeling Results section.

Possible limitations and how we addressed them

A risk of topic modeling is that it can create false insights. Topics are not always meaningful in and of themselves, the meaning applied to them is human-generated and the operator must have some working knowledge of the texts in order to interpret the topics appropriately and within a reasonable context. For this project, we approached our texts with an historical and textual understanding of the tracts and Hannah More. Additionally, because there is some randomization to the topics, there is also a lot of variation in topic models of the same corpus (which we addressed in our models by running them multiple times to try and achieve a relatively average model).

Your name
Comment title
Content <a><i><u><b>
CAPTCHA

The Hannah More Project

Computational Analysis, Author Attribution, and the Cheap Repository Tracts of the 18th Century

Topic Modeling

What is topic modeling?

Why did we choose this method?

Possible limitations and how we addressed them

Further reading on topic modeling:

Discussion of "Topic Modeling"

Add your voice to this discussion.