Topic Modeling Process

The Hannah More Project

Computational Analysis, Author Attribution, and the Cheap Repository Tracts of the 18th Century

Previous page on path Next page on path

You appear to be using an older verion of Internet Explorer. For the best experience please upgrade your IE version or switch to a another web browser.

Topic Modeling Process

Tools Used

Mallet GUI: Used to model topics/ find key words in the texts to analyze topics in the not signed tracts and the Hannah More texts.
Lexos: Used to scrub texts.

Alternative tool:

Tropes: (ultimately not used)

Phase 1: Selecting the tools

Lily installed Mallet as well as Tropes; the goal was to see which topic modeling software would serve this group better. The major factors she wanted to look at were:

● Usability of the software: can everyone on the team install/run it? Is it too complicated to master in the amount of time we have?
● Usefulness of the topics: do the topics make sense for the tracts, or are they gibberish? Do the pre-tags on tropes make sense?
● Functional output: is the output of the software usable, or are we going to have to do a lot of processing to make it work?

The Topic Modeling Committee decided to use the Mallet GUI because the Mallet command line is more than we need for our data set and won’t run on everyone’s computer. We decided against Tropes because we were uncertain of the algorithm it uses and the output is not necessarily the sort of topics used in our project.

Once we started with the Mallet GUI, the TM Committee scrubbed (for punctuation, Matt Jockers’ list, and capitalization) ran the Tracts (22), the Hannah More Texts (31), and both combined using Lexos. The first run (not included in the rounds) included some gibberish topics due to a faulty file in the tracts (“xx, xy, zzzz,” etc), but once this tract was removed the topics began making sense.

Settings for Round 1 were:

# of topics: 10

# of iterations: 500

No of topic words printed: 10

Topic proportion threshold: 0.05

The website wiki recommended that we increase proportion threshold because of small data set

Settings for Round 2:

# of topics: 10

# of iterations: 500

No of topic words printed: 10

Topic proportion threshold: 0.1

Phase 2: Running the models

Lily conducted more topic modeling experiments. Here are detailed notes from her work:

Step one was downloading the updated tract list (went from 22 to 30), converting them all to .txt and scrubbing them using Matt Jocker’s list in Lexos. Some tracts have titles and some do not, so Lily also went through and deleted all the titles so that it was just the tracts, for consistency. Lily also got rid of “the end” and “finis.” After scrubbing and cleaning up, Lily started running topic models again and playing with settings.

Settings for Round 3:

# of topics: 10

# of iterations: 700

No of topic words printed: 10

Topic proportion threshold: 0.1

Settings for Round 4:

# of topics: 10

# of iterations: 700

No of topic words printed: 10

Topic proportion threshold: 0.5

Settings for Round 5:

# of topics: 10

# of iterations: 700

No of topic words printed: 10

Topic proportion threshold: 0.5

Settings for Round 6:

# of topics: 10

# of iterations: 700

No of topic words printed: 10

Topic proportion threshold: 0.5

The settings for rounds 4, 5 and 6 are the same because the topics between 3 and 4 were so different that I wonder if it’s the randomization factor or if it’s actually the settings. Hopefully this will help me figure it out.

Settings for Round 7:

# of topics: 10

# of iterations: 500

No of topic words printed: 10

Topic proportion threshold: 0.5

At this point Lily was finding the topics to be pretty diverse regardless of how she changed or maintained the settings. Round 5 had the most interesting mix of topics, and seemed pretty representative of the spread of topics from all of the other rounds.

Comment on this page

Local Discussion

Popout

Discussion of "Topic Modeling Process"

Add your voice to this discussion.

Checking your signed in status ...

Previous page on path

Topic Modeling, page 1 of 2

Next page on path

Your name
Comment title
Content <a><i><u><b>
CAPTCHA