Using Digital Media to Analyze the Evolution of Feminist Discourse

Optical Character Recognition


I used the online OCR software found at www.onlineocr.net to translate the New York Times articles that had been download as pdfs of "image-text" into text-based data that would be readable by Voyant. While the software is technically free, it only allows the user to upload single-page pdfs. In order to upload more than one page, one must create an account. The account allows for the conversion of up to 25 pages in total after which, one must purchase pages with costs ranging from 50 pages for $4.95 to 50 000 pages for $399. 95. Since I needed to use the software for 40 articles, I ended up creating multiple accounts using different e-mail addresses to keep the service free.

 While surprisingly effective, this also required a large amount of data cleaning as some words or characters were not read properly, and any ink marks on the pdfs were interpreted by the software as numbers, letters or symbols. I found that older texts with messier ink marks, or more ornate typefaces were much harder for the OCR to interpret than the articles closer to the present day. These issues required me to compare the OCR file with the original article pdf to ensure an exact duplicate of the text which made the process very time consuming, also attributing to the small sample size of articles used in the project. 
 
 

This page has paths:

This page references: