Thanks for your patience during our recent outage at scalar.usc.edu. While Scalar content is loading normally now, saving is still slow, and Scalar's 'additional metadata' features have been disabled, which may interfere with features like timelines and maps that depend on metadata. This also means that saving a page or media item will remove its additional metadata. If this occurs, you can use the 'All versions' link at the bottom of the page to restore the earlier version. We are continuing to troubleshoot, and will provide further updates as needed. Note that this only affects Scalar projects at scalar.usc.edu, and not those hosted elsewhere.
1media/letters.jpg2016-12-15T09:14:58-08:00Jasmine Drudge-Willson646f888af6780551085f831f746c3fb824afa0d71331911plain2016-12-19T11:36:19-08:00Jasmine Drudge-Willson646f888af6780551085f831f746c3fb824afa0d7 I used the online OCR software found at www.onlineocr.net to translate the New York Times articles that had been download as pdfs of "image-text" into text-based data that would be readable by Voyant. While the software is technically free, it only allows the user to upload single-page pdfs. In order to upload more than one page, one must create an account. The account allows for the conversion of up to 25 pages in total after which, one must purchase pages with costs ranging from 50 pages for $4.95 to 50 000 pages for $399. 95. Since I needed to use the software for 40 articles, I ended up creating multiple accounts using different e-mail addresses to keep the service free.
While surprisingly effective, this also required a large amount of data cleaning as some words or characters were not read properly, and any ink marks on the pdfs were interpreted by the software as numbers, letters or symbols. I found that older texts with messier ink marks, or more ornate typefaces were much harder for the OCR to interpret than the articles closer to the present day. These issues required me to compare the OCR file with the original article pdf to ensure an exact duplicate of the text which made the process very time consuming, also contributing to the small sample size of articles used in the project.
12016-12-18T10:14:26-08:00OCR example1Fitz-Gibbon, Bernice. “Woman in the Gay Flannel Suit” New York Times; Jan 29, 1956.
media/Screen Shot 2016-11-21 at 1.00.19 PM.pngplain2016-12-18T10:14:26-08:00