How to do things with Simple Text or XML Files
In addition to human-readability, plain-text files also have the added bonus of being more widely usable across both software platforms and over time. For example the Apple .pages files only work on other Apple products with the word processing software Pages; older formats such as the Wordperfect .wp files have fallen out of favour and are subsequently very difficult to open now.
Some of the rationale behind this decision is discussed below:
The SimpleText format is therefore different from the highly-structured XML format that the TCP texts are widely available in, and are better for human-readable analyses. ​However, there is a real payoff to creating huge amounts of metadata about these materials in that you can put it in a spreadsheet and use it to guide your analysis. (One invigorating point of discussion is imagining the pros and cons of having a hugely annotated and completely unannotated text available as complimentary editions of the same resource!)The idea of SimpleText is that we convert documents to a highly simplified format in order to make certain kinds of processing more standard and easier. That is, we intentionally remove much of the information that richer file formats contain, leaving files that can be processed in a simple fashion. Giving up the potential for the richer information that could be contained in a more complicated format, such as a structured markup language, has its costs. It means there are kinds of analyses we cannot perform with the data in the SimpleText format. However, this move enables us to perform the statistical analyses we are most interested in. (read more on http://graphics.cs.wisc.edu/WP/vep/%20SimpleText/)
Going to http://graphics.cs.wisc.edu/WP/vep/vep-early-modern-drama-collection/ offers several ways of interacting with a larger world of Early Modern dramatic writing. Download the first three files associated with the Core Drama 1660 corpus. (you can download all of them if you want – nobody will stop you!) Use the provided metadata worksheet to isolate a corpus of 1) Shakespeare’s plays and 2) the plays by another author (e.g. Fletcher, Middleton) to your computer. These should be in the form of one or more folders containing files provided by the Core 1660 corpus. How you choose to structure them will have implications for the kinds of results you can get: for example, a corpus divided based on genre will tell you something fundamentally different than a corpus divided based on authorship or dates of publication.
In the next section we will discuss several ways you can analyze a text using digital software.
[POTENTIAL CLASSROOM/HW ACTIVITY: annotated biblio of tools pro/cons, etc]