DHSHX

How to do things with Simple Text or XML Files

The Visualising English Print project developed its own specific format for plain-text files, which you can read about in great detail on this page; this is presented in contrast to the XML files that the TCP project released. For the unfamiliar, XML is a hierarchical structure which is used to describe and label individual items in a text as a form of encoding and markup which produces metadata about the data encoded in the text itself. This presents a very explicit interpretation of a text, but also sets up the ability to perform a huge range of analyses: dates of publication, author/s, all the words said by a specific character, all available nouns, etc. While this is often hugely helpful, it is often useful that we keep a plain-text version (i.e. totally unannotated) version of the documents available: adding in lots of markup and encoding tends to make corpora machine readable but not human-readable.

In addition to human-readability, plain-text files also have the added bonus of being more widely usable across both software platforms and over time. For example the Apple .pages files only work on other Apple products with the word processing software Pages; older formats such as the Wordperfect .wp files have fallen out of favour and are subsequently very difficult to open now. 

Some of the rationale behind this decision is discussed below:

The idea of SimpleText is that we convert documents to a highly simplified format in order to make certain kinds of processing more standard and easier. That is, we intentionally remove much of the information that richer file formats contain, leaving files that can be processed in a simple fashion. Giving up the potential for the richer information that could be contained in a more complicated format, such as a structured markup language, has its costs. It means there are kinds of analyses we cannot perform with the data in the SimpleText format. However, this move enables us to perform the statistical analyses we are most interested in. (read more on http://graphics.cs.wisc.edu/WP/vep/%20SimpleText/)

The SimpleText format is therefore different from the highly-structured XML format that the TCP texts are widely available in, and are better for human-readable analyses. ​However, there is a real payoff to creating huge amounts of metadata about these materials in that you can put it in a spreadsheet and use it to guide your analysis. (One invigorating point of discussion is imagining the pros and cons of having a hugely annotated and completely unannotated text available as complimentary editions of the same resource!)

Going to http://graphics.cs.wisc.edu/WP/vep/vep-early-modern-drama-collection/ offers several ways of interacting with a larger world of Early Modern dramatic writing. Download the first three files associated with the Core Drama 1660 corpus. (you can download all of them if you want – nobody will stop you!)  Use the provided metadata worksheet to isolate a corpus of 1) Shakespeare’s plays and 2) the plays by another author (e.g. Fletcher, Middleton) to your computer. These should be in the form of one or more folders containing files provided by the Core 1660 corpus. How you choose to structure them will have implications for the kinds of results you can get: for example, a corpus divided based on genre will tell you something fundamentally different than a corpus divided based on authorship or dates of publication. 

In the next section we will discuss several ways you can analyze a text using digital software.
[POTENTIAL CLASSROOM/HW ACTIVITY: annotated biblio of tools pro/cons, etc]

This page has paths:

Contents of this path: