Cluster Analysis
Cluster Analysis Methodology:
Using Stylo in R: The theory behind our method
Maciej Eder is one of the many scholars working in the Digital Humanities field using algorithmic processes for authorship attribution and has done studies using Stylometrics. He explains, “In non-traditional authorship attribution, the general goal is to link a disputed/ anonymous sample with the most probable ‘candidate’. This is what state-of-the-art attribution methods do with ever-growing precision” ("Bootstrapping Delta: a safety net in open-set authorship attribution" 1). Stylo in R uses stylometry to create a grouping of samples and it strives to find the nearest neighbors among them. Blanch, in her thesis, describes Stylometry, simply writing, “Stylometry is the use of statistics, calculated by computer, to undertake an internal stylistic analysis of texts in order to determine the identity of the author” (40). For our project, this is the type of analysis we used to attribute authorship to some of the not signed texts in our corpus.
In a study conducted by Mosteller and Wallace, detailed in the article entitled “Measuring the Usefulness of Function Words for Authorship Attribution”, the men were able to find, while working on the Federalist Papers, “[t]hat a small number of the most frequent words in a language (‘function words’) could usefully serve as indicators of authorial style”(Argamon and Levitan 1). Argamon and Levitan reason that this is possible because those (function words) are the words that authors usually choose unconsciously; therefore, the author has a pattern that can be traced and followed. For our project we decided that this would be instrumental in helping us attribute authorship.
Possible limitations and how we addressed them
However, when using such tools there are limitations to consider. One of the problems that can be encountered when using such tools is that “[t]he techniques of classification used are not resistant to a common miss-classification error: any two nearest samples are claimed to be similar no matter how distant they are” (Eder, "Bootstrapping Delta" 1). More specifically, meaning that if two samples appear together there is a chance that it was an error and the texts should have been placed apart. That is one of the reasons why the texts must be examined without the use of technology. Again, “[To start, a person must have a] text of uncertain or anonymous authorship and a comparison corpus of texts by known authors, then one can perform a series of similarity tests between each sample and the disputed text. This allows us to establish a ranking list of possible authors, assuming that the sample nearest to the disputed text is stylistically similar, and thus probably written by the same author” (1). This type of analysis yields a useful image to show others how we determined the likeliness of authorship to a not signed text.
Additionally, one detail that must be considered when conducting this type of research is word count. For the record, word count does matter; Eder suggests, “Most attribution techniques, including Delta, allow the measurement of one vocabulary range (i.e. one vector of MFW’s) at once to obtain a ranking of candidates . . . one needs to analyze at least 2,700 words to see” ("Bootstrapping Delta" 2). This was a serious issue in our project because some of our texts were under that word count, which might have caused a problem. However, to counteract that issue researchers have created a Delta Bootstrap tree. The tree is a variation that will help reduce error by performing attribution tests in “1000 iterations, where the number of MFW’s to be analyzed is chosen randomly (e.g. 334, 638, 72, 201, 904, 145, 134, 762, . . .); in each iteration, the nearest neighbor classification is performed” (3). We created these Bootstrap trees to show another way in which our texts were pairing up with each other.
Lastly, one common concern of our group was the “cherry picking” effect. This occurs when a test is performed again and again until the results he or she is looking for are produced. We found this happened when using certain tools (i.e. Lexos Word Clouds) if the settings are played with the results can vary. To make sure that this would not be an issue our group decided that all tests and scrubbing would be consistent no matter what the results were.
What makes our research different is we made sure to use newer tools like Principal Component Analysis (PCA), Cluster Analysis (CA), and Multidimensional Scaling (MDS), which according to Eder and Rybicki in their article “Stylometry with R”, are rarely used so far in this field. Blanch in her thesis discusses the positives for using Stylometry as tool to use when trying to perform a non-traditional attribution study. For this project, our group also performed a traditional attribution study, meaning an actual text-by-text comparison with no technological assistance. Ultimately, the sentiment that many researchers have asserted is, “That a sound (and successful) bibliographical scholarship will employ a combination of traditional and non-traditional attribution methods” (Blanch 40) - a thought and belief our team took into consideration and implemented.
Tutorial
To view the specific process we used to analyze our corpus in Stylo in R, please view our tutorial here.
Previous page on path | Our Experiments, page 1 of 2 | Next page on path |
Discussion of "Cluster Analysis"
Add your voice to this discussion.
Checking your signed in status ...