This site requires Javascript to be turned on. Please enable Javascript and reload the page.

In the Margins

Establishing Robust Clusters

One of the most vexing questions in the use of cluster analysis for computational stylistics is how we distinguish "good" clusters from clusters that are mere "noise", whether generated by our data or by our choice of implementations? Ideally, we want to generate "robust" clusters, by which we mean that they stand up to some measure of scrutiny. We can define this in many ways. If we cut several documents into segments and the individual segments of each document are clustered together in opposition to segments of other documents, we can assume that the clustering process has captured something meaningful, if only the distinctiveness of our original documents. When less predictable effects occur—say one segment clusters with the "wrong" document—we have to conclude either that there is something sub-optimal about our clustering procedure or that we have found something really interesting. Thus our intuitive sense of "surprise" at our results may be a measure of a weak clustering, but this "surprise" is also the goal of our analysis—within reason. Below we discuss some methods of striking a balance between interpretations based on unexpected clusterings. We examine how we can be relatively sure that our clusters—and thus our conclusions based on them—are robust.

The Holy Grail for some would be a statistical measure of with which to assess the "validity" of our clusters. A number of such measures exist, but their usefulness for a wide variety of data, and for the types of questions humanists typically ask of their data is an open question. At the present (2018), we use Lexos to prepare texts and then move to Eder's bootstrap concensus tree (BCT) tool in the Stylo in R package.

We recommend that you integrate non-statistical approaches into your workflow. Creating a number of different cluster analyses with slightly different settings to see how well the clusters hold up to these "tweaks" is probably the most reliable way to establish confidence in your clusters. Drout et al. have outlined a variety of procedures in Beowulf Unlocked: New Evidence from Lexomic Analysis (2016).

This page has paths:

Contents of this path: