Cluster analysis may be formally defined as an unsupervised learning technique for finding “natural” groupings of given instances in unlabeled data. For the purposes of text analysis, a clustering method is a procedure that starts with a group of documents and attempts to organize them into relatively homogeneous groups called “clusters”. Clustering methods differ from classification methods (supervised learning) in that clustering attempts to form these groups entirely from the data, rather than by assigning documents to predefined groups with designated class labels.
Cluster analysis works by counting the frequency of terms occurring in documents and then grouping the documents based on similarities in these frequencies. When we observe that an author or text uses a term or a group of terms more than other authors or other texts do, we are innately using this technique. In making such claims, we rely on our memory, as well as sometimes unstated selection processes. It may be true, for instance, that Text A uses more terms relating to violence than Text B, but in fact, the difference between the two may be proportionally much less than the difference in the frequency of other terms such as “the” and “is” on which we do not traditionally base our interpretation. Cluster analysis leverages the ability of the computer to compare many more terms than is possible (or at least practical) for the human mind. It may therefore reveal patterns that can be missed by traditional forms of reading. Cluster analysis can be very useful for exploring similarities between texts or segments of texts and can also be used as a test for hypotheses you may have about your texts. But it is important to remember that the type of clustering discussed here relies on the frequency of terms, not their semantic qualities. As a result, it can only provide a kind of proxy for meaning. In order to use cluster analysis successfully to interpret literary texts, it is important to have a good understanding of how the process works.
Here, we acknowledge the many levels of expertise it requires to fully appreciate cluster analysis in general and specifically when choosing (i) metrics that define a distance between documents, (ii) metrics that manage the clustering of like documents, and (iii) clustering based on different features of the text (e.g. all or only the most frequent terms) (Eder, "Computational Stylistics and Biblical Translation"). But in all the details, we encourage the reader to seek the benefits of exploratory analysis like cluster analysis early and often, even as you are learning more about the statistical roots of different metrics. It always helps if you have local statistican to consult..
In traditional literary criticism, concepts like “genre” are frequently used to group texts. There may be some taxonomic criteria for assigning texts to these groups, but recent work in Digital Humanities (Moretti, Jockers, Underwood) have highlighted that such categories can be usefully re-examined using quantitative methods. In cluster analysis, the basis for dividing or assigning documents into clusters is a statistical calculation of their dis(similarity). Similar documents are ones in which there is considerable homogeneity in the frequency with which terms are observed to occur therein. Documents are dissimilar when term frequencies are more heterogeneous.
A clearer picture emerges of this definition of similarity if we examine how it is measured. Imagine three documents as points within a coordinate space.
Document A can be imagined to be more similar to Document B than to Document C by using the distance between them as a metric. Simply draw a straight line between the points, and the documents with the shortest line between them are the most similar. When using cluster analysis for the study of texts, we take as our premise the idea that this notion of similarity measured as proximity may correlate to a range of historical and stylistic relationships amongst the documents in our corpus.
The graph above represents the end of the process. In order to plot our documents in coordinate space, we must first determine the coordinates. This is done by counting the terms in our documents to produce a document-term matrix. For instance a part of the document-term matrix that produced the graph above might be:
The list of term counts for each document is called a document vector. Representing the text as a vector of term counts allows us to calculate the distance or dissimilarity between documents. We can easily convert the document-term matrix into a "distance matrix" (also called a "dissimilarity matrix") by taking the difference between the term counts for each document vector.
|Man||Document A||Document B||Document C|
The distance from A to B is 1 while the distance from B to C is 3. Documents A and B form a cluster because the distance between them is shorter than between either and Document C.
Notice that the table above reproduces only the portion of the document vectors representing the frequency of the word “man”. Adding the “woman” portion creates considerable difficulties for us if we are trying to represent the data in rows and columns. That is because each term in the document vector represents a separate dimension of that vector. The full text of a document may be represented by a vector with thousands of dimensions. Imagine a spreadsheet with thousands of individual sheets, one for each term in the document, and you get the idea. In order for the human mind to interpret this data, we need to produce a flattening, or dimensionality reduction, of the whole distance matrix. The computer does this by algorithmically going through the distance matrix and adjusting the distance between each document vector on the observed distances between each of the terms. There are, in fact, different algorithms for doing this, and part of a successful use of cluster analysis involves choosing the algorithm best suited for the materials being examined.
Many discussions of this process begin with the notion of feature selection. In text analysis, this equates to determining what features of the text make up the document vector. The procedure for feature selection is essentially the processes of scrubbing, cutting, and tokenization. You may also perform certain normalization measures that modify the term frequencies in order to account for differences in document length. Depending on how you perform these tasks, the results of cluster analysis can be very different.
One of the factors that will influence your results is your choice of distance metric. The distance metric is essentially how you define the difference between your documents. A simple example using a distance metric might be the words cat and can (think of these words as documents composed of vectors of three letters each). A distance metric called edit distance can be defined as the number of character changes required to transform cat into can (here, just one change = 1). The difference between cat and call would be 2. Edit distances are very good for measuring the distance between short strings of characters like individual words, but they are unwieldy for longer document vectors. So for whole texts we need some other way of defining distance.
The Euclidean distance metric measures the magnitude of the difference in distance between two document vectors. The Euclidean distance is essentially the length of a line between a point representing a term on one vector and a point representing the same term on the other. You might then decide to take the average Euclidean distance for all points and treat that as the measure of document distance. Statisticians have developed a number of metrics based on modifications of Euclidean distance that can serve as alternative ways of defining document distance. Another approach is to measure the similarity of the two vectors. It may help the reader to visualize two documents with N unique words as represented by two vectors sticking out into N-space. Given two documents, a common method of computing distance is to calculate the cosine similarity, which is the angle between the two document vectors. Cosine similarity is not truly a measure of distance, but it can be converted into a distance measure by subtracting the cosine similarity value (which varies between 0 and 1) from 1.
The difference between Euclidean and non-Euclidean metrics for measuring document distance is largely one of perspective. Euclidean distances tend to measure the distance between documents at some point along the vector where the documents are already quite distinct from one another. In longer documents, this distinction may correlate to a lot of terms which are found in one document but not the other. In the document-term matrix, counts for these terms in the documents where they do not occur are recorded as 0. A distance matrix in which most of the elements are zero is called a sparse matrix. The more dimensions there are in the document vectors, the more likely it is that the distance matrix will be sparse. Many Hapax legomena (terms occurring only once) in your documents, for instance, will very likely produce a sparse document-term matrix.
Measuring Euclidean distance in a sparse matrix can affect the way clustering takes place. We would certainly expect this to be the case if one document was much longer than the other. Using cosine similarity is one way to address this problem since the angle between document vectors does not change depending on the location of points along the vector. There are a large variety of variants on these basic Euclidean and non-Euclidean approaches to defining the distance metric for clustering. For further discussion, see Choosing a Distance Metric.
Other factors influencing your results will depend on the type of cluster analysis you use, your choice of linkage method, the choice of token-type, and the size of the sample of terms considered (e.g., the whole text, as opposed to the top 100 most frequent terms).
Types of Cluster Analysis
There are two main approaches to clustering: hierarchical and partitioning. Hierarchical clustering attempts to divide documents into a branching tree of groups and sub-groups, whereas partitioning methods attempt to assign documents to a pre-designated number of clusters. Clustering methods may also be described as exclusive, generating clusters in which no document can belong to more than one cluster, or overlapping, in which documents may belong belong to multiple clusters. Hierarchical methods allow clusters to belong to other clusters, whereas flat methods do not allow for this possibility. Lexos implements two types of cluster analysis: a form of hierarchical clustering called agglomerative hierarchical clustering and a form of flat partitioning called K-Means.
While a full discussion of the many types of cluster analysis is beyond the scope of this work, we may note two other methods that are commonly used in literary text analysis. One such method is topic modeling, which generates clusters of words that appear in close proximity to each other within the corpus. The most popular tool for topic modeling is Mallet, and the Lexos Multicloud tools allows you to generate topic clouds of the resulting clusters from Mallet data. Another type of clustering is often referred to as community detection, where algorithms are used to identify clusters of related nodes in a network. More information about community detection in network data can be found here [needs a link].
Strengths and Limitations of Cluster Analysis
Cluster analysis has been put to good use for a variety of purposes. Some of the most successful work using cluster analysis has been in the area of authorship attribution, detection of collaboration, source study, and translation (REFs). Above all, it can be a useful provocation, helping to focus enquiry on texts or sections of texts that we might otherwise ignore.
It is important to be aware that there is not a decisive body of evidence supporting the superiority of one clustering method over another. Different methods often produce very different results. There is also a fundamental circularity to cluster analysis in that it seeks to discover structure in data by imposing structure on it. While these considerations may urge caution, it is equally important to remember that they have analogues in traditional forms of an analysis and interpretation.
Regardless of which method is chosen, we should be wary of putting too much stock in any single result. Cluster analysis is most useful when it is repeated many times with slight variations to determine which results are most useful. One of the central concerns will always be the validity of algorithmically produced clusters. In part, this is a question of statistical validity based on the nature of texts used and the particular implementation chosen for clustering. There is ongoing research on what statistical criteria makes a cluster a "good cluster"--and how to learn that is for a given data set--but there is very little consensus that is of practical use for textual analysis. In Lexos, we include a statistical measure called the Silhouette Score, which gives a general indication of how well documents lie within their clusters. Silhouette scores do not reply on knowing class labels beforehand. However, a high or low Silhouette Score should not be taken to mean that the clustering is better or worse. It is merely one of many possible measures we could use. [See http://blog.data-miners.com/2011/03/cluster-silhouettes.html for further information in Silhouette Scores.] For further discussion, see Establishing Robust Clusters.
There is also the more fundamental question of whether similarity as measured by distance metrics corresponds to similarity as apprehended by the human psyche or similarity in terms of the historical circumstances that produced the texts under examination. One of the frequent complaints about cluster analysis is that, in reducing the dimensionality of the documents within the cluster, it occludes access to the content—especially the semantics of the content—responsible for the grouping of documents. Point taken. Treating documents as vectors of term frequencies ignores information, but the success of distance measures on n-dimensional vectors of term counts is clear: cluster analysis continues to support exploration that helps define next steps, for example, new ways of segmenting a set of old texts.