Hierarchical clustering does not require you to choose the number of clusters to begin with. A dendrogram, a visual representation of the clusters, can be built by two methods. Divisive hierarchical clustering begins with only one cluster (consisting of all documents) and proceeds to cut it into separate “sub-clusters”, repeating the process until the criterion for dividing them has been exhausted. Alternately, agglomerative hierarchical clustering begins with every document as its own cluster and then proceeds to assign these items to “super-clusters” based on the selected distance metric and linkage criteria (see below). Lexos offers a tool for performing agglomerative hierarchical clustering.
The clusters that result from hierarchical clustering are typically visualized with a two-dimensional tree diagram called a dendrogram. For more information about the construction and interpretation of dendrograms in this method, see the video below:
Since the resulting tree technically contains clusters at multiple levels, the result of the cluster analysis is obtained by “cutting” the tree at the desired level. Each connected component then forms a cluster for interpretation.
The results of hierarchical clustering and the topography of the resulting dendrogram may vary depending on distance metric, linkage criterion used to form the clusters, and other factors such as tokenization and the number of most frequent words used. The distance metric is the measure used for defining what constitutes document similarity, how "far" (distance) one document is from another.
Hierarchical clustering presents the user with three main challenges:
- Which distance metric to use.
- What type of linkage criterion to select.
- Where to cut the tree.
Each of these challenges will be considered in turn.
Selecting a Distance Metric
One of the most important (and least well-documented) aspects of the hierarchical clustering method is the distance metric. Since we are representing texts as document vectors, it makes sense to define document similarity by comparing the vectors. One way to do this is to measure the distance between each pair of vectors. For example, if two vectors are visualized as lines in a triangle, the hypotenuse between these lines can be used as a measure of the distance between the two documents. This method of measuring how far apart two documents are is known as Euclidean distance. Other measures, such as cosine similarity, which relates the distance between the two documents to the angle between their two vectors, are also options in the distance metric drop-down menu. We have had good success for medium-sized texts with Euclidean (distance is measured along a straight line between two points) and texts of all sizes for Bray-Curtis (“distance” is the percentage of the first text that is different from the second text). While we recommend using these two distances as a generic starting point due to their ubiquity and scalability (respectively), below are suggestions that are more specialized.
Small number of terms per segment
Large number of terms per segment
Generally, for datasets with many terms using vast vocabularies (such as comparing entire corpora) cosine, Manhattan, or Canberra are stronger. For many terms in smaller vocabularies (chapters of books), Euclidean, Chebyshev, or Standardized Euclidean are good choices. For fewer numbers of terms with diverse vocabulary (e.g. non-epic poetry), Correlation, Jaccard, or Squared Euclidean are appropriate starting points. For the smaller samples of a rather diminutive lexicon (character dialogue), Bray-Curtis and Hamming are appropriate.
Choosing a Linkage Method
The second choice that must be made before running a clustering algorithm is the linkage method. At each stage of the clustering process a choice must be made about whether two clusters should be joined (and recall that a single document itself forms a cluster at the lowest level of the hierarchy). An intuitive means for doing this is to join the cluster containing a point (e,g, a term frequency) closest to the current cluster. This is known as single linkage, which joins clusters based on only a single point. Single linkage does not take into account the rest of the points in the cluster, and the resulting dendrograms tend to have spread out clusters. This process is called "chaining". Complete linkage uses the opposite approach. It takes the two points furthest apart between the current cluster and the others. The cluster with the shortest distance to the current cluster is joined to it. Complete linkage thus takes into account all the points on the vector that come before the one with the maximum distance. It tends to produce compact, evenly distributed clusters in the resulting dendrograms. Average linkage is a compromise between single and complete linkage. It takes the average distance of all the points in each cluster and uses the shortest average distance for deciding which cluster should be joined to the current one. We have had good success with average linkage. The weighted average linkage performs the average linkage calculation but weights the distances based on the number of terms in the cluster. It therefore may be a good option when there is significant variation in the size of the documents under examination. Another commonly used form of linkage (not currently available in Lexos) is Ward's criterion, which attempts to minimize the differences in cluster size as the dendrogram is built. It may not be appropriate for use with documents of variable size (c.f. http://academic.reed.edu/psychology/stata/analyses/advanced/agglomerative.html). Visualizations of the differences between the linkage criteria can be seen here. Which linkage criterion you choose depends greatly on the variability of your data and your expectations of its likely cluster structure. The fact that it is very difficult to predict this in advance may explain why the "compromise" of average linkage has proved successful for us.
Cutting the Dendrogram
Once the dendrogram has been generated, every document leaf will form its own cluster and all documents will belong to a single cluster at the root. In between, there may be any number of clusters formed at differing levels of the hierarchy. Not all of these clusters will necessarily be meaningful. For example, if you are trying to test the authorship of Shakespearean plays, it may not be significant that Macbeth and A Midsummer Night's Dream fall within the same cluster. It will be more interesting if a Renaissance play we do not know to be by Shakespeare falls within a cluster containing the above plays and not into clusters containing plays by other authors. On the other hand, if we are interested in the question of genre, we might be very interested to know whether Richard II, normally considered a history play, clusters with the tragedy of Macbeth or the comedy of A Midsummer Night's Dream. In practice, these sorts of considerations will cause us to draw a line on the dendrogram (often at a particular branch height) below which we will not consider clusters significant. This is known as cutting the dendrogram. Where to draw the line can be an impressionistic exercise. Like our choice of linkage, it will depend a great deal on our expectations of our data.
It should be clear from the above that interpreting dendrograms requires both an understanding of the choice of implementation and an understanding of the content of the materials being clustered. Furthermore, the structure of of the dendrogram and its interpretation are highly dependent on our expectations about the text we are studying. This epistemological loop is well known in the Humanities, where it is taken for granted that one's perspective and biases influence interpretation. In hierarchical cluster analysis, the decision-making required for implementation builds these limitations into the method, but hopefully calls attention to them as well.
We end with some miscellaneous issues which you should be aware of in choosing hierarchical clustering as a method. First, it does not scale well. If you have a large number of documents, or large documents, the number of computations can theoretically be a strain on a computer's processing power. We have not yet established a threshold where this becomes problematic (especially since it will vary on different machines), but, if you appear to be encountering this problem, trying a simpler distance metric like Squared Euclidean may help. If you do manage to produce a dendrogram with large numbers of leaves, you may have trouble reading it because the leaf labels overlap.
These are largely practical situations, but there are also some conceptual ones. In hierarchical clustering, all items (documents and the terms they contain) are forced into clusters, a scenario that may not accurately reflect the relationships of the original texts. Another issue is that hierarchical clustering assigns documents to clusters early during the process and has no method for undoing that partitioning based on data it encounters later. If that appears to be a problem, we suggest trying K-Means clustering, which adjusts cluster membership at each step.
Statisticians have identified many strengths and shortcomings of hierarchical clustering as a method, and there is ongoing research on the most appropriate distance measures and linkage criteria (much of it using data unlike that employed in literary text analysis). In our test cases, we have typically found that the Euclidean metric with average linkage provides good results. However, Lexos allows you, even encourages you, to apply a number of algorithms and compare the results. This may be one method of establishing whether a particular clustering is valuable. See further Establishing Robust Clusters.