In the Margins

Choosing a Distance Metric

In hierarchical clustering, a distance metric must be chosen before running the algorithm for merging documents into clusters. K-means clustering uses standard Euclidean distance to determine the distance from the cluster centroid, but this or other distance measures can be used to evaluate the cluster quality. (In Lexos, this is done through the Silhouette Score, which can be calculated using multiple distance metrics.)

A few general observations have already been made under Cluster Analysis. The distance metric is essentially how you define the difference between your documents. The Euclidean distance metric measures the magnitude of the difference in distance between two document vectors (vectors of counts for each word in both documents). Non-Euclidean metrics like cosine similarity, which measures the angle between the vectors, can also be converted into measures of distance between clusters. Since document-term matrices are often sparse (they contain a lot of term counts of 0), cosine similarity may be a better option for clustering larger documents, and particularly if the documents are of uneven lengths. But the emphasis must be placed on may. There are no hard and fast rules, although there is renewed attention to providing more nuanced help with the choice of metrics (Jannidis et al. 2015, Eder 201?).

The circumstances under which certain distance metrics perform best, or even how to use machine learning to aid in the selection of such metrics, is the subject of ongoing research. However, much of it uses data very different from the type of material used in literary text analysis. Currently, our best advice is to be aware of how you are measuring distance and experiment with different linkage metrics, trying to explain how they operate on your texts. We provide a case study here which serves to introduce some of the most common metrics (all available in Lexos), and how they affect the results of a single data set.

Eder, M. (201?). Visualization in stylometry: some problems and solutions. To be published in Digital Scholarship in the Humanities.

Jannidis, F., Pielström, S., Schöch, C., Vitt, T. (2015). Improving Burrows' Delta -- An empirical evaluation of text distance measures. Presented at DH 2015 Global Digital Humanities, Sydney, Australia, July 3, 2015.

This page has paths: