Choosing a Distance Metric
In hierarchical clustering, a distance metric must be chosen before running the algorithm for merging documents into clusters. The distance metric is essentially how you define the "difference" between your documents. The Euclidean distance metric measures the magnitude of the difference in distance between two document vectors (counts of each unique term in both documents). Other metrics, like cosine similarity, which measures the angle between the vectors, can also be used as measures of distance between clusters. Below are basic explanations for each distance metric in Lexos, followed by a discussion of cutsize. Mathematical definitions may be found in the glossary.
The Norms
Most distance metrics are deviations of one of three “norms”, generic measures of distance. For the sake of simplicity, the following examples are in two dimensions.
Manhattan distance is the default measure for a 1-norm, and can be thought of as movement in four directions: up, down, left, and right. Diagonals do not exist in Manhattan distance, so a point (4,3) is actually seven units away from the point (0,0) – three up and four right. Manhattan distance is also known as taxicab distance. Both get their name from the concept of traveling in a city grid layout. This distance is also equivalent to measuring the area between two distribution curves (cf. Riemann sums).
Chebyshev distance is the default measure for an ∞-norm, and is eight directional movement: the four directions of Manhattan and the four diagonals. The distance between (0,0) and (4,3) in the ∞-norm is four units (up/right three units and right one unit). Chebyshev distance is often used in programming enemy AI in video games.
Euclidean distance is the default measure for a 2-norm, and is omni-directional movement. This distance between (0,0) and (4,3) is merely the length of the straight line between them, five units. This is the most commonly used definition of distance.
Those interested in mathematics may or may not be surprised to learn that a single formula governs these three norms, and indeed all p-norms: Minkowski distance. However, further discussion of this is quite beyond the scope of Lexos.
The Euclideans, or 2-norms
The following are variants of Euclidean distance.
Standardized Euclidean
This version of Euclidean distance standardizes all vector components (i.e. each unique term) to have a variance of 1, essentially forcing all data to be measured along the same scale.
Squared Euclidean
Here, progressively greater weights are placed on vectors that are further apart. Squared Euclidean distance is also known as quadrance.
Cosine
This derivation of Euclidean distance uses the angle between two vectors, which is very sensitive to less frequently used vocabulary.
Correlation
As one might expect, the correlation coefficient of the vector components can be used as a measure of distance. Correlation distance is equivalent to Cosine distance after vectors have been shifted by their means.
The Taxicabs, or 1-norms
The following are all variants of Manhattan distance.
Bray-Curtis
This measure is essentially a standardized Manhattan distance; colloquially, it is the percentage of a one text that is different from a second.
Canberra
Canberra is a weighted Manhattan distance. Here, ‘weighted’ is the same use as in ‘weighted’ average. Instead of the sum of differences (i.e. Manhattan), it measures the sum of the difference ratios. The weighting allows this metric to be sensitive to differences in less frequently used terms.
Jaccard
Derived from Bray-Curtis, the Jaccard distance is the ratio of the number of unique terms in each text to the total number of terms. Bray-Curtis handles numbers of instances of terms, Jaccard handles only the existence of terms. As such, the primary use of Jaccard distance is to measure the percent of shared vocabulary.
Hamming
Hamming distance is the number of terms that need to be changed to make one segment of text identical to another.
The Selective Listeners, or ∞-norms
Chebyshev distance is the only ∞-norm metric used in Lexos.
On Minimum Segment Size
All of the above distance metrics share the same property: they become more precise as segment size increases. More challenging is determining the accuracy of clustering as segment size decreases. Nor has there been a mathematically well-defined lower bound, below which the ‘normal’ use of language has more impact than stylistic differences, rendering this form of analysis unreliable at best.
Toward this end, the researchers at Lexos have explored variations. In the table below are their findings:
Distance | Linkages | |||
Single | Complete | Average | Weighted | |
Euclidean | 1000 | 1200 | 900 | 1100 |
Squared Euclidean | 1700 | 2000 | 1400 | 1700 |
Standardized Euclidean | 1500 | 1700 | 1200 | 1500 |
Cosine | 2700 | 3100 | 2200 | 2700 |
Correlation | 2400 | 2800 | 2000 | 2500 |
Manhattan | 900 | 1000 | 700 | 900 |
Canberra | 2100 | 2400 | 1700 | 2100 |
Bray-Curtis | 800 | 900 | 700 | 800 |
Jaccard | 2100 | 2400 | 1700 | 2100 |
Hamming | 900 | 1000 | 700 | 900 |
Chebyshev | 900 | 1000 | 700 | 900 |
Table: Minimum recommended cut size (cut by token) for analysis via distance/linkage pair.
Below is a walkthrough of the logical process; each of the forty-four values in the table was found via the following method.
1. The distance formula was restated in terms of segment size, using an ideal distribution according to Zipf’s Law. Then one word of this ideal distribution was changed in such a manner that it has the greatest impact on the vector. The new vector was calculated and the distance the vector moved was measured. This gave a generalized form of the largest gap between adjacent sets - akin to the radius of a Hamming sphere.
2. A generalized linkage formula was used to track the movement of the ‘center’ of a cluster as new leaves were included. This formula required the use of an average distance. Because of a direct relation between the root mean squared error and the number of different words in a segment, an average distance was found by representing a dummy vector with a variable RMSE and rewriting the equation in terms of segment size. Essentially, what emerged was the distance of expected cluster movement, and a standard deviation to go with it.
3. These two functions, the rewritten distance function and the generalized linkage function, were set against a variable representing the number of changes as given by the RMSE equation. This inequality was solved for segment size, giving a number below which there is a significant chance the linkage process will sort non-optimally (e.g. unwanted chaining for single linkage).
4. Then, for good measure, an error margin of ½ a standard deviation of the cluster movement was factored in and the resulting numbers rounded to the nearest multiple of one hundred. This ensures that users who follow the recommendations will consistently achieve mathematically significant results.1
This does not mean clustering with these sizes makes the choice of linkage or distance metric obsolete, nor does it mean that it guarantees there will always be no “undesirable effects” in the clustering process. These numbers were derived from a model that actively tries to ameliorate the impact of any single clustering event.
1 The results of one pair, Euclidean/Average, which were the previous default settings in Lexos, were checked by finding a curve of best fit for a given segment size while scaling the RMSE. The function was rotated by pi/4 and the local minimum found, corresponding to the value of the segment size previously deduced. (735.556… for those curious.)