In the Margins

Silhouette Scores

A Silhouette Score, or Silhouette Coefficient, is a measure of fit for your clusters. It gives a general indication of how well individual objects lie within their cluster. A score of 1 indicates tight, distinct clusters. Values near 0 indicate overlapping clusters. Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar.

Silhouette scores are calculated from

Applied to documents, a single silhouette score is calculated for each document. The silhouette score for a collection of documents is the mean of the silhouette scores for each individual document.

Silhouette scores are most easily calculated from flat clusterings like those produced by k-means. In a flat cluster, where every object is represented in two-dimensional space, simple Euclidean distance (or any distance metric that can be substituted) can be used to measure difference between every point on the plane. The Lexos K-Means Tool offers all the distance metrics available in its Hierarchical Clustering Tool for calculating silhouette scores.

Generating a silhouette score for hierarchical clusterings is slightly more complicated. Since in a hierarchical cluster analysis clusters can also form parts of other clusters, the hierarchy must first be "flattened" so that documents can be assigned to specific clusters on a flat plane. There are many ways of doing this. Perhaps the easiest to understand is the use of a maximum number of clusters as a method for assigning documents. In hierarchical clustering, each document is technically a cluster of 1, so the maximum number of clusters possible is the same as the number of leaves in the dendrogram. Since a dendrogram needs to have at least one branching structure, the minimum number of possible clusters is 2. In other words, we can choose to provide a threshold of 2 to the maximum number of leaves in our dendrogram as the maximum number of clusters we will allow in our flattened hierarchy.

One way to think about this is to imagine a line bisecting a vertically-oriented dendrogram at a random point on the x-axis. How confident can we be that all the documents on the left belong together (and the same for the documents on the right)? If we draw lines between every document, we are naturally very confident that each document belongs with itself. If we draw a smaller number of lines, we will lose confidence, but we might still find that the groupings acceptable. Manipulating the maximum cluster threshold, then, is one way in which we could examine how robust our groupings are.

Another method we could adopt to flatten a hierarchical cluster is to set the threshold according to some branch height in the dendrogram. We can then compare the height of each link in the tree with the average height of the adjacent links that are less than, say, two levels below it. This comparison is known as the inconsistency coefficient, the theory being that a clade with branch height that is very different from the heights of the adjacent clades is "inconsistent" with them.

At present, Lexos provides only the maximum cluster criterion as a method of flattening the dendrogram, although the inconsistency criterion and others will be added in the future.

If you are wondering why silhouette scores have that name, imagine the silhouette score for each document in a cluster plotted as a horizontal line (the length of the line being proportional to the score). With the lines for each document piled on top of each other, they will form the appearance of a solid shape—a silhouette—which will be different for each cluster. Peter Rousseeuw's original description of silhouettes is quite accessible, even for the non-mathematician reader.

See Peter J. Rousseeuw (1987). "Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis". Computational and Applied Mathematics. 20: 53–65.