In the Margins

K-Means Clustering

K-Means Clustering

K-Means clustering partitions a set of documents into a number of groups or clusters in a way that minimizes the variation within clusters. The "K" refers to the number of partitions, so for example, if you wish to see how your documents might cluster into three (3) groups, you would set K=3. [footnote?  You know like if you ask a mathematician, "Give me a number, any number" and she replies, "How about K."]

[not sure about this paragraph: Thus it uses variation of the terms within documents, rather than distances between them, to form clusters. Unlike hierarchical clustering, K-Means clustering requires us to choose the number of clusters (K) we wish to produce. However, we do not need to choose a distance metric. As a result, K-Means can be a good alternative to hierarchical clustering for large data sets since it is less computationally intensive.]

When thinking of K-means clustering, we recommend that you think of each of your documents as represented by a single (x,y) point on a two-dimensional coordinate plane. In this view, a cluster is a collection of documents (points) that are close to one another and together form a group. Assigning documents to a specific cluster amounts to determining which cluster "center" is closest to your document.

[show an image of circles that contain points thereby forming clusters]

The algorithm (general procedure or "recipe") for applying K-means to your collection of documents is described next. Again, the overall goal is to partition your documents into K non-empty subsets.

  1. Decide on the number of clusters you wish to form. So yes, you must pick a value for K a priori.
  2. The algorithm will compute a "center" or centroid for each cluster. The centroid is the center (mean point) of a cluster. The procedure for creating centroids at the very start can be varied and is discussed below.
  3. Assign each of your documents to the cluster with the nearest centroid.
  4. Repeat steps 2 and 3, thereby re-calculating the locations of centroids for the documents in each cluster and reassigning documents to the cluster with the closest center. The algorithm continues until no documents are reassigned to different clusters.



Required Settings:

K value: There is no obvious way to choose the number of clusters. It can be helpful to perform hierarchical clustering before performing K-Means clustering, as the resulting dendrogram may suggest a certain number of clusters that is likely to produce meaningful results. The K-means procedure is very sensitive to the position of the initial seeds, although employing the K-means++ setting can help to constrain this placement.

Method of Visualization:
As mentioned earlier, K-Means clustering is generally visualized on a two-dimensional plane with the distance between cluster members (documents) indicated by coordinates. Trapezoidal polygons known as Voronoi cells may be drawn around the cluster centroids to indicate which documents fall in which clusters. Another way of visualizing the results of K-Means clustering is with Principal Component Analysis (PCA), where dots on the plane are colored to mark their cluster membership. Both visualization approaches can you judge distances between clusters.


Advanced Settings:
Since cluster membership is adjusted at each stage of the process by the re-location of the centroids, the number of iterations required and other factors can be adjusted to select a cutoff point for the algorithm or a desired threshold for convergence of different clusters. As with the initial choice of cluster numbers, there are no hard and fast rules for how these factors should be applied.  For most users, we (strongly?) recommend the default settings be used, that is, the user need not enter or change any of these settings and Lexos will apply the default values.

Maximum number of iterations:
As noted above, the K-means algorithm will continue to re-compute centroids for each cluster until all documents settle down into "final, home" clusters. It is possible that a situation occurs where a document continues to toggle back and forth between two clusters. This value avoids an endless, or at least an unnecessary number of iterations with little change.

Method of Initialization:
Your results of using K-means on a collection of documents can vary significantly depending on the initial choice of centroids.  In Lexos the user is offered two choices: K-Means++ and Random. When using K-Means++, the default setting in Lexos, the center of the first of the K clusters is chosen at random (typically by picking any one of the documents in the starting set as representative of a center of a future cluster). The remaining (K-1) cluster centers are then chosen from the remaining documents by computing a probability proportional to the distances of the centers already chosen. Once all centroids are chosen, normal K-Means clustering takes place. A "random seed" approach is used in which the locations of all centroids at the initial stage are generated randomly. It is best to experiment multiple times with different random seeds. 


Number of Iterations with Different Centroids:  Given the sensitivity of final clusters on the choice of initial centroids, Lexos uses a default setting of running ?? (note: scikit learn says N=10; we have it set at 300?) trials, each trial using different centroid starting locations (or seeds).

Relative Tolerance:
This setting allows an expert user to vary the rate of convergence of the algorithm. (I'm not convinced this is a useful setting for our users?)



The reliability of K-Means clustering can be evaluated by many statistical procedures. Lexos provides one criterion, the Silhouette Score, which is also used to evaluate the reliability of results follwoing hierarchical clustering. See Cluster Analysis for further discussion.



 

This page has paths: