In the Margins

The K-Means Clustering Tool

The Lexos K-Means Clustering tool partitions your active documents into "flat" clusters in a way that minimizes the variation within the clusters. The "K" in "K-Means" refers to the number of partitions (e.g., K=3 means "make three (3) clusters"). 

Lexos provides three methods of visualizing the results of a K-means cluster analyses. In each case, Lexos first applies PCA (Principal Component Analysis) to reduce the dimensions of the data so it can be viewed in a 2-D or 3-D graph. The default method of visualization, Voronoi, identifies a centroid (central point) in each cluster and draws a trapezoidal polygon around it. This may be helpful in allowing you to see which points fall into which cluster. More traditional visualizations are also available as 2-D and 3-D Scatter Plots.

Generating and Reading a K-Means Cluster Analysis

Enter the Number of Clusters (K), select your visualization method, and then click the Get K-Means Result button to perform a K-means cluster analysis. If you wish, you can modify the default settings using the Advanced K-Means Options menus described below.

K-Means cluster analyses may contain many points that appear very close together, making the graph difficult to read. In order to aid the process, you can hover over the points on the graph to see the associated document names. Select a region on the graph to zoom-in; double-click to return to the starting view. In addition, Lexos provides a table to the left of the graph, numbering the clusters from 1 to K, thereby indicating the cluster number for each of your documents.

The default is one-half of the number of active documents. There is no obvious way to choose the number of clusters. It can be helpful to perform hierarchical clustering before performing K-Means clustering, as the resulting dendrogram may suggest a certain number of clusters that is likely to produce meaningful results. 

Advanced K-Means Options

Since cluster membership is adjusted at each stage of the process by the re-location of the centroids, the number of iterations required and other factors can be adjusted to select a cutoff point for the algorithm or a desired threshold for convergence of different clusters. These adjustments are handled in the Advanced K-Means Options section: Maximum Number of Iterations, Method of Initialization, Number of Iterations with Different Centroids, and Relative Tolerance. As with the initial choice of cluster numbers, there are no hard and fast rules for how these factors should be applied. The default settings should serve most users' purposes. However, here are some brief descriptions of the purposes of each option:

Maximum number of iterations: The K-means algorithm will continue to re-compute centroids for each cluster until all documents settle down into "final" clusters. It is possible that a situation occurs where a document continues to toggle back and forth between two clusters. Setting this value avoids an endless, or at least an unnecessary number of repetitions of the algorithm with little change.

Method of Initialization: Your results of using K-means on a collection of documents can vary significantly depending on the initial choice of centroids. In Lexos the user is offered two choices: K-Means++ and Random. When using the default K-Means++ setting, Lexos chooses the first of the K clusters at random (typically by picking any one of the documents in the starting set as representative of a center of a future cluster). The remaining (K-1) cluster centers are then chosen from the remaining documents by computing a probability proportional to the distances of the centers already chosen. Once all centroids are chosen, normal K-Means clustering takes place. The K-means procedure is very sensitive to the position of the initial seeds, although employing the K-means++ setting can help to constrain this placement.

The Random setting employs a "random seed" approach in which the locations of all centroids at the initial stage are generated randomly. It is best to experiment multiple times with different random seeds.

Number of Iterations with Different Centroids: The number of times the k-means algorithm will be run with different centroid seeds. The final results will be the best output of the last n_init (number of iterations), the default being ten (10).

Relative Tolerance: This indicates the relative tolerance (as a decimal) with respect to inertia to declare convergence. It is unlikely you will need to alter the default value.

Downloading K-Means Graphs

As you pan over a graph in any of the three visualizations, you should see a menu at the top margin of the graph. Clicking on the camera-icon will enable you to download the graph as a .png file.

This page has paths: