The Bootstrap Consensus Tree Tool (Beta)
Note: This tool is still in beta.
Bootstrap Consensus Trees are a type of hierarchical cluster analysis that takes into account the sensitivity of cluster assignments to changes in the data or the parameters used to perform the clusering. With the Lexos Hierarchical Clustering tool, it is easy to observe different results when you scrub or cut your data differently, or when you choose a different distance metric or linkage method. This can lead to uncertainty about how much meaning to attribute to the results of an individual experiment. Bootstrap consensus trees attempt to ascertain the stability of individual cluster assignments by analysing many different variations of a cluster analysis and displaying a "consensus" dendrogram with the assignments that occur with most consistency. This dendrogram then represents a reliable indicator of the clusters that are sufficiently marked to recur regardless of minor changes in the data or the clustering parameters. This reliability provides additional confidence that the cluster assignment is meaningful, rather than an artefact of the statistical procedure.
The Bootstrap Consensus Tool works by selecting random samples (portions) of your active documents and performing a hierarchical cluster analysis of them. The algorithm then performs another iteration with a different random sample and calculates the consensus. In each iteration 80% of the tokens in your data are sampled. This procedure is repeated over many more iterations (you can designate how many), and a dendrogram showing the consensus cluster analysis is the result. This dendrogram may be compared to the results experiments produced by the Lexos Hierarchical Clustering tool.
Bootstrap Consensus Tree Options
- Distance Metric: The method for calculating the distance between each pair of vectors.
- Linkage Method: The method for determining how clusters should be linked.
- Majority rule consensus cutoff: The perentage of times a document must appear in a clade to be considered in the consensus calculation. The default is 50%.
- Number of bootstrap iterations: The number of times you wish the algorithm to perform sampling and clustering before producing the consensus clusters. The default is 100 iterations.
- Sample each iteration without replacement: By default, each data sample is removed from the original data set before the next iteration. If you wish to replace the data, select "with" replacement from the dropdown menu. Information about choosing the best option is given below.
Best Practices
In our individual experiments with hierarchical clustering, we have found the Euclidean distance metric and average linkage to produce the most consistently meaningful clusterings. Therefore these are default choices. For consideration of how these should be chosen. For further information about choosing a distance metric and linkage methods, see the article on Hierarchical Clustering.
We have not established best practices for the majority rule cutoff or the number of iterations. The choice to perform sampling with or without replacement is likely to relate to the size of your data set. For an explanation of how replacement works, see Mary Parker, "Sampling With Replacement and Sampling Without Replacement"
Finally, the choice to sample 80% of tokens in each iteration is somewhat arbitrary. It is designed to make samples sufficiently large and distinct from each other for effective clustering without removing too much of the data available for subsequent iterations in experiments run without replacement. Currently, this percentage cannot be configured directly in the Lexos application. If you wish to install Lexos on your local machine, you can open the file lexos/models/bct_model.py
in a text editor and change frac=0.8
in the _get_bootstrap_trees()
function to some other decimal.