Thanks for your patience during our recent outage at scalar.usc.edu. While Scalar content is loading normally now, saving is still slow, and Scalar's 'additional metadata' features have been disabled, which may interfere with features like timelines and maps that depend on metadata. This also means that saving a page or media item will remove its additional metadata. If this occurs, you can use the 'All versions' link at the bottom of the page to restore the earlier version. We are continuing to troubleshoot, and will provide further updates as needed. Note that this only affects Scalar projects at scalar.usc.edu, and not those hosted elsewhere.
In the MarginsMain MenuWelcomeThe In the Margins home pageLexomicsThe starting point for the Lexomics pathManualStart page for the Lexos ManualTopicsExplore this path to learn about the Lexomic methodsGlossaryGlossary of terms used in Lexos and In the MarginsBibliographyBeginning of bibliography pathLexos Install GuideInstall GuideScott Kleinman9a8f11284fbcd30816f25779706745a199e2813bMark D. LeBlanc23eecdfefefedd63f3c03839b2eb82298bb7b6acMichael Drout982893aaef23041e734606413d064fcc52ac209a
Establishing Robust Clusters
12015-08-12T00:21:10-07:00Scott Kleinman9a8f11284fbcd30816f25779706745a199e2813b53715Detailed discussion of how to handle cluster robustnessplain2018-06-27T17:02:32-07:00Mark D. LeBlanc23eecdfefefedd63f3c03839b2eb82298bb7b6acOne of the most vexing questions in the use of cluster analysis for computational stylistics is how we distinguish "good" clusters from clusters that are mere "noise", whether generated by our data or by our choice of implementations? Ideally, we want to generate "robust" clusters, by which we mean that they stand up to some measure of scrutiny. We can define this in many ways. If we cut several documents into segments and the individual segments of each document are clustered together in opposition to segments of other documents, we can assume that the clustering process has captured something meaningful, if only the distinctiveness of our original documents. When less predictable effects occur—say one segment clusters with the "wrong" document—we have to conclude either that there is something sub-optimal about our clustering procedure or that we have found something really interesting. Thus our intuitive sense of "surprise" at our results may be a measure of a weak clustering, but this "surprise" is also the goal of our analysis—within reason. Below we discuss some methods of striking a balance between interpretations based on unexpected clusterings. We examine how we can be relatively sure that our clusters—and thus our conclusions based on them—are robust.
The Holy Grail for some would be a statistical measure of with which to assess the "validity" of our clusters. A number of such measures exist, but their usefulness for a wide variety of data, and for the types of questions humanists typically ask of their data is an open question. At the present (2018), we use Lexos to prepare texts and then move to Eder's bootstrap concensus tree (BCT) tool in the Stylo in R package.
We recommend that you integrate non-statistical approaches into your workflow. Creating a number of different cluster analyses with slightly different settings to see how well the clusters hold up to these "tweaks" is probably the most reliable way to establish confidence in your clusters. Drout et al. have outlined a variety of procedures in Beowulf Unlocked: New Evidence from Lexomic Analysis (2016).
This page has paths:
12015-07-31T23:01:32-07:00Scott Kleinman9a8f11284fbcd30816f25779706745a199e2813bCluster AnalysisScott Kleinman35The start page for the cluster analysis topics pathplain2016-08-19T16:34:39-07:00Scott Kleinman9a8f11284fbcd30816f25779706745a199e2813b