This site requires Javascript to be turned on. Please enable Javascript and reload the page.

In the Margins Main Menu Welcome The In the Margins home page Lexomics The starting point for the Lexomics path Manual Start page for the Lexos Manual Topics Explore this path to learn about the Lexomic methods Glossary Glossary of terms used in Lexos and In the Margins Bibliography Beginning of bibliography path Lexos Install Guide Install Guide

Similarity Query

5371

Similarity Query as implemented here is a good first test to rank the "closeness" between a single document and all other documents in your active set. We often apply it early in exploratory analyses problems in new domains and/or languages. Note: It is easy to use and easy to over-interpret the results.

In Similarity Query, Lexos supports the comparison of one file to all other files, the user selecting the target file name and tokenize options. More specificaly, Lexos calls sklearn.metrics.pairwise.cosine_similarity to compute similarity as the normalized dot product of two vectors of counts from the Document Term Matrix (DTM), for example between documents X and Y:

CosineSimilarity = (X, Y) = <X, Y> / (||X||*||Y||)

with the distance between documents being defined as:

distance = 1 - CosineSimilarity

Given Lexos' use of counts for vectors of token counts ("counts" defined by the method of tokenization and normalization), all vectors are positive (in the first quadrant) and thus the angle between any vectors is [0,pi/2]. Said differently, the value for Cosine Similarity will always be in the range of (0,1), where a higher value indicates that the two document vectors are closer to each other; values approaching zero indicate that the two vectors are further apart when comparing them.

Tutorial:

After using the tools to scrub and cut files, the users can then use similarity to compare files.

1. First, select one file that you want to use to compare with all the others. Lexos will automatically select the first file in the list.