In the Margins

Similarity Query

Similarity Query as implemented here is a good first test to rank the "closeness" between a single document and all other documents in your active set. We often apply it early in exploratory analyses problems in new domains and/or languages. Note: It is easy to use and easy to over-interpret the results.

In Similarity Query, Lexos supports the comparison of one file to all other files, the user selecting the target file name and tokenize options. More specificaly, Lexos calls sklearn.metrics.pairwise.cosine_similarity to compute similarity as the normalized dot product of two vectors of counts from the Document Term Matrix (DTM), for example between documents X and Y:

>    CosineSimilarity = (X, Y) = <X, Y> / (||X||*||Y||)

with the distance between documents being defined as:

    distance = 1 - CosineSimilarity

Given Lexos' use of counts for vectors of token counts ("counts" defined by the method of tokenization and normalization), all vectors are positive (in the first quadrant) and thus the angle between any vectors is [0,pi/2]. Said differently, the value for Cosine Similarity will always be in the range of (0,1), where a higher value indicates that the two document vectors are closer to each other; values approaching zero indicate that the two vectors are further apart when comparing them.

Tutorial:
 
After using the tools to scrub and cut files, the users can then use similarity to compare files. 
 
1. First, select one file that you want to use to compare with all the others. Lexos will automatically select the first file in the list.