Similarity Query
In Similarity Query, Lexos supports the comparison of one file to all other files, the user selecting the target file name and tokenize options. More specificaly, Lexos calls sklearn.metrics.pairwise.cosine_similarity to compute similarity as the normalized dot product of two vectors of counts from the Document Term Matrix (DTM), for example between documents X and Y:
CosineSimilarity = (X, Y) = <X, Y> / (||X||*||Y||)
with the distance between documents being defined as:
distance = 1 - CosineSimilarity
Given Lexos' use of counts for vectors of token counts ("counts" defined by the method of tokenization and normalization), all vectors are positive (in the first quadrant) and thus the angle between any vectors is [0,pi/2]. Said differently, the value for Cosine Similarity will always be in the range of (0,1), where a higher value indicates that the two document vectors are closer to each other; values approaching zero indicate that the two vectors are further apart when comparing them.
Tutorial:
After using the tools to scrub and cut files, the users can then use similarity to compare files.
1. First, select one file that you want to use to compare with all the others. Lexos will automatically select the first file in the list.