Similarity Query
Similarity Query supports the comparison of one file to all other files, the user selecting the target file name and tokenize options. More specificaly, Lexos calls skiKit learn's cosine similarity utility to compute the Given Lexos' use of counts for vectors of token counts ("counts" defined by the method of tokenization and normalization), all vectors are positive (in the first quadrant) and thus the angle between any two vectors is in the range [0,pi/2]. Said differently, the value for Cosine Similarity (the cosine of the two vectors) will always be in the range of [0,1], where a higher similarity value indicates that the two document vectors are closer to each other; values approaching zero indicate that the two vectors are further apart when comparing them. Note that this metric is a measurement of orientation between vectors, not magnitude.
Ok, if you've read this far, the cosine similarity is the normalized dot product of two vectors of counts from the Document Term Matrix (DTM), for example between documents X and Y:
CosineSimilarity = (X, Y) = <X, Y> / (||X||*||Y||)
The distance between documents is then defined as:distance = 1 - CosineSimilarity that is: (UNLIKE) orthogonal vectors 0 <= distance <= 1 parallel vectors (LIKE)
Thus, two documents with "similar" counts of two words: <1,0> and <2,0> would have a distance measure = 0 (they are LIKE) whereas two documents with completely unlike word counts of <0,1> and <2,0> would need the cosine of a 90-degree angle between two orthogonal vectors and thus a distance of +1.
Tutorial:
TBD
More reading:
Cosine Similarity: Wikipedia
Perone, Christian S. (2013). Machine Learning::Cosine Similarity for Vector Space Models (Part III). Terra Incognita.
Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
Cosine Similarity: Wikipedia
Perone, Christian S. (2013). Machine Learning::Cosine Similarity for Vector Space Models (Part III). Terra Incognita.
Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.