*自从我写TF-IDF教程以来已经很久了(第I部分和第I部分I) and as I promissed, here is the continuation of the tutorial. Unfortunately I had no time to fix the previous tutorials for the newer versions of theScikit-Learn(Sklearn)package nor to answer all the questions, but I hope to do that in a close future.
因此,在上一个教程上,我们了解了如何在向量空间中建模文档,TF-IDF转换工作方式以及如何计算TF-IDF,现在我们将学习的是如何使用众所周知的相似度测量(余弦相似性)计算不同文档之间的相似性。
圆点产品
让我们从两个向量的DOT产品的定义开始:和
, 在哪里
和
are the components of the vector (features of the document, or TF-IDF values for each word of the document in our example) and the
是矢量的维度:
如您所见,点产品的定义是从两个向量加上的两个向量中的每个组件的简单乘法。有关两个尺寸的两个向量的点产品的示例(2D):
您可能注意到的第一件事是,两个向量之间的点产品的结果不是另一个向量,而是单个值,标量。
这一切都很简单易懂,但是亚洲金博宝什么是点产品?它背后的直觉想法是什么?有零的点产品是什么意思?要了解它,我们需要了解Dot产品的几何定义是什么:
重新排列方程以便使用换向性质更好地了解:
So, what is the term?该术语是向量的投影
into the vector
如下图所示:

Now, what happens when the vector向载体正交(角度为90度)
like on the image below ?

在三角形上没有相邻的一侧,它将相当于零,术语将是零和导致乘法的乘法
也会为零。现在,您知道,当两个不同的矢量之间的点产品为零时,它们彼此正交(它们的角度为90度),这是一种非常简洁的方式来检查不同载体的正交性。亚洲金博宝同样重要的是要注意我们使用的是2D示例,但是关于它的最惊人的事实是我们还可以计算高维空间中的矢量之间的角度和相似性,这就是为什么数学让我们看到远远超过显而易见的方式we can’t visualize or imagine what is the angle between two vectors with twelve dimensions for instance.
余弦相似之处
两个向量之间的余弦相似性(或矢量空间上的两个文件)是计算它们之间的角度的余弦的度量。This metric is a measurement of orientation and not magnitude, it can be seen as a comparison between documents on a normalized space because we’re not taking into the consideration only the magnitude of each word count (tf-idf) of each document, but the angle between the documents. What we have to do to build the cosine similarity equation is to solve the equation of the dot product for the:
也就是说,这是余弦相似公式。余弦相似度将产生一个度量标准,说明如何通过查看角度而不是幅度的关系,如下面的示例:

Note that even if we had a vector pointing to a point far from another vector, they still could have an small angle and that is the central point on the use of Cosine Similarity, the measurement tends to ignore the higher term count on documents. Suppose we have a document with the word “sky” appearing 200 times and another document with the word “sky” appearing 50, the Euclidean distance between them will be higher but the angle will still be small because they are pointing to the same direction, which is what matters when we are comparing documents.
现在we have a Vector Space Model of documents (like on the image below) modeled as vectors (with TF-IDF counts) and also have a formula to calculate the similarity between different documents in this space, let’s see now how we do it in practice usingScikit-Learn(Sklearn)。

Practice Using Scikit-learn (sklearn)
*在本教程中,我正在使用python 2.7.5和scikit-searn 0.14.1。
我们需要做的第一件事是定义我们的示例文档:
文件=(“天空是蓝色的”,“太阳明亮”,“天空中的太阳很明亮”,“我们可以看到闪亮的阳光,明亮的太阳”)
然后我们实例化Sklearn TF-IDF矢量化器并将文档转换为TF-IDF矩阵:
from sklearn.feature_extraction.text import TfidfVectorizer tfidf_vectorizer = TfidfVectorizer() tfidf_matrix = tfidf_vectorizer.fit_transform(documents) print tfidf_matrix.shape (4, 11)
现在我们有TF-IDF矩阵(tfidf_matrix) for each document (the number of rows of the matrix) with 11 tf-idf terms (the number of columns from the matrix), we can calculate the Cosine Similarity between the first document (“The sky is blue”) with each of the other documents of the set:
来自sklearn.metrics.pairwise import cosine_similarity cosine_simillity(tfidf_matrix [0:1],tfidf_matrix)阵列([[1.,0.36651513,0.13448867])
Thetfidf_matrix [0:1]是获取稀疏矩阵的第一行的SCIPY操作,结果数组是第一个文档与集合中所有文档之间的余弦相似性。请注意,数组的第一个值为1.0,因为它是第一个文档之间的余弦相似性。另请注意,由于第三个文件上存在类似的单词(“天空中的太阳亮相”),它取得了更好的分数。
If you want, you can also solve the Cosine Similarity for the angle between vectors:
我们只需要隔离角度()然后移动
在等式的右手:
The与余弦的逆相同(
)。
导入数学#这已经在上一步计算了,所以我们只使用值cos_sim = 0.52305744角度_in_radians = math.acos(cos_sim)打印math.degrees(Anstal_in_radians)58.462437107432784
并且该角度〜58.5是我们文档集的第一个和第三个文档之间的角度。
Related Material
A video about Dot Product on The Khan Academy
Scikit-Learn(Sklearn)– Thede factoPython的机器学习包