* It has been a long time since I wrote the TF-IDF tutorial (Part I和第二部分) and as I promissed, here is the continuation of the tutorial. Unfortunately I had no time to fix the previous tutorials for the newer versions of thescikit-learn (sklearn)包也不回答所有的问题,但我希望做的是在不远的未来。
So, on the previous tutorials we learned how a document can be modeled in the Vector Space, how the TF-IDF transformation works and how the TF-IDF is calculated, now what we are going to learn is how to use a well-known similarity measure (Cosine Similarity) to calculate the similarity between different documents.
点积
Let’s begin with the definition of the dot product for two vectors:和
,哪里
和
是矢量的(TF-IDF值在我们的例子中的文件的每个字的文档的特征,或)部件和
是向量的维数:
As you can see, the definition of the dot product is a simple multiplication of each component from the both vectors added together. See an example of a dot product for two vectors with 2 dimensions each (2D):
该first thing you probably noticed is that the result of a dot product between two vectors isn’t another vector but a single value, a scalar.
This is all very simple and easy to understand, but what is a dot product ? What is the intuitive idea behind it ? What does it mean to have a dot product of zero ? To understand it, we need to understand what is the geometric definition of the dot product:
Rearranging the equation to understand it better using the commutative property, we have:
So, what is the term?该术语是向量的投影
到载体中
如下图上显示:

Now, what happens when the vector是正交(以90度的角度)的矢量
像下面的图片吗?

会有在三角形没有相邻侧,这将是等同于零的,术语will be zero and the resulting multiplication with the magnitude of the vector
也将是零。现在你知道,当两个不同的向量的内积为零,它们是相互正交的(他们有90度的角度),这是为了检查不同的向量的正交性非常整洁的方式。亚洲金博宝这也是要注意,我们使用的是2D的例子很重要,但它最惊人的事实是,我们也可以计算角度和更高维空间向量之间的相似度,这就是为什么数学让我们看到较明显远,即使we can’t visualize or imagine what is the angle between two vectors with twelve dimensions for instance.
余弦相似度
两个向量之间的余弦相似性(或两个documents on the Vector Space) is a measure that calculates the cosine of the angle between them. This metric is a measurement of orientation and not magnitude, it can be seen as a comparison between documents on a normalized space because we’re not taking into the consideration only the magnitude of each word count (tf-idf) of each document, but the angle between the documents. What we have to do to build the cosine similarity equation is to solve the equation of the dot product for the:
And that is it, this is the cosine similarity formula. Cosine Similarity will generate a metric that says how related are two documents by looking at the angle instead of magnitude, like in the examples below:

请注意,即使我们有远从另一个向量指向一个点的矢量,他们仍然可以有一个小的角度,这是对使用余弦相似度的中心点,测量往往忽略文档更高的长期计数。假设我们有一个词“天空”出现200次并且与词“天空”出现50另一文档的文档,它们之间的欧氏距离会比较高,但角度仍然很小,因为它们都指向同一个方向,这是当我们比较文档最重要的。
现在we have a Vector Space Model of documents (like on the image below) modeled as vectors (with TF-IDF counts) and also have a formula to calculate the similarity between different documents in this space, let’s see now how we do it in practice usingscikit-learn (sklearn)。

练习使用Scikit学习(sklearn)
*在本教程中我使用了Python 2.7.5和Scikit学习0.14.1。
该first thing we need to do is to define our set of example documents:
文件=(“天空是蓝色的”,“阳光灿烂”,“天空阳光灿烂”,“我们可以看到闪亮的阳光,灿烂的阳光”)
And then we instantiate the Sklearn TF-IDF Vectorizer and transform our documents into the TF-IDF matrix:
从进口sklearn.feature_extraction.text TfidfVectorizer tfidf_vectorizer = TfidfVectorizer()tfidf_matrix = tfidf_vectorizer.fit_transform(文档)打印tfidf_matrix.shape(4,11)
Now we have the TF-IDF matrix (Ťfidf_matrix)对于每个文档(11 TF-IDF术语(从矩阵的列数),我们可以计算在第一文档之间的余弦相似度(“天蓝色”)与每个矩阵的行)的数目该组的其他文件:
from sklearn.metrics.pairwise import cosine_similarity cosine_similarity(tfidf_matrix[0:1], tfidf_matrix) array([[ 1. , 0.36651513, 0.52305744, 0.13448867]])
该tfidf_matrix [0:1]是个Scipy operation to get the first row of the sparse matrix and the resulting array is the Cosine Similarity between the first document with all documents in the set. Note that the first value of the array is 1.0 because it is the Cosine Similarity between the first document with itself. Also note that due to the presence of similar words on the third document (“The sun in the sky is bright”), it achieved a better score.
如果你愿意,你也可以解决余弦相似度的向量之间的角度:
We only need to isolate the angle ()和移动
等式的右边:
该是个same as the inverse of the cosine (
).
进口数学#这是已经计算上一步,所以我们只使用值cos_sim = 0.52305744 angle_in_radians = math.acos(cos_sim)打印math.degrees(angle_in_radians)58.462437107432784
和〜58.5的那个角是第一,我们的文档集的第三个文档之间的角度。
相关材料
Scikit-learn (sklearn)-该事实上的Machine Learning package for Python