188betcom网页版

*自从我写TF-IDF教程以来已经很久了(第I部分第I部分I) and as I promissed, here is the continuation of the tutorial. Unfortunately I had no time to fix the previous tutorials for the newer versions of theScikit-Learn(Sklearn)package nor to answer all the questions, but I hope to do that in a close future.

因此,在上一个教程上,我们了解了如何在向量空间中建模文档,TF-IDF转换工作方式以及如何计算TF-IDF,现在我们将学习的是如何使用众所周知的相似度测量(余弦相似性)计算不同文档之间的相似性。

圆点产品

让我们从两个向量的DOT产品的定义开始:\ vec {a} =(a_1,a_2,a_3,\ ldots)\ vec {b} =(b_1,b_2,b_3,\ ldots), 在哪里a_nb_nare the components of the vector (features of the document, or TF-IDF values for each word of the document in our example) and the\mathit{n}是矢量的维度:

\ vec {a} \ cdot \ vec {b} = \ sum_ {i = 1} ^ n a_ib_i = a_1b_1 + a_2b_2 + \ cdots + a_nb_n

如您所见,点产品的定义是从两个向量加上的两个向量中的每个组件的简单乘法。有关两个尺寸的两个向量的点产品的示例(2D):

\ vec {a} =(0,3)\\ \ vec {b} =(4,0)\\ \ vec {a} \ cdot \ vec {b} = 0 * 4 + 3 * 0 = 0

您可能注意到的第一件事是,两个向量之间的点产品的结果不是另一个向量,而是单个值,标量。

这一切都很简单易懂,但是亚洲金博宝什么是点产品?它背后的直觉想法是什么?有零的点产品是什么意思?要了解它,我们需要了解Dot产品的几何定义是什么:

\ vec {a}\cdot \vec{b} = \|\vec{a}\|\|\vec{b}\|\cos{\theta}

重新排列方程以便使用换向性质更好地了解:

\ vec {a} \ cdot \ vec {b} = \ | \ vec {b} \ | \ | \ vec {a} \ | \ cos {\ theta}

So, what is the term\ displaystyle \ | \ vec {a} \ | \ cos {\ theta}?该术语是向量的投影\ vec {a}into the vector\vec{b}如下图所示:

矢量a投射到载体b中。通过维基百科。

Now, what happens when the vector\ vec {a}向载体正交(角度为90度)\vec{b}like on the image below ?

Two orthogonal vectors (with 90 degrees angle).

在三角形上没有相邻的一侧,它将相当于零,术语\ displaystyle \ | \ vec {a} \ | \ cos {\ theta}将是零和导致乘法的乘法\vec{b}也会为零。现在,您知道,当两个不同的矢量之间的点产品为零时,它们彼此正交(它们的角度为90度),这是一种非常简洁的方式来检查不同载体的正交性。亚洲金博宝同样重要的是要注意我们使用的是2D示例,但是关于它的最惊人的事实是我们还可以计算高维空间中的矢量之间的角度和相似性,这就是为什么数学让我们看到远远超过显而易见的方式we can’t visualize or imagine what is the angle between two vectors with twelve dimensions for instance.

余弦相似之处

两个向量之间的余弦相似性(或矢量空间上的两个文件)是计算它们之间的角度的余弦的度量。This metric is a measurement of orientation and not magnitude, it can be seen as a comparison between documents on a normalized space because we’re not taking into the consideration only the magnitude of each word count (tf-idf) of each document, but the angle between the documents. What we have to do to build the cosine similarity equation is to solve the equation of the dot product for the\ cos {\ theta}:

\ displaystyle \ vec {a} \ cdot \ vec {b} = \ | \ vec {a} \ | \ | \ vec {b} \ | \ cos {\ theta} \\ \\ \ cos {\ theta} =\ frac {\ vec {a} \ cdot \ vec {b}} {\ | \ vec {a} \ | \ | \ vec {b} \ |}

也就是说,这是余弦相似公式。余弦相似度将产生一个度量标准,说明如何通过查看角度而不是幅度的关系,如下面的示例:

余弦相似之处values for different documents, 1 (same direction), 0 (90 deg.), -1 (opposite directions).

Note that even if we had a vector pointing to a point far from another vector, they still could have an small angle and that is the central point on the use of Cosine Similarity, the measurement tends to ignore the higher term count on documents. Suppose we have a document with the word “sky” appearing 200 times and another document with the word “sky” appearing 50, the Euclidean distance between them will be higher but the angle will still be small because they are pointing to the same direction, which is what matters when we are comparing documents.

现在we have a Vector Space Model of documents (like on the image below) modeled as vectors (with TF-IDF counts) and also have a formula to calculate the similarity between different documents in this space, let’s see now how we do it in practice usingScikit-Learn(Sklearn)

Vector Space Model

Practice Using Scikit-learn (sklearn)

*在本教程中,我正在使用python 2.7.5和scikit-searn 0.14.1。

我们需要做的第一件事是定义我们的示例文档:

文件=(“天空是蓝色的”,“太阳明亮”,“天空中的太阳很明亮”,“我们可以看到闪亮的阳光,明亮的太阳”)

然后我们实例化Sklearn TF-IDF矢量化器并将文档转换为TF-IDF矩阵:

from sklearn.feature_extraction.text import TfidfVectorizer tfidf_vectorizer = TfidfVectorizer() tfidf_matrix = tfidf_vectorizer.fit_transform(documents) print tfidf_matrix.shape (4, 11)

现在我们有TF-IDF矩阵(tfidf_matrix) for each document (the number of rows of the matrix) with 11 tf-idf terms (the number of columns from the matrix), we can calculate the Cosine Similarity between the first document (“The sky is blue”) with each of the other documents of the set:

来自sklearn.metrics.pairwise import cosine_similarity cosine_simillity(tfidf_matrix [0:1],tfidf_matrix)阵列([[1.,0.36651513,0.13448867])

Thetfidf_matrix [0:1]是获取稀疏矩阵的第一行的SCIPY操作,结果数组是第一个文档与集合中所有文档之间的余弦相似性。请注意,数组的第一个值为1.0,因为它是第一个文档之间的余弦相似性。另请注意,由于第三个文件上存在类似的单词(“天空中的太阳亮相”),它取得了更好的分数。

If you want, you can also solve the Cosine Similarity for the angle between vectors:

\ cos {\ theta} = \ frac {\ vec {a} \ cdot \ vec {b}} {\ | \ vec {a} \ | \ | \ vec {b} \ |}

我们只需要隔离角度(\theta)然后移动\ cos.在等式的右手:

\theta = \arccos{\frac{\vec{a} \cdot \vec{b}}{\|\vec{a}\|\|\vec{b}\|}}

The\arccos与余弦的逆相同(\ cos ^ -1)。

Lets for instance, check the angle between the first and third documents:
导入数学#这已经在上一步计算了,所以我们只使用值cos_sim = 0.52305744角度_in_radians = math.acos(cos_sim)打印math.degrees(Anstal_in_radians)58.462437107432784

并且该角度〜58.5是我们文档集的第一个和第三个文档之间的角度。

就是这样,我希望你喜欢这个第三篇教程!
将本文引用为:Christian S. Perone,“机器学习::传染媒介空间模型的余弦相似性(第III部分),”Terra Incognita,12/09/2013,188betcom网页版

Related Material

A video about Dot Product on The Khan Academy

维基百科:DOT产品

Wikipedia: Cosine Similarity

Scikit-Learn(Sklearn)– Thede factoPython的机器学习包

机器学习::文本特征提取(TF-IDF) - 第二部分

Read the first part of this tutorial:文本特征提取(TF-IDF) - 第i部分

这篇文章是一个延续of the first part where we started to learn the theory and practice about text feature extraction and vector space model representation. I really recommend youto read the first partof the post series in order to follow this second post.

由于很多人喜欢本教程的第一部分,因此第二部分比第一个更长。

介绍

In the first post, we learned how to use theterm-frequency代表矢量空间中的文本信息。然而,术语 - 频率方法的主要问题是它缩放频繁的术语,并缩放罕见的术语,这些术语是比高频术语更具信息性的。基本的直觉是许多文件中经常发生的术语不是一个很好的鉴别者,并且真的是有道理的(至少在许多实验测试中);此处的重要问题是:为什么您在分类问题中,例如,强调一个几乎存在于文档的整个语料库中的术语?

The tf-idf weight comes to solve this problem. What tf-idf gives is how important is a word to a document in a collection, and that’s why tf-idf incorporates local and global parameters, because it takes in consideration not only the isolated term but also the term within the document collection. What tf-idf then does to solve that problem, is to scale down the frequent terms while scaling up the rare terms; a term that occurs 10 times more than another isn’t 10 times more important than it, that’s why tf-idf uses the logarithmic scale to do that.

但是让我们回到我们的定义\ mathrm {tf}(t,d)这实际上是术语的术语计数t在文件中d。使用这种简单的术语频率可能导致我们出现问题keyword spamming,这是我们在文件中有一个重复的术语,目的是改善IR的排名(Information Retrieval)系统甚至为长篇文档创建偏差,使它们看起来比他们在文档中术语的高频率更重要。

To overcome this problem, the term frequency\ mathrm {tf}(t,d)在向量空间上的文档通常也是标准化的。让我们看看我们如何正常化这个矢量。

矢量归一化

假设我们将正常化术语频率矢量\vec{v_{d_4}}我们在本教程的第一部分计算。文件d4from the first part of this tutorial had this textual representation:

D4:我们可以看到闪亮的阳光,明亮的阳光。

使用该文档的非归阵术语频率的矢量空间表示为:

\vec{v_{d_4}} = (0,2,1,0)

为了归一化向量,与计算相同单位矢量向量,它们用“帽子”表示法表示:\hat{v}。单位矢量的定义\hat{v}矢量\ vec {v}is:

\displaystyle \hat{v} = \frac{\vec{v}}{\|\vec{v}\|_p}

Where the\hat{v}是单位矢量,或归一化的矢量,\ vec {v}是个vector going to be normalized and the\ | \ vec {v} \ | _p是个norm (magnitude, length) of the vector\ vec {v}in thel ^ p.space (don’t worry, I’m going to explain it all).

单位向量实际上只不过是向量的标准化版本,是长度为1的向量。

(来源:http://processi的标准化过程ng.org/learning/pvector/)
(来源:http://processi的标准化过程ng.org/learning/pvector/)

但这里的重要问题是如何计算矢量的长度和理解这一点,您必须了解的动机l ^ p.spaces, also calledLebesgue空间

Lebesgue空间

这个矢量有多长?(资料来源:来源:http://processing.org/learning/pvector/)
这个矢量有多长?(资料来源:来源:http://processing.org/learning/pvector/)

通常,矢量的长度\ vec {u} =(u_1,u_2,u_3,\ ldots,u_n)使用该计算欧几里德准则常态是一个函数,为矢量空间中的所有向量分配严格的正长度或大小- ,它定义为:

(来源:http://processing.org/learning/pvector/)
(来源:http://processing.org/learning/pvector/)

\|\vec{u}\| = \sqrt{u^2_1 + u^2_2 + u^2_3 + \ldots + u^2_n}

But this isn’t the only way to define length, and that’s why you see (sometimes) a numberptogether with the norm notation, like in\ | \ vec {u} \ | _p。那是因为它可以是:

\displaystyle \|\vec{u}\|_p = ( \left|u_1\right|^p + \left|u_2\right|^p + \left|u_3\right|^p + \ldots + \left|u_n\right|^p )^\frac{1}{p}

并简化为:

\displaystyle \|\vec{u}\|_p = (\sum\limits_{i=1}^{n}\left|\vec{u}_i\right|^p)^\frac{1}{p}

So when you read about aL2-NOM,你正在读的欧几里德准则, a norm withp = 2, the most common norm used to measure the length of a vector, typically called “magnitude”; actually, when you have an unqualified length measure (without thepnumber), you have theL2-NOM(Euclidean norm).

当你读到一个L1-NOM,你正在阅读常态p=1, 定义为:

\displaystyle \|\vec{u}\|_1 = ( \left|u_1\right| + \left|u_2\right| + \left|u_3\right| + \ldots + \left|u_n\right|)

这只不过是载体的组件的简单总和,也称为Taxicab distance, also called Manhattan distance.

Taxicab geometry versus Euclidean distance: In taxicab geometry all three pictured lines have the same length (12) for the same route. In Euclidean geometry, the green line has length6 \ times \ sqrt {2} \约8.48,是独特的最短路径。
来源:维基百科::出租车几何

请注意,您还可以使用任何规范来规范向量,但我们将使用最常见的规范,L2-NOM,这也是0.9版本中的默认值Scikits.Learn.。You can also find papers comparing the performance of the two approaches among other methods to normalize the document vector, actually you can use any other method, but you have to be concise, once you’ve used a norm, you have to use it for the whole process directly involving the norm (如果您稍后将参加其L2-rom,则使用L1-Norm的单位矢量不会有长度1)。

Back to vector normalization

现在you know what the vector normalization process is, we can try a concrete example, the process of using the L2-norm (we’ll use the right terms now) to normalize our vector\vec{v_{d_4}} = (0,2,1,0)为了获得其单位矢量\hat{v_{d_4}}。为此,我们将简单地将其插入单位向量的定义以评估它:

\ hat {v} = \ frac {\ vec {v}} {\ | \ vec {v} \ | _p} \\ \\ \\ \ hat {v_ {d_4}} = \ frac {\ vec {v_ {d_4}}} {|| \ vec {v_ {d_4}} || _2} \\ \\ \\ \ hat {v_ {d_4}} = \ frac {(0,2,1,0)} {\ sqrt {0^ 2 + 2 ^ 2 + 1 ^ 2 + 0 ^ 2}} \\ \\ \ hat {v_ {d_4}} = \ frac {(0,2,1,0)} {\ sqrt {5}}} \\\ \ small \ hat {v_ {d_4}} =(0.0,0.89442719,0.4472136,0.0)

那就是它!我们的规范化矢量\hat{v_{d_4}}现在有一个l2-norm\|\hat{v_{d_4}}\|_2 = 1.0

Note that here we have normalized our term frequency document vector, but later we’re going to do that after the calculation of the tf-idf.

The term frequency – inverse document frequency (tf-idf) weight

现在,您已经了解了矢量规范化如何在理论和实践中工作,让我们继续我们的教程。假设您的收藏中有以下文件(从教程的第一部分中获取):

Train Document Set: d1: The sky is blue. d2: The sun is bright. Test Document Set: d3: The sun in the sky is bright. d4: We can see the shining sun, the bright sun.

Your document space can be defined then asd = \ {d_1,d_2,\ ldots,d_n \}在哪里n是你的语料库中的文件数量,以及我们的案件D_{train} = \{d_1, d_2\}D_{test} = \{d_3, d_4\}。The cardinality of our document space is defined by\ left | {d_ {train}} \右手|= 2\ left | {d_ {test}} \右|= 2,因为我们只有2个用于培训和测试的文件,但它们显然不需要具有相同的基数。

我们现在看,如何定义IDF(逆文档频率):

\displaystyle \mathrm{idf}(t) = \log{\frac{\left|D\right|}{1+\left|\{d : t \in d\}\right|}}

在哪里\左手| \ {D:T \ IN D \} \右|是个number of documents在哪里the termt出现,当术语频率函数满足时\ mathrm {tf}(t,d)\ neq 0, we’re only adding 1 into the formula to avoid zero-division.

The formula for the tf-idf is then:

\mathrm{tf\mbox{-}idf}(t) = \mathrm{tf}(t, d) \times \mathrm{idf}(t)

和this formula has an important consequence: a high weight of the tf-idf calculation is reached when you have a high term frequency (tf) in the given document (本地参数)和整个集合中的术语的低文档频率(全局参数)。

现在,让我们计算功能矩阵中存在的每个功能的IDF,我们在第一个教程中计算的术语频率:

M_ {rain} = \ begin {bmatrix} 0&1&1&1 \\ 0&2&1&0 \ END {BMATRIX}

Since we have 4 features, we have to calculate\ mathrm {idf}(t_1),\mathrm{idf}(t_2),\mathrm{idf}(t_3),\ mathrm {idf}(t_4):

\ mathrm {idf}(t_1)= \ log {\ frac {\ left | d \ light |} {1+ \ left | \ {d:t_1 \ in d \} \ light |}}} = \ log {\ frac{2} {1}} = 0.69314718

\mathrm{idf}(t_2) = \log{\frac{\left|D\right|}{1+\left|\{d : t_2 \in d\}\right|}} = \log{\frac{2}{3}} = -0.40546511

\mathrm{idf}(t_3) = \log{\frac{\left|D\right|}{1+\left|\{d : t_3 \in d\}\right|}} = \log{\frac{2}{3}} = -0.40546511

\ mathrm {IDF}(T_4)= \日志{\压裂{\左| d \右|} {1+ \左| \ {d:T_4 \在d \} \右|}} = \日志{\压裂{2} {2}} = 0.0

这些IDF权重可以由向量表示为:

\ vec {idf_ {train}} =(0.69314718,-0.40546511,0.0)

现在we have our matrix with the term frequency (m_ {火车}) and the vector representing the idf for each feature of our matrix (\ vec {idf_ {train}}),我们可以计算我们的TF-IDF重量。我们要做的是矩阵中每列的简单乘法m_ {火车}与相应的\ vec {idf_ {train}}vector dimension. To do that, we can create a squarediagonal matrixm_ {idf}垂直和水平尺寸都等于矢量\ vec {idf_ {train}}尺寸:

m_ {idf} = \ begin {bmatrix} 0.69314718&0&0&0 \\ 0&0&0 \\ 0&0&0&0&0&0&0&0&0&0 \ {bmatrix}

然后将其乘以术语频率矩阵,因此最终结果可以定义为:

m_ {tf \ mbox { - } IDF} = m_ {train} \ times m_ {idf}

Please note that the matrix multiplication isn’t commutative, the result ofA \times B会与之不同b \ times a,这就是为什么m_ {idf}位于乘法的右侧,以实现将每个IDF值乘以其对应功能的所需效果:

\begin{bmatrix}   \mathrm{tf}(t_1, d_1) & \mathrm{tf}(t_2, d_1) & \mathrm{tf}(t_3, d_1) & \mathrm{tf}(t_4, d_1)\\   \mathrm{tf}(t_1, d_2) & \mathrm{tf}(t_2, d_2) & \mathrm{tf}(t_3, d_2) & \mathrm{tf}(t_4, d_2)   \end{bmatrix}   \times   \begin{bmatrix}   \mathrm{idf}(t_1) & 0 & 0 & 0\\   0 & \mathrm{idf}(t_2) & 0 & 0\\   0 & 0 & \mathrm{idf}(t_3) & 0\\   0 & 0 & 0 & \mathrm{idf}(t_4)   \end{bmatrix}   \\ =   \begin{bmatrix}   \mathrm{tf}(t_1, d_1) \times \mathrm{idf}(t_1) & \mathrm{tf}(t_2, d_1) \times \mathrm{idf}(t_2) & \mathrm{tf}(t_3, d_1) \times \mathrm{idf}(t_3) & \mathrm{tf}(t_4, d_1) \times \mathrm{idf}(t_4)\\   \mathrm{tf}(t_1, d_2) \times \mathrm{idf}(t_1) & \mathrm{tf}(t_2, d_2) \times \mathrm{idf}(t_2) & \mathrm{tf}(t_3, d_2) \times \mathrm{idf}(t_3) & \mathrm{tf}(t_4, d_2) \times \mathrm{idf}(t_4)   \end{bmatrix}

让我们看看现在是这个乘法的具体例子:

m_ {tf \ mbox { - } IDF} = m_ {train} \ times m_ {idf} = \\ \ begin {bmatrix} 0和1&1&1 \\ 0&2&1&1 \\ 0&2&1&0 \ END {BMATRIX}\ times \ begin {bmatrix} 0.69314718&0&0&0 \\ 0&0&0 \\ 0&0&0&0&0&0&0&0&0&0&0&0&0&0&0 \ {bmatrix} \\= \ begin {bmatrix} 0&-0.40546511&0 \\ 0&0.0.81093022&-0.40546511&0 \ end {bmatrix}

And finally, we can apply our L2 normalization process to theM_{tf\mbox{-}idf}矩阵。请注意,此标准化是“row-wise”因为我们将作为分离的向量处理矩阵的每一行,以便归一化,而不是整个矩阵:

M_{tf\mbox{-}idf} = \frac{M_{tf\mbox{-}idf}}{\|M_{tf\mbox{-}idf}\|_2} = \ begin {bmatrix} 0&-0.70710678&0 \\ 0&-0.89442719&-0.4472136&0 \ end {bmatrix}

这是我们的测试文件集的非常规范化的TF-IDF重量,实际上是单位向量的集合。如果您拍摄了矩阵的每一行的L2-常态,则会发现它们都有1个常数为1。

Pythonpractice

使用的环境:Pythonv.2.7.2,numpy 1.6.1.,Scipy V.0.9.0.,Sklearn(Scikits.Learn)V.0.9

现在你在等待的部分!在本节中,我将使用Python来显示TF-IDF计算的每个步骤scikit.learn.特征提取module.

The first step is to create our training and testing document set and computing the term frequency matrix:

from sklearn.feature_extraction.text import CountVectorizer train_set = ("The sky is blue.", "The sun is bright.") test_set = ("The sun in the sky is bright.", "We can see the shining sun, the bright sun.") count_vectorizer = CountVectorizer() count_vectorizer.fit_transform(train_set) print "Vocabulary:", count_vectorizer.vocabulary # Vocabulary: {'blue': 0, 'sun': 1, 'bright': 2, 'sky': 3} freq_term_matrix = count_vectorizer.transform(test_set) print freq_term_matrix.todense() #[[0 1 1 1] #[0 2 1 0]]

现在我们有频率术语矩阵(调用freq_term_matrix.),我们可以实例化tfidftransformer.,这将是负责计算the tf-idf weights for our term frequency matrix:

来自sklearn.feature_extraction.text导入tfidftransformer tfidf = tfidftransformer(rang =“l2”)tfidf.fit(freq_term_matrix)打印“iDf:”,tfidf.idf_#iDF:[0.69314718 -0.40546511 -0.40546511 -0.40546511 -0.40546511 -0.40546511 -0.40546511 -0.40546511 -0.40546511 0.]

Note that I’ve specified the norm as L2, this is optional (actually the default is L2-norm), but I’ve added the parameter to make it explicit to you that it it’s going to use the L2-norm. Also note that you can see the calculated idf weight by accessing the internal attribute calledIDF_。现在合身()方法已经计算了矩阵的IDF,让我们转换freq_term_matrix.to the tf-idf weight matrix:

tf_idf_matrix = tfidf.transform(freq_term_matrix)print tf_idf_matrix.todense()#[0.70710678 -0.70710678 0]#[0.0.89442719 -0.4472136 0.]]

那就是它,tf_idf_matrix.is actually our previousM_{tf\mbox{-}idf}矩阵。您可以通过使用相同的效果来实现相同的效果VectorizerScikit的类。这是vectorize学习r that automatically combines theCountVectorizertfidftransformer.给你。看这个例子to know how to use it for the text classification process.

I really hope you liked the post, I tried to make it simple as possible even for people without the required mathematical background of linear algebra, etc. In the next Machine Learning post I’m expecting to show how you can use the tf-idf to calculate the cosine similarity.

If you liked it, feel free to comment and make suggestions, corrections, etc.

Cite this article as: Christian S. Perone, "Machine Learning :: Text feature extraction (tf-idf) – Part II," inTerra Incognita, 03/10/2011,//www.cpetem.com/2011/10/machine-learning-text-feature-extraction-tf-idf-part-ii/

References

Understanding Inverse Document Frequency: on theoretical arguments for IDF

维基百科:: TF-IDF

The classic Vector Space Model

Sklearn text feature extraction code

更新

2015年3月13日形成,固定图像问题。
03 Oct 2011添加了关于Python示例的环境的信息