188betcom网页版

* It has been a long time since I wrote the TF-IDF tutorial (Part I第二部分) and as I promissed, here is the continuation of the tutorial. Unfortunately I had no time to fix the previous tutorials for the newer versions of thescikit-learn (sklearn)包也不回答所有的问题,但我希望做的是在不远的未来。

So, on the previous tutorials we learned how a document can be modeled in the Vector Space, how the TF-IDF transformation works and how the TF-IDF is calculated, now what we are going to learn is how to use a well-known similarity measure (Cosine Similarity) to calculate the similarity between different documents.

点积

Let’s begin with the definition of the dot product for two vectors:\ VEC {A} =(A_1,A_2,A_3,\ ldots)\ VEC {B}= (b_1, b_2, b_3, \ldots),哪里一个B_N是矢量的(TF-IDF值在我们的例子中的文件的每个字的文档的特征,或)部件和\ mathit {N}是向量的维数:

\ VEC {A}\cdot \vec{b} = \sum_{i=1}^n a_ib_i = a_1b_1 + a_2b_2 + \cdots + a_nb_n

As you can see, the definition of the dot product is a simple multiplication of each component from the both vectors added together. See an example of a dot product for two vectors with 2 dimensions each (2D):

\ VEC {A}= (0, 3) \\   \vec{b} = (4, 0) \\   \vec{a} \cdot \vec{b} = 0*4 + 3*0 = 0

该first thing you probably noticed is that the result of a dot product between two vectors isn’t another vector but a single value, a scalar.

This is all very simple and easy to understand, but what is a dot product ? What is the intuitive idea behind it ? What does it mean to have a dot product of zero ? To understand it, we need to understand what is the geometric definition of the dot product:

\ {VEC一个} \ CDOT \ VEC {B} = \ | \ {VEC一个} \ | \ | \ VEC {B} \ | \ COS {\ THETA}

Rearranging the equation to understand it better using the commutative property, we have:

\ VEC {A}\cdot \vec{b} = \|\vec{b}\|\|\vec{a}\|\cos{\theta}

So, what is the term\displaystyle \|\vec{a}\|\cos{\theta}?该术语是向量的投影\ VEC {A}到载体中\ VEC {B}如下图上显示:

该projection of the vector A into the vector B. By Wikipedia.

Now, what happens when the vector\ VEC {A}是正交(以90度的角度)的矢量\ VEC {B}像下面的图片吗?

两个正交向量(具有90度角)。

会有在三角形没有相邻侧,这将是等同于零的,术语\displaystyle \|\vec{a}\|\cos{\theta}will be zero and the resulting multiplication with the magnitude of the vector\ VEC {B}也将是零。现在你知道,当两个不同的向量的内积为零,它们是相互正交的(他们有90度的角度),这是为了检查不同的向量的正交性非常整洁的方式。亚洲金博宝这也是要注意,我们使用的是2D的例子很重要,但它最惊人的事实是,我们也可以计算角度和更高维空间向量之间的相似度,这就是为什么数学让我们看到较明显远,即使we can’t visualize or imagine what is the angle between two vectors with twelve dimensions for instance.

余弦相似度

两个向量之间的余弦相似性(或两个documents on the Vector Space) is a measure that calculates the cosine of the angle between them. This metric is a measurement of orientation and not magnitude, it can be seen as a comparison between documents on a normalized space because we’re not taking into the consideration only the magnitude of each word count (tf-idf) of each document, but the angle between the documents. What we have to do to build the cosine similarity equation is to solve the equation of the dot product for the\ COS {\ THETA}

\displaystyle  \vec{a} \cdot \vec{b} = \|\vec{a}\|\|\vec{b}\|\cos{\theta} \\ \\  \cos{\theta} = \frac{\vec{a} \cdot \vec{b}}{\|\vec{a}\|\|\vec{b}\|}

And that is it, this is the cosine similarity formula. Cosine Similarity will generate a metric that says how related are two documents by looking at the angle instead of magnitude, like in the examples below:

余弦相似性值对于不同的文档,1(相同的方向),0(90度),-1(相反的方向)。

请注意,即使我们有远从另一个向量指向一个点的矢量,他们仍然可以有一个小的角度,这是对使用余弦相似度的中心点,测量往往忽略文档更高的长期计数。假设我们有一个词“天空”出现200次并且与词“天空”出现50另一文档的文档,它们之间的欧氏距离会比较高,但角度仍然很小,因为它们都指向同一个方向,这是当我们比较文档最重要的。

现在we have a Vector Space Model of documents (like on the image below) modeled as vectors (with TF-IDF counts) and also have a formula to calculate the similarity between different documents in this space, let’s see now how we do it in practice usingscikit-learn (sklearn)

向量空间模型

练习使用Scikit学习(sklearn)

*在本教程中我使用了Python 2.7.5和Scikit学习0.14.1。

该first thing we need to do is to define our set of example documents:

文件=(“天空是蓝色的”,“阳光灿烂”,“天空阳光灿烂”,“我们可以看到闪亮的阳光,灿烂的阳光”)

And then we instantiate the Sklearn TF-IDF Vectorizer and transform our documents into the TF-IDF matrix:

从进口sklearn.feature_extraction.text TfidfVectorizer tfidf_vectorizer = TfidfVectorizer()tfidf_matrix = tfidf_vectorizer.fit_transform(文档)打印tfidf_matrix.shape(4,11)

Now we have the TF-IDF matrix (Ťfidf_matrix)对于每个文档(11 TF-IDF术语(从矩阵的列数),我们可以计算在第一文档之间的余弦相似度(“天蓝色”)与每个矩阵的行)的数目该组的其他文件:

from sklearn.metrics.pairwise import cosine_similarity cosine_similarity(tfidf_matrix[0:1], tfidf_matrix) array([[ 1. , 0.36651513, 0.52305744, 0.13448867]])

tfidf_matrix [0:1]是个Scipy operation to get the first row of the sparse matrix and the resulting array is the Cosine Similarity between the first document with all documents in the set. Note that the first value of the array is 1.0 because it is the Cosine Similarity between the first document with itself. Also note that due to the presence of similar words on the third document (“The sun in the sky is bright”), it achieved a better score.

如果你愿意,你也可以解决余弦相似度的向量之间的角度:

\ COS {\ THETA} = \压裂{\ {VEC一个} \ CDOT \ VEC {B}} {\ | \ {VEC一个} \ | \ | \ VEC {B} \ |}

We only need to isolate the angle (\ THETA)和移动\cos等式的右边:

\ THETA = \ ARCCOS {\压裂{\ {VEC一个} \ CDOT \ VEC {B}} {\ | \ {VEC一个} \ | \ | \ VEC {B} \ |}}

\ ARCCOS是个same as the inverse of the cosine (\cos^-1).

让我们举例来说,检查第一和第三文档之间的角度:
进口数学#这是已经计算上一步,所以我们只使用值cos_sim = 0.52305744 angle_in_radians = math.acos(cos_sim)打印math.degrees(angle_in_radians)58.462437107432784

和〜58.5的那个角是第一,我们的文档集的第三个文档之间的角度。

这就是它,我希望你喜欢这个第三教程!
引用本文为:基督教S. Perone,“机器学习::余弦相似度的向量空间模型(第三部分),”中亚洲金博宝未知领域,2013年12月9日,188betcom网页版

相关材料

关于点产品上的汗学院的视频

Wikipedia: Dot Product

维基百科:余弦相似度

Scikit-learn (sklearn)-该事实上的Machine Learning package for Python

机器学习::文本特征提取(TF-IDF) - 第二部分

阅读本教程的第一部分:Text feature extraction (tf-idf) – Part I

这个职位是一个延续在哪里,我们开始学习有关文本特征提取和向量空间模型表示的理论和实践的第一部分。我真的建议你阅读第一部分后一系列以遵循这个第二。

由于很多人喜欢这个教程的第一部分,该第二部分是比第一个长一点。

介绍

在第一篇文章中,我们学会了如何使用长期频以表示在矢量空间的文本信息。然而,与术语频率方法的主要问题是,它大大加快了频繁的条款和规模下降,这比高频方面经验更丰富罕见的条款。基本的直觉是,在许多文件中经常出现的一个术语不太好鉴别,真正有意义的(至少在许多实验测试);这里最重要的问题是:你为什么会在例如分类问题,强调术语,是在你的文档的整个语料库几乎礼物?

在TF-IDF权重来解决这个问题。什么TF-IDF给出的是如何重要的是一个集合中的文档的话,这就是为什么TF-IDF结合本地和全球的参数,因为它考虑到不仅需要隔离的期限,但也文献集内的术语。什么TF-IDF然后做来解决这个问题,是缩小,同时扩大了难得的条件频繁的条款;出现比其他的10倍以上期限不为10倍比它更重要的是,为什么TF-IDF采用对数刻度的做到这一点。

But let’s go back to our definition of the\ mathrm {TF}(T,d)which is actually the term count of the termŤ在文档中d。该use of this simple term frequency could lead us to problems like滥用关键字,which is when we have a repeated term in a document with the purpose of improving its ranking on an IR (信息检索)系统,甚至对创建长文档偏见,使他们看起来比他们只是因为手册中出现的高频更重要。

为了克服这个问题,词频\ mathrm {TF}(T,d)of a document on a vector space is usually also normalized. Let’s see how we normalize this vector.

Vector normalization

Suppose we are going to normalize the term-frequency vector\vec{v_{d_4}}我们在本教程的第一部分已经计算。该文件D4从本教程的第一部分中有这样的文字表示:

D4:我们可以看到闪亮的阳光,明亮的阳光下。

And the vector space representation using the non-normalized term-frequency of that document was:

\vec{v_{d_4}} = (0,2,1,0)

To normalize the vector, is the same as calculating theUnit Vector矢量,而他们使用的是“帽子”符号表示:\帽子{V}。该definition of the unit vector\帽子{V}of a vector\ VEC {V}是:

\displaystyle \hat{v} = \frac{\vec{v}}{\|\vec{v}\|_p}

\帽子{V}是单位矢量,或者归一化矢量,所述\ VEC {V}是个vector going to be normalized and the\ | \ VEC {V} \ | _p是矢量的范数(大小,长度)\ VEC {V}在里面L^p空间(别担心,我将所有的解释)。

该unit vector is actually nothing more than a normalized version of the vector, is a vector which the length is 1.

归一化处理(来源:http://processing.org/learning/pvector/)
归一化处理(来源:http://processing.org/learning/pvector/)

但这里的重要问题是如何向量的长度来计算,并明白这一点,你必须了解的动机L^p空间,也被称为Lebesgue spaces

Lebesgue spaces

多久这个载体?(来源:来源:http://processing.org/learning/pvector/)
多久这个载体?(来源:来源:http://processing.org/learning/pvector/)

通常,一个矢量的长度\ {VEC U】=(U_1,U_2,U_3,\ ldots,u_n)is calculated using the欧几里得范-a norm is a function that assigns a strictly positive length or size to all vectors in a vector space-, which is defined by:

(资源:http://processing.org/learning/pvector/)
(资源:http://processing.org/learning/pvector/)

\ | \ VEC【U} \ |= \ SQRT【U ^ 2_1 + U ^ 2_2 + U ^ 2_3 + \ ldots + U ^ 2_n}

但是,这不是定义长度的唯一途径,这就是为什么你看到(有时)的数pŤogether with the norm notation, like in\ | \ VEC【U} \ |_p。That’s because it could be generalized as:

\的DisplayStyle \ | \ VEC【U} \ | _p =(\左| U_1 \右| ^ P + \左| U_2 \右| ^ P + \左| U_3 \右| ^ P + \ ldots + \左|u_n \右| ^ p)^ \压裂{1} {p}

并简化为:

\的DisplayStyle \ | \ VEC【U} \ | _p =(\总和\ limits_ {I = 1} ^ {N} \左| \ VEC {U】_i \右| ^ P)^ \压裂{1} {P}

所以,当你阅读有关L2-norm,you’re reading about the欧几里得范,具有规范p = 2时用于测量的矢量的长度的最常用标准,通常称为“大小”;其实,当你有一个不合格的长度测量(不p号),你有L2-norm(欧几里得范数)。

当你阅读一L1-norm你正在阅读与规范P = 1,defined as:

\displaystyle \|\vec{u}\|_1 = ( \left|u_1\right| + \left|u_2\right| + \left|u_3\right| + \ldots + \left|u_n\right|)

这无非是向量的组件的简单相加,也被称为出租汽车距离,也被称为曼哈顿距离。

出租车几何与欧几里得距离:在出租车几何所有三个描绘线具有对于相同的路径具有相同的长度(12)。在欧几里德几何,绿色的线有长度,6 \倍\ SQRT {2} \约8.48,并且是唯一的最短路径。
资源:维基百科::出租车几何

请注意,您也可以使用任何规范正常化的载体,但我们将使用最常用的规范,L2范数,这也是在0.9版本的默认scikits.learn。You can also find papers comparing the performance of the two approaches among other methods to normalize the document vector, actually you can use any other method, but you have to be concise, once you’ve used a norm, you have to use it for the whole process directly involving the norm (a unit vector that used a L1-norm isn’t going to have the length 1 if you’re going to take its L2-norm later).

Back to vector normalization

现在you know what the vector normalization process is, we can try a concrete example, the process of using the L2-norm (we’ll use the right terms now) to normalize our vector\vec{v_{d_4}} = (0,2,1,0)in order to get its unit vector\hat{v_{d_4}}。To do that, we’ll simple plug it into the definition of the unit vector to evaluate it:

\帽子{V}= \frac{\vec{v}}{\|\vec{v}\|_p} \\ \\  \hat{v_{d_4}} = \frac{\vec{v_{d_4}}}{||\vec{v_{d_4}}||_2} \\ \\ \\  \hat{v_{d_4}} = \frac{(0,2,1,0)}{\sqrt{0^2 + 2^2 + 1^2 + 0^2}} \\ \\  \hat{v_{d_4}} = \frac{(0,2,1,0)}{\sqrt{5}} \\ \\  \small \hat{v_{d_4}} = (0.0, 0.89442719, 0.4472136, 0.0)

这就是它!我们的法矢\hat{v_{d_4}}现在有一个L2范\ | \帽子{V_ {D_4}} \ | _2 = 1.0

请注意,这里我们归我们词频文档向量,但后来我们要做的是,TF-IDF的计算后。

术语频率 - 逆文档频率(TF-IDF)重量

现在你已经了解了矢量归在理论和实践是如何工作的,让我们继续我们的教程。假设你有你的收藏(从教程的第一部分拍摄)在下列文件:

Train Document Set: d1: The sky is blue. d2: The sun is bright. Test Document Set: d3: The sun in the sky is bright. d4: We can see the shining sun, the bright sun.

Your document space can be defined then asd = \ {D_1,D_2,\ ldots,D_N \}哪里ñ文档的数量在你的语料库,在吗our case asD_ {火车} = \ {D_1,D_2 \}D_ {测试} = \ {D_3,D_4 \}。我们的文档空间的基数被定义\left|{D_{train}}\right| = 2\左| {{D_测试}} \右|= 2,since we have only 2 two documents for training and testing, but they obviously don’t need to have the same cardinality.

现在让我们看看,然后是如何IDF(逆文档频率)定义:

\的DisplayStyle \ mathrm {IDF}(T)= \日志{\压裂{\左| d \右|} {1+ \左| \ {d:吨\在d \} \右|}}

哪里\left|\{d : t \in d\}\right|是个文件数其中术语Ť出现,术语频率函数满足当\ mathrm {TF}(T,d)\neq 0,我们只加1代入公式,以避免零分。

该formula for the tf-idf is then:

\ mathrm {TF \ MBOX { - } IDF}(T)= \ mathrm {TF}(T,d)\倍\ mathrm {IDF}(t)的

和该公式具有重要的后果:当你有给定文档中高词频(TF)达到TF-IDF计算的高权重(本地参数)和整个集合中的术语的低文档频率(global parameter).

Now let’s calculate the idf for each feature present in the feature matrix with the term frequency we have calculated in the first tutorial:

M_ {}列车=  \begin{bmatrix}  0 & 1 & 1 & 1\\  0 & 2 & 1 & 0  \end{bmatrix}

因为我们有4个特点,我们要计算\ mathrm {IDF}(T_1)\ mathrm {IDF}(T_2)\ mathrm {IDF}(t_3处)\mathrm{idf}(t_4)

\ mathrm {IDF}(T_1)= \日志{\压裂{\左| d \右|} {1+ \左| \ {d:T_1 \在d \} \右|}} = \日志{\压裂{2} {1}} = 0.69314718

\ mathrm {IDF}(T_2)= \日志{\压裂{\左| d \右|} {1+ \左| \ {d:T_2 \在d \} \右|}} = \日志{\压裂{2} {3}} = -0.40546511

\ mathrm {IDF}(t_3处)= \日志{\压裂{\左| d \右|} {1+ \左| \ {d:t_3处\在d \} \右|}} = \日志{\压裂{2} {3}} = -0.40546511

\mathrm{idf}(t_4) = \log{\frac{\left|D\right|}{1+\left|\{d : t_4 \in d\}\right|}} = \log{\frac{2}{2}} = 0.0

这些IDF权重可以由矢量作为表示:

\ {VEC {idf_列车}}= (0.69314718, -0.40546511, -0.40546511, 0.0)

现在,我们有我们的词频矩阵(M_ {}列车)和表示我们的矩阵的每个特征的IDF(矢量\ {VEC {idf_列车}}),我们可以计算出我们的TF-IDF权重。我们要做的是矩阵中的每一列的简单乘法M_ {}列车with the respective\ {VEC {idf_列车}}向量维度。要做到这一点,我们可以创建一个正方形对角矩阵M_ {} IDFwith both the vertical and horizontal dimensions equal to the vector\ {VEC {idf_列车}}dimension:

M_ {} IDF=   \begin{bmatrix}   0.69314718 & 0 & 0 & 0\\   0 & -0.40546511 & 0 & 0\\   0 & 0 & -0.40546511 & 0\\   0 & 0 & 0 & 0   \end{bmatrix}

然后将它乘到术语频率矩阵,因此最终结果然后可以定义为:

M_ {TF \ MBOX { - }} IDF= M_{train} \times M_{idf}

请注意,矩阵乘法是不可交换的,结果A \times Bwill be different than the result of the乙\一个时代,和Ťhis is why theM_ {} IDFis on the right side of the multiplication, to accomplish the desired effect of multiplying each idf value to its corresponding feature:

\begin{bmatrix}   \mathrm{tf}(t_1, d_1) & \mathrm{tf}(t_2, d_1) & \mathrm{tf}(t_3, d_1) & \mathrm{tf}(t_4, d_1)\\   \mathrm{tf}(t_1, d_2) & \mathrm{tf}(t_2, d_2) & \mathrm{tf}(t_3, d_2) & \mathrm{tf}(t_4, d_2)   \end{bmatrix}   \times   \begin{bmatrix}   \mathrm{idf}(t_1) & 0 & 0 & 0\\   0 & \mathrm{idf}(t_2) & 0 & 0\\   0 & 0 & \mathrm{idf}(t_3) & 0\\   0 & 0 & 0 & \mathrm{idf}(t_4)   \end{bmatrix}   \\ =   \begin{bmatrix}   \mathrm{tf}(t_1, d_1) \times \mathrm{idf}(t_1) & \mathrm{tf}(t_2, d_1) \times \mathrm{idf}(t_2) & \mathrm{tf}(t_3, d_1) \times \mathrm{idf}(t_3) & \mathrm{tf}(t_4, d_1) \times \mathrm{idf}(t_4)\\   \mathrm{tf}(t_1, d_2) \times \mathrm{idf}(t_1) & \mathrm{tf}(t_2, d_2) \times \mathrm{idf}(t_2) & \mathrm{tf}(t_3, d_2) \times \mathrm{idf}(t_3) & \mathrm{tf}(t_4, d_2) \times \mathrm{idf}(t_4)   \end{bmatrix}

Let’s see now a concrete example of this multiplication:

M_ {TF \ MBOX { - }} IDF= M_{train} \times M_{idf} = \\   \begin{bmatrix}   0 & 1 & 1 & 1\\   0 & 2 & 1 & 0   \end{bmatrix}   \times   \begin{bmatrix}   0.69314718 & 0 & 0 & 0\\   0 & -0.40546511 & 0 & 0\\   0 & 0 & -0.40546511 & 0\\   0 & 0 & 0 & 0   \end{bmatrix} \\   =   \begin{bmatrix}   0 & -0.40546511 & -0.40546511 & 0\\   0 & -0.81093022 & -0.40546511 & 0   \end{bmatrix}

最后,我们可以将我们的L2归一化处理的M_ {TF \ MBOX { - }} IDFmatrix. Please note that this normalization is“逐行”because we’re going to handle each row of the matrix as a separated vector to be normalized, and not the matrix as a whole:

M_ {TF \ MBOX { - } IDF} = \压裂{M_ {TF \ MBOX { - } IDF}} {\ | M_ {TF \ MBOX { - } IDF} \ | _2} = \begin{bmatrix}   0 & -0.70710678 & -0.70710678 & 0\\   0 & -0.89442719 & -0.4472136 & 0   \end{bmatrix}

And that is our pretty normalized tf-idf weight of our testing document set, which is actually a collection of unit vectors. If you take the L2-norm of each row of the matrix, you’ll see that they all have a L2-norm of 1.

Python的实践

环境中使用Python的v.2.7.2NumPy的1.6.1SciPy的v.0.9.0Sklearn(Scikits.learn)v.0.9

现在,你在等待的部分!在本节中,我将使用Python的使用,以显示TF-IDF计算的每一步Scikit.learnfeature extraction module.

该first step is to create our training and testing document set and computing the term frequency matrix:

从sklearn.feature_extraction。文本导入CountVectorizer train_set = ("The sky is blue.", "The sun is bright.") test_set = ("The sun in the sky is bright.", "We can see the shining sun, the bright sun.") count_vectorizer = CountVectorizer() count_vectorizer.fit_transform(train_set) print "Vocabulary:", count_vectorizer.vocabulary # Vocabulary: {'blue': 0, 'sun': 1, 'bright': 2, 'sky': 3} freq_term_matrix = count_vectorizer.transform(test_set) print freq_term_matrix.todense() #[[0 1 1 1] #[0 2 1 0]]

现在,我们有频率项矩阵(称为freq_term_matrix),我们可以实例化TfidfTransformer,这将是负责来计算我们的词频矩阵TF-IDF权重:

从进口sklearn.feature_extraction.text TFIDF TfidfTransformer = TfidfTransformer(NORM = “L2”)tfidf.fit(freq_term_matrix)打印 “IDF:”,tfidf.idf_#IDF:[0.69314718 -0.40546511 -0.40546511 0]

请注意,我所指定的标准为L2,这是可选的(实际上默认为L2范数),但我已经添加了参数,使其明确向你表示,它会使用L2范数。还要注意的是,你可以通过访问称为内部属性看IDF计算权重IDF_。现在fit()method has calculated the idf for the matrix, let’s transform thefreq_term_matrix到TF-IDF权重矩阵:

Ťf_idf_matrix = tfidf.transform(freq_term_matrix) print tf_idf_matrix.todense() # [[ 0. -0.70710678 -0.70710678 0. ] # [ 0. -0.89442719 -0.4472136 0. ]]

And that is it, theŤf_idf_matrix其实我们以前M_ {TF \ MBOX { - }} IDFmatrix. You can accomplish the same effect by using the矢量器类Scikit.learn的这是一个矢量器自动结合CountVectorizerTfidfTransformerŤo you. See这个例子Ťo know how to use it for the text classification process.

I really hope you liked the post, I tried to make it simple as possible even for people without the required mathematical background of linear algebra, etc. In the next Machine Learning post I’m expecting to show how you can use the tf-idf to calculate the cosine similarity.

If you liked it, feel free to comment and make suggestions, corrections, etc.

引用本文为:基督教S. Perone,“机器学习::文本特征提取(TF-IDF) - 第二部分”,在亚洲金博宝未知领域,03/10/2011,//www.cpetem.com/2011/10/machine-learning-text-feature-extraction-tf-idf-part-ii/

参考

理解逆文档频率:对IDF理论论证

Wikipedia :: tf-idf

经典的向量空间模型

Sklearn文本特征提取码

Updates

13 Mar 2015-格式化,固定图像的问题。
2011 10月3日-添加了有关使用Python示例环境信息

Machine Learning :: Text feature extraction (tf-idf) – Part I

Short introduction to Vector Space Model (VSM)

在信息检索和文本挖掘中,Ťerm frequency – inverse document frequency(also calledŤf-idf), is a well know method to evaluate how important is a word in a document. tf-idf are is a very interesting way to convert the textual representation of information into a向量空间模型(VSM), or into sparse features, we’ll discuss more about it later, but first, let’s try to understand what is tf-idf and the VSM.

VSM有一个非常混乱亚洲金博宝的过去,见例如纸该most influential paper Gerard Salton Never Wrote这解释了鬼背后的历史援引这其实从来没有存在过的纸张;在总和,VSM是表示文本信息作为矢量的代数模型,这个矢量的分量可以表示术语(TF-IDF)的重要性或甚至不存在或存在(Bag of Words)它的一个文件中;它是要注意重要的是通过索尔顿提出的经典VSM结合本地和全球的参数/信息(它使用两个分离项进行分析,以及文档的整个集合感)。VSM,解释的lato sensu,is a space where text is represented as a vector of numbers instead of its original string textual representation; the VSM represents the features extracted from the document.

让我们试着从数学结合具体实例定义VSM和TF-IDF在一起,我将使用Python具体的例子(以及惊人scikits.learnPython模块)。

Going to the vector space

在文档建模成向量空间中的第一步骤是创建一个字典方面呈现文档。To do that, you can simple select all terms from the document and convert it to a dimension in the vector space, but we know that there are some kind of words (stop words) that are present in almost all documents, and what we’re doing is extracting important features from documents, features do identify them among other similar documents, so using terms like “the, is, at, on”, etc.. isn’t going to help us, so in the information extraction, we’ll just ignore them.

让我们以下面的文件来定义我们的(愚蠢)文件空间:

Train Document Set: d1: The sky is blue. d2: The sun is bright. Test Document Set: d3: The sun in the sky is bright. d4: We can see the shining sun, the bright sun.

现在,我们要做的是创造的列车文档集合词的索引词汇(字典),使用的文件D1D2从文档集,我们将有以下索引词汇表示为\ mathrm {E}(t)的其中,Ť是个Ťerm:

\ mathrm {E}(t)的=   \begin{cases}   1, & \mbox{if } t\mbox{ is ``blue''} \\   2, & \mbox{if } t\mbox{ is ``sun''} \\   3, & \mbox{if } t\mbox{ is ``bright''} \\   4, & \mbox{if } t\mbox{ is ``sky''} \\   \end{cases}

Note that the terms like “is” and “the” were ignored as cited before. Now that we have an index vocabulary, we can convert the test document set into a vector space where each term of the vector is indexed as our index vocabulary, so the first term of the vector represents the “blue” term of our vocabulary, the second represents “sun” and so on. Now, we’re going to use the长期频代表在我们的矢量空间中的每个术语;术语频无非条款存在于我们的词汇量有多少次的度量\ mathrm {E}(t)的存在于文件D3要么D4,我们定义术语频率为couting的功能:

\ mathrm {TF}(T,d)= \和\ limits_ {X \在d} \ {mathrm FR}(X,T)

其中,\ mathrm {FR}(X,T)是定义为一个简单的函数:

\mathrm{fr}(x,t) =   \begin{cases}   1, & \mbox{if } x = t \\   0, & \mbox{otherwise} \\   \end{cases}

那么,是什么TF(吨,d)回报是多少次的术语Ť存在于文档中d。这方面的例子,可以Ťf(``sun'', d4) = 2因为我们只有两次出现了“太阳”一词的文档中D4。现在,你的理解是频率是如何工作的,我们可以去到创作文档载体,由下式表示:

\displaystyle \vec{v_{d_n}} =(\mathrm{tf}(t_1,d_n), \mathrm{tf}(t_2,d_n), \mathrm{tf}(t_3,d_n), \ldots, \mathrm{tf}(t_n,d_n))

Each dimension of the document vector is represented by the term of the vocabulary, for example, the\ mathrm {TF}(T_1,d_2)中,表示该术语的频率项1或Ť_1(这是我们的“蓝色”的词汇术语)的文件中d_2

现在,让我们展示如何文件的具体例子d_3d_4被表示为向量:

\ {VEC V_ {D_3}}= (\mathrm{tf}(t_1,d_3), \mathrm{tf}(t_2,d_3), \mathrm{tf}(t_3,d_3), \ldots, \mathrm{tf}(t_n,d_3)) \\   \vec{v_{d_4}} = (\mathrm{tf}(t_1,d_4), \mathrm{tf}(t_2,d_4), \mathrm{tf}(t_3,d_4), \ldots, \mathrm{tf}(t_n,d_4))

which evaluates to:

\ {VEC V_ {D_3}}= (0, 1, 1, 1) \\   \vec{v_{d_4}} = (0, 2, 1, 0)

正如你所看到的,因为文件d_3d_4是:

D3:在天空,阳光灿烂。D4:我们可以看到闪亮的阳光,明亮的阳光下。

该resulting vector\ {VEC V_ {D_3}}表明我们有,为了,0次出现的术语的“蓝色”,1次出现的术语“太阳”,等等。在里面\ {VEC V_ {D_3}},we have 0 occurences of the term “blue”, 2 occurrences of the term “sun”, etc.

But wait, since we have a collection of documents, now represented by vectors, we can represent them as a matrix with| d |\次˚F形状,其中| d |是个Cardinality of the document space, or how many documents we have and theF为特征的数量,在我们的例子代表由词汇量。上述载体的矩阵表示的例子是:

M_ {| d |\次˚F}=   \begin{bmatrix}   0 & 1 & 1 & 1\\   0 & 2 & 1 & 0   \end{bmatrix}

As you may have noted, these matrices representing the term frequencies tend to be very(与大多数归零项),这就是为什么你会看到这些矩阵作为稀疏矩阵的一种常见表现。

Python的实践

环境中使用Python的v.2.7.2NumPy的1.6.1SciPy的v.0.9.0Sklearn(Scikits.learn)v.0.9

Since we know the theory behind the term frequency and the vector space conversion, let’s show how easy is to do that using the amazingscikit.learnPython module.

Scikit.learn comes withlots of examplesas well real-life interesting数据集you can use and also some辅助功能Ťo download 18k newsgroups posts for instance.

Since we already defined our small train/test dataset before, let’s use them to define the dataset in a way that scikit.learn can use:

train_set =(“天空是蓝色的。”,“阳光灿烂”。)TEST_SET =(“在天空中的太阳是光明的。”,“我们可以看到闪亮的阳光,灿烂的阳光。”)

在scikit.learn,我们所提出的术语频率,被称为CountVectorizer,所以我们需要导入它,并创建一个新闻实例:

从进口sklearn.feature_extraction.text CountVectorizer矢量器= CountVectorizer()

CountVectorizer已经使用默认的“分析”之称WordNGramAnalyzer,which is responsible to convert the text to lowercase, accents removal, token extraction, filter stop words, etc… you can see more information by printing the class information:

print vectorizer CountVectorizer(analyzer__min_n=1, analyzer__stop_words=set(['all', 'six', 'less', 'being', 'indeed', 'over', 'move', 'anyway', 'four', 'not', 'own', 'through', 'yourselves', (...)

Let’s create now the vocabulary index:

vectorizer.fit_transform(train_set) print vectorizer.vocabulary {'blue': 0, 'sun': 1, 'bright': 2, 'sky': 3}

看到创建的词汇是一样的E(T)(除了因为它是零索引)。

Let’s use the same vectorizer now to create the sparse matrix of ourTEST_SET文件:

smatrix = vectorizer.transform(test_set) print smatrix (0, 1) 1 (0, 2) 1 (0, 3) 1 (1, 1) 2 (1, 2) 1

Note that the sparse matrix created calledsmatrixis aSciPy的稀疏矩阵与存储在元素坐标格式。但是你可以把它转换成一个密集的格式:

smatrix.todense()矩阵([[0,1,1,1],........ [0,2,1,0]],D型细胞= int64类型)

需要注意的是稀疏矩阵创建相同矩阵M_ {| d |\次˚F}we cited earlier in this post, which represents the two document vectors\ {VEC V_ {D_3}}\vec{v_{d_4}}

我们将在接下来的文章中看到,我们如何定义IDF逆文档频率),而不是简单的术语频率,如规模以及如何对数被用于根据其重要性进行调整词频的测量,以及我们如何能够利用一些众所周知的机器学习方法的使用它来区分文档。

我希望你喜欢这个职位,如果你真的很喜欢,发表评论,所以我会就能知道是否有足够多的人热衷于这些系列的机器学习主题文章。

按照承诺,here is the second partof this tutorial series.

引用本文为:基督教S. Perone,“机器学习::文本特征提取(TF-IDF) - 第一部分,”在亚洲金博宝未知领域,18/09/2011,//www.cpetem.com/2011/09/machine-learning-text-feature-extraction-tf-idf-part-i/

参考

经典的向量空间模型

最有影响力的论文杰拉德·索尔顿从未写过

Wikipedia: tf-idf

维基百科:向量空间模型

Scikits.learn例子

Updates

11年9月21日-固定一些错字和矢量表示法
11年9月22日- 根据新的0.9版本sklearn的固定进口并增加了环境部
11年10月2日-fixed Latex math typos
18 Oct 11-added link to the second part of the tutorial series
11年3月4日-Fixed formatting issues