机器学习::文本特征提取(TF-IDF) - 第一部分I

阅读本教程的第一部分:Text feature extraction (tf-idf) – Part I

这个职位是一个Continuationof the first part where we started to learn the theory and practice about text feature extraction and vector space model representation. I really recommend you阅读第一部分后一系列以遵循这个第二。

Since a lot of people liked the first part of this tutorial, this second part is a little longer than the first.

介绍

在第一篇文章中,我们学会了如何使用长期频Ťo represent textual information in the vector space. However, the main problem with the term-frequency approach is that it scales up frequent terms and scales down rare terms which are empirically more informative than the high frequency terms. The basic intuition is that a term that occurs frequently in many documents is not a good discriminator, and really makes sense (at least in many experimental tests); the important question here is: why would you, in a classification problem for instance, emphasize a term which is almost present in the entire corpus of your documents ?

该TF-IDFweight comes to solve this problem. What tf-idf gives is how important is a word to a document in a collection, and that’s why tf-idf incorporates local and global parameters, because it takes in consideration not only the isolated term but also the term within the document collection. What tf-idf then does to solve that problem, is to scale down the frequent terms while scaling up the rare terms; a term that occurs 10 times more than another isn’t 10 times more important than it, that’s why tf-idf uses the logarithmic scale to do that.

但是,让我们回到我们的定义\ mathrm {TF}(T,d)which is actually the term count of the termŤ在文档中d。该use of this simple term frequency could lead us to problems like滥用关键字,which is when we have a repeated term in a document with the purpose of improving its ranking on an IR (Information Retrieval) system or even create a bias towards long documents, making them look more important than they are just because of the high frequency of the term in the document.

为了克服这个问题,词频\ mathrm {TF}(T,d)of a document on a vector space is usually also normalized. Let’s see how we normalize this vector.

矢量归

假设我们要正常化术语频矢量\ {VEC V_ {D_4}}我们在本教程的第一部分已经计算。该文件D4from the first part of this tutorial had this textual representation:

D4:我们可以看到闪亮的阳光,明亮的阳光下。

和使用该文件的非归一化项频向量空间表示为:

\ {VEC V_ {D_4}}= (0,2,1,0)

为了归一化矢量,是相同的计算单位向量矢量,而他们使用的是“帽子”符号表示:\hat{v}。该definition of the unit vector\hat{v}一个向量的\ VEC {V}是:

\的DisplayStyle \帽子{V} = \压裂{\ vec的{V}} {\ | \ vec的{V} \ | _p}

\hat{v}是单位矢量,或者归一化矢量,所述\ VEC {V}是个vector going to be normalized and the\ | \ VEC {V} \ | _p是矢量的范数(大小,长度)\ VEC {V}in theL^pspace (don’t worry, I’m going to explain it all).

单位向量只不过是一个没有rmalized version of the vector, is a vector which the length is 1.

该ñ要么malization process (Source: http://processing.org/learning/pvector/)
该ñ要么malization process (Source: http://processing.org/learning/pvector/)

但这里的重要问题是如何向量的长度来计算,并明白这一点,你必须了解的动机L^p空间,也被称为Lebesgue spaces

Lebesgue spaces

How long is this vector ? (Source: Source: http://processing.org/learning/pvector/)
How long is this vector ? (Source: Source: http://processing.org/learning/pvector/)

Usually, the length of a vector\ {VEC U】=(U_1,U_2,U_3,\ ldots,u_n)is calculated using the欧几里得范-一个准则是在矢量空间中分配一个严格正长度或大小于所有矢量的函数-, which is defined by:

(资源:http://processing.org/learning/pvector/)
(资源:http://processing.org/learning/pvector/)

\ | \ VEC【U} \ |= \ SQRT【U ^ 2_1 + U ^ 2_2 + U ^ 2_3 + \ ldots + U ^ 2_n}

但是,这不是定义长度的唯一途径,这就是为什么你看到(有时)的数p符合规范的符号,就像在了一起\ | \ VEC【U} \ |_p。这是因为它可以被概括为:

\的DisplayStyle \ | \ VEC【U} \ | _p =(\左| U_1 \右| ^ P + \左| U_2 \右| ^ P + \左| U_3 \右| ^ P + \ ldots + \左|u_n \右| ^ p)^ \压裂{1} {p}

并简化为:

\displaystyle \|\vec{u}\|_p = (\sum\limits_{i=1}^{n}\left|\vec{u}_i\right|^p)^\frac{1}{p}

所以,当你阅读有关L2-norm,you’re reading about the欧几里得范,具有规范p = 2时用于测量的矢量的长度的最常用标准,通常称为“大小”;其实,当你有一个不合格的长度测量(不p号),你有L2-norm(欧几里得范数)。

当你阅读一L1范你正在阅读与规范P = 1, 定义为:

\displaystyle \|\vec{u}\|_1 = ( \left|u_1\right| + \left|u_2\right| + \left|u_3\right| + \ldots + \left|u_n\right|)

这无非是向量的组件的简单相加,也被称为Taxicab distance,also called Manhattan distance.

出租车几何与欧几里得距离:在出租车几何所有三个描绘线具有对于相同的路径具有相同的长度(12)。在欧几里德几何,绿色的线有长度,6 \倍\ SQRT {2} \约8.48,和是个unique shortest path.
资源:维基百科::出租车几何

请注意,您也可以使用任何规范正常化的载体,但我们将使用最常用的规范,L2范数,这也是在0.9版本的默认scikits.learn。You can also find papers comparing the performance of the two approaches among other methods to normalize the document vector, actually you can use any other method, but you have to be concise, once you’ve used a norm, you have to use it for the whole process directly involving the norm (即所使用的L1范数的单位矢量是不会具有长度1,如果你要以后采取其L2范数).

返回矢量归

现在你知道了矢量正常化进程是什么,我们可以尝试一个具体的例子,使用L2范数的过程(我们现在使用正确的术语),以规范我们的矢量\ {VEC V_ {D_4}}= (0,2,1,0)为了得到其单位向量\ {帽子V_ {D_4}}。To do that, we’ll simple plug it into the definition of the unit vector to evaluate it:

\hat{v} = \frac{\vec{v}}{\|\vec{v}\|_p} \\ \\  \hat{v_{d_4}} = \frac{\vec{v_{d_4}}}{||\vec{v_{d_4}}||_2} \\ \\ \\  \hat{v_{d_4}} = \frac{(0,2,1,0)}{\sqrt{0^2 + 2^2 + 1^2 + 0^2}} \\ \\  \hat{v_{d_4}} = \frac{(0,2,1,0)}{\sqrt{5}} \\ \\  \small \hat{v_{d_4}} = (0.0, 0.89442719, 0.4472136, 0.0)

这就是它!我们的法矢\ {帽子V_ {D_4}}现在有一个L2范\|\hat{v_{d_4}}\|_2 = 1.0

请注意,这里我们归我们词频文档向量,但后来我们要做的是,TF-IDF的计算后。

该Ťerm frequency – inverse document frequency (tf-idf) weight

Now you have understood how the vector normalization works in theory and practice, let’s continue our tutorial. Suppose you have the following documents in your collection (taken from the first part of tutorial):

火车文档集:D1:天空是蓝色的。D2:阳光灿烂。测试文档集:D3:在天空,阳光灿烂。D4:我们可以看到闪亮的阳光,明亮的阳光下。

您的文档空间可以那么作为被定义d = \ {D_1,D_2,\ ldots,D_N \}哪里ñ是个文件数in your corpus, and in our case asD_ {火车} = \ {D_1,D_2 \}D_{test} = \{d_3, d_4\}。我们的文档空间的基数被定义\left|{D_{train}}\right| = 2\left|{D_{test}}\right| = 2,因为我们只有2两个用于训练和测试文档,但他们显然并不需要有相同的基数。

现在让我们看看,然后是如何IDF(逆文档频率)定义:

\displaystyle \mathrm{idf}(t) = \log{\frac{\left|D\right|}{1+\left|\{d : t \in d\}\right|}}

哪里\左| \ {d:T \在d \} \右|是个文件数其中术语Ť出现,术语频率函数满足当\ mathrm {TF}(T,d)\ 0 NEQ,我们只加1代入公式,以避免零分。

为TF-IDF式则是:

\mathrm{tf\mbox{-}idf}(t) = \mathrm{tf}(t, d) \times \mathrm{idf}(t)

和该公式具有重要的后果:当你有给定文档中高词频(TF)达到TF-IDF计算的高权重(本地参数)和整个集合中的术语的低文档频率(全局参数).

现在,让我们计算每个出现在与我们在第一个教程计算词频特征矩阵功能的IDF:

M_ {}列车=  \begin{bmatrix}  0 & 1 & 1 & 1\\  0 & 2 & 1 & 0  \end{bmatrix}

因为我们有4个特点,我们要计算\mathrm{idf}(t_1)\ mathrm {IDF}(T_2)\ mathrm {IDF}(t_3处)\ mathrm {IDF}(T_4)

\mathrm{idf}(t_1) = \log{\frac{\left|D\right|}{1+\left|\{d : t_1 \in d\}\right|}} = \log{\frac{2}{1}} = 0.69314718

\ mathrm {IDF}(T_2)= \日志{\压裂{\左| d \右|} {1+ \左| \ {d:T_2 \在d \} \右|}} = \日志{\压裂{2} {3}} = -0.40546511

\ mathrm {IDF}(t_3处)= \日志{\压裂{\左| d \右|} {1+ \左| \ {d:t_3处\在d \} \右|}} = \日志{\压裂{2} {3}} = -0.40546511

\ mathrm {IDF}(T_4)= \日志{\压裂{\左| d \右|} {1+ \左| \ {d:T_4 \在d \} \右|}} = \日志{\压裂{2} {2}} = 0.0

这些IDF权重可以由矢量作为表示:

\ {VEC {idf_列车}} =(0.69314718,-0.40546511,-0.40546511,0.0)

现在we have our matrix with the term frequency (M_ {}列车) and the vector representing the idf for each feature of our matrix (\ {VEC {idf_列车}}),我们可以计算出我们的TF-IDF权重。我们要做的是矩阵中的每一列的简单乘法M_ {}列车与各自的\ {VEC {idf_列车}}向量维度。要做到这一点,我们可以创建一个正方形diagonal matrixCalledM_ {} IDF同时与垂直和水平尺寸等于向量\ {VEC {idf_列车}}尺寸:

M_ {} IDF=   \begin{bmatrix}   0.69314718 & 0 & 0 & 0\\   0 & -0.40546511 & 0 & 0\\   0 & 0 & -0.40546511 & 0\\   0 & 0 & 0 & 0   \end{bmatrix}

和ñmultiply it to the term frequency matrix, so the final result can be defined then as:

M_ {TF \ MBOX { - }} IDF= M_{train} \times M_{idf}

请注意,矩阵乘法是不可交换的,结果A \乘以B会比的结果不同乙\一个时代,这就是为什么M_ {} IDFis on the right side of the multiplication, to accomplish the desired effect of multiplying each idf value to its corresponding feature:

\begin{bmatrix}   \mathrm{tf}(t_1, d_1) & \mathrm{tf}(t_2, d_1) & \mathrm{tf}(t_3, d_1) & \mathrm{tf}(t_4, d_1)\\   \mathrm{tf}(t_1, d_2) & \mathrm{tf}(t_2, d_2) & \mathrm{tf}(t_3, d_2) & \mathrm{tf}(t_4, d_2)   \end{bmatrix}   \times   \begin{bmatrix}   \mathrm{idf}(t_1) & 0 & 0 & 0\\   0 & \mathrm{idf}(t_2) & 0 & 0\\   0 & 0 & \mathrm{idf}(t_3) & 0\\   0 & 0 & 0 & \mathrm{idf}(t_4)   \end{bmatrix}   \\ =   \begin{bmatrix}   \mathrm{tf}(t_1, d_1) \times \mathrm{idf}(t_1) & \mathrm{tf}(t_2, d_1) \times \mathrm{idf}(t_2) & \mathrm{tf}(t_3, d_1) \times \mathrm{idf}(t_3) & \mathrm{tf}(t_4, d_1) \times \mathrm{idf}(t_4)\\   \mathrm{tf}(t_1, d_2) \times \mathrm{idf}(t_1) & \mathrm{tf}(t_2, d_2) \times \mathrm{idf}(t_2) & \mathrm{tf}(t_3, d_2) \times \mathrm{idf}(t_3) & \mathrm{tf}(t_4, d_2) \times \mathrm{idf}(t_4)   \end{bmatrix}

Let’s see now a concrete example of this multiplication:

M_ {TF \ MBOX { - }} IDF= M_{train} \times M_{idf} = \\   \begin{bmatrix}   0 & 1 & 1 & 1\\   0 & 2 & 1 & 0   \end{bmatrix}   \times   \begin{bmatrix}   0.69314718 & 0 & 0 & 0\\   0 & -0.40546511 & 0 & 0\\   0 & 0 & -0.40546511 & 0\\   0 & 0 & 0 & 0   \end{bmatrix} \\   =   \begin{bmatrix}   0 & -0.40546511 & -0.40546511 & 0\\   0 & -0.81093022 & -0.40546511 & 0   \end{bmatrix}

And finally, we can apply our L2 normalization process to theM_ {TF \ MBOX { - }} IDF矩阵。请注意,这正常化“逐行”因为我们要处理矩阵的每一行作为一个分离向量进行归一化,而不是矩阵作为一个整体:

M_ {TF \ MBOX { - }} IDF= \frac{M_{tf\mbox{-}idf}}{\|M_{tf\mbox{-}idf}\|_2} = \begin{bmatrix}   0 & -0.70710678 & -0.70710678 & 0\\   0 & -0.89442719 & -0.4472136 & 0   \end{bmatrix}

And that is our pretty normalized tf-idf weight of our testing document set, which is actually a collection of unit vectors. If you take the L2-norm of each row of the matrix, you’ll see that they all have a L2-norm of 1.

Python practice

环境中使用Python v.2.7.2Numpy 1.6.1Scipy v.0.9.0Sklearn(Scikits.learn)v.0.9

Now the section you were waiting for ! In this section I’ll use Python to show each step of the tf-idf calculation using theScikit.learn特征提取模块。

第一步是创建我们的训练和测试文档集和计算词频矩阵:

from sklearn.feature_extraction.text import CountVectorizer train_set = ("The sky is blue.", "The sun is bright.") test_set = ("The sun in the sky is bright.", "We can see the shining sun, the bright sun.") count_vectorizer = CountVectorizer() count_vectorizer.fit_transform(train_set) print "Vocabulary:", count_vectorizer.vocabulary # Vocabulary: {'blue': 0, 'sun': 1, 'bright': 2, 'sky': 3} freq_term_matrix = count_vectorizer.transform(test_set) print freq_term_matrix.todense() #[[0 1 1 1] #[0 2 1 0]]

现在,我们有频率项矩阵(称为freq_term_matrix), we can instantiate theTfidfTransformer,which is going to be responsible to calculate the tf-idf weights for our term frequency matrix:

from sklearn.feature_extraction.text import TfidfTransformer tfidf = TfidfTransformer(norm="l2") tfidf.fit(freq_term_matrix) print "IDF:", tfidf.idf_ # IDF: [ 0.69314718 -0.40546511 -0.40546511 0. ]

请注意,我所指定的标准为L2,这是可选的(实际上默认为L2范数),但我已经添加了参数,使其明确向你表示,它会使用L2范数。还要注意的是,你可以通过访问称为内部属性看IDF计算权重idf_。现在fit()method has calculated the idf for the matrix, let’s transform thefreq_term_matrix到TF-IDF权重矩阵:

tf_idf_matrix = tfidf.transform(freq_term_matrix)打印tf_idf_matrix.todense()#[[0 -0.70710678 -0.70710678 0]#[0 -0.89442719 -0.4472136 0]]

这就是它的Ťf_idf_matrix其实我们以前M_ {TF \ MBOX { - }} IDF矩阵。您可以通过使用达到相同的效果Vectorizer类Scikit.learn的这是一个矢量器自动结合CountVectorizerTfidfTransformer给你。看到Ťhis example要知道如何使用它的文本分类过程。

I really hope you liked the post, I tried to make it simple as possible even for people without the required mathematical background of linear algebra, etc. In the next Machine Learning post I’m expecting to show how you can use the tf-idf to calculate the cosine similarity.

如果你喜欢,随时提出意见和建议,修改等。

引用本文为:基督教S. Perone,“机器学习::文本特征提取(TF-IDF) - 第二部分”,在亚洲金博宝未知领域,03/10/2011,//www.cpetem.com/2011/10/machine-learning-text-feature-extraction-tf-idf-part-ii/

References

理解逆Document Frequency: on theoretical arguments for IDF

维基百科:: TF-IDF

该Classic Vector Space Model

Sklearn文本特征提取码

更新

13 Mar 2015-F要么mating, fixed images issues.
2011 10月3日-添加了有关使用Python示例环境信息

机器学习::文本特征提取(TF-IDF) - 第一部分

简短的介绍向量空间模型(VSM)

在信息检索和文本挖掘中,Ťerm frequency – inverse document frequency(也叫TF-IDF),是一个众所周知的方法来评估是多么的重要文档中的文字。TF-IDF是一个非常有趣的信息亚洲金博宝文本表示转换成一种方式向量空间模型(VSM), or into sparse features, we’ll discuss more about it later, but first, let’s try to understand what is tf-idf and the VSM.

VSM有一个非常混乱亚洲金博宝的过去,见例如纸该most influential paper Gerard Salton Never Wrote这解释了鬼背后的历史援引这其实从来没有存在过的纸张;在总和,VSM是表示文本信息作为矢量的代数模型,这个矢量的分量可以表示术语(TF-IDF)的重要性或甚至不存在或存在(文字包) of it in a document; it is important to note that the classical VSM proposed by Salton incorporates local and global parameters/information (in a sense that it uses both the isolated term being analyzed as well the entire collection of documents). VSM, interpreted in a广义上,是其中文本被表示为数字的而不是它的原始字符串的文本表示的向量的空间;在VSM表示从文件中提取的特征。

让我们试着从数学结合具体实例定义VSM和TF-IDF在一起,我将使用Python具体的例子(以及惊人scikits.learnPython模块)。

Going to the vector space

该first step in modeling the document into a vector space is to create a dictionary of terms present in documents. To do that, you can simple select all terms from the document and convert it to a dimension in the vector space, but we know that there are some kind of words (stop words) that are present in almost all documents, and what we’re doing is extracting important features from documents, features do identify them among other similar documents, so using terms like “the, is, at, on”, etc.. isn’t going to help us, so in the information extraction, we’ll just ignore them.

让我们以下面的文件来定义我们的(愚蠢)文件空间:

火车文档集:D1:天空是蓝色的。D2:阳光灿烂。测试文档集:D3:在天空,阳光灿烂。D4:我们可以看到闪亮的阳光,明亮的阳光下。

Now, what we have to do is to create a index vocabulary (dictionary) of the words of the train document set, using the documentsD1D2从文档集,我们将有以下索引词汇表示为\mathrm{E}(t)哪里ŤheŤ是项:

\mathrm{E}(t) =   \begin{cases}   1, & \mbox{if } t\mbox{ is ``blue''} \\   2, & \mbox{if } t\mbox{ is ``sun''} \\   3, & \mbox{if } t\mbox{ is ``bright''} \\   4, & \mbox{if } t\mbox{ is ``sky''} \\   \end{cases}

需要注意的是,如术语“被”和“”分别作为引才忽略。现在,我们有一个索引词汇,我们可以测试文档集转换为其中向量的每个项指数作为我们的索引词汇向量空间,所以向量的第一项代表我们的词汇中的“蓝”来看,第二个代表“太阳”等。现在,我们将使用长期频Ťo represent each term in our vector space; the term-frequency is nothing more than a measure of how many times the terms present in our vocabulary\mathrm{E}(t)are present in the documentsD3要么D4,我们定义术语频率为couting的功能:

\ mathrm {TF}(T,d)= \和\ limits_ {X \在d} \ {mathrm FR}(X,T)

哪里Ťhe\mathrm{fr}(x, t)是定义为一个简单的函数:

\mathrm{fr}(x,t) =   \begin{cases}   1, & \mbox{if } x = t \\   0, & \mbox{otherwise} \\   \end{cases}

那么,是什么TF(吨,d)回报是多少次的术语Ť存在的文件吗d。这方面的例子,可以TF(``太阳 '',D4)= 2因为我们只有两次出现了“太阳”一词的文档中D4。现在,你的理解是频率是如何工作的,我们可以去到创作文档载体,由下式表示:

\的DisplayStyle \ VEC {V_ {D_N}} =(\ mathrm {TF}(T_1,D_N),\ mathrm {TF}(T_2,D_N),\ mathrm {TF}(t_3处,D_N),\ ldots,\ mathrm{TF}(t_n,D_N))

Each dimension of the document vector is represented by the term of the vocabulary, for example, the\ mathrm {TF}(T_1,d_2)中,表示该术语的频率项1或T_1(which is our “blue” term of the vocabulary) in the documentD_2

现在,让我们展示如何文件的具体例子D_3D_4are represented as vectors:

\ VEC {V_ {D_3}} =(\ mathrm {TF}(T_1,D_3),\ mathrm {TF}(T_2,D_3),\ mathrm {TF}(t_3处,D_3),\ ldots,\ mathrm {TF}(Ť_n,d_3)) \\   \vec{v_{d_4}} = (\mathrm{tf}(t_1,d_4), \mathrm{tf}(t_2,d_4), \mathrm{tf}(t_3,d_4), \ldots, \mathrm{tf}(t_n,d_4))

计算结果为:

\vec{v_{d_3}} = (0, 1, 1, 1) \\   \vec{v_{d_4}} = (0, 2, 1, 0)

正如你所看到的,因为文件D_3D_4是:

D3:该sun in the sky is bright. d4: We can see the shining sun, the bright sun.

该resulting vector\vec{v_{d_3}}shows that we have, in order, 0 occurrences of the term “blue”, 1 occurrence of the term “sun”, and so on. In the\vec{v_{d_3}},we have 0 occurences of the term “blue”, 2 occurrences of the term “sun”, etc.

But wait, since we have a collection of documents, now represented by vectors, we can represent them as a matrix with| d |\times Fshape, where| d |是个Cardinality of the document space, or how many documents we have and theF为特征的数量,在我们的例子代表由词汇量。上述载体的矩阵表示的例子是:

M_ {| d |\times F} =   \begin{bmatrix}   0 & 1 & 1 & 1\\   0 & 2 & 1 & 0   \end{bmatrix}

正如你可能已经注意到,代表了词频这些矩阵往往是非常亚洲金博宝(与大多数归零项),这就是为什么你会看到这些矩阵作为稀疏矩阵的一种常见表现。

Python practice

环境中使用Python v.2.7.2Numpy 1.6.1Scipy v.0.9.0Sklearn(Scikits.learn)v.0.9

Since we know the theory behind the term frequency and the vector space conversion, let’s show how easy is to do that using the amazingscikit.learnPython module.

Scikit.learnComes with许多实例以及现实生活中的有趣数据集you can use and also somehelper functions下载例如18K新闻组帖子。

由于我们之前已经定义了我们的小火车/测试数据集,让我们用他们的方式,scikit.learn可用于定义数据集:

train_set =(“天空是蓝色的。”、“阳光璀璨Ť。") test_set = ("The sun in the sky is bright.", "We can see the shining sun, the bright sun.")

在scikit.learn,我们所提出的术语频率,被称为CountVectorizer,所以我们需要导入它,并创建一个新闻实例:

从进口sklearn.feature_extraction.text CountVectorizer矢量器= CountVectorizer()

CountVectorizeralready uses as default “analyzer” calledWordNGramAnalyzer,which is responsible to convert the text to lowercase, accents removal, token extraction, filter stop words, etc… you can see more information by printing the class information:

print vectorizer CountVectorizer(analyzer__min_n=1, analyzer__stop_words=set(['all', 'six', 'less', 'being', 'indeed', 'over', 'move', 'anyway', 'four', 'not', 'own', 'through', 'yourselves', (...)

现在,让我们创造的词汇索引:

vectorizer.fit_transform(train_set)打印vectorizer.vocabulary { '蓝':0, '太阳':1, '鲜艳':2 '天空':3}

看到创建的词汇是一样的E(T)(除了因为它是零索引)。

让我们使用相同的矢量化现在创建的稀疏矩阵我们TEST_SETdocuments:

SMATRIX = vectorizer.transform(TEST_SET)打印SMATRIX(0,1)1(0,2)1(0,3)1(1,1)2(1,2)1

需要注意的是稀疏矩阵创建名为SMATRIXSciPy的稀疏矩阵with elements stored in a坐标格式。但是你可以把它转换成一个密集的格式:

SMATRIX。Ťodense() matrix([[0, 1, 1, 1], ........[0, 2, 1, 0]], dtype=int64)

需要注意的是稀疏矩阵创建相同矩阵M_ {| d |\times F}我们引用前面在这篇文章中,它代表了两个矢量\vec{v_{d_3}}\ {VEC V_ {D_4}}

我们将在接下来的文章中看到,我们如何定义idfinverse document frequency),而不是简单的术语频率,如规模以及如何对数被用于根据其重要性进行调整词频的测量,以及我们如何能够利用一些众所周知的机器学习方法的使用它来区分文档。

I hope you liked this post, and if you really liked, leave a comment so I’ll able to know if there are enough people interested in these series of posts in Machine Learning topics.

按照承诺,这里是第二部分这个系列教程。

Cite this article as: Christian S. Perone, "Machine Learning :: Text feature extraction (tf-idf) – Part I," in亚洲金博宝未知领域,18/09/2011,//www.cpetem.com/2011/09/machine-learning-text-feature-extraction-tf-idf-part-i/

References

该Classic Vector Space Model

该most influential paper Gerard Salton never wrote

维基百科:TF-IDF

Wikipedia: Vector space model

Scikits.learn Examples

更新

11年9月21日-fixed some typos and the vector notation
11年9月22日-fixed import of sklearn according to the new 0.9 release and added the environment section
02 Oct 11- 固定乳胶数学错别字
11年10月18日- 添加链接到教程系列的第二部分
11年3月4日-Fixed formatting issues