## 机器学习::文本特征提取（TF-IDF） - 第一部分I

Since a lot of people liked the first part of this tutorial, this second part is a little longer than the first.

### 矢量归

D4：我们可以看到闪亮的阳光，明亮的阳光下。

$\ {VEC V_ {D_4}}= (0,2,1,0)$

$\的DisplayStyle \帽子{V} = \压裂{\ vec的{V}} {\ | \ vec的{V} \ | _p}$

$\hat{v}$是单位矢量，或者归一化矢量，所述$\ VEC {V}$是个vector going to be normalized and the$\ | \ VEC {V} \ | _p$是矢量的范数（大小，长度）$\ VEC {V}$in the$L^p$space (don’t worry, I’m going to explain it all).

### Lebesgue spaces

Usually, the length of a vector$\ {VEC U】=（U_1，U_2，U_3，\ ldots，u_n）$is calculated using the欧几里得范-一个准则是在矢量空间中分配一个严格正长度或大小于所有矢量的函数-, which is defined by:

$\ | \ VEC【U} \ |= \ SQRT【U ^ 2_1 + U ^ 2_2 + U ^ 2_3 + \ ldots + U ^ 2_n}$

$\的DisplayStyle \ | \ VEC【U} \ | _p =（\左| U_1 \右| ^ P + \左| U_2 \右| ^ P + \左| U_3 \右| ^ P + \ ldots + \左|u_n \右| ^ p）^ \压裂{1} {p}$

$\displaystyle \|\vec{u}\|_p = (\sum\limits_{i=1}^{n}\left|\vec{u}_i\right|^p)^\frac{1}{p}$

$\displaystyle \|\vec{u}\|_1 = ( \left|u_1\right| + \left|u_2\right| + \left|u_3\right| + \ldots + \left|u_n\right|)$

### 返回矢量归

$\hat{v} = \frac{\vec{v}}{\|\vec{v}\|_p} \\ \\ \hat{v_{d_4}} = \frac{\vec{v_{d_4}}}{||\vec{v_{d_4}}||_2} \\ \\ \\ \hat{v_{d_4}} = \frac{(0,2,1,0)}{\sqrt{0^2 + 2^2 + 1^2 + 0^2}} \\ \\ \hat{v_{d_4}} = \frac{(0,2,1,0)}{\sqrt{5}} \\ \\ \small \hat{v_{d_4}} = (0.0, 0.89442719, 0.4472136, 0.0)$

### 该Ťerm frequency – inverse document frequency (tf-idf) weight

Now you have understood how the vector normalization works in theory and practice, let’s continue our tutorial. Suppose you have the following documents in your collection (taken from the first part of tutorial):

火车文档集：D1：天空是蓝色的。D2：阳光灿烂。测试文档集：D3：在天空，阳光灿烂。D4：我们可以看到闪亮的阳光，明亮的阳光下。

$\displaystyle \mathrm{idf}(t) = \log{\frac{\left|D\right|}{1+\left|\{d : t \in d\}\right|}}$

$\mathrm{tf\mbox{-}idf}(t) = \mathrm{tf}(t, d) \times \mathrm{idf}(t)$

$M_ {}列车= \begin{bmatrix} 0 & 1 & 1 & 1\\ 0 & 2 & 1 & 0 \end{bmatrix}$

$\mathrm{idf}(t_1) = \log{\frac{\left|D\right|}{1+\left|\{d : t_1 \in d\}\right|}} = \log{\frac{2}{1}} = 0.69314718$

$\ mathrm {IDF}（T_2）= \日志{\压裂{\左| d \右|} {1+ \左| \ {d：T_2 \在d \} \右|}} = \日志{\压裂{2} {3}} = -0.40546511$

$\ mathrm {IDF}（t_3处）= \日志{\压裂{\左| d \右|} {1+ \左| \ {d：t_3处\在d \} \右|}} = \日志{\压裂{2} {3}} = -0.40546511$

$\ mathrm {IDF}（T_4）= \日志{\压裂{\左| d \右|} {1+ \左| \ {d：T_4 \在d \} \右|}} = \日志{\压裂{2} {2}} = 0.0$

$\ {VEC {idf_列车}} =（0.69314718，-0.40546511，-0.40546511，0.0）$

$M_ {} IDF= \begin{bmatrix} 0.69314718 & 0 & 0 & 0\\ 0 & -0.40546511 & 0 & 0\\ 0 & 0 & -0.40546511 & 0\\ 0 & 0 & 0 & 0 \end{bmatrix}$

$M_ {TF \ MBOX { - }} IDF= M_{train} \times M_{idf}$

$\begin{bmatrix} \mathrm{tf}(t_1, d_1) & \mathrm{tf}(t_2, d_1) & \mathrm{tf}(t_3, d_1) & \mathrm{tf}(t_4, d_1)\\ \mathrm{tf}(t_1, d_2) & \mathrm{tf}(t_2, d_2) & \mathrm{tf}(t_3, d_2) & \mathrm{tf}(t_4, d_2) \end{bmatrix} \times \begin{bmatrix} \mathrm{idf}(t_1) & 0 & 0 & 0\\ 0 & \mathrm{idf}(t_2) & 0 & 0\\ 0 & 0 & \mathrm{idf}(t_3) & 0\\ 0 & 0 & 0 & \mathrm{idf}(t_4) \end{bmatrix} \\ = \begin{bmatrix} \mathrm{tf}(t_1, d_1) \times \mathrm{idf}(t_1) & \mathrm{tf}(t_2, d_1) \times \mathrm{idf}(t_2) & \mathrm{tf}(t_3, d_1) \times \mathrm{idf}(t_3) & \mathrm{tf}(t_4, d_1) \times \mathrm{idf}(t_4)\\ \mathrm{tf}(t_1, d_2) \times \mathrm{idf}(t_1) & \mathrm{tf}(t_2, d_2) \times \mathrm{idf}(t_2) & \mathrm{tf}(t_3, d_2) \times \mathrm{idf}(t_3) & \mathrm{tf}(t_4, d_2) \times \mathrm{idf}(t_4) \end{bmatrix}$

Let’s see now a concrete example of this multiplication:

$M_ {TF \ MBOX { - }} IDF= M_{train} \times M_{idf} = \\ \begin{bmatrix} 0 & 1 & 1 & 1\\ 0 & 2 & 1 & 0 \end{bmatrix} \times \begin{bmatrix} 0.69314718 & 0 & 0 & 0\\ 0 & -0.40546511 & 0 & 0\\ 0 & 0 & -0.40546511 & 0\\ 0 & 0 & 0 & 0 \end{bmatrix} \\ = \begin{bmatrix} 0 & -0.40546511 & -0.40546511 & 0\\ 0 & -0.81093022 & -0.40546511 & 0 \end{bmatrix}$

And finally, we can apply our L2 normalization process to the$M_ {TF \ MBOX { - }} IDF$矩阵。请注意，这正常化“逐行”因为我们要处理矩阵的每一行作为一个分离向量进行归一化，而不是矩阵作为一个整体：

$M_ {TF \ MBOX { - }} IDF= \frac{M_{tf\mbox{-}idf}}{\|M_{tf\mbox{-}idf}\|_2}$ $= \begin{bmatrix} 0 & -0.70710678 & -0.70710678 & 0\\ 0 & -0.89442719 & -0.4472136 & 0 \end{bmatrix}$

And that is our pretty normalized tf-idf weight of our testing document set, which is actually a collection of unit vectors. If you take the L2-norm of each row of the matrix, you’ll see that they all have a L2-norm of 1.

### Python practice

Now the section you were waiting for ! In this section I’ll use Python to show each step of the tf-idf calculation using theScikit.learn特征提取模块。

from sklearn.feature_extraction.text import CountVectorizer train_set = ("The sky is blue.", "The sun is bright.") test_set = ("The sun in the sky is bright.", "We can see the shining sun, the bright sun.") count_vectorizer = CountVectorizer() count_vectorizer.fit_transform(train_set) print "Vocabulary:", count_vectorizer.vocabulary # Vocabulary: {'blue': 0, 'sun': 1, 'bright': 2, 'sky': 3} freq_term_matrix = count_vectorizer.transform(test_set) print freq_term_matrix.todense() #[[0 1 1 1] #[0 2 1 0]]

from sklearn.feature_extraction.text import TfidfTransformer tfidf = TfidfTransformer(norm="l2") tfidf.fit(freq_term_matrix) print "IDF:", tfidf.idf_ # IDF: [ 0.69314718 -0.40546511 -0.40546511 0. ]

tf_idf_matrix = tfidf.transform（freq_term_matrix）打印tf_idf_matrix.todense（）＃[[0 -0.70710678 -0.70710678 0]＃[0 -0.89442719 -0.4472136 0]]

I really hope you liked the post, I tried to make it simple as possible even for people without the required mathematical background of linear algebra, etc. In the next Machine Learning post I’m expecting to show how you can use the tf-idf to calculate the cosine similarity.

Sklearn文本特征提取码

### 更新

13 Mar 2015-F要么mating, fixed images issues.
2011 10月3日-添加了有关使用Python示例环境信息

## 机器学习::文本特征提取（TF-IDF） - 第一部分

### 简短的介绍向量空间模型（VSM）

VSM有一个非常混乱亚洲金博宝的过去，见例如纸该most influential paper Gerard Salton Never Wrote这解释了鬼背后的历史援引这其实从来没有存在过的纸张;在总和，VSM是表示文本信息作为矢量的代数模型，这个矢量的分量可以表示术语（TF-IDF）的重要性或甚至不存在或存在（文字包) of it in a document; it is important to note that the classical VSM proposed by Salton incorporates local and global parameters/information (in a sense that it uses both the isolated term being analyzed as well the entire collection of documents). VSM, interpreted in a广义上，是其中文本被表示为数字的而不是它的原始字符串的文本表示的向量的空间;在VSM表示从文件中提取的特征。

### Going to the vector space

火车文档集：D1：天空是蓝色的。D2：阳光灿烂。测试文档集：D3：在天空，阳光灿烂。D4：我们可以看到闪亮的阳光，明亮的阳光下。

Now, what we have to do is to create a index vocabulary (dictionary) of the words of the train document set, using the documents$D1$$D2$从文档集，我们将有以下索引词汇表示为$\mathrm{E}(t)$哪里Ťhe$Ť$是项：

$\mathrm{E}(t) = \begin{cases} 1, & \mbox{if } t\mbox{ is blue''} \\ 2, & \mbox{if } t\mbox{ is sun''} \\ 3, & \mbox{if } t\mbox{ is bright''} \\ 4, & \mbox{if } t\mbox{ is sky''} \\ \end{cases}$

$\ mathrm {TF}（T，d）= \和\ limits_ {X \在d} \ {mathrm FR}（X，T）$

$\mathrm{fr}(x,t) = \begin{cases} 1, & \mbox{if } x = t \\ 0, & \mbox{otherwise} \\ \end{cases}$

$\的DisplayStyle \ VEC {V_ {D_N}} =（\ mathrm {TF}（T_1，D_N），\ mathrm {TF}（T_2，D_N），\ mathrm {TF}（t_3处，D_N），\ ldots，\ mathrm{TF}（t_n，D_N））$

Each dimension of the document vector is represented by the term of the vocabulary, for example, the$\ mathrm {TF}（T_1，d_2）中，$表示该术语的频率项1或$T_1$（which is our “blue” term of the vocabulary) in the document$D_2$

$\ VEC {V_ {D_3}} =（\ mathrm {TF}（T_1，D_3），\ mathrm {TF}（T_2，D_3），\ mathrm {TF}（t_3处，D_3），\ ldots，\ mathrm {TF}（Ť_n,d_3)) \\ \vec{v_{d_4}} = (\mathrm{tf}(t_1,d_4), \mathrm{tf}(t_2,d_4), \mathrm{tf}(t_3,d_4), \ldots, \mathrm{tf}(t_n,d_4))$

$\vec{v_{d_3}} = (0, 1, 1, 1) \\ \vec{v_{d_4}} = (0, 2, 1, 0)$

D3：该sun in the sky is bright. d4: We can see the shining sun, the bright sun.

But wait, since we have a collection of documents, now represented by vectors, we can represent them as a matrix with$| d |\times F$shape, where$| d |$是个Cardinality of the document space, or how many documents we have and the$F$为特征的数量，在我们的例子代表由词汇量。上述载体的矩阵表示的例子是：

$M_ {| d |\times F} = \begin{bmatrix} 0 & 1 & 1 & 1\\ 0 & 2 & 1 & 0 \end{bmatrix}$

### Python practice

Since we know the theory behind the term frequency and the vector space conversion, let’s show how easy is to do that using the amazingscikit.learnPython module.

Scikit.learnComes with许多实例以及现实生活中的有趣数据集you can use and also somehelper functions下载例如18K新闻组帖子。

train_set =(“天空是蓝色的。”、“阳光璀璨Ť。") test_set = ("The sun in the sky is bright.", "We can see the shining sun, the bright sun.")

从进口sklearn.feature_extraction.text CountVectorizer矢量器= CountVectorizer（）

CountVectorizeralready uses as default “analyzer” calledWordNGramAnalyzer，which is responsible to convert the text to lowercase, accents removal, token extraction, filter stop words, etc… you can see more information by printing the class information:

print vectorizer CountVectorizer(analyzer__min_n=1, analyzer__stop_words=set(['all', 'six', 'less', 'being', 'indeed', 'over', 'move', 'anyway', 'four', 'not', 'own', 'through', 'yourselves', (...)

vectorizer.fit_transform（train_set）打印vectorizer.vocabulary { '蓝'：0， '太阳'：1， '鲜艳'：2 '天空'：3}

SMATRIX = vectorizer.transform（TEST_SET）打印SMATRIX（0，1）1（0，2）1（0，3）1（1，1）2（1，2）1

SMATRIX。Ťodense() matrix([[0, 1, 1, 1], ........[0, 2, 1, 0]], dtype=int64)

I hope you liked this post, and if you really liked, leave a comment so I’ll able to know if there are enough people interested in these series of posts in Machine Learning topics.

Cite this article as: Christian S. Perone, "Machine Learning :: Text feature extraction (tf-idf) – Part I," in亚洲金博宝未知领域，18/09/2011，//www.cpetem.com/2011/09/machine-learning-text-feature-extraction-tf-idf-part-i/

### References

Wikipedia: Vector space model

Scikits.learn Examples

### 更新

11年9月21日-fixed some typos and the vector notation
11年9月22日-fixed import of sklearn according to the new 0.9 release and added the environment section
02 Oct 11- 固定乳胶数学错别字
11年10月18日- 添加链接到教程系列的第二部分
11年3月4日-Fixed formatting issues