## 机器学习::文本特征提取（TF-IDF） - 第二部分

### Introduction

To overcome this problem, the term frequency$\ mathrm {TF}（T，d）$上的矢量空间中的文件的通常也归一化。让我们来看看我们是如何规范这一载体。

### Vector normalization

D4：We can see the shining sun, the bright sun.

$\vec{v_{d_4}} = (0,2,1,0)$

$\的DisplayStyle \帽子{V} = \压裂{\ vec的{V}} {\ | \ vec的{V} \ | _p}$

$\hat{v}$is the unit vector, or the normalized vector, the$\ VEC {V}$is the vector going to be normalized and the$\|\vec{v}\|_p$is the norm (magnitude, length) of the vector$\ VEC {V}$in the$L ^ p$空间（别担心，我将所有的解释）。

The unit vector is actually nothing more than a normalized version of the vector, is a vector which the length is 1.

But the important question here is how the length of the vector is calculated and to understand this, you must understand the motivation of the$L ^ p$spaces, also called勒贝格空间

### 勒贝格空间

$\ | \ VEC【U} \ |= \ SQRT【U ^ 2_1 + U ^ 2_2 + U ^ 2_3 + \ ldots + U ^ 2_n}$

But this isn’t the only way to define length, and that’s why you see (sometimes) a number$p$符合规范的符号，就像在了一起$\ | \ VEC【U} \ |_p$。那是因为它可以概括为:

$\的DisplayStyle \ | \ VEC【U} \ | _p =（\左| U_1 \右| ^ P + \左| U_2 \右| ^ P + \左| U_3 \右| ^ P + \ ldots + \左|u_n \右| ^ p）^ \压裂{1} {p}$

$\displaystyle \|\vec{u}\|_p = (\sum\limits_{i=1}^{n}\left|\vec{u}_i\right|^p)^\frac{1}{p}$

So when you read about aL2-norm，你正在阅读关于Euclidean norm，a norm with$p = 2时$用于测量的矢量的长度的最常用标准，通常称为“大小”;其实，当你有一个不合格的长度测量（不$p$ñumber), you have theL2-norm（欧几里得范数）。

When you read about aL1-norm，你正在阅读关于ñorm with$P = 1$， 定义为：

$\displaystyle \|\vec{u}\|_1 = ( \left|u_1\right| + \left|u_2\right| + \left|u_3\right| + \ldots + \left|u_n\right|)$

Which is nothing more than a simple sum of the components of the vector, also known asTaxicab distance，也被称为曼哈顿距离。

Source:维基百科::出租车几何

### 返回矢量归

Now that you know what the vector normalization process is, we can try a concrete example, the process of using the L2-norm (we’ll use the right terms now) to normalize our vector$\vec{v_{d_4}} = (0,2,1,0)$为了得到其单位向量$\hat{v_{d_4}}$。为了做到这一点，我们将简单的将其插入单位矢量的定义，对其进行评估：

$\hat{v} = \frac{\vec{v}}{\|\vec{v}\|_p} \\ \\ \hat{v_{d_4}} = \frac{\vec{v_{d_4}}}{||\vec{v_{d_4}}||_2} \\ \\ \\ \hat{v_{d_4}} = \frac{(0,2,1,0)}{\sqrt{0^2 + 2^2 + 1^2 + 0^2}} \\ \\ \hat{v_{d_4}} = \frac{(0,2,1,0)}{\sqrt{5}} \\ \\ \small \hat{v_{d_4}} = (0.0, 0.89442719, 0.4472136, 0.0)$

And that is it ! Our normalized vector$\hat{v_{d_4}}$has now a L2-norm$\|\hat{v_{d_4}}\|_2 = 1.0$

### The term frequency – inverse document frequency (tf-idf) weight

Train Document Set: d1: The sky is blue. d2: The sun is bright. Test Document Set: d3: The sun in the sky is bright. d4: We can see the shining sun, the bright sun.

Let’s see now, how idf (inverse document frequency) is then defined:

$\displaystyle \mathrm{idf}(t) = \log{\frac{\left|D\right|}{1+\left|\{d : t \in d\}\right|}}$

The formula for the tf-idf is then:

$\mathrm{tf\mbox{-}idf}(t) = \mathrm{tf}(t, d) \times \mathrm{idf}(t)$

$M_{train} = \begin{bmatrix} 0 & 1 & 1 & 1\\ 0 & 2 & 1 & 0 \end{bmatrix}$

$\ mathrm {IDF}（T_1）= \日志{\压裂{\左| d \右|} {1+ \左| \ {d：T_1 \在d \} \右|}} = \日志{\压裂{2} {1}} = 0.69314718$

$\ mathrm {IDF}（T_2）= \日志{\压裂{\左| d \右|} {1+ \左| \ {d：T_2 \在d \} \右|}} = \日志{\压裂{2} {3}} = -0.40546511$

$\mathrm{idf}(t_3) = \log{\frac{\left|D\right|}{1+\left|\{d : t_3 \in d\}\right|}} = \log{\frac{2}{3}} = -0.40546511$

$\ mathrm {IDF}（T_4）= \log{\frac{\left|D\right|}{1+\left|\{d : t_4 \in d\}\right|}} = \log{\frac{2}{2}} = 0.0$

$\ {VEC {idf_列车}} =（0.69314718，-0.40546511，-0.40546511，0.0）$

Now that we have our matrix with the term frequency ($M_{train}$) and the vector representing the idf for each feature of our matrix ($\ {VEC {idf_列车}}$), we can calculate our tf-idf weights. What we have to do is a simple multiplication of each column of the matrix$M_{train}$与各自的$\ {VEC {idf_列车}}$向量维度。要做到这一点，我们可以创建一个正方形diagonal matrix$M_{idf}$同时与垂直和水平尺寸等于向量$\ {VEC {idf_列车}}$尺寸：

$M_{idf} = \begin{bmatrix} 0.69314718 & 0 & 0 & 0\\ 0 & -0.40546511 & 0 & 0\\ 0 & 0 & -0.40546511 & 0\\ 0 & 0 & 0 & 0 \end{bmatrix}$

$M_ {TF \ MBOX { - }} IDF= M_{train} \times M_{idf}$

${bmatrix} \ \开始mathrm {tf} (t_1 d_1) & \ mathrm {Ťf}(t_2, d_1) & \mathrm{tf}(t_3, d_1) & \mathrm{tf}(t_4, d_1)\\ \mathrm{tf}(t_1, d_2) & \mathrm{tf}(t_2, d_2) & \mathrm{tf}(t_3, d_2) & \mathrm{tf}(t_4, d_2) \end{bmatrix} \times \begin{bmatrix} \mathrm{idf}(t_1) & 0 & 0 & 0\\ 0 & \mathrm{idf}(t_2) & 0 & 0\\ 0 & 0 & \mathrm{idf}(t_3) & 0\\ 0 & 0 & 0 & \mathrm{idf}(t_4) \end{bmatrix} \\ = \begin{bmatrix} \mathrm{tf}(t_1, d_1) \times \mathrm{idf}(t_1) & \mathrm{tf}(t_2, d_1) \times \mathrm{idf}(t_2) & \mathrm{tf}(t_3, d_1) \times \mathrm{idf}(t_3) & \mathrm{tf}(t_4, d_1) \times \mathrm{idf}(t_4)\\ \mathrm{tf}(t_1, d_2) \times \mathrm{idf}(t_1) & \mathrm{tf}(t_2, d_2) \times \mathrm{idf}(t_2) & \mathrm{tf}(t_3, d_2) \times \mathrm{idf}(t_3) & \mathrm{tf}(t_4, d_2) \times \mathrm{idf}(t_4) \end{bmatrix}$

$M_ {TF \ MBOX { - }} IDF= M_{train} \times M_{idf} = \\ \begin{bmatrix} 0 & 1 & 1 & 1\\ 0 & 2 & 1 & 0 \end{bmatrix} \times \begin{bmatrix} 0.69314718 & 0 & 0 & 0\\ 0 & -0.40546511 & 0 & 0\\ 0 & 0 & -0.40546511 & 0\\ 0 & 0 & 0 & 0 \end{bmatrix} \\ = \begin{bmatrix} 0 & -0.40546511 & -0.40546511 & 0\\ 0 & -0.81093022 & -0.40546511 & 0 \end{bmatrix}$

And finally, we can apply our L2 normalization process to the$M_ {TF \ MBOX { - }} IDF$矩阵。请注意，这正常化“逐行”因为我们要处理矩阵的每一行作为一个分离向量进行归一化，而不是矩阵作为一个整体：

$M_ {TF \ MBOX { - } IDF} = \压裂{M_ {TF \ MBOX { - } IDF}} {\ | M_ {TF \ MBOX { - } IDF} \ | _2}$ $= \begin{bmatrix} 0 & -0.70710678 & -0.70710678 & 0\\ 0 & -0.89442719 & -0.4472136 & 0 \end{bmatrix}$

### Python practice

Environment UsedPython v.2.7.2Numpy 1.6.1Scipy v.0.9.0Sklearn (Scikits.learn) v.0.9

The first step is to create our training and testing document set and computing the term frequency matrix:

from sklearn.feature_extraction.text import CountVectorizer train_set = ("The sky is blue.", "The sun is bright.") test_set = ("The sun in the sky is bright.", "We can see the shining sun, the bright sun.") count_vectorizer = CountVectorizer() count_vectorizer.fit_transform(train_set) print "Vocabulary:", count_vectorizer.vocabulary # Vocabulary: {'blue': 0, 'sun': 1, 'bright': 2, 'sky': 3} freq_term_matrix = count_vectorizer.transform(test_set) print freq_term_matrix.todense() #[[0 1 1 1] #[0 2 1 0]]

Now that we have the frequency term matrix (calledfreq_term_matrix），我们可以实例化TfidfTransformer，这将是负责来计算我们的词频矩阵TF-IDF权重：

Note that I’ve specified the norm as L2, this is optional (actually the default is L2-norm), but I’ve added the parameter to make it explicit to you that it it’s going to use the L2-norm. Also note that you can see the calculated idf weight by accessing the internal attribute calledidf_。Now that适合（）方法计算矩阵中的IDF上，让我们改造freq_term_matrix到TF-IDF权重矩阵：

tf_idf_matrix = tfidf.transform（freq_term_matrix）打印tf_idf_matrix.todense（）＃[[0 -0.70710678 -0.70710678 0]＃[0 -0.89442719 -0.4472136 0]]

I really hope you liked the post, I tried to make it simple as possible even for people without the required mathematical background of linear algebra, etc. In the next Machine Learning post I’m expecting to show how you can use the tf-idf to calculate the cosine similarity.

### References

Understanding Inverse Document Frequency: on theoretical arguments for IDF

Wikipedia :: tf-idf

The classic Vector Space Model

Sklearn文本特征提取码