## 188betcom网页版

*自从我写TF-IDF教程以来已经很久了（第I部分第I部分I) and as I promissed, here is the continuation of the tutorial. Unfortunately I had no time to fix the previous tutorials for the newer versions of theScikit-Learn（Sklearn）package nor to answer all the questions, but I hope to do that in a close future.

## 圆点产品

$\ vec {a} \ cdot \ vec {b} = \ sum_ {i = 1} ^ n a_ib_i = a_1b_1 + a_2b_2 + \ cdots + a_nb_n$

$\ vec {a} =（0,3）\\ \ vec {b} =（4,0）\\ \ vec {a} \ cdot \ vec {b} = 0 * 4 + 3 * 0 = 0$

$\ vec {a}\cdot \vec{b} = \|\vec{a}\|\|\vec{b}\|\cos{\theta}$

$\ vec {a} \ cdot \ vec {b} = \ | \ vec {b} \ | \ | \ vec {a} \ | \ cos {\ theta}$

So, what is the term$\ displaystyle \ | \ vec {a} \ | \ cos {\ theta}$？该术语是向量的投影$\ vec {a}$into the vector$\vec{b}$如下图所示：

Now, what happens when the vector$\ vec {a}$向载体正交（角度为90度）$\vec{b}$like on the image below ?

## 余弦相似之处

$\ displaystyle \ vec {a} \ cdot \ vec {b} = \ | \ vec {a} \ | \ | \ vec {b} \ | \ cos {\ theta} \\ \\ \ cos {\ theta} =\ frac {\ vec {a} \ cdot \ vec {b}} {\ | \ vec {a} \ | \ | \ vec {b} \ |}$

Note that even if we had a vector pointing to a point far from another vector, they still could have an small angle and that is the central point on the use of Cosine Similarity, the measurement tends to ignore the higher term count on documents. Suppose we have a document with the word “sky” appearing 200 times and another document with the word “sky” appearing 50, the Euclidean distance between them will be higher but the angle will still be small because they are pointing to the same direction, which is what matters when we are comparing documents.

## Practice Using Scikit-learn (sklearn)

*在本教程中，我正在使用python 2.7.5和scikit-searn 0.14.1。

from sklearn.feature_extraction.text import TfidfVectorizer tfidf_vectorizer = TfidfVectorizer() tfidf_matrix = tfidf_vectorizer.fit_transform(documents) print tfidf_matrix.shape (4, 11)

Thetfidf_matrix [0：1]是获取稀疏矩阵的第一行的SCIPY操作，结果数组是第一个文档与集合中所有文档之间的余弦相似性。请注意，数组的第一个值为1.0，因为它是第一个文档之间的余弦相似性。另请注意，由于第三个文件上存在类似的单词（“天空中的太阳亮相”），它取得了更好的分数。

If you want, you can also solve the Cosine Similarity for the angle between vectors:

$\ cos {\ theta} = \ frac {\ vec {a} \ cdot \ vec {b}} {\ | \ vec {a} \ | \ | \ vec {b} \ |}$

$\theta = \arccos{\frac{\vec{a} \cdot \vec{b}}{\|\vec{a}\|\|\vec{b}\|}}$

The$\arccos$与余弦的逆相同（$\ cos ^ -1$）。

Lets for instance, check the angle between the first and third documents:

## Related Material

Wikipedia: Cosine Similarity

Scikit-Learn（Sklearn）– Thede factoPython的机器学习包

## 机器学习::文本特征提取（TF-IDF） - 第二部分

Read the first part of this tutorial:文本特征提取（TF-IDF） - 第i部分

### 介绍

In the first post, we learned how to use theterm-frequency代表矢量空间中的文本信息。然而，术语 - 频率方法的主要问题是它缩放频繁的术语，并缩放罕见的术语，这些术语是比高频术语更具信息性的。基本的直觉是许多文件中经常发生的术语不是一个很好的鉴别者，并且真的是有道理的（至少在许多实验测试中）;此处的重要问题是：为什么您在分类问题中，例如，强调一个几乎存在于文档的整个语料库中的术语？

The tf-idf weight comes to solve this problem. What tf-idf gives is how important is a word to a document in a collection, and that’s why tf-idf incorporates local and global parameters, because it takes in consideration not only the isolated term but also the term within the document collection. What tf-idf then does to solve that problem, is to scale down the frequent terms while scaling up the rare terms; a term that occurs 10 times more than another isn’t 10 times more important than it, that’s why tf-idf uses the logarithmic scale to do that.

To overcome this problem, the term frequency$\ mathrm {tf}（t，d）$在向量空间上的文档通常也是标准化的。让我们看看我们如何正常化这个矢量。

### 矢量归一化

D4：我们可以看到闪亮的阳光，明亮的阳光。

$\vec{v_{d_4}} = (0,2,1,0)$

$\displaystyle \hat{v} = \frac{\vec{v}}{\|\vec{v}\|_p}$

Where the$\hat{v}$是单位矢量，或归一化的矢量，$\ vec {v}$是个vector going to be normalized and the$\ | \ vec {v} \ | _p$是个norm (magnitude, length) of the vector$\ vec {v}$in the$l ^ p.$space (don’t worry, I’m going to explain it all).

### Lebesgue空间

$\|\vec{u}\| = \sqrt{u^2_1 + u^2_2 + u^2_3 + \ldots + u^2_n}$

But this isn’t the only way to define length, and that’s why you see (sometimes) a number$p$together with the norm notation, like in$\ | \ vec {u} \ | _p$。那是因为它可以是：

$\displaystyle \|\vec{u}\|_p = ( \left|u_1\right|^p + \left|u_2\right|^p + \left|u_3\right|^p + \ldots + \left|u_n\right|^p )^\frac{1}{p}$

$\displaystyle \|\vec{u}\|_p = (\sum\limits_{i=1}^{n}\left|\vec{u}_i\right|^p)^\frac{1}{p}$

So when you read about aL2-NOM，你正在读的欧几里德准则, a norm with$p = 2$, the most common norm used to measure the length of a vector, typically called “magnitude”; actually, when you have an unqualified length measure (without the$p$number), you have theL2-NOM(Euclidean norm).

$\displaystyle \|\vec{u}\|_1 = ( \left|u_1\right| + \left|u_2\right| + \left|u_3\right| + \ldots + \left|u_n\right|)$

Taxicab geometry versus Euclidean distance: In taxicab geometry all three pictured lines have the same length (12) for the same route. In Euclidean geometry, the green line has length$6 \ times \ sqrt {2} \约8.48$，是独特的最短路径。

### Back to vector normalization

$\ hat {v} = \ frac {\ vec {v}} {\ | \ vec {v} \ | _p} \\ \\ \\ \ hat {v_ {d_4}} = \ frac {\ vec {v_ {d_4}}} {|| \ vec {v_ {d_4}} || _2} \\ \\ \\ \ hat {v_ {d_4}} = \ frac {（0,2,1,0）} {\ sqrt {0^ 2 + 2 ^ 2 + 1 ^ 2 + 0 ^ 2}} \\ \\ \ hat {v_ {d_4}} = \ frac {（0,2,1,0）} {\ sqrt {5}}} \\\ \ small \ hat {v_ {d_4}} =（0.0,0.89442719,0.4472136,0.0）$

Note that here we have normalized our term frequency document vector, but later we’re going to do that after the calculation of the tf-idf.

### The term frequency – inverse document frequency (tf-idf) weight

Train Document Set: d1: The sky is blue. d2: The sun is bright. Test Document Set: d3: The sun in the sky is bright. d4: We can see the shining sun, the bright sun.

Your document space can be defined then as$d = \ {d_1，d_2，\ ldots，d_n \}$在哪里$n$是你的语料库中的文件数量，以及我们的案件$D_{train} = \{d_1, d_2\}$$D_{test} = \{d_3, d_4\}$。The cardinality of our document space is defined by$\ left | {d_ {train}} \右手|= 2$$\ left | {d_ {test}} \右|= 2$，因为我们只有2个用于培训和测试的文件，但它们显然不需要具有相同的基数。

$\displaystyle \mathrm{idf}(t) = \log{\frac{\left|D\right|}{1+\left|\{d : t \in d\}\right|}}$

The formula for the tf-idf is then:

$\mathrm{tf\mbox{-}idf}(t) = \mathrm{tf}(t, d) \times \mathrm{idf}(t)$

$M_ {rain} = \ begin {bmatrix} 0＆1＆1＆1 \\ 0＆2＆1＆0 \ END {BMATRIX}$

Since we have 4 features, we have to calculate$\ mathrm {idf}（t_1）$,$\mathrm{idf}(t_2)$,$\mathrm{idf}(t_3)$,$\ mathrm {idf}（t_4）$:

$\ mathrm {idf}（t_1）= \ log {\ frac {\ left | d \ light |} {1+ \ left | \ {d：t_1 \ in d \} \ light |}}} = \ log {\ frac{2} {1}} = 0.69314718$

$\mathrm{idf}(t_2) = \log{\frac{\left|D\right|}{1+\left|\{d : t_2 \in d\}\right|}} = \log{\frac{2}{3}} = -0.40546511$

$\mathrm{idf}(t_3) = \log{\frac{\left|D\right|}{1+\left|\{d : t_3 \in d\}\right|}} = \log{\frac{2}{3}} = -0.40546511$

$\ mathrm {IDF}（T_4）= \日志{\压裂{\左| d \右|} {1+ \左| \ {d：T_4 \在d \} \右|}} = \日志{\压裂{2} {2}} = 0.0$

$\ vec {idf_ {train}} =（0.69314718，-0.40546511,0.0）$

$m_ {idf} = \ begin {bmatrix} 0.69314718＆0＆0＆0 \\ 0＆0＆0 \\ 0＆0＆0＆0＆0＆0＆0＆0＆0＆0 \ {bmatrix}$

$m_ {tf \ mbox { - } IDF} = m_ {train} \ times m_ {idf}$

Please note that the matrix multiplication isn’t commutative, the result of$A \times B$会与之不同$b \ times a$，这就是为什么$m_ {idf}$位于乘法的右侧，以实现将每个IDF值乘以其对应功能的所需效果：

$\begin{bmatrix} \mathrm{tf}(t_1, d_1) & \mathrm{tf}(t_2, d_1) & \mathrm{tf}(t_3, d_1) & \mathrm{tf}(t_4, d_1)\\ \mathrm{tf}(t_1, d_2) & \mathrm{tf}(t_2, d_2) & \mathrm{tf}(t_3, d_2) & \mathrm{tf}(t_4, d_2) \end{bmatrix} \times \begin{bmatrix} \mathrm{idf}(t_1) & 0 & 0 & 0\\ 0 & \mathrm{idf}(t_2) & 0 & 0\\ 0 & 0 & \mathrm{idf}(t_3) & 0\\ 0 & 0 & 0 & \mathrm{idf}(t_4) \end{bmatrix} \\ = \begin{bmatrix} \mathrm{tf}(t_1, d_1) \times \mathrm{idf}(t_1) & \mathrm{tf}(t_2, d_1) \times \mathrm{idf}(t_2) & \mathrm{tf}(t_3, d_1) \times \mathrm{idf}(t_3) & \mathrm{tf}(t_4, d_1) \times \mathrm{idf}(t_4)\\ \mathrm{tf}(t_1, d_2) \times \mathrm{idf}(t_1) & \mathrm{tf}(t_2, d_2) \times \mathrm{idf}(t_2) & \mathrm{tf}(t_3, d_2) \times \mathrm{idf}(t_3) & \mathrm{tf}(t_4, d_2) \times \mathrm{idf}(t_4) \end{bmatrix}$

$m_ {tf \ mbox { - } IDF} = m_ {train} \ times m_ {idf} = \\ \ begin {bmatrix} 0和1＆1＆1 \\ 0＆2＆1＆1 \\ 0＆2＆1＆0 \ END {BMATRIX}\ times \ begin {bmatrix} 0.69314718＆0＆0＆0 \\ 0＆0＆0 \\ 0＆0＆0＆0＆0＆0＆0＆0＆0＆0＆0＆0＆0＆0＆0 \ {bmatrix} \\= \ begin {bmatrix} 0＆-0.40546511＆0 \\ 0＆0.0.81093022＆-0.40546511＆0 \ end {bmatrix}$

And finally, we can apply our L2 normalization process to the$M_{tf\mbox{-}idf}$矩阵。请注意，此标准化是“row-wise”因为我们将作为分离的向量处理矩阵的每一行，以便归一化，而不是整个矩阵：

$M_{tf\mbox{-}idf} = \frac{M_{tf\mbox{-}idf}}{\|M_{tf\mbox{-}idf}\|_2}$ $= \ begin {bmatrix} 0＆-0.70710678＆0 \\ 0＆-0.89442719＆-0.4472136＆0 \ end {bmatrix}$

### Pythonpractice

The first step is to create our training and testing document set and computing the term frequency matrix:

from sklearn.feature_extraction.text import CountVectorizer train_set = ("The sky is blue.", "The sun is bright.") test_set = ("The sun in the sky is bright.", "We can see the shining sun, the bright sun.") count_vectorizer = CountVectorizer() count_vectorizer.fit_transform(train_set) print "Vocabulary:", count_vectorizer.vocabulary # Vocabulary: {'blue': 0, 'sun': 1, 'bright': 2, 'sky': 3} freq_term_matrix = count_vectorizer.transform(test_set) print freq_term_matrix.todense() #[[0 1 1 1] #[0 2 1 0]]

Note that I’ve specified the norm as L2, this is optional (actually the default is L2-norm), but I’ve added the parameter to make it explicit to you that it it’s going to use the L2-norm. Also note that you can see the calculated idf weight by accessing the internal attribute calledIDF_。现在合身（）方法已经计算了矩阵的IDF，让我们转换freq_term_matrix.to the tf-idf weight matrix:

tf_idf_matrix = tfidf.transform（freq_term_matrix）print tf_idf_matrix.todense（）＃[0.70710678 -0.70710678 0]＃[0.0.89442719 -0.4472136 0.]]

I really hope you liked the post, I tried to make it simple as possible even for people without the required mathematical background of linear algebra, etc. In the next Machine Learning post I’m expecting to show how you can use the tf-idf to calculate the cosine similarity.

If you liked it, feel free to comment and make suggestions, corrections, etc.

Cite this article as: Christian S. Perone, "Machine Learning :: Text feature extraction (tf-idf) – Part II," inTerra Incognita, 03/10/2011,//www.cpetem.com/2011/10/machine-learning-text-feature-extraction-tf-idf-part-ii/

### References

Understanding Inverse Document Frequency: on theoretical arguments for IDF

The classic Vector Space Model

Sklearn text feature extraction code

### 更新

2015年3月13日形成，固定图像问题。
03 Oct 2011添加了关于Python示例的环境的信息