## 188betcom网页版

* It has been a long time since I wrote the TF-IDF tutorial (Part I第二部分) and as I promissed, here is the continuation of the tutorial. Unfortunately I had no time to fix the previous tutorials for the newer versions of thescikit-learn (sklearn)包也不回答所有的问题，但我希望做的是在不远的未来。

So, on the previous tutorials we learned how a document can be modeled in the Vector Space, how the TF-IDF transformation works and how the TF-IDF is calculated, now what we are going to learn is how to use a well-known similarity measure (Cosine Similarity) to calculate the similarity between different documents.

## 点积

Let’s begin with the definition of the dot product for two vectors:$\ VEC {A} =（A_1，A_2，A_3，\ ldots）$$\ VEC {B}= (b_1, b_2, b_3, \ldots)$，哪里$一个$$B_N$是矢量的（TF-IDF值在我们的例子中的文件的每个字的文档的特征，或）部件和$\ mathit {N}$是向量的维数：

$\ VEC {A}\cdot \vec{b} = \sum_{i=1}^n a_ib_i = a_1b_1 + a_2b_2 + \cdots + a_nb_n$

As you can see, the definition of the dot product is a simple multiplication of each component from the both vectors added together. See an example of a dot product for two vectors with 2 dimensions each (2D):

$\ VEC {A}= (0, 3) \\ \vec{b} = (4, 0) \\ \vec{a} \cdot \vec{b} = 0*4 + 3*0 = 0$

This is all very simple and easy to understand, but what is a dot product ? What is the intuitive idea behind it ? What does it mean to have a dot product of zero ? To understand it, we need to understand what is the geometric definition of the dot product:

$\ {VEC一个} \ CDOT \ VEC {B} = \ | \ {VEC一个} \ | \ | \ VEC {B} \ | \ COS {\ THETA}$

Rearranging the equation to understand it better using the commutative property, we have:

$\ VEC {A}\cdot \vec{b} = \|\vec{b}\|\|\vec{a}\|\cos{\theta}$

So, what is the term$\displaystyle \|\vec{a}\|\cos{\theta}$？该术语是向量的投影$\ VEC {A}$到载体中$\ VEC {B}$如下图上显示：

Now, what happens when the vector$\ VEC {A}$是正交（以90度的角度）的矢量$\ VEC {B}$像下面的图片吗？

## 余弦相似度

$\displaystyle \vec{a} \cdot \vec{b} = \|\vec{a}\|\|\vec{b}\|\cos{\theta} \\ \\ \cos{\theta} = \frac{\vec{a} \cdot \vec{b}}{\|\vec{a}\|\|\vec{b}\|}$

And that is it, this is the cosine similarity formula. Cosine Similarity will generate a metric that says how related are two documents by looking at the angle instead of magnitude, like in the examples below:

## 练习使用Scikit学习（sklearn）

*在本教程中我使用了Python 2.7.5和Scikit学习0.14.1。

文件=（“天空是蓝色的”，“阳光灿烂”，“天空阳光灿烂”，“我们可以看到闪亮的阳光，灿烂的阳光”）

And then we instantiate the Sklearn TF-IDF Vectorizer and transform our documents into the TF-IDF matrix:

从进口sklearn.feature_extraction.text TfidfVectorizer tfidf_vectorizer = TfidfVectorizer（）tfidf_matrix = tfidf_vectorizer.fit_transform（文档）打印tfidf_matrix.shape（4，11）

Now we have the TF-IDF matrix (Ťfidf_matrix）对于每个文档（11 TF-IDF术语（从矩阵的列数），我们可以计算在第一文档之间的余弦相似度（“天蓝色”）与每个矩阵的行）的数目该组的其他文件：

from sklearn.metrics.pairwise import cosine_similarity cosine_similarity(tfidf_matrix[0:1], tfidf_matrix) array([[ 1. , 0.36651513, 0.52305744, 0.13448867]])

tfidf_matrix [0：1]是个Scipy operation to get the first row of the sparse matrix and the resulting array is the Cosine Similarity between the first document with all documents in the set. Note that the first value of the array is 1.0 because it is the Cosine Similarity between the first document with itself. Also note that due to the presence of similar words on the third document (“The sun in the sky is bright”), it achieved a better score.

$\ COS {\ THETA} = \压裂{\ {VEC一个} \ CDOT \ VEC {B}} {\ | \ {VEC一个} \ | \ | \ VEC {B} \ |}$

We only need to isolate the angle ($\ THETA$）和移动$\cos$等式的右边：

$\ THETA = \ ARCCOS {\压裂{\ {VEC一个} \ CDOT \ VEC {B}} {\ | \ {VEC一个} \ | \ | \ VEC {B} \ |}}$

$\ ARCCOS$是个same as the inverse of the cosine ($\cos^-1$).

进口数学＃这是已经计算上一步，所以我们只使用值cos_sim = 0.52305744 angle_in_radians = math.acos（cos_sim）打印math.degrees（angle_in_radians）58.462437107432784

## 相关材料

Wikipedia: Dot Product

Scikit-learn (sklearn)-该事实上的Machine Learning package for Python

## 机器学习::文本特征提取（TF-IDF） - 第二部分

### 介绍

But let’s go back to our definition of the$\ mathrm {TF}（T，d）$which is actually the term count of the term$Ť$在文档中$d$。该use of this simple term frequency could lead us to problems like滥用关键字，which is when we have a repeated term in a document with the purpose of improving its ranking on an IR (信息检索）系统，甚至对创建长文档偏见，使他们看起来比他们只是因为手册中出现的高频更重要。

### Vector normalization

Suppose we are going to normalize the term-frequency vector$\vec{v_{d_4}}$我们在本教程的第一部分已经计算。该文件$D4$从本教程的第一部分中有这样的文字表示：

D4：我们可以看到闪亮的阳光，明亮的阳光下。

And the vector space representation using the non-normalized term-frequency of that document was:

$\vec{v_{d_4}} = (0,2,1,0)$

To normalize the vector, is the same as calculating theUnit Vector矢量，而他们使用的是“帽子”符号表示：$\帽子{V}$。该definition of the unit vector$\帽子{V}$of a vector$\ VEC {V}$是：

$\displaystyle \hat{v} = \frac{\vec{v}}{\|\vec{v}\|_p}$

$\帽子{V}$是单位矢量，或者归一化矢量，所述$\ VEC {V}$是个vector going to be normalized and the$\ | \ VEC {V} \ | _p$是矢量的范数（大小，长度）$\ VEC {V}$在里面$L^p$空间（别担心，我将所有的解释）。

### Lebesgue spaces

$\ | \ VEC【U} \ |= \ SQRT【U ^ 2_1 + U ^ 2_2 + U ^ 2_3 + \ ldots + U ^ 2_n}$

$\的DisplayStyle \ | \ VEC【U} \ | _p =（\左| U_1 \右| ^ P + \左| U_2 \右| ^ P + \左| U_3 \右| ^ P + \ ldots + \左|u_n \右| ^ p）^ \压裂{1} {p}$

$\的DisplayStyle \ | \ VEC【U} \ | _p =（\总和\ limits_ {I = 1} ^ {N} \左| \ VEC {U】_i \右| ^ P）^ \压裂{1} {P}$

$\displaystyle \|\vec{u}\|_1 = ( \left|u_1\right| + \left|u_2\right| + \left|u_3\right| + \ldots + \left|u_n\right|)$

### Back to vector normalization

$\帽子{V}= \frac{\vec{v}}{\|\vec{v}\|_p} \\ \\ \hat{v_{d_4}} = \frac{\vec{v_{d_4}}}{||\vec{v_{d_4}}||_2} \\ \\ \\ \hat{v_{d_4}} = \frac{(0,2,1,0)}{\sqrt{0^2 + 2^2 + 1^2 + 0^2}} \\ \\ \hat{v_{d_4}} = \frac{(0,2,1,0)}{\sqrt{5}} \\ \\ \small \hat{v_{d_4}} = (0.0, 0.89442719, 0.4472136, 0.0)$

### 术语频率 - 逆文档频率（TF-IDF）重量

Train Document Set: d1: The sky is blue. d2: The sun is bright. Test Document Set: d3: The sun in the sky is bright. d4: We can see the shining sun, the bright sun.

Your document space can be defined then as$d = \ {D_1，D_2，\ ldots，D_N \}$哪里$ñ$文档的数量在你的语料库,在吗our case as$D_ {火车} = \ {D_1，D_2 \}$$D_ {测试} = \ {D_3，D_4 \}$。我们的文档空间的基数被定义$\left|{D_{train}}\right| = 2$$\左| {{D_测试}} \右|= 2$，since we have only 2 two documents for training and testing, but they obviously don’t need to have the same cardinality.

$\的DisplayStyle \ mathrm {IDF}（T）= \日志{\压裂{\左| d \右|} {1+ \左| \ {d：吨\在d \} \右|}}$

$\ mathrm {TF \ MBOX { - } IDF}（T）= \ mathrm {TF}（T，d）\倍\ mathrm {IDF}（t）的$

Now let’s calculate the idf for each feature present in the feature matrix with the term frequency we have calculated in the first tutorial:

$M_ {}列车= \begin{bmatrix} 0 & 1 & 1 & 1\\ 0 & 2 & 1 & 0 \end{bmatrix}$

$\ mathrm {IDF}（T_1）= \日志{\压裂{\左| d \右|} {1+ \左| \ {d：T_1 \在d \} \右|}} = \日志{\压裂{2} {1}} = 0.69314718$

$\ mathrm {IDF}（T_2）= \日志{\压裂{\左| d \右|} {1+ \左| \ {d：T_2 \在d \} \右|}} = \日志{\压裂{2} {3}} = -0.40546511$

$\ mathrm {IDF}（t_3处）= \日志{\压裂{\左| d \右|} {1+ \左| \ {d：t_3处\在d \} \右|}} = \日志{\压裂{2} {3}} = -0.40546511$

$\mathrm{idf}(t_4) = \log{\frac{\left|D\right|}{1+\left|\{d : t_4 \in d\}\right|}} = \log{\frac{2}{2}} = 0.0$

$\ {VEC {idf_列车}}= (0.69314718, -0.40546511, -0.40546511, 0.0)$

$M_ {} IDF= \begin{bmatrix} 0.69314718 & 0 & 0 & 0\\ 0 & -0.40546511 & 0 & 0\\ 0 & 0 & -0.40546511 & 0\\ 0 & 0 & 0 & 0 \end{bmatrix}$

$M_ {TF \ MBOX { - }} IDF= M_{train} \times M_{idf}$

$\begin{bmatrix} \mathrm{tf}(t_1, d_1) & \mathrm{tf}(t_2, d_1) & \mathrm{tf}(t_3, d_1) & \mathrm{tf}(t_4, d_1)\\ \mathrm{tf}(t_1, d_2) & \mathrm{tf}(t_2, d_2) & \mathrm{tf}(t_3, d_2) & \mathrm{tf}(t_4, d_2) \end{bmatrix} \times \begin{bmatrix} \mathrm{idf}(t_1) & 0 & 0 & 0\\ 0 & \mathrm{idf}(t_2) & 0 & 0\\ 0 & 0 & \mathrm{idf}(t_3) & 0\\ 0 & 0 & 0 & \mathrm{idf}(t_4) \end{bmatrix} \\ = \begin{bmatrix} \mathrm{tf}(t_1, d_1) \times \mathrm{idf}(t_1) & \mathrm{tf}(t_2, d_1) \times \mathrm{idf}(t_2) & \mathrm{tf}(t_3, d_1) \times \mathrm{idf}(t_3) & \mathrm{tf}(t_4, d_1) \times \mathrm{idf}(t_4)\\ \mathrm{tf}(t_1, d_2) \times \mathrm{idf}(t_1) & \mathrm{tf}(t_2, d_2) \times \mathrm{idf}(t_2) & \mathrm{tf}(t_3, d_2) \times \mathrm{idf}(t_3) & \mathrm{tf}(t_4, d_2) \times \mathrm{idf}(t_4) \end{bmatrix}$

Let’s see now a concrete example of this multiplication:

$M_ {TF \ MBOX { - }} IDF= M_{train} \times M_{idf} = \\ \begin{bmatrix} 0 & 1 & 1 & 1\\ 0 & 2 & 1 & 0 \end{bmatrix} \times \begin{bmatrix} 0.69314718 & 0 & 0 & 0\\ 0 & -0.40546511 & 0 & 0\\ 0 & 0 & -0.40546511 & 0\\ 0 & 0 & 0 & 0 \end{bmatrix} \\ = \begin{bmatrix} 0 & -0.40546511 & -0.40546511 & 0\\ 0 & -0.81093022 & -0.40546511 & 0 \end{bmatrix}$

$M_ {TF \ MBOX { - } IDF} = \压裂{M_ {TF \ MBOX { - } IDF}} {\ | M_ {TF \ MBOX { - } IDF} \ | _2}$ $= \begin{bmatrix} 0 & -0.70710678 & -0.70710678 & 0\\ 0 & -0.89442719 & -0.4472136 & 0 \end{bmatrix}$

And that is our pretty normalized tf-idf weight of our testing document set, which is actually a collection of unit vectors. If you take the L2-norm of each row of the matrix, you’ll see that they all have a L2-norm of 1.

### Python的实践

从sklearn.feature_extraction。文本导入CountVectorizer train_set = ("The sky is blue.", "The sun is bright.") test_set = ("The sun in the sky is bright.", "We can see the shining sun, the bright sun.") count_vectorizer = CountVectorizer() count_vectorizer.fit_transform(train_set) print "Vocabulary:", count_vectorizer.vocabulary # Vocabulary: {'blue': 0, 'sun': 1, 'bright': 2, 'sky': 3} freq_term_matrix = count_vectorizer.transform(test_set) print freq_term_matrix.todense() #[[0 1 1 1] #[0 2 1 0]]

从进口sklearn.feature_extraction.text TFIDF TfidfTransformer = TfidfTransformer（NORM = “L2”）tfidf.fit（freq_term_matrix）打印 “IDF：”，tfidf.idf_＃IDF：[0.69314718 -0.40546511 -0.40546511 0]

Ťf_idf_matrix = tfidf.transform(freq_term_matrix) print tf_idf_matrix.todense() # [[ 0. -0.70710678 -0.70710678 0. ] # [ 0. -0.89442719 -0.4472136 0. ]]

And that is it, theŤf_idf_matrix其实我们以前$M_ {TF \ MBOX { - }} IDF$matrix. You can accomplish the same effect by using the矢量器类Scikit.learn的这是一个矢量器自动结合CountVectorizerTfidfTransformerŤo you. See这个例子Ťo know how to use it for the text classification process.

I really hope you liked the post, I tried to make it simple as possible even for people without the required mathematical background of linear algebra, etc. In the next Machine Learning post I’m expecting to show how you can use the tf-idf to calculate the cosine similarity.

If you liked it, feel free to comment and make suggestions, corrections, etc.

### 参考

Wikipedia :: tf-idf

Sklearn文本特征提取码

13 Mar 2015-格式化，固定图像的问题。
2011 10月3日-添加了有关使用Python示例环境信息

## Machine Learning :: Text feature extraction (tf-idf) – Part I

### Short introduction to Vector Space Model (VSM)

VSM有一个非常混乱亚洲金博宝的过去，见例如纸该most influential paper Gerard Salton Never Wrote这解释了鬼背后的历史援引这其实从来没有存在过的纸张;在总和，VSM是表示文本信息作为矢量的代数模型，这个矢量的分量可以表示术语（TF-IDF）的重要性或甚至不存在或存在（Bag of Words）它的一个文件中;它是要注意重要的是通过索尔顿提出的经典VSM结合本地和全球的参数/信息（它使用两个分离项进行分析，以及文档的整个集合感）。VSM，解释的lato sensu,is a space where text is represented as a vector of numbers instead of its original string textual representation; the VSM represents the features extracted from the document.

### Going to the vector space

Train Document Set: d1: The sky is blue. d2: The sun is bright. Test Document Set: d3: The sun in the sky is bright. d4: We can see the shining sun, the bright sun.

$\ mathrm {E}（t）的= \begin{cases} 1, & \mbox{if } t\mbox{ is blue''} \\ 2, & \mbox{if } t\mbox{ is sun''} \\ 3, & \mbox{if } t\mbox{ is bright''} \\ 4, & \mbox{if } t\mbox{ is sky''} \\ \end{cases}$

Note that the terms like “is” and “the” were ignored as cited before. Now that we have an index vocabulary, we can convert the test document set into a vector space where each term of the vector is indexed as our index vocabulary, so the first term of the vector represents the “blue” term of our vocabulary, the second represents “sun” and so on. Now, we’re going to use the长期频代表在我们的矢量空间中的每个术语;术语频无非条款存在于我们的词汇量有多少次的度量$\ mathrm {E}（t）的$存在于文件$D3$要么$D4$，我们定义术语频率为couting的功能：

$\ mathrm {TF}（T，d）= \和\ limits_ {X \在d} \ {mathrm FR}（X，T）$

$\mathrm{fr}(x,t) = \begin{cases} 1, & \mbox{if } x = t \\ 0, & \mbox{otherwise} \\ \end{cases}$

$\displaystyle \vec{v_{d_n}} =(\mathrm{tf}(t_1,d_n), \mathrm{tf}(t_2,d_n), \mathrm{tf}(t_3,d_n), \ldots, \mathrm{tf}(t_n,d_n))$

Each dimension of the document vector is represented by the term of the vocabulary, for example, the$\ mathrm {TF}（T_1，d_2）中，$表示该术语的频率项1或$Ť_1$（这是我们的“蓝色”的词汇术语）的文件中$d_2$

$\ {VEC V_ {D_3}}= (\mathrm{tf}(t_1,d_3), \mathrm{tf}(t_2,d_3), \mathrm{tf}(t_3,d_3), \ldots, \mathrm{tf}(t_n,d_3)) \\ \vec{v_{d_4}} = (\mathrm{tf}(t_1,d_4), \mathrm{tf}(t_2,d_4), \mathrm{tf}(t_3,d_4), \ldots, \mathrm{tf}(t_n,d_4))$

which evaluates to:

$\ {VEC V_ {D_3}}= (0, 1, 1, 1) \\ \vec{v_{d_4}} = (0, 2, 1, 0)$

D3：在天空，阳光灿烂。D4：我们可以看到闪亮的阳光，明亮的阳光下。

But wait, since we have a collection of documents, now represented by vectors, we can represent them as a matrix with$| d |\次˚F$形状，其中$| d |$是个Cardinality of the document space, or how many documents we have and the$F$为特征的数量，在我们的例子代表由词汇量。上述载体的矩阵表示的例子是：

$M_ {| d |\次˚F}= \begin{bmatrix} 0 & 1 & 1 & 1\\ 0 & 2 & 1 & 0 \end{bmatrix}$

As you may have noted, these matrices representing the term frequencies tend to be very（与大多数归零项），这就是为什么你会看到这些矩阵作为稀疏矩阵的一种常见表现。

### Python的实践

Since we know the theory behind the term frequency and the vector space conversion, let’s show how easy is to do that using the amazingscikit.learnPython module.

Scikit.learn comes withlots of examplesas well real-life interesting数据集you can use and also some辅助功能Ťo download 18k newsgroups posts for instance.

Since we already defined our small train/test dataset before, let’s use them to define the dataset in a way that scikit.learn can use:

train_set =（“天空是蓝色的。”，“阳光灿烂”。）TEST_SET =（“在天空中的太阳是光明的。”，“我们可以看到闪亮的阳光，灿烂的阳光。”）

从进口sklearn.feature_extraction.text CountVectorizer矢量器= CountVectorizer（）

CountVectorizer已经使用默认的“分析”之称WordNGramAnalyzer，which is responsible to convert the text to lowercase, accents removal, token extraction, filter stop words, etc… you can see more information by printing the class information:

print vectorizer CountVectorizer(analyzer__min_n=1, analyzer__stop_words=set(['all', 'six', 'less', 'being', 'indeed', 'over', 'move', 'anyway', 'four', 'not', 'own', 'through', 'yourselves', (...)

Let’s create now the vocabulary index:

vectorizer.fit_transform(train_set) print vectorizer.vocabulary {'blue': 0, 'sun': 1, 'bright': 2, 'sky': 3}

Let’s use the same vectorizer now to create the sparse matrix of ourTEST_SET文件：

smatrix = vectorizer.transform(test_set) print smatrix (0, 1) 1 (0, 2) 1 (0, 3) 1 (1, 1) 2 (1, 2) 1

Note that the sparse matrix created calledsmatrixis aSciPy的稀疏矩阵与存储在元素坐标格式。但是你可以把它转换成一个密集的格式：

smatrix.todense（）矩阵（[[0，1，1，1]，........ [0，2，1，0]]，D型细胞= int64类型）

### 参考

Wikipedia: tf-idf

Scikits.learn例子