Read the first part of this tutorial:文本特征提取（TF-IDF） - 第一部分。

This post is a**continuation**of the first part where we started to learn the theory and practice about text feature extraction and vector space model representation. I really recommend you**to read the first part**of the post series in order to follow this second post.

Since a lot of people liked the first part of this tutorial, this second part is a little longer than the first.

### 介绍

In the first post, we learned how to use the**term-frequency**to represent textual information in the vector space. However, the main problem with the term-frequency approach is that it scales up frequent terms and scales down rare terms which are empirically more informative than the high frequency terms. The basic intuition is that a term that occurs frequently in many documents is not a good discriminator, and really makes sense (at least in many experimental tests); the important question here is: why would you, in a classification problem for instance, emphasize a term which is almost present in the entire corpus of your documents ?

The tf-idf weight comes to solve this problem. What tf-idf gives is how important is a word to a document in a collection, and that’s why tf-idf incorporates local and global parameters, because it takes in consideration not only the isolated term but also the term within the document collection. What tf-idf then does to solve that problem, is to scale down the frequent terms while scaling up the rare terms; a term that occurs 10 times more than another isn’t 10 times more important than it, that’s why tf-idf uses the logarithmic scale to do that.

But let’s go back to our definition of the这实际上是长期的长期计数在文档中。The use of this simple term frequency could lead us to problems like*keyword spamming*, which is when we have a repeated term in a document with the purpose of improving its ranking on an IR (*Information Retrieval*) system or even create a bias towards long documents, making them look more important than they are just because of the high frequency of the term in the document.

To overcome this problem, the term frequencyof a document on a vector space is usually also normalized. Let’s see how we normalize this vector.

### 矢量归

Suppose we are going to normalize the term-frequency vector我们在本教程的第一部分已经计算。该文件从本教程的第一部分中有这样的文字表示：

D4：我们可以看到闪亮的阳光，明亮的阳光下。

And the vector space representation using the non-normalized term-frequency of that document was:

To normalize the vector, is the same as calculating the单位向量of the vector, and they are denoted using the “hat” notation:。的单位矢量的定义of a vectoris:

Where the是单位矢量，或者归一化矢量，所述是个vector going to be normalized and the是个norm (magnitude, length) of the vector在里面space (don’t worry, I’m going to explain it all).

的单位矢量实际上无非是矢量的归一化版本的更多，是一种载体，其长度为1。

但这里的重要问题是如何向量的长度来计算，并明白这一点，你必须了解的动机spaces, also calledLebesgue spaces。

### Lebesgue spaces

Usually, the length of a vectoris calculated using the欧几里得范–*一个准则是在矢量空间中分配一个严格正长度或大小于所有矢量的函数*-, which is defined by:

But this isn’t the only way to define length, and that’s why you see (sometimes) a numbertogether with the norm notation, like in。这是因为它可以被概括为：

和simplified as:

So when you read about a**L2范**, you’re reading about the**欧几里得范**, a norm with, the most common norm used to measure the length of a vector, typically called “magnitude”; actually, when you have an unqualified length measure (without thenumber), you have the**L2范**(Euclidean norm).

当你阅读一**L1范**你正在阅读与规范, defined as:

这无非是向量的组件的简单相加，也被称为出租汽车距离, also called Manhattan distance.

*Taxicab geometry versus Euclidean distance: In taxicab geometry all three pictured lines have the same length (12) for the same route. In Euclidean geometry, the green line has length, and is the unique shortest path.资源：Wikipedia :: Taxicab Geometry*

Note that you can also use any norm to normalize the vector, but we’re going to use the most common norm, the L2-Norm, which is also the default in the 0.9 release of thescikits.learn。You can also find papers comparing the performance of the two approaches among other methods to normalize the document vector, actually you can use any other method, but you have to be concise, once you’ve used a norm, you have to use it for the whole process directly involving the norm (*a unit vector that used a L1-norm isn’t going to have the length 1 if you’re going to take its L2-norm later*).

### Back to vector normalization

现在you know what the vector normalization process is, we can try a concrete example, the process of using the L2-norm (we’ll use the right terms now) to normalize our vectorin order to get its unit vector。To do that, we’ll simple plug it into the definition of the unit vector to evaluate it:

这就是它！我们的法矢现在有一个L2范。

**Note that here we have normalized our term frequency document vector, but later we’re going to do that after the calculation of the tf-idf.**

### 术语频率 - 逆文档频率（TF-IDF）重量

Now you have understood how the vector normalization works in theory and practice, let’s continue our tutorial. Suppose you have the following documents in your collection (taken from the first part of tutorial):

Train Document Set: d1: The sky is blue. d2: The sun is bright. Test Document Set: d3: The sun in the sky is bright. d4: We can see the shining sun, the bright sun.

Your document space can be defined then aswhere是个number of documents in your corpus, and in our case as和。The cardinality of our document space is defined by和, since we have only 2 two documents for training and testing, but they obviously don’t need to have the same cardinality.

现在让我们看看，然后是如何IDF（逆文档频率）定义：

where是个**number of documents**where the termappears, when the term-frequency function satisfies, we’re only adding 1 into the formula to avoid zero-division.

The formula for the tf-idf is then:

和this formula has an important consequence: a high weight of the tf-idf calculation is reached when you have a high term frequency (tf) in the given document (*本地参数*）和整个集合中的术语的低文档频率（*global parameter*).

Now let’s calculate the idf for each feature present in the feature matrix with the term frequency we have calculated in the first tutorial:

Since we have 4 features, we have to calculate,,,:

这些idf权重可以用一个向量表示s:

现在we have our matrix with the term frequency () and the vector representing the idf for each feature of our matrix (），我们可以计算出我们的TF-IDF权重。我们要做的是矩阵中的每一列的简单乘法with the respectivevector dimension. To do that, we can create a square对角矩阵calledwith both the vertical and horizontal dimensions equal to the vectordimension:

和then multiply it to the term frequency matrix, so the final result can be defined then as:

Please note that the matrix multiplication isn’t commutative, the result ofwill be different than the result of the，这就是为什么is on the right side of the multiplication, to accomplish the desired effect of multiplying each idf value to its corresponding feature:

Let’s see now a concrete example of this multiplication:

最后，我们可以将我们的L2归一化处理的matrix. Please note that this normalization is**“row-wise”**because we’re going to handle each row of the matrix as a separated vector to be normalized, and not the matrix as a whole:

And that is our pretty normalized tf-idf weight of our testing document set, which is actually a collection of unit vectors. If you take the L2-norm of each row of the matrix, you’ll see that they all have a L2-norm of 1.

### 蟒蛇practice

**环境中使用**:蟒蛇v.2.7.2,Numpy 1.6.1,Scipy v.0.9.0,Sklearn（Scikits.learn）v.0.9。

Now the section you were waiting for ! In this section I’ll use Python to show each step of the tf-idf calculation using theScikit.learnfeature extraction module.

The first step is to create our training and testing document set and computing the term frequency matrix:

from sklearn.feature_extraction.text import CountVectorizer train_set = ("The sky is blue.", "The sun is bright.") test_set = ("The sun in the sky is bright.", "We can see the shining sun, the bright sun.") count_vectorizer = CountVectorizer() count_vectorizer.fit_transform(train_set) print "Vocabulary:", count_vectorizer.vocabulary # Vocabulary: {'blue': 0, 'sun': 1, 'bright': 2, 'sky': 3} freq_term_matrix = count_vectorizer.transform(test_set) print freq_term_matrix.todense() #[[0 1 1 1] #[0 2 1 0]]

现在，我们有频率项矩阵（称为**freq_term_matrix**), we can instantiate the**TfidfTransformer**, which is going to be responsible to calculate the tf-idf weights for our term frequency matrix:

from sklearn.feature_extraction.text import TfidfTransformer tfidf = TfidfTransformer(norm="l2") tfidf.fit(freq_term_matrix) print "IDF:", tfidf.idf_ # IDF: [ 0.69314718 -0.40546511 -0.40546511 0. ]

Note that I’ve specified the norm as L2, this is optional (actually the default is L2-norm), but I’ve added the parameter to make it explicit to you that it it’s going to use the L2-norm. Also note that you can see the calculated idf weight by accessing the internal attribute called**idf_**。现在**fit()**method has calculated the idf for the matrix, let’s transform the**freq_term_matrix**to the tf-idf weight matrix:

tf_idf_matrix = tfidf.transform(freq_term_matrix) print tf_idf_matrix.todense() # [[ 0. -0.70710678 -0.70710678 0. ] # [ 0. -0.89442719 -0.4472136 0. ]]

And that is it, the**tf_idf_matrix**is actually our previousmatrix. You can accomplish the same effect by using the**Vectorizer**class of the Scikit.learn which is a vectorizer that automatically combines the**CountVectorizer**和the**TfidfTransformer**to you. Seethis exampleto know how to use it for the text classification process.

I really hope you liked the post, I tried to make it simple as possible even for people without the required mathematical background of linear algebra, etc. In the next Machine Learning post I’m expecting to show how you can use the tf-idf to calculate the cosine similarity.

If you liked it, feel free to comment and make suggestions, corrections, etc.

*Terra Incognita*, 03/10/2011,//www.cpetem.com/2011/10/machine-learning-text-feature-extraction-tf-idf-part-ii/。

### 参考

Understanding Inverse Document Frequency: on theoretical arguments for IDF

Sklearn text feature extraction code

### 更新

**2015年3月13日**–*Formating, fixed images issues.***03 Oct 2011**–*Added the info about the environment used for Python examples*