Tf–idf
In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling.
The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. tf–idf is one of the most popular term-weighting schemes today. A survey conducted in 2015 showed that 83% of text-based recommender systems in digital libraries use tf–idf.
Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. tf–idf can be successfully used for stop-words filtering in various subject fields, including text summarization and classification.
One of the simplest ranking functions is computed by summing the tf–idf for each query term; many more sophisticated ranking functions are variants of this simple model.
Motivations
Term frequency
Suppose we have a set of English text documents and wish to rank which document is most relevant to the query, "the brown cow". A simple way to start out is by eliminating documents that do not contain all three words "the", "brown", and "cow", but this still leaves many documents. To further distinguish them, we might count the number of times each term occurs in each document; the number of times a term occurs in a document is called its term frequency. However, in the case where the length of documents varies greatly, adjustments are often made. The first form of term weighting is due to Hans Peter Luhn which may be summarized as:Inverse document frequency
Because the term "the" is so common, term frequency will tend to incorrectly emphasize documents which happen to use the word "the" more frequently, without giving enough weight to the more meaningful terms "brown" and "cow". The term "the" is not a good keyword to distinguish relevant and non-relevant documents and terms, unlike the less-common words "brown" and "cow". Hence an inverse document frequency factor is incorporated which diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely.Karen Spärck Jones conceived a statistical interpretation of term-specificity called Inverse Document Frequency, which became a cornerstone of term weighting:
Definition
- The tf–idf is the product of two statistics, term frequency and inverse document frequency. There are various ways for determining the exact values of both statistics.
- A formula that aims to define the importance of a keyword or phrase within a document or a web page.
Term frequency
- Boolean "frequencies": if occurs in and 0 otherwise;
- term frequency adjusted for document length : ÷
- logarithmically scaled frequency: ;
- augmented frequency, to prevent a bias towards longer documents, e.g. raw frequency divided by the raw frequency of the most occurring term in the document:
Inverse document frequency
weighting scheme | idf weight |
unary | 1 |
inverse document frequency | |
inverse document frequency smooth | |
inverse document frequency max | |
probabilistic inverse document frequency |
The inverse document frequency is a measure of how much information the word provides, i.e., if it's common or rare across all documents. It is the logarithmically scaled inverse fraction of the documents that contain the word :
with
- : total number of documents in the corpus
- : number of documents where the term appears. If the term is not in the corpus, this will lead to a division-by-zero. It is therefore common to adjust the denominator to.
Term frequency–Inverse document frequency
Then tf–idf is calculated asA high weight in tf–idf is reached by a high term frequency and a low document frequency of the term in the whole collection of documents; the weights hence tend to filter out common terms. Since the ratio inside the idf's log function is always greater than or equal to 1, the value of idf is greater than or equal to 0. As a term appears in more documents, the ratio inside the logarithm approaches 1, bringing the idf and tf–idf closer to 0.
weighting scheme | document term weight | query term weight |
1 | ||
2 | ||
3 |
Justification of idf
Idf was introduced, as "term specificity", by Karen Spärck Jones in a 1972 paper. Although it has worked well as a heuristic, its theoretical foundations have been troublesome for at least three decades afterward, with many researchers trying to find information theoretic justifications for it.Spärck Jones's own explanation did not propose much theory, aside from a connection to Zipf's law. Attempts have been made to put idf on a probabilistic footing, by estimating the probability that a given document contains a term as the relative document frequency,
so that we can define idf as
Namely, the inverse document frequency is the logarithm of "inverse" relative document frequency.
This probabilistic interpretation in turn takes the same form as that of self-information. However, applying such information-theoretic notions to problems in information retrieval leads to problems when trying to define the appropriate event spaces for the required probability distributions: not only documents need to be taken into account, but also queries and terms.
Link with Information Theory
The Term Frequency and the Inverse Document Frequency can be formulated using Information theory; it helps to understand why their product have a meaning in terms of joint informational content of a document. A characteristic assumption about the distribution is that:This assumption and its implications, according to Aizawa: "represent the heuristic that tf-idf employs."
Recall the expression of the Conditional entropy of a "randomly chosen" document in the corpus conditional to the fact it contains a specific term ):
In terms of notation, and are "random variables" corresponding to respectively draw a document or a term.
Now recall the definition of the Mutual information and note that it can be expressed as
The last step is to expand, the unconditional probability to draw a term, with respect to the choice of a document, to obtain:
This expression shows that summing the Tf-idf of all possible terms and documents recovers the mutual information between documents and term taking into account all the specificities of their joint distribution (for details, see. Each Tf-idf hence carries the "bit of information" attached to a term x document pair.
Example of tf–idf
Suppose that we have term count tables of a corpus consisting of only two documents, as listed on the right.Term | Term Count |
this | 1 |
is | 1 |
another | 2 |
example | 3 |
Term | Term Count |
this | 1 |
is | 1 |
a | 2 |
sample | 1 |
The calculation of tf–idf for the term "this" is performed as follows:
In its raw frequency form, tf is just the frequency of the "this" for each document. In each document, the word "this" appears once; but as the document 2 has more words, its relative frequency is smaller.
An idf is constant per corpus, and accounts for the ratio of documents that include the word "this". In this case, we have a corpus of two documents and all of them include the word "this".
So tf–idf is zero for the word "this", which implies that the word is not very informative as it appears in all documents.
The word "example" is more interesting - it occurs three times, but only in the second document:
Finally,
.