Weighting words using Tf-Idf

Updates

  • 29-Apr-2018 – Added string instance check Python 2.7, Python3.6 compatibility (Thanks Greg)

If I ask you “Do you remember the article about electrons in NY Times?” there’s a better chance you will remember it than if I asked you “Do you remember the article about electrons in the Physics books?”. Here’s why: an article about electrons in NY Times is far less common than in a collection of physics books. It is less likely to stumble upon the “electron” concept in NY Times than in a physics book.

Let’s consider now the scenario of a single article. Suppose you read an article and you’re asked to rank the concepts found in the article by importance. The chances are you’ll basically order the concepts by frequency. The reason is simply that important stuff would be mentioned repeatedly because the narrative gravitates around them.

Combining the 2 insights, given a term, a document and a collection of documents we can loosely say that:
importance ~ appearances(term, document) / count(documents containing term in collection)

This technique is called Tf-IdfTerm Frequency – Inverse Document Frequency. Here’s how the measure is defined:

  • tf = count(word, document) / len(document) – term frequency
  • idf = log( len(collection) / count(document_containing_term, collection) – inverse document frequency )
  • tf-idf = tf * idf – term frequency – inverse document frequency

Let’s test this theory on some data. We’re going to use the Reuters dataset bundles inside NLTK.

Let’s build a tokenizer that ignores punctuation and stopwords:

We now need to know all the words inside the collection

Let’s compute the Idf for every word in the vocabulary:

Let’s write, as an exercise, the numpy parallelized version of the Idf computation:

Since Idf doesn’t depend on the current document but only on the collection we can preprocess the results as we did above. Here’s the code for the final computation:

Let’s run a few computations:

Notice how I sneakily computed the words in the order of the Tf-Idf score.

That’s how we compute the Tf-Idf ourselves. Let’s also use some libraries to make the job a bit easier. Note that the scores might be different but the order should be the same. The difference is probably due to different smoothing strategies.

Here’s the code for computing the Tf-Idf score using scikit-learn:

Here’s the code for computing the Tf-Idf score using gensim:

Conclusions

  • TfIdf is a really popular technique for weighting the importance of the terms inside a collection of documents
  • It is used in Information Retrieval to rank results
  • It is used for extracting keywords on web pages