Building a simple inverted index using NLTK

In this example I want to show how to use some of the tools packed in NLTK to build something pretty awesome. Inverted indexes are a very powerful tool and is one of the building blocks of modern day search engines.

While building the inverted index, you’ll learn to:
1. Use a stemmer from NLTK
2. Filter words using a stopwords list
3. Tokenize text

The stopwords list is used so that the index doesn’t create an entry for every word in the English language. The words contained in such lists have ideally no semantics by their own(so, that, the,…).

The stemmer is used to get a common form for different inflections of the base word (watching -> watch, ghostly -> ghost, etc…). The stem of the word is not necessarily a dictionary word. Stemmers use heuristic approaches for determining the base form of the word fast.

If you want the exact dictionary form, I suggest using a Lemmatizer like WordnetLemmatizer (though, it is much slower).

Let’s insert some data and do some queries:

As you can see, I can pass inflected forms to the index, and still get the correct results.