Building a simple inverted index using NLTK
In this example I want to show how to use some of the tools packed in NLTK to build something pretty awesome. Inverted indexes are a very powerful tool and is one of the building blocks of modern day search engines.
While building the inverted index, you’ll learn to:
1. Use a stemmer from NLTK
2. Filter words using a stopwords list
3. Tokenize text
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 | import nltk from collections import defaultdict from nltk.stem.snowball import EnglishStemmer # Assuming we're working with English class Index: """ Inverted index datastructure """ def __init__(self, tokenizer, stemmer=None, stopwords=None): """ tokenizer -- NLTK compatible tokenizer function stemmer -- NLTK compatible stemmer stopwords -- list of ignored words """ self.tokenizer = tokenizer self.stemmer = stemmer self.index = defaultdict(list) self.documents = {} self.__unique_id = 0 if not stopwords: self.stopwords = set() else: self.stopwords = set(stopwords) def lookup(self, word): """ Lookup a word in the index """ word = word.lower() if self.stemmer: word = self.stemmer.stem(word) return [self.documents.get(id, None) for id in self.index.get(word)] def add(self, document): """ Add a document string to the index """ for token in [t.lower() for t in nltk.word_tokenize(document)]: if token in self.stopwords: continue if self.stemmer: token = self.stemmer.stem(token) if self.__unique_id not in self.index[token]: self.index[token].append(self.__unique_id) self.documents[self.__unique_id] = document self.__unique_id += 1 index = Index(nltk.word_tokenize, EnglishStemmer(), nltk.corpus.stopwords.words('english')) |
The stopwords list is used so that the index doesn’t create an entry for every word in the English language. The words contained in such lists have ideally no semantics by their own(so, that, the,…).
The stemmer is used to get a common form for different inflections of the base word (watching -> watch, ghostly -> ghost, etc…). The stem of the word is not necessarily a dictionary word. Stemmers use heuristic approaches for determining the base form of the word fast.
If you want the exact dictionary form, I suggest using a Lemmatizer like WordnetLemmatizer
(though, it is much slower).
Let’s insert some data and do some queries:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | # TOP10 Dire straits index.add('Industrial Disease') index.add('Private Investigations') index.add('So Far Away') index.add('Twisting by the Pool') index.add('Skateaway') index.add('Walk of Life') index.add('Romeo and Juliet') index.add('Tunnel of Love') index.add('Money for Nothing') index.add('Sultans of Swing') # TOP10 Led Zeppelin index.add('Stairway To Heaven') index.add('Kashmir') index.add('Achilles Last Stand') index.add('Whole Lotta Love') index.add('Immigrant Song') index.add('Black Dog') index.add('When The Levee Breaks') index.add('Since I\'ve Been Lovin\' You') index.add('Since I\'ve Been Loving You') index.add('Over the Hills and Far Away') index.add('Dazed and Confused') # Let's make some queries: print index.lookup('loves') # ['Tunnel of Love', 'Whole Lotta Love', "Since I've Been Loving You"] print index.lookup('loved') # ['Tunnel of Love', 'Whole Lotta Love', "Since I've Been Loving You"] print index.lookup('daze') # ['Dazed and Confused'] print index.lookup('confusion') # ['Dazed and Confused'] |
As you can see, I can pass inflected forms to the index, and still get the correct results.
Thank u for ur hardworking but can you include also a tutorial for nltk summarizer
Hi Kevin,
It’s definitely on my list. Now I’m working on an extended article on sentiment analysis. Hope you’ll find that useful too 😀
Bogdan.
[…] You can see a stemmer in action in this article about Building an inverted index […]
HI sir it is very much usefull for me. is it possible to give the index list as a database.
Hi Jenish,
This is basically how a database builds its index. If you want to read about such indexes (in the context of PostgreSQL) here’s a starting point:
https://www.postgresql.org/docs/9.6/static/textsearch-indexes.html
Is this what you are looking for?
Bogdan.