Menu Sidebar
natural language processing pipeline

Building a NLP pipeline in NLTK

If you have been working with NLTK for some time now, you probably find the task of preprocessing the text a bit cumbersome. In this post, I will walk you through a simple and fun approach for performing repetitive tasks using coroutines. The coroutines concept is a pretty obscure one but very useful indeed. You can check out this awesome presentation by David Beazley to grasp all the stuff needed to get you through this (plus much, much more).

Consider this really simple scenario (although things usually get much more intricate):

Read More

text chunking

Text Chunking with NLTK

What is chunking

Text chunking, also referred to as shallow parsing, is a task that follows Part-Of-Speech Tagging and that adds more structure to the sentence. The result is a grouping of the words in “chunks”. Here’s a quick example:

In other words, in a shallow parse tree, there’s one maximum level between the root and the leaves. A deep parse tree looks like this:

There are several advantages and drawbacks for using one against the other. The most obvious advantage of shallow parsing is that it’s an easier task and a shallow parser can be more accurate. Also, working with chunks is way easier than working with full-blown parse trees.

Read More

Named Entity Recognition with NLTK

Complete guide to build your own Named Entity Recognizer with Python

What is “NER”

NER, short for Named Entity Recognition is probably the first step towards information extraction from unstructured text. It basically means extracting what is a real world entity from the text (Person, Organization, Event etc …).

Why do you need this information? You might want to map it against a knowledge base to understand what the sentence is about, or you might want to extract relationships between different named entities (like who works where, when the event takes place etc…)

NLTK NER Chunker

NLTK has a standard NE annotator so that we can get started pretty quickly.

Read More


Stemmers vs. Lemmatizers

Stackoverflow is full of questions about why stemmers and lemmatizers don’t work as expected. The root cause of the confusion is that their role is often misunderstood. Here’s a comparison:

  • Both stemmers and lemmatizers try to bring inflected words to the same form
  • Stemmers use an algorithmic approach of removing prefixes and suffixes. The result might not be an actual dictionary word.
  • Lemmatizers use a corpus. The result is always a dictionary word.
  • Lemmatizers need extra info about the part of speech they are processing. “Calling” can be either a verb or a noun (the calling)
  • Stemmers are faster than lemmatizers

When to use stemmers and when to use lemmatizers

There’s no definite right answer here, but here are a few guidelines:

  • If speed is important, use stemmers (lemmatizers have to search through a corpus while stemmers do simple operations on a string)
  • If you just want to make sure that the system you are building is tolerant to inflections, use stemmers (If you query for “best bar in New York”, you’d accept an article on “Best bars in New York 2016″)
  • If you need the actual dictionary word, use a lemmatizer. (for example, if you are building a natural language generation system)

How do stemmers work

Stemmers are extremely simple to use and very fast. They usually are the preferred choice. They work by applying different transformation rules on the word until no other transformation can be applied.

You can see a stemmer in action in this article about Building an inverted index

How do lemmatizers work

As previously mentioned, lemmatizers need to know about the part of speech. This is a substantial dissadvantage since the task of Part-Of-Speech tagging is prone to errors. Here’s how to properly use a lemmatizer:

Introduction to Wordnet

Wordnet, getting your hands dirty

Wordnet is a lexical database created at Princeton University. Its size and several properties it holds make Wordnet one of the most useful tools you can have in your NLP arsenal.

Here are a few properties that make Wordnet so useful:

  • Synonyms are grouped together in something called Synset
  • A synset contains lemmas, which are the base form of a word
  • There are hierarchical links between synsets (ISA relations or hypernym/hyponym relations)
  • Several other properties such as antonyms or related words are included for each lemma in the synset
    Read More
Sentence Similarity Illustration

Compute sentence similarity using Wordnet

It’s common in the world on Natural Language Processing to need to compute sentence similarity. Wordnet is an awesome tool and you should always keep it in mind when working with text. It’s of great help for the task we’re trying to tackle.

Suppose we have these sentences:

  • “Dogs are awesome.”
  • “Some gorgeous creatures are felines.” (Ok, maybe not the most common sentence structure but bare with me)
  • “Dolphins are swimming mammals.”

Say we want to know what’s the closest sentence to “Cats are beautiful animals.”
Read More


Recipe: Text classification using NLTK and scikit-learn

Text classification is most probably, the most encountered Natural Language Processing task. It can be described as assigning texts to an appropriate bucket. A sports article should go in SPORT_NEWS, and a medical prescription should go in MEDICAL_PRESCRIPTIONS.

To train a text classifier, we need some annotated data. This training data can be obtained through several methods. Suppose you want to build a spam classifier. You would export the contents of your mailbox. You’d label the email in the inbox folder as NOT_SPAM and the contents of your spam folder as SPAM.
Read More

Building a simple inverted index using NLTK

In this example I want to show how to use some of the tools packed in NLTK to build something pretty awesome. Inverted indexes are a very powerful tool and is one of the building blocks of modern day search engines.

While building the inverted index, you’ll learn to:
1. Use a stemmer from NLTK
2. Filter words using a stopwords list
3. Tokenize text
Read More

Newer Posts
Older Posts