Building a NLP pipeline in NLTK

If you have been working with NLTK for some time now, you probably find the task of preprocessing the text a bit cumbersome. In this post, I will walk you through a simple and fun approach for performing repetitive tasks using coroutines. The coroutines concept is a pretty obscure one but very useful indeed. You can check out this awesome presentation by David Beazley to grasp all the stuff needed to get you through this (plus much, much more).

Consider this really simple scenario (although things usually get much more intricate):

This is the most common way of using NLTK’s functions. Wouldn’t it be nice to make this a bit more reusable? Things we usually get done with NLTK fit perfectly in the pipeline model. In fact, some NLP frameworks use this model (CoreNLP, GATE).

Let’s make use of coroutines and develop such a pipeline system for NLTK (Python style of course). This can be indeed really powerful and even more important, much more enjoyable to work with.

We’ll be using the coroutine decorator on the pipeline components (it’s copy/paste from the presentation, so please check that out). Coroutines are just generators that consume data instead of generating it. The decorator moves us to the first yield inside the coroutine so we can immediately pass data to the coroutine.

A pipeline system has 3 types of components:

  • Source: the initial source of the data (this is not a coroutine)
  • Pipelines: what actually processes the data (operate, filter, compose)
  • Sinks: Coroutines that don’t pass data around (usually they display or store data)

Let’s define a simple source that just iterates through a list of texts and passes them to some targets (pipelines or sinks):

And now for the really fun part, the actual processing pipelines:

These are the most used NLTK functions wrapped around coroutines. The final preparation step: creating a basic sink, the component that sits at the end of a pipe and does a final operation on the processed data (storing, printing, etc …).

As you maybe guessed, this just prints whatever we send it (not the smartest component). Let’s put stuff together and create an equivalent workflow with the first one:

I know what you’re thinking, this is not really much shorter. Indeed, but the advantages become more obvious when you further enrich the workflow. Let’s print the intermediary results by placing more sinks:

Suppose we want to filter out short sentences (less than 10 words for example). Let’s create a filtering component:

Stuff just got cooler! Happy piping!