Building a NLP pipeline in NLTK
If you have been working with NLTK for some time now, you probably find the task of preprocessing the text a bit cumbersome. In this post, I will walk you through a simple and fun approach for performing repetitive tasks using coroutines. The coroutines concept is a pretty obscure one but very useful indeed. You can check out this awesome presentation by David Beazley to grasp all the stuff needed to get you through this (plus much, much more).
Consider this really simple scenario (although things usually get much more intricate):
1 2 3 4 5 6 7 8 9 10 11 | import nltk for text in texts: sentences = nltk.sent_tokenize(text) for sentence in sentences: words = nltk.word_tokenize(sentence) tagged_words = nltk.pos_tag(words) ne_tagged_words = nltk.ne_chunk(tagged_words) print ne_tagged_words |
This is the most common way of using NLTK’s functions. Wouldn’t it be nice to make this a bit more reusable? Things we usually get done with NLTK fit perfectly in the pipeline model. In fact, some NLP frameworks use this model (CoreNLP, GATE).
Let’s make use of coroutines and develop such a pipeline system for NLTK (Python style of course). This can be indeed really powerful and even more important, much more enjoyable to work with.
We’ll be using the coroutine decorator on the pipeline components (it’s copy/paste from the presentation, so please check that out). Coroutines are just generators that consume data instead of generating it. The decorator moves us to the first yield inside the coroutine so we can immediately pass data to the coroutine.
1 2 3 4 5 6 7 | def coroutine(func): def start(*args, **kwargs): cr = func(*args, **kwargs) cr.next() return cr return start |
A pipeline system has 3 types of components:
- Source: the initial source of the data (this is not a coroutine)
- Pipelines: what actually processes the data (operate, filter, compose)
- Sinks: Coroutines that don’t pass data around (usually they display or store data)
Let’s define a simple source that just iterates through a list of texts and passes them to some targets (pipelines or sinks):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | texts = [ """ Babylon was a significant city in ancient Mesopotamia, in the fertile plain between the Tigris and Euphrates rivers. The city was built upon the Euphrates, and divided in equal parts along its left and right banks, with steep embankments to contain the river's seasonal floods. """, """ Hammurabi was the sixth Amorite king of Babylon from 1792 BC to 1750 BC middle chronology. He became the first king of the Babylonian Empire following the abdication of his father, Sin-Muballit, who had become very ill and died, extending Babylon's control over Mesopotamia by winning a series of wars against neighboring kingdoms. """ ] def source(texts, targets): for text in texts: for t in targets: t.send(text) |
And now for the really fun part, the actual processing pipelines:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | @coroutine def sent_tokenize_pipeline(targets): while True: text = (yield) sentences = nltk.sent_tokenize(text) for sentence in sentences: for target in targets: target.send(sentence) @coroutine def word_tokenize_pipeline(targets): while True: sentence = (yield) words = nltk.word_tokenize(sentence) for target in targets: target.send(words) @coroutine def pos_tag_pipeline(targets): while True: words = (yield) tagged_words = nltk.pos_tag(words) for target in targets: target.send(tagged_words) @coroutine def ne_chunk_pipeline(targets): while True: tagged_words = (yield) ner_tagged = nltk.ne_chunk(tagged_words) for target in targets: target.send(ner_tagged) |
These are the most used NLTK functions wrapped around coroutines. The final preparation step: creating a basic sink, the component that sits at the end of a pipe and does a final operation on the processed data (storing, printing, etc …).
1 2 3 4 5 6 7 | @coroutine def printer(): while True: line = (yield) print line |
As you maybe guessed, this just prints whatever we send it (not the smartest component). Let’s put stuff together and create an equivalent workflow with the first one:
1 2 3 4 5 6 7 8 9 10 | source(texts, targets=[ sent_tokenize_pipeline(targets=[ word_tokenize_pipeline(targets=[ pos_tag_pipeline(targets=[ ne_chunk_pipeline(targets=[printer()]), ]) ]) ]) ]) |
I know what you’re thinking, this is not really much shorter. Indeed, but the advantages become more obvious when you further enrich the workflow. Let’s print the intermediary results by placing more sinks:
1 2 3 4 5 6 7 8 9 10 11 12 13 | source(texts, targets=[ sent_tokenize_pipeline(targets=[ printer(), # print the raw sentences word_tokenize_pipeline(targets=[ printer(), # print the tokenized sentences pos_tag_pipeline(targets=[ printer(), # print the tagged sentences ne_chunk_pipeline(targets=[printer()]), ]) ]) ]) ]) |
Suppose we want to filter out short sentences (less than 10 words for example). Let’s create a filtering component:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | @coroutine def filter_short(min_len, targets): while True: words = (yield) if len(words) < min_len: continue for target in targets: target.send(words) source(texts, targets=[ sent_tokenize_pipeline(targets=[ printer(), word_tokenize_pipeline(targets=[ printer(), filter_short(10, targets=[ # Filter pos_tag_pipeline(targets=[ printer(), ne_chunk_pipeline(targets=[printer()]), ]) ]) ]) ]) ]) |
Stuff just got cooler! Happy piping!
hello,
I would like to replicate your work using your code. I am very new to python. I wonder if a use a file.txt as inputfile,how can I tell python to write an output (a file object) of this new file after been tokenized, POS-Tag, etc..
Many thanks,
Sarita
Hey Sarita,
This is more of a beginner Python question rather than a NLP question. Doing NLP with Python implies you have a good knowledge of basic Python programming. If you give me a specific snippet I can try tweaking it or something, but best is to start with the basics 🙂