Introduction to NLTK

NLTK (Natural Language ToolKit) is the most popular Python framework for working with human language. There’s a bit of controversy around the question whether NLTK is appropriate or not for production environments. Here’s my take on the matter:

* NLTK doesn’t come with super powerful trained models (like other frameworks do, like Stanford CoreNLP)
* NLTK is perfect for getting started with NLP because it’s packed with examples, clear, concise, easy to understand API
* NLTK comes with a lot of corpora conveniently organised and easily accessible for experiments
* NLTK provides simple and pythonic foundation, that’s easy to extend.
* NLTK contains some beautiful abstractions like the nltk.Tree used in stuff like chunking or syntactic parsing.
* NLTK contains useful functions for doing a quick analysis (have a quick look at the data)
* NLTK is certainly the place for getting started with NLP

You might not use the models in NLTK, but you can extend the excellent base classes and use your own trained models, built using other libraries like scikit-learn or TensorFlow. Here are some examples of training your own NLP models: Training a POS Tagger with NLTK and scikit-learn and Train a NER System. NLTK also provides some interfaces to external tools like the StanfordPOSTagger or SennaTagger.

So, long story short, NLTK is awesome because it provides a great starting point. You shouldn’t ignore it because it may be the foundation for your next application.

Natural Language Processing with NLTK

NLTK is well organized and plays well with the NLP Pyramid. The various functions, shortcuts, intertaces can be organized in a hierarchy as follows:

Module Shortcuts Data Structures Interfaces NLP Pyramid
nltk.stem, nltk.text, nltk.tokenize word_tokenize, sent_tokenize str, nltk.Text => [str] StemmerI, TokenizerI Morphology
nltk.tag, nltk.chunk pos_tag [str] => [(str, tag)], nltk.Tree TaggerI, ParserI, ChunkParserI Syntax
nltk.chunk, nltk.sem ne_chunk nltk.Tree, nltk.DependencyGraph ParserI, ChunkParserI Semantics
nltk.sem.drt Expression Pragmatics

Here’s an example of quickly passing through the first 3 levels of the NLP Pyramid:

Quick peek with nltk.Text

When doing information extraction or text transformation, nltk.Text might come in handy. Here are a few shortcuts to getting some quick insights from the text:

Basic text stats with NLTK

This is an example of computing some basic text statistics using NLTK’s FreqDist class:

FreqDist extends Python’s dict class, so much of the behaviour is inherited from there. It does provide some neat additional functionality:

Word count for the most frequent words

Bigrams, Trigrams, Collocations

In natural language processing, you’ll often work with bigrams and trigrams. If you don’t know what they are yet, fear not, cause the matter is really simple. Bigrams are pairs of consecutive words and trigrams are triplets of consecutive words. Bigrams and trigrams are used to create an approximate model of the language. Here are some shortcuts for computing bigrams and trigrams:

One interesting statistic one can extract from bigrams and trigrams is the list of collocations. Collocations are pairs/triplets of words that appear more frequent than expected judging by the frequency of the individual words.

Bundled corpora

One of the cool things about NLTK is that it comes with bundles corpora. Ok, you need to use nltk.download() to get it the first time you install NLTK, but after that you can the corpora in any of your projects. Having corpora handy is good, because you might want to create quick experiments, train models on properly formatted data or compute some quick text stats. You already stumbled into some examples of using the corpora in this tutorial. Here are some of the most representative ones:

  • WordNetnltk.corpus.wordnet – English Lexical Database
  • Guttenbergnltk.corpus.guttenberg – Books from the Guttenberg Project
  • WebTextnltk.corpus.webtext – User generated content on the web
  • Brownnltk.corpus.brown – Text categorized by genre
  • Reutersnltk.corpus.reuters – News Corpus
  • Wordsnltk.corpus.words – English Vocabulary
  • SentiWordNetnltk.corpus.sentiwordnet – Sentiment polarities mapped over WordNet structure

WordNet and SentiWordNet are special cases and have a different structure. Most of the other corpora can be accessed like this:

Depending on the peculiarities of the corpus, it may have more or fewer methods than the ones presented here in the Brown Corpus. Check the available methods using dir(corpus_object)

Here’s a quick and dirty way of extracting commonly misspelled words or non-dictionary words:

Converting between different formats

Since NLTK has its roots in the academic environment, it also plays nice with academic data formats. It knows how to read data from specific formats and convert to whatever is needed. Here are some of the most popular ones:

Conclusions

  • If you’re getting started with Natural Language Processing, NLTK is for you
  • It may not provide the most performant models, but it has other tools that may come in handy
  • It offers a suite of interfaces and data structures that provide a good framework for building NLP applications
  • It comes with a collection of corpora that’s easy to read and use
  • You can build your own models using other Machine Learning tools and wrap them inside NLTK’s interfaces so that it plays nicely within the framework.