Introduction to NLTK
NLTK (Natural Language ToolKit) is the most popular Python framework for working with human language. There’s a bit of controversy around the question whether NLTK is appropriate or not for production environments. Here’s my take on the matter:
* NLTK doesn’t come with super powerful trained models (like other frameworks do, like Stanford CoreNLP)
* NLTK is perfect for getting started with NLP because it’s packed with examples, clear, concise, easy to understand API
* NLTK comes with a lot of corpora conveniently organised and easily accessible for experiments
* NLTK provides simple and pythonic foundation, that’s easy to extend.
* NLTK contains some beautiful abstractions like the nltk.Tree
used in stuff like chunking or syntactic parsing.
* NLTK contains useful functions for doing a quick analysis (have a quick look at the data)
* NLTK is certainly the place for getting started with NLP
You might not use the models in NLTK, but you can extend the excellent base classes and use your own trained models, built using other libraries like scikit-learn or TensorFlow. Here are some examples of training your own NLP models: Training a POS Tagger with NLTK and scikit-learn and Train a NER System. NLTK also provides some interfaces to external tools like the StanfordPOSTagger
or SennaTagger
.
So, long story short, NLTK is awesome because it provides a great starting point. You shouldn’t ignore it because it may be the foundation for your next application.
Natural Language Processing with NLTK
NLTK is well organized and plays well with the NLP Pyramid. The various functions, shortcuts, intertaces can be organized in a hierarchy as follows:
Module | Shortcuts | Data Structures | Interfaces | NLP Pyramid |
---|---|---|---|---|
nltk.stem , nltk.text , nltk.tokenize | word_tokenize , sent_tokenize | str , nltk.Text => [str] | StemmerI , TokenizerI | Morphology |
nltk.tag , nltk.chunk | pos_tag | [str] => [(str, tag)] , nltk.Tree | TaggerI , ParserI , ChunkParserI | Syntax |
nltk.chunk , nltk.sem | ne_chunk | nltk.Tree , nltk.DependencyGraph | ParserI , ChunkParserI | Semantics |
nltk.sem.drt | – | Expression | – | Pragmatics |
Here’s an example of quickly passing through the first 3 levels of the NLP Pyramid:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | from nltk import word_tokenize, pos_tag, ne_chunk text = "John works at Intel." # str # Morphology Level tokens = word_tokenize(text) print tokens # [str] # ['John', 'works', 'at', 'Intel', '.'] # Syntax Level tagged_tokens = pos_tag(tokens) print tagged_tokens # [(str, tag)] # [('John', 'NNP'), ('works', 'VBZ'), ('at', 'IN'), ('Intel', 'NNP'), ('.', '.')] # Semantics Level ner_tree = ne_chunk(tagged_tokens) print ner_tree # nltk.Tree # (S (PERSON John/NNP) works/VBZ at/IN (ORGANIZATION Intel/NNP) ./.) |
Quick peek with nltk.Text
When doing information extraction or text transformation, nltk.Text
might come in handy. Here are a few shortcuts to getting some quick insights from the text:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | from nltk import Text from nltk.corpus import reuters text = Text(reuters.words()) # Get the collocations that don't contain stop-words text.collocations() # United States; New York; per cent; Rhode Island; years ago; Los Angeles; White House; ... # Get words that appear in similar contexts text.similar('Monday', 5) # april march friday february january # Get common contexts for a list of words text.common_contexts(['August', 'June']) # since_a in_because last_when between_and last_that and_at ... # Get contexts for a word text.concordance('Monday') # said . Trade Minister Saleh said on Monday that Indonesia , as the world ' s s # Reuters to clarify his statement on Monday in which he said the pact should be # the 11 - member CPA which began on Monday . They said producers agreed that c # ief Burkhard Junger was arrested on Monday on suspicion of embezzlement and of # ween one and 1 . 25 billion dlrs on Monday and Tuesday . The spokesman said Mo # ay and Tuesday . The spokesman said Monday ' s float included 500 mln dlrs in |
Basic text stats with NLTK
This is an example of computing some basic text statistics using NLTK’s FreqDist
class:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | from nltk.corpus import webtext from nltk import word_tokenize from nltk import FreqDist, Text # Build a large text text = "" for wt in webtext.fileids()[:100]: text += "\n\n" + webtext.raw(wt) fdist = FreqDist(word_tokenize(text)) # Get the text's vocabulary print fdist.keys()[:100] # First 100 words print fdist['dinosaurs'] # 7 |
FreqDist
extends Python’s dict
class, so much of the behaviour is inherited from there. It does provide some neat additional functionality:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | # Get a word's frequency print fdist.freq('dinosaurs') # 1.84242041402e-05 # Total number of samples print fdist.N() # 379935 # Words that appear exactly once print fdist.hapaxes() # [u'sepcially', u'mutinied', u'Nudists', u'Restrained', ... ] # Most common samples print fdist.most_common(n=5) # [(u'.', 16500), (u':', 14327), (u',', 12427), (u'I', 7786), (u'the', 7313)] # Draw a bar chart with the count of the most common 50 words import matplotlib.pyplot as plt x, y = zip(*fdist.most_common(n=50)) plt.bar(range(len(x)), y) plt.xticks(range(len(x)), x) plt.show() |
Bigrams, Trigrams, Collocations
In natural language processing, you’ll often work with bigrams and trigrams. If you don’t know what they are yet, fear not, cause the matter is really simple. Bigrams are pairs of consecutive words and trigrams are triplets of consecutive words. Bigrams and trigrams are used to create an approximate model of the language. Here are some shortcuts for computing bigrams and trigrams:
1 2 3 4 5 6 7 8 9 10 11 | from nltk import bigrams, trigrams text = "John works at Intel." tokens = word_tokenize(text) print list(bigrams(tokens)) # the `bigrams` function returns a generator, wo we must unwind it # [('John', 'works'), ('works', 'at'), ('at', 'Intel'), ('Intel', '.')] print list(trigrams(tokens)) # the `trigrams` function returns a generator, wo we must unwind it # [('John', 'works', 'at'), ('works', 'at', 'Intel'), ('at', 'Intel', '.')] |
One interesting statistic one can extract from bigrams and trigrams is the list of collocations. Collocations are pairs/triplets of words that appear more frequent than expected judging by the frequency of the individual words.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | import nltk from nltk.collocations import * bigram_measures = nltk.collocations.BigramAssocMeasures() trigram_measures = nltk.collocations.TrigramAssocMeasures() ## Bigrams finder = BigramCollocationFinder.from_words(nltk.corpus.reuters.words()) # only bigrams that appear 5+ times finder.apply_freq_filter(5) # return the 50 bigrams with the highest PMI print finder.nbest(bigram_measures.pmi, 50) # among the collocations we can find stuff like: (u'Corpus', u'Christi'), (u'mechanically', u'separated'), (u'Kuala', u'Lumpur'), (u'Mathematical', u'Applications') ## Trigrams finder = TrigramCollocationFinder.from_words(nltk.corpus.reuters.words()) # only trigrams that appear 5+ times finder.apply_freq_filter(5) # return the 50 trigrams with the highest PMI print finder.nbest(trigram_measures.pmi, 50) # among the collocations we can find stuff like: (u'GHANA', u'COCOA', u'PURCHASES'), (u'Punta', u'del', u'Este'), (u'Special', u'Drawing', u'Rights') |
Bundled corpora
One of the cool things about NLTK is that it comes with bundles corpora. Ok, you need to use nltk.download()
to get it the first time you install NLTK, but after that you can the corpora in any of your projects. Having corpora handy is good, because you might want to create quick experiments, train models on properly formatted data or compute some quick text stats. You already stumbled into some examples of using the corpora in this tutorial. Here are some of the most representative ones:
- WordNet –
nltk.corpus.wordnet
– English Lexical Database - Guttenberg –
nltk.corpus.guttenberg
– Books from the Guttenberg Project - WebText –
nltk.corpus.webtext
– User generated content on the web - Brown –
nltk.corpus.brown
– Text categorized by genre - Reuters –
nltk.corpus.reuters
– News Corpus - Words –
nltk.corpus.words
– English Vocabulary - SentiWordNet –
nltk.corpus.sentiwordnet
– Sentiment polarities mapped over WordNet structure
WordNet and SentiWordNet are special cases and have a different structure. Most of the other corpora can be accessed like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | from nltk.corpus import brown print dir(brown) # Get the IDs of the files inside the corpus print brown.fileids() # [u'ca01', u'ca02', u'ca03', u'ca04', u'ca05', u'ca06', u'ca07', ... # Get the raw contents of the file print brown.raw('ca01') # The/at Fulton/np-tl County/nn-tl Grand/jj-tl ... # Get the tokenized text print brown.words() # [u'The', u'Fulton', u'County', u'Grand', u'Jury', ...] # In case the corpus is part-of-speech tagged print brown.tagged_sents()[0] # [(u'The', u'AT'), (u'Fulton', u'NP-TL'), (u'County', u'NN-TL'), (u'Grand', u'JJ-TL'), ... |
Depending on the peculiarities of the corpus, it may have more or fewer methods than the ones presented here in the Brown Corpus. Check the available methods using dir(corpus_object)
Here’s a quick and dirty way of extracting commonly misspelled words or non-dictionary words:
1 2 3 4 5 6 7 8 | from nltk.corpus import brown, webtext brown_words = set([w.lower() for w in brown.words()]) web_words = set([w.lower() for w in webtext.words()]) # Get the words that are found over the web, but not in the English vocabulary print web_words.difference(brown_words) # [u'nuttyness', u'colonoscopy', u'psm_co_tag', u'prefix', u'woody', u'typeerror', u'scoll', ... |
Converting between different formats
Since NLTK has its roots in the academic environment, it also plays nice with academic data formats. It knows how to read data from specific formats and convert to whatever is needed. Here are some of the most popular ones:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | from nltk import word_tokenize, pos_tag, ne_chunk from nltk.tag import untag, str2tuple, tuple2str from nltk.chunk import tree2conllstr, conllstr2tree, conlltags2tree, tree2conlltags text = "John works at Intel." tokens = word_tokenize(text) print tokens # ['John', 'works', 'at', 'Intel', '.'] tagged_tokens = pos_tag(tokens) print tagged_tokens # [('John', 'NNP'), ('works', 'VBZ'), ('at', 'IN'), ('Intel', 'NNP'), ('.', '.')] print untag(tagged_tokens) # Back to: ['John', 'works', 'at', 'Intel', '.'] tagged_tokens = [tuple2str(t) for t in tagged_tokens] print tagged_tokens # ['John/NNP', 'works/VBZ', 'at/IN', 'Intel/NNP', './.'] tagged_tokens = [str2tuple(t) for t in tagged_tokens] print tagged_tokens # Back to: [('John', 'NNP'), ('works', 'VBZ'), ('at', 'IN'), ('Intel', 'NNP'), ('.', '.')] ner_tree = ne_chunk(tagged_tokens) print ner_tree # (S (PERSON John/NNP) works/VBZ at/IN (ORGANIZATION Intel/NNP) ./.) iob_tagged = tree2conlltags(ner_tree) print iob_tagged # [('John', 'NNP', u'B-PERSON'), ('works', 'VBZ', u'O'), ('at', 'IN', u'O'), ('Intel', 'NNP', u'B-ORGANIZATION'), ('.', '.', u'O')] ner_tree = conlltags2tree(iob_tagged) print ner_tree # Back to: (S (PERSON John/NNP) works/VBZ at/IN (ORGANIZATION Intel/NNP) ./.) tree_str = tree2conllstr(ner_tree) print tree_str # John NNP B-PERSON # works VBZ O # at IN O # Intel NNP B-ORGANIZATION # . . O ner_tree = conllstr2tree(tree_str, chunk_types=('PERSON', 'ORGANIZATION')) print ner_tree # Back to: (S (PERSON John/NNP) works/VBZ at/IN (ORGANIZATION Intel/NNP) ./.) |
Conclusions
- If you’re getting started with Natural Language Processing, NLTK is for you
- It may not provide the most performant models, but it has other tools that may come in handy
- It offers a suite of interfaces and data structures that provide a good framework for building NLP applications
- It comes with a collection of corpora that’s easy to read and use
- You can build your own models using other Machine Learning tools and wrap them inside NLTK’s interfaces so that it plays nicely within the framework.
Good job!
I’ve also found spaCy to be satisfactory and easy to use out of the box. It’s in python also but built for production and is easily extendible.
Absolutely, spaCy rocks! I’m planning to create a tutorial on it as well. I’m just accumulating more practical experience with it 😀
Cool! As far as I know, NLTK is more instructional than competitive these days — it doesn’t have dependency parsing, word embeddings, or coreference resolution and is orders of magnitude slower.
Agree on this as well Tom. NLTK is indeed best for instructional purposes and that’s a reason why it’s used throughout this blog. The default models that come with NLTK are indeed slow and outdated but we can always train better models and wrap them in the good abstractions that NLTK provides.
I could use a tutorial on spaCy 🙂
There you go:
https://nlpforhackers.io/complete-guide-to-spacy/
Hi,
Great tutorials, ‘ve been following them for a while now. I am getting started with a project in NLP and I could use some help with it.
My project is to use AI in customer service. Basically there are bunch of emails and then I have some excel files with customer queries and answers. Right now, I am thinking of training the model on excel files so if I get some query similar to old ones (addressed already), the answer is found and provided to customer. If the answer is not available then a person looks for the answer in pdf documentation (which is in thousands) and then updates the excel file. And the models learns from it again.
This is just my approach. I believe there might be more intelligent approach to this problem. Could there be a way to intelligently surf the PDFs for answer by using AI & NLP; since there can be two possibilities : Either the answer is in excel file or PDF.
Please let me know if you happen to have more questions. Since you are the expert in the field, any advice would be great. Also it would be nice to know some limitations since this field is pretty new to me. Cheers !
Hi Irfan,
Indeed, your project makes sense. From a business point of view, I believe it’s still best to have the human search the answer in the PDF. If not, you can treat that issues as an Information Retrieval problem. In terms of selecting the best answer from the Excel file, you can start by using a simple similarity measure and pick the question that is most similar to the question you are trying to answer.
Bogdan.
Thank you so much! It was very informative!
Thanks Victoria, glad you liked it 🙂