Text Chunking with NLTK
What is chunking
Text chunking, also referred to as shallow parsing, is a task that follows Part-Of-Speech Tagging and that adds more structure to the sentence. The result is a grouping of the words in “chunks”. Here’s a quick example:
1 2 3 4 5 6 7 8 9 | (S (NP Every/DT day/NN) ,/, (NP I/PRP) (VP buy/VBP) (NP something/NN) (PP from/IN) (NP the/DT corner/NN shop/NN)) |
In other words, in a shallow parse tree, there’s one maximum level between the root and the leaves. A deep parse tree looks like this:
1 2 3 4 5 6 7 8 9 10 11 | (S (NP (S (NP Every/DT NN day/NN) ,/, (NP I/PRP) (VP buy/VBP (NP something/NN) (PP from/IN (NP the/DT corner/NN shop/NN)))))) |
There are several advantages and drawbacks for using one against the other. The most obvious advantage of shallow parsing is that it’s an easier task and a shallow parser can be more accurate. Also, working with chunks is way easier than working with full-blown parse trees.
Chunking is a very similar task to Named-Entity-Recognition. In fact, the same format, IOB-tagging is used. You can read about it in the post about Named-Entity-Recognition.
Corpus for Chunking
Good news, NLTK has a handy corpus for training a chunker. Chunking was part of the CoNLL-2000 shared task. You can read a paper about the task here: Introduction to the CoNLL-2000 Shared Task: Chunking
Let’s have a look at the corpus:
1 2 3 4 5 | from nltk.corpus import conll2000 chunked_sentence = conll2000.chunked_sents()[0] print chunked_sentence |
Here’s the first annotated sentence in the corpus:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | (S (NP Confidence/NN) (PP in/IN) (NP the/DT pound/NN) (VP is/VBZ widely/RB expected/VBN to/TO take/VB) (NP another/DT sharp/JJ dive/NN) if/IN (NP trade/NN figures/NNS) (PP for/IN) (NP September/NNP) ,/, due/JJ (PP for/IN) (NP release/NN) (NP tomorrow/NN) ,/, (VP fail/VB to/TO show/VB) (NP a/DT substantial/JJ improvement/NN) (PP from/IN) (NP July/NNP and/CC August/NNP) (NP 's/POS near-record/JJ deficits/NNS) ./.) |
We already approached a very similar problem to chunking on the blog: Named Entity Recognition. The approach we’re going to take is almost identical. The feature selection is going to be different and of course, the corpus. We’re going to use the CoNLL-2000 corpus in this case. Let’s remind ourselves how to transform between the nltk.Tree
and IOB format:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | from nltk.chunk import conlltags2tree, tree2conlltags iob_tagged = tree2conlltags(chunked_sentence) print iob_tagged # [(u'Confidence', u'NN', u'B-NP'), (u'in', u'IN', u'B-PP'), (u'the', u'DT', u'B-NP'), (u'pound', u'NN', u'I-NP'), (u'is', u'VBZ', u'B-VP'), (u'widely', u'RB', u'I-VP'), (u'expected', u'VBN', u'I-VP'), (u'to', u'TO', u'I-VP'), (u'take', u'VB', u'I-VP'), (u'another', u'DT', u'B-NP'), (u'sharp', u'JJ', u'I-NP'), (u'dive', u'NN', u'I-NP'), (u'if', u'IN', u'O'), (u'trade', u'NN', u'B-NP'), (u'figures', u'NNS', u'I-NP'), (u'for', u'IN', u'B-PP'), (u'September', u'NNP', u'B-NP'), (u',', u',', u'O'), (u'due', u'JJ', u'O'), (u'for', u'IN', u'B-PP'), (u'release', u'NN', u'B-NP'), (u'tomorrow', u'NN', u'B-NP'), (u',', u',', u'O'), (u'fail', u'VB', u'B-VP'), (u'to', u'TO', u'I-VP'), (u'show', u'VB', u'I-VP'), (u'a', u'DT', u'B-NP'), (u'substantial', u'JJ', u'I-NP'), (u'improvement', u'NN', u'I-NP'), (u'from', u'IN', u'B-PP'), (u'July', u'NNP', u'B-NP'), (u'and', u'CC', u'I-NP'), (u'August', u'NNP', u'I-NP'), (u"'s", u'POS', u'B-NP'), (u'near-record', u'JJ', u'I-NP'), (u'deficits', u'NNS', u'I-NP'), (u'.', u'.', u'O')] chunk_tree = conlltags2tree(iob_tagged) print chunk_tree """ (S (NP Confidence/NN) (PP in/IN) (NP the/DT pound/NN) (VP is/VBZ widely/RB expected/VBN to/TO take/VB) (NP another/DT sharp/JJ dive/NN) if/IN (NP trade/NN figures/NNS) (PP for/IN) (NP September/NNP) ,/, due/JJ (PP for/IN) (NP release/NN) (NP tomorrow/NN) ,/, (VP fail/VB to/TO show/VB) (NP a/DT substantial/JJ improvement/NN) (PP from/IN) (NP July/NNP and/CC August/NNP) (NP 's/POS near-record/JJ deficits/NNS) ./.) """ |
Let’s get an idea of how large the corpus is:
1 2 3 4 5 | from nltk.corpus import conll2000 print len(conll2000.chunked_sents()) # 10948 print len(conll2000.chunked_words()) # 166433 |
That’s a decent amount to produce a well-behaved chunker.
Training a chunker
We’re going to train 2 chunkers, just for the fun of it and then compare.
Preparing the training and test datasets
1 2 3 4 5 6 7 8 9 | import random from nltk.chunk import conlltags2tree, tree2conlltags shuffled_conll_sents = list(conll2000.chunked_sents()) random.shuffle(shuffled_conll_sents) train_sents = shuffled_conll_sents[:int(len(shuffled_conll_sents) * 0.9)] test_sents = shuffled_conll_sents[int(len(shuffled_conll_sents) * 0.9 + 1):] |
NLTK TrigramTagger as a chunker
We’re going to train a chunker using only the Part-Of-Speech as information.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | from nltk import ChunkParserI, TrigramTagger class TrigramChunkParser(ChunkParserI): def __init__(self, train_sents): # Extract only the (POS-TAG, IOB-CHUNK-TAG) pairs train_data = [[(pos_tag, chunk_tag) for word, pos_tag, chunk_tag in tree2conlltags(sent)] for sent in train_sents] # Train a TrigramTagger self.tagger = TrigramTagger(train_data) def parse(self, sentence): pos_tags = [pos for word, pos in sentence] # Get the Chunk tags tagged_pos_tags = self.tagger.tag(pos_tags) # Assemble the (word, pos, chunk) triplets conlltags = [(word, pos_tag, chunk_tag) for ((word, pos_tag), (pos_tag, chunk_tag)) in zip(sentence, tagged_pos_tags)] # Transform to tree return conlltags2tree(conlltags) trigram_chunker = TrigramChunkParser(train_sents) print trigram_chunker.evaluate(test_sents) # ChunkParse score: # IOB Accuracy: 86.8% # Precision: 79.1% # Recall: 83.1% # F-Measure: 81.0% |
Classifier based tagger
We’re now going to do something very similar to the code we implemented in the NER article.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 | import pickle from collections import Iterable from nltk import ChunkParserI, ClassifierBasedTagger from nltk.stem.snowball import SnowballStemmer def features(tokens, index, history): """ `tokens` = a POS-tagged sentence [(w1, t1), ...] `index` = the index of the token we want to extract features for `history` = the previous predicted IOB tags """ # init the stemmer stemmer = SnowballStemmer('english') # Pad the sequence with placeholders tokens = [('__START2__', '__START2__'), ('__START1__', '__START1__')] + list(tokens) + [('__END1__', '__END1__'), ('__END2__', '__END2__')] history = ['__START2__', '__START1__'] + list(history) # shift the index with 2, to accommodate the padding index += 2 word, pos = tokens[index] prevword, prevpos = tokens[index - 1] prevprevword, prevprevpos = tokens[index - 2] nextword, nextpos = tokens[index + 1] nextnextword, nextnextpos = tokens[index + 2] return { 'word': word, 'lemma': stemmer.stem(word), 'pos': pos, 'next-word': nextword, 'next-pos': nextpos, 'next-next-word': nextnextword, 'nextnextpos': nextnextpos, 'prev-word': prevword, 'prev-pos': prevpos, 'prev-prev-word': prevprevword, 'prev-prev-pos': prevprevpos, } class ClassifierChunkParser(ChunkParserI): def __init__(self, chunked_sents, **kwargs): assert isinstance(chunked_sents, Iterable) # Transform the trees in IOB annotated sentences [(word, pos, chunk), ...] chunked_sents = [tree2conlltags(sent) for sent in chunked_sents] # Transform the triplets in pairs, make it compatible with the tagger interface [((word, pos), chunk), ...] def triplets2tagged_pairs(iob_sent): return [((word, pos), chunk) for word, pos, chunk in iob_sent] chunked_sents = [triplets2tagged_pairs(sent) for sent in chunked_sents] self.feature_detector = features self.tagger = ClassifierBasedTagger( train=chunked_sents, feature_detector=features, **kwargs) def parse(self, tagged_sent): chunks = self.tagger.tag(tagged_sent) # Transform the result from [((w1, t1), iob1), ...] # to the preferred list of triplets format [(w1, t1, iob1), ...] iob_triplets = [(w, t, c) for ((w, t), c) in chunks] # Transform the list of triplets to nltk.Tree format return conlltags2tree(iob_triplets) classifier_chunker = ClassifierChunkParser(train_sents) print classifier_chunker.evaluate(test_sents) # ChunkParse score: # IOB Accuracy: 92.1% # Precision: 85.9% # Recall: 89.3% # F-Measure: 87.6% |
We can see that the difference in performance between trigram model approach and the classifier approach is significant. I’ve picked only the features that worked best in this case. Be sure to play a little with them. You might get a better performance if you use one set of features or the other.
Let’s take our new chunker for a spin:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | from nltk import word_tokenize, pos_tag # Something from today's NYTimes paper: print classifier_chunker.parse(pos_tag(word_tokenize( "The acts of defiance directed at Beijing, with some people calling for outright independence for Hong Kong, seemed to augur an especially stormy legislative term."))) """ (S (NP The/DT acts/NNS) (PP of/IN) (NP defiance/NN) (VP directed/VBN) (PP at/IN) (NP Beijing/NNP) ,/, (PP with/IN) (NP some/DT people/NNS) (VP calling/VBG) (PP for/IN) (NP outright/JJ independence/NN) (PP for/IN) (NP Hong/NNP Kong/NNP) ,/, (VP seemed/VBD to/TO augur/VB) (NP an/DT especially/RB) (PP stormy/JJ) (NP legislative/JJ term/NN) ./.) """ |
Conclusions
- Text chunking can be reduced to a tagging problem
- Chunking and Named-Entity-Recognition are very similar tasks
- Chunking is also called shallow-parsing.
- Deep-parsing creates the full parse tree, shallow parsing adds a single extra level to the tree
That’s all. Happy chunking!
hi! good job, I need to chunk corpus from the 20newsgroups datasets. please how can I do this? thanks
Hi Mohammed,
I believe all the pieces are there. Take the chunker you trained here and chunk the text in the 20newsgroups corpus. You can access the data inside the corpus using the method presented here: http://nlpforhackers.io/text-classification/
Wow! you are good at this. I really appreciate the help.I will contact you for more help concerning corpus processing. Thanks once again
Glad to help 🙂
I am a Doctoral candidate in the field of natural language processing. my topic is focused on the detection of semantic text anomaly in corpus using python. Glad to meet you. do you have any forum i can join? would love to follow up all your works and articles. Thanks
No forum at the moment, only a mailing list: http://nlpforhackers.io/newsletter/
hi Bogdani!
how can we make use of the 20newsgroups datasets instead of the conll2000. I still find it difficult to chunk.
Hmmm… Not sure what you are trying to do. The 20newsgroup is not a chunk annotated dataset, meaning you can’t train a chunker on it. You can, however, train your chunker on the conll2000 corpus (which is chunk annotated) and use the resulting model to chunk the 20newsgroup corpus. Hope this helps.
Hi, bogdani
Could you explain how to use the resulting model generated from conll2000 to train a new corpus? I am confusing about this, I have some questions, wether my new corpus need to be annotated in IOB format in advance? and the code above is about evaluation the testset, like Precision and recall, how can I get the resulting model?
Thanks in advance!
P.S. You are so kind and this article is really helpful. a lots of thanks!
Hi Feeoly,
Indeed, you are getting some things mixed up.
1. You don’t train a corpus. You use a corpus to train a model.
2. If you want to train a model, the corpus needs to be annotated. This is supervised learning which means that the data has to be labelled.
About getting the precision and recall for multiclass models (they are originally defined for only binary class model) read this: https://nlpforhackers.io/classification-performance-metrics/
Best,
Bogdan.
hi
i’am a phd student working on improving recommender system suing sentiment analysis , well .. i want to extract adjectives and nouns from user reviews as an item features … how is that using tagging or chunking?
Hi Ahmed,
That’s more the task for Part-Of-Speech Tagging (POS Tagging for short). I’ve written a complete tutorial here: http://nlpforhackers.io/training-pos-tagger/
Bogdan.