Quick Recipe: Build a POS tagger using a Conditional Random Field
A while back I wrote a Complete guide for training your own Part-Of-Speech Tagger. If you are new to Part-Of-Speech Tagging (POS Tagging) make sure you follow that tutorial first. This article is more of an enhancement of the work done there.
What is a CRF?
A Conditional Random Field (CRF for short) is a discriminative sequence labelling model. It’s fairly easy to explain model (compared to Hidden Markov Models). Basically, given:
- some feature extractors (feature extractors need to output real numbers)
- weights associated with the features (which are learned)
- previous labels
predict the current label.
You probably just realized that they seem totally appropriate for doing POS tagging. That’s true, and it’s also appropriate for other NLP tools like NE Extractors and Chunkers .
Building the tagger
In the previous tutorial, we used the nltk.corpus.treebank
corpus. Let’s do the same here in order to compare. I’m also going to remind you that we haven’t used any historical features in the previous tutorial. Our previous classifier didn’t know anything about the previous decisions.
Let’s check the data:
1 2 3 4 5 6 7 8 | import nltk tagged_sentences = nltk.corpus.treebank.tagged_sents() print(tagged_sentences[0]) print("Tagged sentences: ", len(tagged_sentences)) print("Tagged words:", len(nltk.corpus.treebank.tagged_words())) |
Let’s also use the exact same feature extraction function:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | def features(sentence, index): """ sentence: [w1, w2, ...], index: the index of the word """ return { 'word': sentence[index], 'is_first': index == 0, 'is_last': index == len(sentence) - 1, 'is_capitalized': sentence[index][0].upper() == sentence[index][0], 'is_all_caps': sentence[index].upper() == sentence[index], 'is_all_lower': sentence[index].lower() == sentence[index], 'prefix-1': sentence[index][0], 'prefix-2': sentence[index][:2], 'prefix-3': sentence[index][:3], 'suffix-1': sentence[index][-1], 'suffix-2': sentence[index][-2:], 'suffix-3': sentence[index][-3:], 'prev_word': '' if index == 0 else sentence[index - 1], 'next_word': '' if index == len(sentence) - 1 else sentence[index + 1], 'has_hyphen': '-' in sentence[index], 'is_numeric': sentence[index].isdigit(), 'capitals_inside': sentence[index][1:].lower() != sentence[index][1:] } |
Let’s build the dataset:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | from nltk.tag.util import untag # Split the dataset for training and testing cutoff = int(.75 * len(tagged_sentences)) training_sentences = tagged_sentences[:cutoff] test_sentences = tagged_sentences[cutoff:] def transform_to_dataset(tagged_sentences): X, y = [], [] for tagged in tagged_sentences: X.append([features(untag(tagged), index) for index in range(len(tagged))]) y.append([tag for _, tag in tagged]) return X, y X_train, y_train = transform_to_dataset(training_sentences) X_test, y_test = transform_to_dataset(test_sentences) print(len(X_train)) print(len(X_test)) print(X_train[0]) print(y_train[0]) # 2935 # 979 # [{'word': 'Pierre' ... # ['NNP', 'NNP', ',', 'CD', 'NNS', 'JJ', ',', 'MD', 'VB', 'DT', 'NN', 'IN', 'DT', 'JJ', 'NN', 'NNP', 'CD', '.'] |
Notice how each row in the dataset is a sequence, not a single word. CRFs learn sequences.
Let’s now install the CRF library we’ll be using:
1 2 | pip install sklearn-crfsuite |
The sklearn-crfsuite
is a wrapper over the python-crfsuite
library and provides a sklearn
compatible API for the library.
1 2 3 4 5 | from sklearn_crfsuite import CRF model = CRF() model.fit(X_train, y_train) |
Here’s how to make predictions using our model:
1 2 3 4 5 6 7 8 | sentence = ['I', 'am', 'Bob','!'] def pos_tag(sentence): sentence_features = [features(sentence, index) for index in range(len(sentence))] return list(zip(sentence, model.predict([sentence_features])[0])) print(pos_tag(sentence)) # [('I', 'PRP'), ('am', 'VBP'), ('Bob', 'NNP'), ('!', '.')] |
Let’s compute the performance of our model:
1 2 3 4 5 6 7 | from sklearn_crfsuite import metrics y_pred = model.predict(X_test) print(metrics.flat_accuracy_score(y_test, y_pred)) # 0.9602683593122289 |
We achieved a whopping 0.96
accuracy on the POS tagging task. In our previous tutorial, we only achieved 0.90
using a DecisionTreeClassifier
.
Hello, Bogdani. Why I get error ValueError: too many values to unpack when I try to untag? It is must a list of tuples or just tuple? Because mine was trying to untag a tuple.
list of tuples
Thank you for your response, Bogdani. So, in y.append([tag for _, tag in tagged]) it use list of tuples too, right? But now I get another error “ValueError: The numbers of items and labels differ: |x| = 2, |y| = 9390”. Maybe because I am not using same NLTK Corpus or their TaggedCorpusReader.
Okay this one is fixed. Then, I am running into another error in feature extraction part, it says, “IndexError: String index out of range”.
I can’t seem to understand, how can it become out of range…?
Irfan,
I am getting the same error. How did you fix it?
What error are you referring to?
How did you solve the “ValueError: The numbers of items and labels differ: |x| = 2, |y| = 9390” problem?
you’re probably not vectorizing correctly
Hello Bogdani.
This blog is really helpful for me.
However, from these two lines below:
I’m curious why you are using the training dataset for training and testing.
Thank you.
Best wishes,
Cheetah
You are right! Thanks for the catch, will update the code shortly 🙂
Hello bogdani,
I have a question about the predict, in your code you compute the performance model on the x_test and y_test.
can we predict the performance or the accuracy from the sentence ” I am Bob !” here?
Thanks!
of course you can. You just need to have the correct tags. You can’t compute the accuracy for something you don’t already have the solution to.
Ah yeah, i just figured it out yesterday that it compare the tagged treebank with the prediction one.
Thanks for the answer!
Cheers.