Quick Recipe: Build a POS tagger using a Conditional Random Field

A while back I wrote a Complete guide for training your own Part-Of-Speech Tagger. If you are new to Part-Of-Speech Tagging (POS Tagging) make sure you follow that tutorial first. This article is more of an enhancement of the work done there.

What is a CRF?

A Conditional Random Field (CRF for short) is a discriminative sequence labelling model. It’s fairly easy to explain model (compared to Hidden Markov Models). Basically, given:

  1. some feature extractors (feature extractors need to output real numbers)
  2. weights associated with the features (which are learned)
  3. previous labels

predict the current label.

You probably just realized that they seem totally appropriate for doing POS tagging. That’s true, and it’s also appropriate for other NLP tools like NE Extractors and Chunkers .

Building the tagger

In the previous tutorial, we used the nltk.corpus.treebank corpus. Let’s do the same here in order to compare. I’m also going to remind you that we haven’t used any historical features in the previous tutorial. Our previous classifier didn’t know anything about the previous decisions.

Let’s check the data:

Let’s also use the exact same feature extraction function:

Let’s build the dataset:

Notice how each row in the dataset is a sequence, not a single word. CRFs learn sequences.

Let’s now install the CRF library we’ll be using:

The sklearn-crfsuite is a wrapper over the python-crfsuite library and provides a sklearn compatible API for the library.

Here’s how to make predictions using our model:

Let’s compute the performance of our model:

We achieved a whopping 0.98 accuracy on the POS tagging task. In our previous tutorial, we only achieved 0.90 using a DecisionTreeClassifier.