Build a POS tagger with an LSTM using Keras

In this tutorial, we’re going to implement a POS Tagger with Keras. On this blog, we’ve already covered the theory behind POS taggers: POS Tagger with Decision Trees and POS Tagger with Conditional Random Field.

Recently we also started looking at Deep Learning, using Keras, a popular Python Library. You can get started with Keras in this Sentiment Analysis with Keras Tutorial. This tutorial will combine the two subjects. We’ll be building a POS tagger using Keras and a Bidirectional LSTM Layer. Let’s use a corpus that’s included in NLTK:

Let’s restructure the data a bit. Let’s separate the words from the tags.

As always, before training a model, we need to split the data in training and testing data. As usual, let’s use the train_test_split function from Scikit-Learn:

Keras also needs to work with numbers, not with words (or tags). Let’s assign to each word (and tag) a unique integer. We’re computing a set of unique words (and tags) then transforming it in a list and indexing them in a dictionary. These dictionaries are the word vocabulary and the tag vocabulary. We’ll also add a special value for padding the sequences (more on that later), and another one for unknown words (OOV – Out Of Vocabulary).

Let’s now convert the word dataset to integer dataset, both the words and the tags.

Keras can only deal with fixed size sequences. We will pad to the right all the sequences with a special value (0 as the index and “-PAD-“` as the corresponding word/tag) to the length of the longest sequence in the dataset. Let’s compute the maximum length of all the sequences.

Now we can use Keras’s convenient pad_sequences utility function:

Network architecture

Let’s now define the model. Here’s what we need to have in mind:

  • We’ll need an embedding layer that computes a word vector model for our words. Remember that in the Word Embeddings Guide we’ve mentioned that this is one of the methods of computing a word embeddings model.
  • We’ll need an LSTM layer with a Bidirectional modifier. bidirectional modifier inputs to the LSTM the next values in the sequence, not just the previous.
  • We need to set the return_sequences=True parameter so that the LSTM outputs a sequence, not only the final value.
  • After the LSTM Layer we need a Dense Layer (or fully-connected layer) that picks the appropriate POS tag. Since this dense layer needs to run on each element of the sequence, we need to add the TimeDistributed modifier.

There’s one more thing to do before training. We need to transform the sequences of tags to sequences of One-Hot Encoded tags. This is what the Dense Layer outputs. Here’s a function that does that:

Here’s how the one hot encoded tags look like:

The moment we’ve all been waiting for, training the model:

Let’s evaluate our model on the data we’ve kept aside for testing:

In case you’ve got the same numbers as me, don’t get overexcited. There’s a catch: a lot of our success is because there’s a lot of padding and padding is really easy to get right. Let’s set aside this issue for now.

Let’s take two test sentences:

Let’s transform them into padded sequences of word ids:

Let’s make our first predictions:

Pretty hard to read, right? We need to do the “reverse” operation for to_categorical:

Here’s how the predictions look:

You probably are fairly acquainted with the PennTreebank tagset by now and you’re probably disappointed with the result. What’s wrong?

For most of the sentences, the largest part is “padding tokens”. These are really easy to guess, hence the super high performance. Let’s write a custom accuracy, that ignores the paddings:

Let’s now retrain, adding the ignore_class_accuracy metric at the compile stage:

Let’s now retrain:

I’ve stopped the training when the ignore_accuracy reached 0.96. Let’s see how the model performs now:

As you can observe, the results are better, and they can be made better. Some strategies you can try are:

  • Use pretrained vectors – Transfer Learning
  • Use custom features like in classic POS Tagging combined with embeddings
  • Try different architectures