Build a POS tagger with an LSTM using Keras
In this tutorial, we’re going to implement a POS Tagger with Keras. On this blog, we’ve already covered the theory behind POS taggers: POS Tagger with Decision Trees and POS Tagger with Conditional Random Field.
Recently we also started looking at Deep Learning, using Keras, a popular Python Library. You can get started with Keras in this Sentiment Analysis with Keras Tutorial. This tutorial will combine the two subjects. We’ll be building a POS tagger using Keras and a Bidirectional LSTM Layer. Let’s use a corpus that’s included in NLTK:
1 2 3 4 5 6 7 8 9 10 11 12 | import nltk tagged_sentences = nltk.corpus.treebank.tagged_sents() print(tagged_sentences[0]) print("Tagged sentences: ", len(tagged_sentences)) print("Tagged words:", len(nltk.corpus.treebank.tagged_words())) # [('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')] # Tagged sentences: 3914 # Tagged words: 100676 |
Let’s restructure the data a bit. Let’s separate the words from the tags.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | import numpy as np sentences, sentence_tags =[], [] for tagged_sentence in tagged_sentences: sentence, tags = zip(*tagged_sentence) sentences.append(np.array(sentence)) sentence_tags.append(np.array(tags)) # Let's see how a sequence looks print(sentences[5]) print(sentence_tags[5]) # ['Lorillard' 'Inc.' ',' 'the' 'unit' 'of' 'New' 'York-based' 'Loews' # 'Corp.' 'that' '*T*-2' 'makes' 'Kent' 'cigarettes' ',' 'stopped' 'using' # 'crocidolite' 'in' 'its' 'Micronite' 'cigarette' 'filters' 'in' '1956' # '.'] # ['NNP' 'NNP' ',' 'DT' 'NN' 'IN' 'JJ' 'JJ' 'NNP' 'NNP' 'WDT' '-NONE-' 'VBZ' # 'NNP' 'NNS' ',' 'VBD' 'VBG' 'NN' 'IN' 'PRP$' 'NN' 'NN' 'NNS' 'IN' 'CD' # '.'] |
As always, before training a model, we need to split the data in training and testing data. As usual, let’s use the train_test_split function from Scikit-Learn:
1 2 3 4 5 6 7 8 | from sklearn.model_selection import train_test_split (train_sentences, test_sentences, train_tags, test_tags) = train_test_split(sentences, sentence_tags, test_size=0.2) |
Keras also needs to work with numbers, not with words (or tags). Let’s assign to each word (and tag) a unique integer. We’re computing a set of unique words (and tags) then transforming it in a list and indexing them in a dictionary. These dictionaries are the word vocabulary and the tag vocabulary. We’ll also add a special value for padding the sequences (more on that later), and another one for unknown words (OOV – Out Of Vocabulary).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | words, tags = set([]), set([]) for s in train_sentences: for w in s: words.add(w.lower()) for ts in train_tags: for t in ts: tags.add(t) word2index = {w: i + 2 for i, w in enumerate(list(words))} word2index['-PAD-'] = 0 # The special value used for padding word2index['-OOV-'] = 1 # The special value used for OOVs tag2index = {t: i + 1 for i, t in enumerate(list(tags))} tag2index['-PAD-'] = 0 # The special value used to padding |
Let’s now convert the word dataset to integer dataset, both the words and the tags.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | train_sentences_X, test_sentences_X, train_tags_y, test_tags_y = [], [], [], [] for s in train_sentences: s_int = [] for w in s: try: s_int.append(word2index[w.lower()]) except KeyError: s_int.append(word2index['-OOV-']) train_sentences_X.append(s_int) for s in test_sentences: s_int = [] for w in s: try: s_int.append(word2index[w.lower()]) except KeyError: s_int.append(word2index['-OOV-']) test_sentences_X.append(s_int) for s in train_tags: train_tags_y.append([tag2index[t] for t in s]) for s in test_tags: test_tags_y.append([tag2index[t] for t in s]) print(train_sentences_X[0]) print(test_sentences_X[0]) print(train_tags_y[0]) print(test_tags_y[0]) # [2385, 9167, 860, 4989, 6805, 6349, 9078, 3938, 862, 1092, 4799, 860, 1198, 1131, 879, 5014, 7870, 704, 4415, 8049, 9444, 8175, 8172, 10058, 10034, 9890, 1516, 8311, 7870, 1489, 7967, 6458, 8859, 9720, 6754, 5402, 9254, 2663] # [3829, 3347, 1, 8311, 6240, 982, 7936, 1, 3552, 4558, 1, 9007, 8175, 8172, 637, 4517, 7392, 3124, 860, 5416, 920, 3301, 6240, 1205, 5282, 6683, 9890, 758, 4415, 1, 6240, 3386, 9072, 3219, 6240, 9157, 5611, 6240, 6969, 4517, 2956, 175, 2663] # [11, 35, 39, 3, 7, 9, 20, 42, 42, 3, 35, 39, 35, 35, 22, 7, 10, 16, 32, 35, 31, 17, 3, 11, 42, 7, 9, 3, 10, 16, 6, 25, 12, 11, 42, 17, 6, 44] # [2, 35, 16, 3, 20, 35, 42, 42, 16, 25, 7, 31, 17, 3, 35, 15, 42, 7, 39, 35, 35, 16, 20, 42, 40, 16, 7, 6, 32, 30, 20, 42, 42, 37, 20, 42, 3, 20, 42, 15, 11, 42, 44] |
Keras can only deal with fixed size sequences. We will pad to the right all the sequences with a special value (0 as the index and “-PAD-“` as the corresponding word/tag) to the length of the longest sequence in the dataset. Let’s compute the maximum length of all the sequences.
1 2 3 | MAX_LENGTH = len(max(train_sentences_X, key=len)) print(MAX_LENGTH) # 271 |
Now we can use Keras’s convenient pad_sequences
utility function:
1 2 3 4 5 6 7 8 9 10 11 12 | from keras.preprocessing.sequence import pad_sequences train_sentences_X = pad_sequences(train_sentences_X, maxlen=MAX_LENGTH, padding='post') test_sentences_X = pad_sequences(test_sentences_X, maxlen=MAX_LENGTH, padding='post') train_tags_y = pad_sequences(train_tags_y, maxlen=MAX_LENGTH, padding='post') test_tags_y = pad_sequences(test_tags_y, maxlen=MAX_LENGTH, padding='post') print(train_sentences_X[0]) print(test_sentences_X[0]) print(train_tags_y[0]) print(test_tags_y[0]) |
Network architecture
Let’s now define the model. Here’s what we need to have in mind:
- We’ll need an embedding layer that computes a word vector model for our words. Remember that in the Word Embeddings Guide we’ve mentioned that this is one of the methods of computing a word embeddings model.
- We’ll need an LSTM layer with a
Bidirectional
modifier. bidirectional modifier inputs to the LSTM the next values in the sequence, not just the previous. - We need to set the
return_sequences=True
parameter so that the LSTM outputs a sequence, not only the final value. - After the LSTM Layer we need a Dense Layer (or fully-connected layer) that picks the appropriate POS tag. Since this dense layer needs to run on each element of the sequence, we need to add the
TimeDistributed
modifier.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | from keras.models import Sequential from keras.layers import Dense, LSTM, InputLayer, Bidirectional, TimeDistributed, Embedding, Activation from keras.optimizers import Adam model = Sequential() model.add(InputLayer(input_shape=(MAX_LENGTH, ))) model.add(Embedding(len(word2index), 128)) model.add(Bidirectional(LSTM(256, return_sequences=True))) model.add(TimeDistributed(Dense(len(tag2index)))) model.add(Activation('softmax')) model.compile(loss='categorical_crossentropy', optimizer=Adam(0.001), metrics=['accuracy']) model.summary() # _________________________________________________________________ # Layer (type) Output Shape Param # # ================================================================= # embedding_1 (Embedding) (None, 271, 128) 1302400 # _________________________________________________________________ # bidirectional_1 (Bidirection (None, 271, 512) 788480 # _________________________________________________________________ # time_distributed_1 (TimeDist (None, 271, 47) 24111 # _________________________________________________________________ # activation_1 (Activation) (None, 271, 47) 0 # ================================================================= # Total params: 2,114,991 # Trainable params: 2,114,991 # Non-trainable params: 0 # _________________________________________________________________ |
There’s one more thing to do before training. We need to transform the sequences of tags to sequences of One-Hot Encoded tags. This is what the Dense Layer outputs. Here’s a function that does that:
1 2 3 4 5 6 7 8 9 10 | def to_categorical(sequences, categories): cat_sequences = [] for s in sequences: cats = [] for item in s: cats.append(np.zeros(categories)) cats[-1][item] = 1.0 cat_sequences.append(cats) return np.array(cat_sequences) |
Here’s how the one hot encoded tags look like:
1 2 3 | cat_train_tags_y = to_categorical(train_tags_y, len(tag2index)) print(cat_train_tags_y[0]) |
The moment we’ve all been waiting for, training the model:
1 2 | model.fit(train_sentences_X, to_categorical(train_tags_y, len(tag2index)), batch_size=128, epochs=40, validation_split=0.2) |
Let’s evaluate our model on the data we’ve kept aside for testing:
1 2 3 | scores = model.evaluate(test_sentences_X, to_categorical(test_tags_y, len(tag2index))) print(f"{model.metrics_names[1]}: {scores[1] * 100}") # acc: 99.09751977804825 |
In case you’ve got the same numbers as me, don’t get overexcited. There’s a catch: a lot of our success is because there’s a lot of padding and padding is really easy to get right. Let’s set aside this issue for now.
Let’s take two test sentences:
1 2 3 4 5 6 7 8 | test_samples = [ "running is very important for me .".split(), "I was running every day for a month .".split() ] print(test_samples) # [['running', 'is', 'very', 'important', 'for', 'me', '.'], ['I', 'was', 'running', 'every', 'day', 'for', 'a', 'month', '.']] |
Let’s transform them into padded sequences of word ids:
1 2 3 4 5 6 7 8 9 10 11 12 13 | test_samples_X = [] for s in test_samples: s_int = [] for w in s: try: s_int.append(word2index[w.lower()]) except KeyError: s_int.append(word2index['-OOV-']) test_samples_X.append(s_int) test_samples_X = pad_sequences(test_samples_X, maxlen=MAX_LENGTH, padding='post') print(test_samples_X) |
Let’s make our first predictions:
1 2 3 | predictions = model.predict(test_samples_X) print(predictions, predictions.shape) |
Pretty hard to read, right? We need to do the “reverse” operation for to_categorical
:
1 2 3 4 5 6 7 8 9 10 11 | def logits_to_tokens(sequences, index): token_sequences = [] for categorical_sequence in sequences: token_sequence = [] for categorical in categorical_sequence: token_sequence.append(index[np.argmax(categorical)]) token_sequences.append(token_sequence) return token_sequences |
Here’s how the predictions look:
1 2 3 4 5 | print(logits_to_tokens(predictions, {i: t for t, i in tag2index.items()})) # ['JJ', 'NNS', 'NN', 'NNP', 'NNP', 'NNS', '-NONE-', '-PAD-', ... # ['VBP', 'CD', 'JJ', 'CD', 'NNS', 'NNP', 'POS', 'NN', '-NONE-', '-PAD-', ... |
You probably are fairly acquainted with the PennTreebank tagset by now and you’re probably disappointed with the result. What’s wrong?
For most of the sentences, the largest part is “padding tokens”. These are really easy to guess, hence the super high performance. Let’s write a custom accuracy, that ignores the paddings:
1 2 3 4 5 6 7 8 9 10 11 12 13 | from keras import backend as K def ignore_class_accuracy(to_ignore=0): def ignore_accuracy(y_true, y_pred): y_true_class = K.argmax(y_true, axis=-1) y_pred_class = K.argmax(y_pred, axis=-1) ignore_mask = K.cast(K.not_equal(y_pred_class, to_ignore), 'int32') matches = K.cast(K.equal(y_true_class, y_pred_class), 'int32') * ignore_mask accuracy = K.sum(matches) / K.maximum(K.sum(ignore_mask), 1) return accuracy return ignore_accuracy |
Let’s now retrain, adding the ignore_class_accuracy
metric at the compile stage:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | from keras.models import Sequential from keras.layers import Dense, LSTM, InputLayer, Bidirectional, TimeDistributed, Embedding, Activation from keras.optimizers import Adam model = Sequential() model.add(InputLayer(input_shape=(MAX_LENGTH, ))) model.add(Embedding(len(word2index), 128)) model.add(Bidirectional(LSTM(256, return_sequences=True))) model.add(TimeDistributed(Dense(len(tag2index)))) model.add(Activation('softmax')) model.compile(loss='categorical_crossentropy', optimizer=Adam(0.001), metrics=['accuracy', ignore_class_accuracy(0)]) model.summary() # _________________________________________________________________ # Layer (type) Output Shape Param # # ================================================================= # embedding_2 (Embedding) (None, 271, 128) 1292544 # _________________________________________________________________ # bidirectional_2 (Bidirection (None, 271, 512) 788480 # _________________________________________________________________ # time_distributed_2 (TimeDist (None, 271, 47) 24111 # _________________________________________________________________ # activation_2 (Activation) (None, 271, 47) 0 # ================================================================= # Total params: 2,105,135 # Trainable params: 2,105,135 # Non-trainable params: 0 # _________________________________________________________________ |
Let’s now retrain:
1 2 | model.fit(train_sentences_X, to_categorical(train_tags_y, len(tag2index)), batch_size=128, epochs=40, validation_split=0.2) |
I’ve stopped the training when the ignore_accuracy reached 0.96. Let’s see how the model performs now:
1 2 3 4 5 6 | predictions = model.predict(test_samples_X) print(logits_to_tokens(predictions, {i: t for t, i in tag2index.items()})) # ['NN', 'VBZ', 'RB', 'JJ', 'IN', 'DT', '.', '-PAD-', ... # ['PRP', 'VBD', 'VBG', 'DT', 'NN', 'IN', 'DT', 'NN', '.', '-PAD-', ... |
As you can observe, the results are better, and they can be made better. Some strategies you can try are:
- Use pretrained vectors – Transfer Learning
- Use custom features like in classic POS Tagging combined with embeddings
- Try different architectures
Very good tutorial. One note: I don’t think the change in the last part actually affects performance. The Keras documentation states (https://keras.io/metrics/):
“A metric function is similar to a loss function, except that the results from evaluating a metric are not used when training the model.”
Thus the observed improvement is likely attributed to chance. From what I understand, it appears that you have to set the ‘mask_zero’ parameter on the embedding to True (see https://keras.io/layers/embeddings/), which leads to the input value 0 being treated as padding and thus excluded from the vocabulary. However, there’s a SO post with an open bounty asking for a canonical response, indicating that it’s not all that clear how this actually works: https://stackoverflow.com/questions/47485216/keras-ebmedding-layer-mask-zero-how-does-mask-zero-work
i have encountered an error
Traceback (most recent call last):
File “nlp_lstm.py”, line 73, in
test_tags_y.append([tag2index[t] for t in s])
File “nlp_lstm.py”, line 73, in
test_tags_y.append([tag2index[t] for t in s])
KeyError: ‘C’
This is because the tag ‘C’ is not found in tag2index.
You can use the sentence_tags rather than only train_tags to create the tag2index dictionary.
for s in sentences:
for w in s:
words.add(w.lower())
for ts in sentence_tags:
for t in ts:
tags.add(t)
I would like to understand more about the code logic.
Hi,
Thank you for the great tutorial!
In my understanding when training the model it try to minimize the loss that in our case is not optimized because it has a lot of padding. What if we use the ignore_accuracy as objective for the train to minimize.
model.compile(loss=ignore_class_accuracy(0),
optimizer=Adam(0.001),
metrics=[‘accuracy’, ignore_class_accuracy(0)])
does it makes sense?
It does make sense, thanks for posting!