Getting started with Keras for NLP
In the previous tutorial on Deep Learning, we’ve built a super simple network with numpy. I figured that the best next step is to jump right in and build some deep learning models for text. The best way to do this at the time of writing is by using Keras .
What is Keras?
Keras is a deep learning framework that actually under the hood uses other deep learning frameworks in order to expose a beautiful, simple to use and fun to work with, high-level API. Keras can use either of these backends:
- Tensorflow – Google’s deeplearning library
- Theano – may not be further developed
- CNTK – Microsoft’s deeplearning library
- MXNet – deeplearning library from Apache.org (currently under development)
Keras uses these frameworks to deliver powerful computation while exposing a beautiful and intuitive (that kinda looks like scikit-learn) API.
Here’s what Keras brings to the table:
- The integration with the various backends is seamless
- Run training on either CPU/GPU
- Comes in two flavours: sequential or functional. Just to ways of thinking about building models. The resulting models are perfectly equivalent. We’re going to use the sequential one.
- Fast prototyping – With all these good abstractions in place, you can just focus more on the problem and hyperparameter tunning.
Let’s now start using Keras to develop various types of models for Natural Language Processing. Here’s what we’ll be building:
- (Dense) Deep Neural Network – The NN classic model – uses the BOW model
- Convolutional Network – build a network using 1D Conv Layers – uses word vectors
- Recurrent Networks – LSTM Network – Long Short-Term Memory – uses word vectors
- Transfer learning for NLP – Learn how to load spaCy’s vectors or GloVe vectors – uses word vectors
Before getting started, you might want to do a refresher on Word Embeddings.
Deep Neural Network
We’re going to use the same dataset we’ve used in the Introduction to DeepLearning Tutorial. Let’s just quickly cover the data cleaning part:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | import re import pandas as pd from sklearn.model_selection import train_test_split def clean_review(text): # Strip HTML tags text = re.sub('<[^<]+?>', ' ', text) # Strip escaped quotes text = text.replace('\\"', '') # Strip quotes text = text.replace('"', '') return text df = pd.read_csv('labeledTrainData.tsv', sep='\t', quoting=3) df['cleaned_review'] = df['review'].apply(clean_review) X_train, X_test, y_train, y_test = train_test_split(df['cleaned_review'], df['sentiment'], test_size=0.2) |
Let’s now build a CountVectorizer
how we usually do:
1 2 3 4 5 6 7 | from sklearn.feature_extraction.text import CountVectorizer from nltk.corpus import stopwords vectorizer = CountVectorizer(binary=True, stop_words=stopwords.words('english'), lowercase=True, min_df=3, max_df=0.9, max_features=5000) X_train_onehot = vectorizer.fit_transform(X_train) |
Here’s how to create a simple, 2 layer network. The first layer (which actually comes after an input layer) is called the hidden layer, and the second one is called the output layer. Notice how we had to specify the input dimension (input_dim
) and how we only have 1 unit in the output layer because we’re dealing with a binary classification problem. Because we’re dealing with a binary classification problem we chose the output layer’s activation function to be the sigmoid. For the same reason, we chose the binary_crossentropy
as the loss function:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | from keras.models import Sequential from keras.layers import Dense model = Sequential() model.add(Dense(units=500, activation='relu', input_dim=len(vectorizer.get_feature_names()))) model.add(Dense(units=1, activation='sigmoid')) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) model.summary() # _________________________________________________________________ # Layer (type) Output Shape Param # # ================================================================= # dense_22 (Dense) (None, 500) 2500500 # _________________________________________________________________ # dense_23 (Dense) (None, 1) 501 # ================================================================= # Total params: 2,501,001 # Trainable params: 2,501,001 # Non-trainable params: 0 # _________________________________________________________________ |
Here’s how the training is done:
1 2 3 4 | model.fit(X_train_onehot[:-100], y_train[:-100], epochs=2, batch_size=128, verbose=1, validation_data=(X_train_onehot[-100:], y_train[-100:])) |
Notice how we set aside some samples for doing validation while training. We still need to do the evaluation on test data:
1 2 3 | scores = model.evaluate(vectorizer.transform(X_test), y_test, verbose=1) print("Accuracy:", scores[1]) # Accuracy: 0.875 |
We got an 87.5%
accuracy, which is pretty good. Let’s check out the other models.
Convolutional Network
For working with conv nets and recurrent nets we need to transform the texts into sequences of word ids. We will train an embeddings layer, and using the word ids we can fetch the corresponding word vector.
1 2 3 4 5 6 7 8 9 10 11 12 13 | word2idx = {word: idx for idx, word in enumerate(vectorizer.get_feature_names())} tokenize = vectorizer.build_tokenizer() preprocess = vectorizer.build_preprocessor() def to_sequence(tokenizer, preprocessor, index, text): words = tokenizer(preprocessor(text)) indexes = [index[word] for word in words if word in index] return indexes print(to_sequence(tokenize, preprocess, word2idx, "This is an important test!")) # [2269, 4453] X_train_sequences = [to_sequence(tokenize, preprocess, word2idx, x) for x in X_train] print(X_train_sequences[0]) |
We have a problem though. The sequences are of different lengths. We solve this problem by padding the sequence to the left with 5000
.
1 2 3 4 5 6 7 8 9 | # Compute the max lenght of a text MAX_SEQ_LENGHT = len(max(X_train_sequences, key=len)) print("MAX_SEQ_LENGHT=", MAX_SEQ_LENGHT) from keras.preprocessing.sequence import pad_sequences N_FEATURES = len(vectorizer.get_feature_names()) X_train_sequences = pad_sequences(X_train_sequences, maxlen=MAX_SEQ_LENGHT, value=N_FEATURES) print(X_train_sequences[0]) |
Let’s now define a simple CNN for text classification:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | from keras.models import Sequential from keras.layers import Dense, Conv1D, MaxPooling1D, Flatten, Embedding model = Sequential() model.add(Embedding(len(vectorizer.get_feature_names()) + 1, 64, # Embedding size input_length=MAX_SEQ_LENGHT)) model.add(Conv1D(64, 5, activation='relu')) model.add(MaxPooling1D(5)) model.add(Flatten()) model.add(Dense(units=64, activation='relu')) model.add(Dense(units=1, activation='sigmoid')) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) print(model.summary()) |
Training the model looks the same as before:
1 2 3 4 | model.fit(X_train_sequences[:-100], y_train[:-100], epochs=3, batch_size=512, verbose=1, validation_data=(X_train_sequences[-100:], y_train[-100:])) |
Let’s now transform the test data to sequences and pad them:
1 2 3 | X_test_sequences = [to_sequence(tokenize, preprocess, word2idx, x) for x in X_test] X_test_sequences = pad_sequences(X_test_sequences, maxlen=MAX_SEQ_LENGHT, value=N_FEATURES) |
Here’s how to evaluate the model:
1 2 3 | scores = model.evaluate(X_test_sequences, y_test, verbose=1) print("Accuracy:", scores[1]) # 0.8766 |
LSTM Network
Let’s build what’s probably the most popular type of model in NLP at the moment: Long Short Term Memory network. This architecture is specially designed to work on sequence data. It fits perfectly for many NLP tasks like tagging and text classification. It treats the text as a sequence rather than a bag of words or as ngrams.
Here’s a possible model definition:
1 2 3 4 5 6 7 8 9 10 11 12 13 | from keras.models import Sequential from keras.layers import Dense, LSTM, Embedding model = Sequential() model.add(Embedding(len(vectorizer.get_feature_names()) + 1, 64, # Embedding size input_length=MAX_SEQ_LENGHT)) model.add(LSTM(64)) model.add(Dense(units=1, activation='sigmoid')) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) print(model.summary()) |
Training is similar:
1 2 3 4 | model.fit(X_train_sequences[:-100], y_train[:-100], epochs=2, batch_size=128, verbose=1, validation_data=(X_train_sequences[-100:], y_train[-100:])) |
Here’s the evaluation phase and results:
1 2 3 | scores = model.evaluate(X_test_sequences, y_test, verbose=1) print("Accuracy:", scores[1]) # 0.875 |
In the next 2 sections, we’re going to explore transfer learning, a method for reducing the number of parameters we need to train for a network.
Transfer Learning with spaCy embeddings
Notice how in the previous two examples, we used an Embedding
layer. In the previous cases, that layer had to be trained, adding to the number of parameters that need to be trained. What if we used some precomputed embeddings? We can certainly do this. Say we trained a Word2Vec model on our corpus and then we use those embeddings for the various other models we need to train. In this tutorial, we’ll first use the spaCy embeddings. Here’s how to do that:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | import spacy import numpy as np nlp = spacy.load('en_core_web_md') EMBEDDINGS_LEN = len(nlp.vocab['apple'].vector) print("EMBEDDINGS_LEN=", EMBEDDINGS_LEN) # 300 embeddings_index = np.zeros((len(vectorizer.get_feature_names()) + 1, EMBEDDINGS_LEN)) for word, idx in word2idx.items(): try: embedding = nlp.vocab[word].vector embeddings_index[idx] = embedding except: pass |
Next, we’ll define the same network, just like before, but using a pretrained Embedding
layer:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | from keras.models import Sequential from keras.layers import Dense, LSTM, Embedding model = Sequential() model.add(Embedding(len(vectorizer.get_feature_names()) + 1, EMBEDDINGS_LEN, # Embedding size weights=[embeddings_index], input_length=MAX_SEQ_LENGHT, trainable=False)) model.add(LSTM(300)) model.add(Dense(units=1, activation='sigmoid')) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) print(model.summary()) |
Here’s how that performs:
1 2 3 4 5 6 7 | model.fit(X_train_sequences[:-100], y_train[:-100], epochs=1, batch_size=128, verbose=1, validation_data=(X_train_sequences[-100:], y_train[-100:])) scores = model.evaluate(X_test_sequences, y_test, verbose=1) print("Accuracy:", scores[1]) # 0.8508 |
Transfer learning with GloVe embeddings
In this section we’re going to do the same, but with smaller, GloVe embeddings.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | import numpy as np GLOVE_PATH = './glove.6B/glove.6B.50d.txt' GLOVE_VECTOR_LENGHT = 50 def read_glove_vectors(path, lenght): embeddings = {} with open(path) as glove_f: for line in glove_f: chunks = line.split() assert len(chunks) == lenght + 1 embeddings[chunks[0]] = np.array(chunks[1:], dtype='float32') return embeddings GLOVE_INDEX = read_glove_vectors(GLOVE_PATH, GLOVE_VECTOR_LENGHT) # Init the embeddings layer with GloVe embeddings embeddings_index = np.zeros((len(vectorizer.get_feature_names()) + 1, GLOVE_VECTOR_LENGHT)) for word, idx in word2idx.items(): try: embedding = GLOVE_INDEX[word] embeddings_index[idx] = embedding except: pass |
Let’s try this model out:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | from keras.models import Sequential from keras.layers import Dense, LSTM, Embedding model = Sequential() model.add(Embedding(len(vectorizer.get_feature_names()) + 1, GLOVE_VECTOR_LENGHT, # Embedding size weights=[embeddings_index], input_length=MAX_SEQ_LENGHT, trainable=False)) model.add(LSTM(128)) model.add(Dense(units=1, activation='sigmoid')) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) print(model.summary()) model.fit(X_train_sequences[:-100], y_train[:-100], epochs=3, batch_size=128, verbose=1, validation_data=(X_train_sequences[-100:], y_train[-100:])) scores = model.evaluate(X_test_sequences, y_test, verbose=1) print("Accuracy:", scores[1]) # 0.8296 |
When running the first example (DNN) I receive a “ValueError: setting an array element with a sequence.” Do we need to reshape X_train_onehot?
Hi Terry,
Hmmm, that’s weird. Will check it out these days and get back at you.
Thanks,
Bogdan
Hi,
Thanks for a great post. I was wondering if you had any advise on deploying this model for prediction. I saved the glove model as a h5 and the attempted to load in another file. I keep getting an error. Do I need to copy over the embedding length in order to run predictions.
Think it should be good to go and you should be able to load it from another file. What error are you getting?
This is amazing. Thank you so much! 🙂
Hi,
valuable tutorial.
Could you please post a tutorial on word2vec, doc2vec and fattext implementation in NLP?
Will do at one point
Hi,
word2idx = {word: idx for idx, word in enumerate(vectorizer.get_feature_names())}
should be
word2idx = {word: idx+1Â for idx, word in enumerate(vectorizer.get_feature_names())}
as zero is reserved to pad_sequences and should not be used for any word in your texts.
Great post.
Thanks.