Getting started with Keras for NLP

In the previous tutorial on Deep Learning, we’ve built a super simple network with numpy. I figured that the best next step is to jump right in and build some deep learning models for text. The best way to do this at the time of writing is by using Keras .

What is Keras?

Keras is a deep learning framework that actually under the hood uses other deep learning frameworks in order to expose a beautiful, simple to use and fun to work with, high-level API. Keras can use either of these backends:

  • Tensorflow – Google’s deeplearning library
  • Theano – may not be further developed
  • CNTK – Microsoft’s deeplearning library
  • MXNet – deeplearning library from (currently under development)

Keras uses these frameworks to deliver powerful computation while exposing a beautiful and intuitive (that kinda looks like scikit-learn) API.

Here’s what Keras brings to the table:

  • The integration with the various backends is seamless
  • Run training on either CPU/GPU
  • Comes in two flavours: sequential or functional. Just to ways of thinking about building models. The resulting models are perfectly equivalent. We’re going to use the sequential one.
  • Fast prototyping – With all these good abstractions in place, you can just focus more on the problem and hyperparameter tunning.

Let’s now start using Keras to develop various types of models for Natural Language Processing. Here’s what we’ll be building:

  1. (Dense) Deep Neural Network – The NN classic model – uses the BOW model
  2. Convolutional Network – build a network using 1D Conv Layers – uses word vectors
  3. Recurrent Networks – LSTM Network – Long Short-Term Memory – uses word vectors
  4. Transfer learning for NLP – Learn how to load spaCy’s vectors or GloVe vectors – uses word vectors

Before getting started, you might want to do a refresher on Word Embeddings.

Deep Neural Network

We’re going to use the same dataset we’ve used in the Introduction to DeepLearning Tutorial. Let’s just quickly cover the data cleaning part:

Let’s now build a CountVectorizer how we usually do:

Here’s how to create a simple, 2 layer network. The first layer (which actually comes after an input layer) is called the hidden layer, and the second one is called the output layer. Notice how we had to specify the input dimension (input_dim) and how we only have 1 unit in the output layer because we’re dealing with a binary classification problem. Because we’re dealing with a binary classification problem we chose the output layer’s activation function to be the sigmoid. For the same reason, we chose the binary_crossentropy as the loss function:

Here’s how the training is done:

Notice how we set aside some samples for doing validation while training. We still need to do the evaluation on test data:

We got an 87.5% accuracy, which is pretty good. Let’s check out the other models.

Convolutional Network

For working with conv nets and recurrent nets we need to transform the texts into sequences of word ids. We will train an embeddings layer, and using the word ids we can fetch the corresponding word vector.

We have a problem though. The sequences are of different lengths. We solve this problem by padding the sequence to the left with 5000.

Let’s now define a simple CNN for text classification:

Training the model looks the same as before:

Let’s now transform the test data to sequences and pad them:

Here’s how to evaluate the model:

LSTM Network

Let’s build what’s probably the most popular type of model in NLP at the moment: Long Short Term Memory network. This architecture is specially designed to work on sequence data. It fits perfectly for many NLP tasks like tagging and text classification. It treats the text as a sequence rather than a bag of words or as ngrams.

Here’s a possible model definition:

Training is similar:

Here’s the evaluation phase and results:

In the next 2 sections, we’re going to explore transfer learning, a method for reducing the number of parameters we need to train for a network.

Transfer Learning with spaCy embeddings

Notice how in the previous two examples, we used an Embedding layer. In the previous cases, that layer had to be trained, adding to the number of parameters that need to be trained. What if we used some precomputed embeddings? We can certainly do this. Say we trained a Word2Vec model on our corpus and then we use those embeddings for the various other models we need to train. In this tutorial, we’ll first use the spaCy embeddings. Here’s how to do that:

Next, we’ll define the same network, just like before, but using a pretrained Embedding layer:

Here’s how that performs:

Transfer learning with GloVe embeddings

In this section we’re going to do the same, but with smaller, GloVe embeddings.

Let’s try this model out: