Introduction to Deep Learning – Sentiment Analysis

Deep Learning is one of those hyper-hyped subjects that everybody is talking about and everybody claims they’re doing. In certain cases, startups just need to mention they use Deep Learning and they instantly get appreciation. Deep Learning is indeed a powerful technology, but it’s not an answer to every problem. It’s also not magic like many people make it look like.

In this post, we’ll be doing a gentle introduction to the subject. You’ll learn what a Neural Network is, how to train it and how to represent text features (in 2 ways). For this purpose, we’ll be using the IMDB dataset. It contains around 25.000 sentiment annotated reviews. Deep Learning models usually require a lot of data to train properly. If you have little data, maybe Deep Learning is not the solution to your problem. In this case, the amount of data is a good compromise: it’s enough to train some toy models and we don’t need to spend days waiting for the training to finish or use GPU.

You can get the dataset from here: Kaggle IMDB Movie Reviews Dataset

Let’s quickly explore the IMDB dataset:

Notice that the reviews had some <br /> tags, which we removed.

Representing the features as a BOW

Now, you might remember from this blog about the Bag-Of-Words (BOW) model of representing features. You can have a quick read about it in these posts:

Basically, with BOW, we need to compute the vocabulary (all possible words) and then a text is represented by a vector having 1 (or the number of appearances) for the present words in the text and 0 for all the other indices.

Quick Example:

This is a very simplified and not optimized BOW transformer, but this is essentially the algorithm. Throughout this blog we’ve used Scikit Learn and you might be familiar with the vectorizers, which do exactly this: transform a text to its BOW representation. Here’s how that goes:

Build a Logistic Regression model

On this blog, we also touched LogisticRegression in the Classification Performance Metrics post. Logistic Regression is a classification algorithm that is really simple yet very useful and performant. I use it as a baseline in almost every project I do. Logistic Regression is also the most simple Neural Network you can build.

Here’s a really quick explanation of how Logistic Regression works:

  • For each feature (numeric feature) the LogisticRegression has an associated weight
  • When classifying a feature vector, we multiply the features with their weights (w * x). We apply a non-linear function to this value that maps the result in the [0, 1] space.
  • If the resulting value is in the [0, 0.5) range than the data point belongs to the first class. If it belongs to the (0.5, 1] range then the data point belongs to the second class.
  • The tricky part is figuring out the weights of the model. We do this using the Gradient Descent method. This is called training the model. We optimize the weights so that the number of misclassified points is minimum.
  • GradientDescent is an iterative algorithm. It’s not a formula we need to apply. It is an exploratory process that tries to find the best-suited values. There are other training methods, but Gradient Descent and variants of it is what it’s used in practice.

Let’s train a LogisticRegression model for our sentiment dataset:

You will get slightly different scores, and that’s normal. That’s due to the fact that the train_test_split function also shuffles the data. This means you’ll be training your model on different data than mine.

Build a Scikit-Learn NeuralNetwork model

Going from training a LogisticRegression model to training a NeuralNetwork is easy peasy with Scikit-Learn. Here’s how to do it:

Notice the changes made: we used the MLPClassifier instead of LogisticRegression.

Let’s talk about the hidden_layer_sizes parameter. A neural network consists of layers. Every neural network has an input layer (size equal to the number of features) and an output layer (size equal to the number of classes). Between these two layers, there can be a number of hidden layers. The sizes of the hidden layers are a parameter. In this case we’ve only used a single hidden layer. Layers are composed of hidden units (or neurons). Each hidden unit is basically a LogisticRegression unit (with some notable differences, but close enough). This means that there are 100 LogisticRegression units doing their own thing.

The training of a neural network is done via BackPropagation which is a form of propagating the errors from the output layer all the way to the input layer and adjusting the weights incrementally.

Building a NeuralNetwork from scratch with NumPy

In this section, we’ll code a neural network from the ground up. This will be a toy implementation. We just want to understand what’s happening inside. Understanding these model details is pretty crucial for deep learning. When training a NaiveBayes or a RandomForest you might not need to adjust any parameters. This is not the case for neural networks. You’ll need to tweak the parameters for every problem you’re trying to solve. Here’s how a Neural Network looks like:

Neural Network model

This is how most of the time a neural network is described. A Neural Network functions in 2 ways:

  • Forward – Prediction mode – Get the inputs, multiply by weights, apply activation functions and produce outputs
  • Backward – Learning mode – For a known data point, get the network output (via Forward mode), compare to the expected output and propagate the error back, from the output layer towards the input layer. This process is called Backpropagation.

I find it pretty hard to understand how Neural Networks make predictions using this representation. This representation makes you focus more on the links between the neurons rather than the neurons themselves. Here’s a simpler way to look at it. Each layer processes it’s input and computes an output according to this formula:

output = f(weights * input)

f is a non-linear function called the activation function. For this function, we conveniently choose between the sigmoid, hyperbolic tangent or rectified linear unit. We’ll touch these a bit later on. Using the formula above, we can write the formula of the network shown above like this:

Training this neural network simply means optimizing W_1, W_2, W_3 (the weights) and b_1, b_2, b_3 (the biases) such that Y is as close to the expected output as possible. Let’s note that:

  • the W_1, b_1 and f_1 correspond to the first hidden layer
  • the W_2, b_2 and f_2 correspond to the second hidden layer
  • the W_3, b_3 and f_3 correspond to the output layer

Getting back to the activation function: the purpose of this activation function is to introduce non-linearities in the mix. LogisticRegression only knows how to discriminate between linearly-separable classes. This means it can only draw a straight line between the points of 2 classes, like this:

Linearly Separable Classes

By using non-linearities we can make this boundary bendy so that it can accomodate cases like this:

Non Linearly Separable Classes

One of the most popular activation functions is the sigmoid. The sigmoid function squeezes the input in the [0, 1] interval. Here’s how the sigmoid function can be implemented:

Let’s write a SimpleNeuralNetwork class in the Scikit Learn style since we’re very used to it. I named the class SimpleNeuralNetwork since we’re only going to work with a single hidden layer for now. The main reason behind this choice is the simplicity and clarity of the implementation. The main purpose here is to write a simple to understand and simple to follow implementation. At first, let’s also skip the training process. We’re going to init the weights and biases with random numbers and write the prediction method to make sure we understand this step. with Neural Networks, prediction stage is way simpler than training.

Let’s see how our neural network performs on our sentiment analysis task:

As you might expect, the performance is rather poor and that is because we haven’t trained anything. We initialized the matrices, we are able to make predictions, but we haven’t actually wrangled the matrices so that we maximize the classifier’s performance. In fact, the performance of the classifier is as good as flipping a coin.

Let’s now talk about training. If you’re familiar with how LogisticRegression works, then you know what Gradient Descent is. The LogisticRegression classifier tries to minimize a cost function by adjusting the weights. The weights are iteratively adjusted bit by bit, going towards a point of minimum. Gradient Descent does this by going in the direction of the steepest slope. There are a lot of tutorials about GD out there. Make sure you understand it because it is one of the most fundamental algorithms in Data Science and probably the most used Machine Learning algorithm.

Training a Neural Network is pretty much the same in concept. We apply GD at the output layer and then we propagate the error backwards towards the input layer. This process is called Backpropagation.

Let’s test our neural network again:

Well, something isn’t right. We get a performance as bad as the untrained model. This is an important lesson. Neural networks are very sensitive to their parameters. The main culprit here is the learning_rate parameter. The parameter is set to a way too larger value and is unable to slide towards the minimum of the objective function. Let’s try it once again, this time with a more appropriate value:

Now that’s much better. Notice how smooth the training process was. Let’s take it for a spin on some reviews:

Let’s quickly mention some other elements of Deep Learning.

  • Notice how we’re feeding an entire dataset to our neural network to predict rather than sample by sample. This is of course way more efficient
  • We’re training our network using the entire dataset. This is not ideal since a typical Deep Learning dataset can get really huge. There is a solution to this and is called Stochastic Gradient Descent. This implies splitting the training set into chunks and training the NN on a chunk by chunk basis. More on this in a future post.
  • In this case, since our output is binary (+/-) we needed a single output neuron. Obviously, NNs are useful for multiclass classification as well. To achieve this, we need to have 1 output neuron for each class. The output neuron with the highest signal is the classification result. This type of label encoding is called One Hot Encoding. Here’s how to transform a set of labels to the onehot format:
  • In order for the NN to output probabilities in the multiclass case we need a function that transforms the output activations into probabilities. This function is called softmax, here’s how to implement it:

Using word embeddings with averaging

You might remember from the spaCy Tutorial about word embeddings. We can use them in order to learn another simple yet neat trick for text classification. We can transform all the words from a text into their vectors and compute their mean. Hopefully, this mean, will give us enough information about the sentiment of the text. For this, we just need to write a different vectorizer. We’ll be using the same NN we’ve already coded:

Here’s how to train and test the network:

Notice the parameter adjustments we’ve made. Our network working on embeddings works rather well. We’ll be using embeddings more in future tutorials. Keep this trick in mind, it might come in handy.

Conclusions

  1. In this tutorial, we’ve started from LogisticRegression and made our way towards Deep Learning by building our own simple neural network
  2. We learned without going much into details about how LogisticRegression and Neural Networks are trained
  3. We’ve coded our own neural network and put it to work in 2 scenarios: using the bag of words model and using word embeddings
  4. We mentioned the next steps needed in our journey towards learning about Deep Learning:
    • Use One Hot Encoding for multiclass classification
    • Use Stochastic Gradient Descent for large datasets
  5. We witnessed how important parameter adjustment is for training a Neural Network