# Introduction to Deep Learning – Sentiment Analysis

Deep Learning is one of those hyper-hyped subjects that everybody is talking about and everybody claims they’re doing. In certain cases, startups just need to mention they use Deep Learning and they instantly get appreciation. Deep Learning is indeed a powerful technology, but it’s not an answer to every problem. It’s also not *magic* like many people make it look like.

In this post, we’ll be doing a gentle introduction to the subject. You’ll learn what a Neural Network is, how to train it and how to represent text features (in 2 ways). For this purpose, we’ll be using the IMDB dataset. It contains around 25.000 sentiment annotated reviews. Deep Learning models usually require a lot of data to train properly. If you have little data, maybe Deep Learning is not the solution to your problem. In this case, the amount of data is a good compromise: it’s enough to train some toy models and we don’t need to spend days waiting for the training to finish or use GPU.

You can get the dataset from here: Kaggle IMDB Movie Reviews Dataset

Let’s quickly explore the IMDB dataset:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
import re import pandas as pd def clean_review(text): # Strip HTML tags text = re.sub('<[^<]+?>', ' ', text) # Strip escaped quotes text = text.replace('\\"', '') # Strip quotes text = text.replace('"', '') return text df = pd.read_csv('labeledTrainData.tsv', sep='\t', quoting=3) # Create a cleaned_review column df['cleaned_review'] = df['review'].apply(clean_review) # Check out how the cleaned review compares to the original one print(df['review'][0]) print("\n\n") print(df['cleaned_review'][0]) |

Notice that the reviews had some `<br />`

tags, which we removed.

## Representing the features as a BOW

Now, you might remember from this blog about the Bag-Of-Words (BOW) model of representing features. You can have a quick read about it in these posts:

Basically, with BOW, we need to compute the vocabulary (all possible words) and then a text is represented by a vector having `1`

(or the number of appearances) for the present words in the text and `0`

for all the other indices.

Quick Example:

1 2 3 4 5 6 7 8 9 |
VOCABULARY = ['dog', 'cheese', 'cat', 'mouse'] TEXT = 'the mouse ate the cheese' def to_bow(text): words = text.split(" ") return [1 if w in words else 0 for w in VOCABULARY] print(to_bow(TEXT)) # [0, 1, 0, 1] |

This is a very simplified and not optimized BOW transformer, but this is essentially the algorithm. Throughout this blog we’ve used Scikit Learn and you might be familiar with the vectorizers, which do exactly this: transform a text to its BOW representation. Here’s how that goes:

1 2 3 4 5 6 |
from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer(vocabulary=VOCABULARY) vectorizer.transform([TEXT]).todense() # matrix([[0, 1, 0, 1]]) |

## Build a Logistic Regression model

On this blog, we also touched `LogisticRegression`

in the Classification Performance Metrics post. Logistic Regression is a classification algorithm that is really simple yet very useful and performant. I use it as a baseline in almost every project I do. Logistic Regression is also the most simple Neural Network you can build.

Here’s a **really** quick explanation of how Logistic Regression works:

- For each feature (numeric feature) the
`LogisticRegression`

has an associated weight - When classifying a feature vector, we multiply the features with their weights (
`w * x`

). We apply a non-linear function to this value that maps the result in the`[0, 1]`

space. - If the resulting value is in the
`[0, 0.5)`

range than the data point belongs to the first class. If it belongs to the`(0.5, 1]`

range then the data point belongs to the second class. - The tricky part is figuring out the weights of the model. We do this using the
**Gradient Descent**method. This is called training the model. We optimize the weights so that the number of misclassified points is minimum. **GradientDescent**is an iterative algorithm. It’s not a*formula we need to apply*. It is an exploratory process that tries to find the best-suited values. There are other training methods, but Gradient Descent and variants of it is what it’s used in practice.

Let’s train a `LogisticRegression`

model for our sentiment dataset:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
from sklearn.feature_extraction.text import CountVectorizer from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split # Shuffle the data and then split it, keeping 20% aside for testing X_train, X_test, y_train, y_test = train_test_split(df['cleaned_review'], df['sentiment'], test_size=0.2) vectorizer = CountVectorizer(lowercase=True) vectorizer.fit(X_train) classifier = LogisticRegression() classifier.fit(vectorizer.transform(X_train), y_train) print("Score:", classifier.score(vectorizer.transform(X_test), y_test)) # Score: 0.8778 |

You will get slightly different scores, and that’s normal. That’s due to the fact that the `train_test_split`

function also shuffles the data. This means you’ll be training your model on different data than mine.

## Build a Scikit-Learn NeuralNetwork model

Going from training a `LogisticRegression`

model to training a `NeuralNetwork`

is easy peasy with Scikit-Learn. Here’s how to do it:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
from sklearn.feature_extraction.text import CountVectorizer from sklearn.neural_network import MLPClassifier from sklearn.model_selection import train_test_split # Shuffle the data and then split it, keeping 20% aside for testing X_train, X_test, y_train, y_test = train_test_split(df['cleaned_review'], df['sentiment'], test_size=0.2) vectorizer = CountVectorizer(lowercase=True) vectorizer.fit(X_train) classifier = MLPClassifier(hidden_layer_sizes=(100,)) classifier.fit(vectorizer.transform(X_train), y_train) print("Score:", classifier.score(vectorizer.transform(X_test), y_test)) # Score: 0.8816 |

Notice the changes made: we used the `MLPClassifier`

instead of `LogisticRegression`

.

Let’s talk about the `hidden_layer_sizes`

parameter. A neural network consists of layers. Every neural network has an input layer (size equal to the number of features) and an output layer (size equal to the number of classes). Between these two layers, there can be a number of hidden layers. The sizes of the hidden layers are a parameter. In this case we’ve only used a single hidden layer. Layers are composed of hidden units (or neurons). Each hidden unit is basically a LogisticRegression unit (with some notable differences, but close enough). This means that there are `100`

LogisticRegression units *doing their own thing*.

The training of a neural network is done via **BackPropagation** which is a form of propagating the errors from the output layer all the way to the input layer and adjusting the weights incrementally.

## Building a NeuralNetwork from scratch with NumPy

In this section, we’ll code a neural network from the ground up. This will be a toy implementation. We just want to understand what’s happening inside. Understanding these model details is pretty crucial for deep learning. When training a *NaiveBayes* or a *RandomForest* you might not need to adjust any parameters. This is not the case for neural networks. You’ll need to tweak the parameters for every problem you’re trying to solve. Here’s how a Neural Network looks like:

This is how most of the time a neural network is described. A Neural Network functions in 2 ways:

**Forward**– Prediction mode – Get the inputs, multiply by weights, apply activation functions and produce outputs**Backward**– Learning mode – For a known data point, get the network output (via Forward mode), compare to the expected output and propagate the error back, from the output layer towards the input layer. This process is called**Backpropagation**.

I find it pretty hard to understand how Neural Networks make predictions using this representation. This representation makes you focus more on the links between the neurons rather than the neurons themselves. Here’s a simpler way to look at it. Each layer processes it’s input and computes an output according to this formula:

`output = f(weights * input)`

`f`

is a non-linear function called *the activation* function. For this function, we conveniently choose between the sigmoid, hyperbolic tangent or rectified linear unit. We’ll touch these a bit later on. Using the formula above, we can write the formula of the network shown above like this:

1 2 |
Y = f_3(W_3 * f_2(W_2 * f_1(W_1 * x + b_1) + b_2) + b_3) |

Training this neural network simply means optimizing `W_1, W_2, W_3`

(the weights) and `b_1, b_2, b_3`

(the biases) such that `Y`

is as close to the expected output as possible. Let’s note that:

- the
`W_1, b_1 and f_1`

correspond to the first hidden layer - the
`W_2, b_2 and f_2`

correspond to the second hidden layer - the
`W_3, b_3 and f_3`

correspond to the output layer

Getting back to the activation function: the purpose of this activation function is to introduce non-linearities in the mix. LogisticRegression only knows how to discriminate between linearly-separable classes. This means it can only draw a straight line between the points of 2 classes, like this:

By using non-linearities we can make this boundary bendy so that it can accomodate cases like this:

One of the most popular activation functions is the sigmoid. The sigmoid function *squeezes* the input in the `[0, 1]`

interval. Here’s how the sigmoid function can be implemented:

1 2 3 4 5 |
def sigmoid(X): return 1 / (1 + np.exp(-X)) print(sigmoid(np.array([1, 2, 1.1, 4]))) |

Let’s write a `SimpleNeuralNetwork`

class in the Scikit Learn style since we’re very used to it. I named the class `SimpleNeuralNetwork`

since we’re only going to work with a single hidden layer for now. The main reason behind this choice is the simplicity and clarity of the implementation. The main purpose here is to write a simple to understand and simple to follow implementation. At first, let’s also skip the training process. We’re going to init the weights and biases with random numbers and write the prediction method to make sure we understand this step. with Neural Networks, prediction stage is way simpler than training.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 |
import numpy as np from sklearn.base import BaseEstimator, ClassifierMixin from sklearn.utils.validation import check_X_y, check_array from sklearn.preprocessing import LabelBinarizer class SimpleNeuralNetwork(BaseEstimator, ClassifierMixin): def __init__(self, hidden_layer_size=100, learning_rate=.1, epochs=1000, debug_print_epoch=10): assert hidden_layer_size > 0 self.hidden_layer_size_ = hidden_layer_size self.learning_rate_ = learning_rate self.epochs_ = epochs self.debug_print_epoch_ = debug_print_epoch def fit(self, X, y): X, y = check_X_y(X, y, accept_sparse=True) # Makes sure the X and y play nice self.classes_ = np.unique(y) n_classes = len(self.classes_) # In this particular case, we'll make sure the number of classes is 2 assert n_classes == 2 n_samples, n_features = X.shape self.binarizer_ = LabelBinarizer().fit(y) Y_binary = self.binarizer_.transform(y) # Compute the weight matrices sizes and init with small random values # Hidden Layer self.A1_ = np.random.randn(n_features, self.hidden_layer_size_) # Output Layer self.A2_ = np.random.randn(self.hidden_layer_size_, 1) # ~~ SKIP TRAINING FOR NOW ~~ def predict_proba(self, X): """ Output probabilities for each sample""" # make sure X is of an accepted type X = check_array(X, accept_sparse='csr') # Apply linear function at the hidden layer Y_hidden = X.dot(self.A1_) # Apply sigmoid at the output layer Y_output = sigmoid(Y_hidden.dot(self.A2_)) return np.hstack((1 - Y_output, Y_output)) def predict(self, X): """ Output only the most likely class for each sample """ scores = self.predict_proba(X) indices = scores.argmax(axis=1) return self.binarizer_.inverse_transform(indices) |

Let’s see how our neural network performs on our sentiment analysis task:

1 2 3 4 5 6 7 8 9 10 11 12 13 |
from sklearn.feature_extraction.text import CountVectorizer from sklearn.model_selection import train_test_split # Shuffle the data and then split it, keeping 20% aside for testing X_train, X_test, y_train, y_test = train_test_split(df['cleaned_review'], df['sentiment'], test_size=0.2) vectorizer = CountVectorizer(lowercase=True, binary=True) vectorizer.fit(X_train) classifier = SimpleNeuralNetwork(hidden_layer_size=100, epochs=500, learning_rate=0.1) classifier.fit(vectorizer.transform(X_train), list(y_train.values)) print("Score:", classifier.score(vectorizer.transform(X_test), y_test)) # 0.5056 |

As you might expect, the performance is rather poor and that is because we haven’t trained anything. We initialized the matrices, we are able to make predictions, but we haven’t actually wrangled the matrices so that we maximize the classifier’s performance. In fact, the performance of the classifier is as good as flipping a coin.

Let’s now talk about training. If you’re familiar with how LogisticRegression works, then you know what Gradient Descent is. The LogisticRegression classifier tries to minimize a cost function by adjusting the weights. The weights are iteratively adjusted bit by bit, going towards a point of minimum. Gradient Descent does this by going in the direction of the steepest slope. There are a lot of tutorials about GD out there. Make sure you understand it because it is one of the most fundamental algorithms in Data Science and probably the most used Machine Learning algorithm.

Training a Neural Network is pretty much the same in concept. We apply GD at the output layer and then we propagate the error backwards towards the input layer. This process is called **Backpropagation**.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 |
import numpy as np from sklearn.base import BaseEstimator, ClassifierMixin from sklearn.utils.validation import check_X_y, check_array from sklearn.preprocessing import LabelBinarizer class SimpleNeuralNetwork(BaseEstimator, ClassifierMixin): def __init__(self, hidden_layer_size=100, learning_rate=.1, epochs=1000, debug_print_epoch=10): assert hidden_layer_size > 0 self.hidden_layer_size_ = hidden_layer_size self.learning_rate_ = learning_rate self.epochs_ = epochs self.debug_print_epoch_ = debug_print_epoch def fit(self, X, y): X, y = check_X_y(X, y, accept_sparse=True) # Makes sure the X and y play nice self.classes_ = np.unique(y) n_classes = len(self.classes_) # In this particular case, we'll make sure the number of classes is 2 assert n_classes == 2 n_samples, n_features = X.shape self.binarizer_ = LabelBinarizer().fit(y) Y_binary = self.binarizer_.transform(y) # Compute the weight matrices sizes and init with small random values # Hidden Layer self.A1_ = np.random.randn(n_features, self.hidden_layer_size_) # Output Layer self.A2_ = np.random.randn(self.hidden_layer_size_, 1) # Training Process for epoch in range(self.epochs_): Y_hidden = X.dot(self.A1_) Y_output = sigmoid(Y_hidden.dot(self.A2_)) error = Y_output - Y_binary d_A2 = error * Y_output * (1 - Y_output) hidden_error = d_A2.dot(self.A2_.T) d_A1 = hidden_error self.A1_ -= self.learning_rate_ * X.T.dot(d_A1) self.A2_ -= self.learning_rate_ * Y_hidden.T.dot(d_A2) if not epoch % self.debug_print_epoch_: score = self.score(X, y) print(f"Epoch={epoch} \t Score={score}") def predict_proba(self, X): """ Output probabilities for each sample""" # make sure X is of an accepted type X = check_array(X, accept_sparse='csr') # Apply linear function at the hidden layer Y_hidden = X.dot(self.A1_) # Apply sigmoid at the output layer Y_output = sigmoid(Y_hidden.dot(self.A2_)) return np.hstack((1 - Y_output, Y_output)) def predict(self, X): """ Output only the most likely class for each sample """ scores = self.predict_proba(X) indices = scores.argmax(axis=1) return self.binarizer_.inverse_transform(indices) |

Let’s test our neural network again:

1 2 3 4 5 6 7 8 9 10 11 12 13 |
from sklearn.feature_extraction.text import CountVectorizer from sklearn.model_selection import train_test_split # Shuffle the data and then split it, keeping 20% aside for testing vectorizer = CountVectorizer(lowercase=True, binary=True) vectorizer.fit(X_train) classifier = SimpleNeuralNetwork(hidden_layer_size=100, epochs=500, learning_rate=0.1) classifier.fit(vectorizer.transform(X_train), list(y_train.values)) print("Score:", classifier.score(vectorizer.transform(X_test), y_test)) # Score: 0.502 |

Well, something isn’t right. We get a performance as bad as the untrained model. This is an important lesson. Neural networks are very sensitive to their parameters. The main culprit here is the `learning_rate`

parameter. The parameter is set to a way too larger value and is unable to *slide* towards the minimum of the objective function. Let’s try it once again, this time with a more appropriate value:

1 2 3 4 5 |
classifier = SimpleNeuralNetwork(hidden_layer_size=100, epochs=500, learning_rate=0.001) classifier.fit(vectorizer.transform(X_train), list(y_train.values)) print("Score:", classifier.score(vectorizer.transform(X_test), y_test)) # Score: 0.795 |

Now that’s much better. Notice how smooth the training process was. Let’s take it for a spin on some reviews:

1 2 3 4 5 |
classifier.predict(vectorizer.transform([ "This was such a crappy movie. Hated it!", "Pure awesomeness. Best movie ever!!" ])) # array([0, 1]) |

Let’s quickly mention some other elements of Deep Learning.

- Notice how we’re feeding an
**entire dataset to our neural network to predict**rather than sample by sample. This is of course way more efficient - We’re training our network using the entire dataset. This is not ideal since a typical Deep Learning dataset can get really huge. There is a solution to this and is called
**Stochastic Gradient Descent**. This implies splitting the training set into chunks and training the NN on a chunk by chunk basis. More on this in a future post. - In this case, since our output is binary (+/-) we needed a single output neuron. Obviously, NNs are useful for multiclass classification as well. To achieve this, we need to have 1 output neuron for each class. The output neuron with the highest signal is the classification result. This type of label encoding is called
**One Hot Encoding**. Here’s how to transform a set of labels to the onehot format:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
# Out 3 classes CLASSES = list(np.array([3, 1, 2])) # The dataset labels LABELS = np.array([1, 2, 3, 1, 2, 1, 1, 2, 3]) ONEHOT = np.zeros((len(LABELS), len(CLASSES))) for idx, value in enumerate(VALUES): ONEHOT[idx, INDEX.index(value)] = 1 print(ONEHOT) # [[0. 1. 0.] # [0. 0. 1.] # [1. 0. 0.] # [0. 1. 0.] # [0. 0. 1.] # [0. 1. 0.] # [0. 1. 0.] # [0. 0. 1.] # [1. 0. 0.]] |

- In order for the NN to output probabilities in the multiclass case we need a function that transforms the output activations into probabilities. This function is called softmax, here’s how to implement it:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
import numpy as np def softmax(X): """ X = (N, D) Matrix N = Number of samples D = The number of features """ exp_X = np.exp(X) # shape=(N, D) exp_X_sum = exp_X.sum(axis=1) # shape=(N,) exp_X_sum = exp_X_sum.reshape((-1, 1)) # shape=(N, 1) return exp_X / exp_X_sum # shape=(N, D) print(softmax([[0.3, 0.5, 0.1, 1.2], [1.4, 5.3, 1.5, 1.4], [1, 2, 4, 8]])) # Notice how every row adds up to 1.0, like probabilities should print(softmax([[0.3, 0.5, 0.1, 1.2], [1.4, 5.3, 1.5, 1.4], [1, 2, 4, 8]]).sum(axis=1)) |

## Using word embeddings with averaging

You might remember from the spaCy Tutorial about word embeddings. We can use them in order to learn another simple yet neat trick for text classification. We can transform all the words from a text into their vectors and compute their mean. Hopefully, this mean, will give us enough information about the sentiment of the text. For this, we just need to write a different vectorizer. We’ll be using the same NN we’ve already coded:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
import spacy from nltk import word_tokenize from sklearn.base import TransformerMixin from sklearn.preprocessing import MinMaxScaler nlp = spacy.load('en_core_web_lg') def text_vector(text, tokenizer): vector = np.zeros(300) words = tokenizer(text) total_words = 0 for word in words: if word in nlp.vocab: vector += nlp.vocab[word].vector total_words += 1 if total_words > 0: vector /= total_words return vector class EmbeddingsVectorizer(BaseEstimator, TransformerMixin): def __init__(self, tokenizer): self.tokenizer = tokenizer def fit(self, X, y=None): return self def transform(self, raw_documents): return np.vstack([text_vector(text, self.tokenizer) for text in raw_documents]) |

Here’s how to train and test the network:

1 2 3 4 5 6 7 8 9 10 11 12 |
scaler = MinMaxScaler() embeddings_vectorizer = EmbeddingsVectorizer(tokenizer=word_tokenize) X_train_embedded = embeddings_vectorizer.fit_transform(X_train) X_train_scaled = scaler.fit_transform(X_train_embedded) classifier = SimpleNeuralNetwork(hidden_layer_size=100, epochs=5000, learning_rate=0.00001) classifier.fit(X_train_scaled, y_train) X_test_embedded = embeddings_vectorizer.fit_transform(X_test) X_test_scaled = scaler.fit_transform(X_test_embedded) print("Score:", classifier.score(X_test_scaled, y_test)) # Score: 0.8106 |

Notice the parameter adjustments we’ve made. Our network working on embeddings works rather well. We’ll be using embeddings more in future tutorials. Keep this trick in mind, it might come in handy.

## Conclusions

- In this tutorial, we’ve started from LogisticRegression and made our way towards Deep Learning by building our own simple neural network
- We learned without going much into details about how
**LogisticRegression**and**Neural Networks**are trained - We’ve coded our own neural network and put it to work in 2 scenarios: using the
**bag of words model**and using**word embeddings** - We mentioned the next steps needed in our journey towards learning about Deep Learning:
- Use
**One Hot Encoding**for multiclass classification - Use
**Stochastic Gradient Descent**for large datasets

- Use
- We witnessed how important
**parameter adjustment**is for training a Neural Network

Hey,

A nice one. This will give me a few days of trying to wrap my head around this subject and try to experiment with my own amateur models. You mentioned that you will be using word embeddings in the upcoming content. I wonder whether we could use word vectors in order to do some NER with DBpedia Spotlight?

You mean train a model (using word vectors as features) from data annotated with DBPedia Spotlight? Don’t see why not, we might explore that ðŸ™‚

Sure, something like that would definitely be interesting! Looking forward to some DBpedia-related action! ðŸ™‚

Hi,

What is the used cost function for back-propagation (GD) and what is its derivative ?