Getting Started with Sentiment Analysis

The most direct definition of the task is: “Does a text express a positive or negative sentiment?”. Usually, we assign a polarity value to a text. This value is usually in the [-1, 1] interval, 1 being very positive, -1 very negative.

Why is sentiment analysis useful

Sentiment analysis can have a multitude of uses, some of the most prominent being:

  • Discover a brand’s / product’s presence online
  • Check the reviews for a product
  • Customer support

Why sentiment analysis is hard

There are a few problems that make sentiment analysis specifically hard:

1. Negations

A classic argument for why using a bag of words model doesn’t work properly for sentiment analysis. “I like the product” and “I do not like the product” should be opposites. A classic machine learning approach would probably score these sentences identically.

2. Metaphors, Irony, Jokes

Computers always have trouble understanding figurative language. “The best I can say about this product is that it was definitely interesting …”. Here, the word “interesting” plays a different role than the classic, positive meaning.

3. Multiple sentiments in the same text

A complex text can be segmented into different sections. Some sections can be positive, others negative. How do we aggregate the polarities?

“The phone’s design is the best I’ve seen so far, but the battery can definitely use some improvements”

Here we can see the presence of two sentiments. Is the review a positive one or a negative one? Is having a not-so-great battery a deal breaker?

These seem indeed to be complex problems. The solutions aren’t simple at all. In fact, all these issues are open problems in the field of Natural Language Processing.

For now, the best approach is to tune your algorithms to your problem as best as possible. If you are analyzing tweets, you should take emoticons very seriously into account. If you are studying political reviews, you should correlate the polarity with present events. In the case of the phone review, you should weigh the different properties of the phone according to a set of rules, maybe combine the approach with some domain-specific knowledge.

Available Corpora

There are a few resources that can come in handy when doing sentiment analysis.

In this tutorial, we’ll use the IMDB movie reviews corpus. It has enough samples to do some interesting analysis on it. Download it from here: IMDB movie reviews on kaggle. The corpus has many files, containing unlabeled data and test data. We’re only interested in the file. Unzip the file somewhere at your convenience and let’s start.

Reading the data

The sentiment in this corpus is 0 for negative and 1 for positive. As you can see, it also contains some HTML tags, so remember to clean those up later. Let’s shuffle the data and split it for training and testing.

Using SentiWordnet

One of the most straightforward approaches is to use SentiWordnet to compute the polarity of the words and average that value. The plan is to use this model as a baseline for future approaches. It’s also a good idea to know about SentiWordnet and how to use it.

Let’s compute the accuracy of the SWN method

The SentiWordnet approach produced only a 0.6518 accuracy. In case this figure looks good, keep in mind that in the case of binary classification, 0.5 accuracy is the chance accuracy. If the test examples are equally distributed between classes, flipping a coin would yield a 0.5 accuracy.

NLTK SentimentAnalyzer

NLTK has some neat built in utilities for doing sentiment analysis. I wouldn’t name them “industry ready”, but they are definitely useful and good for didactical purposes. Let’s check out SentimentAnalyzer.

I’ve obtained a 0.8064 accuracy using this method (using only the first 5000 training samples; training a NLTK NaiveBayesClassifier takes a while). Not quite happy yet.

NLTK VADER Sentiment Intensity Analyzer

NLTK also contains the VADER (Valence Aware Dictionary and sEntiment Reasoner) Sentiment Analyzer.

It is a lexicon and rule-based sentiment analysis tool specifically created for working with messy social media texts. Let’s see how well it works for our movie reviews.

Pretty disappointing: 0.6892. I know for a fact that VADER works well for other types of text. It’s just not the case for our problem. Keep this tool in mind for your projects. Let’s try to tie things up, and build a proper classifier with Scikit-Learn.

Building a binary classifier with scikit-learn

For our last experiment, we’re going to play with a SVM model from Scikit-Learn. We’ve played already with text classification in the Text Classification Recipe. Make sure you brush up on the text classification task.

One new important addition is using bigrams. Bigrams are pairs of consecutive words. In general N-grams are tuples of N consecutive words. Here’s what I mean:

Using bigrams instead of unigrams (aka words) is a trick for improving performance in text classification.

By using bigrams, we preserve “more context” for the words in the text.

We’re going to build:

  • Unigram classifier (with mark_negation and without)
  • Bigram classifier (with mark_negation and without)
  • Unigram and bigram classifier (with mark_negation and without)

1. Unigram classifier

2. Bigram classifier

3. Unigram and Bigram classifier

Just by changing the ngram_range to (1, 2) we obtain the unigram and bigram model

Strangely enough, using mark negation actually lowered the accuracy of our classifiers in all cases. The classifier using unigrams and bigrams won the contest with a 0.883 accuracy.

If you are curious what features are extracted you can use this code:

That’s it. This was an extensive introduction to sentiment analysis. Hopefully, you got an understanding of what the task of doing Sentiment Analysis implies, what are the most important problems we face and how to overcome them.