Recipe: Text classification using NLTK and scikit-learn

Text classification is most probably, the most encountered Natural Language Processing task. It can be described as assigning texts to an appropriate bucket. A sports article should go in SPORT_NEWS, and a medical prescription should go in MEDICAL_PRESCRIPTIONS.

To train a text classifier, we need some annotated data. This training data can be obtained through several methods. Suppose you want to build a spam classifier. You would export the contents of your mailbox. You’d label the email in the inbox folder as NOT_SPAM and the contents of your spam folder as SPAM.

For the sake of simplicity, we will use a news corpus already available in scikit-learn. Let’s have a look at it:

Analyzing the dataset tells us we’re dealing the task of classifying 18846 in 20 classes.

Training a model usually requires some trail and error. Let’s build a simple way of training and evaluating a classifier agains a test set:

We’ll be playing with the Multinomial Naive Bayes classifier. Text classification is the most common use case for this classifier.

For transforming the text into a feature vector we’ll have to use specific feature extractors from the sklearn.feature_extraction.text. TfidfVectorizer has the advantage of emphasizing the most important words for a given document.

Let’s start building the classifier.

Pretty good result for a first try. Let’s see how we can improve that. The first thing that comes to mind is to ignore insignificant words. We can use NLTK’s stopwords list.

Good boost. We can now try to play with the alpha parameter of the Naive-Bayes classifier. Let’s set it to a low value:

Great progress. Let’s ignore words that appear fewer than 5 times in the document collection:

Oops. Didn’t do much harm, but definitely, it doesn’t help us. Let’s try something more radical.

We’ll use NLTK tokenizer to better split the text into words and then let’s bring the words to a base form using a stemmer. We’ll also ignore the punctuation since word_tokenize doesn’t filter them out.

Small improvement with a great decrease in speed.


Conclusions

  • Experiment with the tools you have available. Feel free to try different vectorizers and/or different classifiers
  • Building a model is an iterative process of trial and error
  • Sometimes accuracy comes at the cost of speed