Recipe: Text classification using NLTK and scikit-learn
Text classification is most probably, the most encountered Natural Language Processing task. It can be described as assigning texts to an appropriate bucket. A sports article should go in SPORT_NEWS
, and a medical prescription should go in MEDICAL_PRESCRIPTIONS
.
To train a text classifier, we need some annotated data. This training data can be obtained through several methods. Suppose you want to build a spam classifier. You would export the contents of your mailbox. You’d label the email in the inbox folder as NOT_SPAM
and the contents of your spam folder as SPAM
.
For the sake of simplicity, we will use a news corpus already available in scikit-learn. Let’s have a look at it:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | from sklearn.datasets import fetch_20newsgroups news = fetch_20newsgroups(subset='all') print len(news.data) # 18846 print len(news.target_names) # 20 print news.target_names # ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc'] for text, num_label in zip(news.data[:10], news.target[:10]): print '[%s]:\t\t "%s ..."' % (news.target_names[num_label], text[:100].split('\n')[0]) # [rec.sport.hockey]: "From: Mamatha Devineni Ratnam <[email protected]> ..." # [comp.sys.ibm.pc.hardware]: "From: [email protected] (Matthew B Lawson) ..." # [talk.politics.mideast]: "From: [email protected] (Hilmi Eren) ..." # [comp.sys.ibm.pc.hardware]: "From: [email protected] (Guy Dawson) ..." # [comp.sys.mac.hardware]: "From: Alexander Samuel McDiarmid <[email protected]> ..." # [sci.electronics]: "From: [email protected] (Stephen Tell) ..." # [comp.sys.mac.hardware]: "From: [email protected] (Louis Paul Adams) ..." # [rec.sport.hockey]: "From: [email protected] (Deepak Chhabra) ..." # [rec.sport.hockey]: "From: [email protected] (Deepak Chhabra) ..." # [talk.religion.misc]: "From: [email protected] (Ken Arromdee) ..." |
Analyzing the dataset tells us we’re dealing the task of classifying 18846 in 20 classes.
Training a model usually requires some trail and error. Let’s build a simple way of training and evaluating a classifier agains a test set:
1 2 3 4 5 6 7 8 9 10 | from sklearn.cross_validation import train_test_split def train(classifier, X, y): X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=33) classifier.fit(X_train, y_train) print "Accuracy: %s" % classifier.score(X_test, y_test) return classifier |
We’ll be playing with the Multinomial Naive Bayes classifier. Text classification is the most common use case for this classifier.
For transforming the text into a feature vector we’ll have to use specific feature extractors from the sklearn.feature_extraction.text
. TfidfVectorizer
has the advantage of emphasizing the most important words for a given document.
Let’s start building the classifier.
1 2 3 4 5 6 7 8 9 10 11 12 | from sklearn.naive_bayes import MultinomialNB from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import TfidfVectorizer trial1 = Pipeline([ ('vectorizer', TfidfVectorizer()), ('classifier', MultinomialNB()), ]) train(trial1, news.data, news.target) # Accuracy: 0.846349745331 |
Pretty good result for a first try. Let’s see how we can improve that. The first thing that comes to mind is to ignore insignificant words. We can use NLTK’s stopwords
list.
1 2 3 4 5 6 7 8 9 10 | from nltk.corpus import stopwords trial2 = Pipeline([ ('vectorizer', TfidfVectorizer(stop_words=stopwords.words('english'))), ('classifier', MultinomialNB()), ]) train(trial2, news.data, news.target) # Accuracy: 0.877546689304 |
Good boost. We can now try to play with the alpha
parameter of the Naive-Bayes classifier. Let’s set it to a low value:
1 2 3 4 5 6 7 8 | trial3 = Pipeline([ ('vectorizer', TfidfVectorizer(stop_words=stopwords.words('english'))), ('classifier', MultinomialNB(alpha=0.05)), ]) train(trial3, news.data, news.target) # Accuracy: 0.909592529711 |
Great progress. Let’s ignore words that appear fewer than 5 times in the document collection:
1 2 3 4 5 6 7 8 9 10 | trial4 = Pipeline([ ('vectorizer', TfidfVectorizer(stop_words=stopwords.words('english'), min_df=5)), ('classifier', MultinomialNB(alpha=0.05)), ]) train(trial4, news.data, news.target) # Accuracy: 0.903013582343 |
Oops. Didn’t do much harm, but definitely, it doesn’t help us. Let’s try something more radical.
We’ll use NLTK tokenizer to better split the text into words and then let’s bring the words to a base form using a stemmer. We’ll also ignore the punctuation since word_tokenize
doesn’t filter them out.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | import string from nltk.stem import PorterStemmer from nltk import word_tokenize def stemming_tokenizer(text): stemmer = PorterStemmer() return [stemmer.stem(w) for w in word_tokenize(text)] trial5 = Pipeline([ ('vectorizer', TfidfVectorizer(tokenizer=stemming_tokenizer, stop_words=stopwords.words('english') + list(string.punctuation))), ('classifier', MultinomialNB(alpha=0.05)), ]) train(trial5, news.data, news.target) # Accuracy: 0.910653650255 |
Small improvement with a great decrease in speed.
Conclusions
- Experiment with the tools you have available. Feel free to try different vectorizers and/or different classifiers
- Building a model is an iterative process of trial and error
- Sometimes accuracy comes at the cost of speed
[…] a way, you created a Bag-Of-Words model when you tried text classification or sentiment analysis. It basically means you take the available words in a text and keep count of […]
[…] play with a SVM model from Scikit-Learn. We’ve played already with text classification in the Text Classification Recipe. Make sure you brush up on the text classification […]
how I can use feature selection(chi2) inside this example
Hi Yousif,
Not sure what you are trying to accomplish. Maybe give this a read: http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html
Bogdan.
Thank you very much for your reply
but I want to know how I can use chi2?
Hi Yousif,
You can use chi2 to do feature selection. Feature selection means you discard the features (in the case of text classification, words) that contribute the least to the performance of the classifier. This way you can have a lighter model and sometimes it helps performance wise by clearing the noise.
You can check out an example here: http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html
Cheers
Is it possible check, which word from news.data belong to which category/ category to which each word in the raw text belongs to?
There are ways of computing probabilities. A word doesn’t belong to a category or vice-versa. The job of the classifier is to build these “soft” boundary given a set of words (aka document)
Hi Bogdani,
Thanks for writing this article. It is really well written and explained.
I am currently exploring Spacy for NER and need to extract relevant information from job descriptions posted on Linkedin. Can you help me with some leads or process?
Thanks,
Check https://spacy.io/usage/examples#training-ner
Thanks a lot, Bogdani. Will look at the link and really glad you replied.
Will be really great if you can cover something like a resume to job description matching in one of your posts.
Thanks,
Hello, is it possible to give you a list of suggested categories ordered by the accuracy? Example:
politics: 91.22%
economy: 87.32%
…
Hi Ardit,
I’m currently away from keyboard and I’m unable to give a more elaborate answer. Please read the post on “Performance Metrics “. It might answer your questions