Simple recipe for text clustering. This sometimes creates issues in scikit-learn because text has sparse features.
Text classification is most probably, the most encountered Natural Language Processing task. It can be described as assigning texts to an appropriate bucket. A sports article should go in
SPORT_NEWS, and a medical prescription should go in
To train a text classifier, we need some annotated data. This training data can be obtained through several methods. Suppose you want to build a spam classifier. You would export the contents of your mailbox. You’d label the email in the inbox folder as
NOT_SPAM and the contents of your spam folder as
In this example I want to show how to use some of the tools packed in NLTK to build something pretty awesome. Inverted indexes are a very powerful tool and is one of the building blocks of modern day search engines.
While building the inverted index, you’ll learn to:
1. Use a stemmer from NLTK
2. Filter words using a stopwords list
3. Tokenize text
You might have stumbled in your NLP application development upon situations when you needed to get the “closest” adjective to a noun, or maybe you needed to “nounify” a verb. After poking around Wordnet I found a simple and pretty effective way to do this. Keep in mind that it is not error proof, but for most of my needs, I found it to perform pretty well. We’ll be using NLTK Wordnet wrapper for this. Let’s have a look at the code: