If you have been working with NLTK for some time now, you probably find the task of preprocessing the text a bit cumbersome. In this post, I will walk you through a simple and fun approach for performing repetitive tasks using coroutines. The coroutines concept is a pretty obscure one but very useful indeed. You […]
What is chunking Text chunking, also referred to as shallow parsing, is a task that follows Part-Of-Speech Tagging and that adds more structure to the sentence. The result is a grouping of the words in “chunks”. Here’s a quick example:
(NP Every/DT day/NN)
(NP the/DT corner/NN shop/NN))
In other words, in a shallow parse tree, there’s one maximum level between the […]
What is “NER” NER, short for Named Entity Recognition is probably the first step towards information extraction from unstructured text. It basically means extracting what is a real world entity from the text (Person, Organization, Event etc …). Why do you need this information? You might want to map it against a knowledge base to […]
Stackoverflow is full of questions about why stemmers and lemmatizers don’t work as expected. The root cause of the confusion is that their role is often misunderstood. Here’s a comparison: Both stemmers and lemmatizers try to bring inflected words to the same form Stemmers use an algorithmic approach of removing prefixes and suffixes. The result […]
Wordnet is a lexical database created at Princeton University. Its size and several properties it holds make Wordnet one of the most useful tools you can have in your NLP arsenal. Here are a few properties that make Wordnet so useful: Synonyms are grouped together in something called Synset A synset contains lemmas, which are […]
It’s common in the world on Natural Language Processing to need to compute sentence similarity. Wordnet is an awesome tool and you should always keep it in mind when working with text. It’s of great help for the task we’re trying to tackle. Suppose we have these sentences: “Dogs are awesome.” “Some gorgeous creatures are […]
Part-Of-Speech tagging (or POS tagging, for short) is one of the main components of almost any NLP analysis. The task of POS-tagging simply implies labelling words with their appropriate Part-Of-Speech (Noun, Verb, Adjective, Adverb, Pronoun, …).
Simple recipe for text clustering. This sometimes creates issues in scikit-learn because text has sparse features.
Text classification is most probably, the most encountered Natural Language Processing task. It can be described as assigning texts to an appropriate bucket. A sports article should go in SPORT_NEWS, and a medical prescription should go in MEDICAL_PRESCRIPTIONS. To train a text classifier, we need some annotated data. This training data can be obtained through […]
In this example I want to show how to use some of the tools packed in NLTK to build something pretty awesome. Inverted indexes are a very powerful tool and is one of the building blocks of modern day search engines. While building the inverted index, you’ll learn to: 1. Use a stemmer from NLTK […]