What are Word Clouds? Word Clouds are a popular way of displaying how important words are in a collection of texts. Basically, the more frequent the word is, the greater space it occupies in the image. One of the uses of Word Clouds is to help us get an intuition about what the collection of […]
If you come from a statistical background or a machine learning one then probably you don’t need any reasons for why it’s useful to build language models. If not, here’s what language models are and why they are useful.
NLTK (Natural Language ToolKit) is the most popular Python framework for working with human language. There’s a bit of controversy around the question whether NLTK is appropriate or not for production environments. Here’s my take on the matter:
Few people realise how tricky splitting text into sentences can be. Most of the NLP frameworks out there already have English models created for this task. You might encounter issues with the pretrained models if:
If you have been working with NLTK for some time now, you probably find the task of preprocessing the text a bit cumbersome. In this post, I will walk you through a simple and fun approach for performing repetitive tasks using coroutines. The coroutines concept is a pretty obscure one but very useful indeed. You […]
What is chunking Text chunking, also referred to as shallow parsing, is a task that follows Part-Of-Speech Tagging and that adds more structure to the sentence. The result is a grouping of the words in “chunks”. Here’s a quick example:
Updates 29-Apr-2018 – Added Gist for the entire code NER, short for Named Entity Recognition is probably the first step towards information extraction from unstructured text. It basically means extracting what is a real world entity from the text (Person, Organization, Event etc …).
Stackoverflow is full of questions about why stemmers and lemmatizers don’t work as expected. The root cause of the confusion is that their role is often misunderstood. Here’s a comparison:
Part-Of-Speech tagging (or POS tagging, for short) is one of the main components of almost any NLP analysis. The task of POS-tagging simply implies labelling words with their appropriate Part-Of-Speech (Noun, Verb, Adjective, Adverb, Pronoun, …).
In this example I want to show how to use some of the tools packed in NLTK to build something pretty awesome. Inverted indexes are a very powerful tool and is one of the building blocks of modern day search engines. While building the inverted index, you’ll learn to: 1. Use a stemmer from NLTK […]