Menu Sidebar

Language models

If you come from a statistical background or a machine learning one then probably you don’t need any reasons for why it’s useful to build language models. If not, here’s what language models are and why they are useful.

What is a model?

Generally speaking, a model (in the statistical sense of course) is a mathematical representation of a process. Almost always models are an approximation of the process. There are several reasons for this but the 2 most important are:
1. We usually only observe the process a limited amount of times
2. The model can be exceptionally complex so we simplify it

A statistician guy once said: All models are wrong, but some are useful.

Here’s what a model usually does: it describes how the modelled process creates data. In our case, the modelled phenomenon is the human language. A language model provides us with a way of generating human language. These models are usually made of probability distributions.

A model is built by observing some samples generated by the phenomenon to be modelled. In the same way, a language model is built by observing some text.

Let’s start building some models.
Read More

Natural Language Processing Corpora

Natural Language Processing Corpora

One of the reasons why it’s so hard to learn, practice and experiment with Natural Language Processing is due to the lack of available corpora. Building a gold standard corpus is seriously hard work. That’s why resources are so scarce or cost a lot of money. In this post, I’m going to aggregate some cool resources, some very well known, some a bit under the radar.


  • Brown – Categorized and part of speech tagged annotated corpus – available in NLTK: nltk.corpus.brown
  • Reuters – Categorized corpus – available in NLTK: nltk.corpus.reuters
  • CoNLL2000 – part of speech and chunk annotated corpus – available in NLTK: nltk.corpus.conll2000
  • CoNLL2002 – NER and part of speech and chunk annotated corpus – available in NLTK: nltk.corpus.conll2002
  • Information Extraction and Entity Recognition Corpus – NER annotated corpus – available in NLTK: nltk.corpus.ieer
  • Wordnet – large lexical database of English – available in NLTK: nltk.corpus.wordnet
  • 20 Newsgroups data set – Categorized corpus – available in Scikit-learn: sklearn.datasets.fetch_20newsgroups
  • Groningen Meaning Bank (GMB) – NER and part of speech annotated corpus
  • text8 – Cleaned up Wikipedia articles by Matt Mahoney
  • webtext – User generated content on the web – available in NLTK: nltk.corpus.wordnet
  • gutenberg – Text from the Gutenberg Project – available in NLTK: nltk.corpus.gutenberg
  • inaugural – US Presidential Inaugural Addresses – available in NLTK: nltk.corpus.inaugural
  • genesis – Bible text – available in NLTK: nltk.corpus.genesis
  • abc – Australian Broadcasting Commission 2006* – available in NLTK:
    Read More
Introduction to Python NLTK

Introduction to NLTK

NLTK (Natural Language ToolKit) is the most popular Python framework for working with human language. There’s a bit of controversy around the question whether NLTK is appropriate or not for production environments. Here’s my take on the matter:

  • NLTK doesn’t come with super powerful trained models (like other frameworks do, like Stanford CoreNLP)
  • NLTK is perfect for getting started with NLP because it’s packed with examples, clear, concise, easy to understand API
  • NLTK comes with a lot of corpora conveniently organised and easily accessible for experiments
  • NLTK provides simple and pythonic foundation, that’s easy to extend.
  • NLTK contains some beautiful abstractions like the nltk.Tree used in stuff like chunking or syntactic parsing.
  • NLTK contains useful functions for doing a quick analysis (have a quick look at the data)
  • NLTK is certainly the place for getting started with NLP

You might not use the models in NLTK, but you can extend the excellent base classes and use your own trained models, built using other libraries like scikit-learn or TensorFlow. Here are some examples of training your own NLP models: Training a POS Tagger with NLTK and scikit-learn and Train a NER System. NLTK also provides some interfaces to external tools like the StanfordPOSTagger or SennaTagger.

So, long story short, NLTK is awesome because it provides a great starting point. You shouldn’t ignore it because it may be the foundation for your next application.

Natural Language Processing with NLTK

NLTK is well organized and plays well with the NLP Pyramid. The various functions, shortcuts, intertaces can be organized in a hierarchy as follows:
Read More


Weighting words using Tf-Idf

If I ask you “Do you remember the article about electrons in NY Times?” there’s a better chance you will remember it than if I asked you “Do you remember the article about electrons in the Physics books?”. Here’s why: an article about electrons in NY Times is far less common than in a collection of physics books. It is less likely to stumble upon the “electron” concept in NY Times than in a physics book.

Let’s consider now the scenario of a single article. Suppose you read an article and you’re asked to rank the concepts found in the article by importance. The chances are you’ll basically order the concepts by frequency. The reason is simply that important stuff would be mentioned repeatedly because the narrative gravitates around them.

Combining the 2 insights, given a term, a document and a collection of documents we can loosely say that:
importance ~ appearances(term, document) / count(documents containing term in collection)

This technique is called Tf-IdfTerm Frequency – Inverse Document Frequency. Here’s how the measure is defined:

  • tf = count(word, document) / len(document) – term frequency
  • idf = log( len(collection) / count(document_containing_term, collection) – inverse document frequency )
  • tf-idf = tf * idf – term frequency – inverse document frequency
    Read More
Performance metrics graph

Classification Performance Metrics

Throughout this blog, we seek to obtain good performance on our classification tasks. Classification is one of the most popular tasks in Machine Learning. Be sure you understand what classification is before going through this tutorial. You can check this Introduction to Machine Learning, specially created for hackers.

Since we’re always concerned with how well our systems are performing, we should have a clear way of measuring how performant a system is.

Binary Classification

We often have to deal with the simple task of Binary Classification. Some examples are: Sentiment Analysis (positive/negative), Spam Detection (spam/not-spam), Fraud Detection (fraud/not-fraud).
Read More

Is it a boy or a girl? An introduction to Machine Learning

Have you ever noticed what happens when you hear a name you haven’t heard before? You automatically put it in a bucket, the girl names bucket or the boy names bucket. In this tutorial, we’re getting started with machine learning. We’ll be building a classifier able to distinguish between boy and girl names. If this sounds interesting read along. If you expect a tonne of intricate math, read along. It’s easier and more fun than you think.
Read More

Splitting text into sentences

Splitting text into sentences

Few people realise how tricky splitting text into sentences can be. Most of the NLP frameworks out there already have English models created for this task.

You might encounter issues with the pretrained models if:

  1. You are working with a specific genre of text(usually technical) that contains strange abbreviations.
  2. You are working with a language that doesn’t have a pretrained model (Example: Romanian)

Here’s an example of the first scenario:

Under the hood, the NLTK’s sent_tokenize function uses an instance of a PunktSentenceTokenizer.

The PunktSentenceTokenizer is an unsupervised trainable model. This means it can be trained on unlabeled data, aka text that is not split into sentences.

Behind the scenes, PunktSentenceTokenizer is learning the abbreviations in the text. This is the mechanism that the tokenizer uses to decide where to “cut” .

We’re going to study how to train such a tokenizer and how to manually add abbreviations to fine-tune it.
Read More

Natural Language Processing - Introduction

What is Natural Language Processing?

This is probably the first post I should have written on the blog. The thing is, I did machine learning and natural language processing for a long time before putting the concepts in order inside my own mind.

I’ve learned techniques and hacks to boost precision of classifiers before fully understanding how a classifier computes its weights or whatever. So I guess it makes sense to publish a general introductory post after some real hands-on posts.

Here’s a popular diagram used to describe what data science usually implies:

data science venn diagram

You probably figured out by now, that Natural Language Processing has something to do with data science. Indeed that’s true. NLP employs much of the techniques used in data science, plus adds a few of its own or puts a new spin on some techniques.

I would say that Natural Language Processing also implies a good understanding of grammar. If your native language is as hard as mine, you probably hated it in school.

I like to think about grammar as the “science” of turning plain language into mathematical objects. Transform bits and pieces of text into formal objects that you can use programmatically. Some examples of grammar related tasks that you’ll probably use very often in NLP are:

  • Splitting text into sentences.
  • Find the part-of-speech for words inside a sentence.
  • Determine de different types of subclauses.
  • Determine the subject and the direct object of a sentence.

Although these might seem trivial for humans, they prove to be difficult tasks for machines mostly because of the ambiguity of natural language. We humans don’t have such a hard time untangling the ambiguities because we have something called common knowledge and prior experience.
Read More

Introduction to Sentiment Analysis

Getting Started with Sentiment Analysis

What is sentiment analysis

The most direct definition of the task is: “Does a text express a positive or negative sentiment?”. Usually, we assign a polarity value to a text. This value is usually in the [-1, 1] interval, 1 being very positive, -1 very negative.

Why is sentiment analysis useful

Sentiment analysis can have a multitude of uses, some of the most prominent being:

  • Discover a brand’s / product’s presence online
  • Check the reviews for a product
  • Customer support

Why sentiment analysis is hard

There are a few problems that make sentiment analysis specifically hard:
Read More

Training a NER System Using a Large Dataset

In a previous article, we studied training a NER (Named-Entity-Recognition) system from the ground up, using the Groningen Meaning Bank Corpus. This article is a continuation of that tutorial. The main purpose of this extension is to:

  1. Replace the classifier with a Scikit-Learn Classifier
  2. Train a NER on a larger subset of the training data
  3. Increase accuracy
  4. Understand Out Of Core Learning

What was wrong with the initial system you might ask. There wasn’t anything fundamentally wrong with the process. In fact, it’s a great didactical example, and we can build upon it. This is where it was lacking:

  1. If you did the training yourself, you probably realized we can’t train the system on the whole dataset (I chose to train it on the first 2000 sentences).
  2. The dataset is so huge – it can’t be loaded all in memory.
  3. We achieved around 93% accuracy. That might sound like a good accuracy, but we might be deceived. Named entities are probably around 10% of the tags. If we predict that all words have O tag (remember, O stands for outside any entity), we’re achieving a 90% accuracy. We can probably do better.
  4. We can come up with a better feature set that better describes the data and is more relevant to our task.

Out-Of-Core Learning

We are used to showing all the data we have at once to our classifier. This means that we have to keep all the data in memory. This can get in our way if we want to train on a larger dataset. Keeping the dataset out of RAM is called Out-Of-Core Learning. Read More

Older Posts


Pin It on Pinterest