Language models

If you come from a statistical background or a machine learning one then probably you don’t need any reasons for why it’s useful to build language models. If not, here’s what language models are and why they are useful.

What is a model?

Generally speaking, a model (in the statistical sense of course) is a mathematical representation of a process. Almost always models are an approximation of the process. There are several reasons for this but the 2 most important are:
1. We usually only observe the process a limited amount of times
2. The model can be exceptionally complex so we simplify it

A statistician guy once said: All models are wrong, but some are useful.

Here’s what a model usually does: it describes how the modelled process creates data. In our case, the modelled phenomenon is the human language. A language model provides us with a way of generating human language. These models are usually made of probability distributions.

A model is built by observing some samples generated by the phenomenon to be modelled. In the same way, a language model is built by observing some text.

Let’s start building some models.

Bag Of Words

This is by far the most simplistic way of modelling the human language. That doesn’t mean it’s useless and unpopular. Quite the opposite. In fact, chances are, being an avid reader of this blog, that you already have created a Bag-Of-Words (or BOW) model. Here’s what you need to know about this model:

  1. It has an oversimplified view of the language
  2. It takes into account only the frequency of the words in the language, not their order or position

In a way, you created a Bag-Of-Words model when you tried text classification or sentiment analysis. It basically means you take the available words in a text and keep count of how many times they appear. Here’s how to build such a model with NLTK:

As you can see, it’s not the most expressive piece of content out there. The produced text follows only the frequency rules of the language and nothing more.

Now that we know the probability of all the words, we can compute the probability of a text. Because the words have been generated independently we just need to multiply all of the probabilities together:

Bigrams and Trigrams

One idea that can help us generate better text is to make sure the new word we’re adding to the sequence goes well with the words already in the sequence. Checking if a word fits well after 10 words might be a bit overkill. We can simplify things to keep the problem reasonable. Let’s make sure the new word goes well after the last word in the sequence (bigram model) or the last two words (trigram model).

“Bigram” is a fancy name for 2 consecutive words while trigram is (you guessed it) a triplet of consecutive words. Here are some quick NLTK magic for extracting bigrams/trigrams:

We’re going to build a trigram model from the Reuters corpus. Building a bigram model is completely analogous and easier.

How easy that was. Now we have a trigram language model. Let’s generate some text:

The output text is actually really readable and I had a lot of fun reading some of the stuff.

Here are a few of them:

The quality of the results is way better than the bag of words ones. What do you think?

The probability of a sequence is computed using conditional probabilities. The probability of word[i] given word[i-1] and word[i-2] is P(word[i] | word[i-1], word[i-2]) which in our case is equal to: model[(word[i-2], word[i-1])][word[i]]

Let’s add the probability computation in the generation script:


  • We’ve learned to build generative language models
  • NLTK has some cool utils that come in handy
  • Theoretically, the bigger the n-grams (generalised size n grams) the better language we’ll be generating
  • The bigger n-grams we’ll be using the bigger our models will get