Language models
If you come from a statistical background or a machine learning one then probably you don’t need any reasons for why it’s useful to build language models. If not, here’s what language models are and why they are useful.
What is a model?
Generally speaking, a model (in the statistical sense of course) is a mathematical representation of a process. Almost always models are an approximation of the process. There are several reasons for this but the 2 most important are:
1. We usually only observe the process a limited amount of times
2. The model can be exceptionally complex so we simplify it
A statistician guy once said: All models are wrong, but some are useful.
Here’s what a model usually does: it describes how the modelled process creates data. In our case, the modelled phenomenon is the human language. A language model provides us with a way of generating human language. These models are usually made of probability distributions.
A model is built by observing some samples generated by the phenomenon to be modelled. In the same way, a language model is built by observing some text.
Let’s start building some models.
Bag Of Words
This is by far the most simplistic way of modelling the human language. That doesn’t mean it’s useless and unpopular. Quite the opposite. In fact, chances are, being an avid reader of this blog, that you already have created a Bag-Of-Words (or BOW) model. Here’s what you need to know about this model:
- It has an oversimplified view of the language
- It takes into account only the frequency of the words in the language, not their order or position
In a way, you created a Bag-Of-Words model when you tried text classification or sentiment analysis. It basically means you take the available words in a text and keep count of how many times they appear. Here’s how to build such a model with NLTK:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | from nltk.corpus import reuters from collections import Counter counts = Counter(reuters.words()) total_count = len(reuters.words()) # The most common 20 words are ... print counts.most_common(n=20) # [(u'.', 94687), (u',', 72360), (u'the', 58251), (u'of', 35979), (u'to', 34035), (u'in', 26478), (u'said', 25224), (u'and', 25043), (u'a', 23492), (u'mln', 18037), (u'vs', 14120), (u'-', 13705), (u'for', 12785), (u'dlrs', 11730), (u"'", 11272), (u'The', 10968), (u'000', 10277), (u'1', 9977), (u's', 9298), (u'pct', 9093)] # Compute the frequencies for word in counts: counts[word] /= float(total_count) # The frequencies should add up to 1 print sum(counts.values()) # 1.0 import random # Generate 100 words of language text = [] for _ in range(100): r = random.random() accumulator = .0 for word, freq in counts.iteritems(): accumulator += freq if accumulator >= r: text.append(word) break print ' '.join(text) # tax been its and industrial and vote " decision rates elimination and 2 . base Ltd one merger half three division trading it to company before CES mln may to . . , and U is - exclusive affiliate - biggest its Association sides above two nearby NOTES 4TH prepared term areas growth said to each gold policy 0 PLOUGH kind economy director currencies requiring . ' loan growth , 83 . new The target Refining 114 STAKE the it on . to ; measure deposit Corp Emergency on 63 the reported the TREASURY state EC to Grosso as basius |
As you can see, it’s not the most expressive piece of content out there. The produced text follows only the frequency rules of the language and nothing more.
Now that we know the probability of all the words, we can compute the probability of a text. Because the words have been generated independently we just need to multiply all of the probabilities together:
1 2 3 4 | # The probability of a text from operator import mul print reduce(mul, [counts[w] for w in text], 1.0) # 3.0290546883e-32 |
Bigrams and Trigrams
One idea that can help us generate better text is to make sure the new word we’re adding to the sequence goes well with the words already in the sequence. Checking if a word fits well after 10 words might be a bit overkill. We can simplify things to keep the problem reasonable. Let’s make sure the new word goes well after the last word in the sequence (bigram model) or the last two words (trigram model).
“Bigram” is a fancy name for 2 consecutive words while trigram is (you guessed it) a triplet of consecutive words. Here are some quick NLTK magic for extracting bigrams/trigrams:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | from nltk.corpus import reuters from nltk import bigrams, trigrams from collections import Counter, defaultdict first_sentence = reuters.sents()[0] print first_sentence # [u'ASIAN', u'EXPORTERS', u'FEAR', u'DAMAGE', u'FROM' ... # Get the bigrams print list(bigrams(first_sentence)) # [(u'ASIAN', u'EXPORTERS'), (u'EXPORTERS', u'FEAR'), (u'FEAR', u'DAMAGE'), (u'DAMAGE', u'FROM'), ... # Get the padded bigrams print list(bigrams(first_sentence, pad_left=True, pad_right=True)) # [(None, u'ASIAN'), (u'ASIAN', u'EXPORTERS'), (u'EXPORTERS', u'FEAR'), (u'FEAR', u'DAMAGE'), (u'DAMAGE', u'FROM'), # Get the trigrams print list(trigrams(first_sentence)) # [(u'ASIAN', u'EXPORTERS', u'FEAR'), (u'EXPORTERS', u'FEAR', u'DAMAGE'), (u'FEAR', u'DAMAGE', u'FROM'), ... # Get the padded trigrams print list(trigrams(first_sentence, pad_left=True, pad_right=True)) # [(None, None, u'ASIAN'), (None, u'ASIAN', u'EXPORTERS'), (u'ASIAN', u'EXPORTERS', u'FEAR'), (u'EXPORTERS', u'FEAR', u'DAMAGE'), (u'FEAR', u'DAMAGE', u'FROM') ... |
We’re going to build a trigram model from the Reuters corpus. Building a bigram model is completely analogous and easier.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | model = defaultdict(lambda: defaultdict(lambda: 0)) for sentence in reuters.sents(): for w1, w2, w3 in trigrams(sentence, pad_right=True, pad_left=True): model[(w1, w2)][w3] += 1 print model["what", "the"]["economists"] # "economists" follows "what the" 2 times print model["what", "the"]["nonexistingword"] # 0 times print model[None, None]["The"] # 8839 sentences start with "The" # Let's transform the counts to probabilities for w1_w2 in model: total_count = float(sum(model[w1_w2].values())) for w3 in model[w1_w2]: model[w1_w2][w3] /= total_count print model["what", "the"]["economists"] # 0.0434782608696 print model["what", "the"]["nonexistingword"] # 0.0 print model[None, None]["The"] # 0.161543241465 |
How easy that was. Now we have a trigram language model. Let’s generate some text:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | import random text = [None, None] sentence_finished = False while not sentence_finished: r = random.random() accumulator = .0 for word in model[tuple(text[-2:])].keys(): accumulator += model[tuple(text[-2:])][word] if accumulator >= r: text.append(word) break if text[-2:] == [None, None]: sentence_finished = True print ' '.join([t for t in text if t]) |
The output text is actually really readable and I had a lot of fun reading some of the stuff.
Here are a few of them:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | # It has been approached by a group formed by Prime Minister Yasuhiro Nakasone that last year ' s spokeswoman said , noting the sharp rise in production to recover higher crude oil stocks dropped to post a long time since mid - 1960s ," the company reported a 448 mln dlr restructuring charge of 14 . 8 Soybeans 14 , 257 , 000 - 10 members . # United Grain Corp of New York investment partnership that deals mainly in the International Court in Manhattan to increase West German growth is put at 423 , 000 vs profit 454 , 000 barrels per day mill located in Qinghai , Inner Mongolia and other major economies continue into the hands of another Conservative government agreed to buy from the previous year and next year from April 1 , 833 , 000 tons of lead . # Net international reserves at the Wall Street that the proposal . # 16 - MAR - 1987 17 : 17 : 02 . 76 # Diaz said the action affects 401 mln dlrs . # Net is after deductions for mandatory preferred stock with a 6 . 4 mln vs 17 . 8 mln dlrs in disbursements this year , the Coffee Board of Trade . # IRAN WARNS U . S . Treasury that ended on Saturday to close them since December 31 , 1987 , and & lt ; DIA > RAISES PRIME RATE RISE UNDER GREENSPAN # Atlanta , Ga ., is aimed at stretching out repayments of mark bonds on the likely duration of firm world prices . # The intervention took place in May , Sheikh Ali also delivered " a range of common stock for each colonial share , Tektronix said . # The dividend will be manufactured in Greenville , Tenn ., and Vic Ferrara of Dallas , for the United States and a strong earthquake |
The quality of the results is way better than the bag of words ones. What do you think?
The probability of a sequence is computed using conditional probabilities. The probability of word[i]
given word[i-1]
and word[i-2]
is P(word[i] | word[i-1], word[i-2])
which in our case is equal to: model[(word[i-2], word[i-1])][word[i]]
Let’s add the probability computation in the generation script:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | import random text = [None, None] prob = 1.0 # <- Init probability sentence_finished = False while not sentence_finished: r = random.random() accumulator = .0 for word in model[tuple(text[-2:])].keys(): accumulator += model[tuple(text[-2:])][word] if accumulator >= r: prob *= model[tuple(text[-2:])][word] # <- Update the probability with the conditional probability of the new word text.append(word) break if text[-2:] == [None, None]: sentence_finished = True print "Probability of text=", prob # <- Print the probability of the text print ' '.join([t for t in text if t]) # Probability of text= 4.69753034878e-48 # DOW CHEMICAL & lt ; SFE > IN ACQUISITION TALKS Comdata Network Inc said it sold the unit , leading to the group and this would not resist a half mln barrels to 247 . 0 pct , Ivory Coast is the lowest growth rate , he said . |
Conclusions
- We’ve learned to build generative language models
- NLTK has some cool utils that come in handy
- Theoretically, the bigger the n-grams (generalised size n grams) the better language we’ll be generating
- The bigger n-grams we’ll be using the bigger our models will get
Hi bogdani can u please send me your email address via my mail @ [email protected]. Thank you
Use the contact form: http://nlpforhackers.io/contact/
I got an error when running the Bigram and Trigram code in both Python 2 and Python 3 (of course, I modified the code to be Python 3 compliant).
The error I get is from here:
first_sentence = reuters.sents()[0]
Please advise. Thank you.
What’s the error you’re getting?
Is it possible to add smoothing to your probabilities? e.g knneser-ney smoothing?
If so – How?
If not – How would you handle probabilities of new sequences (with no appearances so it’s 0 with no smoothing)
Thanks