Compute sentence similarity using Wordnet
It’s common in the world on Natural Language Processing to need to compute sentence similarity. Wordnet is an awesome tool and you should always keep it in mind when working with text. It’s of great help for the task we’re trying to tackle.
Suppose we have these sentences:
* “Dogs are awesome.”
* “Some gorgeous creatures are felines.” (Ok, maybe not the most common sentence structure but bare with me)
* “Dolphins are swimming mammals.”
Say we want to know what’s the closest sentence to “Cats are beautiful animals.”
Properties of a similarity measure
Let’s think of a few qualities we’d expect from this similarity measure:
Similarity(S1, S2) == Similarity(S2, S1)
It’s a must have for any similarity measure.- Given 3 identical sentences except for 1 particular word, then the sentences with the most 2 similar words, should be the most similar
Similarity(S, S) == 1
Observation number 2 raises the question: How do we know that 2 words are more similar? Fortunately, that’s an easy question since Wordnet has that issue covered.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | from nltk.corpus import wordnet as wn print wn.synsets('cat', 'n') # [Synset('cat.n.01'), Synset('guy.n.01'), Synset('cat.n.03'), Synset('kat.n.01'), Synset('cat-o'-nine-tails.n.01'), Synset('caterpillar.n.02'), Synset('big_cat.n.01'), Synset('computerized_tomography.n.01')] print wn.synsets('dog', 'n') # [Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'), Synset('cad.n.01'), Synset('frank.n.02'), Synset('pawl.n.01'), Synset('andiron.n.01')] print wn.synsets('feline', 'n') # [Synset('feline.n.01')] print wn.synsets('mammal', 'n') # [Synset('mammal.n.01')] # It's important to note that the `Synsets` from such a query are ordered by how frequent that sense appears in the corpus # You can check out how frequent a lemma is by doing: cat = wn.synsets('cat', 'n')[0] # Get the most common synset print cat.lemmas()[0].count() # Get the first lemma => 18 dog = wn.synsets('dog', 'n')[0] # Get the most common synset feline = wn.synsets('feline', 'n')[0] # Get the most common synset mammal = wn.synsets('mammal', 'n')[0] # Get the most common synset # You can read more about the different types of wordnet similarity measures here: http://www.nltk.org/howto/wordnet.html for synset in [dog, feline, mammal]: print "Similarity(%s, %s) = %s" % (cat, synset, cat.wup_similarity(synset)) # Similarity(Synset('cat.n.01'), Synset('dog.n.01')) = 0.2 # Similarity(Synset('cat.n.01'), Synset('feline.n.01')) = 0.5 # Similarity(Synset('cat.n.01'), Synset('mammal.n.01')) = 0.2 |
Implementing the similarity measure
Let’s now work towards implementing an algorithm that works for sentences. Some considerations:
- We should POS tag the sentence because we need to tell Wordnet what POS we’re looking for
- Since Wordnet only contains info on nouns, verbs, adjectives and adverbs, we’ll be ignoring everything else (possible problem!)
This algorithm is proposed by Mihalcea et al. in the paper “Corpus-based and Knowledge-based Measures
of Text Semantic Similarity” (https://www.aaai.org/Papers/AAAI/2006/AAAI06-123.pdf)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 | from nltk import word_tokenize, pos_tag from nltk.corpus import wordnet as wn def penn_to_wn(tag): """ Convert between a Penn Treebank tag to a simplified Wordnet tag """ if tag.startswith('N'): return 'n' if tag.startswith('V'): return 'v' if tag.startswith('J'): return 'a' if tag.startswith('R'): return 'r' return None def tagged_to_synset(word, tag): wn_tag = penn_to_wn(tag) if wn_tag is None: return None try: return wn.synsets(word, wn_tag)[0] except: return None def sentence_similarity(sentence1, sentence2): """ compute the sentence similarity using Wordnet """ # Tokenize and tag sentence1 = pos_tag(word_tokenize(sentence1)) sentence2 = pos_tag(word_tokenize(sentence2)) # Get the synsets for the tagged words synsets1 = [tagged_to_synset(*tagged_word) for tagged_word in sentence1] synsets2 = [tagged_to_synset(*tagged_word) for tagged_word in sentence2] # Filter out the Nones synsets1 = [ss for ss in synsets1 if ss] synsets2 = [ss for ss in synsets2 if ss] score, count = 0.0, 0 # For each word in the first sentence for synset in synsets1: # Get the similarity value of the most similar word in the other sentence best_score = max([synset.path_similarity(ss) for ss in synsets2]) # Check that the similarity could have been computed if best_score is not None: score += best_score count += 1 # Average the values score /= count return score sentences = [ "Dogs are awesome.", "Some gorgeous creatures are felines.", "Dolphins are swimming mammals.", "Cats are beautiful animals.", ] focus_sentence = "Cats are beautiful animals." for sentence in sentences: print "Similarity(\"%s\", \"%s\") = %s" % (focus_sentence, sentence, sentence_similarity(focus_sentence, sentence)) print "Similarity(\"%s\", \"%s\") = %s" % (sentence, focus_sentence, sentence_similarity(sentence, focus_sentence)) print # Similarity("Cats are beautiful animals.", "Dogs are awesome.") = 0.511111111111 # Similarity("Dogs are awesome.", "Cats are beautiful animals.") = 0.666666666667 # Similarity("Cats are beautiful animals.", "Some gorgeous creatures are felines.") = 0.833333333333 # Similarity("Some gorgeous creatures are felines.", "Cats are beautiful animals.") = 0.833333333333 # Similarity("Cats are beautiful animals.", "Dolphins are swimming mammals.") = 0.483333333333 # Similarity("Dolphins are swimming mammals.", "Cats are beautiful animals.") = 0.4 # Similarity("Cats are beautiful animals.", "Cats are beautiful animals.") = 1.0 # Similarity("Cats are beautiful animals.", "Cats are beautiful animals.") = 1.0 |
Building a symmetric similarity function
The sentence similarity measure behaves pretty well, but we have a problem. It’s not a symmetrical function. We can do a trick though:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | def symmetric_sentence_similarity(sentence1, sentence2): """ compute the symmetric sentence similarity using Wordnet """ return (sentence_similarity(sentence1, sentence2) + sentence_similarity(sentence2, sentence1)) / 2 for sentence in sentences: print "SymmetricSimilarity(\"%s\", \"%s\") = %s" % ( focus_sentence, sentence, symmetric_sentence_similarity(focus_sentence, sentence)) print "SymmetricSimilarity(\"%s\", \"%s\") = %s" % ( sentence, focus_sentence, symmetric_sentence_similarity(sentence, focus_sentence)) print # SymmetricSimilarity("Cats are beautiful animals.", "Dogs are awesome.") = 0.588888888889 # SymmetricSimilarity("Dogs are awesome.", "Cats are beautiful animals.") = 0.588888888889 # SymmetricSimilarity("Cats are beautiful animals.", "Some gorgeous creatures are felines.") = 0.833333333333 # SymmetricSimilarity("Some gorgeous creatures are felines.", "Cats are beautiful animals.") = 0.833333333333 # SymmetricSimilarity("Cats are beautiful animals.", "Dolphins are swimming mammals.") = 0.441666666667 # SymmetricSimilarity("Dolphins are swimming mammals.", "Cats are beautiful animals.") = 0.441666666667 # SymmetricSimilarity("Cats are beautiful animals.", "Cats are beautiful animals.") = 1.0 # SymmetricSimilarity("Cats are beautiful animals.", "Cats are beautiful animals.") = 1.0 |
There are a lot of things wrong with this approach like:
- It’s not the same as in the original paper since the max similarity is not weighted with an Inverse-Document-Frequency
- Wordnet has some issues with computing the similarity between adjectives and adverbs
- Some Wordnet similarity measures misbehave
All in all, in practice, this method yields acceptable results.
1 2 3 4 5 | print wn.synset('be.v.01').wup_similarity(wn.synset('be.v.01')) # 0.5 (!!!) print wn.synset('gorgeous.a.01').wup_similarity(wn.synset('amazing.a.01')) # None (!!!) |
Conclusions
- We’ve built a symmetric sentence similarity measure.
- There are several issues with how Wordnet computes word similarity.
- Although the method has a lot of drawbacks, it performs fairly well.
I got this error:
best_score = max([synset.path_similarity(ss) for ss in synsets2])
TypeError: ‘>’ not supported between instances of ‘NoneType’ and ‘float’
How can I resolve this?
The
path_similarity
sometimes returns None because there’s no path between 2 synsets. Either filter out those values or try a different similarity measure: http://www.nltk.org/howto/wordnet.htmlThank for reply me. But When I try to improve the above code. It’s seem not same result with author:
arr_simi_score = []
for syn1 in synsets1:
for syn2 in synsets2:
simi_score = syn1.wup_similarity(syn2)
if simi_score is not None:
arr_simi_score.append(simi_score)
best = max(arr_simi_score)
score += best
count += 1
The result is:
Similarity(“Cats are beautiful animals”, “Dogs are awesome”) = 0.8616071428571428
Similarity(“Cats are beautiful animals”, “Some gorgeous creatures are felines”) = 0.9722222222222222
Similarity(“Cats are beautiful animals”, “Dolphins are swimming mammals”) = 0.8333333333333334
Similarity(“Cats are beautiful animals”, “Cats are beautiful animals”) = 1.0
I might have used a different Wordnet version or maybe a different similarity measure. Not sure. Your results make sense though.
how to do this in java
I have fixed the error for NoneType:
for syn1 in synsets1:
arr_simi_score = []
print('=========================================')
print(syn1)
print('----------------')
for syn2 in synsets2:
print(syn2)
simi_score = syn1.path_similarity(syn2)
print(simi_score)
if simi_score is not None:
arr_simi_score.append(simi_score)
print('----------------')
print(arr_simi_score)
if(len(arr_simi_score) > 0):
best = max(arr_simi_score)
print(best)
score += best
count += 1
# Average the values
print('score: ', score)
print('count: ', count)
score /= count
score=0.0;
count=0;
best_score = [0.0]
for ss1 in synsets1:
for ss2 in synsets2:
best1_score=ss1.path_similarity(ss2)
if best1_score is not None:
best_score.append(best1_score)
max1=max(best_score)
if best_score is not None:
score += max1
if max1 is not 0.0:
count += 1
best_score=[0.0]
print(score/count)
How do we create a matrix after calculating the similarities with all the sentences we have? I am looking to pass this to a clustering algorithm by creating a matrix.
Should be pretty straighforward:
for i in range(len(sentences)):
for j in range(len(sentences)):
A[i][j] = similarity(sentences[i], sentences[j])
Hi there, you mentioned you used the paper to gather the similarity metric, but the paper consists of five different types? which one did you base your code on? I am looking for the mathematical equation 🙂 thanks.
It’s the first equation in the paper (1)