Compute sentence similarity using Wordnet

It’s common in the world on Natural Language Processing to need to compute sentence similarity. Wordnet is an awesome tool and you should always keep it in mind when working with text. It’s of great help for the task we’re trying to tackle.

Suppose we have these sentences:

* “Dogs are awesome.”
* “Some gorgeous creatures are felines.” (Ok, maybe not the most common sentence structure but bare with me)
* “Dolphins are swimming mammals.”

Say we want to know what’s the closest sentence to “Cats are beautiful animals.”

Properties of a similarity measure

Let’s think of a few qualities we’d expect from this similarity measure:

  1. Similarity(S1, S2) == Similarity(S2, S1) It’s a must have for any similarity measure.
  2. Given 3 identical sentences except for 1 particular word, then the sentences with the most 2 similar words, should be the most similar
  3. Similarity(S, S) == 1

Observation number 2 raises the question: How do we know that 2 words are more similar? Fortunately, that’s an easy question since Wordnet has that issue covered.

Implementing the similarity measure

Let’s now work towards implementing an algorithm that works for sentences. Some considerations:

  1. We should POS tag the sentence because we need to tell Wordnet what POS we’re looking for
  2. Since Wordnet only contains info on nouns, verbs, adjectives and adverbs, we’ll be ignoring everything else (possible problem!)

This algorithm is proposed by Mihalcea et al. in the paper “Corpus-based and Knowledge-based Measures
of Text Semantic Similarity”
(https://www.aaai.org/Papers/AAAI/2006/AAAI06-123.pdf)

Building a symmetric similarity function

The sentence similarity measure behaves pretty well, but we have a problem. It’s not a symmetrical function. We can do a trick though:

There are a lot of things wrong with this approach like:

  • It’s not the same as in the original paper since the max similarity is not weighted with an Inverse-Document-Frequency
  • Wordnet has some issues with computing the similarity between adjectives and adverbs
  • Some Wordnet similarity measures misbehave

All in all, in practice, this method yields acceptable results.

Conclusions

  • We’ve built a symmetric sentence similarity measure.
  • There are several issues with how Wordnet computes word similarity.
  • Although the method has a lot of drawbacks, it performs fairly well.