Wordnet, getting your hands dirty

Wordnet is a lexical database created at Princeton University. Its size and several properties it holds make Wordnet one of the most useful tools you can have in your NLP arsenal.

Here are a few properties that make Wordnet so useful:

* Synonyms are grouped together in something called Synset
* A synset contains lemmas, which are the base form of a word
* There are hierarchical links between synsets (ISA relations or hypernym/hyponym relations)
* Several other properties such as antonyms or related words are included for each lemma in the synset

Operations on Synsets

Here are the most common operations on Synsets:

Synsets, of course, have an associated part-of-speech and you can query wordnet filtering by it:

You can also query for a very specific synset:

You can also compute how similar to synsets are:

Operations on lemmas

Lemmas in synsets are sorted by how often they appear (in a certain corpus used to create Wordnet):

For a certain lemma, you can query for the antonyms:

Another cool feature allows you to find derivationally related forms for a lemma:

Lemmatization

A very useful feature of Wordnet is the ability to lemmatize a word form to the base, dictionary form.

Conclusions

  • We scratched the surface of how useful Wordnet is
  • We have a method for finding synonyms, antonyms and related forms
  • We learned a method for lemmatizing a word, meaning bringing it to its base form
  • We know a way of computing how similar to words are