Wordnet, getting your hands dirty
Wordnet is a lexical database created at Princeton University. Its size and several properties it holds make Wordnet one of the most useful tools you can have in your NLP arsenal.
Here are a few properties that make Wordnet so useful:
* Synonyms are grouped together in something called Synset
* A synset contains lemmas, which are the base form of a word
* There are hierarchical links between synsets (ISA relations or hypernym/hyponym relations)
* Several other properties such as antonyms or related words are included for each lemma in the synset
Operations on Synsets
Here are the most common operations on Synsets:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 | from nltk.corpus import wordnet as wn car_synsets = wn.synsets('car') print car_synsets # [Synset('car.n.01'), Synset('car.n.02'), Synset('car.n.03'), Synset('car.n.04'), Synset('cable_car.n.01')] for car in car_synsets: print "lemmas: ", car.lemmas() print "definition: ", car.definition() print "hypernyms:", car.hypernyms() print "hyponyms:", car.hyponyms() print '-' * 40, '\n\n' # lemmas: [Lemma('car.n.01.car'), Lemma('car.n.01.auto'), Lemma('car.n.01.automobile'), Lemma('car.n.01.machine'), Lemma('car.n.01.motorcar')] # definition: a motor vehicle with four wheels; usually propelled by an internal combustion engine # hypernyms: [Synset('motor_vehicle.n.01')] # hyponyms: [Synset('ambulance.n.01'), Synset('beach_wagon.n.01'), Synset('bus.n.04'), Synset('cab.n.03'), Synset('compact.n.03'), Synset('convertible.n.01'), Synset('coupe.n.01'), Synset('cruiser.n.01'), Synset('electric.n.01'), Synset('gas_guzzler.n.01'), Synset('hardtop.n.01'), Synset('hatchback.n.01'), Synset('horseless_carriage.n.01'), Synset('hot_rod.n.01'), Synset('jeep.n.01'), Synset('limousine.n.01'), Synset('loaner.n.02'), Synset('minicar.n.01'), Synset('minivan.n.01'), Synset('model_t.n.01'), Synset('pace_car.n.01'), Synset('racer.n.02'), Synset('roadster.n.01'), Synset('sedan.n.01'), Synset('sport_utility.n.01'), Synset('sports_car.n.01'), Synset('stanley_steamer.n.01'), Synset('stock_car.n.01'), Synset('subcompact.n.01'), Synset('touring_car.n.01'), Synset('used-car.n.01')] # ---------------------------------------- # lemmas: [Lemma('car.n.02.car'), Lemma('car.n.02.railcar'), Lemma('car.n.02.railway_car'), Lemma('car.n.02.railroad_car')] # definition: a wheeled vehicle adapted to the rails of railroad # hypernyms: [Synset('wheeled_vehicle.n.01')] # hyponyms: [Synset('baggage_car.n.01'), Synset('cabin_car.n.01'), Synset('club_car.n.01'), Synset('freight_car.n.01'), Synset('guard's_van.n.01'), Synset('handcar.n.01'), Synset('mail_car.n.01'), Synset('passenger_car.n.01'), Synset('slip_coach.n.01'), Synset('tender.n.04'), Synset('van.n.03')] # ---------------------------------------- # lemmas: [Lemma('car.n.03.car'), Lemma('car.n.03.gondola')] # definition: the compartment that is suspended from an airship and that carries personnel and the cargo and the power plant # hypernyms: [Synset('compartment.n.02')] # hyponyms: [] # ---------------------------------------- # lemmas: [Lemma('car.n.04.car'), Lemma('car.n.04.elevator_car')] # definition: where passengers ride up and down # hypernyms: [Synset('compartment.n.02')] # hyponyms: [] # ---------------------------------------- # lemmas: [Lemma('cable_car.n.01.cable_car'), Lemma('cable_car.n.01.car')] # definition: a conveyance for passengers or freight on a cable railway # hypernyms: [Synset('compartment.n.02')] # hyponyms: [] # ---------------------------------------- |
Synsets, of course, have an associated part-of-speech and you can query wordnet filtering by it:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | fight_all = wn.synsets('fight') print fight_all # [Synset('battle.n.01'), Synset('fight.n.02'), Synset('competitiveness.n.01'), Synset('fight.n.04'), Synset('fight.n.05'), # Synset('contend.v.06'), Synset('fight.v.02'), Synset('fight.v.03'), Synset('crusade.v.01')] fight_verb = wn.synsets('fight', 'v') print fight_verb # [Synset('contend.v.06'), Synset('fight.v.02'), Synset('fight.v.03'), Synset('crusade.v.01')] fight_noun = wn.synsets('fight', 'n') print fight_noun # [Synset('battle.n.01'), Synset('fight.n.02'), Synset('competitiveness.n.01'), Synset('fight.n.04'), Synset('fight.n.05')] print fight_noun[0].pos() # 'n' |
You can also query for a very specific synset:
1 2 3 | walk = wn.synset('walk.v.01') print walk # Synset('walk.v.01') |
You can also compute how similar to synsets are:
1 2 3 4 5 6 7 | walk = wn.synset('walk.v.01') run = wn.synset('run.v.01') stand = wn.synset('stand.v.01') print run.path_similarity(walk) # 0.25 print run.path_similarity(stand) # 0.142857142857 |
Operations on lemmas
Lemmas in synsets are sorted by how often they appear (in a certain corpus used to create Wordnet):
1 2 3 4 5 6 7 8 | talk = wn.synset('talk.v.01') print talk.lemmas() # [Lemma('talk.v.01.talk'), Lemma('talk.v.01.speak')] # Lemmas in the synset are sorted by count. The most common lemmas are first print [lemma.count() for lemma in talk.lemmas()] # [108, 53] |
For a certain lemma, you can query for the antonyms:
1 2 3 4 5 6 7 8 9 10 | able = wn.synset('able.a.01') print able # Synset('able.a.01') print able.lemmas() # [Lemma('able.a.01.able')] print able.lemmas()[0].antonyms() # [Lemma('unable.a.01.unable')] |
Another cool feature allows you to find derivationally related forms for a lemma:
1 2 3 | print able.lemmas()[0].derivationally_related_forms() # [Lemma('ability.n.02.ability'), Lemma('ability.n.01.ability')] |
Lemmatization
A very useful feature of Wordnet is the ability to lemmatize a word form to the base, dictionary form.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | from nltk.stem import WordNetLemmatizer wnl = WordNetLemmatizer() print wnl.lemmatize('running', wn.VERB) # run # You need to specify the POS tag. The default is 'noun'. This might confuse you: print wnl.lemmatize('running') # running # A few more examples print wnl.lemmatize('better', wn.ADJ) # good print wnl.lemmatize('oxen', wn.NOUN) # ox print wnl.lemmatize('geese', wn.NOUN) # goose |
Conclusions
- We scratched the surface of how useful Wordnet is
- We have a method for finding synonyms, antonyms and related forms
- We learned a method for lemmatizing a word, meaning bringing it to its base form
- We know a way of computing how similar to words are
Leave A Comment