Complete guide to build your own Named Entity Recognizer with Python
Updates
- 29-Apr-2018 – Added Gist for the entire code
NER, short for Named Entity Recognition is probably the first step towards information extraction from unstructured text. It basically means extracting what is a real world entity from the text (Person, Organization, Event etc …).
Why do you need this information? You might want to map it against a knowledge base to understand what the sentence is about, or you might want to extract relationships between different named entities (like who works where, when the event takes place etc…)
NLTK NER Chunker
NLTK has a standard NE annotator so that we can get started pretty quickly.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | from nltk import word_tokenize, pos_tag, ne_chunk sentence = "Mark and John are working at Google." print ne_chunk(pos_tag(word_tokenize(sentence))) """ (S (PERSON Mark/NNP) and/CC (PERSON John/NNP) are/VBP working/VBG at/IN (ORGANIZATION Google/NNP) ./.) """ |
ne_chunk
needs part-of-speech annotations to add NE labels to the sentence. The output of the ne_chunk
is a nltk.Tree
object.
The ne_chunk
function acts as a chunker, meaning it produces 2-level trees:
- Nodes on Level-1: Outside any chunk
- Nodes on Level-2: Inside a chunk – The label of the chunk is denoted by the label of the subtree
In this example, Mark/NNP
is a level-2 leaf, part of a PERSON
chunk. and/CC
is a level-1 leaf, meaning it’s not part of any chunk.
IOB tagging
nltk.Tree
is great for processing such information in Python, but it’s not the standard way of annotating chunks. Maybe this can be an article on its own but we’ll cover this here really quickly.
The IOB Tagging system contains tags of the form:
- B-{CHUNK_TYPE} – for the word in the Beginning chunk
- I-{CHUNK_TYPE} – for words Inside the chunk
- O – Outside any chunk
A sometimes used variation of IOB tagging is to simply merge the B and I tags:
- {CHUNK_TYPE} – for words inside the chunk
- O – Outside any chunk
We usually want to work with the proper IOB format.
Here’s how to convert between the nltk.Tree
and IOB format:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | from nltk.chunk import conlltags2tree, tree2conlltags sentence = "Mark and John are working at Google." ne_tree = ne_chunk(pos_tag(word_tokenize(sentence))) iob_tagged = tree2conlltags(ne_tree) print iob_tagged """ [('Mark', 'NNP', u'B-PERSON'), ('and', 'CC', u'O'), ('John', 'NNP', u'B-PERSON'), ('are', 'VBP', u'O'), ('working', 'VBG', u'O'), ('at', 'IN', u'O'), ('Google', 'NNP', u'B-ORGANIZATION'), ('.', '.', u'O')] """ ne_tree = conlltags2tree(iob_tagged) print ne_tree """ (S (PERSON Mark/NNP) and/CC (PERSON John/NNP) are/VBP working/VBG at/IN (ORGANIZATION Google/NNP) ./.) """ |
GMB corpus
NLTK doesn’t have a proper English corpus for NER. It has the CoNLL 2002 Named Entity CoNLL
but it’s only for Spanish and Dutch. You can definitely try the method presented here on that corpora. In fact doing so would be easier because NLTK provides a good corpus reader. We are going with Groningen Meaning Bank (GMB) though.
GMB is a fairly large corpus with a lot of annotations. Unfortunately, GMB is not perfect. It is not a gold standard corpus, meaning that it’s not completely human annotated and it’s not considered 100% correct. The corpus is created by using already existed annotators and then corrected by humans where needed.
Let’s start playing with the corpus. Download the 2.2.0 version of the corpus here: Groningen Meaning Bank Download
Essentially, GMB is composed of a lot of files, but we only care about the .tags files. Here’s how one looks like:
1 2 3 4 5 6 7 8 9 10 11 12 | Russia NNP russia geo-nam 1 [] O Place N lam(v1,b1:drs([],[b1:[1001]:named(v1,russia,geo,nam)])) 's POS 's O 0 [] of O (NP/N)\NP lam(v1,lam(v2,lam(v3,app(v1,lam(v4,alfa(def,merge(b1:drs([b1:[]:x1],[b1:[1002]:rel(x1,v4,of,0)]),app(v2,x1)),app(v3,x1))))))) giant JJ giant O 1 [Topic] O O N/N lam(v1,lam(v2,merge(b1:drs([b1:[]:s1],[b1:[]:role(v2,s1,'Topic',-1),b1:[1003]:pred(s1,giant,a,'1')]),app(v1,v2)))) Yukos NNP yukos org-nam 0 [] with Non-concrete N/N lam(v1,lam(v2,merge(b1:drs([b1:[]:x1],[b1:[1004]:named(x1,yukos,org,nam),b1:[]:rel(v2,x1,f(name,[yukos,v1],with),0)]),app(v1,v2)))) oil NN oil O 1 [] in Non-concrete N/N lam(v1,lam(v2,merge(b1:drs([b1:[]:x1],[b1:[1005]:pred(x1,oil,n,'1'),b1:[]:rel(v2,x1,f(noun,[oil,v1],in),0)]),app(v1,v2)))) company NN company O 1 [] O Non-concrete N lam(v1,b1:drs([],[b1:[1006]:pred(v1,company,n,'1')])) has VBZ have O 0 [] O O (S[dcl]\NP)/(S[pt]\NP) lam(v1,lam(v2,lam(v3,app(app(v1,v2),lam(v4,merge(b2:drs([b1:[]:t1,b2:[1007]:x1,b2:[]:e1],[b1:[]:pred(t1,now,a,1),b2:[]:eq(x1,t1),b2:[]:rel(e1,x1,temp_includes,1),b2:[]:rel(v4,e1,temp_abut,1)]),app(v3,v4))))))) filed VBN file O 1 [Theme] O O (S[pt]\NP)/PP lam(v1,lam(v2,lam(v3,app(v2,lam(v4,merge(b1:drs([b1:[]:e1],[b1:[1008]:pred(e1,file,v,'1'),b1:[]:role(e1,v4,'Theme',1)]),merge(app(v1,e1),app(v3,e1)))))))) for IN for O 0 [] O O PP/NP lam(v1,lam(v2,app(v1,lam(v3,b1:drs([],[b1:[1009]:rel(v2,v3,for,0)]))))) bankruptcy NN bankruptcy O 1 [] O Non-concrete N lam(v1,b1:drs([],[b1:[1010]:pred(v1,bankruptcy,n,'1')])) ... |
That looks rather messy, but in fact, it’s pretty structured. A file contains more sentences, which are separated by 2 newline characters. For every sentence, every word is separated by 1 newline character. For every word, each annotation is separated by a tab character.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | import os import collections ner_tags = collections.Counter() corpus_root = "gmb-2.2.0" # Make sure you set the proper path to the unzipped corpus for root, dirs, files in os.walk(corpus_root): for filename in files: if filename.endswith(".tags"): with open(os.path.join(root, filename), 'rb') as file_handle: file_content = file_handle.read().decode('utf-8').strip() annotated_sentences = file_content.split('\n\n') # Split sentences for annotated_sentence in annotated_sentences: annotated_tokens = [seq for seq in annotated_sentence.split('\n') if seq] # Split words standard_form_tokens = [] for idx, annotated_token in enumerate(annotated_tokens): annotations = annotated_token.split('\t') # Split annotations word, tag, ner = annotations[0], annotations[1], annotations[3] ner_tags[ner] += 1 print ner_tags """ Counter({u'O': 1146068, u'geo-nam': 58388, u'org-nam': 48034, u'per-nam': 23790, u'gpe-nam': 20680, u'tim-dat': 12786, u'tim-dow': 11404, u'per-tit': 9800, u'per-fam': 8152, u'tim-yoc': 5290, u'tim-moy': 4262, u'per-giv': 2413, u'tim-clo': 891, u'art-nam': 866, u'eve-nam': 602, u'nat-nam': 300, u'tim-nam': 146, u'eve-ord': 107, u'per-ini': 60, u'org-leg': 60, u'per-ord': 38, u'tim-dom': 10, u'per-mid': 1, u'art-add': 1}) """ |
Let’s interpret the tags a bit. We can observe that the tags are composed (Except for O of course) as such: {TAG}-{SUBTAG}
. Here’s what the top-level categories mean:
- geo = Geographical Entity
- org = Organization
- per = Person
- gpe = Geopolitical Entity
- tim = Time indicator
- art = Artifact
- eve = Event
- nat = Natural Phenomenon
The subcategories are pretty unnecessary and pretty polluted. per-ini for example tags the Initial of a person’s name. This tag, kind of makes sense. On the other hand, it’s unclear what the difference between per-nam (person name) and per-giv (given name), per-fam (family-name), per-mid (middle-name).
I decided to just remove the subcategories and focus only on the main ones. Let’s modify the code a bit:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | ner_tags = collections.Counter() for root, dirs, files in os.walk(corpus_root): for filename in files: if filename.endswith(".tags"): with open(os.path.join(root, filename), 'rb') as file_handle: file_content = file_handle.read().decode('utf-8').strip() annotated_sentences = file_content.split('\n\n') # Split sentences for annotated_sentence in annotated_sentences: annotated_tokens = [seq for seq in annotated_sentence.split('\n') if seq] # Split words standard_form_tokens = [] for idx, annotated_token in enumerate(annotated_tokens): annotations = annotated_token.split('\t') # Split annotation word, tag, ner = annotations[0], annotations[1], annotations[3] # Get only the primary category if ner != 'O': ner = ner.split('-')[0] ner_tags[ner] += 1 print ner_tags # Counter({u'O': 1146068, u'geo': 58388, u'org': 48094, u'per': 44254, u'tim': 34789, u'gpe': 20680, u'art': 867, u'eve': 709, u'nat': 300}) print "Words=", sum(ner_tags.values()) # Words= 1354149 |
This looks much better. You might decide to drop the last few tags because they are not well represented in the corpus. We’ll keep them … for now.
Training your own system
In an earlier post, we have trained a part-of-speech tagger. You can read it here: Training a Part-Of-Speech Tagger. We’re taking a similar approach for training our NE-Chunker.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 | import string from nltk.stem.snowball import SnowballStemmer def features(tokens, index, history): """ `tokens` = a POS-tagged sentence [(w1, t1), ...] `index` = the index of the token we want to extract features for `history` = the previous predicted IOB tags """ # init the stemmer stemmer = SnowballStemmer('english') # Pad the sequence with placeholders tokens = [('[START2]', '[START2]'), ('[START1]', '[START1]')] + list(tokens) + [('[END1]', '[END1]'), ('[END2]', '[END2]')] history = ['[START2]', '[START1]'] + list(history) # shift the index with 2, to accommodate the padding index += 2 word, pos = tokens[index] prevword, prevpos = tokens[index - 1] prevprevword, prevprevpos = tokens[index - 2] nextword, nextpos = tokens[index + 1] nextnextword, nextnextpos = tokens[index + 2] previob = history[index - 1] contains_dash = '-' in word contains_dot = '.' in word allascii = all([True for c in word if c in string.ascii_lowercase]) allcaps = word == word.capitalize() capitalized = word[0] in string.ascii_uppercase prevallcaps = prevword == prevword.capitalize() prevcapitalized = prevword[0] in string.ascii_uppercase nextallcaps = prevword == prevword.capitalize() nextcapitalized = prevword[0] in string.ascii_uppercase return { 'word': word, 'lemma': stemmer.stem(word), 'pos': pos, 'all-ascii': allascii, 'next-word': nextword, 'next-lemma': stemmer.stem(nextword), 'next-pos': nextpos, 'next-next-word': nextnextword, 'nextnextpos': nextnextpos, 'prev-word': prevword, 'prev-lemma': stemmer.stem(prevword), 'prev-pos': prevpos, 'prev-prev-word': prevprevword, 'prev-prev-pos': prevprevpos, 'prev-iob': previob, 'contains-dash': contains_dash, 'contains-dot': contains_dot, 'all-caps': allcaps, 'capitalized': capitalized, 'prev-all-caps': prevallcaps, 'prev-capitalized': prevcapitalized, 'next-all-caps': nextallcaps, 'next-capitalized': nextcapitalized, } |
The feature extraction works almost identical as the one implemented in the Training a Part-Of-Speech Tagger, except we added the history mechanism. Since the previous IOB tag is a very good indicator of what the current IOB tag is going to be, we have included the previous IOB tag as a feature.
Let’s create a few utility functions to help us with the training and move the corpus reading stuff into a function, read_gmb
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 | def to_conll_iob(annotated_sentence): """ `annotated_sentence` = list of triplets [(w1, t1, iob1), ...] Transform a pseudo-IOB notation: O, PERSON, PERSON, O, O, LOCATION, O to proper IOB notation: O, B-PERSON, I-PERSON, O, O, B-LOCATION, O """ proper_iob_tokens = [] for idx, annotated_token in enumerate(annotated_sentence): tag, word, ner = annotated_token if ner != 'O': if idx == 0: ner = "B-" + ner elif annotated_sentence[idx - 1][2] == ner: ner = "I-" + ner else: ner = "B-" + ner proper_iob_tokens.append((tag, word, ner)) return proper_iob_tokens def read_gmb(corpus_root): for root, dirs, files in os.walk(corpus_root): for filename in files: if filename.endswith(".tags"): with open(os.path.join(root, filename), 'rb') as file_handle: file_content = file_handle.read().decode('utf-8').strip() annotated_sentences = file_content.split('\n\n') for annotated_sentence in annotated_sentences: annotated_tokens = [seq for seq in annotated_sentence.split('\n') if seq] standard_form_tokens = [] for idx, annotated_token in enumerate(annotated_tokens): annotations = annotated_token.split('\t') word, tag, ner = annotations[0], annotations[1], annotations[3] if ner != 'O': ner = ner.split('-')[0] if tag in ('LQU', 'RQU'): # Make it NLTK compatible tag = "``" standard_form_tokens.append((word, tag, ner)) conll_tokens = to_conll_iob(standard_form_tokens) # Make it NLTK Classifier compatible - [(w1, t1, iob1), ...] to [((w1, t1), iob1), ...] # Because the classfier expects a tuple as input, first item input, second the class yield [((w, t), iob) for w, t, iob in conll_tokens] reader = read_gmb(corpus_root) |
Check the output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | print reader.next() print '------------' """ [((u'Thousands', u'NNS'), u'O'), ((u'of', u'IN'), u'O'), ((u'demonstrators', u'NNS'), u'O'), ((u'have', u'VBP'), u'O'), ((u'marched', u'VBN'), u'O'), ((u'through', u'IN'), u'O'), ((u'London', u'NNP'), u'B-geo'), ((u'to', u'TO'), u'O'), ((u'protest', u'VB'), u'O'), ((u'the', u'DT'), u'O'), ((u'war', u'NN'), u'O'), ((u'in', u'IN'), u'O'), ((u'Iraq', u'NNP'), u'B-geo'), ((u'and', u'CC'), u'O'), ((u'demand', u'VB'), u'O'), ((u'the', u'DT'), u'O'), ((u'withdrawal', u'NN'), u'O'), ((u'of', u'IN'), u'O'), ((u'British', u'JJ'), u'B-gpe'), ((u'troops', u'NNS'), u'O'), ((u'from', u'IN'), u'O'), ((u'that', u'DT'), u'O'), ((u'country', u'NN'), u'O'), ((u'.', u'.'), u'O')] ------------ """ print reader.next() print '------------' """ [((u'Families', u'NNS'), u'O'), ((u'of', u'IN'), u'O'), ((u'soldiers', u'NNS'), u'O'), ((u'killed', u'VBN'), u'O'), ((u'in', u'IN'), u'O'), ((u'the', u'DT'), u'O'), ((u'conflict', u'NN'), u'O'), ((u'joined', u'VBD'), u'O'), ((u'the', u'DT'), u'O'), ((u'protesters', u'NNS'), u'O'), ((u'who', u'WP'), u'O'), ((u'carried', u'VBD'), u'O'), ((u'banners', u'NNS'), u'O'), ((u'with', u'IN'), u'O'), ((u'such', u'JJ'), u'O'), ((u'slogans', u'NNS'), u'O'), ((u'as', u'IN'), u'O'), ((u'"', '``'), u'O'), ((u'Bush', u'NNP'), u'B-per'), ((u'Number', u'NN'), u'O'), ((u'One', u'CD'), u'O'), ((u'Terrorist', u'NN'), u'O'), ((u'"', '``'), u'O'), ((u'and', u'CC'), u'O'), ((u'"', '``'), u'O'), ((u'Stop', u'VB'), u'O'), ((u'the', u'DT'), u'O'), ((u'Bombings', u'NNS'), u'O'), ((u'.', u'.'), u'O'), ((u'"', '``'), u'O')] ------------ """ print reader.next() print '------------' """ [((u'They', u'PRP'), u'O'), ((u'marched', u'VBD'), u'O'), ((u'from', u'IN'), u'O'), ((u'the', u'DT'), u'O'), ((u'Houses', u'NNS'), u'O'), ((u'of', u'IN'), u'O'), ((u'Parliament', u'NN'), u'O'), ((u'to', u'TO'), u'O'), ((u'a', u'DT'), u'O'), ((u'rally', u'NN'), u'O'), ((u'in', u'IN'), u'O'), ((u'Hyde', u'NNP'), u'B-geo'), ((u'Park', u'NNP'), u'I-geo'), ((u'.', u'.'), u'O')] ------------ """ |
We managed to read sentences from the corpus in a proper format. We can now start to actually train a system. NLTK offers a few helpful classes to accomplish the task. nltk.chunk.ChunkParserI
is a base class for building chunkers/parsers. Another useful asset we are going to use is the nltk.tag.ClassifierBasedTagger
. Under the hood, it uses a NaiveBayes classifier for predicting sequences.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | import pickle from collections import Iterable from nltk.tag import ClassifierBasedTagger from nltk.chunk import ChunkParserI class NamedEntityChunker(ChunkParserI): def __init__(self, train_sents, **kwargs): assert isinstance(train_sents, Iterable) self.feature_detector = features self.tagger = ClassifierBasedTagger( train=train_sents, feature_detector=features, **kwargs) def parse(self, tagged_sent): chunks = self.tagger.tag(tagged_sent) # Transform the result from [((w1, t1), iob1), ...] # to the preferred list of triplets format [(w1, t1, iob1), ...] iob_triplets = [(w, t, c) for ((w, t), c) in chunks] # Transform the list of triplets to nltk.Tree format return conlltags2tree(iob_triplets) |
Let’s build the datasets:
1 2 3 4 5 6 7 8 9 | reader = read_gmb(corpus_root) data = list(reader) training_samples = data[:int(len(data) * 0.9)] test_samples = data[int(len(data) * 0.9):] print "#training samples = %s" % len(training_samples) # training samples = 55809 print "#test samples = %s" % len(test_samples) # test samples = 6201 |
We built everything up to this point so beautifully such that the training can be expressed as simply as:
1 2 | chunker = NamedEntityChunker(training_samples[:2000]) |
It probably took a while. Let’s take it for a spin:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | from nltk import pos_tag, word_tokenize print chunker.parse(pos_tag(word_tokenize("I'm going to Germany this Monday."))) """ (S I/PRP 'm/VBP going/VBG to/TO (geo Germany/NNP) this/DT (tim Monday/NNP) ./.) """ |
The system you just trained did a great job at recognizing named entities:
- Named Entity “Germany” – Geographical Entity
- Named Entity “Monday” – Time Entity
Testing the system
Let’s see how the system measures up. Because we followed to good patterns in NLTK, we can test our NE-Chunker as simple as this:
1 2 3 | score = chunker.evaluate([conlltags2tree([(w, t, iob) for (w, t), iob in iobs]) for iobs in test_samples[:500]]) print score.accuracy() # 0.931132334092 - Awesome :D |
Conclusions
- Chunking can be reduced to a tagging problem.
- Named Entity Recognition is a form of chunking.
- We explored a freely available corpus that can be used for real-world applications.
- The NLTK classifier can be replaced with any classifier you can think about. Try replacing it with a scikit-learn classifier.
If you loved this tutorial, you should definitely check out the sequel: Training a NER system on a large dataset. It builds upon what you already learned, it uses a scikit-learn classifier and pushes the accuracy to 97%.
Notes
- I’ve used NLTK version 3.2.1
- You can find the entire code here: Python NER Gist
Good NER tuorial. Better if trained on top of state of the art approaches like CRF or Hybrid techniques, Semi-supervised or unsupervised techniques as well.
Thanks, man.
I plan to go to more advanced topics at one point
Nice article Bogdan. Google brought me here 🙂
Do you know any good Romanian Corpora for NER?
Hi Mihai!
Really glad to hear from you! Unfortunately, I’m not aware of any Romanian NER Corpus whatsoever. I am working on something you might find useful, though. Talk to you on Facebook 🙂
Lets catch up,
Bogdan.
Thanks for the good work. I have few comments and questions.
1) Why did you not use scikit learn to train the classifier for NER task?
2) Is the order ‘word, tag, iob’ correct in line 9 and 18 in def to_conll_iob(annotated_sentence) ?
Hello,
Thank you and thanks for the questions:
1) I did not use scikit-learn in this tutorial to be able to focus on the task rather than the intricacies of training a model. We’re not focusing on performance but rather on the concepts. I do have a NER tutorial that uses scikit-learn here: http://nlpforhackers.io/training-ner-large-dataset/
2) Yes, that should be the case. Are you encountering any errors on that part?
hey Bogdani,
Do you think, training NER for tagging price would work?
Let’s say if we have a document that contains text from an AIRLINE ticket.
In whole text there would be Fare of the flight somewhere. Do you think any NER(nltk/CRF/RNN) can tag that considering there could be ticket ID, Flight No., additional info in the same document?
Absolutely, especially because usually price has a currency symbol in proximity.
How do you train the model for one time and re-use the model again during testing ?
Hi Meng Hui,
Not sure if I got your question right. Here are a few thoughts:
# Here you are training the NER
chunker = NamedEntityChunker(training_samples[:2000])
# Here you are using it on unseen data (basically, what it’s intended for)
chunker.parse(pos_tag(word_tokenize(“I’m going to Germany this Monday.”)))
# Here you evaluate it
score = chunker.evaluate([conlltags2tree([(w, t, iob) for (w, t), iob in iobs]) for iobs in test_samples[:500]])
print score.accuracy() # 0.931132334092 – Awesome 😀
Hi bogdani,
what I mean is how to save and load the model the next time you want to use it on a new document.
You don’t want to go through the process of training the model again and again every time you have a new documents to test.
Check this out: http://scikit-learn.org/stable/modules/model_persistence.html
Hey Bogdani !
I have already tried out this tutorial and the more advanced version of this, but i am not completely satisfied with the results. Are there any other good corpora that can be used to train the system to get better results.
Thanks
Andy
Sorry Andy,
To my knowledge, there aren’t any better or larger freely available NER corpora 🙁
Regards,
Bogdan.
[…] a previous article, we studied training a NER (Named-Entity-Recognition) system from the ground up, using the Groningen Meaning Bank Corpus. This […]
[…] Chunking is a very similar task to Named-Entity-Recognition. In fact, the same format, IOB-tagging is used. You can read about it in the post about Named-Entity-Recognition. […]
Hey Bogdan,
Great article!!
I m looking for an NER solution for medical literature documents. I want to extract entities like patient description, disease, adverse event of drug etc. from paragraphs that can be anywhere in a document (and I have many pdf docs like that). So, my focus is first locating those paragraphs and then NER. If you can give some pointers on how to approach this task, I will highly appreciate that.
Hi Bat,
The most important part is to have the data annotated. Is that the case? If you have the paragraphs and entities annotated, you can first build a text classifier that works on paragraphs to identify the desired paragraphs. Next, on those paragraphs, train the NER. Again, this is true if the data is annotated. Otherwise, you have to think of an unsupervised method to train the system.
Bogdan.
Hi Bogdani,
in above comment you mentioned if no annotated dataset availabel, then use unsupervised method. could you please tell , what unsupervised method and what other steps required to get final result ?
Thanks.
Hi Achyuta,
“Unsupervised” NER is definitely outside the scope of this blog. I haven’t experimented with it myself. There are a few published papers on the mater. You can check them out.
Off the top of my head, I would consider something like this: Start with a set of known entities. Find web pages that contain those entities and consider the found entities labelled. Find similar sentences to the ones you found but with different entities.
Algorithm:
1. Search for entities,
2. Extract template,
3. Search for the template,
4. Extract new entities
5. Go back to 1. with the new entities found
Hello,
It was good tutorial I have gone through it and made it one thing I want to know do I need to train the data every time when I will check NER of a sentence.
Hi Kanishka,
After the model is trained you can use it on as many sentences you want. If you want to use it in another script, you need to save the model to disk. Until I cover this aspect, you can read about it here: http://scikit-learn.org/stable/modules/model_persistence.html
Great tutorial!
I’ve working through this and I’m a little confused where the features function is called. Is it called in the training or when you apply it to a new sentence?
If you wouldn’t mind writing where and how it is called, that would be great!
Thanks!
Hi Jenny,
Thanks! You can call the NER as many times as you need like this: chunker.parse(pos_tag(word_tokenize(“Here goes a sentence”)))
What you probably need, is how to persist the NER on disk and use it again, right?
Here’s a handy link for joblib: http://scikit-learn.org/stable/modules/model_persistence.html
I’m planning to write a short post on persisting models myself, till then, hope the URL helps
Bogdan.
Hi Jenny,
Maybe my answer wasn’t really to the point. The feature detector is used to extract characteristics about the data we’re classifying. In this example, the feature detection function is used somewhere inside the nltk’s ClassifierBasedTagger.
It is used both at the training phase and the tagging phase.
Hope this helps more,
Bogdan.
[…] http://nlpforhackers.io/named-entity-extraction/ […]
Thanks for your explanation. This is a really good tutorial. I have few questions to better understand what you did as I am new in the domain of NER. Is this a supervised machine learning task right? in this sense, are the entities (chunks) the features and which ones are the classes?
As you said: “# Because the classfier expects a tuple as input, first item input, second the class
yield [((w, t), iob) for w, t, iob in conll_tokens] “
Hi Alfredo,
Yes, Supervised Learning as we have a training set. The classes are the “O” (outside), “B-PER” (Begining of a PERson Entity), “I-PER” (Inside a PERson entity) etc …
The features are the ones defined in the
features
function: the word, the stem, the part-of-speech, etc …Have a good one,
Bogdan.
Thanks for sharing. It was very interesting. I have a question about background of this system. To my understanding NLTK learns from features that you created and takes the label from train set. For example your input is ((w,t), iob), it takes iob as label for training and create a feature set for each token by features function. Now my question is that during prediction whether it creates feature set for the sample? If yes, in prediction it leave the history empty?! because we do not have any label.
Other question is that when I try to pickle it:
pickle.dump(chunker, open(“enr.pickle”, “wb”))
when I try to load it in another module, it takes time and it seems that it pickled whole the module and try to train from scratch. My assumption was that pickle only keep a classifier. What is wrong with this method?
I understood my mistake with pickle, never mind 🙂
Great to hear that 🙂
Hi Reza,
During the prediction phase, the history contains the tags that have just been predicted. The NLTK ClassifierBasedTagger knows how and when to feed the already predicted labels as the
history
parameter to the feature_detector function. That being said, the tagging has to be done in order. Hope this helps.Bogdan.
Thanks Bogdani 🙂
How big the training data should be dear Bogdani? I annotated around 40 sentences by my entities manually and I applied them on some unseen data. Unfortunately, most of the time prediction is wrong. My assumption is that the training data is too small.
I found a free corpus that is annotated (Open American National Corpus), however, it is in complected XML format and no reader is provided. It seems that they used GRAF method for creating their corpus. I tried some open-source GRAF reader but I did not find out how to access to word, pos tagging and entities in this corpus. Have you had any experiences in such corpora?
Hi Reza,
The training data should definitely be waaaay bigger. Think tens of thousands.
Can you provide a link to the corpus please 😀
Bogdan.
Here you are: http://www.anc.org/data/oanc/download/
Actually I used this one: http://www.anc.org/data/masc/
It seems that this corpus is annotated by hand and it has various Name Entities
You don’t need any specialized reader. The files are in XML format. Use notepad++ or sublime text to view them. Use any XML processing library to work with them. Here’s where you can read about the format: http://www.xces.org/ns/GrAF/1.0/
[…] Examples of multiclass problems we might encounter in NLP include: Part Of Speach Tagging and Named Entity Extraction. Let’s repeat the process for creating a dataset, this time with 3 […]
Hello
How can i use this to extract frensh named entities please
Absolutely, as long as you have a French NER corpus 🙂
Hi, It would be really good if I could read this without much prior knowledge. (or each article as a standalone independant one). As a newbie I came accross this and it looks very helpful, but reading it I first saw “pos_tag” and have no idea what it means. And then read “IOB tagging” and have no idea what it means. Any chance to have the articles as standalone that someone can read on their own?
Hi Tomer,
Maybe go through some articles in the order described here: https://nlpforhackers.io/start/
Thanks, it’s more introductory indeed. However for example at: http://nlpforhackers.io/introduction-nltk/ was the first time I encountered “# [(‘John’, ‘NNP’), (‘works’, ‘VBZ’), (‘at’, ‘IN’), (‘Intel’, ‘NNP’), (‘.’, ‘.’)]”. And there is no reference at the point as far as I could tell or before to what NNP, VBZ, … means. (I had to search and find that but that stops the fluency of my reading).
Indeed, that makes sense. Will add a note on that shortly
Hi Bogdani,
Thanks for the great article. Well written and explained. I currently explored Spacy for NER and I am trying to extract relevant from job descriptions on LinkedIn. Example – Relevant skills, programing languages required, education etc.
Can you give any leads how to proceed?
Thanks,
Did you check out the tutorial on training your own spaCy NER? https://spacy.io/usage/examples#training-ner
Hi!
Thanks for the helpful article!
I have a one question,
I think the role of history in the article is now well described.
What I understand so far is like, suppose we have to (NER)tags the word ‘Apple’, we can look for history of how the word Apple has been tagged, since those Entities are very history dependent. However, I think the exact mechanism of history is not clear in this article could you help me understand?
Thank you very much!
Hi!
Thanks for the helpful article!
I have a one question,
I think the role of history in the article is not well described.
What I understand so far is like, suppose we have to (NER)tags the word ‘Apple’, we can look for history of how the word Apple has been tagged, since those Entities are very history dependent. However, I think the exact mechanism of history is not clear in this article could you help me understand?
Thank you very much!
Hi Sir ,
M completely new to this field and also new to python , so m not able to understand excatly what you explain if possible that what you did over here.
Where are you having problems understanding? This is hardly the place to start learning Python 🙂
Hey. I am using Python 3.5.0 and I am getting the following error.
line 22, in features
word, pos = tokens[index]
ValueError: too many values to unpack (expected 2)
Somehow its only receiving the words and not the tags and so the error is there. But I have used the same code as given. So I feel there is something with the NLTK inbuilt function in Python 3. I am using the same training dataset. Please help me resolve this issue
Yep, code is written in Python2.7. Please do the necessary patches to work on 3.5
hello air
can you please tell me , how to use csv data with sentences and entity tag to train the the models , can you please show the code, i am getting errors.
I sincerely don’t know what you are talking about 🙂
What CSVs are you talking about? I don’t use any CSVs. If you are using CSVs, it is up to you to customize the code, this is a tutorial.
I am showing a lot of code, look, the post is full of code 🙂
Bogdan (not air).
Hello, really great tutorial! I was wondering, if it is possible to use the same/similar approach if I need to creat my own entity type?
Do you may be have may be a tutorial about it?
Don’t have a tutorial for that exact case. This approach can be applied to any properly labelled corpus. If you can annotate enough data, you can train the model 🙂
It does not like this line and I have tried alot of variations with no luck. please help…
Traceback (most recent call last):
File “namedEntityRecognizer.py”, line 97, in <module>
chunkerP = NamedEntityChunker(training_samples[:2000])
File “namedEntityRecognizer.py”, line 26, in __init__
**kwargs)
File “/usr/local/lib/python3.5/dist-packages/nltk/tag/sequential.py”, line 628, in __init__
self._train(train, classifier_builder, verbose)
File “/usr/local/lib/python3.5/dist-packages/nltk/tag/sequential.py”, line 659, in _train
index, history)
File “/usr/local/lib/python3.5/dist-packages/nltk/tag/sequential.py”, line 680, in feature_detector
return self._feature_detector(tokens, index, history)
TypeError: ‘list’ object is not callable
can you post your entire script somewhere in a Gist or something?
Did you see the gist? sorry for the multiple replies the form was acting wierd on me and I didnt see the text tab on the right here.
https://gist.github.com/cparello/1fc4f100543b9e5f097d4d7642e5b9cf
I appreciate any help you can give me..
https://gist.github.com/cparello/1fc4f100543b9e5f097d4d7642e5b9cf
All parts work individually until that last line complains about “TypeError: ‘list’ object is not callable”
Hi Bogdani, mine error at
It says,
Are there another way to check previous ner tag in enumerate?
Think that’s a Python 2.7 vs 3.6 issue. I’m away from computer for several weeks to come.
Bogdan.
I don’t think people normally use “accuracy” for NER tasks (The default NLTK
evaluate
function also did a poor job on this). Precision, recall and F1 (which are only calculated on entities and exclude the Os), are used. The accuracy will naturally be very high since the vast majority of the words are non-entity (i.e. labeled O) anyways. It would be nice if you could update these articles with those measures.Hi Bogdan,
I am receiving following error
File “/usr/local/lib/python2.7/dist-packages/nltk/tag/api.py”, line 77, in _check_params
‘Must specify either training data or trained model.’)
ValueError: Must specify either training data or trained model.
Any suggestions for the above. I am using Python2.7 for this. I believe that the model is not defined that’s why it shows this error and I am not able to understand which model is to be defined here. Please help!!
Can you create a GitHub Gist with your code please and place the link in a comment?
Hi,
I’m getting the same error, I check the size of the data after the read methode and it is empty.
I check the corpus I downloaded itself, and it seems to have the size of 803 MB or something like thatg, but I was unable to unzipp the file. Tried many times.
I think the data is the problem. Do you have any suggestion about alternative annotated corpora?
Thanks a lot.
Is the problem persisting?
Hi, awesome tutorial. I can do this with my own language, for example, Quechua language?
Do you have an annotated corpora for the Quechua language?
Yes, I do.
[…] are also two relatively recent guides (1 2) online detailing the process of using NLTK to train the GMB […]
I am trying to build a NLP model to predict medicine names from medical documents: I have a directory containing files of medical documents which are in unstructured format. The documents might be email conversations, billing, approval certificate from FDA etc etc. All the documents contain a trace of 1 medicine name somewhere inside the document. Also I have an excel file where I can find the filenames as well as the medicine names as separate columns that are present inside the files. I have data for around 1000 docs and that will be part of my training set. My understanding is that I need to give custom tags to medicine names in my training set with a label {example: ‘(“WRO Meeting for Myozyme IND 010780”, [(52, 58, ‘MEDICINE’)])}. This is how the Spacy library accepts custom tags for training of a NER model. The expected output is that, when I upload a new document, the model should be able to identify the medicine name from it. How do I tag my dataset or build my training data for this purpose and how to get the necessary output?
Hi Mukh,
To me, it sounds like you have it figured out. What exactly are you missing? You don’t need POS tags or anything else. The example you provided should be enough for the spaCy NER.
Hope this helps,
Bogdan.