Training a NER System Using a Large Dataset
In a previous article, we studied training a NER (Named-Entity-Recognition) system from the ground up, using the Groningen Meaning Bank Corpus. This article is a continuation of that tutorial. The main purpose of this extension to training a NER is to:
- Replace the classifier with a Scikit-Learn Classifier
- Train a NER on a larger subset of the training data
- Increase accuracy
- Understand Out Of Core Learning
What was wrong with the initial system you might ask. There wasn’t anything fundamentally wrong with the process. In fact, it’s a great didactical example, and we can build upon it. This is where it was lacking:
1. If you did the training yourself, you probably realized we can’t train the system on the whole dataset (I chose to train it on the first 2000 sentences).
2. The dataset is so huge – it can’t be loaded all in memory.
3. We achieved around 93% accuracy. That might sound like a good accuracy, but we might be deceived. Named entities are probably around 10% of the tags. If we predict that all words have O tag (remember, O stands for outside any entity), we’re achieving a 90% accuracy. We can probably do better.
4. We can come up with a better feature set that better describes the data and is more relevant to our task.
Out-Of-Core Learning
We are used to showing all the data we have at once to our classifier. This means that we have to keep all the data in memory. This can get in our way if we want to train on a larger dataset. Keeping the dataset out of RAM is called Out-Of-Core Learning.
There are certain types of classifiers that accept the data to be presented in batches. Scikit-Learn includes a few such classifiers. Here’s the list: Scikit-Learn Incremental Classifiers. The process of learning from batches is called Incremental Learning.
The classifiers that support Incremental Learning implement the partial_fit method.
Using generators
In the previous tutorial, we created a method of reading from the corpus that didn’t keep the whole dataset in memory. It was making use of the concept of Generator.
Unfortunately, because we had to present the whole data, we were transforming the generator into a list, thus losing the advantage of working with generators. Since we don’t need all the data this time, we’ll be slicing batches from the generator every time we call the partial_fit method. Let’s include the corpus reading routine, from the previous article here:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 | import os from nltk import conlltags2tree def to_conll_iob(annotated_sentence): """ `annotated_sentence` = list of triplets [(w1, t1, iob1), ...] Transform a pseudo-IOB notation: O, PERSON, PERSON, O, O, LOCATION, O to proper IOB notation: O, B-PERSON, I-PERSON, O, O, B-LOCATION, O """ proper_iob_tokens = [] for idx, annotated_token in enumerate(annotated_sentence): tag, word, ner = annotated_token if ner != 'O': if idx == 0: ner = "B-" + ner elif annotated_sentence[idx - 1][2] == ner: ner = "I-" + ner else: ner = "B-" + ner proper_iob_tokens.append((tag, word, ner)) return proper_iob_tokens def read_gmb_ner(corpus_root): for root, dirs, files in os.walk(corpus_root): for filename in files: if filename.endswith(".tags"): with open(os.path.join(root, filename), 'rb') as file_handle: file_content = file_handle.read().decode('utf-8').strip() annotated_sentences = file_content.split('\n\n') for annotated_sentence in annotated_sentences: annotated_tokens = [seq for seq in annotated_sentence.split('\n') if seq] standard_form_tokens = [] for idx, annotated_token in enumerate(annotated_tokens): annotations = annotated_token.split('\t') word, tag, ner = annotations[0], annotations[1], annotations[3] if ner != 'O': ner = ner.split('-')[0] standard_form_tokens.append((word, tag, ner)) conll_tokens = to_conll_iob(standard_form_tokens) yield conlltags2tree(conll_tokens) |
Better features
The feature detector created in the previous article wasn’t at all bad. In fact, it includes the most popular features and they have been adapted to achieve better performance. We’re going to make a few adjustments. One of the most important features in the task of Named-Entity-Recognition is the shape of the word. We’re going to create a function that describes particular word forms. You should experiment with this function and see if you get better results. Here’s my function:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | import re def shape(word): word_shape = 'other' if re.match('[0-9]+(\.[0-9]*)?|[0-9]*\.[0-9]+$', word): word_shape = 'number' elif re.match('\W+$', word): word_shape = 'punct' elif re.match('[A-Z][a-z]+$', word): word_shape = 'capitalized' elif re.match('[A-Z]+$', word): word_shape = 'uppercase' elif re.match('[a-z]+$', word): word_shape = 'lowercase' elif re.match('[A-Z][a-z]+[A-Z][a-z]+[A-Za-z]*$', word): word_shape = 'camelcase' elif re.match('[A-Za-z]+$', word): word_shape = 'mixedcase' elif re.match('__.+__$', word): word_shape = 'wildcard' elif re.match('[A-Za-z0-9]+\.$', word): word_shape = 'ending-dot' elif re.match('[A-Za-z0-9]+\.[A-Za-z0-9\.]+\.$', word): word_shape = 'abbreviation' elif re.match('[A-Za-z0-9]+\-[A-Za-z0-9\-]+.*$', word): word_shape = 'contains-hyphen' return word_shape |
Here’s the final feature extraction function (I also added one more IOB tag from history):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 | from nltk.stem.snowball import SnowballStemmer stemmer = SnowballStemmer('english') def ner_features(tokens, index, history): """ `tokens` = a POS-tagged sentence [(w1, t1), ...] `index` = the index of the token we want to extract features for `history` = the previous predicted IOB tags """ # Pad the sequence with placeholders tokens = [('__START2__', '__START2__'), ('__START1__', '__START1__')] + list(tokens) + [('__END1__', '__END1__'), ('__END2__', '__END2__')] history = ['__START2__', '__START1__'] + list(history) # shift the index with 2, to accommodate the padding index += 2 word, pos = tokens[index] prevword, prevpos = tokens[index - 1] prevprevword, prevprevpos = tokens[index - 2] nextword, nextpos = tokens[index + 1] nextnextword, nextnextpos = tokens[index + 2] previob = history[-1] prevpreviob = history[-2] feat_dict = { 'word': word, 'lemma': stemmer.stem(word), 'pos': pos, 'shape': shape(word), 'next-word': nextword, 'next-pos': nextpos, 'next-lemma': stemmer.stem(nextword), 'next-shape': shape(nextword), 'next-next-word': nextnextword, 'next-next-pos': nextnextpos, 'next-next-lemma': stemmer.stem(nextnextword), 'next-next-shape': shape(nextnextword), 'prev-word': prevword, 'prev-pos': prevpos, 'prev-lemma': stemmer.stem(prevword), 'prev-iob': previob, 'prev-shape': shape(prevword), 'prev-prev-word': prevprevword, 'prev-prev-pos': prevprevpos, 'prev-prev-lemma': stemmer.stem(prevprevword), 'prev-prev-iob': prevpreviob, 'prev-prev-shape': shape(prevprevword), } return feat_dict |
Learning in batches
After getting the corpus reading and the feature extraction out of the way, we can focus on the cool stuff: training the NE-chunker. The code is fairly simple, but let’s first state what we want to achieve:
- The training method should receive a generator. It should only slice batches from the generator, not load the whole data into memory.
- We’re going to train a Perceptron. It trains fast and gives good results in this case.
- Keep in mind that we will use the partial_fit method.
- Because we don’t show all the data at once, we have to give a list of all the classes up front.
Let’s build out NE-chunker:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 | import itertools from nltk import tree2conlltags from nltk.chunk import ChunkParserI from sklearn.linear_model import Perceptron from sklearn.feature_extraction import DictVectorizer from sklearn.pipeline import Pipeline class ScikitLearnChunker(ChunkParserI): @classmethod def to_dataset(cls, parsed_sentences, feature_detector): """ Transform a list of tagged sentences into a scikit-learn compatible POS dataset :param parsed_sentences: :param feature_detector: :return: """ X, y = [], [] for parsed in parsed_sentences: iob_tagged = tree2conlltags(parsed) words, tags, iob_tags = zip(*iob_tagged) tagged = zip(words, tags) for index in range(len(iob_tagged)): X.append(feature_detector(tagged, index, history=iob_tags[:index])) y.append(iob_tags[index]) return X, y @classmethod def get_minibatch(cls, parsed_sentences, feature_detector, batch_size=500): batch = list(itertools.islice(parsed_sentences, batch_size)) X, y = cls.to_dataset(batch, feature_detector) return X, y @classmethod def train(cls, parsed_sentences, feature_detector, all_classes, **kwargs): X, y = cls.get_minibatch(parsed_sentences, feature_detector, kwargs.get('batch_size', 500)) vectorizer = DictVectorizer(sparse=False) vectorizer.fit(X) clf = Perceptron(verbose=10, n_jobs=-1, n_iter=kwargs.get('n_iter', 5)) while len(X): X = vectorizer.transform(X) clf.partial_fit(X, y, all_classes) X, y = cls.get_minibatch(parsed_sentences, feature_detector, kwargs.get('batch_size', 500)) clf = Pipeline([ ('vectorizer', vectorizer), ('classifier', clf) ]) return cls(clf, feature_detector) def __init__(self, classifier, feature_detector): self._classifier = classifier self._feature_detector = feature_detector def parse(self, tokens): """ Chunk a tagged sentence :param tokens: List of words [(w1, t1), (w2, t2), ...] :return: chunked sentence: nltk.Tree """ history = [] iob_tagged_tokens = [] for index, (word, tag) in enumerate(tokens): iob_tag = self._classifier.predict([self._feature_detector(tokens, index, history)])[0] history.append(iob_tag) iob_tagged_tokens.append((word, tag, iob_tag)) return conlltags2tree(iob_tagged_tokens) def score(self, parsed_sentences): """ Compute the accuracy of the tagger for a list of test sentences :param parsed_sentences: List of parsed sentences: nltk.Tree :return: float 0.0 - 1.0 """ X_test, y_test = self.__class__.to_dataset(parsed_sentences, self._feature_detector) return self._classifier.score(X_test, y_test) |
This is how we train it:
1 2 3 4 5 6 7 8 9 10 11 12 | def train_perceptron(): reader = read_gmb_ner("gmb-2.2.0") all_classes = ['O', 'B-per', 'I-per', 'B-gpe', 'I-gpe', 'B-geo', 'I-geo', 'B-org', 'I-org', 'B-tim', 'I-tim', 'B-art', 'I-art', 'B-eve', 'I-eve', 'B-nat', 'I-nat'] pa_ner = ScikitLearnChunker.train(itertools.islice(reader, 50000), feature_detector=ner_features, all_classes=all_classes, batch_size=500, n_iter=5) accuracy = pa_ner.score(itertools.islice(reader, 5000)) print "Accuracy:", accuracy # 0.970327096314 |
We’ve achieved a whopping 4% boost in performance. That’s huge at this level. It’s exactly the percentages that count. Congrats, you just trained a NE-Chunker with a 97% accuracy.
Takeaways
- The more data you use, the better.
- Keeping things on-disk, rather that in RAM helps us train on larger datasets.
- Scikit-Learn includes models that can be incrementally trained.
- Using a more fancy classifier isn’t always the best solution.
If this was too abrupt for you, check out the Complete guide to training a NER System (Named-Entity-Recognition).
Hello,
Thanks for this nice tutorial. I have been trying out some examples with your code. I run into problems when running the train methos. How do you call the method to train the corpus und test it agianst new data?
In your previous tutorial you call it like that:
print chunker.parse(pos_tag(word_tokenize(“I’m going to Germany this Monday.”)))
This is not working any more with sklearn. I keep getting the error:
sklearn.exceptions.NotFittedError: This Perceptron instance is not fitted yet
any advice and suggestions will be greatly appreciated.
Hi Rod,
The error you are getting indeed means that the training has not been performed. Mind that there are some changes (even the corpus read method) and I recommend starting from scratch.
Does running the train_perceptron() function work? If not, what’s the issue?
Thanks,
Bogdan.
Hi Bogdan,
Thanks for the reply.
I rewrite the code from scratch and it seems to be working fine until I run : train_perceptron()
Then I get the error:
nextword, nextpos = tokens[index + 1]
IndexError: list index out of range
But it is just the same code as in the post and the corpus is the same (“gmb-2.2.0”). I checked the “def ner_features”, I dont really understand why it is out of range?
Is there something wrong with “read_corpus_ner()”? It seems is not passing the data in proper format.
This is what I get when I convert the corpus to a list just before passing it to the classifier:
(‘through’, ‘IN’)
(‘the’, ‘DT’)
(‘of’, ‘IN’)
(‘marchers’, ‘NNS’)
(‘eve’, ‘NN’)
(gpe Britain/NNP)
(‘of’, ‘IN’)
(‘second’, ‘JJ’)
(‘of’, ‘IN’)
(‘to’, ‘TO’)
(‘facility’, ‘NN’)
(‘backing’, ‘NN’)
(tim Tuesday/NNP)
Thanks
Rod
Any advise about this issue? Should I use another formating for input data?
Thanks
Hi Rod,
I don’t know what to tell you 🙂 Seems to be fine to me. Maybe some good ol’ debugging is in order. Let me know if you figure things out.
Bogdan.
Hi Bogdan,
Thanks for your replay. I think the problem is how the corpus data is being passed to the classifier, or how the classiier is interpreting the data.
I see you have a parsen function in the ScikitLearnChunker class, do the data need to be parsed before the classification?
How would you call the train_perceptron() function. I think this is a good clue, scince I using the same corpus data as you.
Thanks,
Rod
Hi although this might be a bit old. If you use python3 then the zip behaviour changed. You need to change line 17 in the
ScikitLearnChunker
totagged = list(zip(words, tags))
. Then indef ner_features
you don’t pass an empty list on the different iterations.Hi,
Nice tutorial I enjoyed it! Only question I have is with regards to the dictionary vectorizer. This is only being fitted on the first iteration so your vector space is surely defined by the first batch of data. What if you have words, features, word shapes etc that haven’t been seen in this first iteration? Looking at the sklearn docs it seems their feature values will always be 0? Therefore I’m struggling to see how this can ever outperform the in-core approach to fitting the classifier. Do you know of a way to account for this?
Many thanks in advance
Dom
Hi Dom,
Indeed that’s the case. The vectorizer, as well as the classifier, are fitted only at training phase. Here’s a scenario:
– We fit the vectorizer and classifier on some labelled data
– We get new unlabelled data. We “re-fit” the vectorizer with the new data. What should the classifier do with the new “words” provided by the vectorizer since it doesn’t have labels for the new samples?
Bogdan.
Hi Bogdan,
Thanks for the response! But I mean even during training the vectorizer is only fitted on the first batch so any new words, parts of speech, stems etc in the subsequent partial-fits will all be given a value of 0. Whereas with the in-core approach they will not as they will all have gone through the fitting and transforming of the vectorizer.
Many thanks
Dom
Hey Bogdani
I used the exact same code. But i get a list index out of range error.
the full traceback is below:
Traceback (most recent call last):
File “learn3.py”, line 240, in
train_perceptron()
File “learn3.py”, line 237, in train_perceptron
n_iter=5)
File “learn3.py”, line 180, in train
X, y = cls.get_minibatch(parsed_sentences, feature_detector, kwargs.get(‘batch_size’, 500))
File “learn3.py”, line 175, in get_minibatch
X, y = cls.to_dataset(batch, feature_detector)
File “learn3.py”, line 167, in to_dataset
X.append(feature_detector(tagged, index, history=iob_tags[:index]))
File “learn3.py”, line 112, in ner_features
nextnextword, nextnextpos = tokens[index + 2]
IndexError: list index out of range
I have been trying to fix this, but no luck.
Any advice ?
I’ll have to look into this issue. Had a few other complaints on the matter 🙁
Hey
So i tried the same code with python 2 and python3. When run on python2 it does not produce any error however when run on python3 it produces the error. I think the cause of the error is the difference in behavior of zip() between python3 and python2.
Thanks
Andy
That has to be it. Thanks for looking into this!
I had the same problem. Python 3 return an iterator, so you have to wrap the zip call in to_dataset into a list(zip(words, tags)) and then everything works.
Thanks for the great tutorial
Hey Bogdani
Even though the GMB corpus has tokens with the ‘art’ ner tag, the model doesn’t seem to be getting trained on it. Is there anything that has to be changed in order to get it to recognize artifacts ?
Thanks
Andy
Hi Andy,
Indeed, I’ve noticed that myself and I started inspecting the corpus. You’ll notice that very few “artefacts” are tagged and even the annotated ones are very noisy. That’s the reason why the NER doesn’t pick up ART tags. Sorry to deliver the bad news 😛
Bogdan.
Hello Bogdani,
I am a beginner in NLP and NER so how should I proceed. Could you please help me out?
Hey Kanishka,
Sure, here you go: http://nlpforhackers.io/getting-started/ and for a more formal introduction, check out this Stanford course: https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1162/
Best,
Bogdan.
Hi
Great to see such great way of implementing machine learning. Quite impressive!
I have a confusion regarding the way you have extracted features. Why haven’t you used next-iob and next-next-iob as features ?
How have you done Feature Selection for this problem?
Thanks
Hi Abi,
I haven’t used
next-iob
andnext-next-iob
simply because when you tag an unknown sentence (no from the corpus), there’s no way of knowing what those are.You only know what has already been tagged (
prev-iob
,prev-prev-iob
).Bogdan.
Hi,
Very nice tutorial. I tried your code in python 3 and I get an error: “line 297, in transform
Xa = np.zeros((len(X), len(vocab)), dtype=dtype) MemoryError”
My idea is that the memory of my computer isn’t enough, do you have any idea how much is the maximum memory consumption of this program?
Thanks a lot.
Hi Iraklis,
Interesting, never had this issue. Does the error go away if you adjust the number of training samples here:
itertools.islice(reader, 50000)
?Bogdan
Hi,
Great Tutorial.
I tried your code in python 2.7 and when I tried to use more than 20000 training samples, it always gives me this error:
“Xa = np.zeros((len(X), len(vocab)), dtype=dtype) MemoryError”
Have you come across any solution?
Thanks a lot.
Hi Bogdan,
Thank you for the tutorial. I had the same issue.
297 vocab = self.vocabulary_
298 X = _tosequence(X)
–> 299 Xa = np.zeros((len(X), len(vocab)), dtype=dtype)
300
301 for i, x in enumerate(X):
MemoryError:
right here, it shows memory error. I adjusted the number to 5000 already, still showing memory error. Can you help?
Thanks,
Susie
Can you go even lower to make sure that is the issue?
Hi ! Thanks for the nice tutorial guide!
It might not be directly related to your post, however, I have some question about NER on specific kind of text. I would be really appreciated if you could help me!! Could you help me …!
I’m trying to extract only the companies’ names from the firm’s disclosure data.
So, basically, I want to classify every word in a sentence into ‘Non-company name word’ and ‘company name word’.
For example, if my desired NER tagger receives the sentence below:
“In fiscal 2003, Garmin accounted for 25% of display segment net sales, while GE Medical accounted for approximately 15% of display segment net sales.”
it should label only the ‘Garmin’ and ‘GE Medical’ as company name.
At first, I used Stanford NER tagger as it has ‘ORGANIZATION’ tag. However, it quite often tags some common noun as ORGANIZATION.
For example,
‘At December;31, 2004 and 2003, – 52 – Table of Contents NOTES TO CONSOLIDATED FINANCIAL STATEMENTS accounts receivable from automotive industry customers totaled $15,248,000 and $15,166,000, respectively.’
Sometimes it tags words like CONSOLIDATED or STATEMENTS as ‘ORGANIZATION’.
I think this is because the disclosure data is somewhat systematically different from general text data on which the Stanford NER tagger are trained.
So, my question is, ‘is there any way that I can modify the NER tagger, like making confidence threshold to tag a word as ORGANIZATION higher or training the tagger to better fit disclosure data?’
Or, Is there any way that I could train my own tagger(that only needs to classify whether a word is company name or not) following your tutorial?
I would be really appreciated if you give any suggestion …!
Thanks very much!!
Is the full working example somewhere in github? (in case there were changes since then and just to see a complete working example) thanks.
Sorry, not yet.
the code is not working, there is no output. can you please help me?
Provide please more info. Please note that this works with Python2.7. What Python are you using?
hello thanks a lot for such a great series of tutotrial.
I am trying to train on my own CSV data consist of words , POS and TAG , can you please help me with that , i.e how to import the csv data here using panda or any other library.
thanks
I am using python3
how do I use a whole document like movie reviews file to test the ner system?
Something like this: chunker.parse(pos_tag(word_tokenize(“I’m going to Germany this Monday.”)))
Thanks for the tutorial. Is it posibble to use another chunk-tag scheme other than IOB-tag like BILOU-tag? And can we use other corpus?
Of course, you can. As long as you pass an iterable to the NER Chunker, all should go smoothly. What corpus do you have in mind?
i am in the process of building NER for Indonesian language. So i need some Indonesian language corpus but there are no corpus like GMB format (POS tag and NER tag in one file). the best i can find are like these:
https://aclweb.org/aclwiki/Resources_for_Indonesian
https://github.com/UniversalDependencies/UD_Indonesian
https://github.com/famrashel/idn-tagged-corpus
https://github.com/yohanesgultom/nlp-experiments/blob/master/data/ner/training_data.txt
Hi Irfan,
I wouldn’t worry too much about the fact that the corpus is not POS/NER annotated. Build a POS tagger using one corpus then use the predicted POS-tags as features for the NER
Bogdan.
Hi, bogdani,
I have already built POS tagger, but how can you predict NER with POS Tag if there is no training data with POS tag/NER annotated before? By using rule-based method? Can we still using machine learning method? I am confused.
Hi Irfan:
You can annotate the NER dataset with POS tags. Then, when training the NER, take those annotations as features. Example:
“Going toNY ” -> Going/VBG to/TO NY/NNP -> Going/VBG/O to/TO/O NY/NNP/I-LOC
Makes sense?
Okay thank you. I understand now. I have another question, what happen when we use previous word feature at the start of a sentence? Are the value become null or what? Does it affect the model?
Hi, Bogdani
Thank you for your answer for my question before. I am trying to implement your tutorial with Multinomial Naive Bayes method but it gives me error.
I am following every step with minor changes and changes clf = Perceptron into clf = MultinoialNB(). Do you know what is the error about? Can you give me an example how to make it run with MultinomialNB? Thank you.
Read the text too:
“There are certain types of classifiers that accept the data to be presented in batches. Scikit-Learn includes a few such classifiers. Here’s the list: Scikit-Learn Incremental Classifiers. The process of learning from batches is called Incremental Learning.”
I managed to fix it. Thanks for the answer.
Hi Bogdani, how to count precision, recall, and F1 score of individual class from the classification above?
Hi,
Check this out: https://nlpforhackers.io/classification-performance-metrics/
Hello Bogdani, I am curious, why don’t you use Multinomial Naive Bayes or just Naive Bayes like the first NER tutorial?
Hi Irfan,
From the article:
There are certain types of classifiers that accept the data to be presented in batches. Scikit-Learn includes a few such classifiers. Here’s the list: Scikit-Learn Incremental Classifiers. The process of learning from batches is called Incremental Learning.
Try to replace the current classifier with a NaiveBayes 🙂
Bogdan
Hi Bogdani,
In your tutorial, we use 100% of our data training to be a data test and count the accuracy with function score. How if I want to use another dataset for data test?
Hi Irfan,
We are not using the entire dataset for training and testing.
pa_ner = ScikitLearnChunker.train(itertools.islice(reader, 50000), ...
accuracy = pa_ner.score(itertools.islice(reader, 5000))
We train on the first 50000 and test on the next 5000
can i get this working code
The code is working, don’t worry. It is written in Python2.7
what is the input to it
Maybe read this first? https://nlpforhackers.io/named-entity-extraction/
thanks
can i get this code done in spacy
Hey My accuracy is not as good as yours it’s around 86 percent. What is going wong?
Hello, thank you for your awesome blog! I am learning a lot. I work in the Spanish language, and I have been adapting your tutorials successfully. Here is an example I did yesterday
The downside to working in Spanish is the scarcity of annotated data. NLTK’s conll2002 Spanish corpus has just 5,000 sentences.
Since a POS tagger is the first step for building a NER tagger, I need to find a good dataset with POS annotations. Do you happen to know where to find a large Spanish dataset?
Thank you!
YES! I have some experience with the IULA corpus: https://repositori.upf.edu/handle/10230/20049 (hope this is the good download link)