Splitting text into sentences
Few people realise how tricky splitting text into sentences can be. Most of the NLP frameworks out there already have English models created for this task.
You might encounter issues with the pretrained models if:
1. You are working with a specific genre of text(usually technical) that contains strange abbreviations.
2. You are working with a language that doesn’t have a pretrained model (Example: Romanian)
Here’s an example of the first scenario:
1 2 3 4 5 6 7 | from nltk import sent_tokenize sentence = "My friend holds a Msc. in Computer Science." print sent_tokenize(sentence) # ['My friend holds a Msc.', 'in Computer Science.'] |
Under the hood, the NLTK’s sent_tokenize
function uses an instance of a PunktSentenceTokenizer
.
The PunktSentenceTokenizer
is an unsupervised trainable model. This means it can be trained on unlabeled data, aka text that is not split into sentences.
Behind the scenes, PunktSentenceTokenizer
is learning the abbreviations in the text. This is the mechanism that the tokenizer uses to decide where to “cut” .
We’re going to study how to train such a tokenizer and how to manually add abbreviations to fine-tune it.
Training a Punkt Sentence Tokenizer
Let’s first build a corpus to train our tokenizer on. We’ll use stuff available in NLTK:
1 2 3 4 5 6 7 8 9 10 11 | from nltk.corpus import gutenberg print dir(gutenberg) print gutenberg.fileids() text = "" for file_id in gutenberg.fileids(): text += gutenberg.raw(file_id) print len(text) # 11793318 |
The NLTK API for training a PunktSentenceTokenizer
is a bit counter-intuitive. Here’s a snippet that works:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | from pprint import pprint from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktTrainer trainer = PunktTrainer() trainer.INCLUDE_ALL_COLLOCS = True trainer.train(text) tokenizer = PunktSentenceTokenizer(trainer.get_params()) # Test the tokenizer on a piece of text sentences = "Mr. James told me Dr. Brown is not available today. I will try tomorrow." print tokenizer.tokenize(sentences) # ['Mr. James told me Dr.', 'Brown is not available today.', 'I will try tomorrow.'] # View the learned abbreviations print tokenizer._params.abbrev_types # set([...]) # Here's how to debug every split decision for decision in tokenizer.debug_decisions(sentences): pprint(decision) print '=' * 30 |
As you can see, the tokenizer correctly detected the abbreviation “Mr.” but not “Dr.”. Let’s fine-tune the tokenizer by adding our own abbreviations.
Adding more abbreviations
Let’s add the “Dr.” abbreviation to the tokenizer. The operation is extremely simple. Remember to add the abbreviations without the trailing punctuation and in lowercase.
1 2 3 4 5 6 7 8 9 | tokenizer._params.abbrev_types.add('dr') print tokenizer.tokenize(sentences) # ['Mr. James told me Dr. Brown is not available today.', 'I will try tomorrow.'] for decision in tokenizer.debug_decisions(sentences): pprint(decision) print '=' * 30 |
And that’s a wrap. Using the things learned here you can now train or adjust a sentence splitter for any language.
Hi
i ask about tfidf with noun phrase can be implemented and make a model for training data using one of the classifiers for 20 news group data ?if you have some explanation .
Hi Alaa,
Not sure if this is what you are asking, but here it goes:
– You can use the conll2000 corpus to build your own NP-Chunker: http://nlpforhackers.io/text-chunking/
– You can feed the NPs to a scikit-learn TfIdfVectorizer (Or create a custom vectorizer)
Let me know if you have a practical example
Thank you for your comments . i do not have practical example .Iam trying to enhance the performance of tfidf from weighting terms to weighting related terms like noun phrases.
Can I read somewhere about this?
Thanks
sure. iam asking about the conll2000 training data with chunking under which classifier ?
You can check this article on chunking: http://nlpforhackers.io/text-chunking/
It uses the NLTK implementation of the NaiveBayes Classifier under the hood
Thank you very much . Do you have example please .
Hi ALAA,
Yes, the example is right within the article. This is the line where you instantiate a classifier:
self.tagger = ClassifierBasedTagger(
train=chunked_sents,
feature_detector=features,
**kwargs)
Cheers
Thank you very much for your explanation. Do you find any new about tfidf with noun phrases
Did you check this post on TfIdf: http://nlpforhackers.io/tf-idf/
The easiest way to do what you’re looking for is to grab the code for doing Tf-Idf with Scikit-learn and replace the tokenizer with a NP-extractor from the other article I’ve mentioned. All the pieces are there 😀
Bogdan.
hi. tanks for your training. I need to frequency cut code in preprocessing. do you help me, please?
Can you please be a bit more specific?
Yes. I work on new text categorization method using ensemble classification. And I use python and nltk for my implementation. In preprocessing I have 3 steps: 1.stemming with porter method 2.stop word removal 3.frequency cut
I need to define frequency cut and implement it in python. Is my information enough? Do you help me in frequency cut code in python?
Hi there,
That’s a totally separate problem, nothing to do with sentence splitting. Not sure there’s a standard method though. It usually is a parameter that needs tunning.
Bogdan.
Hi. You mean that you can not help me in my project? Is split sentences similar to frequency cut?
I am going to design a learning machine. How do I define sentences for my learning machine?
In classification I have used Random Forest algorithm. I need Random Forest algorithm code. Do you help me, please?
this code in the top of this page, correctly run?
Hello Bogdani. Thanks for this great tutorial. I’m facing a problem with splitting text scraped from the internet. Like users comments. There are some cases where there’s no space after a full stop or other punctuation. And the data are not in English. There are some use cases like:
Hello! “Trup” the president of US.
How can I solve this kind of problem? I want them to be two different sentences.
Thank you
Hi Ardit. Not sure if this solves your problem, but you can try the NLTK TweetTokenizer: https://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.casual.TweetTokenizer
Give that a try 🙂
Thanks for your quick reply. I’ll give it a try
Thank you very much for valuable post.
How do I use CountVectorizer in skykit learn for finding the BoW of non-english text? It’s not working for non-english. How can I call language specific word tokenization function inside the CountVectorizer?
This seems to be a bit off topic. Anyway, just pass a function to the tokenizer parameter: CountVectorizer(tokenizer=your_custom_tokenizer)
xx=[‘Hai how r u’,’welcome dera hj’]
import nltk
def token(x):
w=nltk.word_tokenize(x)
return w
token(xx)
vectorizer = CountVectorizer(lowercase=True,tokenizer=token(),ngram_range=(1,1),stop_words=’english’)
vectorizer.fit(X_train)
vectorizer.transform(X_train)
print(vectorizer.get_feature_names())
But I got the error like “expected string or bytes-like object”
How to resolve it?
Try passing the function as the parameter, not applying it:
CountVectorizer(lowercase=True,tokenizer=token,ngram_range=(1,1),stop_words=’english’)