Splitting text into sentences

Few people realise how tricky splitting text into sentences can be. Most of the NLP frameworks out there already have English models created for this task.

You might encounter issues with the pretrained models if:

1. You are working with a specific genre of text(usually technical) that contains strange abbreviations.
2. You are working with a language that doesn’t have a pretrained model (Example: Romanian)

Here’s an example of the first scenario:

Under the hood, the NLTK’s sent_tokenize function uses an instance of a PunktSentenceTokenizer.

The PunktSentenceTokenizer is an unsupervised trainable model. This means it can be trained on unlabeled data, aka text that is not split into sentences.

Behind the scenes, PunktSentenceTokenizer is learning the abbreviations in the text. This is the mechanism that the tokenizer uses to decide where to “cut” .

We’re going to study how to train such a tokenizer and how to manually add abbreviations to fine-tune it.

Training a Punkt Sentence Tokenizer

Let’s first build a corpus to train our tokenizer on. We’ll use stuff available in NLTK:

The NLTK API for training a PunktSentenceTokenizer is a bit counter-intuitive. Here’s a snippet that works:

As you can see, the tokenizer correctly detected the abbreviation “Mr.” but not “Dr.”. Let’s fine-tune the tokenizer by adding our own abbreviations.

Adding more abbreviations

Let’s add the “Dr.” abbreviation to the tokenizer. The operation is extremely simple. Remember to add the abbreviations without the trailing punctuation and in lowercase.

And that’s a wrap. Using the things learned here you can now train or adjust a sentence splitter for any language.