This is probably the first post I should have written on the blog. The thing is, I did machine learning and natural language processing for a long time before putting the concepts in order inside my own mind.
I’ve learned techniques and hacks to boost precision of classifiers before fully understanding how a classifier computes its weights or whatever. So I guess it makes sense to publish a general introductory post after some real hands-on posts.
Here’s a popular diagram used to describe what data science usually implies:
You probably figured out by now, that Natural Language Processing has something to do with data science. Indeed that’s true. NLP employs much of the techniques used in data science, plus adds a few of its own or puts a new spin on some techniques.
I would say that Natural Language Processing also implies a good understanding of grammar. If your native language is as hard as mine, you probably hated it in school.
I like to think about grammar as the “science” of turning plain language into mathematical objects. Transform bits and pieces of text into formal objects that you can use programmatically. Some examples of grammar related tasks that you’ll probably use very often in NLP are:
- Splitting text into sentences.
- Find the part-of-speech for words inside a sentence.
- Determine de different types of subclauses.
- Determine the subject and the direct object of a sentence.
Although these might seem trivial for humans, they prove to be difficult tasks for machines mostly because of the ambiguity of natural language. We humans don’t have such a hard time untangling the ambiguities because we have something called common knowledge and prior experience.
A very popular example of an ambiguous sentence is this (credits: byrdseed.com):
I saw a man on a hill with a telescope.
Here are some of the meanings:
- There’s a man on a hill, and I’m watching him with my telescope.
- There’s a man on a hill, who I’m seeing, and he has a telescope.
- There’s a man, and he’s on a hill that also has a telescope on it.
- I’m on a hill, and I saw a man using a telescope.
- There’s a man on a hill, and I’m sawing him with a telescope.
You probably never even thought of meaning 5 because your common sense knowledge says you can’t saw somebody with a telescope. For machines, this is a very valid question. Another category is figurative language. What would this mean to a machine?
Names and disciplines
Natural Language Processing can take different names. There may be some differences between them but the general direction of deriving meaning or understanding natural language is the same. Here are some alternative names:
- Computational Linguistics (nowadays used usually by people coming from a traditional linguistics background)
- Text-Mining (Usually when a lot of math/statistics are used)
- Natural Language Understanding
Here are some disciplines related to modern natural language processing:
- Machine Learning
- Deep Learning
- Formal Languages, Regular Expressions
The NLP pyramid
Natural Language Processing can also be viewed as a pyramid. The most common NLP tasks build one upon another. Here are the different levels:
In traditional linguistics morphology analyses how words are formed, what is their origin, how does their form change depending on the context. In NLP you’ll mostly deal with
- gender detection
- word inflection
- lemmatization (the base form of a word).
In morphology, most of the operations are at a word level, where a word is viewed as a sequence of characters.
Syntax cares about what proper word constructions are. Determining the underlying structure of a sentence or building valid sentences is what syntax is all about. In a way, syntax is what we usually refer to as grammar. Syntax is probably the most researched branch of computational linguistics. Here are only a few of the tasks:
- Part-of-speech tagging (assigning tags to words: Noun/Verb/Adjective/Adverb/Pronoun/Preposition/Conjunction etc …)
- Building Syntax Trees
- Building Dependency Trees
Syntax usually works on sentences, where a sentence is a sequence of words.
Semantics derives meaning from text. This branch deals with the actual understanding of natural language. Here are some known problems:
- Named Entity Extraction
- Relation Extraction
- Semantic Role Labelling
- Word Sense Disambiguation
Semantics usually works on sentences, where a sentence is a sequence of words usually with some added semantics (like sense, role) attached.
Pragmatics analyses the text as a whole. It’s about determining underlying narrative threads, topics, references. Some discourse tasks are:
- Coreference / Anaphora resolution (find out what word refers what. Example: John is fine. He[John]‘s in no danger.)
- Topic segmentation
- Lexical chains
Pragmatics usually works on a text represented as a sequence of sentences.
In this blog, I’ll be covering most of these tasks. Although I’ve been dealing with most of them for a while now, writing about them makes me look at them from a different perspective. As they say, you truly understand something when you can explain it to others.