Classification Performance Metrics

Throughout this blog, we seek to obtain good performance on our classification tasks. Classification is one of the most popular tasks in Machine Learning. Be sure you understand what classification is before going through this tutorial. You can check this Introduction to Machine Learning, specially created for hackers.

Since we’re always concerned with how well our systems are performing, we should have a clear way of measuring how performant a system is.

Binary Classification

We often have to deal with the simple task of Binary Classification. Some examples are: Sentiment Analysis (positive/negative), Spam Detection (spam/not-spam), Fraud Detection (fraud/not-fraud).

Let’s build a simple dataset to support us throughout this tutorial. It won’t be an NLP related task. We’re keeping things simple, but it definitely applies to any binary classification task in NLP.

Two gaussian clouds

Now, since we have the data, we must split it two: one we’re going to use for training (80%) and another we’re going to use for testing. We test the classifier on a different set because we want to see how well our classifier generalises (how well it performs on data it hasn’t seen already).

The only thing left to do is to train a classifier. My pick for this one is a LogisticRegression classifier. If you are new to logistic regression, you must know that it’s indeed a classifier and not a regressor. It is, in fact, a linear classifier that fits a line using gradient descent that discriminates between classes while optimising the cross-entropy error function.

I couldn’t help myself, and right after training, I wanted to see how well my classifier is performing. I got a 0.98 (98%) score. But what score is this and what does it mean?


Accuracy is the most popular performance measure used and for good reason. It’s extremely helpful, simple to compute and to understand. It is the proportion of the correctly classified samples and all the samples.

Here’s how to compute accuracy in general, without using the score method on a classifier:

There are other ways to measure different aspects of performance. In classic machine learning nomenclature, when we’re dealing with binary classification, the classes are: positive and negative. Think of these classes in the context of disease detection:

  • positive – we predict the disease is present
  • negative – we predict the disease is not present.

Let’s now define some notations:

  • TP – True Positives (Samples the classifier has correctly classified as positives)
  • TN – True Negatives (Samples the classifier has correctly classified as negatives)
  • FP – False Positives (Samples the classifier has incorrectly classified as positives)
  • FN – False Negatives (Samples the classifier has incorrectly classified as negatives)

A bit confused? Let’s confuse you a bit more. Samples in the FP set are actually negatives and samples in FN are actually positives.

One of the questions we might ask ourselves is: out of our positive predictions, how many are indeed positive? Putting it another way: given that the classifier predicted a sample as positive, what’s the probability of the sample being indeed positive?

Let’s suppose we have a system that predicts a disease. What’s the probability of actually having the disease, if we predicted that the sample has the disease?

This measure is called precision, and the formula for computing it is: TP / (TP + FP)

Another question to ask ourselves is: out of all the truly positive samples, how many is our classifier able to detect? Putting it another way: given a positive sample, what is the probability that our system will properly identify it as positive?

Going back to our disease predicting system: if a sample is positive for the disease, what’s the probability that the system will pick it up?

This measure is called recall, and this is the formula for computing it: TP / (TP + FN)

A measure that combines the precision and recall metrics is called F1-score. This score is, in fact, the harmonic mean of the precision and the recall. Here’s the formula for it:
2 / (1 / Precision + 1 / Recall)

Here’s a way of remembering precision and recall:

precision recall graphic

Getting back the classic accuracy metric, here’s the formula for it, using our new notations:
(TP + TN) / (TP + TN + FP + FN)

A convenient shortcut in scikit-learn for obtaining a readable digest of all the metrics is metrics.classification_report

Multiclass Classification

Let’s now make the leap towards multiclass. Examples of multiclass problems we might encounter in NLP include: Part Of Speach Tagging and Named Entity Extraction. Let’s repeat the process for creating a dataset, this time with 3 classes:

Three Gaussian Clouds

Same as before, let’s create the training and test set:

We got a 0.953 prediction accuracy. Extending accuracy for multiclass is pretty straightforward since accuracy can be thought of #correctly_classified_samples/#all_samples. Here are some equivalent approaches:

The other metrics are a bit more tricky to use in the context of multiclass since they are defined explicitly in terms of binary classification metrics.

The solution is to reduce a multiclass classification problem to many binary classification problems. If we have K classes, we deal with K binary classification problems. We consider each class to be the positive one and the rest of the classes the negative one. Let’s build a utility function that computes TP, TN, FP and FN since we’ll need these values later on.

One more step before generalizing the metrics. Let’s compute some useful numbers beforehand.

Computing a global precision usually implies some averaging and scikit-learn provides a few ways of doing this. I’ll be presenting here 3 types of averaging:

  • macro: the averaged per class precisions
  • weighted: similar to macro but weighted by the number of samples for each class

Here are the ways to compute them using handy scikit-learn functions or using equivalent numpy operations:

Although it implies repeating the same process, here’s the code for recall:

Since F1-score is just the harmonic average of precision and recall, we’re just going to list the different values for different averaging:`

Which type of averaging you use is up to you and it depends on the type of problem you are solving. For most of the problems out there, weighted is a good choice.

Here’s how the classification_report behaves for the multiclass case:

Confusion Matrix

There’s one more important tool you should know about called confusion matrix. Using the matrix you can tell if there’s a class that’s constantly mistaken for some other class. It’s also really simple to use:

Here’s how to read the confusion matrix. C[i][j] is the number of samples that belong to class i but are classified as j. Obviously, if i equals j, C[i][j] is the number of correctly classified samples belonging to class i.

If the classifier is perfect, you’ll obtain non-zero values only on the main diagonal.

A common practice is to normalise the confusion matrix, working with proportions rather than sample counts.

Here’s how to do that:


Hope you wrapped your head around the different metrics used for measuring how well a classifier is doing. The F1-score is considered the most “complete” score, being a combination of precision and recall. It is also the least intuitive one.

It depends on the problem you are solving which metric should be the most important. For example, if your system is predicting cancer, you might want to optimise recall rather than precision. You want to be sure you detect as many positive cases as possible, even though you get a considerable amount of false positives.