Classification Performance Metrics
Throughout this blog, we seek to obtain good performance on our classification tasks. Classification is one of the most popular tasks in Machine Learning. Be sure you understand what classification is before going through this tutorial. You can check this Introduction to Machine Learning, specially created for hackers.
Since we’re always concerned with how well our systems are performing, we should have a clear way of measuring how performant a system is.
Binary Classification
We often have to deal with the simple task of Binary Classification. Some examples are: Sentiment Analysis (positive/negative), Spam Detection (spam/not-spam), Fraud Detection (fraud/not-fraud).
Let’s build a simple dataset to support us throughout this tutorial. It won’t be an NLP related task. We’re keeping things simple, but it definitely applies to any binary classification task in NLP.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | import numpy as np import matplotlib.pyplot as plt TRAINING_SIZE = 1000 center1 = np.array([0, 0]) center2 = np.array([3, 3]) X = np.zeros((TRAINING_SIZE, 2)) X[:TRAINING_SIZE/2, :] = np.random.randn(TRAINING_SIZE/2, 2) + center1 X[TRAINING_SIZE/2:, :] = np.random.randn(TRAINING_SIZE/2, 2) + center2 plt.scatter(X[:TRAINING_SIZE/2, 0], X[:TRAINING_SIZE/2, 1], color='red') plt.scatter(X[TRAINING_SIZE/2:, 0], X[TRAINING_SIZE/2:, 1], color='blue') plt.show() y = np.append(np.zeros(TRAINING_SIZE/2), np.ones(TRAINING_SIZE/2)) print X.shape, y.shape # (1000, 2) (1000,) |
Now, since we have the data, we must split it two: one we’re going to use for training (80%) and another we’re going to use for testing. We test the classifier on a different set because we want to see how well our classifier generalises (how well it performs on data it hasn’t seen already).
1 2 3 4 5 6 | from sklearn.cross_validation import train_test_split # Shuffle the data and then split it, keeping 20% aside for testing X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) print X_train.shape, X_test.shape, y_train.shape, y_test.shape # (800, 2) (200, 2) (800,) (200,) |
The only thing left to do is to train a classifier. My pick for this one is a LogisticRegression classifier. If you are new to logistic regression, you must know that it’s indeed a classifier and not a regressor. It is, in fact, a linear classifier that fits a line using gradient descent that discriminates between classes while optimising the cross-entropy error function.
1 2 3 4 5 6 7 8 | from sklearn.linear_model import LogisticRegression model = LogisticRegression() model.fit(X_train, y_train) # Let's see how well we're doing ... print model.score(X_test, y_test) # 0.98 |
I couldn’t help myself, and right after training, I wanted to see how well my classifier is performing. I got a 0.98 (98%) score. But what score is this and what does it mean?
Accuracy
Accuracy is the most popular performance measure used and for good reason. It’s extremely helpful, simple to compute and to understand. It is the proportion of the correctly classified samples and all the samples.
Here’s how to compute accuracy in general, without using the score
method on a classifier:
1 2 3 4 | from sklearn.metrics import accuracy_score print accuracy_score(y_test, model.predict(X_test)) # 0.98 |
There are other ways to measure different aspects of performance. In classic machine learning nomenclature, when we’re dealing with binary classification, the classes are: positive and negative. Think of these classes in the context of disease detection:
- positive – we predict the disease is present
- negative – we predict the disease is not present.
Let’s now define some notations:
- TP – True Positives (Samples the classifier has correctly classified as positives)
- TN – True Negatives (Samples the classifier has correctly classified as negatives)
- FP – False Positives (Samples the classifier has incorrectly classified as positives)
- FN – False Negatives (Samples the classifier has incorrectly classified as negatives)
A bit confused? Let’s confuse you a bit more. Samples in the FP set are actually negatives and samples in FN are actually positives.
One of the questions we might ask ourselves is: out of our positive predictions, how many are indeed positive? Putting it another way: given that the classifier predicted a sample as positive, what’s the probability of the sample being indeed positive?
Let’s suppose we have a system that predicts a disease. What’s the probability of actually having the disease, if we predicted that the sample has the disease?
This measure is called precision, and the formula for computing it is: TP / (TP + FP)
1 2 3 4 5 6 | from sklearn.metrics import precision_score # Take turns considering the positive class either 0 or 1 print precision_score(y_test, model.predict(X_test), pos_label=0) # 0.971153846154 print precision_score(y_test, model.predict(X_test), pos_label=1) # 0.989583333333 |
Another question to ask ourselves is: out of all the truly positive samples, how many is our classifier able to detect? Putting it another way: given a positive sample, what is the probability that our system will properly identify it as positive?
Going back to our disease predicting system: if a sample is positive for the disease, what’s the probability that the system will pick it up?
This measure is called recall, and this is the formula for computing it: TP / (TP + FN)
1 2 3 4 5 6 | from sklearn.metrics import recall_score # Take turns considering the positive class either 0 or 1 print recall_score(y_test, model.predict(X_test), pos_label=0) # 0.990196078431 print recall_score(y_test, model.predict(X_test), pos_label=1) # 0.969387755102 |
A measure that combines the precision and recall metrics is called F1-score. This score is, in fact, the harmonic mean of the precision and the recall. Here’s the formula for it:
2 / (1 / Precision + 1 / Recall)
1 2 3 4 5 6 | from sklearn.metrics import f1_score # Take turns considering the positive class either 0 or 1 print f1_score(y_test, model.predict(X_test), pos_label=0) # 0.980582524272 print f1_score(y_test, model.predict(X_test), pos_label=1) # 0.979381443299 |
Here’s a way of remembering precision and recall:
Getting back the classic accuracy metric, here’s the formula for it, using our new notations:
(TP + TN) / (TP + TN + FP + FN)
A convenient shortcut in scikit-learn for obtaining a readable digest of all the metrics is metrics.classification_report
1 2 3 4 5 6 7 8 9 10 11 | from sklearn.metrics import classification_report print classification_report(y_test, model.predict(X_test), target_names=['RED', 'BLUE']) # precision recall f1-score support # # RED 0.97 0.99 0.98 102 # BLUE 0.99 0.97 0.98 98 # # avg / total 0.98 0.98 0.98 200 |
Multiclass Classification
Let’s now make the leap towards multiclass. Examples of multiclass problems we might encounter in NLP include: Part Of Speach Tagging and Named Entity Extraction. Let’s repeat the process for creating a dataset, this time with 3 classes:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | import numpy as np import matplotlib.pyplot as plt TRAINING_SIZE = 1500 center1 = np.array([0, 0]) center2 = np.array([3, 3]) center3 = np.array([0, 4]) X = np.zeros((TRAINING_SIZE, 2)) X[:TRAINING_SIZE/3, :] = np.random.randn(TRAINING_SIZE/3, 2) + center1 X[TRAINING_SIZE/3:2*TRAINING_SIZE/3, :] = np.random.randn(TRAINING_SIZE/3, 2) + center2 X[2*TRAINING_SIZE/3:, :] = np.random.randn(TRAINING_SIZE/3, 2) + center3 plt.scatter(X[:TRAINING_SIZE/3, 0], X[:TRAINING_SIZE/3, 1], color='red') plt.scatter(X[TRAINING_SIZE/3:2*TRAINING_SIZE/3, 0], X[TRAINING_SIZE/3:2*TRAINING_SIZE/3:, 1], color='blue') plt.scatter(X[2*TRAINING_SIZE/3:, 0], X[2*TRAINING_SIZE/3::, 1], color='green') plt.show() y = np.append(np.zeros(TRAINING_SIZE/3), np.zeros(TRAINING_SIZE/3) + 1) y = np.append(y, np.zeros(TRAINING_SIZE/3) + 2) print X.shape, y.shape # (1500, 2) (1500,) |
Same as before, let’s create the training and test set:
1 2 3 4 5 6 | from sklearn.cross_validation import train_test_split # Shuffle the data and then split it, keeping 20% aside for testing X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) print X_train.shape, X_test.shape, y_train.shape, y_test.shape # (1200, 2) (300, 2) (1200,) (300,) |
1 2 3 4 5 6 7 8 | from sklearn.linear_model import LogisticRegression model = LogisticRegression() model.fit(X_train, y_train) # Let's see how well we're doing ... print model.score(X_test, y_test) # 0.953333333333 |
We got a 0.953 prediction accuracy. Extending accuracy for multiclass is pretty straightforward since accuracy can be thought of #correctly_classified_samples/#all_samples
. Here are some equivalent approaches:
1 2 3 4 5 6 | from sklearn.metrics import accuracy_score print accuracy_score(y_test, model.predict(X_test)) # 0.953333333333 print model.score(X_test, y_test) # 0.953333333333 print np.mean(y_test == model.predict(X_test)) # 0.953333333333 |
The other metrics are a bit more tricky to use in the context of multiclass since they are defined explicitly in terms of binary classification metrics.
The solution is to reduce a multiclass classification problem to many binary classification problems. If we have K classes, we deal with K binary classification problems. We consider each class to be the positive one and the rest of the classes the negative one. Let’s build a utility function that computes TP, TN, FP and FN since we’ll need these values later on.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | def classification_stats(y_test, y_pred, pos_class): TP, FP, TN, FN = .0, .0, .0, .0 for idx in xrange(len(y_test)): if y_test[idx] == pos_class: if y_pred[idx] == pos_class: TP += 1 else: FN += 1 else: if y_pred[idx] != pos_class: TN += 1 else: FP += 1 return TP, FP, TN, FN |
One more step before generalizing the metrics. Let’s compute some useful numbers beforehand.
1 2 3 4 5 6 7 8 9 10 11 12 13 | # How many samples for each class are there in the test set support = np.array([np.sum(y_test == 0), np.sum(y_test == 1), np.sum(y_test == 2)]) print support # [ 96 98 106] # Compute TP, FP, TN, FN stats for each class stats0 = np.array(classification_stats(y_test, model.predict(X_test), pos_class=0)) stats1 = np.array(classification_stats(y_test, model.predict(X_test), pos_class=1)) stats2 = np.array(classification_stats(y_test, model.predict(X_test), pos_class=2)) # Compute the global TP, FP, TN, FN stats by summing the stats for each class global_stats = stats0 + stats1 + stats2 print global_stats # [ 286. 14. 586. 14.] |
Computing a global precision usually implies some averaging and scikit-learn provides a few ways of doing this. I’ll be presenting here 3 types of averaging:
- macro: the averaged per class precisions
- weighted: similar to macro but weighted by the number of samples for each class
- micro:
GLOBAL_TP / (GLOBAL_TP + GLOBAL_FP)
Here are the ways to compute them using handy scikit-learn functions or using equivalent numpy operations:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | from sklearn.metrics import precision_score # Compute a per class precision, taking turns considering the positive class either 0, 1 or 2 per_class_precision = precision_score(y_test, model.predict(X_test), average=None) print per_class_precision # [ 0.95876289 0.9223301 0.98 ] print 'precision_score - average=macro' print precision_score(y_test, model.predict(X_test), average='macro') # 0.953697661228 print np.mean(per_class_precision) # 0.953697661228 print 'precision_score - average=weighted' print precision_score(y_test, model.predict(X_test), average='weighted') # 0.95436528876 print support.dot(per_class_precision) / np.sum(support) # 0.95436528876 print 'precision_score - average=micro' print precision_score(y_test, model.predict(X_test), average='micro') # 0.953333333333 print global_stats[0] / (global_stats[0] + global_stats[1]) # 0.953333333333 |
Although it implies repeating the same process, here’s the code for recall:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | from sklearn.metrics import recall_score # Compute a per class recall, taking turns considering the positive class either 0, 1 or 2 per_class_recall = recall_score(y_test, model.predict(X_test), average=None) print per_class_recall # [ 0.96875 0.96938776 0.9245283 ] print 'recall_score - average=macro' print recall_score(y_test, model.predict(X_test), average='macro') # 0.954222018996 print np.mean(per_class_recall) # 0.954222018996 print 'recall_score - average=weighted' print recall_score(y_test, model.predict(X_test), average='weighted') # 0.953333333333 print support.dot(per_class_recall) / np.sum(support) # 0.953333333333 print 'recall_score - average=micro' print recall_score(y_test, model.predict(X_test), average='micro') # 0.953333333333 print global_stats[0] / (global_stats[0] + global_stats[3]) # 0.953333333333 |
Since F1-score is just the harmonic average of precision and recall, we’re just going to list the different values for different averaging:`
1 2 3 4 5 6 | from sklearn.metrics import f1_score print f1_score(y_test, model.predict(X_test), average='macro') # 0.95348683749 print f1_score(y_test, model.predict(X_test), average='weighted') # 0.953364398558 print f1_score(y_test, model.predict(X_test), average='micro') # 0.953333333333 |
Which type of averaging you use is up to you and it depends on the type of problem you are solving. For most of the problems out there, weighted
is a good choice.
Here’s how the classification_report
behaves for the multiclass case:
1 2 3 4 5 6 7 8 9 10 11 12 | from sklearn.metrics import classification_report print classification_report(y_test, model.predict(X_test), target_names=['RED', 'BLUE', 'GREEN']) # precision recall f1-score support # # RED 0.96 0.97 0.96 96 # BLUE 0.92 0.97 0.95 98 # GREEN 0.98 0.92 0.95 106 # # avg / total 0.95 0.95 0.95 300 |
Confusion Matrix
There’s one more important tool you should know about called confusion matrix. Using the matrix you can tell if there’s a class that’s constantly mistaken for some other class. It’s also really simple to use:
1 2 3 4 5 6 7 8 | from sklearn.metrics import confusion_matrix print confusion_matrix(y_test, model.predict(X_test)) # [[93 3 0] # [ 1 95 2] # [ 3 5 98]] |
Here’s how to read the confusion matrix. C[i][j]
is the number of samples that belong to class i
but are classified as j
. Obviously, if i
equals j
, C[i][j]
is the number of correctly classified samples belonging to class i
.
If the classifier is perfect, you’ll obtain non-zero values only on the main diagonal.
A common practice is to normalise the confusion matrix, working with proportions rather than sample counts.
Here’s how to do that:
1 2 3 4 5 6 7 8 | C = confusion_matrix(y_test, model.predict(X_test)) NC = C.astype(float) / C.sum(axis=1) print NC # [[ 0.96875 0.03061224 0. ] # [ 0.01041667 0.96938776 0.01886792] # [ 0.03125 0.05102041 0.9245283 ]] |
Conclusions
Hope you wrapped your head around the different metrics used for measuring how well a classifier is doing. The F1-score is considered the most “complete” score, being a combination of precision and recall. It is also the least intuitive one.
It depends on the problem you are solving which metric should be the most important. For example, if your system is predicting cancer, you might want to optimise recall rather than precision. You want to be sure you detect as many positive cases as possible, even though you get a considerable amount of false positives.
How can we leverage this technique to check the performance metrics on anomaly detection in textual data
That was more of an example of binary classification than anything. Why? Do you have a specific problem in mind? Thanks, Bogdan.
I believe you forgot to multiply by 2 in F1-Score formula according to https://en.wikipedia.org/wiki/F1_score
You wrote: 1 / (1 / Precision + 1 / Recall)
I think it should be: 2 (1 / (1 / Precision + 1 / Recall))
Hi Roman,
Yep, thanks!