Model Evaluation Measures

ROC Curve

A receiver operating characteristic curve, or ROC curve, is a graphical plot that shows the behavior of a classifier. It was originated in 1950s from radio signal analysis, and was made popular by a 1978 paper by Charles Metz called “Basic Principles of ROC Analysis.” An ROC visually explains the trade-off between False Positive Rate (FPR) and True Positive Rate (TPR). As a reminder, the TPR answers the question, “When the actual classification is positive, how often does the classifier predicts positive?” The FPR answers the question, “When the actual classification is negative, how often does the classifier incorrectly predict positive?” Both the True Positive Rate and the False Positive Rate range from 0 to 1.

Let’s consider we have a Model A i.e. Logistic Regression, which outputs the probability of class P (Positive) and N (Negative) based on certain independent variables. Assuming we get 2000 predictions for a validation data that contains 1000 P and 1000 N examples. One way to visualize the model predictions for a given set of validation examples is shown as following where probability of P class is given on x-axis and the count of predicted examples is given on y-axis. The actual class of P examples is color coded as green and N examples as red. As you can see, there are several examples where the plot is showing an overlap between P and N examples. This is where our classifier is failing to separate the classes.

Data

set.seed(11)
N =1000
data <- data.frame(
  Class = c( rep("N", N), rep("P", N) ),
  Probability = c( rnorm(N, mean=0.4, sd=0.1), rnorm(N, mean=0.6, sd=0.1) )
)
plot_multi_histogram(data, 'Probability', 'Class', xintercept=0.5)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This plot tells you that there are;

  • 50 examples for which the classifier predicted the probability as 0.25 for being in Class N and the actual class of those examples is N.
  • 38 examples for which the classifier predicted the probability as 0.75 for being in class P and the actual class of those examples is P.
  • 75 examples for which the classifier predicted the probability as 0.50 for being in class P and 72 examples for which the classifier predicted the probability as 0.50 for being class N.

To make actual predictions, you might consider setting a threshold at 0.50 probability and classify every example with more than of equal to 0.5 probability as Class P and all other examples with less than 0.5 probability as Class N.

TPR, FPR and decision threshold

If you keep 0.5 as threshold; the TPR would be the examples to the right of 0.5 threshold divided by total P examples, or 832 divided by 1000, which is 0.832. The FPR would be the N examples to the right of 0.5 threshold divided by 1000. There are 158 of such examples so FPR is 158/100=0.158.

If you decide to make the threshold as 0.60; there will be 481 P examples to the right 0.60 (correctly classified P examples) and TPR will be 481/1000=0.481. The incorrectly classified examples (class N examples classified as P) will be 27 making the FPR as 27/1000=0.027. You can see that increasing the threshold improved FPR but reduced TPR.

In case the threshold is 0.40; there will be more correctly classified P examples (974 P predictions to the right of 0.40) making TPR as 0.974. However, the number of N examples incorrectly classified as P will also increase to 482 making FPR as 0.482.

Table 1 is summarizing how changing the threshold affects TPR and FPR for three threshold values. In order to see the relationship between TPR and FPR for all threshold values ranging from say 0.1 and 1 will require us to calculate these values for each threshold. This is where the ROC curve comes into the picture. An ROC curve summarizes the relationship between TPR and FPR in a single plot as shown in Figure 2. You can identify the three decision threshold values on this plot (marked with in green, blue and red dotted lines for 0.6, 0.5 and 0.4 threshold values). Please note that the dotted lines are not part of standard ROC. These lines are used here for illustration purpose only. ROC shows increase in correct positive classifications or TPR as you allow for more and more false positives or FPR. A perfect classifier that makes no mistakes would hit a true positive rate of 100% immediately, without incurring any false positives—this almost never happens in practice.

## Warning in verify_d(data$d): D not labeled 0/1, assuming N = 0 and P = 1!

## Warning in verify_d(data$d): D not labeled 0/1, assuming N = 0 and P = 1!

Comparing Models using ROC

Another effective use of ROC is to compare performance of multiple classifiers. Let us assume we have another classifier, Model B, that predicts class probability predictions as plotted in the following.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

You can compare the performance of both models using the ROC curve. Sometimes it is hard to identify the best model by just looking at the curves. In order to quantify the performance, the Area Under the Curve (AUC) is computed. The AUC summarizes the ROC curve into a single number, so that it can be compared easily and automatically.

pred1 <- prediction(data$Probability, data$Class)
pred2 <- prediction(data2$Probability, data$Class)
perf1 <- perf_auc <- performance(pred1,"auc")
perf2 <- perf_auc <- performance(pred2,"auc")
title=paste0('ROC: Comparing Models \n Model A AUC: ', 
             round(perf1@y.values[[1]], 2), 
             ', Model B AUC: ', 
             round(perf2@y.values[[1]],2))
plot(performance(pred1, "tpr", "fpr"), add=FALSE,col='red',lwd=2, main=title)
plot(performance(pred2, "tpr", "fpr"), add=TRUE,col='blue',lwd=2,  main='ROC: Comparing Models')
legend(x=.751, y=.81, legend=c("Model A", "Model B"), col=c("red", "blue"), lty=1)

Outside of the machine learning and data science community, there are many popular variations of the idea of ROC curves. The marketing analytics community uses lift and gain charts. The medical modeling community often looks at odds ratios. The statistics community examines sensitivity and specificity.

References

Witten, I. H., Frank, E., Hall, M. A., & Pal, C. J. (2017). Data mining: Practical machine learning tools and techniques (4th ed.). Cambridge, MA: Elsevier Inc., Morgan Kaufmann Publishers.

Alpaydin, Ethem. Introduction to machine learning. MIT press, 2009.

Zheng, A. (2015). Evaluating Machine Learning Models. O’Reilly Media.