ROC 1 - decisions, decisions…

Setup

This section contains the setup and the various utility functions used throughout.

Libraries used:

library(data.table)
library(ggplot2)
library(ggrepel)
library(ROCit)
library(ROCR)

Introduction

Receiver Operatoring Characteristic (ROC) curves were conceived as a way to describe the quality of a binary classifier. It is common that the input is a continuous random variable from which we would determine that a signal was present if the r.v. exceeded a given threshold value. For example, a clinical test might be indicative of infection if the results were above a value of 70. Alternatively, we might decide that a signal is present in some noise if we saw the readings move above a given value for a period of time.

ROC gives you insight into the relationship between true positives and false positives:

\[ \begin{equation} \begin{split} &\mathsf{Pr}(\mathsf{signal \ detected} | \mathsf{signal \ present}) \\ &\mathsf{Pr}(\mathsf{signal \ detected} | \mathsf{signal \ absent}) \end{split} \end{equation} \]

Example 1

Create some artificial data corresponding to some feed that we are listening to for signals.

set.seed(99)
sig_mu <- c(1.1, 2)
sig_sd <- c(1.1, 1.4)
d <- data.table(
  y = c(rnorm(500), 
        rnorm(150, sig_mu[1], sig_sd[1]), 
        rnorm(500), 
        rnorm(150, sig_mu[2], sig_sd[2]), 
        rnorm(500)),
  signal = c(rep(0, 500),
             rep(1, 150),
             rep(0, 500),
             rep(2, 150),
             rep(0, 500))
)
d[, id:= 1:.N]

How much deviation from the background noise would you need before you declared that you observed a signal?

ggplot(d, aes(x = id, y = y)) +
  geom_line() +
  geom_vline(xintercept = 500, col = 2, lwd = 0.4) +
  geom_vline(xintercept = 650, col = 2, lwd = 0.4) +
  geom_vline(xintercept = 1150, col = 2, lwd = 0.4) +
  geom_vline(xintercept = 1300, col = 2, lwd = 0.4) +
  theme_bw() +
  scale_x_continuous("Time") +
  scale_y_continuous("Signal") +
  theme(axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank())

Figure 1: Figure: Monitoring a feed for signals

In the above plot, the feed between the vertical lines indicate where the feed is from a different random variable. If you saw this data (without knowing the underlying truth) you might be less comfortable declaring that the first blip is a signal than declaring that the second blip is.

Kernel densities for distribution of the feed from a noise and signal period. The vertical lines show three possible values at which a decision threshold could be set.

dsub <- d[signal %in% c(0, 1), ]
lab <- c("No", "Yes", "Yes")
ggplot(dsub, aes(x = y, 
                 col = lab[signal+1],
                 group = lab[signal+1])) +
  geom_density() +
  geom_vline(xintercept = c(0, 1), lwd = 0.2) +
  geom_vline(xintercept = c(0, 1), lwd = 0.2) +
  geom_text_repel(data = data.table(xcoor = c(0, 1),
                                    ycoor = 0.3),
                  aes(x = xcoor, y  = ycoor,
                      label = paste0("Tau = ", xcoor)),
                  box.padding = 1,
                  max.overlaps = Inf,
                  size = 4, inherit.aes = F) +  
  theme_bw() +
  scale_x_continuous("y") +
  scale_y_continuous("Density") +
  scale_colour_discrete("Signal") +
  theme(axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank(),
        legend.position="bottom")

Figure 2: Figure: Kernel density from feed during intervals believed to contain background noise and signal respectively

Kernel density from the second signal.

dsub <- d[signal %in% c(0, 2), ]
lab <- c("No", "Yes", "Yes")
ggplot(dsub, aes(x = y, 
                 col = lab[signal+1],
                 group = lab[signal+1])) +
  geom_density() +
  geom_vline(xintercept = c(0, 1), lwd = 0.2) +
  geom_vline(xintercept = c(0, 1), lwd = 0.2) +
  geom_text_repel(data = data.table(xcoor = c(0, 1),
                                    ycoor = 0.3),
                  aes(x = xcoor, y  = ycoor,
                      label = paste0("Tau = ", xcoor)),
                  box.padding = 1,
                  max.overlaps = Inf,
                  size = 4, inherit.aes = F) +  
  theme_bw() +
  scale_x_continuous("y") +
  scale_y_continuous("Density") +
  scale_colour_discrete("Signal") +
  theme(axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank(),
        legend.position="bottom")

Figure 3: Figure: Kernel density from feed during intervals believed to contain background noise and signal respectively

For this scenario, if \(\tau = 0\), then the true positives would be flagged with about 80% of the time but the false positive rate is about 50%. And if \(\tau = 1\), the true positive rate is still over 70% but the false positives drops to 15%. So, in this instance, the two noise and the signal are more readily separated. You can imagine a scenario where the distribution of the signal far exceeded the distribution of the noise. Under such a setting, it might be possible to perfectly separate the two classifications such that the true positive was 100% and the false positive was 0.

The take home messages are:

any given decision threshold \(\tau\) is a trade off between the true and false positive rate
the trade off also relates to the strength of the signal relative to the noise

Compute the sample means and standard deviations

(stats <- dsub[, .(mu = mean(y), sd = sd(y)), keyby = signal])

##    signal           mu        sd
## 1:      0 -0.003336282 0.9886502
## 2:      2  1.979934311 1.4985990

Assume that the form of the underlying signal is normal or perhaps you have done a bunch of calibration testing and so you know that a truly plausible signal has a mean of \(2\) and a variance of \(1.5^2\). Either way, use the knowledge of what constitutes a signal to compute the true and false positive rates. In this instance, using the descriptive statistics, compute the true and false positive rates; the information required for constructing a ROC curve.

x <- seq(-5, 5, len = 1000)
true_pos <- 1 - pnorm(x, 
                      mean = stats[signal == 2, mu],
                      sd = stats[signal == 2, sd])
false_pos <- 1 - pnorm(x, 
                       mean = stats[signal == 0, mu],
                       sd = stats[signal == 0, sd])
droc <- data.table(
  false_pos = false_pos,
  true_pos = true_pos
)

The ROC curve is shown below with the red dots showing the true positive and false negative rate for threshold values of 0 and 1. The 45 degree line is what you would get if the classifier were no better than randomly guessing at which the area under the curve would be 0.5. The higher the ROC curve hugs the top left corner, the better job at classifying.

dthresh <- data.table(tau = c(0, 1))
dthresh[, x := 1-pnorm(c(0, 1))]
dthresh[, y := 1-pnorm(c(0, 1), 
                       stats[signal == 2, mu],
                       stats[signal == 2, sd])]
dthresh[, xcoor := 0.5]
dthresh[, ycoor := 0.5]

ggplot(droc, aes(x = false_pos, 
                 y = true_pos)) +
  geom_line() +
  geom_abline(slope = 1, lty = 2) +
  geom_point(data = dthresh,
             aes(x = x, y = y), 
             col = 2, size = 3) + 
  geom_text_repel(data = dthresh,
                  aes(x = x, y  = y,
                      label = paste0("Tau = ", tau)),
                  box.padding = 4,
                  max.overlaps = Inf,
                  size = 4, inherit.aes = F) +  
  theme_bw() +
  scale_x_continuous("Pr(False negative)", limits = c(0, 1)) +
  scale_y_continuous("Pr(True positive)", limits = c(0, 1))

Figure 4: Figure: ROC curve

Choosing which \(\tau\) to use in practice is context dependent. For example, if the costs of false negative is enormously high (say missing a catestrophic structural defect in a skyscraper) then you are going to want the threshold that gives you the highest possible true positive rate, regardless of the fact that you are going to be detecting a lot of false positives.

Example 2

Commonly, models are built to predict an outcome of interest (loan defaults, HIV infection, manufacturing defects etc) that will have varying predictive performance. In this setting, the conditional probabilities for true and false positive rate are derived emprically.

Anyway, simulate some default rates that we will try to predict with a model.

set.seed(2)
n <- 3000
d <- data.table(
  age = rnorm(n, 0, 1), #standardised age
  dependents = rpois(n, 1),
  sex = rbinom(n, 1, prob = 0.5),
  balance = rnorm(n, 0, 1) # standardised avg bank balance
)
b <- c(-2.0, -0.4, 0.6, 0, -0.2)
d[, eta := cbind(1, as.matrix(d)) %*% matrix(b, ncol = 1)]
d[, defaulted := rbinom(.N, 1, 1/(1 + exp(-eta)))]
d[, dependents := factor(dependents)]

The defaulted column shows whether the person actually defaulted or not; that is what we want to predict. Model with logistic regression but also add in a random guess prediction.

lm1 <- glm(defaulted ~ age + dependents + sex + balance,
           data = d,
           family = binomial)
d[, p1 := predict(lm1, type = "response")]
d[, p2 := runif(nrow(d))]

Note that if you were serious about this, you would partition the data into train and test sets and build a bunch of models and then evaluate them on the test data. For now, I am just going to use the GLM.

The predictions for the probability of defaulting can now be used to compute an ROC curve via the ROCit package (or others like ROCR or pROC). The summary method gives the area under the curve (AUC). More information here https://cran.r-project.org/web/packages/ROCit/vignettes/my-vignette.html and this https://stats.stackexchange.com/questions/180638/how-to-derive-the-probabilistic-interpretation-of-the-auc is also quite informative.

r1 <- rocit(score=d$p1, class=d$defaulted)
summary(r1)
ciAUC(r1)

##                             
##  Method used: empirical     
##  Number of positive(s): 683 
##  Number of negative(s): 2317
##  Area under curve: 0.7124   
##                                                           
##    estimated AUC : 0.712447496415507                      
##    AUC estimation method : empirical                      
##                                                           
##    CI of AUC                                              
##    confidence level = 95%                                 
##    lower = 0.688880208215758     upper = 0.736014784615255

By way of comparison, here is the AUC for random guessing

r2 <- rocit(score=d$p2, class=d$defaulted)
ciAUC(r2)

##                                                           
##    estimated AUC : 0.51962356027857                       
##    AUC estimation method : empirical                      
##                                                           
##    CI of AUC                                              
##    confidence level = 95%                                 
##    lower = 0.494833284118916     upper = 0.544413836438224

The AUC is commonly interpreted as accuracy, but it is actually the probability that a randomly drawn interval with a signal present will produce a higher X value than a signal interval that just contains noise.

plot(r1)

Figure 5: Figure: ROC curve for GLM prediction model

Alternatively, use ROCR for which further instruction can be found at https://ipa-tys.github.io/ROCR/.

pred <- ROCR::prediction(d$p1, d$defaulted)
pred
perf <- ROCR::performance(pred,"tpr","fpr")
perf

## A prediction instance
##   with 3000 data pointsA performance instance
##   'False positive rate' vs. 'True positive rate' (alpha: 'Cutoff')
##   with 3001 data points

The rainbow colours on the right hand side give the threshold value that was used.

plot(perf,colorize=TRUE)

Figure 6: Figure: ROC curve for GLM prediction model

Summary

Binary classification involves predicting one of two levels and the ROC curve can be used to quantify a model’s predictive performance via the AUC.

The AUC is defined as the probability that a randomly drawn interval with a signal present will produce a higher X value than a signal interval that just contains noise. In other words, it tells you the extent that noise and signal can be seperated; or how well the model can distinguish between classes.

Other aspects that were not covered here include the various terminology, performance metrics and diagnostics. Examples of some of the relevant terms include: sensitivity, specificity, accuracy, positive predictive values, negative predictive values, mis-classification rate, accuracy, etc. Those can be for another day.