ROC class notes

Le Kang

2024-01-22

Two-by-two classification table

Four joint probabilities,

  • \(p(s>t,P)\),
  • \(p(s>t,N)\),
  • \(p(s \leq t,P)\),
  • \(p(s \leq t,N)\).

One very common measure is the misclassification \(p(s>t,N)+p(s \leq t,P)\), but it is far from perfect as it weights the two kinds ofmisclassification as equally important.

Conditional and marginal probabilities

  • \(p(s>t|N)\), false positive rate, fp;
  • \(p(s>t|P)\), true positive rate, tp, sensitivity;
  • \(p(s \leq t|N)\), true negative rate, tn=1-fp, specificity;
  • \(p(s \leq t|P)\), false negative rate, fn=1-tp;
  • \(p(P)\) is the prevalence, and \(p(N)=1-p(P)\).

Converse conditional probabilities

  • \(p(P|s>t)\), positive predictive value, ppv;
  • \(p(N|s\leq t)\), negative predictive value, npv;
  • misclassification rate as a weighted sum of the tp and fp: \((1-tp)\times p(P)+fp\times p(N)\)

The conditional probabilities

\(p(s>t|N)\), \(p(s>t|P)\), \(p(s \leq t|N)\), \(p(s \leq t|P)\) are independent of prevalence. This gives them power to generalize over populations which have different proportions belonging to the classes, and is a property we will make use of below when describing the ROC curve.

What about ppv and npv?

Ex: \(tp=tn=0.99\), \(p(P)=0.001\)

The true positive and true negative rates of a classification rule are usually used together as joint measures of performance.

  • in general, decreasing \(t\) so that the true positive rate increases will lead to the true negative rate decreasing.
  • could choose \(t\) such that misclassification rate is minimized
  • or choose \(t\) such that \(tp-fp\), equivalently, \(tp+tn-1\) is maximized; the maximum of this quantity is Youden index

Implication

In essence, it is some sort of comparison between the distributions of the scores for the positive and negative populations.

  • a good rule tends to produce high scores for the positive population, and low scores for the negative population, and the classifier is better the larger the extent to which these distributions differ.

ROC

The ROC curve is a way of jointly displaying these two distributions.

  • a graph showing true positive rate on the vertical axis and false positive rate on the horizontal axis, as the classification threshold \(t\) varies.

  • why \(tp\) and \(fp\) would be enough?

rm(list=ls())
library(mvtnorm)
library(ggplot2)
Psub=rnorm(100,mean=1)
Nsub=rnorm(100,mean=0)


df=data.frame(cbind(c(Psub,Nsub),rep(1:0,each=100)))
names(df)<-c("x","class")
head(df)
ggplot(df,aes(x=x,fill=factor(class)))+geom_density(alpha=0.4)

          x class
1 1.8083866     1
2 0.1019016     1
3 0.7259634     1
4 1.4447029     1
5 1.9019546     1
6 2.5012733     1

all_t=sort(df$x)

tp=fp=numeric(length(all_t))
for (i in 1:length(all_t))
{
pred=factor(ifelse(df$x>all_t[i],1,0),levels = c(0,1))
tmptbl=table(pred,df$class)
tp[i] = tmptbl[2,2]/100
fp[i] = tmptbl[2,1]/100
}

roc=data.frame(tp,fp)
ggplot(roc,aes(x=fp,y=tp))+geom_point()

plot(fp,tp,ylim=c(0,1),xlim=c(0,1),type="n")
for (i in 1:length(all_t)) {
  points(fp[i],tp[i])  
  Sys.sleep(0.1)
}

AUC

The Area Under the Curve or AUC, based on the ROC curve, is a global measure of separability between the distributions of scores for the positive and negative populations.

It does not require one to choose a threshold value, but summarizes the results over all possible choices.

It turns out to be equivalent to the popular Mann-Whitney U-statistic: a measure of the similarity of the two score distributions.

Population ROC curves

Again, we assume the classification rule to be some continuous function \(S(\boldsymbol{X})\) of the random vector \(\boldsymbol{X}\) of variables measured on each individual, conventionally arranged so that large values of the function are more indicative of population P and small ones more indicative of population N.

If \(\boldsymbol{x}\) is the observed value of \(\boldsymbol{X}\), then the individual is allocated to population P or N according as \(s(\boldsymbol{x})\) exceeds or does not exceed some threshold \(T\).

Some ROC examples

  • The classifier will be least successful when the two populations are exactly the same, \(tp=fp\) always.

  • At the other (usually unattainable) extreme there is complete separation, at least one \(t\) such that \(tp=1, fp=0\). For all smaller values of \(t\), \(tp = 1\) while \(fp\) varies from 0 to 1.

  • In practice, the ROC curve will be a continuous curve lying betweenthese two extremes, so it will lie in the upper triangle of the graph.

Properties of the ROC

  1. \(tp=y=h(x)=h(fp)\) based on \((x,y)\), or \((x(t),y(t))\) is a monotone increasing function in the positive quadrant, between (0,0) and (1,1). (Why?)

  2. The ROC curve is unaltered if the classification scores undergo a strictly increasing transformation.

  3. The slope of the ROC at the point with \(t\) is \[\frac{dy}{dx}=\frac{p(t|P)}{p(t|N)}\]

Continuous scores

For population N, assume pdf and cdf to be \(f\) and \(F\), respectively.

For population N, assume pdf and cdf to be \(g\) and \(G\), respectively.

\(x(t)=1-F(t)\), \(y(t)=1-G(t)\), so \[y=1-G[F^{-1}(1-x)], ~~0\leq x \leq 1\]