ROC class notes

Le Kang

2024-02-05

Breast Imaging Study Responses

library(pROC)
library(readxl)
BIS_Responses <- read_excel("./Breast Imaging Study Responses.xlsx")
head(BIS_Responses,10)
# A tibble: 10 × 6
   CaseID    hawilar    he  liul  shok Status
   <chr>       <dbl> <dbl> <dbl> <dbl> <chr> 
 1 1-007.jpg       3     4     1     2 B     
 2 1-014.jpg       6     2     1     1 B     
 3 1-016.jpg       4     1     3     5 M     
 4 1-030.jpg       3     2     1     1 B     
 5 1-031.jpg       8     9    10    10 M     
 6 1-039.jpg       6     4     7     5 M     
 7 1-043.jpg       9    10    10    10 B     
 8 1-048.jpg       4     6     1     3 B     
 9 1-050.jpg       7     9     7     9 B     
10 1-051.jpg       4     2     1     2 B     

ROC curves

R Code

plot(roc(BIS_Responses$Status, 
BIS_Responses$hawilar),
xlim=c(1,0),col="cyan",asp=1)

plot(roc(BIS_Responses$Status, 
BIS_Responses$he),
add=T,col="red")

plot(roc(BIS_Responses$Status,
BIS_Responses$liul),
add=T,col="blue")

plot(roc(BIS_Responses$Status,
BIS_Responses$shok),
add=T,col="green")

The ROC binormal model

Parameters: \(\mu_P, \mu_N, \sigma_P, \sigma_N\)

Assuming that \(\mu_P>\mu_N\) in accord with the convention that large values of \(S\) are indicative of population P and small ones indicative of population N.

Consider standard normalization.

\[x(t)=p(S>t|N)=p\left(Z>\dfrac{t-\mu_N}{\sigma_N}\right)=\Phi\left(\dfrac{\mu_N-t}{\sigma_N}\right)\] \[y(x)=p(S>t|P)=\Phi\left(\dfrac{\mu_P-t}{\sigma_P}\right)\]

ROC curve under binormal model

\[\Phi^{-1}(y)=a+b\Phi^{-1}(x),\] where \(a=(\mu_P-\mu_N)/\sigma_P, b=\sigma_N/\sigma_P.\)

It follows from the earlier assumptions that \(a>0\), while \(b\) is clearly non-negative by definition.

\[AUC=p(S_p>S_N)=\Phi\left(\dfrac{\mu_P-\mu_N}{\sqrt{\sigma^2_P+\sigma^2_N}}\right)=\Phi\left(\dfrac{a}{\sqrt{1+b^2}}\right)\]

The binormal model will be appropriate for any ROC curve pertaining to populations that can be transformed to normality by some monotone transformation.

ROC Estimation

  • Empirical counting process (jagged)
  • Parametric (binormal, bigamma)
  • Nonparametric
  • Semi-parametric

ROC curve in mathematical form

\[y=1-G[F^{-1}(1-x)], ~~0\leq x \leq 1,\]

assuming pdf and cdf to be \(f\) and \(F\) for population N, and pdf and cdf to be \(g\) and \(G\) for population P.

Empirical: \[y=1-\hat{G}[\hat{F}^{-1}(1-x)], ~~0\leq x \leq 1,\]

Empirical CDFs are step functions, depending only on the ranks of the combined set of test scores.

\[\hat{tp}(t)=\dfrac{\sum\mathbf{1}(S_P>t)}{ n_P}=1-\hat{G}(t)\] \[\hat{fp}(t)=\dfrac{\sum\mathbf{1}(S_N>t)}{ n_N}=1-\hat{F}(t)\]

Although technically all possible values of \(t\) need to be considered, in practice, \(\hat{fp}\) will only change when \(t\) crosses the score values of the \(n_N\) individuals and \(\hat{tp}\) will only change when \(t\) crosses the score values of the \(n_P\) individuals, so there will at most be \(n_N+n_P+1\) discrete points on the plot.

The connected lines are either horizontal or vertical if just one of \((fp, tp)\) changes at that value of \(t\), and they are sloped if both estimates change.

Empirical ROC curves depend only on the ranks of the combined set of test scores.

P=round(rnorm(10,mean=1)*10,1)
N=round(rnorm(10,mean=0)*10,1)
G=ecdf(P)
F=ecdf(N)
Finv=function(x) quantile(N,x)

y=function(x) 1-G(Finv(1-x))

xx <- seq(0,1,by=0.05)
yy= sapply(xx,y)

plot(xx,yy,ylim=c(0,1),xlab="fp",ylab="tp")
abline(a=0,b=1)

Parametric estimation

Sometimes the irregular appearance of the empirical ROC curve is not deemed adequate as an estimate of the underlying “true” smooth curve.

Ex: The binormal model - estimating \(a\) and \(b\)

The Dorfman and Alf method

With ordered categorical data, Dorfman and Alf 1 proposed a maximum likelihood method.

Assume the score \(S\) can take on only one of a finite set of ranked values or categories \(C_1, C_2, \ldots, C_k\) say. Then there is a latent random variable \(W\), and a set of unknown thresholds \(-\infty=w_0<w_1<w_2,\ldots,<w_k=\infty\), such that \(S\) falls in category \(C_i\) if and only if \(w_{i-1}<W\leq w_i\). Then we could define \(p_{i|N}\) and \(p_{i|P}\).

The log-likelihood function \[\mathcal{L}=\sum_{i=1}^k(n_{i|N}\log p_{i|N}+n_{i|P}\log p_{i|P})\] where \(n_{i|N}\) and \(n_{i|P}\) are the observed numbers of individuals from populations N and P respectively falling in category \(C_i\).

The Metz method

With continuous data, Metz et al.1 considered truth-state runs in rank-ordered data for a natural categorization of continuously-distributed test results for maximum likelihood (ML) estimation of ROC curves.

Semiparametric estimation

The kernel density methods:

\[\hat{f}(x)=\dfrac{1}{n_N h_N}\sum_{i=1}^{n_N}k\left(\dfrac{x-s_{N_i}}{h_N}\right)\] \[\hat{g}(x)=\dfrac{1}{n_P h_P}\sum_{i=1}^{n_P}k\left(\dfrac{x-s_{P_i}}{h_P}\right)\]

where \(k(\cdot)\) is the kernel function and \(h_N, h_P\) are the bandwidths in each.

Choosing between the many available kernel functions is relatively unimportant as all give comparable results, but more care needs to be taken over the selection of bandwidth.

since \(F\) and \(G\) are estimated separately, the final ROC curve estimator is not invariant under a monotone transformation of the data.

The spline smoothing is also a popular in density estimation.