ROC class notes

Le Kang

2024-02-14

Semiparametric estimation

The kernel density methods: obtain smooth estimates of the functions \(F\) and \(G\) directly from the data, without imposing any distributional constraints.

\[\hat{f}(x)=\dfrac{1}{n_N h_N}\sum_{i=1}^{n_N}k\left(\dfrac{x-s_{N_i}}{h_N}\right)\] \[\hat{g}(x)=\dfrac{1}{n_P h_P}\sum_{i=1}^{n_P}k\left(\dfrac{x-s_{P_i}}{h_P}\right)\]

where \(k(\cdot)\) is the kernel function and \(h_N, h_P\) are the bandwidths in each.

x <- rnorm(500)
hist(x, freq = FALSE)
dens <- density(x)
lines(dens, col = "red")

dens


Call:
    density.default(x = x)

Data: x (500 obs.); Bandwidth 'bw' = 0.266

       x                 y            
 Min.   :-4.1683   Min.   :0.0000337  
 1st Qu.:-2.2261   1st Qu.:0.0057572  
 Median :-0.2838   Median :0.0715865  
 Mean   :-0.2838   Mean   :0.1285882  
 3rd Qu.: 1.6585   3rd Qu.:0.2440612  
 Max.   : 3.6008   Max.   :0.3553230

Choosing between the many available kernel functions is relatively unimportant as all give comparable results, but more care needs to be taken over the selection of bandwidth. We may use the general-purpose bandwidths.

since \(F\) and \(G\) are estimated separately, the final ROC curve estimator is not invariant under a monotone transformation of the data.

\[\widehat{AUC}=\dfrac{1}{n_N n_P}\sum_{i=1}^{n_N}\sum_{j=1}^{n_P} \Phi\left(\dfrac{s_{P_j}-s_{N_i}}{\sqrt{h_N^2+h_P^2}}\right)\]

The spline smoothing is also a popular in density estimation.

Variance estimation for AUC

For parametric and semiparametric methods, maximum likelihood theory will yield asymptotic expressions for the variances and covariances of the parameters and so the delta method will yield the required variance.

\[\mathcal{I(\theta)}=-E\left[\dfrac{\partial^2 \mathcal{L(\theta)}}{\partial \theta^2}\right]_\hat{\theta}\] \[var(\theta)=[\mathcal{I(\theta)}]^{-1}\]

For nonparametric method, U-statistics¹ based variance estimator is available.

\[var({\widehat{AUC}})=\dfrac{1}{n_N n_P}\left({AUC}(1-{AUC})+[n_P-1][Q_1-{AUC}^2]+ \\ [n_N-1][Q_2-{AUC}^2]\right)\]

where \(Q_1\) is the probability that the classification scores of two randomly chosen individuals from population P exceed the score of a randomly chosen individual from population N, and \(Q_2\) is the converse probability that the classification score of a randomly chosen individual from population P exceeds both scores of two randomly chosen individuals from population N.

PAUC estimation

\[PAUC(f_1,f_2)=\int_{f_1}^{f_2} y(x)dx\]

Under binormal model,

\[\hat{PAUC}(f_1,f_2)=\int_{f_1}^{f_2} \Phi(\hat{a}+\hat{b}z_x)dx\]

The nonparametric method

\[PAUC(f_1,f_2)=P(S_P>S_N, f_1\leq 1-F(S_N)\leq f_2)\]

\[\hat{PAUC}(f_1,f_2)=\dfrac{1}{n_N n_P}\sum_{i=1}^{n_N}\sum_{j=1}^{n_P} I(S_{P_j}>S_{N_i}) \times \\ I(f_1\leq \frac{\sum_j (S_{N_j}> S_{N_i})}{n_N}\leq f_2)\]