ROC class notes

Le Kang

2024-02-26

Bias correction

Assume \(\sigma_P^2=\sigma_N^2=\sigma^2\), and \(\sigma_{\varepsilon}^2=\sigma_{\eta}^2=\sigma_E^2\),

\(E(\widehat{AUC}^{\prime})\approx AUC-\frac{1}{2}E[(\varepsilon-\eta)^2]E[G^{\prime\prime}(S_N)]\)

What is the variance for \(S^{\prime}_{N_i}\) or \(S^{\prime}_{P_j}\)?

\[AUC^{\prime}=\Phi\left(\dfrac{\mu_P-\mu_N}{\sqrt{2\sigma^2(1+\theta^2)}}\right),~~~AUC=\Phi\left(\dfrac{\mu_P-\mu_N}{\sqrt{2\sigma^2}}\right).\] where \(\theta=\sigma_E/\sigma\)

Bias correction with unknown \(\theta\)

Consider the analogy between the measurement error model and the model with random effects ANOVA model assuming normality.

Intraclass correlation coefficient (ICC)
Variance components
“MSTR” terms

\(S^{\prime}_{N_i}=S_{N_i}+\varepsilon_i\)
\(S^{\prime}_{P_j}=S_{P_j}+\eta_j\)

x=rnorm(100,sd=1)

N_rep=5
ID=rep(1:100,each=N_rep)

xx=rep(x,each=N_rep)+rnorm(100*N_rep,mean=0,sd=0.5)
x_obs=as.vector(tapply(xx,ID,mean))

sd(x)

[1] 0.9106813

sd(x_obs)

[1] 0.9124346

xx=rep(x,each=N_rep)+rnorm(100*N_rep,mean=0,sd=0.25)
x_obs=as.vector(tapply(xx,ID,mean))

sd(x_obs)

[1] 0.9219266

ROC curves and covariates

Once a classifier \(S(\boldsymbol{X})\) has been constructed from the vector \(\boldsymbol{X}\) of primary variables and is in use for allocating individuals to one or other of the populations N and P, it frequently transpires that a further variable or set of variables will provide useful classificatory information which will modify the behavior of the classifier in one or both of the populations.

Indirect adjustment: the effect of the covariates on the distributions of \(S\) is first modeled in the two populations and the ROC curve is then derived from the modified distributions
Direct adjustment: the effect of the covariates is modeled on the ROC curve itself.

Indirect adjustment

Define \(\boldsymbol{Z}_P\) and \(\boldsymbol{Z}_N\) as covariates for P and N,

the means of \(S_P\) and \(S_N\) for given values of the covariates can be modeled

\(\mu_P(\boldsymbol{Z}_P)=\alpha_P+\beta^T_P\boldsymbol{Z}_P\)

\(\mu_N(\boldsymbol{Z}_N)=\alpha_N+\beta^T_N\boldsymbol{Z}_N\)

In most practical applications many if not all of the covariates will be common to both populations, but there is no necessity for the two sets to be identical.

Under normality assumption, this model is essentially the same as the one underlying the binormal model, the only difference being in the specification of the population means.

ROC curve is given by

\[y=\Phi\left(\frac{\mu_P(\boldsymbol{Z}_P)-\mu_N(\boldsymbol{Z}_N)+\sigma_N \times \Phi^{-1}(x)}{\sigma_P}\right),\] \(0\leq x \leq 1.\)

Ordinary least-squares regression can be used to obtain point estimates for parameters \(\alpha_i, \beta_i, \sigma_i, i=P, N\) and substitution of these estimates into the formula at given values of \(\boldsymbol{z}_P\) and \(\boldsymbol{z}_N\) will yield the covariate-specific ROC curves.

Direct adjustment

Previously, we separately model the effects of the covariates on the two classification score distributions and then deriving the induced covariate-specific ROC curve from the modified distributions.

We could model the effects of the covariates directly on the ROC curve itself, such an approach means that any parameters associated with the covariates have a direct interpretation in terms of the curve.

A natural choice is the use of generalized linear model methodology for direct modeling of ROC curves.

ROC-GLM model

Note that we have domain and range of the curve in the interval (0,1). Also, the curve is monotonically increasing over this interval.

\[h(y)=b(x)+\beta^T\boldsymbol{Z}\]

\(b(\cdot)\) is an unknown baseline function monotonic on (0, 1) and \(h(\cdot)\) is the link function, specified as part of the model and also monotonic on (0, 1), such as inverse normal CDF, logit, or logarithmic.

\(\boldsymbol{Z}\) is as usual the vector of covariates.

Covariate adjustment of ROC curve

Define placement value \[U=1-F(S)\] Using N as the reference distribution, \(U_N\) is uniform on (0,1), the distribution of \(U_P\) quantifies the separation between the two populations.

For well separated populations the \(S_P\) scores are all higher than the \(S_N\) scores, so that the \(U_P\) values will all be very small, but as the populations progressively overlap then the \(U_P\) and \(U_N\) values will intermingle to a greater and greater extent.

\[ROC(t)=P(U_P\leq t)\] which is just the CDF of \(U_P\).

Consider \(U_{it}=I(S^N_{P_i}\leq t)\), we have \(E(U_{it})=ROC(t)\).

With covariate \(\boldsymbol{Z}\),\[ROC(t|\boldsymbol{Z})=P(U_P\leq t|\boldsymbol{Z})\]

Covariate adjustment of AUC

For indirect adjustment, plugging adjusted means of P and N, we have

\[AUC(\boldsymbol{z}_P,\boldsymbol{z}_N)\]

For direct adjustment, we consider regression modeling of AUC.