ROC class notes

Le Kang

2024-02-28

Bias correction with unknown \(\theta\)

Consider the analogy between the measurement error model and the model with random effects ANOVA model assuming normality.

Intraclass correlation coefficient (ICC)
Variance components
“MSTR” terms

\(S^{\prime}_{N_i}=S_{N_i}+\varepsilon_i\)
\(S^{\prime}_{P_j}=S_{P_j}+\eta_j\)

N_rep=5
ID=rep(1:100,each=N_rep)
x=rnorm(100,sd=1)
sd(x)

[1] 1.005558

xx=rep(x,each=N_rep)+rnorm(100*N_rep,mean=0,sd=1)
x_obs=as.vector(tapply(xx,ID,mean))
sd(x_obs)

[1] 1.06514

xx=rep(x,each=N_rep)+rnorm(100*N_rep,mean=0,sd=0.5)
x_obs=as.vector(tapply(xx,ID,mean))
sd(x_obs)

[1] 0.9994661

Some useful quantities

To estimate \(\mu\)’s and \(\sigma\)’s, you may want to consider the following quantities,

\(\bar{S}^{\prime}_{N_i}\), \(\bar{S}^{\prime}_{P_j}\)

\(\bar{S}^{\prime}_{N}\), \(\bar{S}^{\prime}_{P}\)

\(SSTR_{N}, SSE_N\)

\(SSTR_{P}, SSE_P\)

ROC curves and covariates

Indirect adjustment: the effect of the covariates on the distributions of \(S\) is first modeled in the two populations and the ROC curve is then derived from the modified distributions
Direct adjustment: the effect of the covariates is modeled on the ROC curve itself.

Indirect adjustment

Define \(\boldsymbol{Z}_P\) and \(\boldsymbol{Z}_N\) as covariates for P and N, \(E(S_P)\) and \(E(S_N)\) for given values of the covariates can be modeled

\(\mu_P(\boldsymbol{Z}_P)=\alpha_P+\beta^T_P\boldsymbol{Z}_P\)

\(\mu_N(\boldsymbol{Z}_N)=\alpha_N+\beta^T_N\boldsymbol{Z}_N\)

In most practical applications many if not all of the covariates will be common to both populations, but there is no necessity for the two sets to be identical.

Under normality assumption, this model is essentially the same as the one underlying the binormal model, the only difference being in the specification of the population means.

ROC curve is given by

\[y=\Phi\left(\frac{\mu_P(\boldsymbol{Z}_P)-\mu_N(\boldsymbol{Z}_N)+\sigma_N \times \Phi^{-1}(x)}{\sigma_P}\right),\] \(0\leq x \leq 1.\)

Ordinary least-squares regression can be used to obtain point estimates for parameters \(\alpha_i, \beta_i, \sigma_i, i=P, N\) and substitution of these estimates into the formula at given values of \(\boldsymbol{z}_P\) and \(\boldsymbol{z}_N\) will yield the covariate-specific ROC curves.

Direct adjustment

Previously, we separately model the effects of the covariates on the two classification score distributions and then deriving the induced covariate-specific ROC curve from the modified distributions.

We could model the effects of the covariates directly on the ROC curve itself, such an approach means that any parameters associated with the covariates have a direct interpretation in terms of the curve.

A natural choice is the use of generalized linear model methodology for direct modeling of ROC curves.

ROC-GLM model

Note that we have domain and range of the curve in the interval (0,1). Also, the curve is monotonically increasing over this interval.

\[h(y)=b(x)+\beta^T\boldsymbol{Z}\]

\(b(\cdot)\) is an unknown baseline function monotonic on (0, 1) and \(h(\cdot)\) is the link function, specified as part of the model and also monotonic on (0, 1), such as inverse normal CDF, logit, or logarithmic.

\(\boldsymbol{Z}\) is as usual the vector of covariates.

Covariate adjustment of ROC curve

Define placement value \[U=1-F(S)\] Using N as the reference distribution, \(U_N\) is uniform on (0,1), the distribution of \(U_P\) quantifies the separation between the two populations.

For well separated populations the \(S_P\) scores are all higher than the \(S_N\) scores, so that the \(U_P\) values will all be very small, but as the populations progressively overlap then the \(U_P\) and \(U_N\) values will intermingle to a greater and greater extent.

Recall that \(ROC(t)=1-G[F^{-1}(1-t)], 0\leq t\leq 1\),

\(P(U_P\leq t)=P(1-F(S_P)\leq t)\\ =P(S_P\geq F^{-1}(1-t) )=1-G[F^{-1}(1-t)],\)

i.e., \[ROC(t)=P(U_P\leq t).\]

Consider \(U_{it}=I(S^N_{P_i}\leq t)\), we have \(E(U_{it})=ROC(t)\). Note that \(S^N_{P_i}\) is the placement value for \(S_{P_i}\).

Baseline function \(b(\cdot)\)

Popular choice for \(b(\cdot)=\sum_k \alpha_k b_k(\cdot)\) for some specified functions \(b_k\)

Ex: \(h(\cdot)=\Phi^{-1}(\cdot), b_1(\cdot)=1, b_2(\cdot)=\Phi^{-1}(\cdot)\), what is this model?

Estimation process without covariates

Choose a set of values of \(t\) over which the model is to be fitted.
For each \(t\), obtain \(u_{it}=I(s^N_{P_i}\leq t)\) for \(i=1, 2, \ldots,n_P\).
Binary regression with link function \(h(\cdot)\) and covariates \(b_k(t)\) will provide estimates of the model parameters \(\alpha_k\).

Note: \(t\) values are false positive fractions for the purpose of ROC analysis, so we could choose up to \(n_N-1\) values in general.

Estimation process with covariates

The placement value of \(S\) is defined by \[U=1-F(S|\boldsymbol{Z})\]

With covariate \(\boldsymbol{Z}\),\[ROC(t|\boldsymbol{Z})=P(U_P\leq t|\boldsymbol{Z})\]

Assume \(U_{it}=I(S^N_{P_i}\leq t|\boldsymbol{Z})\) continues to follow a GLM.

Previously, the placement values could be obtained directly using empirical proportions of the sample data.

Now the dependence of the placement values on the values of the covariates \(\boldsymbol{Z}\) means that placement values have to be estimated by first estimating the conditional distribution function \(F\) of scores given \(\boldsymbol{Z}=\boldsymbol{z}\) in population N.

Covariate adjustment of AUC

For indirect adjustment, plugging adjusted means of P and N, we have \(AUC(\boldsymbol{z}_P,\boldsymbol{z}_N)\)

\(\hat{\mu}_P(\boldsymbol{z}_P)=\hat{\alpha}_P+\hat{\beta}^T_P\boldsymbol{z}_P\)

\(\hat{\mu}_N(\boldsymbol{z}_N)=\hat{\alpha}_N+\hat{\beta}^T_N\boldsymbol{z}_N\)

For direct adjustment, we consider regression modeling of AUC.

\[E(h[\widehat{AUC}])=\alpha+\boldsymbol{\beta}^T\boldsymbol{z}\] Here, we consider \(U_{ij}=I(S_{P_j}>S_{N_i})\). What about \(E(U_{ij}|\boldsymbol{Z})\)?

Any issue with data type of \(\boldsymbol{Z}\)?