ROC class notes

Le Kang

2024-03-27

Comparing entire ROC curves

Parametric approach
Nonparametric approach

The binormal model

\(H_0: a_{12}=a_1-a_2=0 ~\text{and}~b_{12}=b_1-b_2=0\) \(H_1: a_{12}=a_1-a_2\neq 0 ~\text{or}~~b_{12}=b_1-b_2\neq 0\)

Recall for a single ROC curve, \[\chi^2_{(2)}=(\hat{a}-a_0,\hat{b}-b_0)[\boldsymbol{S}]^{-1}\begin{pmatrix} \hat{a}-a_0 \\ \hat{b}-b_0 \end{pmatrix},\] \(\boldsymbol{S}\) is the covariance matrix for \(\hat{a}\) and \(\hat{b}\).

Typically, \(a_0=0, b_0=1\).

Two correlated sets of \(a\)’s and \(b\)’s

\[\chi^2_{(2)}=(\hat{a}_1-\hat{a}_2,\hat{b}_1-\hat{b}_2)[\boldsymbol{S}]^{-1}\begin{pmatrix} \hat{a}_1-\hat{a}_2 \\ \hat{b}_1-\hat{b}_2 \end{pmatrix},\] \(\boldsymbol{S}\) is the covariance matrix for \(\hat{a}_{12}\) and \(\hat{b}_{12}\), i.e.,

\[\chi^2_{(2)}=\dfrac{\hat{a}^2_{12}\text{var}(\hat{b}_{12})+\hat{b}^2_{12}\text{var}(\hat{a}_{12})-2\hat{a}_{12}\hat{b}_{12}\text{cov}(\hat{a}_{12},\hat{b}_{12})}{\text{var}(\hat{a}_{12})\text{var}(\hat{b}_{12})-\text{cov}^2(\hat{a}_{12},\hat{b}_{12})}\]

Nonparametric approach

Once again, assume \(f_i\) and \(g_i\) are the PDFs, and \(F_i\) and \(G_i\) are the CDFs of class N and P scores respectively, for classifier \(i\). Let \(x_{i\pi}\) be the \(\pi\)th quantile for classifier \(i\), so that

\[p(N)F_1(x_{1\pi})+[1-p(N)]G_1(x_{1\pi})=\pi\] \[p(N)F_2(x_{2\pi})+[1-p(N)]G_2(x_{2\pi})=\pi\]

The ROC curves are identical if and only if the misclassification rates of the classifiers are the same for all \(\pi\).

Identity of ROC curves

Testing the integrated unsigned difference between the misclassification rates is zero

\[\int |e_1 ({\pi})-e_2 ({\pi}) | d\pi=0\] based on \(\int |\hat{e}_1 ({\pi})-\hat{e}_2 ({\pi}) | d\pi\).

Regression approaches

Indirect approach
Direct approach

ROC-GLM model

We model the true positive rate \(y\) in terms of the false positive rate \(x\) by a generalized linear model \[h(y)=b(x)+\boldsymbol{\beta}^T \boldsymbol{Z},\] where \(h(\cdot)\) is the link function, \(b(\cdot)\) is a baseline model, both being monotonic on (0,1), and \(\boldsymbol{Z}\) is a vector of covariates.

Testing \(H_0: \boldsymbol{\beta}=\boldsymbol{0}\) with dummy variables.

Uncertain or unknown gold standard

An essential requirement for conducting any of the ROC analyses described earlier is that the classification score \(S\) has been obtained for samples of individuals, each of which has been labeled, and labeled correctly, as either N or P.

Very often, to achieve correct labeling the sample members may have to be,

either tested to destruction (if they are inanimate, e.g., materials subject to pressure to break),
subjected to some invasive procedure (if they are animate, e.g., biopsy for breast tumor detection),
only examined in impossible situations such as post mortem.

Methods of ROC analysis should cater for the possibility of sample mislabeling as well as being able to cope with missing labels.

General framework

In addition to the classification score \(S\), we denote the “group label” variable as \(L\).

Deterministic: \(L\) = P or N without error
Probabilistic: \(L\) is a binary random variable with some specified (but unknown) probabilities of assigning values P, N respectively to each sample member.

Assume that the “true” group labels are given by the latent (i.e., unobservable) binary variable \(Z\).

\[L|Z \sim \text{Bernoulli}(\pi)\]

where \(\log\left(\dfrac{\pi}{1-\pi}\right)=\beta_0\) when \(Z=\) N, and \(\log\left(\dfrac{\pi}{1-\pi}\right)=\beta_0+\beta_1\) when \(Z=\) P.

In addition, \(Z \sim \text{Bernoulli}(\zeta)\) where \(\zeta \sim \text{Beta}(a,b)\).