ROC class notes

ROC Estimation

Empirical counting process (jagged)
Parametric (binormal, bigamma)
Nonparametric
Semi-parametric

The nonparametric empirical method

\[y=1-\hat{G}[\hat{F}^{-1}(1-x)], ~~0\leq x \leq 1,\]

\[\hat{tp}(t)=\dfrac{\sum\mathbf{1}(S_P>t)}{ n_P}=1-\hat{G}(t)\] \[\hat{fp}(t)=\dfrac{\sum\mathbf{1}(S_N>t)}{ n_N}=1-\hat{F}(t)\] \[\widehat{AUC}=\hat{p}(S_p>S_N)=\dfrac{\sum_i\sum_j I(S_i^P>S_j^N)}{n_P n_N}\]

Parametric estimation

Ex: The binormal model - estimating \(a\) and \(b\) \[\Phi^{-1}(y)=a+b\Phi^{-1}(x),\] where \(a=(\mu_P-\mu_N)/\sigma_P, b=\sigma_N/\sigma_P.\)

It follows from the earlier assumptions that \(a>0\), while \(b\) is clearly non-negative by deﬁnition.

\[\widehat{AUC}=\Phi\left(\dfrac{\hat{\mu}_P-\hat{\mu}_N}{\sqrt{\hat{\sigma}^2_P+\hat{\sigma}^2_N}}\right)=\Phi\left(\dfrac{\hat{a}}{\sqrt{1+\hat{b}^2}}\right)\]

The Dorfman and Alf method

With ordered categorical data, Dorfman and Alf ¹ proposed a maximum likelihood method.

Assume the score \(S\) can take on only one of a finite set of ranked values or categories \(C_1, C_2, \ldots, C_k\) say. Then there is a latent random variable \(W\), and a set of unknown thresholds \(-\infty=w_0<w_1<w_2,\ldots,<w_k=\infty\), such that \(S\) falls in category \(C_i\) if and only if \(w_{i-1}<W\leq w_i\). Then we could define \(p_{i|N}\) and \(p_{i|P}\).

The log-likelihood function \[\mathcal{L}=\sum_{i=1}^k(n_{i|N}\log p_{i|N}+n_{i|P}\log p_{i|P})\] where \(n_{i|N}\) and \(n_{i|P}\) are the observed numbers of individuals from populations N and P respectively falling in category \(C_i\).

How to write log-likelihood function in terms of \(w_i\), \(a\) and \(b\)? How to maximize it?

The Metz method

With continuous data, Metz et al.¹ considered truth-state runs in rank-ordered data for a natural categorization of continuously-distributed test results for maximum likelihood (ML) estimation of ROC curves.

Truth-state runs in rank-ordered data

An example:

N sample: (6.24, 1.77, 4.61, 8.29)

P sample: (12.87, 10.22, 15.90, 5.01, 13.35)

The information for ROC curve is preserved in the sequence \(\{n, n, p, n, n, p, p, p, p\}\), or \(\{2n, 1p, 2n, 4p\}\), therefore,

N sample: \(\{2, 0, 2, 0\}\)

P sample: \(\{0, 1, 0, 4\}\)

The runs of truth states in rank-ordered test result outcomes provide a natural categorization of inherently continuous data that retains any information relevant to ROC curve fitting.

Semiparametric estimation

The kernel density methods:

\[\hat{f}(x)=\dfrac{1}{n_N h_N}\sum_{i=1}^{n_N}k\left(\dfrac{x-s_{N_i}}{h_N}\right)\] \[\hat{g}(x)=\dfrac{1}{n_P h_P}\sum_{i=1}^{n_P}k\left(\dfrac{x-s_{P_i}}{h_P}\right)\]

where \(k(\cdot)\) is the kernel function and \(h_N, h_P\) are the bandwidths in each.

Choosing between the many available kernel functions is relatively unimportant as all give comparable results, but more care needs to be taken over the selection of bandwidth.

since \(F\) and \(G\) are estimated separately, the final ROC curve estimator is not invariant under a monotone transformation of the data.

The spline smoothing is also a popular in density estimation.