With ordered categorical data, Dorfman and Alf 1 proposed a maximum likelihood method.
Assume the score \(S\) can take on only one of a finite set of ranked values or categories \(C_1, C_2, \ldots, C_k\) say. Then there is a latent random variable \(W\), and a set of unknown thresholds \(-\infty=w_0<w_1<w_2\ldots<w_{k-1}<w_k=\infty\), such that \(S\) falls in category \(C_i\) if and only if \(w_{i-1}<W\leq w_i\). Then we could define \(p_{i|N}\) and \(p_{i|P}\).
The log-likelihood function \[\log\mathcal{L}=\sum_{i=1}^k(n_{i|N}\log p_{i|N}+n_{i|P}\log p_{i|P})\] where \(n_{i|N}\) and \(n_{i|P}\) are the observed numbers of individuals from populations N and P respectively falling in category \(C_i\).
Second derivatives for the Hessian matrix \(\Rightarrow\) Asympototic variance-covariance matrix
The Metz method
With continuous data, Metz et al.1 considered truth-state runs in rank-ordered data for a natural categorization of continuously-distributed test results for maximum likelihood (ML) estimation of ROC curves.
Truth-state runs in rank-ordered data
An example:
N sample: (6.24, 1.77, 4.61, 8.29)
P sample: (12.87, 10.22, 15.90, 5.01, 13.35)
The information for ROC curve is preserved in the sequence \(\{n, n, p, n, n, p, p, p, p\}\), or \(\{2n, 1p, 2n, 4p\}\)
N sample: \(\{2, 0, 2, 0\}\)
P sample: \(\{0, 1, 0, 4\}\)
The runs of truth states in rank-ordered test result outcomes provide a natural categorization of inherently continuous data that retains any information relevant to ROC curve fitting.
- What if there are too many categories?
Binning into 20 categories to improve computational efficiency.
Semiparametric estimation
The kernel density methods: obtain smooth estimates of the functions \(F\) and \(G\) directly from the data, without imposing any distributional constraints.
where \(k(\cdot)\) is the kernel function and \(h_N, h_P\) are the bandwidths in each.
x <-rnorm(500)hist(x, freq =FALSE)dens <-density(x)lines(dens, col ="red")
dens
Call:
density.default(x = x)
Data: x (500 obs.); Bandwidth 'bw' = 0.2767
x y
Min. :-4.1913 Min. :0.0000351
1st Qu.:-2.1852 1st Qu.:0.0131271
Median :-0.1791 Median :0.0634206
Mean :-0.1791 Mean :0.1244967
3rd Qu.: 1.8270 3rd Qu.:0.2349642
Max. : 3.8331 Max. :0.3759868
Choosing between the many available kernel functions is relatively unimportant as all give comparable results, but more care needs to be taken over the selection of bandwidth. We may use the general-purpose bandwidths.
since \(F\) and \(G\) are estimated separately, the final ROC curve estimator is not invariant under a monotone transformation of the data.