2024-01-22
Four joint probabilities,
One very common measure is the misclassification \(p(s>t,N)+p(s \leq t,P)\), but it is far from perfect as it weights the two kinds ofmisclassification as equally important.
The conditional probabilities
\(p(s>t|N)\), \(p(s>t|P)\), \(p(s \leq t|N)\), \(p(s \leq t|P)\) are independent of prevalence. This gives them power to generalize over populations which have different proportions belonging to the classes, and is a property we will make use of below when describing the ROC curve.
What about ppv and npv?
Ex: \(tp=tn=0.99\), \(p(P)=0.001\)
The true positive and true negative rates of a classification rule are usually used together as joint measures of performance.
In essence, it is some sort of comparison between the distributions of the scores for the positive and negative populations.
The ROC curve is a way of jointly displaying these two distributions.
a graph showing true positive rate on the vertical axis and false positive rate on the horizontal axis, as the classification threshold \(t\) varies.
why \(tp\) and \(fp\) would be enough?
x class
1 1.8083866 1
2 0.1019016 1
3 0.7259634 1
4 1.4447029 1
5 1.9019546 1
6 2.5012733 1
all_t=sort(df$x)
tp=fp=numeric(length(all_t))
for (i in 1:length(all_t))
{
pred=factor(ifelse(df$x>all_t[i],1,0),levels = c(0,1))
tmptbl=table(pred,df$class)
tp[i] = tmptbl[2,2]/100
fp[i] = tmptbl[2,1]/100
}
roc=data.frame(tp,fp)
ggplot(roc,aes(x=fp,y=tp))+geom_point()
plot(fp,tp,ylim=c(0,1),xlim=c(0,1),type="n")
for (i in 1:length(all_t)) {
points(fp[i],tp[i])
Sys.sleep(0.1)
}The Area Under the Curve or AUC, based on the ROC curve, is a global measure of separability between the distributions of scores for the positive and negative populations.
It does not require one to choose a threshold value, but summarizes the results over all possible choices.
It turns out to be equivalent to the popular Mann-Whitney U-statistic: a measure of the similarity of the two score distributions.
Again, we assume the classification rule to be some continuous function \(S(\boldsymbol{X})\) of the random vector \(\boldsymbol{X}\) of variables measured on each individual, conventionally arranged so that large values of the function are more indicative of population P and small ones more indicative of population N.
If \(\boldsymbol{x}\) is the observed value of \(\boldsymbol{X}\), then the individual is allocated to population P or N according as \(s(\boldsymbol{x})\) exceeds or does not exceed some threshold \(T\).
The classifier will be least successful when the two populations are exactly the same, \(tp=fp\) always.
At the other (usually unattainable) extreme there is complete separation, at least one \(t\) such that \(tp=1, fp=0\). For all smaller values of \(t\), \(tp = 1\) while \(fp\) varies from 0 to 1.
In practice, the ROC curve will be a continuous curve lying betweenthese two extremes, so it will lie in the upper triangle of the graph.
\(tp=y=h(x)=h(fp)\) based on \((x,y)\), or \((x(t),y(t))\) is a monotone increasing function in the positive quadrant, between (0,0) and (1,1). (Why?)
The ROC curve is unaltered if the classification scores undergo a strictly increasing transformation.
The slope of the ROC at the point with \(t\) is \[\frac{dy}{dx}=\frac{p(t|P)}{p(t|N)}\]
For population N, assume pdf and cdf to be \(f\) and \(F\), respectively.
For population N, assume pdf and cdf to be \(g\) and \(G\), respectively.
\(x(t)=1-F(t)\), \(y(t)=1-G(t)\), so \[y=1-G[F^{-1}(1-x)], ~~0\leq x \leq 1\]