This is a draft vignette for the confusR package. It serves as
confusR.Binary models have well established measures of performance (e.g., precision, accuracy). With multinomial classification methods, it is not so simple.
What measures exist for assessing and describing performance of multinomial classification models, and how can we express those measures in a way that is interpretable by non-data-scientists?
— Brendan Langfield
As organisations seek to develop and deploy more complex classification systems, there is a growing need for understanding and transparency in model development, as well as a requirement to explain how the models are operating. Similar to assessment in educational settings, we can think about performance assessment of classification systems with two ends in mind:
We are interested in the latter.
In particular, we want to separate the test set performance of the classifier into two parts
Grant Sanderson suggests that we can better understand the discriminative performance of a model by expressing Bayes’ rule in terms of prior odds and Bayes factors (3blue1brown_medical_2020?).
Suppose we have a hypothesis \(\mathrm{D}\) that a person actually has a disease, and some evidence \(\mathrm{T}\) in about that in the form of a positive test result for that disease. Often we want to know “if I have a positive test result, what’s the chance that I actually have the disease”. Usually this is written symbolically in terms of probabilities as: \[ \mathrm{P(D|T)} = \frac{\mathrm{P(T|D)P(D)}}{\mathrm{P(T|D)P(D)}+\mathrm{P(T|\overline{D})P(\overline{D})}} \] …but Sanderson extols the merits of writing this using odds: \[ \begin{align} \mathrm{O(D|T)} &= \mathrm{O(D)} \frac{\mathrm{P(T|D)}}{\mathrm{P(T|\overline{D})}}\\ &= \mathrm{O(D)} \frac{\text{True positive rate}}{\text{False positive rate}} \end{align} \] where the ratio is the Bayes factor of the test for a positive result. This factor represents how our prior odds of having the disease are updated as a result of the test outcome. It is also known as the likelihood ratio of a positive outcome (or LR+ for short).
In the language of statistics,
equals
multiplied by
or, more succinctly
\[ \begin{align} \text{posterior odds} &= \text{prior odds} \times \text{Bayes factor for a positive outcome}\\[2ex] &= \text{prior odds} \times \frac{\text{True positive rate}}{\text{False positive rate}} \end{align} \]
If we visualise these terms on a logarithmic scale, we can exploit the fact that
\[ \begin{align} \log(\text{posterior odds}) &= \log(\text{prior odds}) + \log(\text{Bayes factor for a positive outcome}) \end{align} \]
…let’s see how, starting with a well-known example.
David M. Eddy (1982) gives an example of a diagnostic test for breast cancer where the test had
and the prior probability of breast cancer was assumed to be 1%.
Eddy asked physicians to estimate the probability of actually having cancer given a prediction of cancer from the test.
If you have not seen this example, I recommend you watch Luana Micallef’s (2012) telling of it right now before I spoil the surprising answer.
Here’s the confusion matrix we would expect if we applied this test to 1000 patients:
Eddy
actual
predicted cancer benign
cancer 8 95
benign 2 895
and here is how we refer to the cells in a binary confusion matrix, i.e., true positive (TP), false positive (FP), false negative (FN), true negative (FN):
actual
predicted positive negative
positive TP FP
negative FN TN
So, with this confusion matrix, the probability of actually having cancer given a prediction of cancer by the test is \[ \begin{align} \mathrm{P}(\text{actually having cancer }|\text{ test predicts cancer}) &= \frac{\text{TP}}{\text{TP}+\text{FP}}\\ &= \frac{8}{8+95}\\[1.5ex] &= 7.77\% \end{align} \]
The surprise (for most people) is that a test with a true positive rate of 79.2% and false positive rate of 9.6% has such a low probability of actually being right. The reason for this counter-intuitive outcome is that the prior probability of cancer is rare (1%), so when the test is positive, it is usually a false positive.
Let’s change the prior probability of cancer and see how the same test performs. Here is the confusion matrix of the same diagnostic test applied to 1000 patients who have a 50% prior probability of breast cancer:
Eddy.equal
actual
predicted cancer benign
cancer 396 48
benign 104 452
With this set of patients, the probability of actually having cancer given a prediction of cancer by the test is \[ \begin{align} \mathrm{P}(\text{actually having cancer }|\text{ test predicts cancer}) &= \frac{\text{TP}}{\text{TP}+\text{FP}}\\ &= \frac{396}{396+48}\\[1.5ex] &= 89.19\% \end{align} \]
Same test, but the new confusion matrix shows the probability the test’s prediction of cancer is correct has increased dramatically.
For many people, this is confusing. How can the same test have such a different probability of being right in these two scenarios? To make sense of this, we suggest using the odds formulation of Bayes rule, i.e.,
\[ \begin{align} \text{posterior odds of cancer} &= \text{prior odds of cancer} \times \text{Bayes factor of test for cancer}\\[2ex] &= \text{prior odds of cancer} \times \frac{\text{True positive rate}}{\text{False positive rate}} \end{align} \]
to split apart
Here is a plot which shows that split
plot.BayesFactors.pos(Eddy, show.arrows=TRUE, legend.pos=c(0,0))
In the above plot (which is on a log scale)
In other words,
Now let’s use this same plotting technique on the confusion matrix of the same diagnostic test applied to 1000 patients who have a 50% prior probability of breast cancer:
Eddy.equal
actual
predicted cancer benign
cancer 396 48
benign 104 452
plot.BayesFactors.pos(Eddy.equal, show.arrows=TRUE, legend.pos=c(0,0), log.range=7)
This approach separates the characteristics of the classifier from the prior distribution of the classes it is applied to. Now let’s look at how this approach can be applied to multinomial classification systems.
We’ve just looked at a binary classification system in which a test can make one of two possible predictions, cancer or benign. Multinomial classification systems make predictions of 1-out-of-\(C\) classes, where \(C > 2\).
Here’s an example confusion matrix from a multinomial classifier with 17 classes
actual
predicted Lung Brea Colo Panc Skin Ovar Rena Pros Head Esop Thyr Blad Germ Endo Live Adre Cerv
Lung 180 11 0 3 6 2 3 4 3 3 2 3 0 0 2 0 3
Brea 10 194 1 1 3 4 1 0 0 1 2 0 0 1 1 3 3
Colo 3 4 164 1 1 0 0 0 0 2 0 1 0 1 3 1 0
Panc 21 6 6 114 1 2 0 1 2 6 0 2 1 0 0 1 1
Skin 1 4 0 0 90 0 0 1 1 2 0 0 0 0 0 0 0
Ovar 13 5 0 0 2 92 0 0 0 2 0 0 0 5 0 0 0
Rena 0 1 0 0 0 0 70 0 0 0 1 0 2 0 0 0 0
Pros 3 0 0 0 3 0 0 54 0 2 0 0 0 0 0 0 0
Head 1 0 0 0 1 0 2 1 47 0 0 0 0 0 0 0 0
Esop 0 3 2 1 1 0 1 1 3 32 0 1 2 0 0 0 0
Thyr 0 0 0 0 0 0 0 0 0 0 38 0 0 0 0 0 0
Blad 3 2 0 0 3 1 1 2 0 0 0 35 1 0 0 0 0
Germ 0 0 0 0 0 0 0 0 0 1 0 0 26 0 0 0 0
Endo 1 1 0 1 0 1 0 0 0 0 0 0 0 14 7 0 0
Live 0 0 0 1 0 0 1 0 1 0 0 0 0 0 5 0 0
Adre 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 0
Cerv 0 0 2 0 0 0 0 0 0 1 0 0 0 0 0 0 4
This summarises the predictions that a classification system made about examples whose actual classes were known. This 17 \(\times\) 17 matrix can be summarised further by forming the 17 binary confusion matrices of each class versus all other classes, e.g.,
actual
predicted Lung n.Lung
Lung 180 45
n.Lung 56 1127
actual
predicted Brea n.Brea
Brea 194 31
n.Brea 37 1146
…through to…
actual
predicted Cerv n.Cerv
Cerv 4 3
n.Cerv 7 1394
Once we have summarised a \(C\times C\) confusion matrix as \(C\) binary one-versus-all confusion matrices, we can visualise the prior, posterior and Bayes factor for each class versus all others like so:
plot.BayesFactors.pos(CUP, show.arrows = TRUE, arrow.length = 2.5, arrow.size = 1)
This can be sorted by prior class probability
plot.BayesFactors.pos(CUP, sort.by="prior")
…or by posterior class probability
plot.BayesFactors.pos(CUP, sort.by="post")
Note the estimated Bayes factors for the classes Thyr and Adre are infinite; the posterior probability of actually being Thyr given a prediction of Thyr is 1 and this value is off the right of the scale in the left panel.
Here’s another example of a different (23 \(\times\) 23) confusion matrix in which we see a class (LIS) that is never predicted correctly:
actual
predicted ARN BCN BER DEN DOH FAI HKG ICN IEV ILR KUL LCY LIS NYC OFF SAO SCL SDJ SFO SGP TPE TYO ZRH
ARN 34 0 0 2 0 0 0 0 2 2 0 0 0 0 0 1 0 0 1 1 1 1 0
BCN 0 37 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
BER 0 0 24 1 3 2 0 0 0 0 0 4 0 2 0 1 1 0 0 1 1 1 0
DEN 2 0 0 30 0 2 0 0 0 1 0 1 0 0 0 0 0 0 2 0 0 1 1
DOH 0 0 1 2 36 2 3 0 0 6 0 1 1 3 0 0 5 1 0 5 2 0 0
FAI 0 0 4 4 3 22 0 0 0 1 0 3 1 0 0 1 4 1 0 1 0 3 0
HKG 1 0 0 0 1 0 13 4 7 2 3 1 0 0 0 3 0 0 0 3 2 4 0
ICN 2 0 0 0 0 0 4 33 0 1 0 2 0 1 0 2 0 0 2 1 6 1 0
IEV 1 0 0 1 1 0 2 1 7 5 0 0 0 1 0 1 0 0 1 6 4 4 2
ILR 0 0 0 0 4 4 5 0 5 44 2 1 9 3 15 2 1 1 1 4 4 2 1
KUL 0 1 0 0 0 0 3 0 3 3 23 0 0 1 0 0 0 0 0 1 1 1 2
LCY 0 0 2 0 0 0 0 1 1 0 0 15 0 1 0 0 0 0 0 0 0 0 0
LIS 0 0 1 0 0 1 0 0 0 5 0 0 0 2 0 0 0 0 0 0 0 0 0
NYC 2 0 5 1 3 9 2 0 2 1 0 3 7 77 1 1 0 0 0 2 2 2 2
OFF 0 0 0 1 0 0 0 0 0 13 0 0 1 2 10 0 0 0 0 0 0 0 0
SAO 0 0 0 1 0 0 1 4 2 2 0 0 0 0 0 8 0 0 0 1 0 4 2
SCL 0 0 1 0 4 1 0 0 0 1 0 2 0 0 0 0 15 0 0 0 0 0 0
SDJ 0 0 0 1 0 0 3 0 5 2 0 0 0 0 0 1 0 28 0 4 0 0 1
SFO 1 0 0 0 5 0 1 0 1 2 0 2 0 1 0 0 0 0 19 2 3 2 1
SGP 1 0 0 0 1 0 2 3 5 1 1 1 0 1 0 2 0 0 0 8 1 7 6
TPE 2 0 0 1 3 0 3 1 3 1 1 1 0 1 0 1 0 1 1 4 16 2 5
TYO 4 0 3 0 1 4 2 2 2 2 0 0 0 3 0 5 0 0 2 3 3 39 0
ZRH 0 0 0 0 0 0 5 1 4 2 0 0 0 0 0 0 0 0 0 1 4 1 10
and in this situation, the posterior probability and Bayes factor for LIS versus all other classes are off the left of the scale
So far, we have been working with the odds formulation of Bayes rule for a positive outcome, i.e., the odds of actually having a disease given that the test predicts that you do \(\mathrm{O(D|T)}\) : \[ \begin{align} \mathrm{O(D|T)} &= \mathrm{O(D)} \frac{\mathrm{P(T|D)}}{\mathrm{P(T|\overline{D})}}\\ &= \mathrm{O(D)} \frac{\text{True positive rate}}{\text{False positive rate}} \end{align} \] where the ratio is the Bayes factor of the test for a positive result, i.e., the likelihood ratio of a positive outcome ( LR+ ).
We can also work with the odds formulation for a negative outcome, i.e., the odds that you do not have the disease given that the test predicts that you do not: \(\mathrm{O(\overline{D}|\overline{T})}\).
In that case
\[ \begin{align} \mathrm{O(\overline{D}|\overline{T})} &= \mathrm{O(\overline{D})} \frac{\mathrm{P(\overline{T}|\overline{D})}}{\mathrm{P(\overline{T}|D)}}\\ &= \mathrm{O(\overline{D})} \frac{\text{True negative rate}}{\text{False negative rate}} \end{align} \] where the ratio \[ \frac{\text{True negative rate}}{\text{False negative rate}} = \frac{1}{\text{Bayes factor for a negative outcome}}. \] The Bayes factor of the test for a negative result is also known as the likelihood ratio of a negative outcome (or LR- for short).
Here is how the odds of negative outcomes look for each type of cancer of unknown primary
plot.BayesFactors.neg(CUP, show.arrows = TRUE, arrow.length = 2.5, arrow.size = 1, legend.pos = c(0,0))
Note that because of there are 17 classes in this scenario, the prior probability of an example not being from a specific class is high.
Here are the Bayes factor plots for the Eddy data, first, for a negative outcome…
plot.BayesFactors.neg(Eddy, show.arrows = TRUE, legend.pos = c(0,0))
…then for the positive outcome.
plot.BayesFactors.pos(Eddy, show.arrows = TRUE, legend.pos = c(0,0))
The diagnostic odds ratio (DOR) (Glas et al. 2003) combines the Bayes factors (likelihood ratios) for the positive result (LR+) and the negative result (LR-) of a binary test: \[ \begin{align} \mathrm{DOR} &= \frac{\mathrm{LR+}}{\mathrm{LR-}}\\ &= \frac{\text{True positive rate}}{\text{False positive rate}}\cdot\frac{\text{True negative rate}}{\text{False negative rate}} \end{align} \]
so taking logs we can write \[ \log(\mathrm{DOR}) = \log(\mathrm{LR+}) + \log(1/\mathrm{LR-}) \]
which we can visualise as
plot.DOR(CUP, show.arrows = TRUE, arrow.length = 2.5, arrow.size = 1, legend.pos = c(1,0))
Here the arrows emphasise that \(\log(1/\mathrm{LR-})=-\log(\mathrm{LR-})\) is added to \(\log(\mathrm{LR+})\) to get \(\log(\mathrm{DOR})\).
We can sort the classes by LR+ or LR-:
plot.DOR(CUP, show.arrows = TRUE, arrow.length = 2.5, arrow.size = 1, legend.pos = c(1,0), sort.by = "LR+")
plot.DOR(CUP, show.arrows = TRUE, arrow.length = 2.5, arrow.size = 1, legend.pos = c(1,0), sort.by = "LR-")
For binary classifiers, the diagnostic odds ratio is the same for both outcomes, emphasisng that the confusion matrix has four numbers and three degrees of freedom: neither LR+ or LR- completely summarise the confusion matrix. Only DOR does that (by arbitrarily dividing LR+/LR-)
HASY.train and HASY.test are 369 \(\times\) 369 confusion matrices derived from a classification system evaluated on training and testing data sets provided by Martin Thoma (2017). These data are about HAndwritten SYmbols including ``the Latin uppercase and lowercase characters (A-Z, a-z), the Arabic numerals (0-9), 32 different types of arrows, fractal and calligraphic Latin characters, brackets and more", collected from Detexify and write-math.com.
Here is a section of the HASY.train confusion matrix. Note the confusion between \\phi (\(\phi\)) and \\Phi (\(\Phi\)):
actual
predicted \\nu \\xi \\Xi \\Pi \\rho \\varrho \\tau \\phi \\Phi \\varphi
\\nu 344 0 0 0 0 0 0 0 0 0
\\xi 0 2309 0 0 0 0 0 0 0 0
\\Xi 0 0 350 0 0 0 0 0 0 0
\\Pi 0 0 0 451 0 0 0 0 0 0
\\rho 0 0 0 0 622 1 0 0 0 0
\\varrho 0 0 0 0 0 198 0 0 0 0
\\tau 0 0 0 0 0 0 369 0 0 0
\\phi 0 0 0 0 0 0 0 561 13 2
\\Phi 0 0 0 0 0 0 0 25 532 0
\\varphi 0 0 0 0 0 0 0 2 0 1366
Viewing, and making sense of large confusion matrices in matrix form is a challenge. Typical approaches represent the confusion matrix as a heatmap or image like this:
…but it is hard to get a sense of how well the classifier is performing on this data.
We can get more insight from plotting the priors, posteriors and likelihood ratios for each class versus all others:
plot.BayesFactors.pos(HASY.train)
We can also compare multiple confusion matrices using this approach. To do this, we calculate data frames (actually tibbles) containing the one-versus-all binary confusion matrix statistics for HASY.train and HASY.test
OVA(HASY.train, ID="training") -> HASY.train.OVA
OVA(HASY.test, ID="testing") -> HASY.test.OVA
then bind their rows together, and sort the class factor by the \(\mathrm{LR+}\) observed on the training data:
rbind(HASY.train.OVA, HASY.test.OVA) %>%
select(ID, class, TP, FP, FN, TN, prior.O, PLR, post.O) %>%
# Sort the `class` factor by the `PLR` values of the `train` data
# See https://stackoverflow.com/questions/54430898/
group_by(ID) %>%
mutate(
class = fct_reorder(
class,
filter(., ID=="training") %>% pull(PLR)
)
) %>%
ungroup() -> HASY.both.OVA
so we can produce the following bespoke plot
PLR.breaks <- 10^(0:5) # Choose some nice axis breaks
ggplot(HASY.both.OVA, aes(x=PLR, y=class)) +
geom_vline(xintercept = 1, lty=2) +
geom_point(aes(colour=ID)) +
scale_color_discrete(name="confusion matrix") +
scale_x_log10(limits=range(PLR.breaks), breaks=PLR.breaks, label=as.character(PLR.breaks), name="LR+") +
theme(legend.justification=c(0,1), legend.position = c(0.1,1), axis.ticks.y = element_blank())
Warning: Transformation introduced infinite values in continuous x-axis
Warning: Removed 54 rows containing missing values (geom_point).
This clearly shows that the classifier performs better on the training data than the testing data, as we would expect. Furthermore, by plotting the \(\mathrm{LR+}\) values, we see the intrinsic ability of the classifier to discriminate classes; we have removed the prior abundance of classes from this comparison.
Note there in the previous example, warnings were issued in this plotting process, reminding us that there are some classes in the confusion matrix that warrant further inspection. We can filter out the rows of the data that have 0, +Inf and NaN (not a number) values of \(\mathrm{LR+}\) and tabulate them
HASY.both.OVA %>%
mutate(
`LR+`=factor(
case_when(
PLR==0 ~ "zero",
PLR==+Inf ~ "+Inf",
is.nan(PLR) ~ "NaN.", # Note the ".". Without it, table() exclude these values
TRUE ~ "+ve"),
levels=c("zero", "+ve", "+Inf", "NaN")
)
) %$% table(ID, `LR+`)
LR+
ID zero +ve +Inf NaN
training 0 247 122 0
testing 12 320 34 0
\(\mathrm{LR+} = 0\) in classes that had no true positive but some false positive predictions from the classifier:
HASY.both.OVA %>% filter(PLR==0)
# A tibble: 12 x 9
ID class TP FP FN TN prior.O PLR post.O
<fct> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 testing "c" 0 3 7 16982 0.000412 0 0
2 testing "g" 0 1 6 16985 0.000353 0 0
3 testing "o" 0 2 7 16983 0.000412 0 0
4 testing "s" 0 1 6 16985 0.000353 0 0
5 testing "v" 0 1 7 16984 0.000412 0 0
6 testing "x" 0 2 7 16983 0.000412 0 0
7 testing "z" 0 3 6 16983 0.000353 0 0
8 testing "\\amalg" 0 6 19 16967 0.00112 0 0
9 testing "\\with" 0 2 21 16969 0.00124 0 0
10 testing "\\shortrightarrow" 0 3 28 16961 0.00165 0 0
11 testing "\\triangledown" 0 1 20 16971 0.00118 0 0
12 testing "\\fullmoon" 0 5 15 16972 0.000884 0 0
\(\mathrm{LR+} = \infty\) in classes that had some true positive but no false positive predictions from the classifier:
HASY.both.OVA %>% filter(PLR==+Inf)
# A tibble: 156 x 9
ID class TP FP FN TN prior.O PLR post.O
<fct> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 training A 142 0 1 151098 0.000946 Inf Inf
2 training B 54 0 0 151187 0.000357 Inf Inf
3 training E 47 0 1 151193 0.000317 Inf Inf
4 training G 106 0 0 151135 0.000701 Inf Inf
5 training H 57 0 0 151184 0.000377 Inf Inf
6 training J 93 0 0 151148 0.000615 Inf Inf
7 training Q 60 0 0 151181 0.000397 Inf Inf
8 training U 31 0 22 151188 0.000351 Inf Inf
9 training W 55 0 0 151186 0.000364 Inf Inf
10 training 2 111 0 0 151130 0.000734 Inf Inf
# ... with 146 more rows
\(\mathrm{LR+} = \mathrm{NaN}\) in classes that had no true positive and no false positive predictions from the classifier:
HASY.both.OVA %>% filter(is.nan(PLR))
# A tibble: 3 x 9
ID class TP FP FN TN prior.O PLR post.O
<fct> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 testing "l" 0 0 6 16986 0.000353 NaN NaN
2 testing "\\mathbb{R}" 0 0 6 16986 0.000353 NaN NaN
3 testing "\\vartriangle" 0 0 12 16980 0.000707 NaN NaN