About this document

This is a draft vignette for the confusR package. It serves as

Motivation: understanding multinomial classifier performance

Binary models have well established measures of performance (e.g., precision, accuracy). With multinomial classification methods, it is not so simple.
What measures exist for assessing and describing performance of multinomial classification models, and how can we express those measures in a way that is interpretable by non-data-scientists?
— Brendan Langfield

As organisations seek to develop and deploy more complex classification systems, there is a growing need for understanding and transparency in model development, as well as a requirement to explain how the models are operating. Similar to assessment in educational settings, we can think about performance assessment of classification systems with two ends in mind:

  1. Summative, in which we simply wish to have a measure of performance that we can use to compare and rank different classifiers, e.g., in a “performance shoot-out”
  2. Formative, in which we wish to gain insight into the strengths and limitations of a classification system so we can improve its performance.

We are interested in the latter.

In particular, we want to separate the test set performance of the classifier into two parts

  1. the effect of the abundance of different classes in the test set, i.e., the prior prevalence of classes
  2. the effect of classifier, i.e., its innate ability to discriminate different classes

Approach: use Bayes rule with odds

Grant Sanderson suggests that we can better understand the discriminative performance of a model by expressing Bayes’ rule in terms of prior odds and Bayes factors (3blue1brown_medical_2020?).

Suppose we have a hypothesis \(\mathrm{D}\) that a person actually has a disease, and some evidence \(\mathrm{T}\) in about that in the form of a positive test result for that disease. Often we want to know “if I have a positive test result, what’s the chance that I actually have the disease”. Usually this is written symbolically in terms of probabilities as: \[ \mathrm{P(D|T)} = \frac{\mathrm{P(T|D)P(D)}}{\mathrm{P(T|D)P(D)}+\mathrm{P(T|\overline{D})P(\overline{D})}} \] …but Sanderson extols the merits of writing this using odds: \[ \begin{align} \mathrm{O(D|T)} &= \mathrm{O(D)} \frac{\mathrm{P(T|D)}}{\mathrm{P(T|\overline{D})}}\\ &= \mathrm{O(D)} \frac{\text{True positive rate}}{\text{False positive rate}} \end{align} \] where the ratio is the Bayes factor of the test for a positive result. This factor represents how our prior odds of having the disease are updated as a result of the test outcome. It is also known as the likelihood ratio of a positive outcome (or LR+ for short).

In the language of statistics,

equals

multiplied by

or, more succinctly

\[ \begin{align} \text{posterior odds} &= \text{prior odds} \times \text{Bayes factor for a positive outcome}\\[2ex] &= \text{prior odds} \times \frac{\text{True positive rate}}{\text{False positive rate}} \end{align} \]

If we visualise these terms on a logarithmic scale, we can exploit the fact that

\[ \begin{align} \log(\text{posterior odds}) &= \log(\text{prior odds}) + \log(\text{Bayes factor for a positive outcome}) \end{align} \]

…let’s see how, starting with a well-known example.

Example: Visualising binary confusion matrices

David M. Eddy (1982) gives an example of a diagnostic test for breast cancer where the test had

and the prior probability of breast cancer was assumed to be 1%.

Eddy asked physicians to estimate the probability of actually having cancer given a prediction of cancer from the test.

If you have not seen this example, I recommend you watch Luana Micallef’s (2012) telling of it right now before I spoil the surprising answer.

Here’s the confusion matrix we would expect if we applied this test to 1000 patients:

Eddy
         actual
predicted cancer benign
   cancer      8     95
   benign      2    895

and here is how we refer to the cells in a binary confusion matrix, i.e., true positive (TP), false positive (FP), false negative (FN), true negative (FN):

          actual
predicted  positive negative
  positive TP       FP      
  negative FN       TN      

So, with this confusion matrix, the probability of actually having cancer given a prediction of cancer by the test is \[ \begin{align} \mathrm{P}(\text{actually having cancer }|\text{ test predicts cancer}) &= \frac{\text{TP}}{\text{TP}+\text{FP}}\\ &= \frac{8}{8+95}\\[1.5ex] &= 7.77\% \end{align} \]

The surprise (for most people) is that a test with a true positive rate of 79.2% and false positive rate of 9.6% has such a low probability of actually being right. The reason for this counter-intuitive outcome is that the prior probability of cancer is rare (1%), so when the test is positive, it is usually a false positive.

Let’s change the prior probability of cancer and see how the same test performs. Here is the confusion matrix of the same diagnostic test applied to 1000 patients who have a 50% prior probability of breast cancer:

Eddy.equal
         actual
predicted cancer benign
   cancer    396     48
   benign    104    452

With this set of patients, the probability of actually having cancer given a prediction of cancer by the test is \[ \begin{align} \mathrm{P}(\text{actually having cancer }|\text{ test predicts cancer}) &= \frac{\text{TP}}{\text{TP}+\text{FP}}\\ &= \frac{396}{396+48}\\[1.5ex] &= 89.19\% \end{align} \]

Same test, but the new confusion matrix shows the probability the test’s prediction of cancer is correct has increased dramatically.

For many people, this is confusing. How can the same test have such a different probability of being right in these two scenarios? To make sense of this, we suggest using the odds formulation of Bayes rule, i.e.,

\[ \begin{align} \text{posterior odds of cancer} &= \text{prior odds of cancer} \times \text{Bayes factor of test for cancer}\\[2ex] &= \text{prior odds of cancer} \times \frac{\text{True positive rate}}{\text{False positive rate}} \end{align} \]

to split apart

  1. the effect of the abundance of different classes in the test set, i.e., the prior prevalence of cancer
  2. the effect of classifier, i.e., its innate ability to discriminate cancer from benign cases.

Here is a plot which shows that split

plot.BayesFactors.pos(Eddy, show.arrows=TRUE, legend.pos=c(0,0))

In the above plot (which is on a log scale)

In other words,

  1. the left panel shows the prior odds of cancer and the effect that the classifier has on updating those odds
  2. the right panel shows the effect of classifier alone.

Now let’s use this same plotting technique on the confusion matrix of the same diagnostic test applied to 1000 patients who have a 50% prior probability of breast cancer:

Eddy.equal
         actual
predicted cancer benign
   cancer    396     48
   benign    104    452
plot.BayesFactors.pos(Eddy.equal, show.arrows=TRUE, legend.pos=c(0,0), log.range=7)

This approach separates the characteristics of the classifier from the prior distribution of the classes it is applied to. Now let’s look at how this approach can be applied to multinomial classification systems.

Approach: use one-versus-all to partition multinomial confusion matrices

We’ve just looked at a binary classification system in which a test can make one of two possible predictions, cancer or benign. Multinomial classification systems make predictions of 1-out-of-\(C\) classes, where \(C > 2\).

Here’s an example confusion matrix from a multinomial classifier with 17 classes

         actual
predicted Lung Brea Colo Panc Skin Ovar Rena Pros Head Esop Thyr Blad Germ Endo Live Adre Cerv
     Lung  180   11    0    3    6    2    3    4    3    3    2    3    0    0    2    0    3
     Brea   10  194    1    1    3    4    1    0    0    1    2    0    0    1    1    3    3
     Colo    3    4  164    1    1    0    0    0    0    2    0    1    0    1    3    1    0
     Panc   21    6    6  114    1    2    0    1    2    6    0    2    1    0    0    1    1
     Skin    1    4    0    0   90    0    0    1    1    2    0    0    0    0    0    0    0
     Ovar   13    5    0    0    2   92    0    0    0    2    0    0    0    5    0    0    0
     Rena    0    1    0    0    0    0   70    0    0    0    1    0    2    0    0    0    0
     Pros    3    0    0    0    3    0    0   54    0    2    0    0    0    0    0    0    0
     Head    1    0    0    0    1    0    2    1   47    0    0    0    0    0    0    0    0
     Esop    0    3    2    1    1    0    1    1    3   32    0    1    2    0    0    0    0
     Thyr    0    0    0    0    0    0    0    0    0    0   38    0    0    0    0    0    0
     Blad    3    2    0    0    3    1    1    2    0    0    0   35    1    0    0    0    0
     Germ    0    0    0    0    0    0    0    0    0    1    0    0   26    0    0    0    0
     Endo    1    1    0    1    0    1    0    0    0    0    0    0    0   14    7    0    0
     Live    0    0    0    1    0    0    1    0    1    0    0    0    0    0    5    0    0
     Adre    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    7    0
     Cerv    0    0    2    0    0    0    0    0    0    1    0    0    0    0    0    0    4

This summarises the predictions that a classification system made about examples whose actual classes were known. This 17 \(\times\) 17 matrix can be summarised further by forming the 17 binary confusion matrices of each class versus all other classes, e.g.,

         actual
predicted Lung n.Lung
   Lung    180     45
   n.Lung   56   1127
         actual
predicted Brea n.Brea
   Brea    194     31
   n.Brea   37   1146

…through to…

         actual
predicted Cerv n.Cerv
   Cerv      4      3
   n.Cerv    7   1394

Example: visualising multinomial confusion matrices

Once we have summarised a \(C\times C\) confusion matrix as \(C\) binary one-versus-all confusion matrices, we can visualise the prior, posterior and Bayes factor for each class versus all others like so:

plot.BayesFactors.pos(CUP, show.arrows = TRUE, arrow.length = 2.5, arrow.size = 1)

This can be sorted by prior class probability

plot.BayesFactors.pos(CUP, sort.by="prior")

…or by posterior class probability

plot.BayesFactors.pos(CUP, sort.by="post")

Note the estimated Bayes factors for the classes Thyr and Adre are infinite; the posterior probability of actually being Thyr given a prediction of Thyr is 1 and this value is off the right of the scale in the left panel.

Here’s another example of a different (23 \(\times\) 23) confusion matrix in which we see a class (LIS) that is never predicted correctly:

         actual
predicted ARN BCN BER DEN DOH FAI HKG ICN IEV ILR KUL LCY LIS NYC OFF SAO SCL SDJ SFO SGP TPE TYO ZRH
      ARN  34   0   0   2   0   0   0   0   2   2   0   0   0   0   0   1   0   0   1   1   1   1   0
      BCN   0  37   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
      BER   0   0  24   1   3   2   0   0   0   0   0   4   0   2   0   1   1   0   0   1   1   1   0
      DEN   2   0   0  30   0   2   0   0   0   1   0   1   0   0   0   0   0   0   2   0   0   1   1
      DOH   0   0   1   2  36   2   3   0   0   6   0   1   1   3   0   0   5   1   0   5   2   0   0
      FAI   0   0   4   4   3  22   0   0   0   1   0   3   1   0   0   1   4   1   0   1   0   3   0
      HKG   1   0   0   0   1   0  13   4   7   2   3   1   0   0   0   3   0   0   0   3   2   4   0
      ICN   2   0   0   0   0   0   4  33   0   1   0   2   0   1   0   2   0   0   2   1   6   1   0
      IEV   1   0   0   1   1   0   2   1   7   5   0   0   0   1   0   1   0   0   1   6   4   4   2
      ILR   0   0   0   0   4   4   5   0   5  44   2   1   9   3  15   2   1   1   1   4   4   2   1
      KUL   0   1   0   0   0   0   3   0   3   3  23   0   0   1   0   0   0   0   0   1   1   1   2
      LCY   0   0   2   0   0   0   0   1   1   0   0  15   0   1   0   0   0   0   0   0   0   0   0
      LIS   0   0   1   0   0   1   0   0   0   5   0   0   0   2   0   0   0   0   0   0   0   0   0
      NYC   2   0   5   1   3   9   2   0   2   1   0   3   7  77   1   1   0   0   0   2   2   2   2
      OFF   0   0   0   1   0   0   0   0   0  13   0   0   1   2  10   0   0   0   0   0   0   0   0
      SAO   0   0   0   1   0   0   1   4   2   2   0   0   0   0   0   8   0   0   0   1   0   4   2
      SCL   0   0   1   0   4   1   0   0   0   1   0   2   0   0   0   0  15   0   0   0   0   0   0
      SDJ   0   0   0   1   0   0   3   0   5   2   0   0   0   0   0   1   0  28   0   4   0   0   1
      SFO   1   0   0   0   5   0   1   0   1   2   0   2   0   1   0   0   0   0  19   2   3   2   1
      SGP   1   0   0   0   1   0   2   3   5   1   1   1   0   1   0   2   0   0   0   8   1   7   6
      TPE   2   0   0   1   3   0   3   1   3   1   1   1   0   1   0   1   0   1   1   4  16   2   5
      TYO   4   0   3   0   1   4   2   2   2   2   0   0   0   3   0   5   0   0   2   3   3  39   0
      ZRH   0   0   0   0   0   0   5   1   4   2   0   0   0   0   0   0   0   0   0   1   4   1  10

and in this situation, the posterior probability and Bayes factor for LIS versus all other classes are off the left of the scale

Example: visualising the likelihood of a negative outcome

So far, we have been working with the odds formulation of Bayes rule for a positive outcome, i.e., the odds of actually having a disease given that the test predicts that you do \(\mathrm{O(D|T)}\) : \[ \begin{align} \mathrm{O(D|T)} &= \mathrm{O(D)} \frac{\mathrm{P(T|D)}}{\mathrm{P(T|\overline{D})}}\\ &= \mathrm{O(D)} \frac{\text{True positive rate}}{\text{False positive rate}} \end{align} \] where the ratio is the Bayes factor of the test for a positive result, i.e., the likelihood ratio of a positive outcome ( LR+ ).

We can also work with the odds formulation for a negative outcome, i.e., the odds that you do not have the disease given that the test predicts that you do not: \(\mathrm{O(\overline{D}|\overline{T})}\).

In that case

\[ \begin{align} \mathrm{O(\overline{D}|\overline{T})} &= \mathrm{O(\overline{D})} \frac{\mathrm{P(\overline{T}|\overline{D})}}{\mathrm{P(\overline{T}|D)}}\\ &= \mathrm{O(\overline{D})} \frac{\text{True negative rate}}{\text{False negative rate}} \end{align} \] where the ratio \[ \frac{\text{True negative rate}}{\text{False negative rate}} = \frac{1}{\text{Bayes factor for a negative outcome}}. \] The Bayes factor of the test for a negative result is also known as the likelihood ratio of a negative outcome (or LR- for short).

Here is how the odds of negative outcomes look for each type of cancer of unknown primary

plot.BayesFactors.neg(CUP, show.arrows = TRUE, arrow.length = 2.5, arrow.size = 1, legend.pos = c(0,0))

Note that because of there are 17 classes in this scenario, the prior probability of an example not being from a specific class is high.

Here are the Bayes factor plots for the Eddy data, first, for a negative outcome…

plot.BayesFactors.neg(Eddy, show.arrows = TRUE, legend.pos = c(0,0))

…then for the positive outcome.

plot.BayesFactors.pos(Eddy, show.arrows = TRUE, legend.pos = c(0,0))

Example: visualising the diagnostic odds ratio

The diagnostic odds ratio (DOR) (Glas et al. 2003) combines the Bayes factors (likelihood ratios) for the positive result (LR+) and the negative result (LR-) of a binary test: \[ \begin{align} \mathrm{DOR} &= \frac{\mathrm{LR+}}{\mathrm{LR-}}\\ &= \frac{\text{True positive rate}}{\text{False positive rate}}\cdot\frac{\text{True negative rate}}{\text{False negative rate}} \end{align} \]

so taking logs we can write \[ \log(\mathrm{DOR}) = \log(\mathrm{LR+}) + \log(1/\mathrm{LR-}) \]

which we can visualise as

plot.DOR(CUP, show.arrows = TRUE, arrow.length = 2.5, arrow.size = 1, legend.pos = c(1,0))

Here the arrows emphasise that \(\log(1/\mathrm{LR-})=-\log(\mathrm{LR-})\) is added to \(\log(\mathrm{LR+})\) to get \(\log(\mathrm{DOR})\).

We can sort the classes by LR+ or LR-:

plot.DOR(CUP, show.arrows = TRUE, arrow.length = 2.5, arrow.size = 1, legend.pos = c(1,0), sort.by = "LR+")

plot.DOR(CUP, show.arrows = TRUE, arrow.length = 2.5, arrow.size = 1, legend.pos = c(1,0), sort.by = "LR-")

For binary classifiers, the diagnostic odds ratio is the same for both outcomes, emphasisng that the confusion matrix has four numbers and three degrees of freedom: neither LR+ or LR- completely summarise the confusion matrix. Only DOR does that (by arbitrarily dividing LR+/LR-)

Example: larger confusion matrices

HASY.train and HASY.test are 369 \(\times\) 369 confusion matrices derived from a classification system evaluated on training and testing data sets provided by Martin Thoma (2017). These data are about HAndwritten SYmbols including ``the Latin uppercase and lowercase characters (A-Z, a-z), the Arabic numerals (0-9), 32 different types of arrows, fractal and calligraphic Latin characters, brackets and more", collected from Detexify and write-math.com.

Here is a section of the HASY.train confusion matrix. Note the confusion between \\phi (\(\phi\)) and \\Phi (\(\Phi\)):

          actual
predicted  \\nu \\xi \\Xi \\Pi \\rho \\varrho \\tau \\phi \\Phi \\varphi
  \\nu      344    0    0    0     0        0     0     0     0        0
  \\xi        0 2309    0    0     0        0     0     0     0        0
  \\Xi        0    0  350    0     0        0     0     0     0        0
  \\Pi        0    0    0  451     0        0     0     0     0        0
  \\rho       0    0    0    0   622        1     0     0     0        0
  \\varrho    0    0    0    0     0      198     0     0     0        0
  \\tau       0    0    0    0     0        0   369     0     0        0
  \\phi       0    0    0    0     0        0     0   561    13        2
  \\Phi       0    0    0    0     0        0     0    25   532        0
  \\varphi    0    0    0    0     0        0     0     2     0     1366

Viewing, and making sense of large confusion matrices in matrix form is a challenge. Typical approaches represent the confusion matrix as a heatmap or image like this:

…but it is hard to get a sense of how well the classifier is performing on this data.

We can get more insight from plotting the priors, posteriors and likelihood ratios for each class versus all others:

plot.BayesFactors.pos(HASY.train)

We can also compare multiple confusion matrices using this approach. To do this, we calculate data frames (actually tibbles) containing the one-versus-all binary confusion matrix statistics for HASY.train and HASY.test

OVA(HASY.train, ID="training") -> HASY.train.OVA
OVA(HASY.test,  ID="testing")  -> HASY.test.OVA

then bind their rows together, and sort the class factor by the \(\mathrm{LR+}\) observed on the training data:

rbind(HASY.train.OVA, HASY.test.OVA) %>% 
  select(ID, class, TP, FP, FN, TN, prior.O, PLR, post.O) %>%
  # Sort the `class` factor by the `PLR` values of the `train` data
  # See https://stackoverflow.com/questions/54430898/
  group_by(ID) %>%
  mutate(
    class = fct_reorder(
      class, 
      filter(., ID=="training") %>% pull(PLR)
      )
    ) %>%
  ungroup() -> HASY.both.OVA

so we can produce the following bespoke plot

PLR.breaks <- 10^(0:5) # Choose some nice axis breaks

ggplot(HASY.both.OVA, aes(x=PLR, y=class)) +
  geom_vline(xintercept = 1, lty=2) +
  geom_point(aes(colour=ID)) + 
  scale_color_discrete(name="confusion matrix") +
  scale_x_log10(limits=range(PLR.breaks), breaks=PLR.breaks, label=as.character(PLR.breaks), name="LR+") +
  theme(legend.justification=c(0,1), legend.position = c(0.1,1), axis.ticks.y = element_blank())
Warning: Transformation introduced infinite values in continuous x-axis
Warning: Removed 54 rows containing missing values (geom_point).

This clearly shows that the classifier performs better on the training data than the testing data, as we would expect. Furthermore, by plotting the \(\mathrm{LR+}\) values, we see the intrinsic ability of the classifier to discriminate classes; we have removed the prior abundance of classes from this comparison.

Zero, infinity and NaN values of likelihood ratios

Note there in the previous example, warnings were issued in this plotting process, reminding us that there are some classes in the confusion matrix that warrant further inspection. We can filter out the rows of the data that have 0, +Inf and NaN (not a number) values of \(\mathrm{LR+}\) and tabulate them

HASY.both.OVA %>% 
  mutate(
    `LR+`=factor(
      case_when(
      PLR==0      ~ "zero",
      PLR==+Inf   ~ "+Inf",
      is.nan(PLR) ~ "NaN.",           # Note the ".". Without it, table() exclude these values
      TRUE ~ "+ve"),
      levels=c("zero", "+ve", "+Inf", "NaN")
    )
  )  %$% table(ID, `LR+`)
          LR+
ID         zero +ve +Inf NaN
  training    0 247  122   0
  testing    12 320   34   0

\(\mathrm{LR+} = 0\) in classes that had no true positive but some false positive predictions from the classifier:

HASY.both.OVA %>% filter(PLR==0)
# A tibble: 12 x 9
   ID      class                  TP    FP    FN    TN  prior.O   PLR post.O
   <fct>   <fct>               <int> <dbl> <dbl> <dbl>    <dbl> <dbl>  <dbl>
 1 testing "c"                     0     3     7 16982 0.000412     0      0
 2 testing "g"                     0     1     6 16985 0.000353     0      0
 3 testing "o"                     0     2     7 16983 0.000412     0      0
 4 testing "s"                     0     1     6 16985 0.000353     0      0
 5 testing "v"                     0     1     7 16984 0.000412     0      0
 6 testing "x"                     0     2     7 16983 0.000412     0      0
 7 testing "z"                     0     3     6 16983 0.000353     0      0
 8 testing "\\amalg"               0     6    19 16967 0.00112      0      0
 9 testing "\\with"                0     2    21 16969 0.00124      0      0
10 testing "\\shortrightarrow"     0     3    28 16961 0.00165      0      0
11 testing "\\triangledown"        0     1    20 16971 0.00118      0      0
12 testing "\\fullmoon"            0     5    15 16972 0.000884     0      0

\(\mathrm{LR+} = \infty\) in classes that had some true positive but no false positive predictions from the classifier:

HASY.both.OVA %>% filter(PLR==+Inf)
# A tibble: 156 x 9
   ID       class    TP    FP    FN     TN  prior.O   PLR post.O
   <fct>    <fct> <int> <dbl> <dbl>  <dbl>    <dbl> <dbl>  <dbl>
 1 training A       142     0     1 151098 0.000946   Inf    Inf
 2 training B        54     0     0 151187 0.000357   Inf    Inf
 3 training E        47     0     1 151193 0.000317   Inf    Inf
 4 training G       106     0     0 151135 0.000701   Inf    Inf
 5 training H        57     0     0 151184 0.000377   Inf    Inf
 6 training J        93     0     0 151148 0.000615   Inf    Inf
 7 training Q        60     0     0 151181 0.000397   Inf    Inf
 8 training U        31     0    22 151188 0.000351   Inf    Inf
 9 training W        55     0     0 151186 0.000364   Inf    Inf
10 training 2       111     0     0 151130 0.000734   Inf    Inf
# ... with 146 more rows

\(\mathrm{LR+} = \mathrm{NaN}\) in classes that had no true positive and no false positive predictions from the classifier:

HASY.both.OVA %>% filter(is.nan(PLR))
# A tibble: 3 x 9
  ID      class              TP    FP    FN    TN  prior.O   PLR post.O
  <fct>   <fct>           <int> <dbl> <dbl> <dbl>    <dbl> <dbl>  <dbl>
1 testing "l"                 0     0     6 16986 0.000353   NaN    NaN
2 testing "\\mathbb{R}"       0     0     6 16986 0.000353   NaN    NaN
3 testing "\\vartriangle"     0     0    12 16980 0.000707   NaN    NaN

Bibliography

Eddy, David M. 1982. “Probabilistic Reasoning in Clinical Medicine: Problems and Opportunities.” In Judgment Under Uncertainty: Heuristics and Biases, edited by Amos Tversky, Daniel Kahneman, and Paul Slovic, 249–67. Cambridge: Cambridge University Press. https://doi.org/10.1017/CBO9780511809477.019.
Glas, Afina S., Jeroen G. Lijmer, Martin H. Prins, Gouke J. Bonsel, and Patrick M. M. Bossuyt. 2003. “The Diagnostic Odds Ratio: A Single Indicator of Test Performance.” Journal of Clinical Epidemiology 56 (11): 1129–35. https://doi.org/10.1016/S0895-4356(03)00177-X.
Luana Micallef. 2012. Explaining Bayesian Problems Using Visualizations. https://www.youtube.com/watch?v=D8VZqxcu0I0.
Thoma, Martin. 2017. “The HASYv2 Dataset.” arXiv:1701.08380 [Cs], January. http://arxiv.org/abs/1701.08380.