Motivation: understanding multinomial classifier performance

Motivating question from Brendan:

Binary models have well established measures of performance, such as precision and accuracy. When using multi classification methods, it is not this simple. What measures exist for assessing and describing performance of multi-class classification models, and how can we express those measures, both visually and semantically in a way that is interpretable by our (non-data scientist) business partners?

As organisations seek to develop and deploy more complex classification systems, there is a growing need for understanding and transparency in model development, as well as a requirement to explain how the models are operating.

Similar to assessment in educational settings, we can think about performance assessment of classification systems with two ends in mind:

Summative, in which we simply wish to have a measure of performance that we can use to compare and rank different classifiers, e.g., in a “performance shoot-out”
Formative, in which we wish to gain insight into the strengths and limitations of a classification system so we can improve its performance.

We’ve decided to begin by

restricting our attention to multinomial classification systems (i.e., classifiers that determine 1-out-of-\(C\) classes) in a closed set of classes
- we use the term “classification system” to describe systems that may comprise multiple base classifiers (e.g., binary classifiers)
assuming that the classifiers work by thresholding a score, i.e., a continuous value that monotonically relates to the probability of class membership
working with simulated classifier outputs, to allow us to focus on and experiment with different scenarios

Some of the challenges in understanding multinomial classifier performance

We have started by trying to break down the high-level challenge of understanding multinomial classifier performance into its constituent problems. Here are the main challenges we see at this stage

Classifiers that do not provide scores or probability estimates

Harrell (2017) emphasises the distinction between prediction and classification:

Classification is in effect a decision. Optimum decisions require making full use of available data, developing predictions, and applying a loss/utility/cost function to make a decision that, for example, minimizes expected loss or maximizes expected utility. Different end users have different utility functions. In risk assessment this leads to their having different risk thresholds for action. Classification assumes that every user has the same utility function and that the utility function implied by the classification system is that utility function.

Some prediction systems produce estimates of class-probabilities \(\hat{p}(A|x)\), \(\hat{p}(B|x), \dots\) that an input \(x\) belongs to class \(A, B, \dots\). Others may simply produce “a score \(s = s(x)\), an unspecified monotonic increasing transformation of an estimate \(\hat{p}(A|x)\)” (Hand 2009).

Some classifiers (e.g., support vector machines, decision trees, k-nearest neighbours) do not immediately produce class probability estimates. However, there are procedures for transforming (torturing?) their outputs into calibrated probabilities, e.g., the distribution of training set classes found at the terminal nodes of a decision tree could be regarded as estimates of the class probabilities; Zadrozny and Elkan (2001) show strategies to improve those estimates.

Combining the outputs of several class-probability models

There are different ways to achieve multinomial classification from systems of binary models: one-vs-all, one-vs-one, error correcting output codes, etc. If we want to understand the performance of a multinomial classification system of class-probability models, we have to understand how the outputs of these models are combined.

Voting is one approach, however voting only provides a class label: we want probabilities, e.g., “a strategy for polychotomous classification that involves estimating class probabilities for each pair of classes, and then coupling the estimates together” (Hastie and Tibshirani 1998). See (Wu, Lin, and Weng 2004) for an improved approach.

Decomposing the outputs of a multiclass prediction system

Just as multinomial classification can be achieved by composing the outputs of several class-probability models, we could also decomposing the outputs of multiclass prediction systems into various binary models. Here, I am thinking about neural network systems that impose a softmax to ensure that outputs satisfy the requirements of a probability estimate.

Calibrating the probability estimates of prediction systems

See Zadrozny and Elkan (2002).

Factoring apart prior probability of classes from model discriminative performance

Grant Sanderson suggests that we can better understand the discriminative performance of a model by expressing Bayes’ rule in terms of prior odds and Bayes factors (3Blue1Brown 2020).

Suppose we have a hypothesis \(\mathrm{H}\) that a person actually has a disease, and some evidence \(\mathrm{E}\) in about that in the form of a positive test result for that disease. One key question is “if I have a positive test result, what’s the chance that I actually have the disease”, written symbolically in probabilities as: \[ \mathrm{P(H|E)} = \frac{\mathrm{P(E|H)P(H)}}{\mathrm{P(E|H)P(H)}+\mathrm{P(E|\overline{H})P(\overline{H})}} \] or in terms of odds as \[ \begin{align} \mathrm{O(H|E)} &= \mathrm{O(H)} \frac{\mathrm{P(E|H)}}{\mathrm{P(E|\overline{H})}}\\ &= \mathrm{O(H)} \frac{\text{True positive rate}}{\text{False positive rate}} \end{align} \] where the ratio is the Bayes factor of the test for a positive result. This factor represents how how your prior odds of having the disease are updated as a result of the test outcome.

For the absence of disease, the formulation is \[ \begin{align} \mathrm{O(\overline{H}|E)} &= \mathrm{O(\overline{H})} \frac{\mathrm{P(\overline{E}|H)}}{\mathrm{P(\overline{E}|\overline{H})}}\\ &= \mathrm{O(\overline{H})} \frac{\text{False negative rate}}{\text{True negative rate}}. \end{align} \]

We still have to estimate the test’s Bayes factor using a set of data whose outcomes are known.

Is there a way that we can use this idea to present how a classification system updates our belief about the odds of different classes?
- Check out Olivetti, Greiner, and Avesani (2014)
When an outcome \(\mathrm{H}\) is rare, test results tend to be dominated by false positives and \(\mathrm{P(H|E)}\) is (counterintuitively) low. This makes me wonder about the performance of multinomial classifiers where the number of classes is large and, necessarily, the prior probability of each class is low. How do systems, say in natural language processing or image recognition where the number of classes can be high, achieve acceptable performance? What are the Bayes factors for

Choosing loss functions

See Buja, Stuetzle, and Shen (2005).

Next steps?

We agreed to try a combination of simulation and visualisation to explore how class-probability estimates can be visualised and understood.

Start with a 3 class problem
- Simulate the three outputs of a system, \(\mathbf{y}_i = (y_{iA}, y_{iB}, y_{iB})\) in response to the \(i^\mathrm{th}\) observation \(\mathbf{x}_i\)
- Here \(y_{iA} = \hat{p}(A|\mathbf{x}_i)\), i.e., an estimate of the probability that the \(i^\mathrm{th}\) observation belongs to class \(A\) \(\left(\hat{p}(A|x_i), \hat{p}(B|x_i), \hat{p}(C|x_i)\right)\)
- I suggest using a logistic normal distribution (“the Gaussian of the simplex”) to model each of the class-conditional probabilities \(\mathrm{P}(\mathbf{y}_i|A), \mathrm{P}(\mathbf{y}_i|B),\mathrm{P}(\mathbf{y}_i|C)\)
- The distribution of \(\mathbf{y}_i\) is then the sum of each class-conditional probability times the prior probability of each class:

\[ \mathrm{P}(\mathbf{y}_i) = \mathrm{P}(\mathbf{y}_i|A).\mathrm{P}(A) + \mathrm{P}(\mathbf{y}_i|B).\mathrm{P}(B) + \mathrm{P}(\mathbf{y}_i|C).\mathrm{P}(C) \]

This allows us to manipulate the distributions of class-probability estimates which could then be displayed in \(\mathbb{R}^2\) by way of an isometric log-ratio projection (See Supplementary Information of (Lovell, Chua, and McGrath 2020))
Plot \(\mathrm{P}(\mathbf{y}_i)\), and the optimal decision boundaries between \(\mathrm{P}(\mathbf{y}_i|A)\), \(\mathrm{P}(\mathbf{y}_i|B)\) and \(\mathrm{P}(\mathbf{y}_i|C)\) (see Murphy (2012 Chapter 4: Gaussian models))
Perform a quadratic discriminant analysis on samples from \(\mathrm{P}(\mathbf{y}_i)\) and compare the decision boundaries
Factor apart prior probability of classes from model discriminative performance using prior odds and Bayes factors

References

3Blue1Brown. 2020. The Medical Test Paradox: Can Redesigning Bayes Rule Help? https://www.youtube.com/watch?v=lG4VkPoG3ko.

Buja, Andreas, Werner Stuetzle, and Yi Shen. 2005. Loss Functions for Binary Class Probability Estimation and Classification: Structure and Applications,” Manuscript, Available at Www-Stat.wharton.upenn.edu/ Buja. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.184.5203.

Hand, David J. 2009. “Measuring Classifier Performance: A Coherent Alternative to the Area Under the ROC Curve.” Machine Learning 77 (1): 103–23. https://doi.org/10.1007/s10994-009-5119-5.

Harrell, Frank. 2017. “Classification Vs. Prediction.” Statistical Thinking. January 15, 2017. https://www.fharrell.com/post/classification/.

Hastie, Trevor, and Robert Tibshirani. 1998. “Classification by Pairwise Coupling.” Annals of Statistics 26 (2): 451–71. https://doi.org/10.1214/aos/1028144844.

Lovell, David R, Xin-Yi Chua, and Annette McGrath. 2020. “Counts: An Outstanding Challenge for Log-Ratio Analysis of Compositional Data in the Molecular Biosciences.” NAR Genomics and Bioinformatics 2 (2): lqaa040. https://doi.org/10.1093/nargab/lqaa040.

Murphy, Kevin P. 2012. Machine Learning: A Probabilistic Perspective. Adaptive Computation and Machine Learning Series. Cambridge, MA: MIT Press.

Olivetti, Emanuele, Susanne Greiner, and Paolo Avesani. 2014. “Statistical Independence for the Evaluation of Classifier-Based Diagnosis.” Brain Informatics 2 (1): 13–19. https://doi.org/10.1007/s40708-014-0007-6.

Wu, Ting-Fan, Chih-Jen Lin, and Ruby C. Weng. 2004. “Probability Estimates for Multi-Class Classification by Pairwise Coupling.” Journal of Machine Learning Research 5 (Aug): 975–1005. https://www.jmlr.org/papers/v5/wu04a.html?907d3908.

Zadrozny, Bianca, and Charles Elkan. 2001. “Obtaining Calibrated Probability Estimates from Decision Trees and Naive Bayesian Classifiers.” In In Proceedings of the Eighteenth International Conference on Machine Learning, 609–16. Morgan Kaufmann.

———. 2002. “Transforming Classifier Scores into Accurate Multiclass Probability Estimates.” In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 694–99. KDD ’02. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/775047.775151.

Multinomial classification

Notes from Meeting 004

Andrew Bradley, David Lovell, Bridget McCarron, Brendan Langfield

13/01/2021