Motivating question from Brendan:
Binary models have well established measures of performance, such as precision and accuracy. When using multi classification methods, it is not this simple. What measures exist for assessing and describing performance of multi-class classification models, and how can we express those measures, both visually and semantically in a way that is interpretable by our (non-data scientist) business partners?
As organisations seek to develop and deploy more complex classification systems, there is a growing need for understanding and transparency in model development, as well as a requirement to explain how the models are operating.
Similar to assessment in educational settings, we can think about performance assessment of classification systems with two ends in mind:
We’ve decided to begin by
We have started by trying to break down the high-level challenge of understanding multinomial classifier performance into its constituent problems. Here are the main challenges we see at this stage
Harrell (2017) emphasises the distinction between prediction and classification:
Classification is in effect a decision. Optimum decisions require making full use of available data, developing predictions, and applying a loss/utility/cost function to make a decision that, for example, minimizes expected loss or maximizes expected utility. Different end users have different utility functions. In risk assessment this leads to their having different risk thresholds for action. Classification assumes that every user has the same utility function and that the utility function implied by the classification system is that utility function.
Some prediction systems produce estimates of class-probabilities \(\hat{p}(A|x)\), \(\hat{p}(B|x), \dots\) that an input \(x\) belongs to class \(A, B, \dots\). Others may simply produce “a score \(s = s(x)\), an unspecified monotonic increasing transformation of an estimate \(\hat{p}(A|x)\)” (Hand 2009).
Some classifiers (e.g., support vector machines, decision trees, k-nearest neighbours) do not immediately produce class probability estimates. However, there are procedures for transforming (torturing?) their outputs into calibrated probabilities, e.g., the distribution of training set classes found at the terminal nodes of a decision tree could be regarded as estimates of the class probabilities; Zadrozny and Elkan (2001) show strategies to improve those estimates.
There are different ways to achieve multinomial classification from systems of binary models: one-vs-all, one-vs-one, error correcting output codes, etc. If we want to understand the performance of a multinomial classification system of class-probability models, we have to understand how the outputs of these models are combined.
Voting is one approach, however voting only provides a class label: we want probabilities, e.g., “a strategy for polychotomous classification that involves estimating class probabilities for each pair of classes, and then coupling the estimates together” (Hastie and Tibshirani 1998). See (Wu, Lin, and Weng 2004) for an improved approach.
Just as multinomial classification can be achieved by composing the outputs of several class-probability models, we could also decomposing the outputs of multiclass prediction systems into various binary models. Here, I am thinking about neural network systems that impose a softmax to ensure that outputs satisfy the requirements of a probability estimate.
See Zadrozny and Elkan (2002).
Grant Sanderson suggests that we can better understand the discriminative performance of a model by expressing Bayes’ rule in terms of prior odds and Bayes factors (3Blue1Brown 2020).
Suppose we have a hypothesis \(\mathrm{H}\) that a person actually has a disease, and some evidence \(\mathrm{E}\) in about that in the form of a positive test result for that disease. One key question is “if I have a positive test result, what’s the chance that I actually have the disease”, written symbolically in probabilities as: \[ \mathrm{P(H|E)} = \frac{\mathrm{P(E|H)P(H)}}{\mathrm{P(E|H)P(H)}+\mathrm{P(E|\overline{H})P(\overline{H})}} \] or in terms of odds as \[ \begin{align} \mathrm{O(H|E)} &= \mathrm{O(H)} \frac{\mathrm{P(E|H)}}{\mathrm{P(E|\overline{H})}}\\ &= \mathrm{O(H)} \frac{\text{True positive rate}}{\text{False positive rate}} \end{align} \] where the ratio is the Bayes factor of the test for a positive result. This factor represents how how your prior odds of having the disease are updated as a result of the test outcome.
For the absence of disease, the formulation is \[ \begin{align} \mathrm{O(\overline{H}|E)} &= \mathrm{O(\overline{H})} \frac{\mathrm{P(\overline{E}|H)}}{\mathrm{P(\overline{E}|\overline{H})}}\\ &= \mathrm{O(\overline{H})} \frac{\text{False negative rate}}{\text{True negative rate}}. \end{align} \]
We still have to estimate the test’s Bayes factor using a set of data whose outcomes are known.
See Buja, Stuetzle, and Shen (2005).
We agreed to try a combination of simulation and visualisation to explore how class-probability estimates can be visualised and understood.
\[ \mathrm{P}(\mathbf{y}_i) = \mathrm{P}(\mathbf{y}_i|A).\mathrm{P}(A) + \mathrm{P}(\mathbf{y}_i|B).\mathrm{P}(B) + \mathrm{P}(\mathbf{y}_i|C).\mathrm{P}(C) \]
This allows us to manipulate the distributions of class-probability estimates which could then be displayed in \(\mathbb{R}^2\) by way of an isometric log-ratio projection (See Supplementary Information of (Lovell, Chua, and McGrath 2020))
Plot \(\mathrm{P}(\mathbf{y}_i)\), and the optimal decision boundaries between \(\mathrm{P}(\mathbf{y}_i|A)\), \(\mathrm{P}(\mathbf{y}_i|B)\) and \(\mathrm{P}(\mathbf{y}_i|C)\) (see Murphy (2012 Chapter 4: Gaussian models))
Perform a quadratic discriminant analysis on samples from \(\mathrm{P}(\mathbf{y}_i)\) and compare the decision boundaries
Factor apart prior probability of classes from model discriminative performance using prior odds and Bayes factors
3Blue1Brown. 2020. The Medical Test Paradox: Can Redesigning Bayes Rule Help? https://www.youtube.com/watch?v=lG4VkPoG3ko.
Buja, Andreas, Werner Stuetzle, and Yi Shen. 2005. Loss Functions for Binary Class Probability Estimation and Classification: Structure and Applications,” Manuscript, Available at Www-Stat.wharton.upenn.edu/ Buja. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.184.5203.
Hand, David J. 2009. “Measuring Classifier Performance: A Coherent Alternative to the Area Under the ROC Curve.” Machine Learning 77 (1): 103–23. https://doi.org/10.1007/s10994-009-5119-5.
Harrell, Frank. 2017. “Classification Vs. Prediction.” Statistical Thinking. January 15, 2017. https://www.fharrell.com/post/classification/.
Hastie, Trevor, and Robert Tibshirani. 1998. “Classification by Pairwise Coupling.” Annals of Statistics 26 (2): 451–71. https://doi.org/10.1214/aos/1028144844.
Lovell, David R, Xin-Yi Chua, and Annette McGrath. 2020. “Counts: An Outstanding Challenge for Log-Ratio Analysis of Compositional Data in the Molecular Biosciences.” NAR Genomics and Bioinformatics 2 (2): lqaa040. https://doi.org/10.1093/nargab/lqaa040.
Murphy, Kevin P. 2012. Machine Learning: A Probabilistic Perspective. Adaptive Computation and Machine Learning Series. Cambridge, MA: MIT Press.
Olivetti, Emanuele, Susanne Greiner, and Paolo Avesani. 2014. “Statistical Independence for the Evaluation of Classifier-Based Diagnosis.” Brain Informatics 2 (1): 13–19. https://doi.org/10.1007/s40708-014-0007-6.
Wu, Ting-Fan, Chih-Jen Lin, and Ruby C. Weng. 2004. “Probability Estimates for Multi-Class Classification by Pairwise Coupling.” Journal of Machine Learning Research 5 (Aug): 975–1005. https://www.jmlr.org/papers/v5/wu04a.html?907d3908.
Zadrozny, Bianca, and Charles Elkan. 2001. “Obtaining Calibrated Probability Estimates from Decision Trees and Naive Bayesian Classifiers.” In In Proceedings of the Eighteenth International Conference on Machine Learning, 609–16. Morgan Kaufmann.
———. 2002. “Transforming Classifier Scores into Accurate Multiclass Probability Estimates.” In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 694–99. KDD ’02. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/775047.775151.