Motivation: summarizing multinomial classifier performance

Motivating question from Brendan:

Binary models have well established measures of performance, such as precision and accuracy. When using multi classification methods, it is not this simple. What measures exist for assessing and describing performance of multi-class classification models, and how can we express those measures, both visually and semantically in a way that is interpretable by our (non-data scientist) business partners?

We’ve decided to begin by

restricting our attention to multinomial classifiers (i.e., classifiers that determine 1-out-of-\(C\) classes).
working with simulated confusion matrices, just to get a handle on the problem
assuming that the classifiers work by thresholding a continuous value that relates to the probability of class membership

Simulating confusion matrices

Here’s a function that simulates a confusion matrix for a multinomial classifier where

\(C\) is the number of classes
\(\mathbf{N}\) is a \(C \times C\) confusion matrix whose elements are \(n_{ij}\)
\(n_{ij}\) are the number of observations predicted to be from class \(i\) that are actually from class \(j\)* \(n\) is the number of observations that have been classified
- The rows of \(\mathbf{N}\) are the predicted classes
- The columns of \(\mathbf{N}\) are the actual classes
\(n_j = \sum_{i} n_{ij}\) are the number of observations actually from class \(j\)
\(n_i = \sum_{j} n_{ij}\) are the number of observations predicted to be from class \(i\)
\(n = \sum_{i,j} n_{ij}\) are the number of observations

The function takes as its arguments

A vector Nj specifying the number of observations in each class \(1, ..., C\)
A \(C \times C\) matrix Li.given.j specifying the likelihood of the classifier predicting that an observation is from class \(i\) when it is actually from class \(j\).

Hmmm… this is not quite going to work because we are not simulating the underlying continuous values… dang.

And to do that, we have to decide how to implement a multinomial classifier from multiple binary classifiers.

Implementing multinomial classification

See

One versus rest (OvR)

…creates a binary classifier for each of the \(C\) classes. Each of these classifiers learns to detect its specific class and reject all others.

Points to note:

Still we have to compare and combine the outputs of all \(C\) classifiers
- Do we combine the classes? or do we consider the continuous values prior to thresholding somehow?
Contention between two or more classifiers could arise
- Or could be avoided if a sequence of binary classifications is followed
How do we treat situations where no classifier detects an observation
Note that these situations are masked by using softmax

One versus one (OvO)

…creates \(C(C-1)/2\) classifiers, one for each pair of classes. The final decision is based on a majority vote.

Points to note:

Outputs still have to be combined, but the

Error correcting outcput codes (ECOC)

Hierarchical classification (sequence of binary classifiers)

…creates a set of binary classifiers that partition the

No contention

HOw we combine binary classifier has a big impact on how we interpret the performance of a multiple classifiers.

Multinomial classification

Notes from Meeting 001

David Lovell, Bridget McCarron, Brendan Langfield