Classification Metrics

Author

Arqam Patel

Binary Classification

Accuracy

No. of correct predictions/ total no. of predictions

\[ \frac{TP + TN}{TP+ TN + FP + FN} \]

Precision

How precise the predictions are: inversely proportional to no. of false positives.

\[ \frac{TP}{TP + FP} \]

Recall (a.k.a Sensitivity)

How good the model is at doing positive examples. Inversely proportional to false negatives.

\[ \frac{TP}{TP + FN} \]

Specificity

How good the model is at doing negative examples. Inversely proportional to false positives.

\[ \frac{TN}{TN + FP} \]

F1 score

Harmonic mean of

\[ F_1 = \frac{2*precision*recall}{precision + recall} \]

Curve based metrics

Each of the previous metrics required only the final predicted labels. Curve based metrics rely on plotting the curves of such metrics for various thresholds of certainty and thus require the predicted probabilities. They then compute the area under the curve as a metric for performance.

Curve based metrics are generally restricted to binary classification.

ROC Curve

FPR

1- Specificity; the lower the better

\[ \frac{FP}{TN + FP} \]

TPR

= Recall or Sensitivity; the higher the better

\[ \frac{TP}{FN + TP} \]

The ROC (receiver operating characteristic) curve is FPR vs TPR, plotting the various points for different values of probability cutoff.

AUC

The area under the curve is the main metric of the ROC curve, i.e. the curve closest to the left top corner is best.

Precision Recall Curve

Another, less common metric is obtained by plotting the precision and recall for various thresholds and computing the area under the curve.

For Multi Class Classification

Confusion matrix

The matrix that shows the distribution of objects across their true labels and predicted labels.

The [1,1]th square shows how many objects with true label Ideal were classified as Ideal, [1,2]th shows how many objects with true label Ideal were classified as Premium, and so on.

Convert to binary

Grouping classes together

One way to approach the problem is to convert it to a binary one by defining two superclasses which together comprise all the classes. For example, take Ideal and Premium as one class (W) and Good and Fair as one (L). Now use the metrics used for binary classification for these to classes. But that’s kindof copping out.

Extending binary metrics by averaging

F1 is the preferred metric for multiclass classification.

The majority of classification metrics are defined for binary cases by default. In extending these binary metrics to multiclass, several averaging techniques are used.

First, a multiclass problem is broken down into a series of binary problems using either One-vs-One (OVO) or One-vs-Rest (OVR, also called One-vs-All) approaches.

We’ll focus on OVR due to OVO being computationally expensive.

In OVR, for each class one by one, we take it as the positive class, and group the rest into a negative superclass.

Essentially, the One-vs-Rest strategy converts a multiclass problem into a series of binary tasks for each class in the target. For example, classifying 4 types of diamond types can be binarized into 4 tasks with OVR:

  • Task 1: ideal vs. [premium, good, fair] — i.e., ideal vs. not ideal

  • Task 2: premium vs. [ideal, good, fair] — i.e., premium vs. not premium

  • Task 3: good vs. [ideal, premium, fair] — i.e., good vs. not good

  • Task 4: fair vs. [ideal, premium, good] — i.e., fair vs. not fair

Now, for each task, we evaluate the metric of interest (precision, recall, F1, whatever), so we get 4 values. However, we need a single metric, not a vector, to be able to compare the performances of different models effectively.

Thus, we use 3 kinds of averaging techniques

  1. Macro: simple AM of values (i.e. all classes weighted equally). For example, for overall precision, we calculate precision for each class specific task and then take the average.
  2. Weighted: Similar to macro, except that now we compute a weighted mean. May be used to weigh some classes that are more important higher.
  3. Micro: We weigh all samples, irrespective of class, equally. We use the sum of values for each class as the inputs, and compute the metric using these. e.g. for computing precision, we use TP = sum of all TPs across all classes and FP = sum of all FPs across all classes. We then compute Precision = TP/(TP+FP).

Cohen Kappa score

Rarely used in practice.

In multiclass classification, it can be used to evaluate the agreement between the predicted and actual class labels. The Cohen Kappa score ranges from -1 to 1, where a score of -1 indicates perfect disagreement, 0 indicates random agreement, and 1 indicates perfect agreement.

\[ \kappa = \frac{p_o-p_e}{1-p_e} = 1- \frac{1-p_0}{1-p_e} \]

\(p_0\) is the micro accuracy (TPs+TNs across all classes/ total no. of samples).

\(p_e\) is the probability of random agreement. For this, calculate P(randomly getting true positive for class A) for each class. \(p_e\) is the sum over all classes.

P(randomly getting true positive for class A) = P(random sample belongs to class A)P(random sample classified as A)

References

Comprehensive Guide to Multiclass Classification Metrics