Classification Metrics
Binary Classification
Accuracy
No. of correct predictions/ total no. of predictions
\[ \frac{TP + TN}{TP+ TN + FP + FN} \]
Precision
How precise the predictions are: inversely proportional to no. of false positives.
\[ \frac{TP}{TP + FP} \]
Recall (a.k.a Sensitivity)
How good the model is at doing positive examples. Inversely proportional to false negatives.
\[ \frac{TP}{TP + FN} \]
Specificity
How good the model is at doing negative examples. Inversely proportional to false positives.
\[ \frac{TN}{TN + FP} \]
F1 score
Harmonic mean of
\[ F_1 = \frac{2*precision*recall}{precision + recall} \]
Curve based metrics
Each of the previous metrics required only the final predicted labels. Curve based metrics rely on plotting the curves of such metrics for various thresholds of certainty and thus require the predicted probabilities. They then compute the area under the curve as a metric for performance.
Curve based metrics are generally restricted to binary classification.
ROC Curve
FPR
1- Specificity; the lower the better
\[ \frac{FP}{TN + FP} \]
TPR
= Recall or Sensitivity; the higher the better
\[ \frac{TP}{FN + TP} \]
The ROC (receiver operating characteristic) curve is FPR vs TPR, plotting the various points for different values of probability cutoff.
AUC
The area under the curve is the main metric of the ROC curve, i.e. the curve closest to the left top corner is best.
Precision Recall Curve
Another, less common metric is obtained by plotting the precision and recall for various thresholds and computing the area under the curve.
For Multi Class Classification
Confusion matrix
The matrix that shows the distribution of objects across their true labels and predicted labels.
The [1,1]th square shows how many objects with true label Ideal were classified as Ideal, [1,2]th shows how many objects with true label Ideal were classified as Premium, and so on.
Convert to binary
Grouping classes together
One way to approach the problem is to convert it to a binary one by defining two superclasses which together comprise all the classes. For example, take Ideal and Premium as one class (W) and Good and Fair as one (L). Now use the metrics used for binary classification for these to classes. But that’s kindof copping out.
Extending binary metrics by averaging
F1 is the preferred metric for multiclass classification.
The majority of classification metrics are defined for binary cases by default. In extending these binary metrics to multiclass, several averaging techniques are used.
First, a multiclass problem is broken down into a series of binary problems using either One-vs-One (OVO) or One-vs-Rest (OVR, also called One-vs-All) approaches.
We’ll focus on OVR due to OVO being computationally expensive.
In OVR, for each class one by one, we take it as the positive class, and group the rest into a negative superclass.
Essentially, the One-vs-Rest strategy converts a multiclass problem into a series of binary tasks for each class in the target. For example, classifying 4 types of diamond types can be binarized into 4 tasks with OVR:
Task 1: ideal vs. [premium, good, fair] — i.e., ideal vs. not ideal
Task 2: premium vs. [ideal, good, fair] — i.e., premium vs. not premium
Task 3: good vs. [ideal, premium, fair] — i.e., good vs. not good
Task 4: fair vs. [ideal, premium, good] — i.e., fair vs. not fair
Now, for each task, we evaluate the metric of interest (precision, recall, F1, whatever), so we get 4 values. However, we need a single metric, not a vector, to be able to compare the performances of different models effectively.
Thus, we use 3 kinds of averaging techniques
- Macro: simple AM of values (i.e. all classes weighted equally). For example, for overall precision, we calculate precision for each class specific task and then take the average.
- Weighted: Similar to macro, except that now we compute a weighted mean. May be used to weigh some classes that are more important higher.
- Micro: We weigh all samples, irrespective of class, equally. We use the sum of values for each class as the inputs, and compute the metric using these. e.g. for computing precision, we use TP = sum of all TPs across all classes and FP = sum of all FPs across all classes. We then compute Precision = TP/(TP+FP).
Cohen Kappa score
Rarely used in practice.
In multiclass classification, it can be used to evaluate the agreement between the predicted and actual class labels. The Cohen Kappa score ranges from -1 to 1, where a score of -1 indicates perfect disagreement, 0 indicates random agreement, and 1 indicates perfect agreement.
\[ \kappa = \frac{p_o-p_e}{1-p_e} = 1- \frac{1-p_0}{1-p_e} \]
\(p_0\) is the micro accuracy (TPs+TNs across all classes/ total no. of samples).
\(p_e\) is the probability of random agreement. For this, calculate P(randomly getting true positive for class A) for each class. \(p_e\) is the sum over all classes.
P(randomly getting true positive for class A) = P(random sample belongs to class A)P(random sample classified as A)