Classification metrics

When predicting classes, there are two types of performance metrics. One is based on “hard” class predictions; for example whether a cell is cancerous or not. The other is based on class probabilities of the observations. We focus on the two class problem.

Hard class predictions

Accuracy

True Positive Rate measures the proportion of actual positive cases and actual negative correctly identified by a diagnostic test or classifier.
Formula: \[ Accuracy = \frac{TP + TN}{TP + FP + TN + FN}\]

Sensitivity:

Also known as True Positive Rate (TPR) or Recall.
True Positive Rate measures the proportion of actual positive cases correctly identified by a diagnostic test or classifier.
Formula: \[ TPR = \frac{TP}{TP + FN}\]
High TPR indicates a low rate of false negatives.

Specificity:
- Also known as True Negative Rate.
- Specificity measures the proportion of actual negative cases that are correctly identified by a diagnostic test or classifier. \[ Specificity = \frac{TN}{TN + FP}\] .
- High specificity indicates a low rate of false positives.
Positive Predictive Value:
- Also known as Precision.
- Positive Predictive Value is the proportion of true positive predictions among all positive predictions made by a diagnostic test or classifier.
- Formula: \[ PPV = \frac{TP}{TP + FP}\]
  - High PPV indicates a low rate of false positive predictions.
False Positive Rate:
- The False Positive Rate measures the proportion of actual negative cases that are incorrectly identified as positive by a diagnostic test or classifier.
- Formula: \[ FPR = \frac{FP}{FP + TN}\]
- Low FPR indicates a low rate of false positives.
- FPR = 1- specificity

This is not an exhaustive list- we only list some of the more commonly used. These are often tablulated in a confusion matrix. See the Definition section of the Wikipedia page https://en.wikipedia.org/wiki/Precision_and_recall. The upper left of the table corresponds to the confusion matrix. The rest of the table contains many other metrics. See also Geron Chapter 3, the section on Performance Measures.

Metrics as conditional probabilities

Hard class prediction metrics can be formulated as conditional probabilities. In a test for a disease, suppose we use “+” to represent the event that the test is positive and “D” to represent presence of disease. We can express sensitivity, specificity, and positive predictive value mathematically as conditional probabilities.

Sensitivity:

Sensitivity measures the probability that the test correctly identifies individuals who have the disease.

\[P(+ | D) = \frac{P(+ \text{ and } D)}{P(D)}\]

Specificity:

Specificity measures the probability that the test correctly identifies individuals who do not have the disease.

\[P(- | \overline{D}) = \frac{P(- \text{ and } \overline{D})}{P(\overline{D})}\]

Positive Predictive Value:

PPV measures the probability that an individual who tests positive actually has the disease.

\[P(D | +) = \frac{P(D \text{ and } +)}{P(+)}\]

Accuracy

Note that accuracy is not a conditional probability. It is \(P(ypred = y)\), where \(ypred\) is the predicted class ande \(y\) is the true class. We can also express as

\[P(+ \ and \ D) + P(- \ and \ \overline{{D}})\]

We can find this in Python using numpy.mean(ypred == y), or with functions such as score or accuracy.

Soft class predictions and AUC ROC

One performance metric based on soft class predictions is AUC ROC, sometimes referred to simply as AUC, or ROC. These are short for “area under curve”, and “receiver operator characteristic”, which is a term from electrical engineering.

Imagine you have fit a model and obtained probabilities of success for each observation. You can decide one whatever cutoff you want in order to make a hard class prediction. For example, you may want to flag borrowers with a risk of credit default greater than 10%. That threshold will determine a certain true positive rate - higher is better, and a false positive rate- lower is better. Now imagine plotting true positive versus false positive for every possible cutoff probability. The resulting curve is the ROC curve. The area under the curve is a performance metric-the higher the better. Values close to 1 are very good. Value near 0.5 are no better than a coin flip. AUC ROC gives us a way to compare models without tying them to a fixed cutoff.

ISLR has a nice description, as does Wikipedia, which also details the above https://en.wikipedia.org/wiki/Receiver_operating_characteristic