Classification Metrics

Objective

We will explore a slew of classification metrics through writing custom functions, than comparing the results to various R packages.

Load Data

First let’s load the data we’ll use for this exercise:

# load
df <- read.csv("./data/classification-output-data.csv") %>% clean_names()

# preview
head(df)

##   pregnant glucose diastolic skinfold insulin  bmi pedigree age class
## 1        7     124        70       33     215 25.5    0.161  37     0
## 2        2     122        76       27     200 35.9    0.483  26     0
## 3        3     107        62       13      48 22.9    0.678  23     1
## 4        1      91        64       24       0 29.2    0.192  21     0
## 5        4      83        86       19       0 29.3    0.317  34     0
## 6        1     100        74       12      46 19.5    0.149  28     0
##   scored_class scored_probability
## 1            0         0.32845226
## 2            0         0.27319044
## 3            0         0.10966039
## 4            0         0.05599835
## 5            0         0.10049072
## 6            0         0.05515460

We are only using the following three columns for this exercise:

class
scored_class
scored_probability

Raw confusion matrix

Let’s use the table() function to get the raw confusion matrix for the scored dataset.

(conf_mat_raw <- table(actual = df$class, predicted = df$scored_class))

##       predicted
## actual   0   1
##      0 119   5
##      1  30  27

The raw confusion matrix has

119 True Positive - predicted 0 actually 0
27 True Negative - predicted 1 actually 1
30 False Negative - predicted 0 actually 1
5 False Positive - predicted 1 actual 0

Accuracy

Accuracy tells us how correct our predictions are. However,if the dataset is extremely unbalanced, it wouldn’t be to difficult to get a good score by only predicting the majority class. So, let’s get some perspective and examine data sets class balance.

(df_class_bal <- df %>% 
  group_by(class) %>% 
  summarise(count = n()) %>% 
  mutate(perc = count / sum(count)))

## # A tibble: 2 × 3
##   class count  perc
##   <int> <int> <dbl>
## 1     0   124 0.685
## 2     1    57 0.315

We have: - 69% class 0 - 31% class 1

This is an unbalanced dataset, we can visualize it as well:

# set colors
class_colors <- c("#0C6291","#A63446")

# plot class imbalance
df_class_bal %>% 
  ggplot() +
    geom_bar(aes(y = reorder(as.factor(class), perc), x = perc, 
                 fill = as.factor(class)), 
             stat = "identity", alpha = 0.9) +
    scale_x_continuous(labels = percent_format(suffix = "%")) +
    scale_fill_manual(values = c("1" = class_colors[1], "0" = class_colors[2]),
                      name = "") +
    labs(x = "", y = "", title = "Percent by Class (Actual)") +
    theme_minimal() +
    theme(plot.title.position = "plot",
          plot.title = element_text(hjust = 0.5))

Now let’s compute accuracy. The formula is as follows:

\[Accuracy = \frac{TP+TN}{TP+FP+TN+FN}\]

Set predictions and true_labels variables:

# convert to factors
predictions <- as.factor(df$scored_class)
true_labels <- as.factor(df$class)

# accuracy function
acc_fun <- function(predictions, true_labels) {
  
  # define variables
  tp <- sum(predictions == 1 & true_labels == 1)
  fn <- sum(predictions == 0 & true_labels == 1)
  tn <- sum(predictions == 0 & true_labels == 0)
  fp <- sum(predictions == 1 & true_labels == 0)
  
  # compute
  sum(tp,tn) / sum(tp,tn,fp,fn)
}

# compute accuracy
(acc <- acc_fun(predictions, true_labels))

## [1] 0.8066298

Looks like we have 80% accuracy, not great considering the class imbalance.

Now let’s look at classification error rate.

Classification Error Rate

This should be the compliment of Accuracy, formula is as follows:

\[Classification \space Error \space Rate=\frac{FP+FN}{TP+FP+TN+FN}\]

# error rate function
error_rate_func <- function(predictions, true_labels) {
  
  # define variables
  tp <- sum(predictions == 1 & true_labels == 1)
  fn <- sum(predictions == 0 & true_labels == 1)
  tn <- sum(predictions == 0 & true_labels == 0)
  fp <- sum(predictions == 1 & true_labels == 0)
  
  # compute
  sum(fp,fn) / sum(tp,tn,fp,fn)
}

# compute
(error_rate <- error_rate_func(predictions, true_labels))

## [1] 0.1933702

We have the compliment of Accuracy, which means we have a Classification Error Rate of ~19%.

Precision

Precision measures the the accuracy of the positive predictions by quantifying how well the model correctly identifies the relevant instances from the total instances it predicts as positive.

Precision is particularly important in cases where false positives are costly or have significant consequences. For example in medical diagnoses, high precision is essential because misdiagnosing a healthy person as diseased can less to unnecessary treatments with potential side effects.

Formula is as such:

\[Precision=\frac{TP}{TP+FP}\]

# precision func
precision_func <- function(predictions, true_labels) {
  
  # define variables
  tp <- sum(predictions == 0 & true_labels == 0)
  fn <- sum(predictions == 1 & true_labels == 0)
  tn <- sum(predictions == 1 & true_labels == 1)
  fp <- sum(predictions == 0 & true_labels == 1)
  
  # compute
  tp / sum(tp, fp)
}

# compute
(precision_score <-precision_func(predictions,true_labels))

## [1] 0.7986577

We have a 79% Precision score, so of the items predicted as positive, 79% of them were correct.

Next up, Sensitivity or Recall.

Sensitivity (Recall)

This metric allows us to identify all the relevant instances of a specific class. Recall is often in tension with Precision. This is because lowering the classification threshold can lead to more false positives, while increasing Precision can lead to more false negatives.

This is the preferred metric to use when dealing with imbalanced datasets like this one.

Formula is as such:

\[Sensitivity(Recall)=\frac{TP}{TP+FN}\]

# recall function
recall_func <- function(predictions, true_labels) {
  
  # define variables
  tp <- sum(predictions == 0 & true_labels == 0)
  fn <- sum(predictions == 1 & true_labels == 0)
  tn <- sum(predictions == 1 & true_labels == 1)
  fp <- sum(predictions == 0 & true_labels == 1)
  
  # compute
  tp / sum(tp, fn)
}

# compute
(recall_score <- recall_func(predictions,true_labels))

## [1] 0.9596774

We are 96% sensitive for the true positives. This is a relatively high score, and since this is an imbalanced data set, it gives us a more accurate view of the model’s performance.

F1 Score

F1 score is a mix of Precision and Recall by taking the harmonic mean, which gives us a balance between them. It is useful when we want to balance the trade-off between precision and recall and get an overall assessment of the model’s performance.

Formula is as such:

\[\frac{2*(Precision*Recall)}{Precision+Recall}\]

# f1 function
f1_func <- function(predictions, true_labels) {
  
  # define variables
  tp <- sum(predictions == 0 & true_labels == 0)
  fn <- sum(predictions == 1 & true_labels == 0)
  tn <- sum(predictions == 1 & true_labels == 1)
  fp <- sum(predictions == 0 & true_labels == 1)
  
  # compute
  precision_score <- 
    tp / sum(tp,fp) # define precision
  recall_score <- 
    tp / sum(tp,fn) # define recall
  
  # compute f1 score
  2*(precision_score*recall_score) / (precision_score+recall_score)
}

# compute
(f1_score <- f1_func(predictions,true_labels))

## [1] 0.8717949

We have an 87% F1 score, which makes sense as it’s a balance between the ~80% Precision and ~96% Recall.

F1 Score bounds

What are the bounds of the F1 score? They are between 0 and 1, but let’s show how this is true.

We know that:

\[0<=Precison<=1\] \[0<=Recall<=1\]

And we also know that the product of Precision and Recall (defined as PR) is:

\[0<=PR<=1\]

Furthermore we know that the sum of Precision and Recall (if the above is true), must be between 0 and 2.

Therefore, the F1 score which is a ratio of the product of Precision and Recall to their sum, will be between 0 and 1.

\[0<=F1<=1\]

ROC (Receiver Operating Characteristics)

The ROC curve is a graphical representation used to evaluate binary classification models. It depicts the trade-off between the models true positive rate (Sensitivity or Recall) versus it’s false positive rate (1-Specificity)

Specificity is:

\[Specificity=\frac{TN}{TN+FP}\]

Let’s write a function to plot the ROC Curve for our dataset.

# roc cruve function
plot_roc_curve <- function(true_labels, predicted_probs) {
  # create df of true and predicted
  sorted_data <- data.frame(true_labels = true_labels,
                  predicted_probs = predicted_probs)
  # sort descending
  sorted_data <- sorted_data %>% arrange(desc(predicted_probs))
  
  # initilize variables
  tpr <- numeric(length(sorted_data$true_labels))
  fpr <- numeric(length(sorted_data$true_labels))
  auc_value <- 0
  
  # loop through different threshold values
  for (i in 1:length(sorted_data$true_labels)) {
    threshold <- sorted_data$predicted_probs[i]
    predictions <- ifelse(predicted_probs >= threshold,1,0)
  
  # calculate tpr and fpr
  tp <- sum(predictions == 1 & true_labels == 1)
  fn <- sum(predictions == 0 & true_labels == 1)
  tn <- sum(predictions == 0 & true_labels == 0)
  fp <- sum(predictions == 1 & true_labels == 0)
  
  tpr[i] <- tp / (tp + fn)
  fpr[i] <- fp / (tn + fp)
  
    # calculate AUC using the trapezoidal rule
      if (i > 1) {
      auc_value <- auc_value + 0.5 * 
      (fpr[i] - fpr[i - 1]) * (tpr[i] + tpr[i - 1])
    }
  }
  
  # create df for roc curve
  roc_data <- data.frame(fpr = fpr, tpr = tpr)
  
  # plot
  roc_plot <- roc_data %>% 
    ggplot(aes(fpr, tpr)) +
      geom_line() +
      geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "gray") +
      labs(title = "ROC Curve", x = "False Positive Rate (FPR)", y = "True Positive Rate (TPR)") +
    theme_minimal() +
    theme(plot.title.position = "plot",
          plot.title = element_text(hjust = 0.5))

  
  # print plot
  print(roc_plot)
  
  # return auc value
  return(auc_value)
}

To calculate AUC, we used the trapezoidal rule. This calculates the area of each trapezoid formed by two adjacent data points on the ROC curve. The formula is as such:

\[Area=0.5*(x2 - x1)*(y1 +y2)\]

After calculating the area of each trapezoid, we sum up these areas for all the trapezoids along the ROC curve. This sum provides an estimate of the Area under the curve (AUC), which represents the probability that the model will rank a randomly chosen positive instance higher than a randomly chosen negative instance. AUC values range from 0 to 1 with higher values indicating better discrimination by the model.

# use on dataset
auc_result <- plot_roc_curve(df$class,df$scored_probability)

cat("AUC score is:",round(auc_result, digits = 2))

## AUC score is: 0.85

Here we have a 0.85 AUC which is a strong indicator of good model performance. This means the model is quite effective at distinguishing between positive and negative instances. It has a high probability of correctly ranking a randomly chosen positive instance higher than a randomly chosen negative instance. Also, it’s a pretty balanced trade-off between sensitivity (TPR) and specificity (FPR).

All metrics

Now let’s look at all these metrics together:

cat("Classification metrics from custom functions for our dataset are as follows:","\n",
  "Accuracy: ",round(acc,digits = 2),"\n",
  "Classification Error Rate: ",round(error_rate,digits = 2),"\n",
  "Precision: ",round(precision_score,digits = 2),"\n",
  "Recall: ",round(recall_score,digits = 2),"\n",
  "F1 Score: ",round(f1_score,digits = 2))

## Classification metrics from custom functions for our dataset are as follows: 
##  Accuracy:  0.81 
##  Classification Error Rate:  0.19 
##  Precision:  0.8 
##  Recall:  0.96 
##  F1 Score:  0.87

Let’s cross reference with some built in packages.

Caret package

Let’s see how the Caret package scores our dataset metrics:

# compute confusion matrix
confusion_matrix <- caret::confusionMatrix(predictions,true_labels)

# extract classification metrics
accuracy <- confusion_matrix$overall['Accuracy']
error_rate <- 1 - accuracy
precision <- confusion_matrix$byClass['Pos Pred Value']
recall <- confusion_matrix$byClass['Sensitivity']
f1_score <- confusion_matrix$byClass['F1']

# print metrics
cat("Classification metrics for our dataset are as follows:\n",
    "Accuracy:", round(accuracy, digits = 2), "\n",
    "Classification Error Rate:", round(error_rate, digits = 2), "\n",
    "Precision:", round(precision, digits = 2), "\n",
    "Recall:", round(recall, digits = 2), "\n",
    "F1 Score:", round(f1_score, digits = 2), "\n")

## Classification metrics for our dataset are as follows:
##  Accuracy: 0.81 
##  Classification Error Rate: 0.19 
##  Precision: 0.8 
##  Recall: 0.96 
##  F1 Score: 0.87

pRoc

Let’s calculate AUC with pRoc package

# sort data descening probs
sorted_data <- data.frame(true_labels = df$class,
                  predicted_probs = df$scored_probability)
# create an ROC curve
roc_curve <- roc(as.numeric(sorted_data$true_labels), as.numeric(sorted_data$predicted_probs))

# calc AUC
auc_result <- auc(roc_curve)

# results
cat("AUC:", round(auc_result, digits = 2))

## AUC: 0.85

Results are the identical to custom functions.

Classification Metrics

Justin Williams

2023-09-22

Objective

Load Data

Raw confusion matrix

Accuracy

Classification Error Rate

Precision

Sensitivity (Recall)

F1 Score

F1 Score bounds

ROC (Receiver Operating Characteristics)

All metrics

Caret package

pRoc