We will explore a slew of classification metrics through writing custom functions, than comparing the results to various R packages.
First let’s load the data we’ll use for this exercise:
# load
df <- read.csv("./data/classification-output-data.csv") %>% clean_names()
# preview
head(df)
## pregnant glucose diastolic skinfold insulin bmi pedigree age class
## 1 7 124 70 33 215 25.5 0.161 37 0
## 2 2 122 76 27 200 35.9 0.483 26 0
## 3 3 107 62 13 48 22.9 0.678 23 1
## 4 1 91 64 24 0 29.2 0.192 21 0
## 5 4 83 86 19 0 29.3 0.317 34 0
## 6 1 100 74 12 46 19.5 0.149 28 0
## scored_class scored_probability
## 1 0 0.32845226
## 2 0 0.27319044
## 3 0 0.10966039
## 4 0 0.05599835
## 5 0 0.10049072
## 6 0 0.05515460
We are only using the following three columns for this exercise:
class
scored_class
scored_probability
Let’s use the table()
function to get the raw confusion
matrix for the scored dataset.
(conf_mat_raw <- table(actual = df$class, predicted = df$scored_class))
## predicted
## actual 0 1
## 0 119 5
## 1 30 27
The raw confusion matrix has
Accuracy tells us how correct our predictions are. However,if the dataset is extremely unbalanced, it wouldn’t be to difficult to get a good score by only predicting the majority class. So, let’s get some perspective and examine data sets class balance.
(df_class_bal <- df %>%
group_by(class) %>%
summarise(count = n()) %>%
mutate(perc = count / sum(count)))
## # A tibble: 2 × 3
## class count perc
## <int> <int> <dbl>
## 1 0 124 0.685
## 2 1 57 0.315
We have: - 69% class 0 - 31% class 1
This is an unbalanced dataset, we can visualize it as well:
# set colors
class_colors <- c("#0C6291","#A63446")
# plot class imbalance
df_class_bal %>%
ggplot() +
geom_bar(aes(y = reorder(as.factor(class), perc), x = perc,
fill = as.factor(class)),
stat = "identity", alpha = 0.9) +
scale_x_continuous(labels = percent_format(suffix = "%")) +
scale_fill_manual(values = c("1" = class_colors[1], "0" = class_colors[2]),
name = "") +
labs(x = "", y = "", title = "Percent by Class (Actual)") +
theme_minimal() +
theme(plot.title.position = "plot",
plot.title = element_text(hjust = 0.5))
Now let’s compute accuracy. The formula is as follows:
\[Accuracy = \frac{TP+TN}{TP+FP+TN+FN}\]
Set predictions
and true_labels
variables:
# convert to factors
predictions <- as.factor(df$scored_class)
true_labels <- as.factor(df$class)
# accuracy function
acc_fun <- function(predictions, true_labels) {
# define variables
tp <- sum(predictions == 1 & true_labels == 1)
fn <- sum(predictions == 0 & true_labels == 1)
tn <- sum(predictions == 0 & true_labels == 0)
fp <- sum(predictions == 1 & true_labels == 0)
# compute
sum(tp,tn) / sum(tp,tn,fp,fn)
}
# compute accuracy
(acc <- acc_fun(predictions, true_labels))
## [1] 0.8066298
Looks like we have 80% accuracy, not great considering the class imbalance.
Now let’s look at classification error rate.
This should be the compliment of Accuracy, formula is as follows:
\[Classification \space Error \space Rate=\frac{FP+FN}{TP+FP+TN+FN}\]
# error rate function
error_rate_func <- function(predictions, true_labels) {
# define variables
tp <- sum(predictions == 1 & true_labels == 1)
fn <- sum(predictions == 0 & true_labels == 1)
tn <- sum(predictions == 0 & true_labels == 0)
fp <- sum(predictions == 1 & true_labels == 0)
# compute
sum(fp,fn) / sum(tp,tn,fp,fn)
}
# compute
(error_rate <- error_rate_func(predictions, true_labels))
## [1] 0.1933702
We have the compliment of Accuracy, which means we have a Classification Error Rate of ~19%.
Precision measures the the accuracy of the positive predictions by quantifying how well the model correctly identifies the relevant instances from the total instances it predicts as positive.
Precision is particularly important in cases where false positives are costly or have significant consequences. For example in medical diagnoses, high precision is essential because misdiagnosing a healthy person as diseased can less to unnecessary treatments with potential side effects.
Formula is as such:
\[Precision=\frac{TP}{TP+FP}\]
# precision func
precision_func <- function(predictions, true_labels) {
# define variables
tp <- sum(predictions == 0 & true_labels == 0)
fn <- sum(predictions == 1 & true_labels == 0)
tn <- sum(predictions == 1 & true_labels == 1)
fp <- sum(predictions == 0 & true_labels == 1)
# compute
tp / sum(tp, fp)
}
# compute
(precision_score <-precision_func(predictions,true_labels))
## [1] 0.7986577
We have a 79% Precision score, so of the items predicted as positive, 79% of them were correct.
Next up, Sensitivity or Recall.
This metric allows us to identify all the relevant instances of a specific class. Recall is often in tension with Precision. This is because lowering the classification threshold can lead to more false positives, while increasing Precision can lead to more false negatives.
This is the preferred metric to use when dealing with imbalanced datasets like this one.
Formula is as such:
\[Sensitivity(Recall)=\frac{TP}{TP+FN}\]
# recall function
recall_func <- function(predictions, true_labels) {
# define variables
tp <- sum(predictions == 0 & true_labels == 0)
fn <- sum(predictions == 1 & true_labels == 0)
tn <- sum(predictions == 1 & true_labels == 1)
fp <- sum(predictions == 0 & true_labels == 1)
# compute
tp / sum(tp, fn)
}
# compute
(recall_score <- recall_func(predictions,true_labels))
## [1] 0.9596774
We are 96% sensitive for the true positives. This is a relatively high score, and since this is an imbalanced data set, it gives us a more accurate view of the model’s performance.
F1 score is a mix of Precision and Recall by taking the harmonic mean, which gives us a balance between them. It is useful when we want to balance the trade-off between precision and recall and get an overall assessment of the model’s performance.
Formula is as such:
\[\frac{2*(Precision*Recall)}{Precision+Recall}\]
# f1 function
f1_func <- function(predictions, true_labels) {
# define variables
tp <- sum(predictions == 0 & true_labels == 0)
fn <- sum(predictions == 1 & true_labels == 0)
tn <- sum(predictions == 1 & true_labels == 1)
fp <- sum(predictions == 0 & true_labels == 1)
# compute
precision_score <-
tp / sum(tp,fp) # define precision
recall_score <-
tp / sum(tp,fn) # define recall
# compute f1 score
2*(precision_score*recall_score) / (precision_score+recall_score)
}
# compute
(f1_score <- f1_func(predictions,true_labels))
## [1] 0.8717949
We have an 87% F1 score, which makes sense as it’s a balance between the ~80% Precision and ~96% Recall.
What are the bounds of the F1 score? They are between 0 and 1, but let’s show how this is true.
We know that:
\[0<=Precison<=1\] \[0<=Recall<=1\]
And we also know that the product of Precision and Recall (defined as PR) is:
\[0<=PR<=1\]
Furthermore we know that the sum of Precision and Recall (if the above is true), must be between 0 and 2.
Therefore, the F1 score which is a ratio of the product of Precision and Recall to their sum, will be between 0 and 1.
\[0<=F1<=1\]
The ROC curve is a graphical representation used to evaluate binary classification models. It depicts the trade-off between the models true positive rate (Sensitivity or Recall) versus it’s false positive rate (1-Specificity)
Specificity is:
\[Specificity=\frac{TN}{TN+FP}\]
Let’s write a function to plot the ROC Curve for our dataset.
# roc cruve function
plot_roc_curve <- function(true_labels, predicted_probs) {
# create df of true and predicted
sorted_data <- data.frame(true_labels = true_labels,
predicted_probs = predicted_probs)
# sort descending
sorted_data <- sorted_data %>% arrange(desc(predicted_probs))
# initilize variables
tpr <- numeric(length(sorted_data$true_labels))
fpr <- numeric(length(sorted_data$true_labels))
auc_value <- 0
# loop through different threshold values
for (i in 1:length(sorted_data$true_labels)) {
threshold <- sorted_data$predicted_probs[i]
predictions <- ifelse(predicted_probs >= threshold,1,0)
# calculate tpr and fpr
tp <- sum(predictions == 1 & true_labels == 1)
fn <- sum(predictions == 0 & true_labels == 1)
tn <- sum(predictions == 0 & true_labels == 0)
fp <- sum(predictions == 1 & true_labels == 0)
tpr[i] <- tp / (tp + fn)
fpr[i] <- fp / (tn + fp)
# calculate AUC using the trapezoidal rule
if (i > 1) {
auc_value <- auc_value + 0.5 *
(fpr[i] - fpr[i - 1]) * (tpr[i] + tpr[i - 1])
}
}
# create df for roc curve
roc_data <- data.frame(fpr = fpr, tpr = tpr)
# plot
roc_plot <- roc_data %>%
ggplot(aes(fpr, tpr)) +
geom_line() +
geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "gray") +
labs(title = "ROC Curve", x = "False Positive Rate (FPR)", y = "True Positive Rate (TPR)") +
theme_minimal() +
theme(plot.title.position = "plot",
plot.title = element_text(hjust = 0.5))
# print plot
print(roc_plot)
# return auc value
return(auc_value)
}
To calculate AUC, we used the trapezoidal rule. This calculates the area of each trapezoid formed by two adjacent data points on the ROC curve. The formula is as such:
\[Area=0.5*(x2 - x1)*(y1 +y2)\]
After calculating the area of each trapezoid, we sum up these areas for all the trapezoids along the ROC curve. This sum provides an estimate of the Area under the curve (AUC), which represents the probability that the model will rank a randomly chosen positive instance higher than a randomly chosen negative instance. AUC values range from 0 to 1 with higher values indicating better discrimination by the model.
# use on dataset
auc_result <- plot_roc_curve(df$class,df$scored_probability)
cat("AUC score is:",round(auc_result, digits = 2))
## AUC score is: 0.85
Here we have a 0.85 AUC which is a strong indicator of good model performance. This means the model is quite effective at distinguishing between positive and negative instances. It has a high probability of correctly ranking a randomly chosen positive instance higher than a randomly chosen negative instance. Also, it’s a pretty balanced trade-off between sensitivity (TPR) and specificity (FPR).
Now let’s look at all these metrics together:
cat("Classification metrics from custom functions for our dataset are as follows:","\n",
"Accuracy: ",round(acc,digits = 2),"\n",
"Classification Error Rate: ",round(error_rate,digits = 2),"\n",
"Precision: ",round(precision_score,digits = 2),"\n",
"Recall: ",round(recall_score,digits = 2),"\n",
"F1 Score: ",round(f1_score,digits = 2))
## Classification metrics from custom functions for our dataset are as follows:
## Accuracy: 0.81
## Classification Error Rate: 0.19
## Precision: 0.8
## Recall: 0.96
## F1 Score: 0.87
Let’s cross reference with some built in packages.
Let’s see how the Caret
package scores our dataset
metrics:
# compute confusion matrix
confusion_matrix <- caret::confusionMatrix(predictions,true_labels)
# extract classification metrics
accuracy <- confusion_matrix$overall['Accuracy']
error_rate <- 1 - accuracy
precision <- confusion_matrix$byClass['Pos Pred Value']
recall <- confusion_matrix$byClass['Sensitivity']
f1_score <- confusion_matrix$byClass['F1']
# print metrics
cat("Classification metrics for our dataset are as follows:\n",
"Accuracy:", round(accuracy, digits = 2), "\n",
"Classification Error Rate:", round(error_rate, digits = 2), "\n",
"Precision:", round(precision, digits = 2), "\n",
"Recall:", round(recall, digits = 2), "\n",
"F1 Score:", round(f1_score, digits = 2), "\n")
## Classification metrics for our dataset are as follows:
## Accuracy: 0.81
## Classification Error Rate: 0.19
## Precision: 0.8
## Recall: 0.96
## F1 Score: 0.87
Let’s calculate AUC with pRoc
package
# sort data descening probs
sorted_data <- data.frame(true_labels = df$class,
predicted_probs = df$scored_probability)
# create an ROC curve
roc_curve <- roc(as.numeric(sorted_data$true_labels), as.numeric(sorted_data$predicted_probs))
# calc AUC
auc_result <- auc(roc_curve)
# results
cat("AUC:", round(auc_result, digits = 2))
## AUC: 0.85
Results are the identical to custom functions.