For this extra credit, we will be going over what’s required to Analyze binary classification model performance.
For binary classification problems, these are typically represented in a confusion matrix where one axis is actual and the other are predicted. Then the 4 quadrants inside represent TP, TN, FP, FN.
| Type | Actual Positive | Actual Negative |
|---|---|---|
| Predicted Positive | TP | FP |
| Predicted Negative | FN | TN |
Before we get into tasks, let’s first read in the prediction dataset:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(stringr)
data_url <- "https://raw.githubusercontent.com/riverar9/cuny-msds/main/data607/extra_credit/penguin_predictions.csv"
penguin_predictions <- read.csv(data_url)
penguin_predictions |>
group_by(.pred_class, sex) |>
count()
## # A tibble: 4 × 3
## # Groups: .pred_class, sex [4]
## .pred_class sex n
## <chr> <chr> <int>
## 1 female female 36
## 2 female male 3
## 3 male female 3
## 4 male male 51
With this table, we must define that being categorized as female is the positive. So a TP will be when .pred_class and sex are both female.
With that defined, I found that there’s a table() function which conveniently creates a confusion matrix.
pred_cm <- table(penguin_predictions$.pred_class, penguin_predictions$sex)
pred_cm
##
## female male
## female 36 3
## male 3 51
It is important to note that in this orientation, the confusion matrix above has the actual varying along the x-axis and the predicted varying on the y-axis. With the tibble and the table above, we can collect the TP, TN, FP, FN:
Let’s encode this into integer variables for reference later. We’ll do this by taking the values directly from the confusion matrix created above:
tp <- pred_cm[1]
fn <- pred_cm[2]
fp <- pred_cm[3]
tn <- pred_cm[4]
print(str_c(tp, "\t", fp))
## [1] "36\t3"
print(str_c(fn, "\t", tn))
## [1] "3\t51"
Therefore, for the majority class, Null Error Rate can be calculated by:
\[ sample_c=\text{correctly predicted sample} \] \[ sample_p=\text{total population of the sample} \] \[ NER = \dfrac{sample_c}{sample_p} \]
First to calculate the NER, we will need to know what is the majority class is:
colors <- c('plum', 'blue')
ggplot(penguin_predictions, aes(x = sex)) +
geom_bar(fill = colors) +
geom_text(stat = "count", aes(label = after_stat(count)), vjust = -1)
From that graph, we can see that the majority of observations are of male penguins (54 males vs 39 females). Therefore, the male class is the majority class. Now that we know this, we can use the TN and FN to calculate NER as these correspond to the majority class:
ner <- tn / (fn + tn)
ner
## [1] 0.9444444
or because I like using LaTex:
\[ NER = \dfrac{sample_c}{sample_p} = \dfrac{51}{(3+51)} = \text{94.4%} \]
The NER can be a good measure of how well the classification performs compared to simple chance. For the classification to do well, then the NER must be greater than the overall chance. For example, if 99% of all penguins were male, then a NER of 94% would be worse off than a 99% chance.
First, we’ll modify the dataframe to have a column for each prediction threshold:
penguin_predictions <- penguin_predictions |>
mutate(.pred_class.2 = ifelse(.pred_female < 0.2, "male", "female")) |>
mutate(.pred_class.5 = ifelse(.pred_female < 0.5, "male", "female")) |>
mutate(.pred_class.8 = ifelse(.pred_female < 0.8, "male", "female"))
head(penguin_predictions)
## .pred_female .pred_class sex .pred_class.2 .pred_class.5 .pred_class.8
## 1 0.9921746 female female female female female
## 2 0.9542394 female female female female female
## 3 0.9847350 female female female female female
## 4 0.1870206 male female male male male
## 5 0.9947012 female female female female female
## 6 0.9999891 female female female female female
Now we’ll create a confusion matrix using table:
pred_cm2 <- table(penguin_predictions$.pred_class.2, penguin_predictions$sex)
pred_cm5 <- table(penguin_predictions$.pred_class.5, penguin_predictions$sex)
pred_cm8 <- table(penguin_predictions$.pred_class.8, penguin_predictions$sex)
pred_cm2
##
## female male
## female 37 6
## male 2 48
pred_cm5
##
## female male
## female 36 3
## male 3 51
pred_cm8
##
## female male
## female 36 2
## male 3 52
When the threshold is lowered to .2, we can observe that the number of true positives increases by one but the number of false positives increase. So, doing so increases our NER.
On the contrast, when increasing our threshold to .5 did not change the result at all. This would mean that there are no .pred_female values which are between the inherent threshold and .5. To verify this finding, I’ll put it into a graph:
ggplot(penguin_predictions, aes(x = .pred_female, y = .pred_female)) +
geom_point()
Looking at the scatter plot, the values for .pred_female do not have any entries from around .35 to about .75. So any threshold between these two numbers will yield the same result. Finally, looking at the threshold of .8 we observe that one entry from the FP bucket has moved to a TN therefore increasing our NER. This exercise really demonstrates the importance of setting thresholds and also how potentially impactful a modification can change the metrics and results.
But it leaves a question, how can we most robustly determine a threshold to use? Online, I found a resource from Google which provided a few methods for determining this and it suggested to plot a ROC curve. Doing so seems too timely at the moment but I am looking forward to doing so!
Reviewing a few definitions: \[ acc = \dfrac{TP + TN}{TP + FP + FN + TN} \] \[ pre = \dfrac{TP}{TP + FP} \] \[ rec = \dfrac{TP}{TP + FN} \] \[ F_1 =\dfrac{2 * pre * rec}{pre + rec} \]
With the definitions, let’s create a table with all these calculations organized:
acc <- (tp + tn) / (tp + fp + fn + tn)
pre <- (tp) / (tp + fp)
rec <- (tp) / (tp + fn)
f_1 <- (2 * pre * rec) / (pre + rec)
thresholds <- c(.2, .5, .8)
collection <- c(acc, pre, rec, f_1)
for (threshold in thresholds) {
temp_df <- penguin_predictions |>
mutate(t_pred_class = ifelse(.pred_female < threshold, "male", "female"))
temp_cm <- table(temp_df$t_pred_class, temp_df$sex)
t_tp <- temp_cm[1]
t_fn <- temp_cm[2]
t_fp <- temp_cm[3]
t_tn <- temp_cm[4]
t_acc <- (t_tp + t_tn) / (t_tp + t_fp + t_fn + t_tn)
t_pre <- (t_tp) / (t_tp + t_fp)
t_rec <- (t_tp) / (t_tp + t_fn)
t_f_1 <- (2 * t_pre * t_rec) / (t_pre + t_rec)
collection <- c(collection, t_acc, t_pre, t_rec, t_f_1)
}
scores_matrix <- matrix(collection, ncol=4, byrow=TRUE)
colnames(scores_matrix) <- c("accuracy", "precision", "recall", "f1 score")
rownames(scores_matrix) <- c("provided", thresholds)
scores_table <- as.table(scores_matrix)
scores_matrix
## accuracy precision recall f1 score
## provided 0.9354839 0.9230769 0.9230769 0.9230769
## 0.2 0.9139785 0.8604651 0.9487179 0.9024390
## 0.5 0.9354839 0.9230769 0.9230769 0.9230769
## 0.8 0.9462366 0.9473684 0.9230769 0.9350649
With the code above, you would be able to theoretically add more thresholds and use those to recalculate all of the scores.
Now looking at the table we can see that at a threshold of 0.8 we can verify our observation in Task 3. Additionally, we can see that all metrics are best when the threshold is 0.8.
Using the penguin sex prediction model output to evaluate scores has been a great introduction to assesing a classification model’s performance. With this exercise I am now intregued as to how classification problems can be optimized given the real world’s constraints.