Week 2B Classification Metrics

Author

Madina Kudanova

Introduction

This assignment focuses on evaluating the performance of a binary classification model. Using predicted probabilities and known class labels, the goal is to understand how different decision thresholds influence model errors and performance metrics.

Classification metrics

Approach

To evaluate the performance of a binary classification model, I will work with a provided dataset containing model-predicted probabilities and true class labels. My step-by-step plan involves first examining the distribution of the actual class labels to establish a baseline using the null error rate. I will then assess how converting predicted probabilities into class labels at different decision thresholds affects classification outcomes. Using these results, I will construct confusion matrices and derive performance metrics to understand how threshold choice influences the balance between different types of classification errors.

Base Code

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
url <- "https://raw.githubusercontent.com/acatlin/data/refs/heads/master/penguin_predictions.csv"
penguins <- read_csv(url)
Rows: 93 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): .pred_class, sex
dbl (1): .pred_female

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# 1) Convert actual labels to 0/1
penguins <- penguins %>%
  mutate(actual = ifelse(sex == "female", 1, 0))

# 2) Null error rate
class_counts <- penguins %>% count(actual)

majority_n <- max(class_counts$n)
total_n <- sum(class_counts$n)

null_error_rate <- 1 - (majority_n / total_n)
null_error_rate
[1] 0.4193548
# Plot actual class distribution
penguins %>%
  ggplot(aes(x = sex)) +
  geom_bar() +
  labs(title = "Actual class distribution", x = "sex", y = "count")

# 3) Helper: counts for confusion matrix at a threshold
confusion_counts <- function(df, threshold) {
  pred <- ifelse(df$.pred_female > threshold, 1, 0)
  actual <- df$actual

  TP <- sum(pred == 1 & actual == 1)
  FP <- sum(pred == 1 & actual == 0)
  TN <- sum(pred == 0 & actual == 0)
  FN <- sum(pred == 0 & actual == 1)

  tibble(threshold = threshold, TP = TP, FP = FP, TN = TN, FN = FN)
}

# 4) Helper: turn counts into a 2x2 confusion matrix
make_cm <- function(row) {
  matrix(
    c(row$TN, row$FP,
      row$FN, row$TP),
    nrow = 2, byrow = TRUE,
    dimnames = list(Actual = c("0","1"), Predicted = c("0","1"))
  )
}

# 5) Compute confusion matrices for 0.2 / 0.5 / 0.8
thresholds <- c(0.2, 0.5, 0.8)
counts_all <- map_dfr(thresholds, ~ confusion_counts(penguins, .x))

counts_all
# A tibble: 3 × 5
  threshold    TP    FP    TN    FN
      <dbl> <int> <int> <int> <int>
1       0.2    37     6    48     2
2       0.5    36     3    51     3
3       0.8    36     2    52     3
for (t in thresholds) {
  cat("\n--- Threshold =", t, "---\n")
  row <- counts_all %>% filter(threshold == t)
  print(make_cm(row))
}

--- Threshold = 0.2 ---
      Predicted
Actual  0  1
     0 48  6
     1  2 37

--- Threshold = 0.5 ---
      Predicted
Actual  0  1
     0 51  3
     1  3 36

--- Threshold = 0.8 ---
      Predicted
Actual  0  1
     0 52  2
     1  3 36
# 6) Metrics from counts
metrics_from_counts <- function(row) {
  TP <- row$TP; FP <- row$FP; TN <- row$TN; FN <- row$FN
  total <- TP + FP + TN + FN

  accuracy <- (TP + TN) / total
  precision <- TP / (TP + FP)
  recall <- TP / (TP + FN)
  f1 <- 2 * precision * recall / (precision + recall)

  tibble(threshold = row$threshold,
         accuracy = accuracy, precision = precision, recall = recall, f1 = f1)
}

metrics_table <- counts_all %>%
  group_by(threshold) %>%
  group_modify(~ metrics_from_counts(.x)) %>%
  ungroup() %>%
  mutate(across(c(accuracy, precision, recall, f1), ~ round(.x, 3)))
Warning: Unknown or uninitialised column: `threshold`.
Unknown or uninitialised column: `threshold`.
Unknown or uninitialised column: `threshold`.
metrics_table
# A tibble: 3 × 5
  threshold accuracy precision recall    f1
      <dbl>    <dbl>     <dbl>  <dbl> <dbl>
1       0.2    0.914     0.86   0.949 0.902
2       0.5    0.935     0.923  0.923 0.923
3       0.8    0.946     0.947  0.923 0.935

Conclusion

This analysis demonstrates that classification model performance depends strongly on the chosen decision threshold. By examining the null error rate, confusion matrices, and performance metrics, it becomes clear that accuracy alone is not sufficient to evaluate a model. Lower thresholds increase recall by identifying more positive cases, while higher thresholds improve precision by reducing false positives. As shown in this assignment, there is no single “best” threshold; instead, the appropriate threshold should be selected based on the specific costs and priorities of different types of classification errors.