week2b_classification_metrics.qmd

Author

Sinem K Moschos

Week 2B Classification Metrics

Step 1 Load the data

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
url <- "https://raw.githubusercontent.com/acatlin/data/refs/heads/master/penguin_predictions.csv"
df <- read_csv(url)
Rows: 93 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): .pred_class, sex
dbl (1): .pred_female

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(df)
Rows: 93
Columns: 3
$ .pred_female <dbl> 0.99217462, 0.95423945, 0.98473504, 0.18702056, 0.9947012…
$ .pred_class  <chr> "female", "female", "female", "male", "female", "female",…
$ sex          <chr> "female", "female", "female", "female", "female", "female…
head(df)
# A tibble: 6 × 3
  .pred_female .pred_class sex   
         <dbl> <chr>       <chr> 
1        0.992 female      female
2        0.954 female      female
3        0.985 female      female
4        0.187 male        female
5        0.995 female      female
6        1.000 female      female

Step 2 Null error rate

# count the actual classes
class_counts <- table(df$sex)
class_counts

female   male 
    39     54 
# majority class (most common label)
majority_class <- names(which.max(class_counts))
majority_class
[1] "male"
# null error rate = error if we always predict the majority class
null_error_rate <- mean(df$sex != majority_class)
null_error_rate
[1] 0.4193548

Step 3 Class distribution plot

library(ggplot2)

ggplot(df, aes(x = sex)) +
  geom_bar() +
  labs(
    title = "Class Distribution of Sex",
    x = "Sex",
    y = "Count"
  )

Step 4 Confusion matrices at thresholds 0.2 0.5 0.8

make_confusion <- function(threshold) {
  pred <- ifelse(df$.pred_female > threshold, "female", "male")
  
  TP <- sum(pred == "female" & df$sex == "female")
  FP <- sum(pred == "female" & df$sex == "male")
  TN <- sum(pred == "male" & df$sex == "male")
  FN <- sum(pred == "male" & df$sex == "female")
  
  tibble(
    threshold = threshold,
    TP = TP,
    FP = FP,
    TN = TN,
    FN = FN
  )
}

conf_02 <- make_confusion(0.2)
conf_05 <- make_confusion(0.5)
conf_08 <- make_confusion(0.8)

conf_02
# A tibble: 1 × 5
  threshold    TP    FP    TN    FN
      <dbl> <int> <int> <int> <int>
1       0.2    37     6    48     2
conf_05
# A tibble: 1 × 5
  threshold    TP    FP    TN    FN
      <dbl> <int> <int> <int> <int>
1       0.5    36     3    51     3
conf_08
# A tibble: 1 × 5
  threshold    TP    FP    TN    FN
      <dbl> <int> <int> <int> <int>
1       0.8    36     2    52     3

Step 5 Metrics table

calc_metrics <- function(conf) {
  with(conf, {
    accuracy  <- (TP + TN) / (TP + FP + TN + FN)
    precision <- TP / (TP + FP)
    recall    <- TP / (TP + FN)
    f1        <- 2 * precision * recall / (precision + recall)
    
    tibble(
      threshold = threshold,
      accuracy = accuracy,
      precision = precision,
      recall = recall,
      f1 = f1
    )
  })
}

metrics_02 <- calc_metrics(conf_02)
metrics_05 <- calc_metrics(conf_05)
metrics_08 <- calc_metrics(conf_08)

metrics_table <- bind_rows(metrics_02, metrics_05, metrics_08)

metrics_table
# A tibble: 3 × 5
  threshold accuracy precision recall    f1
      <dbl>    <dbl>     <dbl>  <dbl> <dbl>
1       0.2    0.914     0.860  0.949 0.902
2       0.5    0.935     0.923  0.923 0.923
3       0.8    0.946     0.947  0.923 0.935

Step 6 Threshold use cases

  • Why null error rate matters The null error rate is what we get if we always predict the most common class. If a model cannot beat this baseline, the model is not useful.

  • When 0.2 threshold is better A 0.2 threshold is better when missing a true positive is expensive, so we want to catch more positives even if we make more false alarms.

Example: medical screening, where it is safer to flag more people for follow up testing.

  • When 0.8 threshold is better A 0.8 threshold is better when a false positive is expensive, so we only predict positive when we are very confident.

Example: approving a large loan, where a false approval can cost a lot of money.