week_2b

Approach

Loading tidyverse

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Viewing the data from github

df_raw <- read.csv(
  'https://raw.githubusercontent.com/acatlin/data/refs/heads/master/penguin_predictions.csv'
)

glimpse(df_raw)
Rows: 93
Columns: 3
$ .pred_female <dbl> 0.99217462, 0.95423945, 0.98473504, 0.18702056, 0.9947012…
$ .pred_class  <chr> "female", "female", "female", "male", "female", "female",…
$ sex          <chr> "female", "female", "female", "female", "female", "female…

Fields

  • pred_female
  • pred_class
  • sex

Tasks

  • Null Error Rate
    • ‘[M]ajority-class error rate.’
      • I had help from ChatGPT and it told me that the error rate = majority class % - total %.
        • So, if sex = 100, 2 uniques (men/women), men = 55 & women = 45, the null error rate would be 45%.
      • Seems like a unique count for the sex field, then a division into percentage.
temp <- df_raw |>
  count(sex) |>
  mutate(percent = round(((n / sum(n))*100), 0))
temp
     sex  n percent
1 female 39      42
2   male 54      58
  • Create a plot showing distribution of sex
    • simple ggplot geom_bar
  • Explain why it’s important to know null error rate
    • It defines the accuracy rate if the model just defaults to predicting the majority value.
    • Why is that important? I thought randomness would be 50% across the board, but that’s not true unless the data is % 50 equal. So, this method takes the higher % of the data and says, if the prediction model can’t achieve above that %, we can’t determine if it’s actually predicting.
  • Confusion Matrices
    • Using probability thresholds of 0.2, 0.5, and 0.8, compute:
      • True Positives (TP)
        • Measures when the class is true and predicted is true.
      • True Negatives (TN)
        • Measures when the class is false and predicted is false.
      • False Positives (FP)
        • Measures when the class is false and predicted is true.
      • False Negatives (FN)
        • Measures when the class is true and predicted is false.
    • Present results as three confusion matrices.
      • So, this would take the pred_female vs the actual.
        • We then use the thresholds to calculate new columns with different n values.
        • Then we use those new fields vs the actual for the confusion matrix.

You may compute these manually or use existing functions.

  • Performance Metrics
    • We can just use the values from the confusion matrices (?)
  • Threshold Use Cases
    • I guess we can analyze the data from the thresholds, determine the model accuracy, and think of reasons why we would use a lower threshold vs a higher threshold.

Code-base

glimpse(df_raw)
Rows: 93
Columns: 3
$ .pred_female <dbl> 0.99217462, 0.95423945, 0.98473504, 0.18702056, 0.9947012…
$ .pred_class  <chr> "female", "female", "female", "male", "female", "female",…
$ sex          <chr> "female", "female", "female", "female", "female", "female…

Null Error Rate

df <- df_raw 
df <- df |>
  count(sex) |>
  mutate(percent = round(((n / sum(n))*100), 0))
df
     sex  n percent
1 female 39      42
2   male 54      58

Calculate Null Error Rate

null_error_rate <- min(df$n) / sum(df$n)
sprintf("The null error rate is %.02f",(null_error_rate))
[1] "The null error rate is 0.42"

Plot distribution of class

ggplot(df, aes(x = sex, y = n, fill = sex)) +
  geom_col() +
  labs(
    title = "Distribution of Sex",
    subtitle = "The count of each sex for the study.",
    x = "Sex",
    y = "Count"
  )

Why is null error rate important?

Null error rate is the chance the model has at being wrong if it only chooses the majority class. It’s the baseline metric that the model needs to outperform in order to say that it’s actually learning meaningful patterns to inform it’s predictions rather than just choosing the majority class.

Confusion Matrices & Performance Metrics

df <- df_raw |>
  rename(
    pred_female = .pred_female,
    pred_class = .pred_class,
    truth = sex
    ) |>
  mutate(truth_bool = truth == "female")
glimpse(df)
Rows: 93
Columns: 4
$ pred_female <dbl> 0.99217462, 0.95423945, 0.98473504, 0.18702056, 0.99470123…
$ pred_class  <chr> "female", "female", "female", "male", "female", "female", …
$ truth       <chr> "female", "female", "female", "female", "female", "female"…
$ truth_bool  <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…

The calculation for the prediction should be:

threshold_prediction <- function(df_pred, threshold) {
  
  pred  <- df_pred$pred_female >= threshold
  reality <- df_pred$truth_bool
  
  tp <-   sum(pred & reality)
  fp <-   sum(pred & !reality)
  tn <-   sum(!pred & !reality)
  fn <-   sum(!pred & reality)
  
  tibble(
    threshold = threshold,
    true_positive= tp,
    false_positive = fp,
    true_negative = tn,
    false_negative = fn
  )
}
threshold = 1
# run threshold_prediction
p <- threshold_prediction(df, threshold)
p
# A tibble: 1 × 5
  threshold true_positive false_positive true_negative false_negative
      <dbl>         <int>          <int>         <int>          <int>
1         1             0              0            54             39

The calculations for the performance metrics:

Accuracy

  • Proportion of correct predictions of all cases.
  • Correct predictions / Total predictions
    • (true_positive + true_negative) / sum(true_positive, false_positive, true_negative, false_negative)
    • 100% Accuracy is also 1 - null error rate.
accuracy <- function(tp, fp, tn, fn){(tp + tn) / (tp + fp + tn + fn)}

Precision

  • Proportion of correct positive prediction targets of all predicted positive targets.
  • True positive / True positive + False positive
precision <- function(tp, fp){tp / (tp + fp)}

Recall

  • Proportion of correct positive prediction targets of all actual positive targets.
  • True positive / True positive + False negative
recall <- function(tp, fn){tp / (tp + fn)}

F1 Score

  • Harmonic mean of precision and recall
  • It’s really an anti-cheat.
    • You can be 100% correct in your predictions, but miss 90% of the actual targets.
    • You can get 100% of the actual targets, but be 90% wrong.
f1  <- function(tp, fp, fn){
  p <- precision(tp, fp)
  r <- recall(tp, fn)
  2 * (p * r) / (p + r)
}

We can put those metrics into our original function

threshold_prediction <- function(df_pred, threshold) {
  
  pred  <- df_pred$pred_female >= threshold
  reality <- df_pred$truth_bool
  
  tp <-   sum(pred & reality)
  fp <-   sum(pred & !reality)
  tn <-   sum(!pred & !reality)
  fn <-   sum(!pred & reality)
  
  tib <- tibble(
    threshold,
    true_positive= tp,
    false_positive = fp,
    true_negative = tn,
    false_negative = fn,
    accuracy = accuracy(tp, fp, tn, fn),
    precision = precision(tp, fp),
    recall = recall(tp, fn),
    f1 = f1(tp, fp, fn)
  )
  tib |> 
    mutate(across(c(accuracy, precision, recall, f1), ~ round(.x, 2)))
}
p <- threshold_prediction(df, threshold)
p
# A tibble: 1 × 9
  threshold true_positive false_positive true_negative false_negative accuracy
      <dbl>         <int>          <int>         <int>          <int>    <dbl>
1         1             0              0            54             39     0.58
# ℹ 3 more variables: precision <dbl>, recall <dbl>, f1 <dbl>

I asked ChatGPT to create a ggplot from the tibble because it seemed complicated.

confusion_plot <- function(x) {

  cm <- tibble(
    Actual = c("Positive","Positive","Negative","Negative"),
    Pred   = c("Positive","Negative","Positive","Negative"),
    n = c(
      x$true_positive,
      x$false_negative,
      x$false_positive,
      x$true_negative
    )
  )

  ggplot(cm, aes(x = Pred, y = Actual, fill = n)) +
    geom_tile() +
    geom_text(aes(label = n), color = "white", size = 6) +
    scale_fill_gradient(low = "grey70", high = "steelblue") +
    labs(
      title = "Confusion Matrix",
      subtitle = sprintf("Threshold %.1f", threshold),
      x = "Predicted",
      y = "Actual"
    ) +
    theme_minimal() +
    theme(
      plot.title = element_text(hjust = 0.5),
      plot.subtitle = element_text(hjust = 0.5)
      )
}

To make it cleaner:

evaluate_threshold <- function(df, threshold) {
  m <- threshold_prediction(df, threshold)
  print(m[,1:5])
  print(m[, 6:ncol(m)])
  confusion_plot(m)
}

Threshold @ 0.2

threshold <- 0.2
evaluate_threshold(df, threshold)
# A tibble: 1 × 5
  threshold true_positive false_positive true_negative false_negative
      <dbl>         <int>          <int>         <int>          <int>
1       0.2            37              6            48              2
# A tibble: 1 × 4
  accuracy precision recall    f1
     <dbl>     <dbl>  <dbl> <dbl>
1     0.91      0.86   0.95   0.9

Threshold @ 0.5

threshold <- 0.5
evaluate_threshold(df, threshold)
# A tibble: 1 × 5
  threshold true_positive false_positive true_negative false_negative
      <dbl>         <int>          <int>         <int>          <int>
1       0.5            36              3            51              3
# A tibble: 1 × 4
  accuracy precision recall    f1
     <dbl>     <dbl>  <dbl> <dbl>
1     0.94      0.92   0.92  0.92

Threshold @ 0.8

threshold <- 0.8
evaluate_threshold(df, threshold)
# A tibble: 1 × 5
  threshold true_positive false_positive true_negative false_negative
      <dbl>         <int>          <int>         <int>          <int>
1       0.8            36              2            52              3
# A tibble: 1 × 4
  accuracy precision recall    f1
     <dbl>     <dbl>  <dbl> <dbl>
1     0.95      0.95   0.92  0.94

Threshold Use Cases

0.2 Threshold usecase

When your prediction models cares less about precision and more about recall. This would be good for cancer screening, where you want to capture as many potential cases because missing even one patient would be costly.

0.8 Threshold usecase

When your prediction models cares more about precision than recall. Using the cancer example, this would be good for experimental cancer treatments where the patient’s must meet strict eligibility criteria and false positives would be costly.

Conceptual Roadmap

Binary Classification Basics

Yeah, the threshold determines whether the prediction probability is rejected or not.

Baseline Thinking (Null Error Rate)

It was kinda difficult to understand this, but it makes sense. If the model just chose the class with the majority, then it would have an error rate equal to the proportion that was not chosen. So, 70/100 are male, and the model only chose male, it would be 30% error rate. The model would have to outperform that % to be considered meaningful, even by a percentage point.

It also wouldn’t be the inverse, because in that instance it would be a 70% error rate from picking the minority. However, that goes against the idea of baseline prediction performance, which is determined by a naive classifier, such as always predicting the most common class. I agree with that idea, because I’d do the same without any useful signals to weigh my predictions.

Also, if the minority class is the target for predictions, the error rate matters less because the metrics make up for it.

Confusion Matrix as Core Object

I see that, the fundamentals inform metrics.

Threshold Tradeoffs

Threshold depends on context, I agree. It’s similar to the usecase of medical patients discussed in the Threshold Use Cases subsection.

Practical Guidance

Explore the Data

knitr::include_graphics("class_imbalance.png")

plot_df <- df |>
  count(truth, name = "n") |>
  mutate(
    prop = n / sum(n),
    label = scales::percent(prop, accuracy = 0.1)
  )

plot_df |>
  ggplot(aes(x = truth, y = prop, fill = truth)) +
  geom_col(width = 0.7) +
  geom_text(aes(label = label), vjust = -0.5) +
  scale_y_continuous(
    labels = scales::percent,
    expand = expansion(mult = c(0, 0.2))
  ) +
  labs(
    title = "Class Distribution for Sex",
    y = "Proportion",
    x = NULL
  ) +
  guides(fill = "none") +
  theme(
    panel.background = element_blank(),
    axis.text = element_text(face = "bold"),
    axis.title.y = element_text(margin = margin(r = 25)),
    plot.title = element_text(hjust = 0.5, face = "bold", margin = margin(b = 35)),
    plot.margin = margin(t = 20, r = 40, b = 20, l = 40),
    panel.grid.major.y = element_line(color = "grey"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor = element_blank()
  )

The graph from the example seems fine. The only thing would be to convert the data labels to .2f. I just made a clean chart. It’s simple.

Understanding Probability vs Class

I actually did this earlier in the codebase where I put them into a singular function, but I can do it in a summary table!

threshold = 0.8

df_extra <- df |>
  select(pred_female, truth, truth_bool) |>
  mutate(target = ifelse(truth_bool == TRUE, 1,0)) |>
  mutate(pred_class_2 = ifelse(pred_female > threshold, 1, 0))

glimpse(df_extra)
Rows: 93
Columns: 5
$ pred_female  <dbl> 0.99217462, 0.95423945, 0.98473504, 0.18702056, 0.9947012…
$ truth        <chr> "female", "female", "female", "female", "female", "female…
$ truth_bool   <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU…
$ target       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ pred_class_2 <dbl> 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, …

Build Confusion Matrices

df_confusion <- df_extra |>
  summarize(
    tp = sum(pred_class_2 == 1 & target == 1),
    fp = sum(pred_class_2 == 1 & target == 0),
    tn = sum(pred_class_2 == 0 & target == 0),
    fn = sum(pred_class_2 == 0 & target == 1)
  )
glimpse(df_confusion)
Rows: 1
Columns: 4
$ tp <int> 36
$ fp <int> 2
$ tn <int> 52
$ fn <int> 3

Drive Metrics

df_metrics <- df_confusion |>
  summarize(
    accuracy = (tp + tn) / (tp + fp + tn + fn),
    precision = tp / (tp + fp),
    recall = tp / (tp + fn),
    f1 = (tp * 2) / (2 * tp + fp + fn)
  )
df_metrics
   accuracy precision    recall        f1
1 0.9462366 0.9473684 0.9230769 0.9350649