Assignment #2B

Author

Fraz Aslam

Approach

My main approach to this assignment will be to properly understand Null Error Rate, Confusion Matrices and thresholds. Loading the data into R from a .csv should not prove to be a problem but being able to apply the proper functions to acheive the task will be my greatest hurdle in this assignment. I will leverage Claude to help me understand the best way to apprach datasets with classification problems and machine learning. Im sure it will prove to be very useful to truly understand what it is im trying to do.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.2.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(gt)

url <- "https://raw.githubusercontent.com/acatlin/data/refs/heads/master/penguin_predictions.csv"

df <- read_csv(
  file = url,
  show_col_types = FALSE,
  progress = FALSE
)

glimpse(df)

Rows: 93
Columns: 3
$ .pred_female <dbl> 0.99217462, 0.95423945, 0.98473504, 0.18702056, 0.9947012…
$ .pred_class  <chr> "female", "female", "female", "male", "female", "female",…
$ sex          <chr> "female", "female", "female", "female", "female", "female…

Null Error Rate

Null Error Rate is crucial to know because its the baseline that a real model must beat. if the model is lower than it is not useful.

table(df$sex)


female   male 
    39     54

null_error_rate <- min(table(df$sex)) / nrow(df)
null_error_rate

[1] 0.4193548

ggplot(df, aes(x = factor(sex))) +
  geom_bar(fill = "steelblue") +
  labs(
    title = "Class Distribution of Sex",
    x = "Sex",
    y = "Count"
  )

threshold <- 0.5 

df <- df %>%
  mutate(pred_05 = ifelse(.pred_female > threshold, "female", "male"))

df

# A tibble: 93 × 4
   .pred_female .pred_class sex    pred_05
          <dbl> <chr>       <chr>  <chr>  
 1        0.992 female      female female 
 2        0.954 female      female female 
 3        0.985 female      female female 
 4        0.187 male        female male   
 5        0.995 female      female female 
 6        1.000 female      female female 
 7        0.959 female      female female 
 8        1.000 female      female female 
 9        1.000 female      female female 
10        0.339 male        female male   
# ℹ 83 more rows

Building the Confusion Matrix

TP <- sum(df$pred_05 == "female" & df$sex == "female") 
FP <- sum(df$pred_05 == "female" & df$sex == "male")   
TN <- sum(df$pred_05 == "male"   & df$sex == "male")   
FN <- sum(df$pred_05 == "male"   & df$sex == "female")  


conf_mat_05 <- matrix(
  c(TP, FP, FN, TN),
  nrow = 2,
  byrow = TRUE,
  dimnames = list(
    "Predicted" = c("Female", "Male"),
    "Actual" = c("Female", "Male")
  )
)

conf_mat_05

         Actual
Predicted Female Male
   Female     36    3
   Male        3   51

threshold <- 0.2
df <- df %>%
  mutate(pred_02 = ifelse(.pred_female > threshold, "female", "male"))

TP02 <- sum(df$pred_02 == "female" & df$sex == "female")
FP02 <- sum(df$pred_02 == "female" & df$sex == "male")
TN02 <- sum(df$pred_02 == "male"   & df$sex == "male")
FN02 <- sum(df$pred_02 == "male"   & df$sex == "female")

conf_mat_02 <- matrix(
  c(TP02, FP02, FN02, TN02),
  nrow = 2,
  byrow = TRUE,
  dimnames = list(
    "Predicted" = c("Female", "Male"),
    "Actual" = c("Female", "Male")
  )
)

conf_mat_02

         Actual
Predicted Female Male
   Female     37    6
   Male        2   48

threshold <- 0.8
df <- df %>%
  mutate(pred_08 = ifelse(.pred_female > threshold, "female", "male"))

TP08 <- sum(df$pred_08 == "female" & df$sex == "female")
FP08 <- sum(df$pred_08 == "female" & df$sex == "male")
TN08 <- sum(df$pred_08 == "male"   & df$sex == "male")
FN08 <- sum(df$pred_08 == "male"   & df$sex == "female")

conf_mat_08 <- matrix(
  c(TP08, FP08, FN08, TN08),
  nrow = 2,
  byrow = TRUE,
  dimnames = list(
    "Predicted" = c("Female", "Male"),
    "Actual" = c("Female", "Male")
  )
)

conf_mat_08

         Actual
Predicted Female Male
   Female     36    2
   Male        3   52

# Metrics for threshold 0.2
accuracy_02  <- (TP02 + TN02) / (TP02 + FP02 + TN02 + FN02)
precision_02 <- TP02 / (TP02 + FP02)
recall_02    <- TP02 / (TP02 + FN02)
f1_02        <- 2 * precision_02 * recall_02 / (precision_02 + recall_02)

# Metrics for threshold 0.5
accuracy_05  <- (TP + TN) / (TP + FP + TN + FN)
precision_05 <- TP / (TP + FP)
recall_05    <- TP / (TP + FN)
f1_05        <- 2 * precision_05 * recall_05 / (precision_05 + recall_05)

# Metrics for threshold 0.8
accuracy_08  <- (TP08 + TN08) / (TP08 + FP08 + TN08 + FN08)
precision_08 <- TP08 / (TP08 + FP08)
recall_08    <- TP08 / (TP08 + FN08)
f1_08        <- 2 * precision_08 * recall_08 / (precision_08 + recall_08)

# Combine into final table
final_metrics <- tibble(
  Threshold = c(0.2, 0.5, 0.8),
  Accuracy  = c(accuracy_02, accuracy_05, accuracy_08),
  Precision = c(precision_02, precision_05, precision_08),
  Recall    = c(recall_02, recall_05, recall_08),
  F1        = c(f1_02, f1_05, f1_08)
)

final_metrics

# A tibble: 3 × 5
  Threshold Accuracy Precision Recall    F1
      <dbl>    <dbl>     <dbl>  <dbl> <dbl>
1       0.2    0.914     0.860  0.949 0.902
2       0.5    0.935     0.923  0.923 0.923
3       0.8    0.946     0.947  0.923 0.935

Threshold Use Cases

0.2 Threshold - This is where the threshold is low, you will catch more positives or “false alarms” this is useful when screening for diseases. False positives are better than false negatives or not catching positives

0.8 Threshold - This is where the threshold is high. Similar to medical field we can take the example of surgery. You should only do surgery if the patient needs it or displays a high confidence for it.