DATA 607 2B

Author

Desiree Thomas

Approach

When I was a teen I used to volunteer at the NY Aquarium as a Docent. One of my animal specialties were the African Penguins (along with Sharks and Sea Lions). Fun times!

I also was a Surrender Prevention Coordinator for an animal shelter.

Potential Business/Non-Profit Problem

National Geographic Style: A conservation team received a grant to study the migration patterns of female African Penguins. They received a certain number of satellite trackers to attach to female penguins so the team can track their egg-laying migrations. Falsely predicting a penguin is female would cost them an entire satellite tracker and is wasted grant money. These penguins are sexually monomorphic. I suppose you could also use this for a breeding program?

https://www.cbsnews.com/pittsburgh/news/2-african-penguin-chicks-national-aviary/

Outline: To complete this assignment, I completed the four tasks: Null Error Rate, Confusioin Matrcies at Multiple Thresholds, Performance Metrics, and Threshold Use Cases. Some challenges faced were understanding the background of the math and what certain terms were. There was a lot of self-study done, researching, printing of the assignment (to physically write things down) and YouTube University watching.

Load Libraries and Data

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.2.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

df <- read_csv("https://raw.githubusercontent.com/desithomas/DATA607-2B/refs/heads/main/penguin_predictions.csv")

Rows: 93 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): .pred_class, sex
dbl (1): .pred_female

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(df)

Rows: 93
Columns: 3
$ .pred_female <dbl> 0.99217462, 0.95423945, 0.98473504, 0.18702056, 0.9947012…
$ .pred_class  <chr> "female", "female", "female", "male", "female", "female",…
$ sex          <chr> "female", "female", "female", "female", "female", "female…

1. Null Error Rate

Calculating the null error rate (majority-class error rate)

# Male penguins vs. Female penguins
sex_count <- table(df$sex)
print(sex_count)


female   male 
    39     54

majority_count <- max(sex_count)

total_count <- sum(sex_count)

null_accuracy <- majority_count / total_count

null_error_rate <- 1 - null_accuracy 

print(paste("Null Accuracy (Majority %):", round(null_accuracy, 4)))

[1] "Null Accuracy (Majority %): 0.5806"

print(paste("Null Error Rate (Minority %):", round(null_error_rate, 4)))

[1] "Null Error Rate (Minority %): 0.4194"

ggplot(df, aes(x = sex, fill = sex)) + geom_bar() + labs(title = "Distribution of Actual Class (sex))", y = "Count", x = "Sex") + theme_minimal()

Importance of Understanding the Null Error Rate and Accuracy

The null error rate is the percentage of time you would be wrong if you always guessed the majority class. In terms of your model, it represents the performance of your model. A model that just predicts the majority class every time is not a model that is worth paying for or backing. However, high accuracy is also not the sole standard because the error rate is the same as the null error rate. A model needs to significantly outperform/ “out-guess” this as the baseline.

In terms of our penguins and their researchers, a model that falsely predicts that a penguin is female will have a cost of an entire satellite tracker. If our Null Error Rate is high (for example, there are more males than females) and our model isn’t significantly better than the baseline then we are risking expensive equipment on a guesses. For example in our data, there are 54 males and 39 females. The number of eligible female penguins that could be tracked but are not because of the model predicting them as male, would lead to a loss of potential data. As African penguins are endangered, this could also have negative consequences on conservation efforts.

Confusion Matrices at Multiple Thresholds

Probability thresholds of 0.2, 0.5, and 0.8 were used.

# 0.2 

df_0.2 <- df %>% 
  mutate(Predicted_threshold = ifelse(.pred_female > 0.2, "female", "male"))

confusion_0.2 <- table(Predicted = factor(df_0.2$Predicted_threshold, levels = c("female", "male")), Actual = factor(df_0.2$sex, levels = c("female", "male")))

print("Confusion Matrix at the 0.2 Threshold")

[1] "Confusion Matrix at the 0.2 Threshold"

print(confusion_0.2)

         Actual
Predicted female male
   female     37    6
   male        2   48

#0.5

df_0.5 <- df %>% 
  mutate(Predicted_threshold = ifelse(.pred_female > 0.5, "female", "male"))

confusion_0.5 <- table(Predicted = factor(df_0.5$Predicted_threshold, levels = c("female", "male")), Actual = factor(df_0.5$sex, levels = c("female", "male")))

print("Confustion Matrix at the 0.5 Threshold")

[1] "Confustion Matrix at the 0.5 Threshold"

print(confusion_0.5)

         Actual
Predicted female male
   female     36    3
   male        3   51

df_0.8 <- df %>% 
  mutate(Predicted_threshold = ifelse(.pred_female > 0.8, "female", "male"))

confusion_0.8 <- table(Predicted = factor(df_0.8$Predicted_threshold, levels = c("female", "male")), Actual = factor(df_0.8$sex, levels = c("female", "male")))

print("Confusion Matrix at the 0.8 Threshold")

[1] "Confusion Matrix at the 0.8 Threshold"

print(confusion_0.8)

         Actual
Predicted female male
   female     36    2
   male        3   52

calculate_perform_metrics <- function(cm, threshold) {
  TP 
}

Performance metrics

#Table

calculate_perform_metrics <- function(cm, threshold) {
  TP <- cm[1,1] # Predicted Female, Actual Female
  FP <- cm[1,2] # Predicted Female, Actual Male
  FN <- cm[2,1] # Predicted Male, Actual Female
  TN <- cm[2,2] # Predicted Male, Actual Male
  
  #Formulas and F1 Score 
  accuracy <- (TP + TN) / sum(cm)
  precision <- TP / (TP + FP)
  recall <- TP / (TP + FN)
  f1 <- 2 * (precision * recall) / (precision + recall)
  
  return(data.frame(
    Threshold = threshold,
    Accuracy = round(accuracy, 4),
    Precision = round(precision, 4),
    Recall = round(recall, 4),
    F1_Score = round(f1, 4)
  ))
}


metrics_summary <- rbind(
  calculate_perform_metrics(confusion_0.2, 0.2),
  calculate_perform_metrics(confusion_0.5, 0.5),
  calculate_perform_metrics(confusion_0.8, 0.8)
)

knitr::kable(metrics_summary, caption = "Model Performance Metrics Across Thresholds")

Model Performance Metrics Across Thresholds
Threshold	Accuracy	Precision	Recall	F1_Score
0.2	0.9140	0.8605	0.9487	0.9024
0.5	0.9355	0.9231	0.9231	0.9231
0.8	0.9462	0.9474	0.9231	0.9351

So, what do these have to do with our penguins and researchers? Well, when our model predicts that a penguin is female, how often was it right? This was the TP/(TP +TP) formula aka our model’s precision. If our model has low precision then it means that the researchers would place trackers on male birds instead of female birds. If our model has high precision, then it means that our researchers are placing trackers on female birds!

Now, do you remember the count of female birds that we found in the beginning? When we measure the recall <- TP/(TP + FN), we are measuring how many females out of the actual females penguins, did our model find? How this effects tracking and the effort: If our recall is too long then it means that the researchers would have missed most of the females. The model would have declared a female penguin a “male” and it could potentially lead to our study being too small. If our model had a high recall then it means that we found almost every female penguin in the penguin colony.

Accuracy relates to our researchers and penguins in that it determines how often the model was overall correct. Important! Imagine if you told Natural Geographic that your model’s accuracy was .2 and you wanted top dollars for your model! For an endangered species! This was the formula for accuracy: (TP + TN) / total. About 58% of the penguins in this colony are male, If the model guessed male for each bird, it would have 58% accuracy. However, this would lead to the recall being 0% because we would have found 0 female birds. The study would not be able to continue because there would be “no female birds to place trackers on.”

This leads us to the F1 score. This is the harmonic mean of precision and recall. Essentially, it means that it is balanced.

Threshold Use Cases (CIRDC - Canine Infectious Respiratory Disease Complex)

0.2 Threshold: A high population animal shelter currently has CIRDC making the rounds through their dog population. In this case, having a False Negative can lead to the other dogs in the General Population getting sick. Ideally, sick dogs go into ISO - to avoid getting healthy dogs sick and spreading CIRDC around. This False Negative could lead to a shelter wide outbreak. While a False Positive is not ideal, but still a bit better than having another ~70 healthy dogs being infected. This is a big reason why shelters and rescues ask you to quarantine your adopted pets before introducing them to your current pets at home.

0.8 Threshold: This maximizes Precision. The suspected dog is only moved to ISO if the model is 80% certain that the dog has CIRDC. For a real life example, the model may look for something like coughing AND nasal discharge, rather than just coughing. It may raise some false negatives that remain in their kennels within the General Population. On the other hand, ISO might have enough space for severe cases now, rather than just CIRDC!

LLM Transscript

User: How do I render the data for the performance metrics? Do a guide and walkthrough along with the syntax. (Attached screenshot of relevant code)

LLM: To render your performance metrics professionally, we will use the knitr::kable() function. This is the industry-standard “engine” for converting raw R data frames into clean, formatted HTML tables for your report.

Here is the complete guide, including the code to generate the data and the specific syntax to render it.

knitr::kable( metrics_summary, digits = 4, # 1. Rounds all numbers to 4 decimal places caption = “Performance Metrics by Threshold”, # 2. Adds a professional title align = “c”, # 3. Centers the text in columns col.names = c(“Threshold”, “Accuracy”, “Precision”, “Recall”, “F1 Score”) # 4. Renames columns )

Step 2: The Syntax Walkthrough (The “Deep Dive”) You asked for the syntax guide. Here is exactly what the arguments inside knitr::kable() are doing:

metrics_summary (The Data)

What it is: This is the data frame object you created in Step B.

Why: You must pass the raw data first. kable cannot work without an object to format.

digits = 4

What it does: It automatically rounds every numeric column in your table to 4 decimal places.

Why use it? Without this, your table will show 0.4193548387…, which looks messy and unprofessional. This cleans it up instantly without you needing to run round() on every single variable manually.

caption = “…”

What it does: Adds a numbered table title (e.g., Table 1: Performance Metrics…) above the grid.

Why use it? In academic and professional reports, every table must have a title so the reader knows what they are looking at before they read the numbers.

align = “c”

What it does: Sets the alignment of the columns. “c” = Center, “l” = Left, “r” = Right.

Why use it? Numbers are easier to compare when they are centered or right-aligned. Default left-alignment often makes tables look jagged.

col.names = c(…)

What it does: Replaces the variable names (like F1_Score) with pretty headers (like F1 Score).

Why use it? Variable names often have underscores (e.g., pred_female). This argument lets you fix the labels for the human reader without changing the underlying code.