For this assignment, I plan to analyze the performance of a binary classification model predicting whether a penguin is female. The goal is to understand how probability thresholds affect evaluation metrics, such as accuracy, precision, recall, and F1 score. The dataset contains model-predicted probabilities (.pred_female), predicted classes (.pred_class), and the actual class labels (sex).
Planned Approach
Data Exploration:
Load penguin_predictions.csv into R using read_csv().
Inspect the distribution of the actual class (sex) using count(sex) and visualize it with ggplot2.
Compute the null error rate by identifying the majority class and calculating the proportion of misclassified observations if the model always predicted that class.
Understanding the null error rate provides a baseline to evaluate model performance.
Threshold-Based Classification:
Ignore .pred_class initially.
For thresholds of 0.2, 0.5, and 0.8, convert predicted probabilities into predicted classes using:
This allows us to explore how changing the threshold affects model errors and metrics.
Confusion Matrices:
For each threshold, manually compute:
True Positives (TP)
False Positives (FP)
True Negatives (TN)
False Negatives (FN)
Present results as three confusion matrices corresponding to the three thresholds.
Understanding TP, FP, TN, FN is crucial, as all performance metrics are derived from these values.
Performance Metrics Calculation:
Calculate Accuracy, Precision, Recall, and F1 score for each threshold using the formulas:
Accuracy = (TP + TN) / total
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F1 Score = harmonic mean of Precision and Recall
Present metrics in a clear table to compare threshold effects.
Threshold Use Cases:
Discuss scenarios where a low threshold (0.2) is preferable, e.g., when missing a positive instance is costly.
Discuss scenarios where a high threshold (0.8) is preferable, e.g., when false positives are more harmful than false negatives.
Data Challenges and Considerations
Class Imbalance: If one class dominates, metrics like accuracy may be misleading; null error rate provides context.
Threshold Trade-offs: Different thresholds produce different balances of precision and recall; selecting an appropriate threshold depends on the real-world consequences of errors.
Reproducibility: All calculations will be done in R with code provided, so others can replicate the results.
Visualization: Plots of predicted probability distributions or threshold effects will help intuitively understand model behavior. #Codebase Deliverable
# Load packageslibrary(readr)# Read the CSVpenguins <-read_csv("https://raw.githubusercontent.com/acatlin/data/refs/heads/master/penguin_predictions.csv")
Rows: 93 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): .pred_class, sex
dbl (1): .pred_female
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(ggplot2)# Count how many of each classtable(penguins$sex)
female male
39 54
# Or with dplyrpenguins %>%count(sex)
# A tibble: 2 × 2
sex n
<chr> <int>
1 female 39
2 male 54
#I want this in a bar chart so I will be using ggplot hereggplot(penguins, aes(x = sex)) +geom_bar(fill ="steelblue") +labs(title ="Distribution of Actual Sex", x ="Sex", y ="Count")
# Count of each classclass_counts <-table(penguins$sex)# Majority class- extract name of majority classmajority_class <-names(which.max(class_counts))# Null error rate: 1- majority-class observations/ total observationsnull_error_rate <-1-max(class_counts) /sum(class_counts)null_error_rate
[1] 0.4193548
# Load tidyverse library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ lubridate 1.9.4 ✔ tibble 3.3.1
✔ purrr 1.2.1 ✔ tidyr 1.3.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Load your CSV filepenguin_predictions <-read_csv(penguins <-"https://raw.githubusercontent.com/acatlin/data/refs/heads/master/penguin_predictions.csv", show_col_types =FALSE)# Glimpse at the first 6 rows for ease of readinghead(penguin_predictions)
Warning: Since gt v0.3.0, `columns = vars(...)` has been deprecated.
• Please use `columns = c(...)` instead.
Classification Metrics by Threshold
Probability Threshold
Accuracy
Precision
Recall
F1 Score
0.2
0.914
0.860
0.949
0.902
0.5
0.935
0.923
0.923
0.923
0.8
0.946
0.947
0.923
0.935
#A super low threshold like 0.2 means that the model will be extremely sensitive. Even if there's a 20% chance something is positive, the model will predict it to be positive. In the real world, such a sensitive model would be most useful in a potential cancer detection study, where the disease is highly time-sensitive and dangerous. You'd rather have more false positives than risk not detecting a patient with cancer.#A super high threshold like 0.8, on the other hand, means only if there is an 80% chance something is positive, the model will actually predict it to be positive. This could be useful in cases like determining when to email offers to high-value wealth clients in a financial institution, for example. Only those most likely to have a positive response should be targeted.