week2b_classification_metrics.qmd

Author

Sinem K Moschos

Week 2B Classification Metrics

##Approach

Introduction

In this assignment, I am working on understanding how a classification model is evaluated. I am not building or training a model on week 2b. Instead, I am given model predictions and my submission is to analyze them and understand how different evaluation methods work. Since I am new to machine learning concepts, my main goal is to understand what each step mean.

What the Dataset Is

The dataset already contains predictions from a model. One column shows the predicted probability that an observation belongs to the female class. Another column shows a predicted class label, and the last column shows the actual class label. I learned that models first predict probabilities, and then those probabilities are turned into class labels using a threshold. Because of this, I will focus on the probability column and not rely on the given predicted class.

First Step: Looking at the Baseline

Before evaluating the model, I plan to calculate the null error rate. The null error rate shows how wrong we would be if we always guessed the most common class. This step is important because it gives me a baseline to compare against. If a model performs worse than this, then it is not useful.

Understanding Thresholds

Next, I will work with probability thresholds. A threshold is the rule that decides when a probability becomes a positive or negative prediction. I plan to test different thresholds, specifically 0.2, 0.5, and 0.8. This will help me see how predictions change when I am more or less strict about labeling something as positive.

Confusion Matrix as the Main Tool

For each threshold, I will manually calculate the confusion matrix. This means counting how many predictions are true positives, false positives, true negatives, and false negatives. I am doing this step carefully because I learned that all evaluation metrics are built from these four numbers.

Calculating Performance Metrics

After building the confusion matrices, I will calculate accuracy, precision, recall, and F1 score. Instead of just using built-in functions, I will compute these metrics myself using the formulas. This helps me understand what each metric actually measures and how it relates back to the confusion matrix.

Comparing Thresholds

Once I have metrics for each threshold, I will compare them to see how performance changes. I expect that lower thresholds will catch more positive cases but also create more false positives, while higher thresholds will be more conservative. This comparison helps explain why there is no single best threshold for every situation.

Explaining Real World Use Cases

Finally, I will explain when different thresholds might be useful in real life. For example, a lower threshold may be better when missing a positive case is risky, while a higher threshold may be better when false positives are costly. This step connects the technical work to real-world decision making.

Expected Outcome

By following these steps, I expect to gain a better understanding of how classification models are evaluated, how thresholds affect results, and why different metrics are used. This assignment helps me build intuition rather than just memorizing formulas.

Step 1 Load the data

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
url <- "https://raw.githubusercontent.com/acatlin/data/refs/heads/master/penguin_predictions.csv"
df <- read_csv(url)
Rows: 93 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): .pred_class, sex
dbl (1): .pred_female

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(df)
Rows: 93
Columns: 3
$ .pred_female <dbl> 0.99217462, 0.95423945, 0.98473504, 0.18702056, 0.9947012…
$ .pred_class  <chr> "female", "female", "female", "male", "female", "female",…
$ sex          <chr> "female", "female", "female", "female", "female", "female…
head(df)
# A tibble: 6 × 3
  .pred_female .pred_class sex   
         <dbl> <chr>       <chr> 
1        0.992 female      female
2        0.954 female      female
3        0.985 female      female
4        0.187 male        female
5        0.995 female      female
6        1.000 female      female

Step 2 Null error rate

# count the actual classes
class_counts <- table(df$sex)
class_counts

female   male 
    39     54 
# majority class (most common label)
majority_class <- names(which.max(class_counts))
majority_class
[1] "male"
# null error rate = error if we always predict the majority class
null_error_rate <- mean(df$sex != majority_class)
null_error_rate
[1] 0.4193548

Step 3 Class distribution plot

library(ggplot2)

ggplot(df, aes(x = sex)) +
  geom_bar() +
  labs(
    title = "Class Distribution of Sex",
    x = "Sex",
    y = "Count"
  )

Step 4 Confusion matrices at thresholds 0.2 0.5 0.8

# Ignore .pred_class and recompute predictions from probabilities
make_confusion <- function(threshold) {
  pred <- ifelse(df$.pred_female > threshold, "female", "male")

  TP <- sum(pred == "female" & df$sex == "female")
  FP <- sum(pred == "female" & df$sex == "male")
  TN <- sum(pred == "male" & df$sex == "male")
  FN <- sum(pred == "male" & df$sex == "female")

  tibble(
    threshold = threshold,
    TP = TP,
    FP = FP,
    TN = TN,
    FN = FN
  )
}

conf_02 <- make_confusion(0.2)
conf_05 <- make_confusion(0.5)
conf_08 <- make_confusion(0.8)

conf_02
# A tibble: 1 × 5
  threshold    TP    FP    TN    FN
      <dbl> <int> <int> <int> <int>
1       0.2    37     6    48     2
conf_05
# A tibble: 1 × 5
  threshold    TP    FP    TN    FN
      <dbl> <int> <int> <int> <int>
1       0.5    36     3    51     3
conf_08
# A tibble: 1 × 5
  threshold    TP    FP    TN    FN
      <dbl> <int> <int> <int> <int>
1       0.8    36     2    52     3
cm_02 <- matrix(
  c(conf_02$TP, conf_02$FP, conf_02$FN, conf_02$TN),
  nrow = 2,
  byrow = TRUE,
  dimnames = list(
    Predicted = c("Female", "Male"),
    Actual = c("Female", "Male")
  )
)

cm_05 <- matrix(
  c(conf_05$TP, conf_05$FP, conf_05$FN, conf_05$TN),
  nrow = 2,
  byrow = TRUE,
  dimnames = list(
    Predicted = c("Female", "Male"),
    Actual = c("Female", "Male")
  )
)

cm_08 <- matrix(
  c(conf_08$TP, conf_08$FP, conf_08$FN, conf_08$TN),
  nrow = 2,
  byrow = TRUE,
  dimnames = list(
    Predicted = c("Female", "Male"),
    Actual = c("Female", "Male")
  )
)

cm_02
         Actual
Predicted Female Male
   Female     37    6
   Male        2   48
cm_05
         Actual
Predicted Female Male
   Female     36    3
   Male        3   51
cm_08
         Actual
Predicted Female Male
   Female     36    2
   Male        3   52

Step 5 Metrics table

calc_metrics <- function(conf) {
  accuracy  <- (conf$TP + conf$TN) / (conf$TP + conf$FP + conf$TN + conf$FN)
  precision <- conf$TP / (conf$TP + conf$FP)
  recall    <- conf$TP / (conf$TP + conf$FN)
  f1        <- 2 * precision * recall / (precision + recall)

  tibble(
    threshold = conf$threshold,
    accuracy = accuracy,
    precision = precision,
    recall = recall,
    f1 = f1
  )
}

metrics_table <- bind_rows(
  calc_metrics(conf_02),
  calc_metrics(conf_05),
  calc_metrics(conf_08)
)

metrics_table
# A tibble: 3 × 5
  threshold accuracy precision recall    f1
      <dbl>    <dbl>     <dbl>  <dbl> <dbl>
1       0.2    0.914     0.860  0.949 0.902
2       0.5    0.935     0.923  0.923 0.923
3       0.8    0.946     0.947  0.923 0.935

Step 6 Threshold use cases

Why null error rate matters

The null error rate is like a starting point. It shows what would happen if we did not use the model at all and just guessed the same answer every time. This helps me understand if the model is doing anything useful or not. If the model cannot beat this simple guess, then there is no point using it.

When a 0.2 threshold makes more sense

A 0.2 threshold makes sense when it is more important to catch as many positive cases as possible. With a lower threshold, the model says positive more often. This means we catch more real positives, but we also make more mistakes by flagging some negatives as positives.

For example, in medical checks, it is usually better to be extra careful and send more people for follow up tests instead of missing someone who actually has a problem.

When a 0.8 threshold makes more sense

A 0.8 threshold makes sense when making a false positive is a big problem. With a higher threshold, the model only says positive when it is really sure. This means fewer false alarms, but we might miss some true positives.

For example, when approving a big loan, it is risky to approve someone by mistake, so it makes sense to only approve cases with very high confidence.

Overall, lower thresholds mean more positives and higher thresholds mean fewer positives. There is no one correct threshold. It depends on what kind of mistake matters more.