week2b

Author

Zihao Yu

1.How will I tackle the problem?

I will import the data, verify the values in the .pred_female, .pred_class, and sex columns, then plot the distribution using sex and calculate the NER. Based on the Machine Learning PDF, calculate TP/FP/TN/FN; compute Accuracy, Precision, Recall, and F1, summarize these into a comparison table, and explain different scenarios suited for 0.2 or 0.8.

2.What data challenges do I anticipate?

The first time may require more time. And the assignment involves multiple iterations of calculations with numerous steps, making data processing prone to errors. To minimize mistakes, take extra care when transcribing data and perform double-checks after each computation.

source: “https://raw.githubusercontent.com/XxY-coder/data607-week2b/refs/heads/main/penguin_predictions.csv”

3. Explore the Data

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.1     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
df <- read.csv("https://raw.githubusercontent.com/XxY-coder/data607-week2b/refs/heads/main/penguin_predictions.csv")

glimpse(df)
Rows: 93
Columns: 3
$ .pred_female <dbl> 0.99217462, 0.95423945, 0.98473504, 0.18702056, 0.9947012…
$ .pred_class  <chr> "female", "female", "female", "male", "female", "female",…
$ sex          <chr> "female", "female", "female", "female", "female", "female…
names(df)
[1] ".pred_female" ".pred_class"  "sex"         
sum(is.na(df))
[1] 0
df |>
  ggplot(
    aes(x = sex)
  )+
  geom_bar()+
  labs(
    title = " Count of sex",
    x = "sex",
    y = "count"
  )

df |>
  count(sex, sort = TRUE)
     sex  n
1   male 54
2 female 39

4.Deal with the NER

Data contain 39 Female, and 54 Male. The formula is Number of Minority class samples/ Total numbers.

error_rate <- 39/(39 +54)
error_rate
[1] 0.4193548

5. Understand Probability vs Class

df2 <-
  df |>
  mutate(
    pred_02 = ifelse(.pred_female > 0.2, 1, 0),
    pred_05 = ifelse(.pred_female > 0.5, 1, 0),
    pred_08 = ifelse(.pred_female > 0.8, 1, 0)
  )

glimpse(df2)
Rows: 93
Columns: 6
$ .pred_female <dbl> 0.99217462, 0.95423945, 0.98473504, 0.18702056, 0.9947012…
$ .pred_class  <chr> "female", "female", "female", "male", "female", "female",…
$ sex          <chr> "female", "female", "female", "female", "female", "female…
$ pred_02      <dbl> 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ pred_05      <dbl> 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ pred_08      <dbl> 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, …

6. Work with probability thresholds of 0.2.

cm_02 <- 
  df2 |>
  summarise(
    TP = sum(pred_02 == 1 & sex == "female", na.rm = TRUE),
    FP = sum(pred_02 == 1 & sex == "male",   na.rm = TRUE),
    TN = sum(pred_02 == 0   & sex == "male",   na.rm = TRUE),
    FN = sum(pred_02 == 0   & sex == "female", na.rm = TRUE)
  )
cm_02
  TP FP TN FN
1 37  6 48  2
accuracy_02 <- (37+48)/93
accuracy_02
[1] 0.9139785
Prec_02 <- (37)/(37+6)
Prec_02
[1] 0.8604651
recall_02 <- (37)/(37+2)
recall_02
[1] 0.9487179
F1_02 <- (2*(Prec_02)*(recall_02))/((Prec_02)+(recall_02))
F1_02
[1] 0.902439

7. Work with probability thresholds of 0.5.

cm_05 <- 
  df2 |>
  summarise(
    TP = sum(pred_05 == 1 & sex == "female", na.rm = TRUE),
    FP = sum(pred_05 == 1 & sex == "male",   na.rm = TRUE),
    TN = sum(pred_05 == 0   & sex == "male",   na.rm = TRUE),
    FN = sum(pred_05 == 0   & sex == "female", na.rm = TRUE)
  )
cm_05
  TP FP TN FN
1 36  3 51  3
accuracy_05 <- (36+51)/93
accuracy_05
[1] 0.9354839
Prec_05 <- (36)/(36+3)
Prec_05
[1] 0.9230769
recall_05 <- (36)/(36+3)
recall_05
[1] 0.9230769
F1_05 <- (2*(Prec_05)*(recall_05))/((Prec_05)+(recall_05))
F1_05
[1] 0.9230769

8. Work with probability thresholds of 0.8.

cm_08 <- 
  df2 |>
  summarise(
    TP = sum(pred_08 == 1 & sex == "female", na.rm = TRUE),
    FP = sum(pred_08 == 1 & sex == "male",   na.rm = TRUE),
    TN = sum(pred_08 == 0   & sex == "male",   na.rm = TRUE),
    FN = sum(pred_08 == 0   & sex == "female", na.rm = TRUE)
  )
cm_08
  TP FP TN FN
1 36  2 52  3
accuracy_08 <- (36+52)/93
accuracy_08
[1] 0.9462366
Prec_08 <- (36)/(36+2)
Prec_08
[1] 0.9473684
recall_08 <- (36)/(36+3)
recall_08
[1] 0.9230769
F1_08 <- (2*(Prec_08)*(recall_08))/((Prec_08)+(recall_08))
F1_08
[1] 0.9350649

Explain why knowing the null error rate is important when evaluating models.

NER helps us check for data class imbalance. When data is skewed toward the majority class, accuracy may appear inflated, while the minority class is more easily misclassified or missed.

Conclusion

By calculating thresholds of 0.2, 0.5, and 0.8:

—Threshold 0.5 yields stable performance across metrics, suitable for most datasets but not necessarily the optimal choice. The decision depends on whether higher precision or recall is prioritized.

—At threshold 0.2, recall is higher but precision is lower. While accuracy remains high, it becomes less significant due to sample imbalance and is generally considered. Threshold 0.2 is more suitable for medical screening to identify correct cases and reduce misdiagnosis; however, precision should not be too low to ensure results remain accurate.

—At 0.8, both precision and accuracy reach 0.95. Recall is slightly lower than at 0.2, while F1 is 0.4 higher. Overall, the model with threshold 0.8 performs stronger. Its superior metrics make it more suitable for tasks requiring correct classification, such as estimating gender ratios in a college, where it minimizes misclassifications and yields more stable statistical results.