Introduction

We will be looking at the penguin_predictions.csv dataset. This dataset contains three data columns. They are:

  • The pred_female which is the model’s degree of certainty of it’s prediction
  • The pred_class the predicted sex of the penguin
  • The sex the actual sex of the penguin

We will use the following libraries

  • The tidyverse library
  • The dplyr library
  • The knitr library

Reading in Our Data

To start let us read in our data and look at it’s summary.

url <-("https://raw.githubusercontent.com/WendyR20/DATA-607-Assignment-2B/refs/heads/main/penguin_predictions.csv")
data <- read_csv(url)
## Rows: 93 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): .pred_class, sex
## dbl (1): .pred_female
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
summary(data)
##   .pred_female       .pred_class            sex           
##  Min.   :0.0000000   Length:93          Length:93         
##  1st Qu.:0.0003508   Class :character   Class :character  
##  Median :0.1098907   Mode  :character   Mode  :character  
##  Mean   :0.4351396                                        
##  3rd Qu.:0.9921746                                        
##  Max.   :1.0000000

Null Error Rate

Let us calculate the null error rate for the dataset. First, we need to find out how many null values exist in the dataset.

# Count Null values for each column
sapply(data, function(x) sum(is.na(x)))
## .pred_female  .pred_class          sex 
##            0            0            0
# Summary table of missing values per column
null_summary <- data.frame(
  null_count = sapply(data, function(x) sum(is.na(x))),
  null_percent = sapply(data, function(x) mean(is.na(x)) * 100)
)

null_summary
##              null_count null_percent
## .pred_female          0            0
## .pred_class           0            0
## sex                   0            0

We see that there are no null values in the dataset, thus the null error rate is 0%. We avoided having to do any extra calculation (phew)!

Explanatory Variable Distribution

Now let’s plot the distribution of the explanatory variable for this dataset: the actual sex of the penguins.

Importance of Null Error rate

Its is important to know the null error rate, as the null values in a dataset can affect our understanding of a dataset, we may not be able to use certain models if we have null values. Null values can also lead to inaccurate calculations of the median and mean of our data.

ggplot(data, aes(x = sex)) +
  geom_bar(fill =  "deepskyblue") +
  geom_text(stat="count", aes(label=after_stat(count)),vjust=2)

  labs(title = "Distribution of Actual Sex",
       x = "Actual Sex",
       y = "Count") + 
       theme_bw() +
  theme(plot.title = element_text(hjust = 0.5, vjust = 0.5),
        axis.title.x = element_text(hjust = 0.5, vjust = 0.5),
        axis.title.y = element_text(hjust = 0.5, vjust = 0.5))
## NULL

Confusion Matrix

Let’s practice making a confusion matrix with our data as it is.

  #confusion matrix
  df1 <- data
  df1 %>% 
    select(sex, .pred_class) %>%
    mutate(sex = recode(sex,
                        'female' = 'TP', 
                        'male' = 'TN'),
           .pred_class = recode(.pred_class,
                                'female' = 'FP', 
                                'male' = 'TN')) %>%
    table()
##     .pred_class
## sex  FP TN
##   TN  3 51
##   TP 36  3

Probability Thresholds 0.2

We are going to analyze the data and determine false positive, true negative, and false negative values based on the .pred_female threshold being 0.2 .pred_female threshold being 0.2

For that we will first create a new column and assign the values male’ or ‘female’ to new column based on 0.2 threshold.

 df2 <- data
  df2$pred_class_threshold <- NA # Initialize the new column
  print(df2)
## # A tibble: 93 × 4
##    .pred_female .pred_class sex    pred_class_threshold
##           <dbl> <chr>       <chr>  <lgl>               
##  1        0.992 female      female NA                  
##  2        0.954 female      female NA                  
##  3        0.985 female      female NA                  
##  4        0.187 male        female NA                  
##  5        0.995 female      female NA                  
##  6        1.000 female      female NA                  
##  7        0.959 female      female NA                  
##  8        1.000 female      female NA                  
##  9        1.000 female      female NA                  
## 10        0.339 male        female NA                  
## # ℹ 83 more rows
  df2$pred_class_threshold <- ifelse(df2$.pred_female >= 0.2, "female", "male")

Calculations for the 0.2 Probability Threshold

Later on we will be making a table of the accuracy, precision, recall and F1 scores for the probability thresholds, let’s get the calculations for the 0.2 probability threshold done now.

calculations <- function(df2) { 
    
  #Calculating True Positives (TP), True Negatives (TN), False Positives (FP), False Negatives (FN)
    TP <- sum(df2$sex == 'female' & df2$pred_class_threshold == 'female')
    TN <- sum(df2$sex == 'male' & df2$pred_class_threshold == 'male')
    FP <- sum(df2$sex == 'male' & df2$pred_class_threshold == 'female')
    FN <- sum(df2$sex == 'female' & df2$pred_class_threshold == 'male')
    
    #Calculating accuracy
    accuracy_02 <- (TP + TN) / (TP + FP + TN + FN)
    
    #Calculating precision
    precision_02 <- (TP) / (TP + FP)
    
    #Calculating recall
    recall_02 <- (TP) / (TP + FN)
    
    #Calculating F1 score
    f1_02 <- (2 * precision_02 * recall_02) / (precision_02 + recall_02)
    
    
    result_02 <- c(accuracy_02, precision_02, recall_02, f1_02)
    result_02
  }
  scores_02 <-calculations(df2)
  print(scores_02)
## [1] 0.9139785 0.8604651 0.9487179 0.9024390

Confusion Matrix for 0.2 Threshold

Let’s make a confusion matrix based on the 0.2 Threshold

  df2 %>% 
    select(sex, pred_class_threshold) %>%
    mutate(sex = recode(sex,
                        'female' = 'TP', 
                        'male' = 'TN'),
           pred_class_threshold = recode(pred_class_threshold,
                                          'female' = 'FP', 
                                          'male' = 'FN')) %>%
    table() 
##     pred_class_threshold
## sex  FN FP
##   TN 48  6
##   TP  2 37

Probability Threshold 0.5

df2$pred_class_threshold <- ifelse(df2$.pred_female >= 0.5, "female", "male")
print(df2)
## # A tibble: 93 × 4
##    .pred_female .pred_class sex    pred_class_threshold
##           <dbl> <chr>       <chr>  <chr>               
##  1        0.992 female      female female              
##  2        0.954 female      female female              
##  3        0.985 female      female female              
##  4        0.187 male        female male                
##  5        0.995 female      female female              
##  6        1.000 female      female female              
##  7        0.959 female      female female              
##  8        1.000 female      female female              
##  9        1.000 female      female female              
## 10        0.339 male        female male                
## # ℹ 83 more rows

Calculations for the 0.5 Probability Threshold

Later on we will be making a table of the accuracy, precision, recall and F1 scores for the probability thresholds, let’s get the calculations for the 0.5 probability threshold done now.

calculations <- function(df2) { 
  
  #Calculating True Positives (TP), True Negatives (TN), False Positives (FP), False Negatives (FN)
  TP <- sum(df2$sex == 'female' & df2$pred_class_threshold == 'female')
  TN <- sum(df2$sex == 'male' & df2$pred_class_threshold == 'male')
  FP <- sum(df2$sex == 'male' & df2$pred_class_threshold == 'female')
  FN <- sum(df2$sex == 'female' & df2$pred_class_threshold == 'male')
  
  #Calculating accuracy
  accuracy_05 <- (TP + TN) / (TP + FP + TN + FN)
  
  #Calculating precision
  precision_05 <- (TP) / (TP + FP)
  
  #Calculating recall
  recall_05 <- (TP) / (TP + FN)
  
  #Calculating F1 score
  f1_05 <- (2 * precision_05 * recall_05) / (precision_05 + recall_05)
  
  result_05 <- c(accuracy_05, precision_05, recall_05, f1_05)
  result_05
}
scores_05 <-calculations(df2)
print(scores_05)
## [1] 0.9354839 0.9230769 0.9230769 0.9230769

Confusion Matrix for 0.5 Threshold

Let’s make a confusion matrix based on the 0.5 Threshold

df2 %>% 
  select(sex, pred_class_threshold) %>%
  mutate(sex = recode(sex,
                      'female' = 'TP', 
                      'male' = 'TN'),
         pred_class_threshold = recode(pred_class_threshold,
                                        'female' = 'FP', 
                                        'male' = 'FN')) %>%
  table()
##     pred_class_threshold
## sex  FN FP
##   TN 51  3
##   TP  3 36

Probability Threshold 0.8

df2$pred_class_threshold <- ifelse(df2$.pred_female >= 0.8, "female", "male")
print(df2)
## # A tibble: 93 × 4
##    .pred_female .pred_class sex    pred_class_threshold
##           <dbl> <chr>       <chr>  <chr>               
##  1        0.992 female      female female              
##  2        0.954 female      female female              
##  3        0.985 female      female female              
##  4        0.187 male        female male                
##  5        0.995 female      female female              
##  6        1.000 female      female female              
##  7        0.959 female      female female              
##  8        1.000 female      female female              
##  9        1.000 female      female female              
## 10        0.339 male        female male                
## # ℹ 83 more rows

Calculations for the 0.8 Probability Threshold

Later on we will be making a table of the accuracy, precision, recall and F1 scores for the probability thresholds, let’s get the calculations for the 0.8 probability threshold done now.

calculations <- function(df2) { 
  
  #Calculating True Positives (TP), True Negatives (TN), False Positives (FP), False Negatives (FN)
  TP <- sum(df2$sex == 'female' & df2$pred_class_threshold == 'female')
  TN <- sum(df2$sex == 'male' & df2$pred_class_threshold == 'male')
  FP <- sum(df2$sex == 'male' & df2$pred_class_threshold == 'female')
  FN <- sum(df2$sex == 'female' & df2$pred_class_threshold == 'male')
  
  #Calculating accuracy
  accuracy_08 <- (TP + TN) / (TP + FP + TN + FN)
  
  #Calculating precision
  precision_08 <- (TP) / (TP + FP)
  
  #Calculating recall
  recall_08 <- (TP) / (TP + FN)
  
  #Calculating F1 score
  f1_08 <- (2 * precision_08 * recall_08) / (precision_08 + recall_08)
  
  result_08 <- c(accuracy_08, precision_08, recall_08, f1_08)
  result_08 # The last evaluated expression is returned
  
  
  #return(list(accuracy_08, precision_08,
   #           recall_08, f1_08))
}
scores_08 <-calculations(df2)
print(scores_08)
## [1] 0.9462366 0.9473684 0.9230769 0.9350649

Confusion Matrix for 0.8 Threshold

Let’s make a confusion matrix based on the 0.8 Threshold

df2 %>% 
  select(sex, pred_class_threshold) %>%
  mutate(sex = recode(sex,
                      'female' = 'TP', 
                      'male' = 'TN'),
         pred_class_threshold = recode(pred_class_threshold,
                                        'female' = 'FP', 
                                        'male' = 'FN')) %>%
  table()
##     pred_class_threshold
## sex  FN FP
##   TN 52  2
##   TP  3 36

Table for Accuracy, Precision, Recall, and F1 scores

As mentioned earlier we will be making a table of the accuracy, precision, recall and F1 scores for all three probability thresholds.

labels <- c("Accuracy", "Precision", "Recall", "F1 Score")

score_df <- data.frame(
  Performace_Metrics = labels,
  Threshold_02 = scores_02,
  Threshold_05 = scores_05,
  Threshold_08 = scores_08
)
    score_df %>%
      kable(format = "html", caption = "Performance Metrics Table")  %>% 
      kable_styling() %>% 
        column_spec(1, width = "18em")
Performance Metrics Table
Performace_Metrics Threshold_02 Threshold_05 Threshold_08
Accuracy 0.9139785 0.9354839 0.9462366
Precision 0.8604651 0.9230769 0.9473684
Recall 0.9487179 0.9230769 0.9230769
F1 Score 0.9024390 0.9230769 0.9350649

Probability Threshold Use Case

  1. We might want a low probability threshold like 0.2 if it’s very important that do not miss any female penguins, i.e. it is better for us to have too many false positives than any false negatives. If say there were a disease that primarily affects female penguins, we would want to lower our threshold.

  2. On the other hand we may want a very high probability threshold if we wanted to avoid as many false positives as possible, say should we be trying to create a new space for the female penguins away from the male penguins, we would want less false positives and want to make sure we have a model that captures the male penguins well.