We will be looking at the penguin_predictions.csv dataset. This dataset contains three data columns. They are:
We will use the following libraries
To start let us read in our data and look at it’s summary.
url <-("https://raw.githubusercontent.com/WendyR20/DATA-607-Assignment-2B/refs/heads/main/penguin_predictions.csv")
data <- read_csv(url)
## Rows: 93 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): .pred_class, sex
## dbl (1): .pred_female
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
summary(data)
## .pred_female .pred_class sex
## Min. :0.0000000 Length:93 Length:93
## 1st Qu.:0.0003508 Class :character Class :character
## Median :0.1098907 Mode :character Mode :character
## Mean :0.4351396
## 3rd Qu.:0.9921746
## Max. :1.0000000
Let us calculate the null error rate for the dataset. First, we need to find out how many null values exist in the dataset.
# Count Null values for each column
sapply(data, function(x) sum(is.na(x)))
## .pred_female .pred_class sex
## 0 0 0
# Summary table of missing values per column
null_summary <- data.frame(
null_count = sapply(data, function(x) sum(is.na(x))),
null_percent = sapply(data, function(x) mean(is.na(x)) * 100)
)
null_summary
## null_count null_percent
## .pred_female 0 0
## .pred_class 0 0
## sex 0 0
We see that there are no null values in the dataset, thus the null error rate is 0%. We avoided having to do any extra calculation (phew)!
Now let’s plot the distribution of the explanatory variable for this dataset: the actual sex of the penguins.
Its is important to know the null error rate, as the null values in a dataset can affect our understanding of a dataset, we may not be able to use certain models if we have null values. Null values can also lead to inaccurate calculations of the median and mean of our data.
ggplot(data, aes(x = sex)) +
geom_bar(fill = "deepskyblue") +
geom_text(stat="count", aes(label=after_stat(count)),vjust=2)
labs(title = "Distribution of Actual Sex",
x = "Actual Sex",
y = "Count") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5, vjust = 0.5),
axis.title.x = element_text(hjust = 0.5, vjust = 0.5),
axis.title.y = element_text(hjust = 0.5, vjust = 0.5))
## NULL
Let’s practice making a confusion matrix with our data as it is.
#confusion matrix
df1 <- data
df1 %>%
select(sex, .pred_class) %>%
mutate(sex = recode(sex,
'female' = 'TP',
'male' = 'TN'),
.pred_class = recode(.pred_class,
'female' = 'FP',
'male' = 'TN')) %>%
table()
## .pred_class
## sex FP TN
## TN 3 51
## TP 36 3
We are going to analyze the data and determine false positive, true negative, and false negative values based on the .pred_female threshold being 0.2 .pred_female threshold being 0.2
For that we will first create a new column and assign the values male’ or ‘female’ to new column based on 0.2 threshold.
df2 <- data
df2$pred_class_threshold <- NA # Initialize the new column
print(df2)
## # A tibble: 93 × 4
## .pred_female .pred_class sex pred_class_threshold
## <dbl> <chr> <chr> <lgl>
## 1 0.992 female female NA
## 2 0.954 female female NA
## 3 0.985 female female NA
## 4 0.187 male female NA
## 5 0.995 female female NA
## 6 1.000 female female NA
## 7 0.959 female female NA
## 8 1.000 female female NA
## 9 1.000 female female NA
## 10 0.339 male female NA
## # ℹ 83 more rows
df2$pred_class_threshold <- ifelse(df2$.pred_female >= 0.2, "female", "male")
Later on we will be making a table of the accuracy, precision, recall and F1 scores for the probability thresholds, let’s get the calculations for the 0.2 probability threshold done now.
calculations <- function(df2) {
#Calculating True Positives (TP), True Negatives (TN), False Positives (FP), False Negatives (FN)
TP <- sum(df2$sex == 'female' & df2$pred_class_threshold == 'female')
TN <- sum(df2$sex == 'male' & df2$pred_class_threshold == 'male')
FP <- sum(df2$sex == 'male' & df2$pred_class_threshold == 'female')
FN <- sum(df2$sex == 'female' & df2$pred_class_threshold == 'male')
#Calculating accuracy
accuracy_02 <- (TP + TN) / (TP + FP + TN + FN)
#Calculating precision
precision_02 <- (TP) / (TP + FP)
#Calculating recall
recall_02 <- (TP) / (TP + FN)
#Calculating F1 score
f1_02 <- (2 * precision_02 * recall_02) / (precision_02 + recall_02)
result_02 <- c(accuracy_02, precision_02, recall_02, f1_02)
result_02
}
scores_02 <-calculations(df2)
print(scores_02)
## [1] 0.9139785 0.8604651 0.9487179 0.9024390
Let’s make a confusion matrix based on the 0.2 Threshold
df2 %>%
select(sex, pred_class_threshold) %>%
mutate(sex = recode(sex,
'female' = 'TP',
'male' = 'TN'),
pred_class_threshold = recode(pred_class_threshold,
'female' = 'FP',
'male' = 'FN')) %>%
table()
## pred_class_threshold
## sex FN FP
## TN 48 6
## TP 2 37
df2$pred_class_threshold <- ifelse(df2$.pred_female >= 0.5, "female", "male")
print(df2)
## # A tibble: 93 × 4
## .pred_female .pred_class sex pred_class_threshold
## <dbl> <chr> <chr> <chr>
## 1 0.992 female female female
## 2 0.954 female female female
## 3 0.985 female female female
## 4 0.187 male female male
## 5 0.995 female female female
## 6 1.000 female female female
## 7 0.959 female female female
## 8 1.000 female female female
## 9 1.000 female female female
## 10 0.339 male female male
## # ℹ 83 more rows
Later on we will be making a table of the accuracy, precision, recall and F1 scores for the probability thresholds, let’s get the calculations for the 0.5 probability threshold done now.
calculations <- function(df2) {
#Calculating True Positives (TP), True Negatives (TN), False Positives (FP), False Negatives (FN)
TP <- sum(df2$sex == 'female' & df2$pred_class_threshold == 'female')
TN <- sum(df2$sex == 'male' & df2$pred_class_threshold == 'male')
FP <- sum(df2$sex == 'male' & df2$pred_class_threshold == 'female')
FN <- sum(df2$sex == 'female' & df2$pred_class_threshold == 'male')
#Calculating accuracy
accuracy_05 <- (TP + TN) / (TP + FP + TN + FN)
#Calculating precision
precision_05 <- (TP) / (TP + FP)
#Calculating recall
recall_05 <- (TP) / (TP + FN)
#Calculating F1 score
f1_05 <- (2 * precision_05 * recall_05) / (precision_05 + recall_05)
result_05 <- c(accuracy_05, precision_05, recall_05, f1_05)
result_05
}
scores_05 <-calculations(df2)
print(scores_05)
## [1] 0.9354839 0.9230769 0.9230769 0.9230769
Let’s make a confusion matrix based on the 0.5 Threshold
df2 %>%
select(sex, pred_class_threshold) %>%
mutate(sex = recode(sex,
'female' = 'TP',
'male' = 'TN'),
pred_class_threshold = recode(pred_class_threshold,
'female' = 'FP',
'male' = 'FN')) %>%
table()
## pred_class_threshold
## sex FN FP
## TN 51 3
## TP 3 36
df2$pred_class_threshold <- ifelse(df2$.pred_female >= 0.8, "female", "male")
print(df2)
## # A tibble: 93 × 4
## .pred_female .pred_class sex pred_class_threshold
## <dbl> <chr> <chr> <chr>
## 1 0.992 female female female
## 2 0.954 female female female
## 3 0.985 female female female
## 4 0.187 male female male
## 5 0.995 female female female
## 6 1.000 female female female
## 7 0.959 female female female
## 8 1.000 female female female
## 9 1.000 female female female
## 10 0.339 male female male
## # ℹ 83 more rows
Later on we will be making a table of the accuracy, precision, recall and F1 scores for the probability thresholds, let’s get the calculations for the 0.8 probability threshold done now.
calculations <- function(df2) {
#Calculating True Positives (TP), True Negatives (TN), False Positives (FP), False Negatives (FN)
TP <- sum(df2$sex == 'female' & df2$pred_class_threshold == 'female')
TN <- sum(df2$sex == 'male' & df2$pred_class_threshold == 'male')
FP <- sum(df2$sex == 'male' & df2$pred_class_threshold == 'female')
FN <- sum(df2$sex == 'female' & df2$pred_class_threshold == 'male')
#Calculating accuracy
accuracy_08 <- (TP + TN) / (TP + FP + TN + FN)
#Calculating precision
precision_08 <- (TP) / (TP + FP)
#Calculating recall
recall_08 <- (TP) / (TP + FN)
#Calculating F1 score
f1_08 <- (2 * precision_08 * recall_08) / (precision_08 + recall_08)
result_08 <- c(accuracy_08, precision_08, recall_08, f1_08)
result_08 # The last evaluated expression is returned
#return(list(accuracy_08, precision_08,
# recall_08, f1_08))
}
scores_08 <-calculations(df2)
print(scores_08)
## [1] 0.9462366 0.9473684 0.9230769 0.9350649
Let’s make a confusion matrix based on the 0.8 Threshold
df2 %>%
select(sex, pred_class_threshold) %>%
mutate(sex = recode(sex,
'female' = 'TP',
'male' = 'TN'),
pred_class_threshold = recode(pred_class_threshold,
'female' = 'FP',
'male' = 'FN')) %>%
table()
## pred_class_threshold
## sex FN FP
## TN 52 2
## TP 3 36
As mentioned earlier we will be making a table of the accuracy, precision, recall and F1 scores for all three probability thresholds.
labels <- c("Accuracy", "Precision", "Recall", "F1 Score")
score_df <- data.frame(
Performace_Metrics = labels,
Threshold_02 = scores_02,
Threshold_05 = scores_05,
Threshold_08 = scores_08
)
score_df %>%
kable(format = "html", caption = "Performance Metrics Table") %>%
kable_styling() %>%
column_spec(1, width = "18em")
Performace_Metrics | Threshold_02 | Threshold_05 | Threshold_08 |
---|---|---|---|
Accuracy | 0.9139785 | 0.9354839 | 0.9462366 |
Precision | 0.8604651 | 0.9230769 | 0.9473684 |
Recall | 0.9487179 | 0.9230769 | 0.9230769 |
F1 Score | 0.9024390 | 0.9230769 | 0.9350649 |
We might want a low probability threshold like 0.2 if it’s very important that do not miss any female penguins, i.e. it is better for us to have too many false positives than any false negatives. If say there were a disease that primarily affects female penguins, we would want to lower our threshold.
On the other hand we may want a very high probability threshold if we wanted to avoid as many false positives as possible, say should we be trying to create a new space for the female penguins away from the male penguins, we would want less false positives and want to make sure we have a model that captures the male penguins well.