Load Packages

library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)

Load Data

url <- "https://raw.githubusercontent.com/acatlin/data/master/classification_model_performance.csv"

performance <- read_csv(url, show_col_types = FALSE)

head(performance)
## # A tibble: 6 × 3
##   class scored.class scored.probability
##   <dbl>        <dbl>              <dbl>
## 1     0            0             0.328 
## 2     0            0             0.273 
## 3     1            0             0.110 
## 4     0            0             0.0560
## 5     0            0             0.100 
## 6     0            0             0.0552

Calculate and state the null error rate for the provided classification_model_performance.csv dataset. Create a plot showing the data distribution of the actual explanatory variable.

# assign each case to TP, TN, FP, or FN
performance <- performance %>%
  mutate(result = ifelse(class == 0 & scored.class == 0, "TN", 
                         ifelse(class == 1& scored.class == 1, "TP",
                                ifelse(class == 1 & scored.class == 0, "FN", 
                                       ifelse(class == 0 & scored.class == 1, "FP", "other")))))

# calculate null error rate
null.error.rate = (sum(performance$result == "FP") + sum(performance$result == "TN")) / nrow(performance)

# Print the null error rate
cat("The null error rate is:", null.error.rate, "\n")
## The null error rate is: 0.6850829
performance_percents <- performance %>%
  mutate(
    positive_actual = sum(class == 1),
    negative_actual = sum(class == 0),
    positive_percent = round(positive_actual / nrow(performance) * 100, 2),
    negative_percent = round(negative_actual / nrow(performance) * 100, 2))

performance %>%
  ggplot(aes(x = factor(class))) +
  geom_bar(fill = c("0" = "lightblue", "1" = "lightpink")) +
  geom_text(
    aes(label = paste0(ifelse(factor(class) == 0, performance_percents$negative_percent, performance_percents$positive_percent), "%")),
    y = 0, 
    position = position_identity(),
    vjust = -0.5
  ) +
  xlab("Class") +
  ylab("Count") +
  ggtitle("Target",
          "Distribution of Actual Explanatory Variable") +
  theme(
    plot.title = element_text(hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5)
  )

Analyze the data to determine the true positive,false positive,true negative, and false negative values for the dataset, using scored.probability thresholds of 0.2, 0.5, and 0.8. Display your results in a table, with the probability thresholds in columns, and the TP, FP, TN, and FN values in rows.

confusion_matrix <- performance %>%
  mutate(threshold_02 = ifelse(scored.probability >= 0.2, 1, 0),
         threshold_05 = ifelse(scored.probability >= 0.5, 1, 0),
         threshold_08 = ifelse(scored.probability >= 0.8, 1, 0)) %>%
  gather(threshold, scored.class, -class, -scored.probability, -result, -scored.class) %>%
  group_by(threshold) %>%
  summarise(
    TP = sum(class == 1 & scored.class == 1),
    FP = sum(class == 0 & scored.class == 1),
    TN = sum(class == 0 & scored.class == 0),
    FN = sum(class == 1 & scored.class == 0)
  ) %>%
  pivot_longer(cols = c(TP, FP, TN, FN), names_to = "Values", values_to = "threshold_value") %>%
  pivot_wider(names_from = threshold, values_from = threshold_value) %>%
  as.data.frame()

colnames(confusion_matrix) <- c("Values", "0.2 Threshold", "0.5 Threshold", "0.8 Threshold")

confusion_matrix
##   Values 0.2 Threshold 0.5 Threshold 0.8 Threshold
## 1     TP            49            27             9
## 2     FP            51             5             1
## 3     TN            73           119           123
## 4     FN             8            30            48

Create a table showing — for each of the three thresholds — the accuracy, precision, recall, and F1 scores.

# transpose matrix
rotated_confusion_matrix <- as.data.frame(t(confusion_matrix[,-1]))
colnames(rotated_confusion_matrix) <- confusion_matrix$Values

rotated_confusion_matrix %>%
  mutate(
    accuracy = (TP + TN) / (TP+FP+TN+FN),
    precision = TP / (TP + FP),
    recall = TP / (TP + FN),
    f.score = (2 * recall * precision) / (recall + precision)
  ) %>%
  select(-TP, -FP, -TN, -FN)
##                accuracy precision    recall   f.score
## 0.2 Threshold 0.6740331   0.49000 0.8596491 0.6242038
## 0.5 Threshold 0.8066298   0.84375 0.4736842 0.6067416
## 0.8 Threshold 0.7292818   0.90000 0.1578947 0.2686567

Provide at least one example use case where (a) an 0.2 scored probability threshold would be preferable, and (b) an 0.8 scored probability threshold would be preferable.

0.2 Scored Probability Threshold

Scenario: Early screening for diseases.

In a model where there is screening for a rare disease, it may be preferable to use a 0.2 scored probability threshold since this will increase the sensitivity of the model and capture more potential cases, minimizing the chance of missing true positives. The consequences of a false positive are less severe than those of a missing a true positive.

0.8 Scored Probability Threshold

Scenario: Fraud Detection in Financial Transactions

In a model that determines fraud from financial transactions (i.e. credit card transactions), it may be preferable to use a 0.8 scored probability threshold since the consequences of false positives are high. If a transaction is falsely flagged as fraud, it could inconvenience a user by blocking their credit card. Using a higher probability threshold means that the model would only flag transactions where there is high confidence that the transaction is fraudulent.

Sources

Confusion Matrix