Assignment 2

DATA 621

Daniel DeBonis

Overview

In this homework assignment, you will work through various classification metrics. You will be asked to create functions in R to carry out the various calculations. You will also investigate some functions in packages that will let you obtain the equivalent results. Finally, you will create graphical output that also can be used to evaluate the output of classification models, such as binary logistic regression.

Step 1 - Download Classification Data Set

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data <- read.csv('https://raw.githubusercontent.com/ddebonis47/classwork/refs/heads/main/classification-output-data.csv')

Step 2 - Raw Confusion Matrix

table(data$class, data$scored.class)
##    
##       0   1
##   0 119   5
##   1  30  27

We can determine a few things looking at the confusion matrix of our data. The majority of the predictions made were correct: Of the 149 data points predicted to be in class zero, 119 indeed were, resulting that class. Of the 32 predicted to be in class one, 27 were. Of the total 181 participants, 35 were subjected to a type 1 or type 2 error.

Step 3 - Accuracy Function

The formula for accuracy is total correct predictions/total predictions made

get_accuracy <- function(data, actual_col = "class", predicted_col = "scored.class") {
  actual <- data[[actual_col]]
  predicted <- data[[predicted_col]]
  mean(actual == predicted)
}
accuracy <- get_accuracy(data)
print(accuracy)
## [1] 0.8066298

The accuracy here is approximately 0.81.

Step 4 - Classification Error Rate

Error rate is essentially the inverse of the accuracy function. It is the ratio of incorrect predictions over total predictions made.

df <- data

get_classification_error <- function(df, actual_col = "class", predicted_col = "scored.class") {
  actual <- df[[actual_col]]
  predicted <- df[[predicted_col]]
  
  error_rate <- mean(actual != predicted)
  return(error_rate)
}
error_rate <- get_classification_error(df)
print(error_rate)
## [1] 0.1933702

As expected, the produced value is equal to 1 - .8066, the accuracy.

Step 5 - Precision

What precision tells us is what percentage of those predicted to be in the positive class indeed were labeled as such.

get_precision <- function(df, actual_col = "class", predicted_col = "scored.class", positive_class = 1) {
  actual <- df[[actual_col]]
  predicted <- df[[predicted_col]]
  
  # True Positives
  TP <- sum(predicted == positive_class & actual == positive_class)
  
  # False Positives 
  FP <- sum(predicted == positive_class & actual != positive_class)
  
  # Avoid division by zero
  precision <- ifelse((TP + FP) == 0, NA, TP / (TP + FP))
  
  return(precision)
}
get_precision(df)
## [1] 0.84375

Step 6 - Sensitivity / Recall

What sensitivity measures is the ratio of those predicted to be positive out of all those in the positive category. In other words, it measures what proportion of the actual positives were predicted to be as such.

get_sensitivity <- function(df, actual_col = "class", predicted_col = "scored.class", positive_class = 1) {
  actual <- df[[actual_col]]
  predicted <- df[[predicted_col]]
  
  # True Positives (TP): actual = positive and predicted = positive
  TP <- sum(actual == positive_class & predicted == positive_class)
  
  # False Negatives (FN): actual = positive but predicted = negative
  FN <- sum(actual == positive_class & predicted != positive_class)
  
  # Avoid division by zero
  sensitivity <- ifelse((TP + FN) == 0, NA, TP / (TP + FN))
  
  return(sensitivity)
}
get_sensitivity(df)
## [1] 0.4736842

Or as a fraction, this would be 27/(27+30)

Step 7 - Specificity

Specificity is like a counterpoint to sensitivity that looks at the true negatives that were predicted out of all that were actually negative. This measures what proportion of the actual negativces were predicted to be as such.

get_specificity <- function(df, actual_col = "class", predicted_col = "scored.class", positive_class = 1) {
  actual <- df[[actual_col]]
  predicted <- df[[predicted_col]]
  
  # Identify negatives (anything not the positive_class)
  negative_class <- setdiff(unique(actual), positive_class)
  if (length(negative_class) == 0) {
    stop("No negative class found. Check your data or positive_class setting.")
  }
  negative_class <- negative_class[1]  # In case multiple
  
  # True Negatives (TN): actual = negative and predicted = negative
  TN <- sum(actual == negative_class & predicted == negative_class)
  
  # False Positives (FP): actual = negative but predicted = positive
  FP <- sum(actual == negative_class & predicted == positive_class)
  
  # Avoid division by zero
  specificity <- ifelse((TN + FP) == 0, NA, TN / (TN + FP))
  
  return(specificity)
}
get_specificity(df)
## [1] 0.9596774

Or in fraction form, it would be equal to 119/124

Step 8 - F1 Score

The F1 score is the harmonic mean of precision and recall, a useful metric when both are important.

get_f1 <- function(df, actual_col = "class", predicted_col = "scored.class", positive_class = 1) {
  actual <- df[[actual_col]]
  predicted <- df[[predicted_col]]
  
  # True Positives (TP)
  TP <- sum(predicted == positive_class & actual == positive_class)
  
  # False Positives (FP)
  FP <- sum(predicted == positive_class & actual != positive_class)
  
  # False Negatives (FN)
  FN <- sum(predicted != positive_class & actual == positive_class)
  
  # Precision and Recall
  precision <- ifelse((TP + FP) == 0, NA, TP / (TP + FP))
  recall    <- ifelse((TP + FN) == 0, NA, TP / (TP + FN))
  
  # F1 Score
  if (is.na(precision) || is.na(recall) || (precision + recall) == 0) {
    f1 <- NA
  } else {
    f1 <- 2 * (precision * recall) / (precision + recall)
  }
  
  return(f1)
}
get_f1(df)
## [1] 0.6067416

Step 9 - More on F1 Score

The F1 score is a metric that exists between the values of 0 and 1. If all elements are predicted correctly, the F1 score is one. More specifically, F1 score is based on the precision and recall, which are both proportions between zero and one given the formulas include some pieces in both the numerator and denominator, with more in the denominator. The harmonic mean of two values between zero and one will also be between zero and one.

Step 10 - ROC Curve and AUC

get_roc <- function(df, actual_col = "class", prob_col = "scored.probability", positive_class = 1) {
  actual <- df[[actual_col]]
  prob <- df[[prob_col]]
  
  thresholds <- seq(0, 1, by = 0.01)
  
  tpr <- numeric(length(thresholds))  # True Positive Rate (Sensitivity)
  fpr <- numeric(length(thresholds))  # False Positive Rate (1 - Specificity)
  
  for (i in seq_along(thresholds)) {
    t <- thresholds[i]
    predicted <- ifelse(prob >= t, positive_class, 1 - positive_class)
    
    TP <- sum(predicted == positive_class & actual == positive_class)
    FP <- sum(predicted == positive_class & actual != positive_class)
    TN <- sum(predicted != positive_class & actual != positive_class)
    FN <- sum(predicted != positive_class & actual == positive_class)
    
    tpr[i] <- ifelse((TP + FN) == 0, 0, TP / (TP + FN))
    fpr[i] <- ifelse((FP + TN) == 0, 0, FP / (FP + TN))
  }
  
  # Sort points by FPR for plotting and AUC calculation
  order_idx <- order(fpr)
  fpr <- fpr[order_idx]
  tpr <- tpr[order_idx]
  
  # Compute AUC via trapezoidal rule
  auc <- sum(diff(fpr) * (head(tpr, -1) + tail(tpr, -1)) / 2)
  
  # Plot ROC
  library(ggplot2)
  roc_plot <- ggplot(data.frame(FPR = fpr, TPR = tpr), aes(x = FPR, y = TPR)) +
    geom_line(color = "blue", linewidth = 1) +
    geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "red") +
    labs(title = paste("ROC Curve (AUC =", round(auc, 3), ")"),
         x = "False Positive Rate (1 - Specificity)",
         y = "True Positive Rate (Sensitivity)") +
    theme_minimal()
  
  return(list(plot = roc_plot, auc = auc))
}

roc_result <- get_roc(df)

roc_result$plot    # shows the ROC curve

roc_result$auc 
## [1] 0.8484012

Step 11 - Producing Classficiation Metrics

I have been generating these values as I have been working through the assignment, but I will replicate the findings into a table here:

# Assuming your dataset is named df
metrics <- list(
  accuracy = get_accuracy(df),
  error_rate = get_classification_error(df),
  precision = get_precision(df),
  sensitivity = get_sensitivity(df),
  specificity = get_specificity(df),
  f1_score = get_f1(df),
  roc = get_roc(df)
)

# View the numeric metrics
metrics[c("accuracy", "error_rate", "precision", "sensitivity", "specificity", "f1_score")]
## $accuracy
## [1] 0.8066298
## 
## $error_rate
## [1] 0.1933702
## 
## $precision
## [1] 0.84375
## 
## $sensitivity
## [1] 0.4736842
## 
## $specificity
## [1] 0.9596774
## 
## $f1_score
## [1] 0.6067416
# View the ROC curve
metrics$roc$plot

# View the AUC
metrics$roc$auc
## [1] 0.8484012

Step 12 - Replicating with caret package

library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift

Under the caret package, it is important for variables to be factors

# Make sure the columns are factors
df$class <- factor(df$class, levels = c(0,1))
df$scored.class <- factor(df$scored.class, levels = c(0,1))

# Compute confusion matrix
cm <- confusionMatrix(data = df$scored.class, reference = df$class, positive = "1")
cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 119  30
##          1   5  27
##                                           
##                Accuracy : 0.8066          
##                  95% CI : (0.7415, 0.8615)
##     No Information Rate : 0.6851          
##     P-Value [Acc > NIR] : 0.0001712       
##                                           
##                   Kappa : 0.4916          
##                                           
##  Mcnemar's Test P-Value : 4.976e-05       
##                                           
##             Sensitivity : 0.4737          
##             Specificity : 0.9597          
##          Pos Pred Value : 0.8438          
##          Neg Pred Value : 0.7987          
##              Prevalence : 0.3149          
##          Detection Rate : 0.1492          
##    Detection Prevalence : 0.1768          
##       Balanced Accuracy : 0.7167          
##                                           
##        'Positive' Class : 1               
## 

Looks like the values that I had calculated already have the same values here. There are other statistics present in this summary, but the common values match exactly.

Step 13 - Replicating with pROC package

library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
# Ensure class is numeric or factor
df$class <- as.numeric(df$class)

# Generate ROC object
roc_obj <- roc(response = df$class, predictor = df$scored.probability)
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
# Plot ROC curve
plot(roc_obj, col = "blue", lwd = 2, main = "ROC Curve (pROC)")
abline(a = 0, b = 1, lty = 2, col = "red")  # random guess line

# Compute AUC
auc_value <- auc(roc_obj)
auc_value
## Area under the curve: 0.8503

Here we have some difference between this graph and the previous one. One interesting distinction is that the graph generated with the pROC package does not bound the axes between 0 and 1. This makes the graphs look slightly different but the same information is being presented on slightly different scales. Additionally, the area under the curve calculated in the two methods differs very slightly: .848 vs. .850.