DATA621 - Homework 2

Background
1. Data Ingestion
2. Raw Confusion Matrix
3. Accuracy of the Predictions
4. Classification Error Rate of the Predictions
5. Precision of the Predictions
6. Sensitivity/Recall of the Predictions
7. Specificity of the Predictions
8. F1 Score of the Predictions
9. F1 Score of the Predictions: Bounds
10. ROC curve
11. All Classification Metrics
12. Classification Metrics via caret Package
13. ROC Curve via pROC Package

Background

In this homework assignment, you will work through various classification metrics. You will be asked to create functions in R to carry out the various calculations. You will also investigate some functions in packages that will let you obtain the equivalent results. Finally, you will create graphical output that also can be used to evaluate the output of classification models, such as binary logistic regression.

The data set has three key columns we will use:

class: the actual class for the observation

scored.class: the predicted class for the observation (based on a threshold of 0.5)

scored.probability: the predicted probability of success for the observation

#Import Libraries
library(readr) # to uses read_csv function
library(tidyverse)

library(e1071)  # For skewness function
library(corrplot)

library(caret)
library(ROSE)
library(smotefamily)

library(dplyr)
library(caret)
library(MASS)
library(lmtest)
library(car)
library(GGally)
library(pROC)

1. Data Ingestion

output_data <- read.csv("https://raw.githubusercontent.com/gillianmcgovern0/cuny-data-621/refs/heads/main/classification-output-data.csv")

head(output_data)

##   pregnant glucose diastolic skinfold insulin  bmi pedigree age class
## 1        7     124        70       33     215 25.5    0.161  37     0
## 2        2     122        76       27     200 35.9    0.483  26     0
## 3        3     107        62       13      48 22.9    0.678  23     1
## 4        1      91        64       24       0 29.2    0.192  21     0
## 5        4      83        86       19       0 29.3    0.317  34     0
## 6        1     100        74       12      46 19.5    0.149  28     0
##   scored.class scored.probability
## 1            0         0.32845226
## 2            0         0.27319044
## 3            0         0.10966039
## 4            0         0.05599835
## 5            0         0.10049072
## 6            0         0.05515460

2. Raw Confusion Matrix

Use the table() function to get the raw confusion matrix for this scored dataset. Make sure you understand the output. In particular, do the rows represent the actual or predicted class? The columns?

# Raw confusion matrix
raw_confiusion_matrix <- table(output_data$scored.class, output_data$class)
names(dimnames(raw_confiusion_matrix)) <- c("Prediction", "Actual")
print(raw_confiusion_matrix)

##           Actual
## Prediction   0   1
##          0 119  30
##          1   5  27

Each row represents the predicted class values, and each column represent the actual values. For this confusion matrix, there are:

119 true negative values (correctly classified as not the target value)
27 true positive values (correctly classified as the target value)
5 false positive values (incorrectly classified as the target value)
30 false negative values (incorrectly classified as not the target value)

3. Accuracy of the Predictions

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the accuracy of the predictions

\[ 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = \frac{𝑇𝑃 + 𝑇N}{𝑇𝑃 + 𝐹𝑃 + 𝑇𝑁 + 𝐹𝑁} \]

# Main function to calculate confusion matrix once
calculate_confusion_matrix <- function(df, actual_col = "class", predicted_col = "scored.class") {
  cm <- table(Actual = df[[actual_col]], Predicted = df[[predicted_col]])
  tn <- cm[1, 1]
  fp <- cm[1, 2]
  fn <- cm[2, 1]
  tp <- cm[2, 2]
  return(list(tn=tn,fp=fp,fn=fn,tp=tp))
}

# Function to calculate accuracy
calculate_accuracy <- function(df, actual_col = "class", predicted_col = "scored.class") {
  cm <- calculate_confusion_matrix(df, actual_col, predicted_col)
  total <- cm$tp + cm$tn + cm$fp + cm$fn
  accuracy <- (cm$tp + cm$tn) / total
  
  return(accuracy)
}

#print accuracy
class_accuracy <- calculate_accuracy(output_data)
 print(paste("Accuracy:", class_accuracy))

## [1] "Accuracy: 0.806629834254144"

Accuracy score is 0.8066298.

4. Classification Error Rate of the Predictions

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the classification error rate of the predictions. Verify that you get an accuracy and an error rate that sums to one.

\[ 𝐶𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛\ 𝐸𝑟𝑟𝑜𝑟 \ 𝑅𝑎𝑡e = \frac{F𝑃 + FN}{𝑇𝑃 + 𝐹𝑃 + 𝑇𝑁 + 𝐹N} \]

# Classification error rate
# % of total observations that the model inaccurately predicts
# Takes in the original dataset and returns classification error rate

calculate_error_rate <- function(df, actual_col = "class", predicted_col = "scored.class") {
  cm <- calculate_confusion_matrix(df, actual_col, predicted_col)
  total <- cm$tp + cm$tn + cm$fp + cm$fn
  error_rate <- (cm$fp + cm$fn) / total
  accuracy <- (cm$tp + cm$tn) / total

  return(list(error_rate = error_rate,
    accuracy = accuracy,
    sum_check = error_rate + accuracy))
}

#print error rate
class_error_rate<- calculate_error_rate(output_data)
print(paste("Error Rate:", class_error_rate$error_rate))

## [1] "Error Rate: 0.193370165745856"

Classification Error Rate is 0.1933702.

Let’s make sure Classification Error Rate + Accuracy = 1:

# Validate error rate and accuracy functions
# Sum should be 1

print(paste("Sum of accuracy and error rate:", round(class_error_rate$sum_check, 4)))

## [1] "Sum of accuracy and error rate: 1"

5. Precision of the Predictions

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the precision of the predictions.

\[ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = \frac{TP}{𝑇𝑃 + 𝐹𝑃} \]

# Precision
# Accuracy of the model's positive predictions
# Takes in the original dataset and returns precision

calculate_precision <- function(df, actual_col = "class", predicted_col = "scored.class") {
  cm <- calculate_confusion_matrix(df, actual_col, predicted_col)

  precision <- (cm$tp) / (cm$tp + cm$fp)
  
  return(precision)
}

#print precision
class_precision<- calculate_precision(output_data)
print(paste("Precision:", class_precision))

## [1] "Precision: 0.84375"

Precision is 0.84375.

6. Sensitivity/Recall of the Predictions

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the sensitivity of the predictions. Sensitivity is also known as recall.

\[ 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 = \frac{TP}{𝑇𝑃 + 𝐹N} \]

# Sensitivity
# True positive rate
# Takes in the original dataset and returns sensitivity

calculate_sensitivity<- function(df, actual_col = "class", predicted_col = "scored.class") {
  cm <- calculate_confusion_matrix(df, actual_col, predicted_col)

  sensitivity <- (cm$tp) / (cm$tp + cm$fn)
  
  return(sensitivity)
}

#print sensitivity
class_sensitivity<- calculate_sensitivity(output_data)
print(paste("Sensitivity:", class_sensitivity))

## [1] "Sensitivity: 0.473684210526316"

Sensitivity/Recall is 0.4736842.

7. Specificity of the Predictions

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the specificity of the predictions.

\[ 𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = \frac{TN}{𝑇N + 𝐹P} \]

# Specificity
# True negative rate
# Takes in the original dataset and returns specificity

calculate_specificity<- function(df, actual_col = "class", predicted_col = "scored.class") {
  cm <- calculate_confusion_matrix(df, actual_col, predicted_col)

  specificity <- (cm$tn) / (cm$tn + cm$fp)
  
  return(specificity)
}

#print specificity
class_specificity<- calculate_specificity(output_data)
print(paste("Specificity:", class_specificity))

## [1] "Specificity: 0.959677419354839"

Specificity is 0.9596774.

8. F1 Score of the Predictions

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the F1 score of the predictions.

\[ 𝐹1\ 𝑆𝑐𝑜𝑟e = \frac{2 × 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡y}{𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡y} \]

# F1 Score
# Takes in the original dataset and returns f1 score

calculate_f1_score<- function(df, actual_col = "class", predicted_col = "scored.class") {
  cm <- calculate_confusion_matrix(df, actual_col, predicted_col)
  prec <- calculate_precision(output_data)
  sens <- calculate_sensitivity(output_data)
  f1_score <- (2 * prec * sens) / (prec + sens)
  
  return(f1_score)
}

#print f1-score
class_f1_score<- calculate_f1_score(output_data)
print(paste("F1_score:", class_f1_score))

## [1] "F1_score: 0.606741573033708"

F1 Score is 0.6067416.

9. F1 Score of the Predictions: Bounds

Before we move on, let’s consider a question that was asked: What are the bounds on the F1 score? Show that the F1 score will always be between 0 and 1. (Hint: If 0 < 𝑎 < 1 and 0 < 𝑏 < 1 then 𝑎𝑏 < 𝑎.)

# Confirm 𝑎*𝑏 < 𝑎
calculate_precision(output_data) * calculate_sensitivity(output_data) # a * b

## [1] 0.3996711

calculate_precision(output_data) # a

## [1] 0.84375

calculate_sensitivity(output_data) # a

## [1] 0.4736842

The bounds for the F1 score are between 0 and 1. For example, if recall, a proportion, is 0.47 and precision, another proportion, is 0.84, then recall * precision is 0.40 which is less than 0.84.

(2 * 1 * 1) / (1 + 1)

## [1] 1

(2 * 1 * 0) / (1 + 0)

## [1] 0

If there is a 100% recall and precision score, then the F1 score would be 1. If one of the scores was zero, then the F1 score would be 0, meaning the bounds for F1 have to be between 0 and 1.

10. ROC curve

Write a function that generates an ROC curve from a data set with a true classification column (class in our example) and a probability column (scored.probability in our example). Your function should return a list that includes the plot of the ROC curve and a vector that contains the calculated area under the curve (AUC). Note that I recommend using a sequence of thresholds ranging from 0 to 1 at 0.01 intervals.

# Function to generate ROC curve and calculate AUC
calculate_roc_curve <- function(df, actual_col = "class", probability_col = "scored.probability") {
  
  # Create a sequence of thresholds from 0 to 1 at 0.01 intervals
  thresholds <- seq(0, 1, by = 0.01)
  
  # Initialize vectors to store TPR and FPR
  tpr_values <- numeric(length(thresholds))
  fpr_values <- numeric(length(thresholds))
  
  # Get actual values
  actual <- df[[actual_col]]
  probabilities <- df[[probability_col]]
  
  # Calculate TPR and FPR for each threshold
  for (i in seq_along(thresholds)) {
    threshold <- thresholds[i]
    
    # Predicted class based on threshold
    predicted <- as.integer(probabilities >= threshold)
    
    # Create confusion matrix with explicit factor levels to ensure 2x2 structure
    cm <- table(factor(actual, levels = c(0, 1)), 
                factor(predicted, levels = c(0, 1)))
    
    # Extract values from 2x2 matrix
    tn <- cm[1, 1]
    fp <- cm[1, 2]
    fn <- cm[2, 1]
    tp <- cm[2, 2]
    
    # Calculate True Positive Rate(sensitivity) and False Positive Rate(1 - Specificity)
    tpr_values[i] <- ifelse(tp + fn == 0, 0, tp / (tp + fn))
    fpr_values[i] <- ifelse(fp + tn == 0, 0, fp / (fp + tn))
  }
  
  # Calculate AUC using trapezoidal rule
  auc <- 0
  for (i in 2:length(fpr_values)) {
    auc <- auc + (fpr_values[i-1] - fpr_values[i]) * (tpr_values[i-1] + tpr_values[i]) / 2
  }
  
  # Create ROC plot
  plot_roc <- plot(fpr_values, tpr_values, 
                   type = "l", 
                   xlim = c(0, 1), 
                   ylim = c(0, 1),
                   xlab = "False Positive Rate (1 - Specificity)",
                   ylab = "True Positive Rate (Sensitivity)",
                   main = paste("ROC Curve (AUC =", round(auc, 4), ")"),
                   col = "blue",
                   lwd = 2)
  
  # Add diagonal reference line
  abline(0, 1, col = "red", lty = 2, lwd = 1.5)
  
  # Return list with plot and AUC
  result <- list(
    plot = plot_roc,
    auc = auc,
    fpr = fpr_values,
    tpr = tpr_values,
    thresholds = thresholds
  )
  
  return(result)
}


 roc_results <- calculate_roc_curve(output_data)

 print(paste("AUC:", round(roc_results$auc, 4)))

## [1] "AUC: 0.8489"

11. All Classification Metrics

Use your created R functions and the provided classification output data set to produce all of the classification metrics discussed above.

#Call all created functions
cm <- calculate_confusion_matrix(output_data)
class_accuracy <- calculate_accuracy(output_data)
class_error_rate<- calculate_error_rate(output_data)
class_precision<- calculate_precision(output_data)
class_sensitivity<- calculate_sensitivity(output_data)
class_specificity<- calculate_specificity(output_data)
class_f1_score<- calculate_f1_score(output_data)

# Generate single consolidated metrics 
output_text <- paste(
"================================================================================\n",
"COMPREHENSIVE CLASSIFICATION METRICS REPORT\n",
"================================================================================\n\n",
  
"CONFUSION MATRIX:\n",
sprintf("            Predicted 0  Predicted 1\n"),
sprintf("Actual 0    %12d %12d\n", cm$tn, cm$fp),
sprintf("Actual 1    %12d %12d\n\n", cm$fn, cm$tp),
  
"Confusion Matrix Components:\n",
sprintf("True Negatives (TN):   %d\n", cm$tn),
sprintf("False Positives (FP):  %d\n", cm$fp),
sprintf("False Negatives (FN):  %d\n", cm$fn),
sprintf("True Positives (TP):   %d\n\n", cm$tp),
  
"CLASSIFICATION METRICS:\n",
"----------------\n",
sprintf("Accuracy:              %.4f\n", class_accuracy),
sprintf("Error Rate:            %.4f\n",class_error_rate$error_rate),
sprintf("Accuracy + Error Rate: %.4f (should equal 1.0)\n\n", class_error_rate$sum_check),
  
"THRESHOLD-DEPENDENT METRICS:\n",
"----------------\n",
sprintf("Precision:             %.4f\n",class_precision),
sprintf("Sensitivity:           %.4f (True Positive Rate)\n", class_sensitivity),
sprintf("Specificity:           %.4f (True Negative Rate)\n\n", class_specificity),
  
"HARMONIC MEAN METRIC:\n",
"----------------\n",
sprintf("F1 Score:              %.4f\n", class_f1_score),
sprintf("F1 Score range:        [0, 1] ✓\n\n"),
  
"ROC CURVE AND AUC:\n",
"----------------\n",
sprintf("Area Under Curve (AUC): %.4f\n", roc_results$auc),
sprintf("AUC = 0.5  (Random classifier)\n"),
sprintf("AUC = 1.0  (Perfect classifier)\n"),
sprintf("Current AUC = %.4f\n\n", roc_results$auc),

"================================================================================\n"
)

cat(output_text)

## ================================================================================
##  COMPREHENSIVE CLASSIFICATION METRICS REPORT
##  ================================================================================
## 
##  CONFUSION MATRIX:
##              Predicted 0  Predicted 1
##  Actual 0             119            5
##  Actual 1              30           27
## 
##  Confusion Matrix Components:
##  True Negatives (TN):   119
##  False Positives (FP):  5
##  False Negatives (FN):  30
##  True Positives (TP):   27
## 
##  CLASSIFICATION METRICS:
##  ----------------
##  Accuracy:              0.8066
##  Error Rate:            0.1934
##  Accuracy + Error Rate: 1.0000 (should equal 1.0)
## 
##  THRESHOLD-DEPENDENT METRICS:
##  ----------------
##  Precision:             0.8438
##  Sensitivity:           0.4737 (True Positive Rate)
##  Specificity:           0.9597 (True Negative Rate)
## 
##  HARMONIC MEAN METRIC:
##  ----------------
##  F1 Score:              0.6067
##  F1 Score range:        [0, 1] ✓
## 
##  ROC CURVE AND AUC:
##  ----------------
##  Area Under Curve (AUC): 0.8489
##  AUC = 0.5  (Random classifier)
##  AUC = 1.0  (Perfect classifier)
##  Current AUC = 0.8489
## 
##  ================================================================================

summary_df <- data.frame(
  Metric = c("Accuracy", "Error Rate", "Precision", "Sensitivity", "Specificity", "F1 Score", "AUC"),
  Value = c(class_accuracy, class_error_rate$error_rate, class_precision, class_sensitivity, class_specificity, class_f1_score, roc_results$auc),
  stringsAsFactors = FALSE
)

print(summary_df)

##        Metric     Value
## 1    Accuracy 0.8066298
## 2  Error Rate 0.1933702
## 3   Precision 0.8437500
## 4 Sensitivity 0.4736842
## 5 Specificity 0.9596774
## 6    F1 Score 0.6067416
## 7         AUC 0.8488964

12. Classification Metrics via caret Package

Investigate the caret package. In particular, consider the functions confusionMatrix, sensitivity, and specificity. Apply the functions to the data set. How do the results compare with your own functions?

The results produced by the caret::confusionMatrix() function matched exactly with those from our custom-built metric functions. Specifically:

Accuracy: 0.8066
Sensitivity: 0.4737
Specificity: 0.9597
Precision (Positive Predictive Value): 0.8438

This confirms the correctness and consistency of our custom functions. The caret package uses the same definitions for confusion matrix-based metrics, validating our results. Additionally, caret offers more statistics (e.g., balanced accuracy, Kappa) which can further support model evaluation.

output_data$class <- as.factor(output_data$class)
output_data$scored.class <- as.factor(output_data$scored.class)

confusionMatrix(output_data$scored.class, output_data$class, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 119  30
##          1   5  27
##                                           
##                Accuracy : 0.8066          
##                  95% CI : (0.7415, 0.8615)
##     No Information Rate : 0.6851          
##     P-Value [Acc > NIR] : 0.0001712       
##                                           
##                   Kappa : 0.4916          
##                                           
##  Mcnemar's Test P-Value : 4.976e-05       
##                                           
##             Sensitivity : 0.4737          
##             Specificity : 0.9597          
##          Pos Pred Value : 0.8438          
##          Neg Pred Value : 0.7987          
##              Prevalence : 0.3149          
##          Detection Rate : 0.1492          
##    Detection Prevalence : 0.1768          
##       Balanced Accuracy : 0.7167          
##                                           
##        'Positive' Class : 1               
##

13. ROC Curve via pROC Package

Investigate the pROC package. Use it to generate an ROC curve for the data set. How do the results compare with your own functions?

This AUC value of 0.6503 indicates that the model has a strong ability to distinguish between the two classes (0 = control, 1 = case). Specifically, an AUC of 0.8503 means that, on average, there is an 85% chance that the model will assign a higher predicted probability to a randomly chosen positive case than to a randomly chosen negative case. Our custom AUC function produced a value of 0.8489, which is extremely close to pROC’s result of 0.8503. This minor difference is expected, as the manual method uses fixed threshold steps, applies a basic trapezoidal rule for integration, and provides an approximate result.

roc <- roc(output_data$class, output_data$scored.probability)
plot(roc, main = "ROC Curve using pROC (AUC= 0.8503)")

auc(roc)

## Area under the curve: 0.8503