DATA621 - Homework 2

Background

In this homework assignment, you will work through various classification metrics. You will be asked to create functions in R to carry out the various calculations. You will also investigate some functions in packages that will let you obtain the equivalent results. Finally, you will create graphical output that also can be used to evaluate the output of classification models, such as binary logistic regression.

#Import Libraries
library(readr) # to uses read_csv function
library(tidyverse)

library(e1071)  # For skewness function
library(corrplot)

library(caret)
library(ROSE)
library(smotefamily)

library(dplyr)
library(caret)
library(MASS)
library(lmtest)
library(car)
library(GGally)
library(pROC)

1. Data Ingestion

output_data <- read.csv("https://raw.githubusercontent.com/gillianmcgovern0/cuny-data-621/refs/heads/main/classification-output-data.csv")

head(output_data)

##   pregnant glucose diastolic skinfold insulin  bmi pedigree age class
## 1        7     124        70       33     215 25.5    0.161  37     0
## 2        2     122        76       27     200 35.9    0.483  26     0
## 3        3     107        62       13      48 22.9    0.678  23     1
## 4        1      91        64       24       0 29.2    0.192  21     0
## 5        4      83        86       19       0 29.3    0.317  34     0
## 6        1     100        74       12      46 19.5    0.149  28     0
##   scored.class scored.probability
## 1            0         0.32845226
## 2            0         0.27319044
## 3            0         0.10966039
## 4            0         0.05599835
## 5            0         0.10049072
## 6            0         0.05515460

2. Raw Confusion Matrix

Use the table() function to get the raw confusion matrix for this scored dataset. Make sure you understand the output. In particular, do the rows represent the actual or predicted class? The columns?

# Raw confusion matrix
raw_confiusion_matrix <- table(output_data$scored.class, output_data$class)
names(dimnames(raw_confiusion_matrix)) <- c("Prediction", "Actual")
print(raw_confiusion_matrix)

##           Actual
## Prediction   0   1
##          0 119  30
##          1   5  27

Each row represents the predicted class values, and each column represent the actual values. For this confusion matrix, there are:

119 true negative values (correctly classified as not the target value)
27 true positive values (correctly classified as the target value)
5 false positive values (incorrectly classified as the target value)
30 false negative values (incorrectly classified as not the target value)

3. Accuracy of the Predictions

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the accuracy of the predictions

\[ 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = \frac{𝑇𝑃 + 𝑇N}{𝑇𝑃 + 𝐹𝑃 + 𝑇𝑁 + 𝐹𝑁} \]

# Returns TP, TN, FP, FN
calculate_confiusion_matrix_values <- function(output_data) {
  raw_confiusion_matrix <- table(output_data$scored.class, output_data$class)
  confiusion_dataframe <- as.data.frame(raw_confiusion_matrix)
  confiusion_dataframe$confusion_label <- c("TN", "FP", "FN", "TP")
  TP <- as.integer(confiusion_dataframe[4, "Freq"])
  TN <- as.integer(confiusion_dataframe[1, "Freq"])
  FP <- as.integer(confiusion_dataframe[2, "Freq"])
  FN <- as.integer(confiusion_dataframe[3, "Freq"])
  return(list(TP = TP, TN = TN, FP = FP, FN = FN))
}

# Accuracy function
# % of total observations that the model accurately predicts
# Takes in the original dataset and returns accuracy score
calculate_accuracy <- function(output_data) {
  results <- calculate_confiusion_matrix_values(output_data)
  accuracy <- (results$TP + results$TN) / (results$TP + results$FP + results$TN + results$FN)
  return(accuracy)
}
print("Accuracy score:")

## [1] "Accuracy score:"

accuracy <- calculate_accuracy(output_data)
print(accuracy)

## [1] 0.8066298

### Dhanya

# Main function to calculate confusion matrix once
calculate_confusion_matrix <- function(df, actual_col = "class", predicted_col = "scored.class") {
  cm <- table(Actual = df[[actual_col]], Predicted = df[[predicted_col]])
  tn <- cm[1, 1]
  fp <- cm[1, 2]
  fn <- cm[2, 1]
  tp <- cm[2, 2]
  return(list(tn=tn,fp=fp,fn=fn,tp=tp))
}
calculate_accuracy <- function(df, actual_col = "class", predicted_col = "scored.class") {
  cm <- calculate_confusion_matrix(df, actual_col, predicted_col)
  total <- cm$tp + cm$tn + cm$fp + cm$fn
  accuracy <- (cm$tp + cm$tn) / total
  
  return(accuracy)
}

class_accuracy <- calculate_accuracy(output_data)
 print(paste("Accuracy:", class_accuracy))

## [1] "Accuracy: 0.806629834254144"

Accuracy score is 0.8066298.

4. Classification Error Rate of the Predictions

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the classification error rate of the predictions. Verify that you get an accuracy and an error rate that sums to one.

\[ 𝐶𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛\ 𝐸𝑟𝑟𝑜𝑟 \ 𝑅𝑎𝑡e = \frac{F𝑃 + FN}{𝑇𝑃 + 𝐹𝑃 + 𝑇𝑁 + 𝐹N} \]

# Classification error rate
# % of total observations that the model inaccurately predicts
# Takes in the original dataset and returns classification error rate
calculate_class_error_rate <- function(output_data) {
  results <- calculate_confiusion_matrix_values(output_data)
  class_error_rate <- (results$FP + results$FN) / (results$TP + results$FP + results$TN + results$FN)
  return(class_error_rate)
}
print("Classification Error Rate:")

## [1] "Classification Error Rate:"

class_error_rate <- calculate_class_error_rate(output_data)
print(class_error_rate)

## [1] 0.1933702

## Dhanya

calculate_error_rate <- function(df, actual_col = "class", predicted_col = "scored.class") {
  cm <- calculate_confusion_matrix(df, actual_col, predicted_col)
  total <- cm$tp + cm$tn + cm$fp + cm$fn
  error_rate <- (cm$fp + cm$fn) / total
  accuracy <- (cm$tp + cm$tn) / total

  return(list(error_rate = error_rate,
    accuracy = accuracy,
    sum_check = error_rate + accuracy))
}

class_error_rate<- calculate_error_rate(output_data)
print(paste("Error Rate:", class_error_rate$error_rate))

## [1] "Error Rate: 0.193370165745856"

Classification Error Rate is 0.1933702.

Let’s make sure Classification Error Rate + Accuracy = 1:

# Validate error rate and accuracy functions
# Sum should be 1
print(calculate_accuracy(output_data) + calculate_class_error_rate(output_data))

## [1] 1

#Dhanya
print(paste("Sum of accuracy and error rate:", round(class_error_rate$sum_check, 4)))

## [1] "Sum of accuracy and error rate: 1"

5. Precision of the Predictions

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the precision of the predictions.

\[ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = \frac{TP}{𝑇𝑃 + 𝐹𝑃} \]

# Precision
# Accuracy of the model's positive predictions
# Takes in the original dataset and returns precision
calculate_precision <- function(output_data) {
  results <- calculate_confiusion_matrix_values(output_data)
  precision <- (results$TP) / (results$TP + results$FP)
  return(precision)
}
print("Precision:")

## [1] "Precision:"

precision <- calculate_precision(output_data)
print(precision)

## [1] 0.84375

#Dhanya
calculate_precision <- function(df, actual_col = "class", predicted_col = "scored.class") {
  cm <- calculate_confusion_matrix(df, actual_col, predicted_col)

  precision <- (cm$tp) / (cm$tp + cm$fp)
  
  return(precision)
}

class_precision<- calculate_precision(output_data)
print(paste("Precision:", class_precision))

## [1] "Precision: 0.84375"

Precision is 0.84375.

6. Sensitivity/Recall of the Predictions

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the sensitivity of the predictions. Sensitivity is also known as recall.

\[ 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 = \frac{TP}{𝑇𝑃 + 𝐹N} \]

# Sensitivity
# True positive rate
# Takes in the original dataset and returns sensitivity
calculate_sensitivity <- function(output_data) {
  results <- calculate_confiusion_matrix_values(output_data)
  sensitivity <- (results$TP) / (results$TP + results$FN)
  return(sensitivity)
}
print("Sensitivity:")

## [1] "Sensitivity:"

sensitivity <- calculate_sensitivity(output_data)
print(sensitivity)

## [1] 0.4736842

#Dhanya
calculate_sensitivity<- function(df, actual_col = "class", predicted_col = "scored.class") {
  cm <- calculate_confusion_matrix(df, actual_col, predicted_col)

  sensitivity <- (cm$tp) / (cm$tp + cm$fn)
  
  return(sensitivity)
}

class_sensitivity<- calculate_sensitivity(output_data)
print(paste("Sensitivity:", class_sensitivity))

## [1] "Sensitivity: 0.473684210526316"

Sensitivity/Recall is 0.4736842.

7. Specificity of the Predictions

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the specificity of the predictions.

\[ 𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = \frac{TN}{𝑇N + 𝐹P} \]

# Specificity
# True negative rate
# Takes in the original dataset and returns specificity
calculate_specificity <- function(output_data) {
  results <- calculate_confiusion_matrix_values(output_data)
  specificity <- (results$TN) / (results$TN + results$FP)
  return(specificity)
}
print("Specificity:")

## [1] "Specificity:"

specificity <- calculate_specificity(output_data)
print(specificity)

## [1] 0.9596774

#Dhanya
calculate_specificity<- function(df, actual_col = "class", predicted_col = "scored.class") {
  cm <- calculate_confusion_matrix(df, actual_col, predicted_col)

  specificity <- (cm$tn) / (cm$tn + cm$fp)
  
  return(specificity)
}

class_specificity<- calculate_specificity(output_data)
print(paste("Specificity:", class_specificity))

## [1] "Specificity: 0.959677419354839"

Specificity is 0.9596774.

8. F1 Score of the Predictions

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the F1 score of the predictions.

\[ 𝐹1\ 𝑆𝑐𝑜𝑟e = \frac{2 × 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡y}{𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡y} \]

# F1 Score
# Takes in the original dataset and returns f1 score
calculate_f1_score <- function(output_data) {
  prec <- calculate_precision(output_data)
  sens <- calculate_sensitivity(output_data)
  f1_score <- (2 * prec * sens) / (prec + sens)
  return(f1_score)
}
print("F1 Score:")

## [1] "F1 Score:"

f1score <- calculate_f1_score(output_data)
print(f1score)

## [1] 0.6067416

#Dhanya
calculate_f1_score<- function(df, actual_col = "class", predicted_col = "scored.class") {
  cm <- calculate_confusion_matrix(df, actual_col, predicted_col)
  prec <- calculate_precision(output_data)
  sens <- calculate_sensitivity(output_data)
  f1_score <- (2 * prec * sens) / (prec + sens)
  
  return(f1_score)
}

class_f1_score<- calculate_f1_score(output_data)
print(paste("F1_score:", class_f1_score))

## [1] "F1_score: 0.606741573033708"

F1 Score is 0.6067416.

9. F1 Score of the Predictions: Bounds

Before we move on, let’s consider a question that was asked: What are the bounds on the F1 score? Show that the F1 score will always be between 0 and 1. (Hint: If 0 < 𝑎 < 1 and 0 < 𝑏 < 1 then 𝑎𝑏 < 𝑎.)

# Confirm 𝑎*𝑏 < 𝑎
calculate_precision(output_data) * calculate_sensitivity(output_data) # a * b

## [1] 0.3996711

calculate_precision(output_data) # a

## [1] 0.84375

calculate_sensitivity(output_data) # a

## [1] 0.4736842

The bounds for the F1 score are between 0 and 1. For example, if recall, a proportion, is 0.47 and precision, another proportion, is 0.84, then recall * precision is 0.40 which is less than 0.84.

(2 * 1 * 1) / (1 + 1)

## [1] 1

(2 * 1 * 0) / (1 + 0)

## [1] 0

If there is a 100% recall and precision score, then the F1 score would be 1. If one of the scores was zero, then the F1 score would be 0, meaning the bounds for F1 have to be between 0 and 1.

10. ROC curve

Write a function that generates an ROC curve from a data set with a true classification column (class in our example) and a probability column (scored.probability in our example). Your function should return a list that includes the plot of the ROC curve and a vector that contains the calculated area under the curve (AUC). Note that I recommend using a sequence of thresholds ranging from 0 to 1 at 0.01 intervals.

# Function to generate ROC curve and calculate AUC
calculate_roc_curve <- function(df, actual_col = "class", probability_col = "scored.probability") {
  
  # Create a sequence of thresholds from 0 to 1 at 0.01 intervals
  thresholds <- seq(0, 1, by = 0.01)
  
  # Initialize vectors to store TPR and FPR
  tpr_values <- numeric(length(thresholds))
  fpr_values <- numeric(length(thresholds))
  
  # Get actual values
  actual <- df[[actual_col]]
  probabilities <- df[[probability_col]]
  
  # Calculate TPR and FPR for each threshold
  for (i in seq_along(thresholds)) {
    threshold <- thresholds[i]
    
    # Predicted class based on threshold
    predicted <- as.integer(probabilities >= threshold)
    
    # Create confusion matrix with explicit factor levels to ensure 2x2 structure
    cm <- table(factor(actual, levels = c(0, 1)), 
                factor(predicted, levels = c(0, 1)))
    
    # Extract values from 2x2 matrix
    tn <- cm[1, 1]
    fp <- cm[1, 2]
    fn <- cm[2, 1]
    tp <- cm[2, 2]
    
    # Calculate True Positive Rate(sensitivity) and False Positive Rate(1 - Specificity)
    tpr_values[i] <- ifelse(tp + fn == 0, 0, tp / (tp + fn))
    fpr_values[i] <- ifelse(fp + tn == 0, 0, fp / (fp + tn))
  }
  
  # Calculate AUC using trapezoidal rule
  auc <- 0
  for (i in 2:length(fpr_values)) {
    auc <- auc + (fpr_values[i-1] - fpr_values[i]) * (tpr_values[i-1] + tpr_values[i]) / 2
  }
  
  # Create ROC plot
  plot_roc <- plot(fpr_values, tpr_values, 
                   type = "l", 
                   xlim = c(0, 1), 
                   ylim = c(0, 1),
                   xlab = "False Positive Rate (1 - Specificity)",
                   ylab = "True Positive Rate (Sensitivity)",
                   main = paste("ROC Curve (AUC =", round(auc, 4), ")"),
                   col = "blue",
                   lwd = 2)
  
  # Add diagonal reference line
  abline(0, 1, col = "red", lty = 2, lwd = 1.5)
  
  # Return list with plot and AUC
  result <- list(
    plot = plot_roc,
    auc = auc,
    fpr = fpr_values,
    tpr = tpr_values,
    thresholds = thresholds
  )
  
  return(result)
}


 roc_results <- calculate_roc_curve(output_data)

 print(paste("AUC:", round(roc_results$auc, 4)))

## [1] "AUC: 0.8489"

11. All Classification Metrics

Use your created R functions and the provided classification output data set to produce all of the classification metrics discussed above.

#Call all created functions
cm <- calculate_confusion_matrix(output_data)
class_accuracy <- calculate_accuracy(output_data)
class_error_rate<- calculate_error_rate(output_data)
class_precision<- calculate_precision(output_data)
class_sensitivity<- calculate_sensitivity(output_data)
class_specificity<- calculate_specificity(output_data)
class_f1_score<- calculate_f1_score(output_data)
roc_results <- calculate_roc_curve(output_data)

# Generate single consolidated metrics 
output_text <- paste(
"================================================================================\n",
"COMPREHENSIVE CLASSIFICATION METRICS REPORT\n",
"================================================================================\n\n",
  
"CONFUSION MATRIX:\n",
sprintf("            Predicted 0  Predicted 1\n"),
sprintf("Actual 0    %12d %12d\n", cm$tn, cm$fp),
sprintf("Actual 1    %12d %12d\n\n", cm$fn, cm$tp),
  
"Confusion Matrix Components:\n",
sprintf("True Negatives (TN):   %d\n", cm$tn),
sprintf("False Positives (FP):  %d\n", cm$fp),
sprintf("False Negatives (FN):  %d\n", cm$fn),
sprintf("True Positives (TP):   %d\n\n", cm$tp),
  
"CLASSIFICATION METRICS:\n",
"----------------\n",
sprintf("Accuracy:              %.4f\n", accuracy),
sprintf("Error Rate:            %.4f\n",class_error_rate$error_rate),
sprintf("Accuracy + Error Rate: %.4f (should equal 1.0)\n\n", class_error_rate$sum_check),
  
"THRESHOLD-DEPENDENT METRICS:\n",
"----------------\n",
sprintf("Precision:             %.4f\n", precision),
sprintf("Sensitivity:           %.4f (True Positive Rate)\n", sensitivity),
sprintf("Specificity:           %.4f (True Negative Rate)\n\n", specificity),
  
"HARMONIC MEAN METRIC:\n",
"----------------\n",
sprintf("F1 Score:              %.4f\n", class_f1_score),
sprintf("F1 Score range:        [0, 1] ✓\n\n"),
  
"ROC CURVE AND AUC:\n",
"----------------\n",
sprintf("Area Under Curve (AUC): %.4f\n", roc_results$auc),
sprintf("AUC = 0.5  (Random classifier)\n"),
sprintf("AUC = 1.0  (Perfect classifier)\n"),
sprintf("Current AUC = %.4f\n\n", roc_results$auc),

"================================================================================\n"
)

cat(output_text)

## ================================================================================
##  COMPREHENSIVE CLASSIFICATION METRICS REPORT
##  ================================================================================
## 
##  CONFUSION MATRIX:
##              Predicted 0  Predicted 1
##  Actual 0             119            5
##  Actual 1              30           27
## 
##  Confusion Matrix Components:
##  True Negatives (TN):   119
##  False Positives (FP):  5
##  False Negatives (FN):  30
##  True Positives (TP):   27
## 
##  CLASSIFICATION METRICS:
##  ----------------
##  Accuracy:              0.8066
##  Error Rate:            0.1934
##  Accuracy + Error Rate: 1.0000 (should equal 1.0)
## 
##  THRESHOLD-DEPENDENT METRICS:
##  ----------------
##  Precision:             0.8438
##  Sensitivity:           0.4737 (True Positive Rate)
##  Specificity:           0.9597 (True Negative Rate)
## 
##  HARMONIC MEAN METRIC:
##  ----------------
##  F1 Score:              0.6067
##  F1 Score range:        [0, 1] ✓
## 
##  ROC CURVE AND AUC:
##  ----------------
##  Area Under Curve (AUC): 0.8489
##  AUC = 0.5  (Random classifier)
##  AUC = 1.0  (Perfect classifier)
##  Current AUC = 0.8489
## 
##  ================================================================================

summary_df <- data.frame(
  Metric = c("Accuracy", "Error Rate", "Precision", "Sensitivity", "Specificity", "F1 Score", "AUC"),
  Value = c(class_accuracy, class_error_rate$error_rate, class_precision, class_sensitivity, class_specificity, class_f1_score, roc_results$auc),
  stringsAsFactors = FALSE
)

print(summary_df)

##        Metric     Value
## 1    Accuracy 0.8066298
## 2  Error Rate 0.1933702
## 3   Precision 0.8437500
## 4 Sensitivity 0.4736842
## 5 Specificity 0.9596774
## 6    F1 Score 0.6067416
## 7         AUC 0.8488964

12. Classification Metrics via caret Package

Investigate the caret package. In particular, consider the functions confusionMatrix, sensitivity, and specificity. Apply the functions to the data set. How do the results compare with your own functions?

13. ROC Curve via pROC Package

Investigate the pROC package. Use it to generate an ROC curve for the data set. How do the results compare with your own functions?