In this homework assignment, you will work through various classification metrics. You will be asked to create functions in R to carry out the various calculations. You will also investigate some functions in packages that will let you obtain the equivalent results. Finally, you will create graphical output that also can be used to evaluate the output of classification models, such as binary logistic regression.
The data set has three key columns we will use:
class: the actual class for the observation
scored.class: the predicted class for the observation (based on a threshold of 0.5)
scored.probability: the predicted probability of success for the observation
#Import Libraries
library(readr) # to uses read_csv function
library(tidyverse)
library(e1071) # For skewness function
library(corrplot)
library(caret)
library(ROSE)
library(smotefamily)
library(dplyr)
library(caret)
library(MASS)
library(lmtest)
library(car)
library(GGally)
library(pROC)
output_data <- read.csv("https://raw.githubusercontent.com/gillianmcgovern0/cuny-data-621/refs/heads/main/classification-output-data.csv")
head(output_data)
## pregnant glucose diastolic skinfold insulin bmi pedigree age class
## 1 7 124 70 33 215 25.5 0.161 37 0
## 2 2 122 76 27 200 35.9 0.483 26 0
## 3 3 107 62 13 48 22.9 0.678 23 1
## 4 1 91 64 24 0 29.2 0.192 21 0
## 5 4 83 86 19 0 29.3 0.317 34 0
## 6 1 100 74 12 46 19.5 0.149 28 0
## scored.class scored.probability
## 1 0 0.32845226
## 2 0 0.27319044
## 3 0 0.10966039
## 4 0 0.05599835
## 5 0 0.10049072
## 6 0 0.05515460
Use the table() function to get the raw confusion matrix for this scored dataset. Make sure you understand the output. In particular, do the rows represent the actual or predicted class? The columns?
# Raw confusion matrix
raw_confiusion_matrix <- table(output_data$scored.class, output_data$class)
names(dimnames(raw_confiusion_matrix)) <- c("Prediction", "Actual")
print(raw_confiusion_matrix)
## Actual
## Prediction 0 1
## 0 119 30
## 1 5 27
Each row represents the predicted class values, and each column represent the actual values. For this confusion matrix, there are:
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the accuracy of the predictions
\[ π΄πππ’ππππ¦ = \frac{ππ + πN}{ππ + πΉπ + ππ + πΉπ} \]
# Main function to calculate confusion matrix once
calculate_confusion_matrix <- function(df, actual_col = "class", predicted_col = "scored.class") {
cm <- table(Actual = df[[actual_col]], Predicted = df[[predicted_col]])
tn <- cm[1, 1]
fp <- cm[1, 2]
fn <- cm[2, 1]
tp <- cm[2, 2]
return(list(tn=tn,fp=fp,fn=fn,tp=tp))
}
# Function to calculate accuracy
calculate_accuracy <- function(df, actual_col = "class", predicted_col = "scored.class") {
cm <- calculate_confusion_matrix(df, actual_col, predicted_col)
total <- cm$tp + cm$tn + cm$fp + cm$fn
accuracy <- (cm$tp + cm$tn) / total
return(accuracy)
}
#print accuracy
class_accuracy <- calculate_accuracy(output_data)
print(paste("Accuracy:", class_accuracy))
## [1] "Accuracy: 0.806629834254144"
Accuracy score is 0.8066298.
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the classification error rate of the predictions. Verify that you get an accuracy and an error rate that sums to one.
\[ πΆπππ π ππππππ‘πππ\ πΈππππ \ π ππ‘e = \frac{Fπ + FN}{ππ + πΉπ + ππ + πΉN} \]
# Classification error rate
# % of total observations that the model inaccurately predicts
# Takes in the original dataset and returns classification error rate
calculate_error_rate <- function(df, actual_col = "class", predicted_col = "scored.class") {
cm <- calculate_confusion_matrix(df, actual_col, predicted_col)
total <- cm$tp + cm$tn + cm$fp + cm$fn
error_rate <- (cm$fp + cm$fn) / total
accuracy <- (cm$tp + cm$tn) / total
return(list(error_rate = error_rate,
accuracy = accuracy,
sum_check = error_rate + accuracy))
}
#print error rate
class_error_rate<- calculate_error_rate(output_data)
print(paste("Error Rate:", class_error_rate$error_rate))
## [1] "Error Rate: 0.193370165745856"
Classification Error Rate is 0.1933702.
Letβs make sure Classification Error Rate + Accuracy = 1:
# Validate error rate and accuracy functions
# Sum should be 1
print(paste("Sum of accuracy and error rate:", round(class_error_rate$sum_check, 4)))
## [1] "Sum of accuracy and error rate: 1"
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the precision of the predictions.
\[ ππππππ πππ = \frac{TP}{ππ + πΉπ} \]
# Precision
# Accuracy of the model's positive predictions
# Takes in the original dataset and returns precision
calculate_precision <- function(df, actual_col = "class", predicted_col = "scored.class") {
cm <- calculate_confusion_matrix(df, actual_col, predicted_col)
precision <- (cm$tp) / (cm$tp + cm$fp)
return(precision)
}
#print precision
class_precision<- calculate_precision(output_data)
print(paste("Precision:", class_precision))
## [1] "Precision: 0.84375"
Precision is 0.84375.
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the sensitivity of the predictions. Sensitivity is also known as recall.
\[ ππππ ππ‘ππ£ππ‘π¦ = \frac{TP}{ππ + πΉN} \]
# Sensitivity
# True positive rate
# Takes in the original dataset and returns sensitivity
calculate_sensitivity<- function(df, actual_col = "class", predicted_col = "scored.class") {
cm <- calculate_confusion_matrix(df, actual_col, predicted_col)
sensitivity <- (cm$tp) / (cm$tp + cm$fn)
return(sensitivity)
}
#print sensitivity
class_sensitivity<- calculate_sensitivity(output_data)
print(paste("Sensitivity:", class_sensitivity))
## [1] "Sensitivity: 0.473684210526316"
Sensitivity/Recall is 0.4736842.
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the specificity of the predictions.
\[ ππππππππππ‘π¦ = \frac{TN}{πN + πΉP} \]
# Specificity
# True negative rate
# Takes in the original dataset and returns specificity
calculate_specificity<- function(df, actual_col = "class", predicted_col = "scored.class") {
cm <- calculate_confusion_matrix(df, actual_col, predicted_col)
specificity <- (cm$tn) / (cm$tn + cm$fp)
return(specificity)
}
#print specificity
class_specificity<- calculate_specificity(output_data)
print(paste("Specificity:", class_specificity))
## [1] "Specificity: 0.959677419354839"
Specificity is 0.9596774.
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the F1 score of the predictions.
\[ πΉ1\ ππππe = \frac{2 Γ ππππππ πππ Γ ππππ ππ‘ππ£ππ‘y}{ππππππ πππ + ππππ ππ‘ππ£ππ‘y} \]
# F1 Score
# Takes in the original dataset and returns f1 score
calculate_f1_score<- function(df, actual_col = "class", predicted_col = "scored.class") {
cm <- calculate_confusion_matrix(df, actual_col, predicted_col)
prec <- calculate_precision(output_data)
sens <- calculate_sensitivity(output_data)
f1_score <- (2 * prec * sens) / (prec + sens)
return(f1_score)
}
#print f1-score
class_f1_score<- calculate_f1_score(output_data)
print(paste("F1_score:", class_f1_score))
## [1] "F1_score: 0.606741573033708"
F1 Score is 0.6067416.
Before we move on, letβs consider a question that was asked: What are the bounds on the F1 score? Show that the F1 score will always be between 0 and 1. (Hint: If 0 < π < 1 and 0 < π < 1 then ππ < π.)
# Confirm π*π < π
calculate_precision(output_data) * calculate_sensitivity(output_data) # a * b
## [1] 0.3996711
calculate_precision(output_data) # a
## [1] 0.84375
calculate_sensitivity(output_data) # a
## [1] 0.4736842
The bounds for the F1 score are between 0 and 1. For example, if recall, a proportion, is 0.47 and precision, another proportion, is 0.84, then recall * precision is 0.40 which is less than 0.84.
(2 * 1 * 1) / (1 + 1)
## [1] 1
(2 * 1 * 0) / (1 + 0)
## [1] 0
If there is a 100% recall and precision score, then the F1 score would be 1. If one of the scores was zero, then the F1 score would be 0, meaning the bounds for F1 have to be between 0 and 1.
Write a function that generates an ROC curve from a data set with a true classification column (class in our example) and a probability column (scored.probability in our example). Your function should return a list that includes the plot of the ROC curve and a vector that contains the calculated area under the curve (AUC). Note that I recommend using a sequence of thresholds ranging from 0 to 1 at 0.01 intervals.
# Function to generate ROC curve and calculate AUC
calculate_roc_curve <- function(df, actual_col = "class", probability_col = "scored.probability") {
# Create a sequence of thresholds from 0 to 1 at 0.01 intervals
thresholds <- seq(0, 1, by = 0.01)
# Initialize vectors to store TPR and FPR
tpr_values <- numeric(length(thresholds))
fpr_values <- numeric(length(thresholds))
# Get actual values
actual <- df[[actual_col]]
probabilities <- df[[probability_col]]
# Calculate TPR and FPR for each threshold
for (i in seq_along(thresholds)) {
threshold <- thresholds[i]
# Predicted class based on threshold
predicted <- as.integer(probabilities >= threshold)
# Create confusion matrix with explicit factor levels to ensure 2x2 structure
cm <- table(factor(actual, levels = c(0, 1)),
factor(predicted, levels = c(0, 1)))
# Extract values from 2x2 matrix
tn <- cm[1, 1]
fp <- cm[1, 2]
fn <- cm[2, 1]
tp <- cm[2, 2]
# Calculate True Positive Rate(sensitivity) and False Positive Rate(1 - Specificity)
tpr_values[i] <- ifelse(tp + fn == 0, 0, tp / (tp + fn))
fpr_values[i] <- ifelse(fp + tn == 0, 0, fp / (fp + tn))
}
# Calculate AUC using trapezoidal rule
auc <- 0
for (i in 2:length(fpr_values)) {
auc <- auc + (fpr_values[i-1] - fpr_values[i]) * (tpr_values[i-1] + tpr_values[i]) / 2
}
# Create ROC plot
plot_roc <- plot(fpr_values, tpr_values,
type = "l",
xlim = c(0, 1),
ylim = c(0, 1),
xlab = "False Positive Rate (1 - Specificity)",
ylab = "True Positive Rate (Sensitivity)",
main = paste("ROC Curve (AUC =", round(auc, 4), ")"),
col = "blue",
lwd = 2)
# Add diagonal reference line
abline(0, 1, col = "red", lty = 2, lwd = 1.5)
# Return list with plot and AUC
result <- list(
plot = plot_roc,
auc = auc,
fpr = fpr_values,
tpr = tpr_values,
thresholds = thresholds
)
return(result)
}
roc_results <- calculate_roc_curve(output_data)
print(paste("AUC:", round(roc_results$auc, 4)))
## [1] "AUC: 0.8489"
Use your created R functions and the provided classification output data set to produce all of the classification metrics discussed above.
#Call all created functions
cm <- calculate_confusion_matrix(output_data)
class_accuracy <- calculate_accuracy(output_data)
class_error_rate<- calculate_error_rate(output_data)
class_precision<- calculate_precision(output_data)
class_sensitivity<- calculate_sensitivity(output_data)
class_specificity<- calculate_specificity(output_data)
class_f1_score<- calculate_f1_score(output_data)
# Generate single consolidated metrics
output_text <- paste(
"================================================================================\n",
"COMPREHENSIVE CLASSIFICATION METRICS REPORT\n",
"================================================================================\n\n",
"CONFUSION MATRIX:\n",
sprintf(" Predicted 0 Predicted 1\n"),
sprintf("Actual 0 %12d %12d\n", cm$tn, cm$fp),
sprintf("Actual 1 %12d %12d\n\n", cm$fn, cm$tp),
"Confusion Matrix Components:\n",
sprintf("True Negatives (TN): %d\n", cm$tn),
sprintf("False Positives (FP): %d\n", cm$fp),
sprintf("False Negatives (FN): %d\n", cm$fn),
sprintf("True Positives (TP): %d\n\n", cm$tp),
"CLASSIFICATION METRICS:\n",
"----------------\n",
sprintf("Accuracy: %.4f\n", class_accuracy),
sprintf("Error Rate: %.4f\n",class_error_rate$error_rate),
sprintf("Accuracy + Error Rate: %.4f (should equal 1.0)\n\n", class_error_rate$sum_check),
"THRESHOLD-DEPENDENT METRICS:\n",
"----------------\n",
sprintf("Precision: %.4f\n",class_precision),
sprintf("Sensitivity: %.4f (True Positive Rate)\n", class_sensitivity),
sprintf("Specificity: %.4f (True Negative Rate)\n\n", class_specificity),
"HARMONIC MEAN METRIC:\n",
"----------------\n",
sprintf("F1 Score: %.4f\n", class_f1_score),
sprintf("F1 Score range: [0, 1] β\n\n"),
"ROC CURVE AND AUC:\n",
"----------------\n",
sprintf("Area Under Curve (AUC): %.4f\n", roc_results$auc),
sprintf("AUC = 0.5 (Random classifier)\n"),
sprintf("AUC = 1.0 (Perfect classifier)\n"),
sprintf("Current AUC = %.4f\n\n", roc_results$auc),
"================================================================================\n"
)
cat(output_text)
## ================================================================================
## COMPREHENSIVE CLASSIFICATION METRICS REPORT
## ================================================================================
##
## CONFUSION MATRIX:
## Predicted 0 Predicted 1
## Actual 0 119 5
## Actual 1 30 27
##
## Confusion Matrix Components:
## True Negatives (TN): 119
## False Positives (FP): 5
## False Negatives (FN): 30
## True Positives (TP): 27
##
## CLASSIFICATION METRICS:
## ----------------
## Accuracy: 0.8066
## Error Rate: 0.1934
## Accuracy + Error Rate: 1.0000 (should equal 1.0)
##
## THRESHOLD-DEPENDENT METRICS:
## ----------------
## Precision: 0.8438
## Sensitivity: 0.4737 (True Positive Rate)
## Specificity: 0.9597 (True Negative Rate)
##
## HARMONIC MEAN METRIC:
## ----------------
## F1 Score: 0.6067
## F1 Score range: [0, 1] β
##
## ROC CURVE AND AUC:
## ----------------
## Area Under Curve (AUC): 0.8489
## AUC = 0.5 (Random classifier)
## AUC = 1.0 (Perfect classifier)
## Current AUC = 0.8489
##
## ================================================================================
summary_df <- data.frame(
Metric = c("Accuracy", "Error Rate", "Precision", "Sensitivity", "Specificity", "F1 Score", "AUC"),
Value = c(class_accuracy, class_error_rate$error_rate, class_precision, class_sensitivity, class_specificity, class_f1_score, roc_results$auc),
stringsAsFactors = FALSE
)
print(summary_df)
## Metric Value
## 1 Accuracy 0.8066298
## 2 Error Rate 0.1933702
## 3 Precision 0.8437500
## 4 Sensitivity 0.4736842
## 5 Specificity 0.9596774
## 6 F1 Score 0.6067416
## 7 AUC 0.8488964
Investigate the caret
package. In particular, consider
the functions confusionMatrix, sensitivity, and specificity. Apply the
functions to the data set. How do the results compare with your own
functions?
The results produced by the caret::confusionMatrix() function matched exactly with those from our custom-built metric functions. Specifically:
This confirms the correctness and consistency of our custom functions. The caret package uses the same definitions for confusion matrix-based metrics, validating our results. Additionally, caret offers more statistics (e.g., balanced accuracy, Kappa) which can further support model evaluation.
output_data$class <- as.factor(output_data$class)
output_data$scored.class <- as.factor(output_data$scored.class)
confusionMatrix(output_data$scored.class, output_data$class, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 119 30
## 1 5 27
##
## Accuracy : 0.8066
## 95% CI : (0.7415, 0.8615)
## No Information Rate : 0.6851
## P-Value [Acc > NIR] : 0.0001712
##
## Kappa : 0.4916
##
## Mcnemar's Test P-Value : 4.976e-05
##
## Sensitivity : 0.4737
## Specificity : 0.9597
## Pos Pred Value : 0.8438
## Neg Pred Value : 0.7987
## Prevalence : 0.3149
## Detection Rate : 0.1492
## Detection Prevalence : 0.1768
## Balanced Accuracy : 0.7167
##
## 'Positive' Class : 1
##
Investigate the pROC
package. Use it to generate an ROC
curve for the data set. How do the results compare with your own
functions?
This AUC value of 0.6503 indicates that the model has a strong ability to distinguish between the two classes (0 = control, 1 = case). Specifically, an AUC of 0.8503 means that, on average, there is an 85% chance that the model will assign a higher predicted probability to a randomly chosen positive case than to a randomly chosen negative case. Our custom AUC function produced a value of 0.8489, which is extremely close to pROCβs result of 0.8503. This minor difference is expected, as the manual method uses fixed threshold steps, applies a basic trapezoidal rule for integration, and provides an approximate result.
roc <- roc(output_data$class, output_data$scored.probability)
plot(roc, main = "ROC Curve using pROC (AUC= 0.8503)")
auc(roc)
## Area under the curve: 0.8503