In this homework assignment, you will work through various classification metrics. You will be asked to create functions in R to carry out the various calculations. You will also investigate some functions in packages that will let you obtain the equivalent results. Finally, you will create graphical output that also can be used to evaluate the output of classification models, such as binary logistic regression.
#Import Libraries
library(readr) # to uses read_csv function
library(tidyverse)
library(e1071) # For skewness function
library(corrplot)
library(caret)
library(ROSE)
library(smotefamily)
library(dplyr)
library(caret)
library(MASS)
library(lmtest)
library(car)
library(GGally)
library(pROC)
output_data <- read.csv("https://raw.githubusercontent.com/gillianmcgovern0/cuny-data-621/refs/heads/main/classification-output-data.csv")
head(output_data)
## pregnant glucose diastolic skinfold insulin bmi pedigree age class
## 1 7 124 70 33 215 25.5 0.161 37 0
## 2 2 122 76 27 200 35.9 0.483 26 0
## 3 3 107 62 13 48 22.9 0.678 23 1
## 4 1 91 64 24 0 29.2 0.192 21 0
## 5 4 83 86 19 0 29.3 0.317 34 0
## 6 1 100 74 12 46 19.5 0.149 28 0
## scored.class scored.probability
## 1 0 0.32845226
## 2 0 0.27319044
## 3 0 0.10966039
## 4 0 0.05599835
## 5 0 0.10049072
## 6 0 0.05515460
Use the table() function to get the raw confusion matrix for this scored dataset. Make sure you understand the output. In particular, do the rows represent the actual or predicted class? The columns?
# Raw confusion matrix
raw_confiusion_matrix <- table(output_data$scored.class, output_data$class)
names(dimnames(raw_confiusion_matrix)) <- c("Prediction", "Actual")
print(raw_confiusion_matrix)
## Actual
## Prediction 0 1
## 0 119 30
## 1 5 27
Each row represents the predicted class values, and each column represent the actual values. For this confusion matrix, there are:
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the accuracy of the predictions
\[ π΄πππ’ππππ¦ = \frac{ππ + πN}{ππ + πΉπ + ππ + πΉπ} \]
# Returns TP, TN, FP, FN
calculate_confiusion_matrix_values <- function(output_data) {
raw_confiusion_matrix <- table(output_data$scored.class, output_data$class)
confiusion_dataframe <- as.data.frame(raw_confiusion_matrix)
confiusion_dataframe$confusion_label <- c("TN", "FP", "FN", "TP")
TP <- as.integer(confiusion_dataframe[4, "Freq"])
TN <- as.integer(confiusion_dataframe[1, "Freq"])
FP <- as.integer(confiusion_dataframe[2, "Freq"])
FN <- as.integer(confiusion_dataframe[3, "Freq"])
return(list(TP = TP, TN = TN, FP = FP, FN = FN))
}
# Accuracy function
# % of total observations that the model accurately predicts
# Takes in the original dataset and returns accuracy score
calculate_accuracy <- function(output_data) {
results <- calculate_confiusion_matrix_values(output_data)
accuracy <- (results$TP + results$TN) / (results$TP + results$FP + results$TN + results$FN)
return(accuracy)
}
print("Accuracy score:")
## [1] "Accuracy score:"
accuracy <- calculate_accuracy(output_data)
print(accuracy)
## [1] 0.8066298
### Dhanya
# Main function to calculate confusion matrix once
calculate_confusion_matrix <- function(df, actual_col = "class", predicted_col = "scored.class") {
cm <- table(Actual = df[[actual_col]], Predicted = df[[predicted_col]])
tn <- cm[1, 1]
fp <- cm[1, 2]
fn <- cm[2, 1]
tp <- cm[2, 2]
return(list(tn=tn,fp=fp,fn=fn,tp=tp))
}
calculate_accuracy <- function(df, actual_col = "class", predicted_col = "scored.class") {
cm <- calculate_confusion_matrix(df, actual_col, predicted_col)
total <- cm$tp + cm$tn + cm$fp + cm$fn
accuracy <- (cm$tp + cm$tn) / total
return(accuracy)
}
class_accuracy <- calculate_accuracy(output_data)
print(paste("Accuracy:", class_accuracy))
## [1] "Accuracy: 0.806629834254144"
Accuracy score is 0.8066298.
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the classification error rate of the predictions. Verify that you get an accuracy and an error rate that sums to one.
\[ πΆπππ π ππππππ‘πππ\ πΈππππ \ π ππ‘e = \frac{Fπ + FN}{ππ + πΉπ + ππ + πΉN} \]
# Classification error rate
# % of total observations that the model inaccurately predicts
# Takes in the original dataset and returns classification error rate
calculate_class_error_rate <- function(output_data) {
results <- calculate_confiusion_matrix_values(output_data)
class_error_rate <- (results$FP + results$FN) / (results$TP + results$FP + results$TN + results$FN)
return(class_error_rate)
}
print("Classification Error Rate:")
## [1] "Classification Error Rate:"
class_error_rate <- calculate_class_error_rate(output_data)
print(class_error_rate)
## [1] 0.1933702
## Dhanya
calculate_error_rate <- function(df, actual_col = "class", predicted_col = "scored.class") {
cm <- calculate_confusion_matrix(df, actual_col, predicted_col)
total <- cm$tp + cm$tn + cm$fp + cm$fn
error_rate <- (cm$fp + cm$fn) / total
accuracy <- (cm$tp + cm$tn) / total
return(list(error_rate = error_rate,
accuracy = accuracy,
sum_check = error_rate + accuracy))
}
class_error_rate<- calculate_error_rate(output_data)
print(paste("Error Rate:", class_error_rate$error_rate))
## [1] "Error Rate: 0.193370165745856"
Classification Error Rate is 0.1933702.
Letβs make sure Classification Error Rate + Accuracy = 1:
# Validate error rate and accuracy functions
# Sum should be 1
print(calculate_accuracy(output_data) + calculate_class_error_rate(output_data))
## [1] 1
#Dhanya
print(paste("Sum of accuracy and error rate:", round(class_error_rate$sum_check, 4)))
## [1] "Sum of accuracy and error rate: 1"
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the precision of the predictions.
\[ ππππππ πππ = \frac{TP}{ππ + πΉπ} \]
# Precision
# Accuracy of the model's positive predictions
# Takes in the original dataset and returns precision
calculate_precision <- function(output_data) {
results <- calculate_confiusion_matrix_values(output_data)
precision <- (results$TP) / (results$TP + results$FP)
return(precision)
}
print("Precision:")
## [1] "Precision:"
precision <- calculate_precision(output_data)
print(precision)
## [1] 0.84375
#Dhanya
calculate_precision <- function(df, actual_col = "class", predicted_col = "scored.class") {
cm <- calculate_confusion_matrix(df, actual_col, predicted_col)
precision <- (cm$tp) / (cm$tp + cm$fp)
return(precision)
}
class_precision<- calculate_precision(output_data)
print(paste("Precision:", class_precision))
## [1] "Precision: 0.84375"
Precision is 0.84375.
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the sensitivity of the predictions. Sensitivity is also known as recall.
\[ ππππ ππ‘ππ£ππ‘π¦ = \frac{TP}{ππ + πΉN} \]
# Sensitivity
# True positive rate
# Takes in the original dataset and returns sensitivity
calculate_sensitivity <- function(output_data) {
results <- calculate_confiusion_matrix_values(output_data)
sensitivity <- (results$TP) / (results$TP + results$FN)
return(sensitivity)
}
print("Sensitivity:")
## [1] "Sensitivity:"
sensitivity <- calculate_sensitivity(output_data)
print(sensitivity)
## [1] 0.4736842
#Dhanya
calculate_sensitivity<- function(df, actual_col = "class", predicted_col = "scored.class") {
cm <- calculate_confusion_matrix(df, actual_col, predicted_col)
sensitivity <- (cm$tp) / (cm$tp + cm$fn)
return(sensitivity)
}
class_sensitivity<- calculate_sensitivity(output_data)
print(paste("Sensitivity:", class_sensitivity))
## [1] "Sensitivity: 0.473684210526316"
Sensitivity/Recall is 0.4736842.
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the specificity of the predictions.
\[ ππππππππππ‘π¦ = \frac{TN}{πN + πΉP} \]
# Specificity
# True negative rate
# Takes in the original dataset and returns specificity
calculate_specificity <- function(output_data) {
results <- calculate_confiusion_matrix_values(output_data)
specificity <- (results$TN) / (results$TN + results$FP)
return(specificity)
}
print("Specificity:")
## [1] "Specificity:"
specificity <- calculate_specificity(output_data)
print(specificity)
## [1] 0.9596774
#Dhanya
calculate_specificity<- function(df, actual_col = "class", predicted_col = "scored.class") {
cm <- calculate_confusion_matrix(df, actual_col, predicted_col)
specificity <- (cm$tn) / (cm$tn + cm$fp)
return(specificity)
}
class_specificity<- calculate_specificity(output_data)
print(paste("Specificity:", class_specificity))
## [1] "Specificity: 0.959677419354839"
Specificity is 0.9596774.
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the F1 score of the predictions.
\[ πΉ1\ ππππe = \frac{2 Γ ππππππ πππ Γ ππππ ππ‘ππ£ππ‘y}{ππππππ πππ + ππππ ππ‘ππ£ππ‘y} \]
# F1 Score
# Takes in the original dataset and returns f1 score
calculate_f1_score <- function(output_data) {
prec <- calculate_precision(output_data)
sens <- calculate_sensitivity(output_data)
f1_score <- (2 * prec * sens) / (prec + sens)
return(f1_score)
}
print("F1 Score:")
## [1] "F1 Score:"
f1score <- calculate_f1_score(output_data)
print(f1score)
## [1] 0.6067416
#Dhanya
calculate_f1_score<- function(df, actual_col = "class", predicted_col = "scored.class") {
cm <- calculate_confusion_matrix(df, actual_col, predicted_col)
prec <- calculate_precision(output_data)
sens <- calculate_sensitivity(output_data)
f1_score <- (2 * prec * sens) / (prec + sens)
return(f1_score)
}
class_f1_score<- calculate_f1_score(output_data)
print(paste("F1_score:", class_f1_score))
## [1] "F1_score: 0.606741573033708"
F1 Score is 0.6067416.
Before we move on, letβs consider a question that was asked: What are the bounds on the F1 score? Show that the F1 score will always be between 0 and 1. (Hint: If 0 < π < 1 and 0 < π < 1 then ππ < π.)
# Confirm π*π < π
calculate_precision(output_data) * calculate_sensitivity(output_data) # a * b
## [1] 0.3996711
calculate_precision(output_data) # a
## [1] 0.84375
calculate_sensitivity(output_data) # a
## [1] 0.4736842
The bounds for the F1 score are between 0 and 1. For example, if recall, a proportion, is 0.47 and precision, another proportion, is 0.84, then recall * precision is 0.40 which is less than 0.84.
(2 * 1 * 1) / (1 + 1)
## [1] 1
(2 * 1 * 0) / (1 + 0)
## [1] 0
If there is a 100% recall and precision score, then the F1 score would be 1. If one of the scores was zero, then the F1 score would be 0, meaning the bounds for F1 have to be between 0 and 1.
Write a function that generates an ROC curve from a data set with a true classification column (class in our example) and a probability column (scored.probability in our example). Your function should return a list that includes the plot of the ROC curve and a vector that contains the calculated area under the curve (AUC). Note that I recommend using a sequence of thresholds ranging from 0 to 1 at 0.01 intervals.
# Function to generate ROC curve and calculate AUC
calculate_roc_curve <- function(df, actual_col = "class", probability_col = "scored.probability") {
# Create a sequence of thresholds from 0 to 1 at 0.01 intervals
thresholds <- seq(0, 1, by = 0.01)
# Initialize vectors to store TPR and FPR
tpr_values <- numeric(length(thresholds))
fpr_values <- numeric(length(thresholds))
# Get actual values
actual <- df[[actual_col]]
probabilities <- df[[probability_col]]
# Calculate TPR and FPR for each threshold
for (i in seq_along(thresholds)) {
threshold <- thresholds[i]
# Predicted class based on threshold
predicted <- as.integer(probabilities >= threshold)
# Create confusion matrix with explicit factor levels to ensure 2x2 structure
cm <- table(factor(actual, levels = c(0, 1)),
factor(predicted, levels = c(0, 1)))
# Extract values from 2x2 matrix
tn <- cm[1, 1]
fp <- cm[1, 2]
fn <- cm[2, 1]
tp <- cm[2, 2]
# Calculate True Positive Rate(sensitivity) and False Positive Rate(1 - Specificity)
tpr_values[i] <- ifelse(tp + fn == 0, 0, tp / (tp + fn))
fpr_values[i] <- ifelse(fp + tn == 0, 0, fp / (fp + tn))
}
# Calculate AUC using trapezoidal rule
auc <- 0
for (i in 2:length(fpr_values)) {
auc <- auc + (fpr_values[i-1] - fpr_values[i]) * (tpr_values[i-1] + tpr_values[i]) / 2
}
# Create ROC plot
plot_roc <- plot(fpr_values, tpr_values,
type = "l",
xlim = c(0, 1),
ylim = c(0, 1),
xlab = "False Positive Rate (1 - Specificity)",
ylab = "True Positive Rate (Sensitivity)",
main = paste("ROC Curve (AUC =", round(auc, 4), ")"),
col = "blue",
lwd = 2)
# Add diagonal reference line
abline(0, 1, col = "red", lty = 2, lwd = 1.5)
# Return list with plot and AUC
result <- list(
plot = plot_roc,
auc = auc,
fpr = fpr_values,
tpr = tpr_values,
thresholds = thresholds
)
return(result)
}
roc_results <- calculate_roc_curve(output_data)
print(paste("AUC:", round(roc_results$auc, 4)))
## [1] "AUC: 0.8489"
Use your created R functions and the provided classification output data set to produce all of the classification metrics discussed above.
#Call all created functions
cm <- calculate_confusion_matrix(output_data)
class_accuracy <- calculate_accuracy(output_data)
class_error_rate<- calculate_error_rate(output_data)
class_precision<- calculate_precision(output_data)
class_sensitivity<- calculate_sensitivity(output_data)
class_specificity<- calculate_specificity(output_data)
class_f1_score<- calculate_f1_score(output_data)
roc_results <- calculate_roc_curve(output_data)
# Generate single consolidated metrics
output_text <- paste(
"================================================================================\n",
"COMPREHENSIVE CLASSIFICATION METRICS REPORT\n",
"================================================================================\n\n",
"CONFUSION MATRIX:\n",
sprintf(" Predicted 0 Predicted 1\n"),
sprintf("Actual 0 %12d %12d\n", cm$tn, cm$fp),
sprintf("Actual 1 %12d %12d\n\n", cm$fn, cm$tp),
"Confusion Matrix Components:\n",
sprintf("True Negatives (TN): %d\n", cm$tn),
sprintf("False Positives (FP): %d\n", cm$fp),
sprintf("False Negatives (FN): %d\n", cm$fn),
sprintf("True Positives (TP): %d\n\n", cm$tp),
"CLASSIFICATION METRICS:\n",
"----------------\n",
sprintf("Accuracy: %.4f\n", accuracy),
sprintf("Error Rate: %.4f\n",class_error_rate$error_rate),
sprintf("Accuracy + Error Rate: %.4f (should equal 1.0)\n\n", class_error_rate$sum_check),
"THRESHOLD-DEPENDENT METRICS:\n",
"----------------\n",
sprintf("Precision: %.4f\n", precision),
sprintf("Sensitivity: %.4f (True Positive Rate)\n", sensitivity),
sprintf("Specificity: %.4f (True Negative Rate)\n\n", specificity),
"HARMONIC MEAN METRIC:\n",
"----------------\n",
sprintf("F1 Score: %.4f\n", class_f1_score),
sprintf("F1 Score range: [0, 1] β\n\n"),
"ROC CURVE AND AUC:\n",
"----------------\n",
sprintf("Area Under Curve (AUC): %.4f\n", roc_results$auc),
sprintf("AUC = 0.5 (Random classifier)\n"),
sprintf("AUC = 1.0 (Perfect classifier)\n"),
sprintf("Current AUC = %.4f\n\n", roc_results$auc),
"================================================================================\n"
)
cat(output_text)
## ================================================================================
## COMPREHENSIVE CLASSIFICATION METRICS REPORT
## ================================================================================
##
## CONFUSION MATRIX:
## Predicted 0 Predicted 1
## Actual 0 119 5
## Actual 1 30 27
##
## Confusion Matrix Components:
## True Negatives (TN): 119
## False Positives (FP): 5
## False Negatives (FN): 30
## True Positives (TP): 27
##
## CLASSIFICATION METRICS:
## ----------------
## Accuracy: 0.8066
## Error Rate: 0.1934
## Accuracy + Error Rate: 1.0000 (should equal 1.0)
##
## THRESHOLD-DEPENDENT METRICS:
## ----------------
## Precision: 0.8438
## Sensitivity: 0.4737 (True Positive Rate)
## Specificity: 0.9597 (True Negative Rate)
##
## HARMONIC MEAN METRIC:
## ----------------
## F1 Score: 0.6067
## F1 Score range: [0, 1] β
##
## ROC CURVE AND AUC:
## ----------------
## Area Under Curve (AUC): 0.8489
## AUC = 0.5 (Random classifier)
## AUC = 1.0 (Perfect classifier)
## Current AUC = 0.8489
##
## ================================================================================
summary_df <- data.frame(
Metric = c("Accuracy", "Error Rate", "Precision", "Sensitivity", "Specificity", "F1 Score", "AUC"),
Value = c(class_accuracy, class_error_rate$error_rate, class_precision, class_sensitivity, class_specificity, class_f1_score, roc_results$auc),
stringsAsFactors = FALSE
)
print(summary_df)
## Metric Value
## 1 Accuracy 0.8066298
## 2 Error Rate 0.1933702
## 3 Precision 0.8437500
## 4 Sensitivity 0.4736842
## 5 Specificity 0.9596774
## 6 F1 Score 0.6067416
## 7 AUC 0.8488964
Investigate the caret
package. In particular, consider
the functions confusionMatrix, sensitivity, and specificity. Apply the
functions to the data set. How do the results compare with your own
functions?
Investigate the pROC
package. Use it to generate an ROC
curve for the data set. How do the results compare with your own
functions?