Instructions
In this homework assignment, you will work through various
classification metrics. You will be asked to create functions in R to
carry out the various calculations. You will also investigate some
functions in packages that will let you obtain the equivalent results.
Finally, you will create graphical output that also can be used to
evaluate the output of classification models, such as binary logistic
regression.
1. Download the classification output data set (attached in Blackboard to the assignment).
library(readr)
library(dplyr)
library(tidyverse)
url <- "https://raw.githubusercontent.com/greggmaloy/621/main/classification-output-data.csv"
data <- read_csv(url)
print(data)
## # A tibble: 181 × 11
## pregnant glucose diastolic skinfold insulin bmi pedigree age class
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7 124 70 33 215 25.5 0.161 37 0
## 2 2 122 76 27 200 35.9 0.483 26 0
## 3 3 107 62 13 48 22.9 0.678 23 1
## 4 1 91 64 24 0 29.2 0.192 21 0
## 5 4 83 86 19 0 29.3 0.317 34 0
## 6 1 100 74 12 46 19.5 0.149 28 0
## 7 9 89 62 0 0 22.5 0.142 33 0
## 8 8 120 78 0 0 25 0.409 64 0
## 9 1 79 60 42 48 43.5 0.678 23 0
## 10 2 123 48 32 165 42.1 0.52 26 0
## # ℹ 171 more rows
## # ℹ 2 more variables: scored.class <dbl>, scored.probability <dbl>
2. The data set has three key columns we will use:
Use the table() function to get the raw confusion matrix for this scored dataset. Make sure you understand the output. In particular, do the rows represent the actual or predicted class? The columns?
Question 2 Answer:
Rows represent the predicted class. The rows are essentially what is predicted by the model, with 0 representing no predicted condition/disease and 1 representing those individuals predicted by the model to have a condition/disease.
Columns represent the actual class. The columns represent the true condition/disease state, with 0 representing the individual not having the condition/disease and 1 representing those individuals who actually have the condition/disease.
By arranging the actual and predicted classes in a confusion matrix/cross tabs/pivot tab, calculations ,ie sensitivity and specifity, can be derived which provide information about the models predictive power.
Below is a break down of the various sections of a confusion matrix:
Row 0, Column 0 (119): This represents True Negatives—119 cases where the predicted class was 0 and the actual class was also 0. In other words, there were 119 cases that were predicted to be negative that were actually negative (predicted negative and actually negative).
Row 0, Column 1 (30): This represents False Negatives—30 cases where the predicted class was 0 but the actual class was 1. In other words, there were 30 cases where the model does not predict the condition/state, but that individual/case actually has the condition/disease. (predicted negative and actually positive).
Row 1, Column 0 (5): This represents False Positives—5 cases where the predicted class was 1 but the actual class was 0. In other words, there were 5 cases where the model predicted the condition/state/disease, but that individual/case actually did not have the condition/state/disease. (predicted positive and actually negative)
Row 1, Column 1 (27): This represents True Positives—27 cases where the predicted class was 1 and the actual class was 1. In other words, there were 27 case where both the prediction and actual state/condition/disease were positive. (predicted positive and actually positive)
confusion_matrix <- table(data$scored.class,data$class)
confusion_matrix
##
## 0 1
## 0 119 30
## 1 5 27
3. Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the accuracy of the predictions.
In general, accuracy represents the percent of times a model correctly matches/identifies the actual state. The accuracy of our model was determined to be 0.8066298. An accuracy of 0.8066298 indicates that the model is correct 80.66% of the time and incorrect the other ~19% of the time. The formula is composed of the sum of true positives and true negatives divided by the sample size. \[ \text{Accuracy} = \frac{TP + TN}{TP + FP + TN + FN} \]
#function
calculate_accuracy <- function(data, actual_col, predicted_col) {
#TP,TN,FP, FN
TP <- confusion_matrix[2, 2] # Predicted 1, Actual 1
TN <- confusion_matrix[1, 1] # Predicted 0, Actual 0
FP <- confusion_matrix[2, 1] # Predicted 1, Actual 0
FN <- confusion_matrix[1, 2] # Predicted 0, Actual 1
#calculate accuracy
accuracy <- (TP + TN) / (TP + FP + TN + FN)
return(accuracy)
}
#call function
accuracy <- calculate_accuracy(data, actual_col = "class", predicted_col = "scored.class")
print(accuracy)
## [1] 0.8066298
4. Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the classification error rate of the predictions. Verify that you get an accuracy and an error rate that sums to one.
In general, the classification error rate represents the percent of times the model incorrectly matches/identifies the actual state. A classification error rate of 0.1933702 means that the model makes incorrect predictions approximately 19.34% of the time. The formula is composed of the sum of false positives and false negatives divided by the sample size.
Additionally, the classification error rate added to the accuracy will always equal 1. This is because the classification error and the precision rate combined, account for every observation in the dataset.
\[ \text{Classification Error Rate} = \frac{FP + FN}{TP + FP + TN + FN} \]
calculate_classification_error <- function(data, actual_col, predicted_col) {
TP <- confusion_matrix[2, 2] # Predicted 1, Actual 1
TN <- confusion_matrix[1, 1] # Predicted 0, Actual 0
FP <- confusion_matrix[2, 1] # Predicted 1, Actual 0
FN <- confusion_matrix[1, 2] # Predicted 0, Actual 1
classification_error <- (FP + FN) / (TP + FP + TN + FN)
return(classification_error)
}
#call function
classification_error <- calculate_classification_error(data, actual_col = "class", predicted_col = "scored.class")
#result
print(classification_error)
## [1] 0.1933702
validate<-classification_error +accuracy
print(paste("Classification error + accuracy =", validate))
## [1] "Classification error + accuracy = 1"
5. Write a function that takes the data set as a dataframe, with actual and predicted classifications identified and returns the precision of the predictions.
Precision measures the proportion of true positive predictions among all predictions, true positive and false positive. Our model’s precision of 0.84375 means that 84.38% of the instances predicted as positive by the model were correctly identified as true positives. In other words, precision measures the amount of the population/sample which are correctly predicted to be positive divided by all of those the model predicts to have a positive state/condition/disease, including false positives. \[ \text{Precision} = \frac{TP}{TP + FP} \]
calculate_precision <- function(data, actual_col, predicted_col) {
confusion_matrix <- table(data[[predicted_col]], data[[actual_col]])
TP <- confusion_matrix[2, 2] # Predicted 1, Actual 1
FP <- confusion_matrix[2, 1] # Predicted 1, Actual 0
precision <- TP / (TP + FP)
return(precision)
}
precision <- calculate_precision(data, actual_col = "class", predicted_col = "scored.class")
print(paste("Precision: ", precision))
## [1] "Precision: 0.84375"
6. Write a function that takes the data set as a dataframe,
with actual and predicted classifications identified, and returns the
sensitivity of the predictions. Sensitivity is also known as
recall.
Sensitivity measures the proportion of positive instances that are
correctly identified by the model among the the total number of actual
positive instances, including true positives and false negatives. The
sensitivity for our model was determined to be 0.4737. A sensitivity of
this amount can be interpreted as the model correctly identifies ~47% of
the actual positive cases. In other words, the model was unable to
predict about half of the actually positive cases, denoting a number of
high false negatives.
\[ \text{Sensitivity} = \frac{TP}{TP + FN} \]
calculate_sensitivity <- function(data, actual_col, predicted_col) {
confusion_matrix <- table(data[[predicted_col]], data[[actual_col]])
TP <- confusion_matrix[2, 2] # Predicted 1, Actual 1
FN <- confusion_matrix[1, 2] # Predicted 0, Actual 1
sensitivity <- TP / (TP + FN)
return(sensitivity)
}
sensitivity <- calculate_sensitivity(data, actual_col = "class", predicted_col = "scored.class")
print(paste("Sensitivity:", sensitivity))
## [1] "Sensitivity: 0.473684210526316"
7. Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the specificity of the predictions.
Specificity measures the proportion of negative instances that are correctly identified by the model (true negative) among the total number of actual negative instances, including true negatives and false positives. The specificity of our model was determined to be 0.9597. A specificity of this amount means that the model correctly identifies ~96% of the actual negative cases. This high specificity, coupled with a moderate sensitivity of around 50%, highlights the trade-off between the model’s ability to minimize false positives (high specificity) and its lower ability to capture all positive cases (moderate sensitivity). \[ \text{Specificity} = \frac{TN}{TN + FP} \]
calculate_specificity <- function(data, actual_col, predicted_col) {
confusion_matrix <- table(data[[predicted_col]], data[[actual_col]])
TN <- confusion_matrix[1, 1] # Predicted 0, Actual 0
FP <- confusion_matrix[2, 1] # Predicted 1, Actual 0
specificity <- TN / (TN + FP)
return(specificity)
}
specificity <- calculate_specificity(data, actual_col = "class", predicted_col = "scored.class")
print(paste("Specificity:", specificity))
## [1] "Specificity: 0.959677419354839"
8. Write a function that takes the data set as a dataframe,
with actual and predicted classifications identified,and returns the F1
score of the predictions.
The F1 score is essentially a tool/ metric utilized to balance precision
and sensitivity, as in some cases, precision and sensitivity are a trade
off, with one being more desirable than the other depending on what is
being measured. Our model was determined to have an F1 score of
0.606741573033708. An F1 score of 60.67% represents a moderate F1 score
with room for improvement. \[
\text{F1 Score} = \frac{2 × Precision × Sensitivity}{Precision +
Sensitivity}
\]
calculate_f1_score <- function(data, actual_col, predicted_col) {
precision <- calculate_precision(data, actual_col, predicted_col)
sensitivity <- calculate_sensitivity(data, actual_col, predicted_col)
f1_score <- 2 * (precision * sensitivity) / (precision + sensitivity)
return(f1_score)
}
f1_score <- calculate_f1_score(data, actual_col = "class", predicted_col = "scored.class")
print(paste("F1 Score:", f1_score))
## [1] "F1 Score: 0.606741573033708"
9. Before we move on, let’s consider a question that was asked: What are the bounds on the F1 score? Show that the F1 score will always be between 0 and 1. (Hint: If 0 < 𝑎 < 1 and 0 < 𝑏 < 1 then 𝑎𝑏 < 𝑎.)
The bounds are between 0 and 1 since the F1 score is a ratio.
Below is the formula for the F1 score \[ \text{F1 Score} = \frac{2 × Precision × Sensitivity}{Precision + Sensitivity} \] We know that highest possible value for both precision and sensitivity individually is 0.99 because precision and sensitivity are ratios. (If 0 < precision < 1 and 0 < sensitivity < 1 then 𝑎𝑏 < 𝑎.)
We can demonstrate this by plugging in the highest possible values for precision and sensitivity into the F1 equation.
\[
\text{F1 Score} = \frac{2 × 0.99 × 0.99}{0.99 + 0.99}
\] \[
\text{F1 Score} = 0.99 = \frac{1.9602}{1.98}
\]
Likewise we can plug the lowest value (0.01) into the F1 Score formula.
\[
\text{F1 Score} = 0.01 = \frac{0.0002}{0.02}
\]
10. Write a function that generates an ROC curve from a data set with a true classification column (class in our example) and a probability column (scored.probability in our example). Your function should return a list that includes the plot of the ROC curve and a vector that contains the calculated area under the curve (AUC). Note that I recommend using a sequence of thresholds ranging from 0 to 1 at 0.01 intervals.
roc_func <- function(x, y) {
thresholds <- sort(unique(y), decreasing = TRUE)
TP <- numeric(length(thresholds))
FP <- numeric(length(thresholds))
for (i in seq_along(thresholds)) {
TP[i] <- sum(x[y >= thresholds[i]]) / sum(x)
FP[i] <- sum(!x[y >= thresholds[i]]) / sum(!x)
}
auc <- sum(diff(FP) * (TP[-1] + TP[-length(TP)]) / 2)
df <- data.frame(TP = TP, FP = FP)
return(list(df = df, auc = auc))
}
roc_data <- roc_func(data$class, data$scored.probability)
plot(roc_data$df$FP,
roc_data$df$TP,
type = "l",
col = "blue",
lwd = 2,
xlab = "Specificity",
ylab = "Sensitivity",
main = paste("ROC Curve (AUC =", round(roc_data$auc, 3), ")"))
abline(a = 0, b = 1, col = "red", lty = 2) # Diagonal line
library(knitr)
kable(roc_data$auc)
| x |
|---|
| 0.8503113 |
11. Use your created R functions and the provided classification output data set to produce all of the classification metrics discussed above.
As stated before, the model has an accuracy of 80.66%, meaning it correctly predicts disease/condition state 80.66% of the time, with an error rate of 19.34%. The precision rate of 84.38% indicates that when the model predicts a positive disease/condition, it is correct 84.38% of the time. The sensitivity of 47.37% denotes that the model did not correctly predict/identify more than half of the actual positive cases. The specificity at 95.97% denotes that the mode effectively identifies negatives, while the F1 score of 60.67% denotes there is room for improvement in balancing precision and sensitivity.
Taken together, these metrics would be beneficial for research question where false positives are more acceptable than false negatives. The sensitivity of 47% means half of actual disease/condition positive cases, 53%, would not be predicted by the model.
df <- c(accuracy, classification_error, f1_score, precision, sensitivity, specificity)
names(df) <- c("Accuracy", "Error", "F1", "Percision", "Sensitivity", "Spec")
kable(df)
| x | |
|---|---|
| Accuracy | 0.8066298 |
| Error | 0.1933702 |
| F1 | 0.6067416 |
| Percision | 0.8437500 |
| Sensitivity | 0.4736842 |
| Spec | 0.9596774 |
12. Investigate the caret package. In particular, consider the functions confusion Matrix, sensitivity, and specificity. Apply the functions to the data set. How do the results compare with your own functions?
The caret results are identical to the results of our functions. Sensitivity is identical (caret=0.4737 vs our function=0.4737), specificity is identical (caret=0.9597 vs our function=0.9597), accuracy is identical (caret=0.8066 vs our function=0.8066), and precision is identical (caret=0.8438 vs our function=0.8438). The caret package did not caluclate F1 score, but if it did, it would be identical.
#install.packages("caret")
#install.packages("proc")
library(caret)
data$class <- factor(data$class, levels = c(1, 0))
data$scored.class <- factor(data$scored.class, levels = c(1, 0))
conf_matrix <- confusionMatrix(data$scored.class, data$class)
print(conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 0
## 1 27 5
## 0 30 119
##
## Accuracy : 0.8066
## 95% CI : (0.7415, 0.8615)
## No Information Rate : 0.6851
## P-Value [Acc > NIR] : 0.0001712
##
## Kappa : 0.4916
##
## Mcnemar's Test P-Value : 4.976e-05
##
## Sensitivity : 0.4737
## Specificity : 0.9597
## Pos Pred Value : 0.8438
## Neg Pred Value : 0.7987
## Prevalence : 0.3149
## Detection Rate : 0.1492
## Detection Prevalence : 0.1768
## Balanced Accuracy : 0.7167
##
## 'Positive' Class : 1
##
#sensitivity
sens <- sensitivity(data$scored.class, data$class)
print(paste("Sensitivity (Recall):", sens))
## [1] "Sensitivity (Recall): 0.473684210526316"
#specificity
spec <- specificity(data$scored.class, data$class)
print(paste("Specificity:", spec))
## [1] "Specificity: 0.959677419354839"
13. Investigate the pROC package. Use it to generate an ROC curve for the data set. How do the results compare with your own functions?
Both graphs are visually comparable, and both the pROC package and my custom function yielded an identical AUC of 0.85. This AUC can be interpreted as that there is an 85% chance that the model will correctly distinguish between a randomly chosen positive instance and a randomly chosen negative instance.This AUC value indicates a strong model fit, demonstrating that the model effectively identifies true positives while minimizing false positives.
library(pROC)
roc_obj <- roc(data$class, data$scored.probability)
plot(roc_obj, col = "blue", lwd = 2, main = "ROC Curve using pROC", legacy.axes = TRUE)
auc_value <- auc(roc_obj)
print(paste("AUC from pROC:", auc_value))
## [1] "AUC from pROC: 0.850311262026033"