DATA 621: HW 2

Instructions
In this homework assignment, you will work through various classification metrics. You will be asked to create functions in R to carry out the various calculations. You will also investigate some functions in packages that will let you obtain the equivalent results. Finally, you will create graphical output that also can be used to evaluate the output of classification models, such as binary logistic regression.

1. Download the classification output data set (attached in Blackboard to the assignment).

library(readr)
library(dplyr)
library(tidyverse)
url <- "https://raw.githubusercontent.com/greggmaloy/621/main/classification-output-data.csv"
data <- read_csv(url)
print(data)

## # A tibble: 181 × 11
##    pregnant glucose diastolic skinfold insulin   bmi pedigree   age class
##       <dbl>   <dbl>     <dbl>    <dbl>   <dbl> <dbl>    <dbl> <dbl> <dbl>
##  1        7     124        70       33     215  25.5    0.161    37     0
##  2        2     122        76       27     200  35.9    0.483    26     0
##  3        3     107        62       13      48  22.9    0.678    23     1
##  4        1      91        64       24       0  29.2    0.192    21     0
##  5        4      83        86       19       0  29.3    0.317    34     0
##  6        1     100        74       12      46  19.5    0.149    28     0
##  7        9      89        62        0       0  22.5    0.142    33     0
##  8        8     120        78        0       0  25      0.409    64     0
##  9        1      79        60       42      48  43.5    0.678    23     0
## 10        2     123        48       32     165  42.1    0.52     26     0
## # ℹ 171 more rows
## # ℹ 2 more variables: scored.class <dbl>, scored.probability <dbl>

2. The data set has three key columns we will use:

class: the actual class for the observation
scored.class: the predicted class for the observation (based on a threshold of 0.5)
scored.probability: the predicted probability of success for the observation

Use the table() function to get the raw confusion matrix for this scored dataset. Make sure you understand the output. In particular, do the rows represent the actual or predicted class? The columns?

Question 2 Answer:

Rows represent the predicted class. The rows are essentially what is predicted by the model, with 0 representing no predicted condition/disease and 1 representing those individuals predicted by the model to have a condition/disease.

Columns represent the actual class. The columns represent the true condition/disease state, with 0 representing the individual not having the condition/disease and 1 representing those individuals who actually have the condition/disease.

By arranging the actual and predicted classes in a confusion matrix/cross tabs/pivot tab, calculations ,ie sensitivity and specifity, can be derived which provide information about the models predictive power.

Below is a break down of the various sections of a confusion matrix:

Row 0, Column 0 (119): This represents True Negatives—119 cases where the predicted class was 0 and the actual class was also 0. In other words, there were 119 cases that were predicted to be negative that were actually negative (predicted negative and actually negative).

Row 0, Column 1 (30): This represents False Negatives—30 cases where the predicted class was 0 but the actual class was 1. In other words, there were 30 cases where the model does not predict the condition/state, but that individual/case actually has the condition/disease. (predicted negative and actually positive).

Row 1, Column 0 (5): This represents False Positives—5 cases where the predicted class was 1 but the actual class was 0. In other words, there were 5 cases where the model predicted the condition/state/disease, but that individual/case actually did not have the condition/state/disease. (predicted positive and actually negative)

Row 1, Column 1 (27): This represents True Positives—27 cases where the predicted class was 1 and the actual class was 1. In other words, there were 27 case where both the prediction and actual state/condition/disease were positive. (predicted positive and actually positive)

confusion_matrix <- table(data$scored.class,data$class)
confusion_matrix

##    
##       0   1
##   0 119  30
##   1   5  27

3. Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the accuracy of the predictions.

In general, accuracy represents the percent of times a model correctly matches/identifies the actual state. The accuracy of our model was determined to be 0.8066298. An accuracy of 0.8066298 indicates that the model is correct 80.66% of the time and incorrect the other ~19% of the time. The formula is composed of the sum of true positives and true negatives divided by the sample size. \[ \text{Accuracy} = \frac{TP + TN}{TP + FP + TN + FN} \]

#function
calculate_accuracy <- function(data, actual_col, predicted_col) {
#TP,TN,FP, FN
  TP <- confusion_matrix[2, 2]  # Predicted 1, Actual 1
  TN <- confusion_matrix[1, 1]  # Predicted 0, Actual 0
  FP <- confusion_matrix[2, 1]  # Predicted 1, Actual 0
  FN <- confusion_matrix[1, 2]  # Predicted 0, Actual 1
  
#calculate accuracy
accuracy <- (TP + TN) / (TP + FP + TN + FN)
  return(accuracy)
}
#call function
accuracy <- calculate_accuracy(data, actual_col = "class", predicted_col = "scored.class")
print(accuracy)

## [1] 0.8066298

4. Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the classification error rate of the predictions. Verify that you get an accuracy and an error rate that sums to one.

In general, the classification error rate represents the percent of times the model incorrectly matches/identifies the actual state. A classification error rate of 0.1933702 means that the model makes incorrect predictions approximately 19.34% of the time. The formula is composed of the sum of false positives and false negatives divided by the sample size.

Additionally, the classification error rate added to the accuracy will always equal 1. This is because the classification error and the precision rate combined, account for every observation in the dataset.

\[ \text{Classification Error Rate} = \frac{FP + FN}{TP + FP + TN + FN} \]

calculate_classification_error <- function(data, actual_col, predicted_col) {
  TP <- confusion_matrix[2, 2]  # Predicted 1, Actual 1
  TN <- confusion_matrix[1, 1]  # Predicted 0, Actual 0
  FP <- confusion_matrix[2, 1]  # Predicted 1, Actual 0
  FN <- confusion_matrix[1, 2]  # Predicted 0, Actual 1
  
classification_error <- (FP + FN) / (TP + FP + TN + FN)
  return(classification_error)
}
#call function
classification_error <- calculate_classification_error(data, actual_col = "class", predicted_col = "scored.class")

#result
print(classification_error)

## [1] 0.1933702

validate<-classification_error +accuracy
print(paste("Classification error + accuracy =", validate))

## [1] "Classification error + accuracy = 1"

5. Write a function that takes the data set as a dataframe, with actual and predicted classifications identified and returns the precision of the predictions.

Precision measures the proportion of true positive predictions among all predictions, true positive and false positive. Our model’s precision of 0.84375 means that 84.38% of the instances predicted as positive by the model were correctly identified as true positives. In other words, precision measures the amount of the population/sample which are correctly predicted to be positive divided by all of those the model predicts to have a positive state/condition/disease, including false positives. \[ \text{Precision} = \frac{TP}{TP + FP} \]

calculate_precision <- function(data, actual_col, predicted_col) {
  confusion_matrix <- table(data[[predicted_col]], data[[actual_col]])
  TP <- confusion_matrix[2, 2]  # Predicted 1, Actual 1
  FP <- confusion_matrix[2, 1]  # Predicted 1, Actual 0
  precision <- TP / (TP + FP)
  return(precision)
}
precision <- calculate_precision(data, actual_col = "class", predicted_col = "scored.class")
print(paste("Precision: ", precision))

## [1] "Precision:  0.84375"

6. Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the sensitivity of the predictions. Sensitivity is also known as recall.
Sensitivity measures the proportion of positive instances that are correctly identified by the model among the the total number of actual positive instances, including true positives and false negatives. The sensitivity for our model was determined to be 0.4737. A sensitivity of this amount can be interpreted as the model correctly identifies ~47% of the actual positive cases. In other words, the model was unable to predict about half of the actually positive cases, denoting a number of high false negatives.

\[ \text{Sensitivity} = \frac{TP}{TP + FN} \]

calculate_sensitivity <- function(data, actual_col, predicted_col) {
  confusion_matrix <- table(data[[predicted_col]], data[[actual_col]])
  TP <- confusion_matrix[2, 2]  # Predicted 1, Actual 1
  FN <- confusion_matrix[1, 2]  # Predicted 0, Actual 1
  sensitivity <- TP / (TP + FN)
  return(sensitivity)
}


sensitivity <- calculate_sensitivity(data, actual_col = "class", predicted_col = "scored.class")
print(paste("Sensitivity:", sensitivity))

## [1] "Sensitivity: 0.473684210526316"

7. Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the specificity of the predictions.

Specificity measures the proportion of negative instances that are correctly identified by the model (true negative) among the total number of actual negative instances, including true negatives and false positives. The specificity of our model was determined to be 0.9597. A specificity of this amount means that the model correctly identifies ~96% of the actual negative cases. This high specificity, coupled with a moderate sensitivity of around 50%, highlights the trade-off between the model’s ability to minimize false positives (high specificity) and its lower ability to capture all positive cases (moderate sensitivity). \[ \text{Specificity} = \frac{TN}{TN + FP} \]

calculate_specificity <- function(data, actual_col, predicted_col) {
  confusion_matrix <- table(data[[predicted_col]], data[[actual_col]])
  TN <- confusion_matrix[1, 1]  # Predicted 0, Actual 0
  FP <- confusion_matrix[2, 1]  # Predicted 1, Actual 0
  specificity <- TN / (TN + FP)
  return(specificity)
}
specificity <- calculate_specificity(data, actual_col = "class", predicted_col = "scored.class")
print(paste("Specificity:", specificity))

## [1] "Specificity: 0.959677419354839"

8. Write a function that takes the data set as a dataframe, with actual and predicted classifications identified,and returns the F1 score of the predictions.
The F1 score is essentially a tool/ metric utilized to balance precision and sensitivity, as in some cases, precision and sensitivity are a trade off, with one being more desirable than the other depending on what is being measured. Our model was determined to have an F1 score of 0.606741573033708. An F1 score of 60.67% represents a moderate F1 score with room for improvement. \[ \text{F1 Score} = \frac{2 × Precision × Sensitivity}{Precision + Sensitivity} \]

calculate_f1_score <- function(data, actual_col, predicted_col) {
  precision   <- calculate_precision(data, actual_col, predicted_col)
  sensitivity <- calculate_sensitivity(data, actual_col, predicted_col)
  f1_score    <- 2 * (precision * sensitivity) / (precision + sensitivity)
  return(f1_score)
}
f1_score <- calculate_f1_score(data, actual_col = "class", predicted_col = "scored.class")
print(paste("F1 Score:", f1_score))

## [1] "F1 Score: 0.606741573033708"

9. Before we move on, let’s consider a question that was asked: What are the bounds on the F1 score? Show that the F1 score will always be between 0 and 1. (Hint: If 0 < 𝑎 < 1 and 0 < 𝑏 < 1 then 𝑎𝑏 < 𝑎.)

The bounds are between 0 and 1 since the F1 score is a ratio.

Below is the formula for the F1 score \[ \text{F1 Score} = \frac{2 × Precision × Sensitivity}{Precision + Sensitivity} \] We know that highest possible value for both precision and sensitivity individually is 0.99 because precision and sensitivity are ratios. (If 0 < precision < 1 and 0 < sensitivity < 1 then 𝑎𝑏 < 𝑎.)

We can demonstrate this by plugging in the highest possible values for precision and sensitivity into the F1 equation.

\[ \text{F1 Score} = \frac{2 × 0.99 × 0.99}{0.99 + 0.99} \] \[ \text{F1 Score} = 0.99 = \frac{1.9602}{1.98} \]
Likewise we can plug the lowest value (0.01) into the F1 Score formula. \[ \text{F1 Score} = 0.01 = \frac{0.0002}{0.02} \]

10. Write a function that generates an ROC curve from a data set with a true classification column (class in our example) and a probability column (scored.probability in our example). Your function should return a list that includes the plot of the ROC curve and a vector that contains the calculated area under the curve (AUC). Note that I recommend using a sequence of thresholds ranging from 0 to 1 at 0.01 intervals.

roc_func <- function(x, y) {
  thresholds <- sort(unique(y), decreasing = TRUE)
  TP <- numeric(length(thresholds))
  FP <- numeric(length(thresholds))
  
  for (i in seq_along(thresholds)) {
    TP[i] <- sum(x[y >= thresholds[i]]) / sum(x) 
    FP[i] <- sum(!x[y >= thresholds[i]]) / sum(!x) 
  }
  
  auc <- sum(diff(FP) * (TP[-1] + TP[-length(TP)]) / 2)
  
  df <- data.frame(TP = TP, FP = FP)
  return(list(df = df, auc = auc))
}

roc_data <- roc_func(data$class, data$scored.probability)


plot(roc_data$df$FP, 
     roc_data$df$TP, 
     type = "l", 
     col = "blue", 
     lwd = 2,
     xlab = "Specificity",
     ylab = "Sensitivity",
     main = paste("ROC Curve (AUC =", round(roc_data$auc, 3), ")"))
abline(a = 0, b = 1, col = "red", lty = 2) # Diagonal line

library(knitr)
kable(roc_data$auc)

x
0.8503113

11. Use your created R functions and the provided classification output data set to produce all of the classification metrics discussed above.

As stated before, the model has an accuracy of 80.66%, meaning it correctly predicts disease/condition state 80.66% of the time, with an error rate of 19.34%. The precision rate of 84.38% indicates that when the model predicts a positive disease/condition, it is correct 84.38% of the time. The sensitivity of 47.37% denotes that the model did not correctly predict/identify more than half of the actual positive cases. The specificity at 95.97% denotes that the mode effectively identifies negatives, while the F1 score of 60.67% denotes there is room for improvement in balancing precision and sensitivity.

Taken together, these metrics would be beneficial for research question where false positives are more acceptable than false negatives. The sensitivity of 47% means half of actual disease/condition positive cases, 53%, would not be predicted by the model.

df <- c(accuracy, classification_error, f1_score, precision, sensitivity, specificity)
names(df) <- c("Accuracy", "Error", "F1", "Percision", "Sensitivity", "Spec")
kable(df)

	x
Accuracy	0.8066298
Error	0.1933702
F1	0.6067416
Percision	0.8437500
Sensitivity	0.4736842
Spec	0.9596774

12. Investigate the caret package. In particular, consider the functions confusion Matrix, sensitivity, and specificity. Apply the functions to the data set. How do the results compare with your own functions?

The caret results are identical to the results of our functions. Sensitivity is identical (caret=0.4737 vs our function=0.4737), specificity is identical (caret=0.9597 vs our function=0.9597), accuracy is identical (caret=0.8066 vs our function=0.8066), and precision is identical (caret=0.8438 vs our function=0.8438). The caret package did not caluclate F1 score, but if it did, it would be identical.

#install.packages("caret")
#install.packages("proc")

library(caret)

data$class <- factor(data$class, levels = c(1, 0))  
data$scored.class <- factor(data$scored.class, levels = c(1, 0))  

conf_matrix <- confusionMatrix(data$scored.class, data$class)

print(conf_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   0
##          1  27   5
##          0  30 119
##                                           
##                Accuracy : 0.8066          
##                  95% CI : (0.7415, 0.8615)
##     No Information Rate : 0.6851          
##     P-Value [Acc > NIR] : 0.0001712       
##                                           
##                   Kappa : 0.4916          
##                                           
##  Mcnemar's Test P-Value : 4.976e-05       
##                                           
##             Sensitivity : 0.4737          
##             Specificity : 0.9597          
##          Pos Pred Value : 0.8438          
##          Neg Pred Value : 0.7987          
##              Prevalence : 0.3149          
##          Detection Rate : 0.1492          
##    Detection Prevalence : 0.1768          
##       Balanced Accuracy : 0.7167          
##                                           
##        'Positive' Class : 1               
##

#sensitivity
sens <- sensitivity(data$scored.class, data$class)
print(paste("Sensitivity (Recall):", sens))

## [1] "Sensitivity (Recall): 0.473684210526316"

#specificity
spec <- specificity(data$scored.class, data$class)
print(paste("Specificity:", spec))

## [1] "Specificity: 0.959677419354839"

13. Investigate the pROC package. Use it to generate an ROC curve for the data set. How do the results compare with your own functions?

Both graphs are visually comparable, and both the pROC package and my custom function yielded an identical AUC of 0.85. This AUC can be interpreted as that there is an 85% chance that the model will correctly distinguish between a randomly chosen positive instance and a randomly chosen negative instance.This AUC value indicates a strong model fit, demonstrating that the model effectively identifies true positives while minimizing false positives.

library(pROC)
roc_obj <- roc(data$class, data$scored.probability)
plot(roc_obj, col = "blue", lwd = 2, main = "ROC Curve using pROC", legacy.axes = TRUE)

auc_value <- auc(roc_obj)
print(paste("AUC from pROC:", auc_value))

## [1] "AUC from pROC: 0.850311262026033"

DATA 621: HW 2

Group 2: Banu Boopalan, Gregg Maloy, Alexander Moyse, Umais Siddiqui