Data621 - Homework 2

Author

Anthony Josue Roman, Rupendra Shrestha, Bikash Bhowmik, Jerald Melukkaran

Overview

The aim of this assignment is to evaluate the performance of a binary classification model using the provided dataset containing actual outcomes, predicted class outcomes, and class probabilities. The performance evaluation is done in such a manner that the key aspects of the model’s performance in the context of a binary classification problem are understood.

Firstly, a confusion matrix is created that illustrates the relationship between actual class outcomes and the model’s predictions. Using this matrix, the performance of the model is evaluated using various performance measures calculated using custom R functions. These performance measures include accuracy, error rate, precision, sensitivity, specificity, and the F1 measure. These performance measures evaluate the performance of the model from various aspects.

Furthermore, the performance of the model is evaluated using the Receiver Operating Characteristic Curve and the Area Under the Curve. The ROC Curve is a plot of the sensitivity of the model against the false positive rate at various thresholds of the classification outcomes.

Lastly, the performance measures calculated using the custom functions are compared with the performance measures calculated using the pre-built functions in the R environment, such as the caret and pROC libraries.

Objectives

The primary goals and objectives of this assignment are as follows:

  • To build a confusion matrix based on the actual and predicted class labels.
  • To use custom R functions to compute the classification performance metrics.
  • To validate the accuracy and error rate summation.
  • To plot the ROC curve and compute the AUC.
  • To compare the results using the caret and pROC packages.
  • To interpret the results.

Data Explanation

The dataset used in this analysis has the results of a binary classification model. Each instance in the dataset has three key variables that describe the actual outcome and the result of the model’s prediction.

  • class: This is the actual class of the instance in the dataset. It is the actual class label of the instance. The positive class is represented by the value 1, while the negative class is represented by the value 0.
  • scored.class: This is the class label of the instance as determined by the model using a classification threshold.
  • scored.probability: This is the probability of the instance in the dataset belonging to the positive class.

These three variables in the dataset provide the foundation on which the performance of the model is evaluated using the confusion matrix. The performance of the model is usually determined using the actual class of the instance and the class label of the instance as determined by the model. The scored.probability variable is used in the evaluation of the performance of the model using the ROC Curve. The ROC Curve is a plot of the sensitivity of the model against the false positive rate of the model. The Area Under the Curve is used in the determination of the performance of the model. Therefore, the dataset used in the analysis has the necessary information required in the evaluation of the performance of the model.

Loading and Inspect the Data

Code
library(caret)
library(pROC)
Code
df <- read.csv("classification-output-data (2).csv")

head(df)
  pregnant glucose diastolic skinfold insulin  bmi pedigree age class
1        7     124        70       33     215 25.5    0.161  37     0
2        2     122        76       27     200 35.9    0.483  26     0
3        3     107        62       13      48 22.9    0.678  23     1
4        1      91        64       24       0 29.2    0.192  21     0
5        4      83        86       19       0 29.3    0.317  34     0
6        1     100        74       12      46 19.5    0.149  28     0
  scored.class scored.probability
1            0         0.32845226
2            0         0.27319044
3            0         0.10966039
4            0         0.05599835
5            0         0.10049072
6            0         0.05515460
Code
str(df)
'data.frame':   181 obs. of  11 variables:
 $ pregnant          : int  7 2 3 1 4 1 9 8 1 2 ...
 $ glucose           : int  124 122 107 91 83 100 89 120 79 123 ...
 $ diastolic         : int  70 76 62 64 86 74 62 78 60 48 ...
 $ skinfold          : int  33 27 13 24 19 12 0 0 42 32 ...
 $ insulin           : int  215 200 48 0 0 46 0 0 48 165 ...
 $ bmi               : num  25.5 35.9 22.9 29.2 29.3 19.5 22.5 25 43.5 42.1 ...
 $ pedigree          : num  0.161 0.483 0.678 0.192 0.317 0.149 0.142 0.409 0.678 0.52 ...
 $ age               : int  37 26 23 21 34 28 33 64 23 26 ...
 $ class             : int  0 0 1 0 0 0 0 0 0 0 ...
 $ scored.class      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ scored.probability: num  0.328 0.273 0.11 0.056 0.1 ...

Confusion Matrix

The confusion matrix is a summary of the classification model’s performance based on a comparison between the actual and predicted class labels. In the confusion matrix, the rows represent the actual values, and the columns represent the predicted values.

Code
cm <- table(Actual = df$class, Predicted = df$scored.class)
cm
      Predicted
Actual   0   1
     0 119   5
     1  30  27

The structure of the Confusion Matrix is as follows:

  • True Positives (TP): The actual class is 1, and the predicted class is also 1
  • True Negatives (TN): The actual class is 0, and the predicted class is also 0
  • False Positives (FP): The actual class is 0, but the predicted class is 1
  • False Negatives (FN): The actual class is 1, but the predicted class is 0

From the results obtained:

  • TP = 27
  • TN = 119
  • FP = 5
  • FN = 30

The Confusion Matrix obtained from the results indicates that the model performs well in classifying negative cases, as shown by the large number of true negative cases. The model, however, seems to perform poorly in classifying positive cases, as shown by the number of false negative cases, which directly impacts the sensitivity of the model as shown in the following metrics.

Accuracy and Classification Error Rate

Accuracy and classification error rate are basic evaluation metrics used to assess the overall performance of a classification model.

  • Accuracy calculates the percentage of correct classifications made by a model: \[ \text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} \]

  • Classification Error Rate calculates the percentage of incorrect classifications made by a model: \[ \text{Error Rate} = \frac{FP + FN}{TP + TN + FP + FN} \]

Code
get_counts <- function(data) {
  actual <- data$class
  predicted <- data$scored.class
  
  TP <- sum(actual == 1 & predicted == 1)
  TN <- sum(actual == 0 & predicted == 0)
  FP <- sum(actual == 0 & predicted == 1)
  FN <- sum(actual == 1 & predicted == 0)
  
  list(TP = TP, TN = TN, FP = FP, FN = FN)
}

accuracy_fn <- function(data) {
  cts <- get_counts(data)
  (cts$TP + cts$TN) / (cts$TP + cts$TN + cts$FP + cts$FN)
}

error_rate_fn <- function(data) {
  cts <- get_counts(data)
  (cts$FP + cts$FN) / (cts$TP + cts$TN + cts$FP + cts$FN)
}

acc <- accuracy_fn(df)
err <- error_rate_fn(df)
sumaccerr <- acc + err

cat("Accuracy:", acc, "\n")
Accuracy: 0.8066298 
Code
cat("Error Rate:", err, "\n")
Error Rate: 0.1933702 
Code
cat("Sum of Rate:", sumaccerr, "\n")
Sum of Rate: 1 

Accuracy of the model is around 0.8066, which means that around 80.66% of the observations are classified correctly. The error rate of the classification is around 0.1934, which means that around 19.34% of the observations classified are incorrect.

As expected, the accuracy and error rate sum up to 1.

Precision, Sensitivity, Specificity, and F1 Score

Apart from accuracy, there are various other metrics to get a better understanding of the performance of a classification algorithm, particularly when it comes to evaluating how well a model is performing on positive and negative classes.

  • Precision is defined as: \[ \text{Precision} = \frac{TP}{TP + FP} \]

  • Sensitivity or Recall is defined as: \[ \text{Sensitivity} = \frac{TP}{TP + FN} \]

  • Specificity is defined as: \[ \text{Specificity} = \frac{TN}{TN + FP} \]

  • F1 Score is defined as: \[ F1 = \frac{2 \cdot \text{Precision} \cdot \text{Sensitivity}}{\text{Precision} + \text{Sensitivity}} \]

Code
precision_fn <- function(data) {
  cts <- get_counts(data)
  cts$TP / (cts$TP + cts$FP)
}

sensitivity_fn <- function(data) {
  cts <- get_counts(data)
  cts$TP / (cts$TP + cts$FN)
}

specificity_fn <- function(data) {
  cts <- get_counts(data)
  cts$TN / (cts$TN + cts$FP)
}

f1_fn <- function(data) {
  precision <- precision_fn(data)
  sensitivity <- sensitivity_fn(data)
  2 * precision * sensitivity / (precision + sensitivity)
}

prec <- precision_fn(df)
sens <- sensitivity_fn(df)
spec <- specificity_fn(df)
f1   <- f1_fn(df)

cat("Precision:", prec, "\n")
Precision: 0.84375 
Code
cat("Sensitivity:", sens, "\n")
Sensitivity: 0.4736842 
Code
cat("Specificity:", spec, "\n")
Specificity: 0.9596774 
Code
cat("F1:", f1, "\n")
F1: 0.6067416 

The precision of the model is around 0.8438, which implies that whenever the model predicts a positive class, it is accurate 84.38% of the time. Furthermore, the sensitivity of the model is around 0.4737, implying that the model is accurate 47.37% of the time in terms of sensitivity.

Moreover, the specificity of the model is around 0.9597, implying that the model is highly accurate in terms of predicting the negative class. Additionally, the F1 score is around 0.6067, which is a balance between precision and sensitivity.

It is evident that the model is performing well in general, but in terms of the negative class, the model is highly accurate, as demonstrated by the high specificity. However, the sensitivity of the model is low, implying that the model is not performing well in terms of sensitivity, which could be crucial in a different context.

ROC Curve and AUC

The Receiver Operating Characteristic (ROC) curve is a measure used to assess the classification ability of a model over a variety of probability thresholds. It is a plot of the true positive rate versus the false positive rate.

  • True Positive Rate (TPR): \[ \text{TPR} = \frac{TP}{TP + FN} \]

  • False Positive Rate (FPR): \[ \text{FPR} = \frac{FP}{FP + TN} \]

Area Under the Curve (AUC) is a summary value used to report the classification ability of a model. A higher AUC value indicates better classification ability, while a lower AUC value, close to 0.5, indicates poor classification ability.

Code
roc_auc_fn <- function(data) {
  actual <- data$class
  probs <- data$scored.probability
  
  thresholds <- seq(0, 1, by = 0.01)
  tpr <- numeric(length(thresholds))
  fpr <- numeric(length(thresholds))
  
  for (i in seq_along(thresholds)) {
    th <- thresholds[i]
    pred <- ifelse(probs >= th, 1, 0)
    
    TP <- sum(actual == 1 & pred == 1)
    TN <- sum(actual == 0 & pred == 0)
    FP <- sum(actual == 0 & pred == 1)
    FN <- sum(actual == 1 & pred == 0)
    
    tpr[i] <- ifelse((TP + FN) == 0, 0, TP / (TP + FN))
    fpr[i] <- ifelse((FP + TN) == 0, 0, FP / (FP + TN))
  }
  
  roc_df <- data.frame(threshold = thresholds, TPR = tpr, FPR = fpr)
  roc_df <- roc_df[order(roc_df$FPR, roc_df$TPR), ]
  
  auc <- sum(diff(roc_df$FPR) *
               (head(roc_df$TPR, -1) + tail(roc_df$TPR, -1)) / 2)
  
  plot(roc_df$FPR, roc_df$TPR, type = "l", lwd = 2,
       xlab = "False Positive Rate",
       ylab = "True Positive Rate",
       main = "ROC Curve")
  abline(0, 1, lty = 2, col = "gray")
  
  return(list(auc = auc, roc_data = roc_df))
}

roc_results <- roc_auc_fn(df)

Code
cat("Manual AUC:", round(roc_results$auc, 4), "\n")
Manual AUC: 0.8489 

The calculated AUC by hand is approximately 0.8489, and it shows that the model is performing well in terms of distinguishing between positive and negative classes. The ROC curve is also showing performance much higher than the diagonal line, which means the classifier is performing much better than a random guess.

Why the F1 Score Must Be Between 0 and 1

F1 score is given by the equation: \[ F1 = \frac{2PR}{P + R} \]

Precision and recall are always between 0 and 1 because these values represent proportions. So, the range for precision and recall is: \[ 0 \leq P \leq 1, \quad 0 \leq R \leq 1 \]

Precision and recall are always nonnegative values, and the product of these values is always less than or equal to each of the values. This gives us the inequality: \[ 2PR \leq P + R \]

Now, dividing the above equation by (P + R), we get: \[ \frac{2PR}{P + R} \leq 1 \]

This gives us the range for the F1 score as follows: \[ 0 \leq F1 \leq 1 \]

This equation shows us that the F1 score is always between 0 and 1.

Comparison with caret and pROC

To verify the accuracy of the manually computed metrics, the results can be compared with the results obtained using the caret and pROC packages. The caret package is used to verify the accuracy, sensitivity, and specificity computed using the confusion matrix, while the pROC package is used to compute the ROC curve and the AUC.

Code
df$class_factor <- factor(df$class, levels = c(0, 1))
df$pred_factor  <- factor(df$scored.class, levels = c(0, 1))

cm_caret <- caret::confusionMatrix(
  data = df$pred_factor,
  reference = df$class_factor,
  positive = "1"
)

cm_caret
Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 119  30
         1   5  27
                                          
               Accuracy : 0.8066          
                 95% CI : (0.7415, 0.8615)
    No Information Rate : 0.6851          
    P-Value [Acc > NIR] : 0.0001712       
                                          
                  Kappa : 0.4916          
                                          
 Mcnemar's Test P-Value : 4.976e-05       
                                          
            Sensitivity : 0.4737          
            Specificity : 0.9597          
         Pos Pred Value : 0.8438          
         Neg Pred Value : 0.7987          
             Prevalence : 0.3149          
         Detection Rate : 0.1492          
   Detection Prevalence : 0.1768          
      Balanced Accuracy : 0.7167          
                                          
       'Positive' Class : 1               
                                          
Code
sens_caret <- caret::sensitivity(
  data = df$pred_factor,
  reference = df$class_factor,
  positive = "1"
)

spec_caret <- caret::specificity(
  data = df$pred_factor,
  reference = df$class_factor,
  negative = "0"
)

cat("Sensitivity:", round(sens_caret, 4), "\n")
Sensitivity: 0.4737 
Code
cat("Specificity:", round(spec_caret, 4), "\n")
Specificity: 0.9597 
Code
roc_obj <- pROC::roc(df$class, df$scored.probability)
plot(roc_obj, main = "ROC Curve using pROC")
abline(a = 1, b = -1, lty = 2, col = "gray")

Code
auc_val <- pROC::auc(roc_obj)
cat("AUC:", round(as.numeric(auc_val), 4), "\n")
AUC: 0.8503 

The curve generated by the pROC package shows a plot of sensitivity against specificity at different classification thresholds. The x-axis of the curve generated by the manual method is replaced by a plot of specificity in decreasing order by the curve generated by the pROC package.

The ROC curve is still well above the diagonal line, showing good discriminative ability for the model. The AUC value is close to 0.85, showing that the model is effective in ranking positive instances higher than negative instances.

The Area Under the Curve is a metric used to calculate the performance of a model. The AUC value is close to the manually calculated value, showing that the model has good discriminative ability. The differences between the two values are due to the calculation of all possible thresholds by the pROC package.

Conclusion

In this analysis, the performance of a binary classification model was evaluated based on a range of metrics, which are based on the confusion matrix, as well as threshold-based evaluation, which used the ROC curve and AUC.

It is clear from the results obtained that the model performs well overall, with an accuracy of approximately 0.81, along with a good specificity, which implies that the model is highly effective in terms of the correct classification of negative cases. However, the low sensitivity implies that the model fails to capture a significant number of positive cases, which might be a point of concern depending on the context in which the model is used.

Precision and F1 score also highlight this, indicating that although the model is good in its predictions, it is not entirely successful in detecting the positive ones. Results obtained using the ROC curve, along with AUC values, obtained both manually and using the pROC package, highlight the good discriminatory ability of the model.

The results obtained using the caret and pROC packages validated the accuracy of the custom implementations, which was evident by the results being the same in both implementations. Overall, this analysis highlights that although the model is performing well in some aspects, it can definitely be improved, especially in terms of sensitivity.