library(dplyr)
library(ggplot2)
library(pROC)
library(kableExtra)

Introduction

As a part of this homework assignment, we have been given a dataset called classification-output-data.csv, which has a set of independent varibles or features and a class, along with a predictive classification model scored probability and scored class based on the scored probability. We have to use the below 3 key columns to derive some key classification model metrics.  class: the actual class for the observation  scored.class: the predicted class for the observation (based on a threshold of 0.5)  scored.probability: the predicted probability of success for the observation

1. Download the classification output data set

Loading the dataset into R:

hw_dataset <- read.csv("https://raw.githubusercontent.com/deepakmongia/Data621/master/HW-2/Data/classification-output-data.csv",
                       header = TRUE)

2. Understanding the dataset:

Use the table() function to get the raw confusion matrix for this scored dataset. Make sure you understand the output. In particular, do the rows represent the actual or predicted class? The columns?

hw_dataset$class <- as.factor(hw_dataset$class)
hw_dataset$scored.class <- as.factor(hw_dataset$scored.class)

summary(hw_dataset)
##     pregnant         glucose        diastolic        skinfold   
##  Min.   : 0.000   Min.   : 57.0   Min.   : 38.0   Min.   : 0.0  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 64.0   1st Qu.: 0.0  
##  Median : 3.000   Median :112.0   Median : 70.0   Median :22.0  
##  Mean   : 3.862   Mean   :118.3   Mean   : 71.7   Mean   :19.8  
##  3rd Qu.: 6.000   3rd Qu.:136.0   3rd Qu.: 78.0   3rd Qu.:32.0  
##  Max.   :15.000   Max.   :197.0   Max.   :104.0   Max.   :54.0  
##     insulin            bmi           pedigree           age        class  
##  Min.   :  0.00   Min.   :19.40   Min.   :0.0850   Min.   :21.00   0:124  
##  1st Qu.:  0.00   1st Qu.:26.30   1st Qu.:0.2570   1st Qu.:24.00   1: 57  
##  Median :  0.00   Median :31.60   Median :0.3910   Median :30.00          
##  Mean   : 63.77   Mean   :31.58   Mean   :0.4496   Mean   :33.31          
##  3rd Qu.:105.00   3rd Qu.:36.00   3rd Qu.:0.5800   3rd Qu.:41.00          
##  Max.   :543.00   Max.   :50.00   Max.   :2.2880   Max.   :67.00          
##  scored.class scored.probability
##  0:149        Min.   :0.02323   
##  1: 32        1st Qu.:0.11702   
##               Median :0.23999   
##               Mean   :0.30373   
##               3rd Qu.:0.43093   
##               Max.   :0.94633
confusion_matrix <- table(hw_dataset$scored.class, hw_dataset$class)
confusion_matrix %>% kable()
0 1
0 119 30
1 5 27

As we see above, there are 2 class columns - one being the actual class and other being the predicted class based on a classification model. Classification model’s scored probability is also given along. We will now work on the 3 columns to get few important metrics to understand how the model performed.

3. Calculating the accuracy:

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the accuracy of the predictions.

\(Accuracy\quad =\quad \frac { TP+TN }{ TP+FP+TN+FN }\)

Accuracy_func <- function(input_df){
  True_positive_plus_negative <- sum(input_df$class == input_df$scored.class)
  False_positive_plus_negative <- sum(input_df$class != input_df$scored.class)
  
  Accuracy <- True_positive_plus_negative / (True_positive_plus_negative + False_positive_plus_negative)
  return(Accuracy)
}

4. Calculating the classification error rate:

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the classification error rate of the predictions.

\(Classification\quad Error\quad Rate\quad =\quad \frac { FP+FN }{ TP+FP+TN+FN }\)

Class_error_rt_func <- function(input_df){
  True_positive_plus_negative <- sum(input_df$class == input_df$scored.class)
  False_positive_plus_negative <- sum(input_df$class != input_df$scored.class)
  
  error_rate <- False_positive_plus_negative / (True_positive_plus_negative + False_positive_plus_negative)
  return(error_rate)
}

Verify that you get an accuracy and an error rate that sums to one.

Accuracy_func(hw_dataset) + Class_error_rt_func(hw_dataset)
## [1] 1

5. Calculation Precision:

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the precision of the predictions.

\(Precision\quad =\quad \frac { TP }{ TP+FP }\)

Precision_func <- function(input_df) {
  True_positive <- sum(input_df$class == 1 & input_df$scored.class == 1)
  False_positive <- sum(input_df$class == 0 & input_df$scored.class == 1)
  
  Precision_val <- True_positive / (True_positive + False_positive)
  
  return(Precision_val)
}

6. Calculating Sensitivity:

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the sensitivity of the predictions. Sensitivity is also known as recall.

\(Sensitivity\quad =\quad \frac { TP }{ TP+FN }\)

Sensitivity_func <- function(input_df) {
  True_positive <- sum(input_df$class == 1 & input_df$scored.class == 1)
  False_negative <- sum(input_df$class == 1 & input_df$scored.class == 0)
  
  Sensitivity_val <- True_positive / (True_positive + False_negative)
  
  return(Sensitivity_val)
}

7. Calculating Specificity:

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the specificity of the predictions.

\(Specificity\quad =\quad \frac { TN }{ TN+FP }\)

Specificity_func <- function(input_df) {
  True_negative <- sum(input_df$class == 0 & input_df$scored.class == 0)
  False_positive <- sum(input_df$class == 0 & input_df$scored.class == 1)
  
  Specificity_val <- True_negative / (True_negative + False_positive)
  
  return(Specificity_val)
}

8. Calculating F1 score:

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the F1 score of the predictions.

\(F1\quad Score\quad =\quad \frac { 2*Precision*Sensitivity }{ Precision\quad +\quad Sensitivity }\)

f1score_func <- function(input_df) {

  Precision_val <- Precision_func(input_df)
  Sensitivity_val <- Sensitivity_func(input_df)
  
  f1score_val <- ( 2 * Precision_val * Sensitivity_val ) / (Precision_val + Sensitivity_val)
  return(f1score_val)
  
}

9. Bounds of F1 Score

Before we move on, let’s consider a question that was asked: What are the bounds on the F1 score? Show that the F1 score will always be between 0 and 1. (Hint: If \(0<a<1\) and \(0<b<1\) then \(ab<a\).)

Expanding the definition of F1 Score as below -

\(F1\quad Score\quad =\quad \frac { 2*Precision*Sensitivity }{ Precision\quad +\quad Sensitivity } \\ \\ \quad \quad \quad \quad \quad \quad \quad \quad =\quad \frac { 2*\frac { TP }{ TP+FP } *\frac { TP }{ TP+FN } }{ \frac { TP }{ TP+FP } +\frac { TP }{ TP+FN } } \\ \quad \quad \quad \quad \quad \quad \quad \quad =\quad \frac { \frac { 2*{ TP }^{ 2 } }{ (TP+FP)(TP+FN) } }{ \frac { TP(2TP+FP+FN) }{ (TP+FP)(TP+FN) } } \\ \quad \quad \quad \quad \quad \quad \quad \quad =\quad \frac { 2TP }{ 2TP+FP+FN }\)

In the equation above, TP, FP & FN are all positive integers. \(TP,FP,FN\quad \in \quad N\quad where\quad N\quad =\quad \{ 0,1,2,3,...\}\)

So mathematically, 2TP \(\le\) 2TP+FP+FN. Hence Numerator is at most equal to denominator. Hence F1 Score \(\le\) 1.

The fraction, \(\frac { 2TP }{ 2TP+FP+FN }\) will have a minimum value of Zero when TP = 0 and maximum value of 1 when FP = FN = 0.

Hence it can be conculded that \(0\quad \le \quad F1\quad \le \quad 1\).

10. Building the ROC curve:

Write a function that generates an ROC curve from a data set with a true classification column (class in our example) and a probability column (scored.probability in our example). Your function should return a list that includes the plot of the ROC curve and a vector that contains the calculated area under the curve (AUC). Note that I recommend using a sequence of thresholds ranging from 0 to 1 at 0.01 intervals.

ROC_func <- function(input_df) {
  roc_df <- data.frame()
  for (threshold in seq(0, 1, by = 0.01))
  {
    input_df_int <- data.frame(input_df)
    input_df_int$scored.class <- ifelse(input_df_int$scored.probability > threshold, 1, 0)
    roc_df <- rbind(roc_df, data.frame(1 - Specificity_func(input_df_int),
                                     Sensitivity_func(input_df_int)))
  
  }
  
  colnames(roc_df) <- c("FPR", "TPR")
  
  roc_curve <- ggplot(data = roc_df, aes(x = FPR, y = TPR)) + geom_point() +
                geom_path()
  
  roc_df2 <- as.data.frame(roc_df)
  roc_df2 <- roc_df2[order(roc_df2$FPR, roc_df2$TPR),]
  
  roc_df2 <- transform(roc_df2, 
                    dFPR = c(diff(FPR), 0),
                    dTPR = c(diff(TPR), 0))
  
  
  simple_auc <- function(TPR, FPR){
  # inputs already sorted, best scores first 
  dFPR <- c(diff(FPR), 0)
  dTPR <- c(diff(TPR), 0)
  sum(TPR * dFPR) + sum(dTPR * dFPR)/2
  }

  auc_value <- with(roc_df2, simple_auc(TPR, FPR))

  new_list <- list(roc_curve, auc_value)
    
  return(new_list)
  
}

Plotting the ROC curve:

ROC_func_response <- ROC_func(hw_dataset)
ROC_func_response[1]
## [[1]]

AUC Value:

ROC_func_response[2]
## [[1]]
## [1] 0.8488964

11. Produce all the metrics values from the defined functions above:

Use your created R functions and the provided classification output data set to produce all of the classification metrics discussed above.

### Accuracy
print(Accuracy_func(hw_dataset))
## [1] 0.8066298
### Classification Error rate
print(Class_error_rt_func(hw_dataset))
## [1] 0.1933702
### Precision
print(Precision_func(hw_dataset))
## [1] 0.84375
### Sensitivity
sensitivity_val <- Sensitivity_func(hw_dataset)
print(sensitivity_val)
## [1] 0.4736842
### Specificity
specificity_val <- Specificity_func(hw_dataset)
print(specificity_val)
## [1] 0.9596774
### F1 score
print(f1score_func(hw_dataset))
## [1] 0.6067416

12. Comparing our metrics results with the metrics from the “caret” library.

We will now use the “caret” library and see the confusion matrix, and get the sensitivity and specificity using the “caret” library. These should match our function outputs.

library(caret)
## Warning: package 'caret' was built under R version 3.6.2
## Loading required package: lattice
confusion_matrix_caret <- confusionMatrix(hw_dataset$scored.class, hw_dataset$class, positive = '1')

## Confusion matrix from caret
print(confusion_matrix_caret)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 119  30
##          1   5  27
##                                           
##                Accuracy : 0.8066          
##                  95% CI : (0.7415, 0.8615)
##     No Information Rate : 0.6851          
##     P-Value [Acc > NIR] : 0.0001712       
##                                           
##                   Kappa : 0.4916          
##                                           
##  Mcnemar's Test P-Value : 4.976e-05       
##                                           
##             Sensitivity : 0.4737          
##             Specificity : 0.9597          
##          Pos Pred Value : 0.8438          
##          Neg Pred Value : 0.7987          
##              Prevalence : 0.3149          
##          Detection Rate : 0.1492          
##    Detection Prevalence : 0.1768          
##       Balanced Accuracy : 0.7167          
##                                           
##        'Positive' Class : 1               
## 
## Confusion matrix from table function
print(confusion_matrix)
##    
##       0   1
##   0 119  30
##   1   5  27

Getting the sensitivity from caret and then comparing with our earlier calculated value:

sensitivity_caret <- sensitivity(hw_dataset$scored.class, hw_dataset$class, positive = '1')

### sensitivity from caret
print(sensitivity_caret)
## [1] 0.4736842
### sensitivity from our function
print(sensitivity_val)
## [1] 0.4736842

Getting the specificity from caret and then comparing with our earlier calculated value:

specificity_caret <- specificity(hw_dataset$scored.class, hw_dataset$class, positive = '1')

### specificity from caret
print(specificity_caret)
## [1] 0.4736842
### specificity from our function
print(specificity_val)
## [1] 0.9596774

13. ROC curve from pROC library

roc_curve_proc <- roc(as.numeric(hw_dataset$class),
                      as.numeric(hw_dataset$scored.probability))
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
ggroc(roc_curve_proc)

print(roc_curve_proc$auc)
## Area under the curve: 0.8503

Conclusion:

The ROC curve and the AUC values are similar from the caret function and the functions we built.