DATA 621 - Homework 2: Classification Prediction Model Analysis

Libraries

library(kableExtra)
library(ggplot2)
library(dplyr)
library(MASS)
library(pROC)
library(caret)

Introduction

As a part of this homework assignment, we have been given a dataset called classification-output-data.csv, which has a set of independent variables or features and a observed class attribute, along with a predictive classification model scored class probability and scored class based on the scored probability. We have to use the below 3 key columns to derive some key classification model metrics -

class: the actual class for the observation
scored.class: the predicted class for the observation (based on a threshold of 0.5)
scored.probability: the predicted probability of success for the observation

Steps Followed:

1. Download the classification output data set

Loading the dataset into R:

hw_dataset <- read.csv("https://raw.githubusercontent.com/deepakmongia/Data621/master/HW-2/Data/classification-output-data.csv",header = TRUE)

Sample snapshot of data set:

head(hw_dataset, 20) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")

pregnant	glucose	diastolic	skinfold	insulin	bmi	pedigree	age	class	scored.class	scored.probability
7	124	70	33	215	25.5	0.161	37	0	0	0.3284523
2	122	76	27	200	35.9	0.483	26	0	0	0.2731904
3	107	62	13	48	22.9	0.678	23	1	0	0.1096604
1	91	64	24	0	29.2	0.192	21	0	0	0.0559984
4	83	86	19	0	29.3	0.317	34	0	0	0.1004907
1	100	74	12	46	19.5	0.149	28	0	0	0.0551546
9	89	62	0	0	22.5	0.142	33	0	0	0.1071154
8	120	78	0	0	25.0	0.409	64	0	0	0.4599474
1	79	60	42	48	43.5	0.678	23	0	0	0.1170237
2	123	48	32	165	42.1	0.520	26	0	0	0.3153632
5	88	78	30	0	27.6	0.258	37	0	0	0.1251892
5	108	72	43	75	36.1	0.263	33	0	0	0.2706248
13	76	60	0	0	32.8	0.180	41	0	0	0.2098096
0	100	70	26	50	30.8	0.597	21	0	0	0.0935859
7	194	68	28	0	35.9	0.745	41	1	1	0.8848457
12	92	62	7	258	27.6	0.926	44	1	0	0.3966522
0	173	78	32	265	46.5	1.159	58	0	1	0.8913949
3	171	72	33	135	33.3	0.199	24	1	1	0.5345490
8	196	76	29	280	37.5	0.605	57	1	1	0.9463342
5	99	74	27	0	29.0	0.203	32	0	0	0.1449162

Statistical Summary:

summary(hw_dataset)

##     pregnant         glucose        diastolic        skinfold   
##  Min.   : 0.000   Min.   : 57.0   Min.   : 38.0   Min.   : 0.0  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 64.0   1st Qu.: 0.0  
##  Median : 3.000   Median :112.0   Median : 70.0   Median :22.0  
##  Mean   : 3.862   Mean   :118.3   Mean   : 71.7   Mean   :19.8  
##  3rd Qu.: 6.000   3rd Qu.:136.0   3rd Qu.: 78.0   3rd Qu.:32.0  
##  Max.   :15.000   Max.   :197.0   Max.   :104.0   Max.   :54.0  
##     insulin            bmi           pedigree           age       
##  Min.   :  0.00   Min.   :19.40   Min.   :0.0850   Min.   :21.00  
##  1st Qu.:  0.00   1st Qu.:26.30   1st Qu.:0.2570   1st Qu.:24.00  
##  Median :  0.00   Median :31.60   Median :0.3910   Median :30.00  
##  Mean   : 63.77   Mean   :31.58   Mean   :0.4496   Mean   :33.31  
##  3rd Qu.:105.00   3rd Qu.:36.00   3rd Qu.:0.5800   3rd Qu.:41.00  
##  Max.   :543.00   Max.   :50.00   Max.   :2.2880   Max.   :67.00  
##      class         scored.class    scored.probability
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.02323   
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.11702   
##  Median :0.0000   Median :0.0000   Median :0.23999   
##  Mean   :0.3149   Mean   :0.1768   Mean   :0.30373   
##  3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.43093   
##  Max.   :1.0000   Max.   :1.0000   Max.   :0.94633

2. Confusion Matrix:

Use the table() function to get the raw confusion matrix for this scored dataset. Make sure you understand the output. In particular, do the rows represent the actual or predicted class? The columns?

# Subset original data set
classdata <- hw_dataset %>% dplyr::select(scored.class,class)

# Raw Confusion Matrix 
cmat <- table(classdata)

cmat

##             class
## scored.class   0   1
##            0 119  30
##            1   5  27

We have formatted above rax confusion matrix to define Event & Non-Event. Here we have coded ‘1’ values in the raw data set as “Event” (i.e. True value) and ‘0’ values as “Non-Event”( i.e. False value).

cmat <- cmat[order(rownames(cmat),decreasing = T), order(colnames(cmat),decreasing = T)] 
rownames(cmat) <- c("Event","Non-Event")
colnames(cmat) <- c("Event","Non-Event")

cmat %>%  kable %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="50%",height="200px")

	Event	Non-Event
Event	27	5
Non-Event	30	119

The rows in the confusion matrix represent the predicted classes, “Event” and “Non-Event”, from the classification model. The columns represent the actual classes represented in the data set.

3. Calculating the accuracy:

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the accuracy of the predictions.

\(Accuracy\quad =\quad \frac { TP+TN }{ TP+FP+TN+FN }\)

Accuracy_func <- function(input_df){
  True_positive_plus_negative <- sum(input_df$class == input_df$scored.class)
  False_positive_plus_negative <- sum(input_df$class != input_df$scored.class)
  
  Accuracy <- True_positive_plus_negative / (True_positive_plus_negative + False_positive_plus_negative)
  return(Accuracy)
}

Accuracy Score for the given prediction Output:

Accuracy_func (hw_dataset)

## [1] 0.8066298

4. Calculating the classification error rate:

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the classification error rate of the predictions.

\(Classification\quad Error\quad Rate\quad =\quad \frac { FP+FN }{ TP+FP+TN+FN }\)

Class_error_rt_func <- function(input_df){
  True_positive_plus_negative <- sum(input_df$class == input_df$scored.class)
  False_positive_plus_negative <- sum(input_df$class != input_df$scored.class)
  
  error_rate <- False_positive_plus_negative / (True_positive_plus_negative + False_positive_plus_negative)
  return(error_rate)
}

Classification Error rate for the given prediction Output:

Class_error_rt_func (hw_dataset)

## [1] 0.1933702

Verify that you get an accuracy and an error rate that sums to one.

Accuracy_func(hw_dataset) + Class_error_rt_func(hw_dataset)

## [1] 1

5. Calculating Precision:

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the precision of the predictions.

\(Precision\quad =\quad \frac { TP }{ TP+FP }\)

Precision_func <- function(input_df) {
  True_positive <- sum(input_df$class == 1 & input_df$scored.class == 1)
  False_positive <- sum(input_df$class == 0 & input_df$scored.class == 1)
  
  Precision_val <- True_positive / (True_positive + False_positive)
  
  return(Precision_val)
}

Precision Score for the given prediction Output:

Precision_func (hw_dataset)

## [1] 0.84375

6. Calculating Sensitivity:

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the sensitivity of the predictions. Sensitivity is also known as recall.

\(Sensitivity\quad =\quad \frac { TP }{ TP+FN }\)

Sensitivity_func <- function(input_df) {
  True_positive <- sum(input_df$class == 1 & input_df$scored.class == 1)
  False_negative <- sum(input_df$class == 1 & input_df$scored.class == 0)
  
  Sensitivity_val <- True_positive / (True_positive + False_negative)
  
  return(Sensitivity_val)
}

Sensitivity Score for the given prediction Output:

Sensitivity_func (hw_dataset)

## [1] 0.4736842

7. Calculating Specificity:

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the specificity of the predictions.

\(Specificity\quad =\quad \frac { TN }{ TN+FP }\)

Specificity_func <- function(input_df) {
  True_negative <- sum(input_df$class == 0 & input_df$scored.class == 0)
  False_positive <- sum(input_df$class == 0 & input_df$scored.class == 1)
  
  Specificity_val <- True_negative / (True_negative + False_positive)
  
  return(Specificity_val)
}

Specificity Score for the given prediction Output:

Specificity_func (hw_dataset)

## [1] 0.9596774

8. Calculating F1 score:

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the F1 score of the predictions.

\(F1\quad Score\quad =\quad \frac { 2*Precision*Sensitivity }{ Precision\quad +\quad Sensitivity }\)

f1score_func <- function(input_df) {

  Precision_val <- Precision_func(input_df)
  Sensitivity_val <- Sensitivity_func(input_df)
  
  f1score_val <- ( 2 * Precision_val * Sensitivity_val ) / (Precision_val + Sensitivity_val)
  return(f1score_val)
  
}

F1 Score for the given prediction Output:

f1score_func (hw_dataset)

## [1] 0.6067416

9. Bounds of F1 Score

Before we move on, let’s consider a question that was asked: What are the bounds on the F1 score? Show that the F1 score will always be between 0 and 1. (Hint: If \(0<a<1\) and \(0<b<1\) then \(ab<a\).)

Expanding the definition of F1 Score as below -

\(F1\quad Score\quad =\quad \frac { 2*Precision*Sensitivity }{ Precision\quad +\quad Sensitivity } \\ \\ \quad \quad \quad \quad \quad \quad \quad \quad =\quad \frac { 2*\frac { TP }{ TP+FP } *\frac { TP }{ TP+FN } }{ \frac { TP }{ TP+FP } +\frac { TP }{ TP+FN } } \\ \quad \quad \quad \quad \quad \quad \quad \quad =\quad \frac { \frac { 2*{ TP }^{ 2 } }{ (TP+FP)(TP+FN) } }{ \frac { TP(2TP+FP+FN) }{ (TP+FP)(TP+FN) } } \\ \quad \quad \quad \quad \quad \quad \quad \quad =\quad \frac { 2TP }{ 2TP+FP+FN }\)

In the equation above, TP, FP & FN are all positive integers. \(TP,FP,FN\quad \in \quad N\quad where\quad N\quad =\quad \{ 0,1,2,3,...\}\)

So mathematically, 2TP \(\le\) 2TP+FP+FN. Hence Numerator is at most equal to denominator. Hence F1 Score \(\le\) 1.

The fraction, \(\frac { 2TP }{ 2TP+FP+FN }\) will have a minimum value of Zero when TP = 0 and maximum value of 1 when FP = FN = 0.

Hence it can be conculded that \(0\quad \le \quad F1\quad \le \quad 1\).

10. Building the ROC curve:

Write a function that generates an ROC curve from a data set with a true classification column (class in our example) and a probability column (scored.probability in our example). Your function should return a list that includes the plot of the ROC curve and a vector that contains the calculated area under the curve (AUC). Note that I recommend using a sequence of thresholds ranging from 0 to 1 at 0.01 intervals.

ROC_func <- function(input_df) {
  roc_df <- data.frame(matrix(ncol = 3, nrow = 0))
  names(roc_df) <- c("Probability","TPR","FPR")
  input_df_int <- data.frame(input_df)
  for (threshold in seq(0, 1, by = 0.01))
  {
    input_df_int$scored.class <- ifelse(input_df_int$scored.probability > threshold, 1, 0)
    tpr <- Sensitivity_func(input_df_int)
    fpr <- 1 - Specificity_func(input_df_int)
    row <- data.frame(probability = threshold, TPR = tpr, FPR = fpr)
    
    roc_df <- rbind(roc_df, row)
  
  }
  
  ## Compute the area
  roc_df$area <- -diff(roc_df$FPR)*roc_df$TPR
  
  ## Filled missing values with 0
  roc_df <- roc_df %>% mutate(area = ifelse(is.na(area),0,area))
  
  ## Calculate AUC as sum of the area of all rectangles
  AUC <- round(sum(roc_df$area),4)
  

  roc_curve <- ggplot(data = roc_df, aes(x = FPR, y = TPR)) + 
                geom_point() +
                geom_path(color = "dodger blue") +
                ggtitle("ROC Curve") +
                xlab("False Positive Rate (FPR)") + 
                ylab("True Positive Rate (TPR)") +
                theme(
                    plot.title = element_text(hjust = 0.5))+
                geom_abline(intercept = 0, slope = 1, linetype = "dashed") +
                coord_equal(ratio = 1) +
                annotate(geom = "text", x = 0.7, y = 0.4, label = paste("AUC:", AUC))
  
  return(list(Plot = roc_curve, AUC = AUC))
}

Plotting the ROC curve:

ROC_func(hw_dataset)

## Warning in -diff(roc_df$FPR) * roc_df$TPR: longer object length is not a
## multiple of shorter object length

## $Plot

## 
## $AUC
## [1] 0.854

11. Custom Function Model Outputs:

Use your created R functions and the provided classification output data set to produce all of the classification metrics discussed above.

model_metrics <- c(Accuracy_func (hw_dataset), Class_error_rt_func (hw_dataset), Precision_func (hw_dataset), Sensitivity_func (hw_dataset), Specificity_func (hw_dataset), f1score_func (hw_dataset))
names(model_metrics) <- c("Accuracy", "Classification Error Rate", "Precision", "Sensitivity", "Specificity", "F1 Score")

model_metrics %>% kable(col.names = "Metrics") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="50%",height="300px")

	Metrics
Accuracy	0.8066298
Classification Error Rate	0.1933702
Precision	0.8437500
Sensitivity	0.4736842
Specificity	0.9596774
F1 Score	0.6067416

12. Caret Package Function Outputs:

Investigate the caret package. In particular, consider the functions confusionMatrix, sensitivity, and specificity. Apply the functions to the data set. How do the results compare with your own functions?

hw_dataset_1 <- hw_dataset %>% dplyr::select(scored.class, class) %>%
  mutate(scored.class = as.factor(scored.class), 
         class = as.factor(class))

caret_op <- caret::confusionMatrix(data = hw_dataset_1$scored.class,
                reference = hw_dataset_1$class,
                positive = "1")

caret_package <- c(caret_op$overall["Accuracy"], caret_op$byClass["Sensitivity"], caret_op$byClass["Specificity"])
our_function <- c(Accuracy_func (hw_dataset), Sensitivity_func (hw_dataset), Specificity_func (hw_dataset)) 

model_comp <- cbind(caret_package, our_function)

model_comp %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="75%",height="300px")

	caret_package	our_function
Accuracy	0.8066298	0.8066298
Sensitivity	0.4736842	0.4736842
Specificity	0.9596774	0.9596774

As per the table above the Accuracy, Sensitivity and Specitivity scores generated by the Caret package are exactly identical to the corresponding model metrics calculated by our custom functions.

13. ROC Curve using pROC Package:

Investigate the pROC package. Use it to generate an ROC curve for the data set. How do the results compare with your own functions?

rocCurve <- roc(response = hw_dataset$class, 
                predictor = hw_dataset$scored.probability)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## Plot the ROC Curve:
plot(rocCurve, legacy.axes = TRUE, main = "ROC Curve using pROC",print.auc = TRUE)

## Area Under the Curve (AUC):
auc(rocCurve)

## Area under the curve: 0.8503

## Plot ROC with Custom Function
ROC_func(hw_dataset)$Plot

## Warning in -diff(roc_df$FPR) * roc_df$TPR: longer object length is not a
## multiple of shorter object length

## Area Under the Curve (AUC) using Custom Function:
ROC_func(hw_dataset)$AUC

## Warning in -diff(roc_df$FPR) * roc_df$TPR: longer object length is not a
## multiple of shorter object length

## [1] 0.854

Conclusion: The shape of the ROC cusrve generated by pROC package and our function are almost identical. The AUC values are comparable but slightly different.