library(dplyr)
library(ggplot2)
library(pROC)
library(kableExtra)
As a part of this homework assignment, we have been given a dataset called classification-output-data.csv, which has a set of independent varibles or features and a class, along with a predictive classification model scored probability and scored class based on the scored probability. We have to use the below 3 key columns to derive some key classification model metrics. class: the actual class for the observation scored.class: the predicted class for the observation (based on a threshold of 0.5) scored.probability: the predicted probability of success for the observation
Loading the dataset into R:
hw_dataset <- read.csv("https://raw.githubusercontent.com/deepakmongia/Data621/master/HW-2/Data/classification-output-data.csv",
header = TRUE)
Use the table() function to get the raw confusion matrix for this scored dataset. Make sure you understand the output. In particular, do the rows represent the actual or predicted class? The columns?
hw_dataset$class <- as.factor(hw_dataset$class)
hw_dataset$scored.class <- as.factor(hw_dataset$scored.class)
summary(hw_dataset)
## pregnant glucose diastolic skinfold
## Min. : 0.000 Min. : 57.0 Min. : 38.0 Min. : 0.0
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 64.0 1st Qu.: 0.0
## Median : 3.000 Median :112.0 Median : 70.0 Median :22.0
## Mean : 3.862 Mean :118.3 Mean : 71.7 Mean :19.8
## 3rd Qu.: 6.000 3rd Qu.:136.0 3rd Qu.: 78.0 3rd Qu.:32.0
## Max. :15.000 Max. :197.0 Max. :104.0 Max. :54.0
## insulin bmi pedigree age class
## Min. : 0.00 Min. :19.40 Min. :0.0850 Min. :21.00 0:124
## 1st Qu.: 0.00 1st Qu.:26.30 1st Qu.:0.2570 1st Qu.:24.00 1: 57
## Median : 0.00 Median :31.60 Median :0.3910 Median :30.00
## Mean : 63.77 Mean :31.58 Mean :0.4496 Mean :33.31
## 3rd Qu.:105.00 3rd Qu.:36.00 3rd Qu.:0.5800 3rd Qu.:41.00
## Max. :543.00 Max. :50.00 Max. :2.2880 Max. :67.00
## scored.class scored.probability
## 0:149 Min. :0.02323
## 1: 32 1st Qu.:0.11702
## Median :0.23999
## Mean :0.30373
## 3rd Qu.:0.43093
## Max. :0.94633
confusion_matrix <- table(hw_dataset$scored.class, hw_dataset$class)
confusion_matrix %>% kable()
| 0 | 1 | |
|---|---|---|
| 0 | 119 | 30 |
| 1 | 5 | 27 |
As we see above, there are 2 class columns - one being the actual class and other being the predicted class based on a classification model. Classification model’s scored probability is also given along. We will now work on the 3 columns to get few important metrics to understand how the model performed.
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the accuracy of the predictions.
\(Accuracy\quad =\quad \frac { TP+TN }{ TP+FP+TN+FN }\)
Accuracy_func <- function(input_df){
True_positive_plus_negative <- sum(input_df$class == input_df$scored.class)
False_positive_plus_negative <- sum(input_df$class != input_df$scored.class)
Accuracy <- True_positive_plus_negative / (True_positive_plus_negative + False_positive_plus_negative)
return(Accuracy)
}
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the classification error rate of the predictions.
\(Classification\quad Error\quad Rate\quad =\quad \frac { FP+FN }{ TP+FP+TN+FN }\)
Class_error_rt_func <- function(input_df){
True_positive_plus_negative <- sum(input_df$class == input_df$scored.class)
False_positive_plus_negative <- sum(input_df$class != input_df$scored.class)
error_rate <- False_positive_plus_negative / (True_positive_plus_negative + False_positive_plus_negative)
return(error_rate)
}
Verify that you get an accuracy and an error rate that sums to one.
Accuracy_func(hw_dataset) + Class_error_rt_func(hw_dataset)
## [1] 1
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the precision of the predictions.
\(Precision\quad =\quad \frac { TP }{ TP+FP }\)
Precision_func <- function(input_df) {
True_positive <- sum(input_df$class == 1 & input_df$scored.class == 1)
False_positive <- sum(input_df$class == 0 & input_df$scored.class == 1)
Precision_val <- True_positive / (True_positive + False_positive)
return(Precision_val)
}
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the sensitivity of the predictions. Sensitivity is also known as recall.
\(Sensitivity\quad =\quad \frac { TP }{ TP+FN }\)
Sensitivity_func <- function(input_df) {
True_positive <- sum(input_df$class == 1 & input_df$scored.class == 1)
False_negative <- sum(input_df$class == 1 & input_df$scored.class == 0)
Sensitivity_val <- True_positive / (True_positive + False_negative)
return(Sensitivity_val)
}
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the specificity of the predictions.
\(Specificity\quad =\quad \frac { TN }{ TN+FP }\)
Specificity_func <- function(input_df) {
True_negative <- sum(input_df$class == 0 & input_df$scored.class == 0)
False_positive <- sum(input_df$class == 0 & input_df$scored.class == 1)
Specificity_val <- True_negative / (True_negative + False_positive)
return(Specificity_val)
}
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the F1 score of the predictions.
\(F1\quad Score\quad =\quad \frac { 2*Precision*Sensitivity }{ Precision\quad +\quad Sensitivity }\)
f1score_func <- function(input_df) {
Precision_val <- Precision_func(input_df)
Sensitivity_val <- Sensitivity_func(input_df)
f1score_val <- ( 2 * Precision_val * Sensitivity_val ) / (Precision_val + Sensitivity_val)
return(f1score_val)
}
Before we move on, let’s consider a question that was asked: What are the bounds on the F1 score? Show that the F1 score will always be between 0 and 1. (Hint: If \(0<a<1\) and \(0<b<1\) then \(ab<a\).)
Expanding the definition of F1 Score as below -
\(F1\quad Score\quad =\quad \frac { 2*Precision*Sensitivity }{ Precision\quad +\quad Sensitivity } \\ \\ \quad \quad \quad \quad \quad \quad \quad \quad =\quad \frac { 2*\frac { TP }{ TP+FP } *\frac { TP }{ TP+FN } }{ \frac { TP }{ TP+FP } +\frac { TP }{ TP+FN } } \\ \quad \quad \quad \quad \quad \quad \quad \quad =\quad \frac { \frac { 2*{ TP }^{ 2 } }{ (TP+FP)(TP+FN) } }{ \frac { TP(2TP+FP+FN) }{ (TP+FP)(TP+FN) } } \\ \quad \quad \quad \quad \quad \quad \quad \quad =\quad \frac { 2TP }{ 2TP+FP+FN }\)
In the equation above, TP, FP & FN are all positive integers. \(TP,FP,FN\quad \in \quad N\quad where\quad N\quad =\quad \{ 0,1,2,3,...\}\)
So mathematically, 2TP \(\le\) 2TP+FP+FN. Hence Numerator is at most equal to denominator. Hence F1 Score \(\le\) 1.
The fraction, \(\frac { 2TP }{ 2TP+FP+FN }\) will have a minimum value of Zero when TP = 0 and maximum value of 1 when FP = FN = 0.
Hence it can be conculded that \(0\quad \le \quad F1\quad \le \quad 1\).
Write a function that generates an ROC curve from a data set with a true classification column (class in our example) and a probability column (scored.probability in our example). Your function should return a list that includes the plot of the ROC curve and a vector that contains the calculated area under the curve (AUC). Note that I recommend using a sequence of thresholds ranging from 0 to 1 at 0.01 intervals.
ROC_func <- function(input_df) {
roc_df <- data.frame()
for (threshold in seq(0, 1, by = 0.01))
{
input_df_int <- data.frame(input_df)
input_df_int$scored.class <- ifelse(input_df_int$scored.probability > threshold, 1, 0)
roc_df <- rbind(roc_df, data.frame(1 - Specificity_func(input_df_int),
Sensitivity_func(input_df_int)))
}
colnames(roc_df) <- c("FPR", "TPR")
roc_curve <- ggplot(data = roc_df, aes(x = FPR, y = TPR)) + geom_point() +
geom_path()
roc_df2 <- as.data.frame(roc_df)
roc_df2 <- roc_df2[order(roc_df2$FPR, roc_df2$TPR),]
roc_df2 <- transform(roc_df2,
dFPR = c(diff(FPR), 0),
dTPR = c(diff(TPR), 0))
simple_auc <- function(TPR, FPR){
# inputs already sorted, best scores first
dFPR <- c(diff(FPR), 0)
dTPR <- c(diff(TPR), 0)
sum(TPR * dFPR) + sum(dTPR * dFPR)/2
}
auc_value <- with(roc_df2, simple_auc(TPR, FPR))
new_list <- list(roc_curve, auc_value)
return(new_list)
}
Plotting the ROC curve:
ROC_func_response <- ROC_func(hw_dataset)
ROC_func_response[1]
## [[1]]
AUC Value:
ROC_func_response[2]
## [[1]]
## [1] 0.8488964
Use your created R functions and the provided classification output data set to produce all of the classification metrics discussed above.
### Accuracy
print(Accuracy_func(hw_dataset))
## [1] 0.8066298
### Classification Error rate
print(Class_error_rt_func(hw_dataset))
## [1] 0.1933702
### Precision
print(Precision_func(hw_dataset))
## [1] 0.84375
### Sensitivity
sensitivity_val <- Sensitivity_func(hw_dataset)
print(sensitivity_val)
## [1] 0.4736842
### Specificity
specificity_val <- Specificity_func(hw_dataset)
print(specificity_val)
## [1] 0.9596774
### F1 score
print(f1score_func(hw_dataset))
## [1] 0.6067416
We will now use the “caret” library and see the confusion matrix, and get the sensitivity and specificity using the “caret” library. These should match our function outputs.
library(caret)
## Warning: package 'caret' was built under R version 3.6.2
## Loading required package: lattice
confusion_matrix_caret <- confusionMatrix(hw_dataset$scored.class, hw_dataset$class, positive = '1')
## Confusion matrix from caret
print(confusion_matrix_caret)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 119 30
## 1 5 27
##
## Accuracy : 0.8066
## 95% CI : (0.7415, 0.8615)
## No Information Rate : 0.6851
## P-Value [Acc > NIR] : 0.0001712
##
## Kappa : 0.4916
##
## Mcnemar's Test P-Value : 4.976e-05
##
## Sensitivity : 0.4737
## Specificity : 0.9597
## Pos Pred Value : 0.8438
## Neg Pred Value : 0.7987
## Prevalence : 0.3149
## Detection Rate : 0.1492
## Detection Prevalence : 0.1768
## Balanced Accuracy : 0.7167
##
## 'Positive' Class : 1
##
## Confusion matrix from table function
print(confusion_matrix)
##
## 0 1
## 0 119 30
## 1 5 27
Getting the sensitivity from caret and then comparing with our earlier calculated value:
sensitivity_caret <- sensitivity(hw_dataset$scored.class, hw_dataset$class, positive = '1')
### sensitivity from caret
print(sensitivity_caret)
## [1] 0.4736842
### sensitivity from our function
print(sensitivity_val)
## [1] 0.4736842
Getting the specificity from caret and then comparing with our earlier calculated value:
specificity_caret <- specificity(hw_dataset$scored.class, hw_dataset$class, positive = '1')
### specificity from caret
print(specificity_caret)
## [1] 0.4736842
### specificity from our function
print(specificity_val)
## [1] 0.9596774
roc_curve_proc <- roc(as.numeric(hw_dataset$class),
as.numeric(hw_dataset$scored.probability))
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
ggroc(roc_curve_proc)
print(roc_curve_proc$auc)
## Area under the curve: 0.8503
The ROC curve and the AUC values are similar from the caret function and the functions we built.