library(kableExtra)
library(ggplot2)
library(dplyr)
library(MASS)
library(pROC)
library(caret)
As a part of this homework assignment, we have been given a dataset called classification-output-data.csv, which has a set of independent variables or features and a observed class attribute, along with a predictive classification model scored class probability and scored class based on the scored probability. We have to use the below 3 key columns to derive some key classification model metrics -
Loading the dataset into R:
hw_dataset <- read.csv("https://raw.githubusercontent.com/deepakmongia/Data621/master/HW-2/Data/classification-output-data.csv",header = TRUE)
Sample snapshot of data set:
head(hw_dataset, 20) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")
pregnant | glucose | diastolic | skinfold | insulin | bmi | pedigree | age | class | scored.class | scored.probability |
---|---|---|---|---|---|---|---|---|---|---|
7 | 124 | 70 | 33 | 215 | 25.5 | 0.161 | 37 | 0 | 0 | 0.3284523 |
2 | 122 | 76 | 27 | 200 | 35.9 | 0.483 | 26 | 0 | 0 | 0.2731904 |
3 | 107 | 62 | 13 | 48 | 22.9 | 0.678 | 23 | 1 | 0 | 0.1096604 |
1 | 91 | 64 | 24 | 0 | 29.2 | 0.192 | 21 | 0 | 0 | 0.0559984 |
4 | 83 | 86 | 19 | 0 | 29.3 | 0.317 | 34 | 0 | 0 | 0.1004907 |
1 | 100 | 74 | 12 | 46 | 19.5 | 0.149 | 28 | 0 | 0 | 0.0551546 |
9 | 89 | 62 | 0 | 0 | 22.5 | 0.142 | 33 | 0 | 0 | 0.1071154 |
8 | 120 | 78 | 0 | 0 | 25.0 | 0.409 | 64 | 0 | 0 | 0.4599474 |
1 | 79 | 60 | 42 | 48 | 43.5 | 0.678 | 23 | 0 | 0 | 0.1170237 |
2 | 123 | 48 | 32 | 165 | 42.1 | 0.520 | 26 | 0 | 0 | 0.3153632 |
5 | 88 | 78 | 30 | 0 | 27.6 | 0.258 | 37 | 0 | 0 | 0.1251892 |
5 | 108 | 72 | 43 | 75 | 36.1 | 0.263 | 33 | 0 | 0 | 0.2706248 |
13 | 76 | 60 | 0 | 0 | 32.8 | 0.180 | 41 | 0 | 0 | 0.2098096 |
0 | 100 | 70 | 26 | 50 | 30.8 | 0.597 | 21 | 0 | 0 | 0.0935859 |
7 | 194 | 68 | 28 | 0 | 35.9 | 0.745 | 41 | 1 | 1 | 0.8848457 |
12 | 92 | 62 | 7 | 258 | 27.6 | 0.926 | 44 | 1 | 0 | 0.3966522 |
0 | 173 | 78 | 32 | 265 | 46.5 | 1.159 | 58 | 0 | 1 | 0.8913949 |
3 | 171 | 72 | 33 | 135 | 33.3 | 0.199 | 24 | 1 | 1 | 0.5345490 |
8 | 196 | 76 | 29 | 280 | 37.5 | 0.605 | 57 | 1 | 1 | 0.9463342 |
5 | 99 | 74 | 27 | 0 | 29.0 | 0.203 | 32 | 0 | 0 | 0.1449162 |
Statistical Summary:
summary(hw_dataset)
## pregnant glucose diastolic skinfold
## Min. : 0.000 Min. : 57.0 Min. : 38.0 Min. : 0.0
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 64.0 1st Qu.: 0.0
## Median : 3.000 Median :112.0 Median : 70.0 Median :22.0
## Mean : 3.862 Mean :118.3 Mean : 71.7 Mean :19.8
## 3rd Qu.: 6.000 3rd Qu.:136.0 3rd Qu.: 78.0 3rd Qu.:32.0
## Max. :15.000 Max. :197.0 Max. :104.0 Max. :54.0
## insulin bmi pedigree age
## Min. : 0.00 Min. :19.40 Min. :0.0850 Min. :21.00
## 1st Qu.: 0.00 1st Qu.:26.30 1st Qu.:0.2570 1st Qu.:24.00
## Median : 0.00 Median :31.60 Median :0.3910 Median :30.00
## Mean : 63.77 Mean :31.58 Mean :0.4496 Mean :33.31
## 3rd Qu.:105.00 3rd Qu.:36.00 3rd Qu.:0.5800 3rd Qu.:41.00
## Max. :543.00 Max. :50.00 Max. :2.2880 Max. :67.00
## class scored.class scored.probability
## Min. :0.0000 Min. :0.0000 Min. :0.02323
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.11702
## Median :0.0000 Median :0.0000 Median :0.23999
## Mean :0.3149 Mean :0.1768 Mean :0.30373
## 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.43093
## Max. :1.0000 Max. :1.0000 Max. :0.94633
Use the table() function to get the raw confusion matrix for this scored dataset. Make sure you understand the output. In particular, do the rows represent the actual or predicted class? The columns?
# Subset original data set
classdata <- hw_dataset %>% dplyr::select(scored.class,class)
# Raw Confusion Matrix
cmat <- table(classdata)
cmat
## class
## scored.class 0 1
## 0 119 30
## 1 5 27
We have formatted above rax confusion matrix to define Event & Non-Event. Here we have coded ‘1’ values in the raw data set as “Event” (i.e. True value) and ‘0’ values as “Non-Event”( i.e. False value).
cmat <- cmat[order(rownames(cmat),decreasing = T), order(colnames(cmat),decreasing = T)]
rownames(cmat) <- c("Event","Non-Event")
colnames(cmat) <- c("Event","Non-Event")
cmat %>% kable %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="50%",height="200px")
Event | Non-Event | |
---|---|---|
Event | 27 | 5 |
Non-Event | 30 | 119 |
The rows in the confusion matrix represent the predicted classes, “Event” and “Non-Event”, from the classification model. The columns represent the actual classes represented in the data set.
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the accuracy of the predictions.
\(Accuracy\quad =\quad \frac { TP+TN }{ TP+FP+TN+FN }\)
Accuracy_func <- function(input_df){
True_positive_plus_negative <- sum(input_df$class == input_df$scored.class)
False_positive_plus_negative <- sum(input_df$class != input_df$scored.class)
Accuracy <- True_positive_plus_negative / (True_positive_plus_negative + False_positive_plus_negative)
return(Accuracy)
}
Accuracy_func (hw_dataset)
## [1] 0.8066298
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the classification error rate of the predictions.
\(Classification\quad Error\quad Rate\quad =\quad \frac { FP+FN }{ TP+FP+TN+FN }\)
Class_error_rt_func <- function(input_df){
True_positive_plus_negative <- sum(input_df$class == input_df$scored.class)
False_positive_plus_negative <- sum(input_df$class != input_df$scored.class)
error_rate <- False_positive_plus_negative / (True_positive_plus_negative + False_positive_plus_negative)
return(error_rate)
}
Class_error_rt_func (hw_dataset)
## [1] 0.1933702
Verify that you get an accuracy and an error rate that sums to one.
Accuracy_func(hw_dataset) + Class_error_rt_func(hw_dataset)
## [1] 1
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the precision of the predictions.
\(Precision\quad =\quad \frac { TP }{ TP+FP }\)
Precision_func <- function(input_df) {
True_positive <- sum(input_df$class == 1 & input_df$scored.class == 1)
False_positive <- sum(input_df$class == 0 & input_df$scored.class == 1)
Precision_val <- True_positive / (True_positive + False_positive)
return(Precision_val)
}
Precision_func (hw_dataset)
## [1] 0.84375
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the sensitivity of the predictions. Sensitivity is also known as recall.
\(Sensitivity\quad =\quad \frac { TP }{ TP+FN }\)
Sensitivity_func <- function(input_df) {
True_positive <- sum(input_df$class == 1 & input_df$scored.class == 1)
False_negative <- sum(input_df$class == 1 & input_df$scored.class == 0)
Sensitivity_val <- True_positive / (True_positive + False_negative)
return(Sensitivity_val)
}
Sensitivity_func (hw_dataset)
## [1] 0.4736842
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the specificity of the predictions.
\(Specificity\quad =\quad \frac { TN }{ TN+FP }\)
Specificity_func <- function(input_df) {
True_negative <- sum(input_df$class == 0 & input_df$scored.class == 0)
False_positive <- sum(input_df$class == 0 & input_df$scored.class == 1)
Specificity_val <- True_negative / (True_negative + False_positive)
return(Specificity_val)
}
Specificity_func (hw_dataset)
## [1] 0.9596774
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the F1 score of the predictions.
\(F1\quad Score\quad =\quad \frac { 2*Precision*Sensitivity }{ Precision\quad +\quad Sensitivity }\)
f1score_func <- function(input_df) {
Precision_val <- Precision_func(input_df)
Sensitivity_val <- Sensitivity_func(input_df)
f1score_val <- ( 2 * Precision_val * Sensitivity_val ) / (Precision_val + Sensitivity_val)
return(f1score_val)
}
f1score_func (hw_dataset)
## [1] 0.6067416
Before we move on, let’s consider a question that was asked: What are the bounds on the F1 score? Show that the F1 score will always be between 0 and 1. (Hint: If \(0<a<1\) and \(0<b<1\) then \(ab<a\).)
Expanding the definition of F1 Score as below -
\(F1\quad Score\quad =\quad \frac { 2*Precision*Sensitivity }{ Precision\quad +\quad Sensitivity } \\ \\ \quad \quad \quad \quad \quad \quad \quad \quad =\quad \frac { 2*\frac { TP }{ TP+FP } *\frac { TP }{ TP+FN } }{ \frac { TP }{ TP+FP } +\frac { TP }{ TP+FN } } \\ \quad \quad \quad \quad \quad \quad \quad \quad =\quad \frac { \frac { 2*{ TP }^{ 2 } }{ (TP+FP)(TP+FN) } }{ \frac { TP(2TP+FP+FN) }{ (TP+FP)(TP+FN) } } \\ \quad \quad \quad \quad \quad \quad \quad \quad =\quad \frac { 2TP }{ 2TP+FP+FN }\)
In the equation above, TP, FP & FN are all positive integers. \(TP,FP,FN\quad \in \quad N\quad where\quad N\quad =\quad \{ 0,1,2,3,...\}\)
So mathematically, 2TP \(\le\) 2TP+FP+FN. Hence Numerator is at most equal to denominator. Hence F1 Score \(\le\) 1.
The fraction, \(\frac { 2TP }{ 2TP+FP+FN }\) will have a minimum value of Zero when TP = 0 and maximum value of 1 when FP = FN = 0.
Hence it can be conculded that \(0\quad \le \quad F1\quad \le \quad 1\).
Write a function that generates an ROC curve from a data set with a true classification column (class in our example) and a probability column (scored.probability in our example). Your function should return a list that includes the plot of the ROC curve and a vector that contains the calculated area under the curve (AUC). Note that I recommend using a sequence of thresholds ranging from 0 to 1 at 0.01 intervals.
ROC_func <- function(input_df) {
roc_df <- data.frame(matrix(ncol = 3, nrow = 0))
names(roc_df) <- c("Probability","TPR","FPR")
input_df_int <- data.frame(input_df)
for (threshold in seq(0, 1, by = 0.01))
{
input_df_int$scored.class <- ifelse(input_df_int$scored.probability > threshold, 1, 0)
tpr <- Sensitivity_func(input_df_int)
fpr <- 1 - Specificity_func(input_df_int)
row <- data.frame(probability = threshold, TPR = tpr, FPR = fpr)
roc_df <- rbind(roc_df, row)
}
## Compute the area
roc_df$area <- -diff(roc_df$FPR)*roc_df$TPR
## Filled missing values with 0
roc_df <- roc_df %>% mutate(area = ifelse(is.na(area),0,area))
## Calculate AUC as sum of the area of all rectangles
AUC <- round(sum(roc_df$area),4)
roc_curve <- ggplot(data = roc_df, aes(x = FPR, y = TPR)) +
geom_point() +
geom_path(color = "dodger blue") +
ggtitle("ROC Curve") +
xlab("False Positive Rate (FPR)") +
ylab("True Positive Rate (TPR)") +
theme(
plot.title = element_text(hjust = 0.5))+
geom_abline(intercept = 0, slope = 1, linetype = "dashed") +
coord_equal(ratio = 1) +
annotate(geom = "text", x = 0.7, y = 0.4, label = paste("AUC:", AUC))
return(list(Plot = roc_curve, AUC = AUC))
}
Plotting the ROC curve:
ROC_func(hw_dataset)
## Warning in -diff(roc_df$FPR) * roc_df$TPR: longer object length is not a
## multiple of shorter object length
## $Plot
##
## $AUC
## [1] 0.854
Use your created R functions and the provided classification output data set to produce all of the classification metrics discussed above.
model_metrics <- c(Accuracy_func (hw_dataset), Class_error_rt_func (hw_dataset), Precision_func (hw_dataset), Sensitivity_func (hw_dataset), Specificity_func (hw_dataset), f1score_func (hw_dataset))
names(model_metrics) <- c("Accuracy", "Classification Error Rate", "Precision", "Sensitivity", "Specificity", "F1 Score")
model_metrics %>% kable(col.names = "Metrics") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="50%",height="300px")
Metrics | |
---|---|
Accuracy | 0.8066298 |
Classification Error Rate | 0.1933702 |
Precision | 0.8437500 |
Sensitivity | 0.4736842 |
Specificity | 0.9596774 |
F1 Score | 0.6067416 |
Investigate the caret package. In particular, consider the functions confusionMatrix, sensitivity, and specificity. Apply the functions to the data set. How do the results compare with your own functions?
hw_dataset_1 <- hw_dataset %>% dplyr::select(scored.class, class) %>%
mutate(scored.class = as.factor(scored.class),
class = as.factor(class))
caret_op <- caret::confusionMatrix(data = hw_dataset_1$scored.class,
reference = hw_dataset_1$class,
positive = "1")
caret_package <- c(caret_op$overall["Accuracy"], caret_op$byClass["Sensitivity"], caret_op$byClass["Specificity"])
our_function <- c(Accuracy_func (hw_dataset), Sensitivity_func (hw_dataset), Specificity_func (hw_dataset))
model_comp <- cbind(caret_package, our_function)
model_comp %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="75%",height="300px")
caret_package | our_function | |
---|---|---|
Accuracy | 0.8066298 | 0.8066298 |
Sensitivity | 0.4736842 | 0.4736842 |
Specificity | 0.9596774 | 0.9596774 |
As per the table above the Accuracy, Sensitivity and Specitivity scores generated by the Caret package are exactly identical to the corresponding model metrics calculated by our custom functions.
Investigate the pROC package. Use it to generate an ROC curve for the data set. How do the results compare with your own functions?
rocCurve <- roc(response = hw_dataset$class,
predictor = hw_dataset$scored.probability)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Plot the ROC Curve:
plot(rocCurve, legacy.axes = TRUE, main = "ROC Curve using pROC",print.auc = TRUE)
## Area Under the Curve (AUC):
auc(rocCurve)
## Area under the curve: 0.8503
## Plot ROC with Custom Function
ROC_func(hw_dataset)$Plot
## Warning in -diff(roc_df$FPR) * roc_df$TPR: longer object length is not a
## multiple of shorter object length
## Area Under the Curve (AUC) using Custom Function:
ROC_func(hw_dataset)$AUC
## Warning in -diff(roc_df$FPR) * roc_df$TPR: longer object length is not a
## multiple of shorter object length
## [1] 0.854
Conclusion: The shape of the ROC cusrve generated by pROC package and our function are almost identical. The AUC values are comparable but slightly different.