In this homework assignment, you will work through various classification metrics. You will be asked to create functions in R to carry out the various calculations. You will also investigate some functions in packages that will let you obtain the equivalent results. Finally, you will create graphical output that also can be used to evaluate the output of classification models, such as binary logistic regression.
Complete each of the following steps as instructed:
url <- 'https://raw.githubusercontent.com/jcp9010/MSDA/master/CUNY%20621/Homework%202/classification-output-data.csv'
dm_df <- read.csv(url, header = TRUE)
head(dm_df)
## pregnant glucose diastolic skinfold insulin bmi pedigree age class
## 1 7 124 70 33 215 25.5 0.161 37 0
## 2 2 122 76 27 200 35.9 0.483 26 0
## 3 3 107 62 13 48 22.9 0.678 23 1
## 4 1 91 64 24 0 29.2 0.192 21 0
## 5 4 83 86 19 0 29.3 0.317 34 0
## 6 1 100 74 12 46 19.5 0.149 28 0
## scored.class scored.probability
## 1 0 0.32845226
## 2 0 0.27319044
## 3 0 0.10966039
## 4 0 0.05599835
## 5 0 0.10049072
## 6 0 0.05515460
class
: the actual class for the observationscored.class
: the predicted class for the observation (based on a threshold of 0.5)scored.probability
: the predicted probability of success for the observationUse the table() function to get the raw confusion matrix for this scored dataset. Make sure you understand the output. In particular, do the rows represent the actual or predicted class? The columns?
with(dm_df, table(scored.class, class)[2:1,2:1])
## class
## scored.class 1 0
## 1 27 5
## 0 30 119
The website: http://www.saedsayad.com/model_evaluation_c.htm does a great job explaining the confusion matrix. According to the website, “a confusion matrix shows the number of correct and incorrect predictions made by the classification model compared to the actual outcomes (target value) in the data. The matrix is \(NxN\), where N is the number of target values (classes). Performance of such models is commonly evaluated using the data in the matrix. The following table displays a 2x2 confusion matrix for two classes (Positive and Negative).”
Accuracy
: the proportion of the total number of predictions that were correct.Positive Predictive Value or Precision
: the proportion of positive cases that were correctly identified.Negative Predictive Value
: the proportion of negative cases that were correctly identified.Sensitivity or Recall
: the proportion of actual positive cases that are correctly identifiedSpecificity
: the proportion of actual negative cases which are correctly identified.When we look at the Diabetes Dataset and the above table, I had created a confusion matrix with the rows being identified as Model Prediction
and the columns being identified as Actual Target
. For example, in the top row, the scored.class
predicts 32 cases as TRUE, when in fact, there are only 32 TRUE cases and 5 NEGATIVE cases.
\[\text{Accuracy} = \frac{TP + TN}{TP + FP + TN + FN}\]
accuracy_fct <- function(x){
# Takes dataframe Diabetes and returns an accuracy
df_table <- with(x, table(scored.class, class)[2:1,2:1])
acc <- (df_table[1] + df_table[4])/(df_table[1] + df_table[3] + df_table[4] + df_table[2])
return(acc)
}
print(paste0("Accuracy: ", round(accuracy_fct(dm_df),3)))
## [1] "Accuracy: 0.807"
\[\text{Classification Error Rate} = \frac{FP + FN}{TP + FP + TN + FN}\]
class_error_fct <- function(x){
# Takes dataframe Diabetes and returns a Classification Error Rate
df_table <- with(x, table(scored.class, class)[2:1,2:1])
class_error <- (df_table[3] + df_table[2])/(df_table[4] + df_table[3] + df_table[1] + df_table[2])
return(class_error)
}
print(paste0("Classification Error Rate: ", round(class_error_fct(dm_df),3)))
## [1] "Classification Error Rate: 0.193"
Verify that you get an accuracy and an error rate that sums to one.
total_sum <- accuracy_fct(dm_df) + class_error_fct(dm_df)
print(paste0("Accuracy: ", accuracy_fct(dm_df), " + ", class_error_fct(dm_df), " = ", total_sum))
## [1] "Accuracy: 0.806629834254144 + 0.193370165745856 = 1"
\[\text{Precision} = \frac{TP}{TP + FP}\]
precision_fct <- function(x){
# Takes dataframe Diabetes and returns Precision
df_table <- with(x, table(scored.class, class)[2:1,2:1])
prec <- (df_table[1])/(df_table[1] + df_table[3])
return(prec)
}
print(paste0("Precision: ", round(precision_fct(dm_df),3)))
## [1] "Precision: 0.844"
\[\text{Sensitivity} = \frac{TP}{TP + FN}\]
sensitivity_fct <- function(x){
# Takes dataframe Diabetes and returns Sensitivity
df_table <- with(x, table(scored.class, class)[2:1,2:1])
sens <- (df_table[1])/(df_table[1] + df_table[2])
return(sens)
}
print(paste0("Sensitivity: ", round(sensitivity_fct(dm_df),3)))
## [1] "Sensitivity: 0.474"
\[\text{Specificity} = \frac{TN}{TN + FP}\]
specificity_fct <- function(x){
# Takes dataframe Diabetes and returns Specificity
df_table <- with(x, table(scored.class, class)[2:1,2:1])
spec <- (df_table[4])/(df_table[4] + df_table[3])
return(spec)
}
print(paste0("Specificity: ", round(specificity_fct(dm_df),3)))
## [1] "Specificity: 0.96"
\[\text{F1 Score} = \frac{2 * Precision * Sensitivity}{Precision + Sensitivity}\]
f1_fct <- function(x){
# Takes dataframe Diabetes and returns a F1 score
df_table <- with(x, table(scored.class, class)[2:1,2:1])
prec <- (df_table[1])/(df_table[1] + df_table[3])
sens <- (df_table[1])/(df_table[1] + df_table[2])
f1 <- (2 * prec * sens)/(prec + sens)
return(f1)
}
print(paste0("F1 Score: ", round(f1_fct(dm_df),3)))
## [1] "F1 Score: 0.607"
We will assume that a
is precision and b
is sensitivity. Precision and Sensitivity are both given by numbers bounded between 0 and 1, or in other words: \(0 < a < 1\) and \(0 < b < 1\).
If we assume that \(a = 1\) and \(b = 1\), then:
\[\text{F1 Score} = \frac{2*1*1}{1+1} = 1\] Likewise, if we have a and b approach 0 (or take the limit as a and b goes to zero), given that a and b are both positive numbers, we will never have a F1 value that would be equal or less than 0. Therefore, since we have bounded a and b to between 0 and 1, the F1 function will always be between 0 and 1, or in other words \(ab < a\) and \(ab < b\) is true.
roc_curve_fct <- function(x){
# Returns a list that includes the plot of the ROC curve and a vector that contains the calculated area under the curve.
# Using sequence of thresholds ranging from 0 to 1 at 0.01 intervals.
library(ggplot2)
seq_int <- seq(0,1,by=0.01)
TPR_vector <- c()
FPR_vector <- c()
for (i in 1:length(seq_int)){
scored_class <- ifelse(x$scored.probability >= seq_int[i], 1, 0)
rev_df <- data.frame(scored.class = scored_class, class = x$class)
df_table <- with(rev_df, table(scored.class, class))
TPR <- (df_table[4])/(df_table[4] + df_table[3])
FPR <- (df_table[2]/(df_table[2] + df_table[1]))
TPR_vector[i] <- TPR
FPR_vector[i] <- FPR
}
plot_df <- data.frame(TRUE_POSITIVE = TPR_vector, FALSE_POSITIVE = FPR_vector)
ROC_plot <- ggplot(plot_df, aes(x=FALSE_POSITIVE, y=TRUE_POSITIVE)) + geom_point() + geom_line(col="blue") + geom_abline(intercept = 0, slope = 1) + labs(title="Receiver Operator Curve", x = "False Positive Rate (1 - Specificity)", y = "True Positive Rate (Sensitivity)")
# In order to roughly calculate the area under the curve, we must remove the NA values
AUC_df <- plot_df[complete.cases(plot_df),]
# Now to calculate the AUC
x <- abs(diff(AUC_df$FALSE_POSITIVE))
y <- AUC_df$TRUE_POSITIVE
area_under_curve <- sum(x*y)
return(list(ROC_plot, area_under_curve))
}
ROC_list <- roc_curve_fct(dm_df)
ROC_plot <- ROC_list[[1]]
ROC_plot
area_under_curve <- ROC_list[[2]]
print(paste0("Area Under the Curve: ", area_under_curve))
## [1] "Area Under the Curve: 0.829937747594793"
print(paste0("Accuracy: ", round(accuracy_fct(dm_df),3)))
## [1] "Accuracy: 0.807"
print(paste0("Classification Error Rate: ", round(class_error_fct(dm_df),3)))
## [1] "Classification Error Rate: 0.193"
print(paste0("Precision: ", round(precision_fct(dm_df),3)))
## [1] "Precision: 0.844"
print(paste0("Sensitivity: ", round(sensitivity_fct(dm_df),3)))
## [1] "Sensitivity: 0.474"
print(paste0("Specificity: ", round(specificity_fct(dm_df),3)))
## [1] "Specificity: 0.96"
print(paste0("F1 Score: ", round(f1_fct(dm_df),3)))
## [1] "F1 Score: 0.607"
library(caret)
df_table <- with(dm_df, table(scored.class, class)[2:1,2:1])
confusionMatrix(df_table)
## Confusion Matrix and Statistics
##
## class
## scored.class 1 0
## 1 27 5
## 0 30 119
##
## Accuracy : 0.8066
## 95% CI : (0.7415, 0.8615)
## No Information Rate : 0.6851
## P-Value [Acc > NIR] : 0.0001712
##
## Kappa : 0.4916
## Mcnemar's Test P-Value : 4.976e-05
##
## Sensitivity : 0.4737
## Specificity : 0.9597
## Pos Pred Value : 0.8438
## Neg Pred Value : 0.7987
## Prevalence : 0.3149
## Detection Rate : 0.1492
## Detection Prevalence : 0.1768
## Balanced Accuracy : 0.7167
##
## 'Positive' Class : 1
##
print(paste0("Sensitivity: ", sensitivity(df_table)))
## [1] "Sensitivity: 0.473684210526316"
print(paste0("Specificity: ", specificity(df_table)))
## [1] "Specificity: 0.959677419354839"
The results are similar.
library(pROC)
plot(roc(dm_df$class, dm_df$scored.probability), main="ROC Curve from pROC Package")
# Please note that the X axis is in Specificity (as opposed to 1 - Specificity in the above function)
# Area Under the Curve
auc(roc(dm_df$class, dm_df$scored.probability))
## Area under the curve: 0.8503
The results are similar, particularly with the plotting. (The AUC is off by approximately 0.02).