Overview In this homework assignment, we will work through various classification metrics. Functions are in R to carry out the various calculations. We will also investigate some functions in packages that will let us obtain the equivalent results. Finally, we will create graphical output that also can be used to evaluate the output of classification models.
class_output <- read.csv("https://raw.githubusercontent.com/javernw/DATA621-Business-Analytics-and-Data-Mining/master/classification-output-data.csv", header = T)
head(class_output)
confusion_matix <- table("Predictions" = class_output$scored.class, "Actual" = class_output$class)
confusion_matix
## Actual
## Predictions 0 1
## 0 119 30
## 1 5 27
The rows represent predictions while the columns represent the actual observations.
\[\large \text{Accuracy} = \large \frac{TP+TN}{TP+FP+TN+FN}\]
Accuracy can be defined as the fraction of predicitons our model got right. Also known as the error rate, the accuracy rate makes no distinction about the type of error being made.
\[\large \text{Classification Error Rate} = \large \frac{FP+FN}{TP+FP+TN+FN}\] The Classification Error Rate calculates the number of incorrect predictions out of the total number of predictions in the dataset.
cl_cer <- function(df){
cm <- table("Predictions" = df$scored.class, "Actual" = df$class)
TP <- cm[2,2]
TN <- cm[1,1]
FP <- cm[2,1]
FN <- cm[1,2]
return((FP + FN)/(TP + FP + TN + FN))
}
Verify that you get an accuracy and an error rate that sums to one
## [1] TRUE
\[\large \text{Precision} = \large \frac{TP}{TP+FP}\] This is the positive value or the fraction of the positive predictions that are actually positive.
\[\large \text{Sensitivity} = \large \frac{TP}{TP+FN}\]
The sensitivity is sometimes considered the true positive rate since it measures the accuracy in the event population.
\[\large \text{Specificity} = \large \frac{TN}{TN+FP}\] This is the true negatitive rate or the proportion of negatives that are correctly identified.
\[\large \text{F1 Score} = \large \frac{2 * Precision*Sensitivity}{Precision + Sensitivity}\] The F1 Score of Predictions measures the test’s accuracy, on a scale of 0 to 1 where a value of 1 is the most accurate and the value of 0 is the least accurate.
cl_f1score <- function(df){
cm <- table("Predictions" = df$scored.class, "Actual" = df$class)
TP <- cm[2,2]
TN <- cm[1,1]
FP <- cm[2,1]
FN <- cm[1,2]
f1score <- (2 * cl_precision(df) * cl_sensitivity(df)) / (cl_precision(df) + cl_sensitivity(df))
return(f1score)
}
ROC <- function(x, y){
x <- x[order(y, decreasing = TRUE)]
t_p_r <- cumsum(x) / sum(x)
f_p_r <- cumsum(!x) / sum(!x)
xy <- data.frame(t_p_r,f_p_r, x)
f_p_r_df <- c(diff(xy$f_p_r), 0)
t_p_r_df <- c(diff(xy$t_p_r), 0)
A_U_C <- round(sum(xy$t_p_r *f_p_r_df) + sum(t_p_r_df *f_p_r_df)/2, 4)
plot(xy$f_p_r, xy$t_p_r, type = "l",
main = "ROC Curve",
xlab = "False Postivie Rate",
ylab = "True Positive Rate")
abline(a = 0, b = 1)
legend(.6, .4, A_U_C, title = "Area Under Curve")
}
Use your created R functions and the provided classification output data set to produce all of the classification metrics discussed above.
N <- c('Accuracy','Classification Error Rate', 'Precision', 'Sensitivity','Specificity', 'F1 Score')
V <- round(c(cl_accuracy(class_output), cl_cer(class_output), cl_precision (class_output), cl_sensitivity(class_output), cl_specificity(class_output), cl_f1score(class_output)),4)
df_1 <- as.data.frame(cbind(N, V))
kable(df_1)
N | V |
---|---|
Accuracy | 0.8066 |
Classification Error Rate | 0.1934 |
Precision | 0.8438 |
Sensitivity | 0.4737 |
Specificity | 0.9597 |
F1 Score | 0.6067 |
CARET
Investigate the caret package. In particular, consider the functions confusionMatrix, sensitivity, and specificity. Apply the functions to the data set. How do the results compare with your own functions?
confusionMatrix(data = factor(class_output$scored.class), reference = factor(class_output$class), positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 119 30
## 1 5 27
##
## Accuracy : 0.8066
## 95% CI : (0.7415, 0.8615)
## No Information Rate : 0.6851
## P-Value [Acc > NIR] : 0.0001712
##
## Kappa : 0.4916
##
## Mcnemar's Test P-Value : 4.976e-05
##
## Sensitivity : 0.4737
## Specificity : 0.9597
## Pos Pred Value : 0.8438
## Neg Pred Value : 0.7987
## Prevalence : 0.3149
## Detection Rate : 0.1492
## Detection Prevalence : 0.1768
## Balanced Accuracy : 0.7167
##
## 'Positive' Class : 1
##
# Caret - sensitivity
sensitivity(data = factor(class_output$scored.class), reference = factor(class_output$class), positive = "1")
## [1] 0.4736842
## [1] 0.4736842
The homebrew function matches the result of the caret sensitivity function.
# Caret - specificity
specificity(data = factor(class_output$scored.class), reference = factor(class_output$class), negative = "0")
## [1] 0.9596774
## [1] 0.9596774
The homebrew function matches the result of the caret sensitivity function.
pROC
Investigate the pROC package. Use it to generate an ROC curve for the data set. How do the results compare with your own functions?
plot(roc(class_output$class, class_output$scored.probability), print.auc = TRUE , main = "ROC by pROC")
While the two graphs, yield the same result. There are slight differences. The pROC
package places values on the X-label in a range of 1.5<->-0.5. The function we wrote for this assingment, places values 0<-> 1 on the X-label. In addition, the function we wrote for this assignment extends the findings for the Area Underneath the Curve and extra decimal value.