DATA 621 Homework #2
df <- read.csv("https://raw.githubusercontent.com/mikeasilva/data-621-group-projects/master/hw2/data/classification-output-data.csv")
actual <- "class"
predicted <- "scored.class"
probability <- "scored.probability"
# Load our functions
source("hw2.R")
Caret’s Output
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 119 30
1 5 27
Accuracy : 0.8066
95% CI : (0.7415, 0.8615)
No Information Rate : 0.6851
P-Value [Acc > NIR] : 0.0001712
Kappa : 0.4916
Mcnemar's Test P-Value : 4.976e-05
Sensitivity : 0.9597
Specificity : 0.4737
Pos Pred Value : 0.7987
Neg Pred Value : 0.8438
Prevalence : 0.6851
Detection Rate : 0.6575
Detection Prevalence : 0.8232
Balanced Accuracy : 0.7167
'Positive' Class : 0
Our Function’s Output
Confusion Matrix
0 | 1 | |
---|---|---|
0 | 119 | 30 |
1 | 5 | 27 |
Accuracy:
[1] 0.8066298
Classification Error Rate:
[1] 0.1933702
Verify that Error + Accuracy Rate = 1:
[1] 1
Precision:
[1] 0.7986577
Sensitivity (Recall):
[1] 0.9596774
Caret’s Sensitivity: Caret Package produces the exact same sensitivity output
df2 <- df
df2$scored.class <- as.factor(df2$scored.class)
df2$class <- as.factor(df2$class)
caret::sensitivity(df2$scored.class,df2$class)
[1] 0.9596774
Specificity:
[1] 0.4736842
Caret’s Specificity: Caret Package produces the exact same Specificity output
[1] 0.4736842
F1 Score:
[1] 0.8717949
F1 Score Will Always be Between 0 and 1
Before we move on, let’s consider a question that was asked: What are the bounds on the F1 score? Show that the F1 score will always be between 0 and 1. (Hint: If 0<a<1 and 0<b<1 then ab<a).
Proving the range of the F1 Score must be between 0 and 1
Let’s expand the F1 equation using the definitions of precision and sensitivity.
\[\textrm{F1 Score}= \frac{2 \times\mathrm{Precision}\times\mathrm{Sensitivity}}{\mathrm{Precision}+\mathrm{Sensitivity}}\] The numerator is equal to \[=\frac{2 \times TP \times TP}{(TP + FP)(TP+FN)}\] The denominator is equal to \[= \frac{1}{\frac{TP}{TP+FP}+\frac{TP}{TP+FN}}\]
Combine the Numerator and the denominator: \[F1=\frac{2 \times TP \times TP}{(TP + FP)(TP+FN)} \times \frac{1}{\frac{TP}{TP+FP}+\frac{TP}{TP+FN}}\]
Multiply the denominator by (TP + FP)(TP + FN):
\[denominator = \frac{(TP+FP)(TP+FN)}{TP(TP+FN)+TP(TP+FP)}\]
\[F1=\frac{2(TP)^2}{(TP+FP)(TP+FN)}\times \frac{(TP+FP)(TP+FN)}{TP(TP+FN)+TP(TP+FP)}\]
Cancel the denominator of the first term with the numerator of the second term:
\[F1 =\frac{2(TP)^2}{TP(TP+FN)+TP(TP+FP)}\] Simplify the Denominator: \[F1 =\frac{2(TP)^2}{TP^2+FNTP+TP^2+TPFP}\]
\[F1 =\frac{2(TP)^2}{TP(TP+FN+TP+FP)}\]
\[F1 =\frac{2(TP)^2}{TP(TP+FN+TP+FP)}\]
\[F1 =\frac{2(TP)^2}{TP(2TP +FN +FP)}\] Cancel the TP terms and get final answer \[F1=\frac{2TP}{2TP+FN+FP}\]
In the equation above, \(TP,FN,FP \in \mathbb{N}\) where \(\mathbb{N} = \{0,1,2,3...\}\)
\(2TP \le2TP+FN+FP\) this equation shows that the numerator is at most equal to the denominator (i.e \(F1\le1\)).
There would be a Maximum value of \(1\) when \(FN=FP=0\) and \(TP > 0\)
The fraction \(=\frac{2TP}{2TP+FN+FP}\) has a minimum value of zero when the numerator is minimized (i.e. \(TP=0\)).
We have shown that \(0\le F1\le1\)
ROC Curves
The output from the pROC package and the ROC function that my team created is very similar
pROC Output
par(pty = "s")
roc(df[[actual]], df[[probability]], plot = TRUE, legacy.axes = TRUE, print.auc = TRUE)
Call:
roc.default(response = df[[actual]], predictor = df[[probability]], plot = TRUE, legacy.axes = TRUE, print.auc = TRUE)
Data: df[[probability]] in 124 controls (df[[actual]] 0) < 57 cases (df[[actual]] 1).
Area under the curve: 0.8503
Source Code
library(dplyr)
library(ggplot2)
confusion_matrix <- function(df, actual, predicted){
cm <- table(df[[predicted]], df[[actual]])
if(nrow(cm) == ncol(cm)){
return(cm)
}
# Make sure all missing values are present in the confusion matrix
fixed_cm <- matrix(0, ncol(cm), ncol(cm))
colnames(fixed_cm) <- colnames(cm)
rownames(fixed_cm) <- colnames(cm)
# Horribly inefficient at large scale but since we're dealing with a confusion matrix...
for(r in rownames(cm)){
for(c in colnames(cm)){
fixed_cm[r, c] <- cm[r, c]
}
}
return(fixed_cm)
}
confusion_matrix_outcomes <- function(cm){
# This will obviously need some work to handle more than a 2x2 matrix
TP <- cm[1]
FN <- cm[2]
FP <- cm[3]
TN <- cm[4]
list("TN" = TN, "FP" = FP, "FN" = FN, "TP" = TP)
}
# Write a function that takes the data set as a dataframe, with actual and
# predicted classifications identified, and returns the accuracy of the
# predictions.
accuracy <- function(df, actual, predicted){
cm <- confusion_matrix_outcomes(confusion_matrix(df, actual, predicted))
(cm$TP + cm$TN) / (cm$TP + cm$FP + cm$TN + cm$FN)
}
# Write a function that takes the data set as a dataframe, with actual and
# predicted classifications identified, and returns the classification error
# rate of the predictions.
# Verify that you get an accuracy and an error rate that sums to one.
classification_error_rate <- function(df, actual, predicted){
cm <- confusion_matrix_outcomes(confusion_matrix(df, actual, predicted))
(cm$FP + cm$FN) / (cm$TP + cm$FP + cm$TN + cm$FN)
}
# Write a function that takes the data set as a dataframe, with actual and
# predicted classifications identified, and returns the precision of the
# predictions.
precision <- function(df, actual, predicted){
cm <- confusion_matrix_outcomes(confusion_matrix(df, actual, predicted))
cm$TP / (cm$TP + cm$FP)
}
# Write a function that takes the data set as a dataframe, with actual and
# predicted classifications identified, and returns the sensitivity of the
# predictions. Sensitivity is also known as recall.
sensitivity <- function(df, actual, predicted){
cm <- confusion_matrix_outcomes(confusion_matrix(df, actual, predicted))
cm$TP / (cm$TP + cm$FN)
}
recall <- function(df, actual, predicted){
sensitivity(df, actual, predicted)
}
# Write a function that takes the data set as a dataframe, with actual and
# predicted classifications identified, and returns the specificity of the
# predictions.
specificity <- function(df, actual, predicted){
cm <- confusion_matrix_outcomes(confusion_matrix(df, actual, predicted))
cm$TN / (cm$TN + cm$FP)
}
# Write a function that takes the data set as a dataframe, with actual and
# predicted classifications identified, and returns the F1 score of the
# predictions.
f1_score <- function(df, actual, predicted){
p <- precision(df, actual, predicted)
s <- sensitivity(df, actual, predicted)
(2 * p * s) / (p + s)
}
# Write a function that generates an ROC curve from a data set with a true
# classification column (class in our example) and a probability column
# (scored.probability in our example). Your function should return a list
# that includes the plot of the ROC curve and a vector that contains the
# calculated area under the curve (AUC). Note that I recommend using a
# sequence of thresholds ranging from 0 to 1 at 0.01 intervals.
roc_curve <- function(df, actual, probability, interval = 0.01){
outcome <- data.frame(matrix(ncol = 3, nrow = 0))
names(outcome) <- c("prob", "TPR", "FPR")
for (threshold in seq(0, 1, interval)){
df$roc_prediction <- ifelse(df[[probability]] > threshold, 1, 0)
cm <- confusion_matrix(df, actual, "roc_prediction")
cmo <- confusion_matrix_outcomes(cm)
s <- sensitivity(df, actual, "roc_prediction")
f <- 1 - specificity(df, actual, "roc_prediction")
row <- data.frame(prob = threshold, TPR = s, FPR = f)
outcome <- rbind(outcome, row)
}
outcome$area <- (outcome$FPR - dplyr::lag(outcome$FPR)) * outcome$TPR
auc <- sum(outcome$area, na.rm = TRUE)
roc_plot <- ggplot(outcome) +
geom_line(aes(FPR, TPR), color="dodger blue") +
xlab("False Positive Rate (FPR)") +
ylab("True Positive Rate (TPR)") +
theme_minimal() +
annotate(geom="text", x = 0.7, y = 0.07, label=paste("AUC:", round(auc, 3))) +
geom_abline(intercept = 0, slope = 1, linetype = "dashed") +
coord_equal(ratio=1)
return(list(plot = roc_plot, auc = auc, outcome = outcome))
return(outcome)
}