DATA 621 Homework #2
From Work by Critical Thinking Group 3
df <- read.csv("./data/classification-output-data.csv")
actual <- "class"
predicted <- "scored.class"
probability <- "scored.probability"
# Load our functions
source("hw2.R")
Caret’s Output
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 119 30
1 5 27
Accuracy : 0.8066
95% CI : (0.7415, 0.8615)
No Information Rate : 0.6851
P-Value [Acc > NIR] : 0.0001712
Kappa : 0.4916
Mcnemar's Test P-Value : 4.976e-05
Sensitivity : 0.9597
Specificity : 0.4737
Pos Pred Value : 0.7987
Neg Pred Value : 0.8438
Prevalence : 0.6851
Detection Rate : 0.6575
Detection Prevalence : 0.8232
Balanced Accuracy : 0.7167
'Positive' Class : 0
Our Function’s Output
Confusion Matrix
0 | 1 | |
---|---|---|
0 | 119 | 30 |
1 | 5 | 27 |
Accuracy: 0.8066298
Classification Error Rate: 0.1933702
Precision: 0.7986577
Sensitivity (Recall): 0.9596774
Specificity: 0.4736842
F1 Score: 0.8717949
F1 Score Will Always be Between 0 and 1
Before we move on, let’s consider a question that was asked: What are the bounds on the F1 score? Show that the F1 score will always be between 0 and 1. (Hint: If 0<a<1 and 0<b<1 then ab<a).
Proving the range of the F1 Score must be between 0 and 1
Let’s expand the F1 equation using the definitions of precision and sensitivity.
\[\textrm{F1 Score}= \frac{2 \times\mathrm{Precision}\times\mathrm{Sensitivity}}{\mathrm{Precision}+\mathrm{Sensitivity}}\] The numerator is equal to \[=\frac{2 \times TP \times TP}{(TP + FP)(TP+FN)}\] The denominator is equal to \[= \frac{1}{\frac{TP}{TP+FP}+\frac{TP}{TP+FN}}\]
Combine the Numerator and the denominator: \[F1=\frac{2 \times TP \times TP}{(TP + FP)(TP+FN)} \times \frac{1}{\frac{TP}{TP+FP}+\frac{TP}{TP+FN}}\]
Multiply the denominator by (TP + FP)(TP + FN):
\[denominator = \frac{(TP+FP)(TP+FN)}{TP(TP+FN)+TP(TP+FP)}\]
\[F1=\frac{2(TP)^2}{(TP+FP)(TP+FN)}\times \frac{(TP+FP)(TP+FN)}{TP(TP+FN)+TP(TP+FP)}\]
Cancel the denominator of the first term with the numerator of the second term:
\[F1 =\frac{2(TP)^2}{TP(TP+FN)+TP(TP+FP)}\] Simplify the Denominator: \[F1 =\frac{2(TP)^2}{TP^2+FNTP+TP^2+TPFP}\]
\[F1 =\frac{2(TP)^2}{TP(TP+FN+TP+FP)}\]
\[F1 =\frac{2(TP)^2}{TP(TP+FN+TP+FP)}\]
\[F1 =\frac{2(TP)^2}{TP(2TP +FN +FP)}\] Cancel the TP terms and get final answer \[F1=\frac{2TP}{2TP+FN+FP}\]
In the equation above, \(TP,FN,FP \in \mathbb{N}\) where \(\mathbb{N} = \{0,1,2,3...\}\)
\(2TP \le2TP+FN+FP\) this equation shows that the numerator is at most equal to the denominator (i.e \(F1\le1\)).
There would be a Maximum value of \(1\) when \(FN=FP=0\) and \(TP > 0\)
The fraction \(=\frac{2TP}{2TP+FN+FP}\) has a minimum value of zero when the numerator is minimized (i.e. \(TP=0\)).
We have shown that \(0\le F1\le1\)
pROC Output
par(pty = "s")
roc(df[[actual]], df[[probability]], plot = TRUE, legacy.axes = TRUE, print.auc = TRUE)
Call:
roc.default(response = df[[actual]], predictor = df[[probability]], plot = TRUE, legacy.axes = TRUE, print.auc = TRUE)
Data: df[[probability]] in 124 controls (df[[actual]] 0) < 57 cases (df[[actual]] 1).
Area under the curve: 0.8503
Our ROC Curve
0 0.01 0.02 0.03 0.04 0.05
0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
0.06 0.07 0.08 0.09 0.1 0.11
0.000000000 0.000000000 0.000000000 0.000000000 0.004810413 0.005942275
0.12 0.13 0.14 0.15 0.16 0.17
0.006366723 0.000000000 0.000000000 0.000000000 0.000000000 0.028013582
0.18 0.19 0.2 0.21 0.22 0.23
0.009620826 0.010186757 0.000000000 0.010752688 0.000000000 0.000000000
0.24 0.25 0.26 0.27 0.28 0.29
0.011460102 0.011601585 0.011884550 0.012026033 0.012308998 0.000000000
0.3 0.31 0.32 0.33 0.34 0.35
0.000000000 0.000000000 0.014006791 0.014148274 0.014148274 0.000000000
0.36 0.37 0.38 0.39 0.4 0.41
0.000000000 0.000000000 0.030843237 0.000000000 0.015421619 0.031126203
0.42 0.43 0.44 0.45 0.46 0.47
0.000000000 0.015846067 0.015987550 0.000000000 0.049235993 0.016411998
0.48 0.49 0.5 0.51 0.52 0.53
0.016411998 0.016553480 0.000000000 0.000000000 0.000000000 0.000000000
0.54 0.55 0.56 0.57 0.58 0.59
0.017119411 0.017119411 0.000000000 0.000000000 0.000000000 0.000000000
0.6 0.61 0.62 0.63 0.64 0.65
0.017260894 0.000000000 0.051782683 0.000000000 0.034804754 0.000000000
0.66 0.67 0.68 0.69 0.7 0.71
0.000000000 0.017402377 0.017402377 0.034804754 0.017402377 0.017402377
0.72 0.73 0.74 0.75 0.76 0.77
0.017402377 0.017402377 0.000000000 0.000000000 0.000000000 0.017402377
0.78 0.79 0.8 0.81 0.82 0.83
0.000000000 0.017402377 0.000000000 0.000000000 0.017402377 0.000000000
0.84 0.85 0.86 0.87 0.88 0.89
0.017402377 0.017402377 0.017402377 0.017402377 0.000000000 0.052207131
0.9 0.91 0.92 0.93 0.94 0.95
0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.017543860
0.96 0.97 0.98 0.99 1
0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
Source Code
library(dplyr)
library(ggplot2)
confusion_matrix <- function(df, actual, predicted){
cm <- table(df[[predicted]], df[[actual]])
if (nrow(cm) == ncol(cm)){
return(cm)
}
# Make sure all missing values are present in the confusion matrix
fixed_cm <- matrix(0, ncol(cm), ncol(cm))
colnames(fixed_cm) <- colnames(cm)
rownames(fixed_cm) <- colnames(cm)
# Horribly inefficient at large scale but since we're dealing with a
# confusion matrix...
for (r in rownames(cm)){
for (c in colnames(cm)){
fixed_cm[r, c] <- cm[r, c]
}
}
return(fixed_cm)
}
confusion_matrix_outcomes <- function(cm){
# This will obviously need some work to handle more than a 2x2 matrix
TP <- cm[1]
FN <- cm[2]
FP <- cm[3]
TN <- cm[4]
list("TN" = TN, "FP" = FP, "FN" = FN, "TP" = TP)
}
# Write a function that takes the data set as a dataframe, with actual and
# predicted classifications identified, and returns the accuracy of the
# predictions.
accuracy <- function(df, actual, predicted){
cm <- confusion_matrix_outcomes(confusion_matrix(df, actual, predicted))
(cm$TP + cm$TN) / (cm$TP + cm$FP + cm$TN + cm$FN)
}
# Write a function that takes the data set as a dataframe, with actual and
# predicted classifications identified, and returns the classification error
# rate of the predictions.
classification_error_rate <- function(df, actual, predicted){
cm <- confusion_matrix_outcomes(confusion_matrix(df, actual, predicted))
(cm$FP + cm$FN) / (cm$TP + cm$FP + cm$TN + cm$FN)
}
# Write a function that takes the data set as a dataframe, with actual and
# predicted classifications identified, and returns the precision of the
# predictions.
precision <- function(df, actual, predicted){
cm <- confusion_matrix_outcomes(confusion_matrix(df, actual, predicted))
cm$TP / (cm$TP + cm$FP)
}
# Write a function that takes the data set as a dataframe, with actual and
# predicted classifications identified, and returns the sensitivity of the
# predictions. Sensitivity is also known as recall.
sensitivity <- function(df, actual, predicted){
cm <- confusion_matrix_outcomes(confusion_matrix(df, actual, predicted))
cm$TP / (cm$TP + cm$FN)
}
recall <- function(df, actual, predicted){
sensitivity(df, actual, predicted)
}
# Write a function that takes the data set as a dataframe, with actual and
# predicted classifications identified, and returns the specificity of the
# predictions.
specificity <- function(df, actual, predicted){
cm <- confusion_matrix_outcomes(confusion_matrix(df, actual, predicted))
cm$TN / (cm$TN + cm$FP)
}
# Write a function that takes the data set as a dataframe, with actual and
# predicted classifications identified, and returns the F1 score of the
# predictions.
f1_score <- function(df, actual, predicted){
p <- precision(df, actual, predicted)
s <- sensitivity(df, actual, predicted)
(2 * p * s) / (p + s)
}
# Write a function that generates an ROC curve from a data set with a true
# classification column (class in our example) and a probability column
# (scored.probability in our example). Your function should return a list
# that includes the plot of the ROC curve and a vector that contains the
# calculated area under the curve (AUC). Note that I recommend using a
# sequence of thresholds ranging from 0 to 1 at 0.01 intervals.
roc_curve <- function(df, actual, probability, interval = 0.01){
outcome <- data.frame(matrix(ncol = 3, nrow = 0))
names(outcome) <- c("probability", "TPR", "FPR")
for (threshold in seq(0, 1, interval)){
df$roc_prediction <- ifelse(df[[probability]] > threshold, 1, 0)
tpr <- sensitivity(df, actual, "roc_prediction")
fpr <- 1 - specificity(df, actual, "roc_prediction")
row <- data.frame(probability = threshold, TPR = tpr, FPR = fpr)
outcome <- rbind(outcome, row)
}
# Compute the area
outcome$area <- (outcome$FPR - dplyr::lag(outcome$FPR)) * outcome$TPR
# Fill in missing values with zeros
outcome <- outcome %>%
mutate(area = ifelse(is.na(area), 0, area))
# Create a vector of area under the curve
auc_vector <- outcome$area
names(auc_vector) <- outcome$probability
# Sum it to get total area under the curve
auc <- sum(auc_vector)
# Create a ROC plot
roc_plot <- ggplot(outcome) +
geom_line(aes(FPR, TPR), color = "dodger blue") +
xlab("False Positive Rate (FPR)") +
ylab("True Positive Rate (TPR)") +
theme_minimal() +
annotate(geom = "text", x = 0.7, y = 0.07,
label = paste("AUC:", round(auc, 3))) +
geom_abline(intercept = 0, slope = 1, linetype = "dashed") +
coord_equal(ratio = 1)
# Return the list
return(list(plot = roc_plot, auc = auc, auc_vector = auc_vector))
}