Work through various classification metrics creating functions in R
to carry out the various calculations. Some functions in packages that provide equivalent results will also be investigated. Finally, graphical outputs that can be used to evaluate the output of classification models will be created, such as binary logistic regression.
Download the classification output data set.
classification.data <- read.csv(paste0("https://raw.githubusercontent.com/jzuniga123/",
"SPS/master/DATA%20621/classification-output-data.csv"))
is.data.frame(classification.data)
## [1] TRUE
The data set has three key columns we will use: + class: the actual class for the observation + scored.class: the predicted class for the observation (based on a threshold of 0.5) + scored.probability: the predicted probability of success for the observation
classification.data <- classification.data[,c("class","scored.class","scored.probability")]
Use the table()
function to get the raw confusion matrix for this scored dataset. Make sure you understand the output. In particular, do the rows represent the actual or predicted class? The columns?
wrong_order <- table(classification.data$scored.class, classification.data$class)
cm <- apply(apply(wrong_order, 1, rev), 1, rev)
colnames(cm) <- c("Class 1", "Class 0"); rownames(cm) <- c("Scored 1", "Scored 0"); cm
##
## Class 1 Class 0
## Scored 1 27 5
## Scored 0 30 119
In the above output the rows represent the predicted class and the columns represent the actual class.
\(n=181\) | Actual 1 | Actual 0 |
---|---|---|
Predicted 1 | \(TP = A = 27\) | \(FP = B = 5\) |
Predicted 0 | \(FN = C = 30\) | \(TN = D = 119\) |
For this dataset the class value 1 is interpreted as “tested positive for diabetes.” True Positive (TP) = \(A\), False Positive (FP) = \(B\), False Negative (FN) = \(C\), and True Negative (TN) = \(D\). These conditions can be contingent on what is considered a “positive” for a given dataset and how the Predicted and Actual values are arranged (transposed or not) in the confusion matrix.
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the accuracy of the predictions. \[\textrm{Accuracy} = \cfrac { TP+TN }{ TP + TN + FP + FN } = \cfrac { A+D }{ A + B + C + D }\]
my_accuracy <- function(dataframe, predicted, actual) {
datatable <- table(dataframe[ ,predicted], dataframe[ ,actual])
confusion_matrix <- apply(apply(datatable, 1, rev), 1, rev)
A <- confusion_matrix[1,1]; B <- confusion_matrix[1,2]
C <- confusion_matrix[2,1]; D <- confusion_matrix[2,2]
accuracy <- (A + D) / (A + B + C + D)
return (accuracy)
}
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the classification error rate of the predictions. \[\textrm{Classification Error Rate} = \cfrac { FP+FN }{ TP + TN + FP + FN } = \cfrac { B+C }{ A + B + C + D }\]
my_error <- function(dataframe, predicted, actual) {
datatable <- table(dataframe[ ,predicted], dataframe[ ,actual])
confusion_matrix <- apply(apply(datatable, 1, rev), 1, rev)
A <- confusion_matrix[1,1]; B <- confusion_matrix[1,2]
C <- confusion_matrix[2,1]; D <- confusion_matrix[2,2]
error <- (C + B) / (A + B + C + D)
return (error)
}
Verify that you get an accuracy and an error rate that sums to one. \[\cfrac { A+D }{ A+B+C+D } + \cfrac { B+C }{ A+B+C+D }= \cfrac { A+B+C+D }{ A+B+C+D } = 1\]
my_accuracy(classification.data, "scored.class", "class") +
my_error(classification.data, "scored.class", "class")
## [1] 1
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the precision of the predictions. \[\textrm{Precision} = \cfrac { TP }{ TP+FP } = \cfrac { A }{ A+B }\]
my_precision <- function(dataframe, predicted, actual) {
datatable <- table(dataframe[ ,predicted], dataframe[ ,actual])
confusion_matrix <- apply(apply(datatable, 1, rev), 1, rev)
A <- confusion_matrix[1,1]; B <- confusion_matrix[1,2]
C <- confusion_matrix[2,1]; D <- confusion_matrix[2,2]
precision <- A / (A + B)
return (precision)
}
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the sensitivity of the predictions. Sensitivity is also known as recall. \[\textrm{Sensitivity} = \textrm{Recall} = \cfrac { TP }{ TP+FN } = \cfrac { A }{ A+C }\]
my_sensitivity <- function(dataframe, predicted, actual) {
datatable <- table(dataframe[ ,predicted], dataframe[ ,actual])
confusion_matrix <- apply(apply(datatable, 1, rev), 1, rev)
A <- confusion_matrix[1,1]; B <- confusion_matrix[1,2]
C <- confusion_matrix[2,1]; D <- confusion_matrix[2,2]
sensitivity <- A / (A + C)
return (sensitivity)
}
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the specificity of the predictions. \[\textrm{Specificity} = \cfrac { TN }{ TN+FP } = \cfrac { D }{ B+D }\]
my_specificity <- function(dataframe, predicted, actual) {
datatable <- table(dataframe[ ,predicted], dataframe[ ,actual])
confusion_matrix <- apply(apply(datatable, 1, rev), 1, rev)
A <- confusion_matrix[1,1]; B <- confusion_matrix[1,2]
C <- confusion_matrix[2,1]; D <- confusion_matrix[2,2]
specificity <- D / (B + D)
return (specificity)
}
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the \(F_1\) score of the predictions. \[F_1 =\cfrac { 2\cdot \textrm{Precision} \cdot \textrm{Sensitivity} }{ \textrm{Precision} + \textrm{Sensitivity} }=\cfrac { 2\cdot \textrm{Precision} \cdot \textrm{Recall} }{ \textrm{Precision} + \textrm{Recall} }\]
my_F1_Score <- function(dataframe, predicted, actual) {
datatable <- table(dataframe[ ,predicted], dataframe[ ,actual])
confusion_matrix <- apply(apply(datatable, 1, rev), 1, rev)
A <- confusion_matrix[1,1]; B <- confusion_matrix[1,2]
C <- confusion_matrix[2,1]; D <- confusion_matrix[2,2]
F1_Score <- (2 *(A / (A + B)) * (A / (A + C))) / ((A / (A + B)) + (A / (A + C)))
return (F1_Score)
}
Before we move on, let’s consider a question that was asked: What are the bounds on the \(F_1\) score? Show that the \(F_1\) score will always be between 0 and 1. (Hint: If \(0 < a < 1\) and \(0 < b < 1\), then \(ab<a\).)
\[F_{ 1 } =\cfrac { 2\cdot \textrm{Precision} \cdot \textrm{Sensitivity} }{ \textrm{Precision} + \textrm{Sensitivity} } = 2\frac { \left( \frac { A }{ A+B } \right) \left( \frac { A }{ A+C } \right) }{ \left( \frac { A }{ A+B } \right) +\left( \frac { A }{ A+C } \right) } \tag{9.1}\] By construction, the denominators of the Precision and Sensitivity classification metrics hold the following conditions \(A+B \geq A\) and \(A+C \geq A\). Hence, \(\textrm{Precision}=\frac { A }{ A+B } \in [0,1]\) and \(\textrm{Sensitivity}=\frac { A }{ A+C } \in [0,1]\). Simplifying equation 9.1 above we get:
\[F_{ 1 }=\frac { 2\frac { { A }^{ 2 } }{ \left( A+B \right) \left( A+C \right) } }{ \frac { A\left( A+C \right) +A\left( A+B \right) }{ \left( A+B \right) \left( A+C \right) } } =\frac { 2{ A }^{ 2 } }{ A\left( A+C \right) +A\left( A+B \right) } =\frac { 2{ A }^{ 2 } }{ { A }^{ 2 }+AC+{ A }^{ 2 }+AB } =\frac { 2{ A }^{ 2 } }{ 2{ A }^{ 2 }+AB+AC } \tag{9.2}\] Now, considering an extreme case where the actual values are all positive and the model predicts all positive results such that both \(A>0\) and \(B=C=0\), we can find the upper limit boundary point such that: \[F_{ 1 }\left( B=C=0 \right) =\frac { 2{ A }^{ 2 } }{ 2{ A }^{ 2 }+0A+0A } =\frac { 2{ A }^{ 2 } }{ 2{ A }^{ 2 }+0+0 } =\frac { 2{ A }^{ 2 } }{ 2{ A }^{ 2 } } =1 \tag{9.3}\]
Then, considering the extreme case where the actual values are all negative such that \(A=0\) and \(A < C \leq B\), we can find the lower limit boundary point by taking the derivative of the function with repect to \(A\) and then evaluating the derivative at \(A=0\) such that: \[\frac { d }{ dA\quad } \left[ F_1 \right] = \frac { d }{ dA\quad } \left[ \frac { 2{ A }^{ 2 } }{ 2{ A }^{ 2 }+AB+AC } \right] =\frac { 4A }{ 4A+B+C } \tag{9.4}\]\[\frac { d }{ dA\quad } \left[ F_{ 1 }\left( A=0 \right) \right] =\frac { 4\cdot 0 }{ 4\cdot 0+B+C } =\frac { 0 }{ B+C } =0 \tag{9.5}\]
Write a function that generates an ROC curve from a data set with a true classification column (class in our example) and a probability column (scored.probability in our example). Your function should return a list that includes the plot of the ROC curve and a vector that contains the calculated area under the curve (AUC). Use a sequence of thresholds ranging from 0 to 1 at 0.01 intervals.
my_ROC_Curve <- function(dataframe, probabilities, actual) {
thresholds <- seq(0, 1, 1/180)
TPR <- FPR <- numeric(length(thresholds))
Positive <- sum(dataframe[,actual] == 1)
Negative <- sum(dataframe[,actual] == 0)
for (i in 1:length(thresholds)) {
# Upper bound probability threshold for equation 10.1
data_subset <- subset(dataframe, dataframe[ ,probabilities] <= thresholds[i])
# Condition numerator as outlined by indicator function in equation 10.2
TP <- sum(data_subset[data_subset[,actual] == 1, probabilities] > 0.5) # A
TN <- sum(data_subset[data_subset[,actual] == 0, probabilities] <= 0.5) # D
FP <- sum(data_subset[data_subset[,actual] == 0, probabilities] > 0.5) # B
FN <- sum(data_subset[data_subset[,actual] == 1, probabilities] <= 0.5) # C
TPR[i] <- 1 - (TP + FN) / Positive # Equation 10.3
FPR[i] <- 1 - (TN + FP) / Negative # Equation 10.4
}
plot(FPR, TPR, type = "l", lwd = 2,
xlab = "False Positive Rate (1 - Specificity)",
ylab = "True Positive Rate (Sensitivity)")
abline(0, 1)
# diff() returns n-1 suitably lagged and iterated differences
diffFPR <- c(abs(diff(FPR)), 0); diffTPR <- c(abs(diff(TPR)), 0)
# Variant of Trapezoidal numerical integration from equation 10.5
area_under_curve <- sum(TPR * diffFPR) - sum(diffTPR * diffFPR) / 2
return (area_under_curve)
}
\[A:=[0,p] \tag{10.1}\]\[{ 1 }_{ A }\left( x \right) :=\begin{cases} 1 & x\in A \\ 0 & x\notin A \end{cases} \tag{10.2}\]\[TPR_{ i }=\textrm{Sensitivity}_i=1-\sum _{ i,j=1 }^{ n }{ \frac { { 1 }_{ A }\left( x \right) \left[ { TP }_{ i }+{ FN }_{ i } \right] }{ { TP }_{ j }+{ FN }_{ j } } } \tag{10.3}\]\[FPR_{ i }=1-\textrm{Specificity}_i=1-\sum _{ i,j=1 }^{ n }{ \frac { { 1 }_{ A }\left( x \right) \left[ { TN }_{ i }+{ FP }_{ i } \right] }{ { TN }_{ j }+{ FP }_{ j } } } \tag{10.4}\]\[\int _{ a }^{ b }{ f\left( x \right) dx } \approx \frac { 1 }{ 2 } \sum _{ n=0 }^{ N-1 }{ \left( { y }_{ n }+{ y }_{ n+1 } \right) \Delta x } \approx \frac { 1 }{ 2 } \sum _{ n=0 }^{ N-1 }{ \left( { y }_{ n }+{ y }_{ n+1 } \right) \left( x_{ n+1 }-{ x }_{ n } \right) } \tag{10.5}\]
Use your created R
functions and the provided classification output data set to produce all of the classification metrics discussed above.
my_accuracy(classification.data, "scored.class", "class")
## [1] 0.8066298
my_error(classification.data, "scored.class", "class")
## [1] 0.1933702
my_precision(classification.data, "scored.class", "class")
## [1] 0.84375
my_sensitivity(classification.data, "scored.class", "class")
## [1] 0.4736842
my_specificity(classification.data, "scored.class", "class")
## [1] 0.9596774
my_F1_Score(classification.data, "scored.class", "class")
## [1] 0.6067416
my_ROC_Curve(classification.data, "scored.probability", "class")
## [1] 0.8504527
caret
PackageInvestigate the caret
package. In particular, consider the functions confusionMatrix
, sensitivity
, and specificity
. Apply the functions to the data set. How do the results compare with your own functions?
library(caret)
confusionMatrix(classification.data$scored.class, classification.data$class, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 119 30
## 1 5 27
##
## Accuracy : 0.8066
## 95% CI : (0.7415, 0.8615)
## No Information Rate : 0.6851
## P-Value [Acc > NIR] : 0.0001712
##
## Kappa : 0.4916
## Mcnemar's Test P-Value : 4.976e-05
##
## Sensitivity : 0.4737
## Specificity : 0.9597
## Pos Pred Value : 0.8438
## Neg Pred Value : 0.7987
## Prevalence : 0.3149
## Detection Rate : 0.1492
## Detection Prevalence : 0.1768
## Balanced Accuracy : 0.7167
##
## 'Positive' Class : 1
##
sensitivity(factor(classification.data$scored.class), factor(classification.data$class), positive = "1")
## [1] 0.4736842
specificity(factor(classification.data$scored.class), factor(classification.data$class), negative = "0")
## [1] 0.9596774
All the values match if the positive and negative outcomes are declared properly in both the written functions and the caret
package functions. Something worth noting is that the my_precision
value is displayed as Pos Pred Value
in the confusionMatrix
output.
pROC
PackageInvestigate the pROC
package. Use it to generate an ROC curve for the data set. How do the results compare with your own functions?
library(pROC)
plot(roc(classification.data$class, classification.data$scored.probability, smooth=F))
auc(classification.data$class, classification.data$scored.probability)
## Area under the curve: 0.8503
The ROC curves are close to identical except for the \(x\)-axis labeling and tick marks. The written function plots \(y = \textrm{Sensitivity}\) and \(x = 1 - \textrm{Specificity}\) with the \(x\)-axis going from \([0,1]\). The pROC
package function plots \(y = \textrm{Sensitivity}\) and \(x = \textrm{Specificity}\) with the \(x\)-axis going from \([1,0]\). This equates to the same plot, but with slightly different labeling and tick marks. The Area Under the Curve (AUC) outputs are equivalent to the third decimal. The slight difference is likely due to differences in the number of bins (thresholds) used or the approximation method applied to calculate the area.
Linear Models with R, Faraway, 2005.
A Modern Approach to Regression with R, Sheather, 2009.
http://gim.unmc.edu/dxtests/roc2.htm
http://www.cs.uu.nl/docs/vakken/dm/hc6.pdf
http://www.saedsayad.com/model_evaluation_c.htm
https://cran.r-project.org/web/packages/caret/caret.pdf
http://calculus.seas.upenn.edu/?n=Main.NumericalIntegration
https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes
https://stat.ethz.ch/R-manual/R-devel/library/base/html/diff.html