Overview

Work through various classification metrics creating functions in R to carry out the various calculations. Some functions in packages that provide equivalent results will also be investigated. Finally, graphical outputs that can be used to evaluate the output of classification models will be created, such as binary logistic regression.

1. Import Data

Download the classification output data set.

classification.data <- read.csv(paste0("https://raw.githubusercontent.com/jzuniga123/",
                            "SPS/master/DATA%20621/classification-output-data.csv"))
is.data.frame(classification.data)
## [1] TRUE

2. Confusion Matrix

The data set has three key columns we will use: + class: the actual class for the observation + scored.class: the predicted class for the observation (based on a threshold of 0.5) + scored.probability: the predicted probability of success for the observation

classification.data <- classification.data[,c("class","scored.class","scored.probability")]

Use the table() function to get the raw confusion matrix for this scored dataset. Make sure you understand the output. In particular, do the rows represent the actual or predicted class? The columns?

wrong_order <- table(classification.data$scored.class, classification.data$class)
cm <- apply(apply(wrong_order, 1, rev), 1, rev)
colnames(cm) <- c("Class 1", "Class 0"); rownames(cm) <- c("Scored 1", "Scored 0"); cm
##           
##            Class 1 Class 0
##   Scored 1      27       5
##   Scored 0      30     119

In the above output the rows represent the predicted class and the columns represent the actual class.

\(n=181\) Actual 1 Actual 0
Predicted 1 \(TP = A = 27\) \(FP = B = 5\)
Predicted 0 \(FN = C = 30\) \(TN = D = 119\)

For this dataset the class value 1 is interpreted as “tested positive for diabetes.” True Positive (TP) = \(A\), False Positive (FP) = \(B\), False Negative (FN) = \(C\), and True Negative (TN) = \(D\). These conditions can be contingent on what is considered a “positive” for a given dataset and how the Predicted and Actual values are arranged (transposed or not) in the confusion matrix.

3. Accuracy

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the accuracy of the predictions. \[\textrm{Accuracy} = \cfrac { TP+TN }{ TP + TN + FP + FN } = \cfrac { A+D }{ A + B + C + D }\]

my_accuracy <- function(dataframe, predicted, actual) {
  datatable <- table(dataframe[ ,predicted], dataframe[ ,actual])
  confusion_matrix <- apply(apply(datatable, 1, rev), 1, rev)
  A <- confusion_matrix[1,1]; B <- confusion_matrix[1,2]
  C <- confusion_matrix[2,1]; D <- confusion_matrix[2,2]
  accuracy <- (A + D) / (A + B + C + D)
  return (accuracy)
}

4. Classification Error Rate

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the classification error rate of the predictions. \[\textrm{Classification Error Rate} = \cfrac { FP+FN }{ TP + TN + FP + FN } = \cfrac { B+C }{ A + B + C + D }\]

my_error <- function(dataframe, predicted, actual) {
  datatable <- table(dataframe[ ,predicted], dataframe[ ,actual])
  confusion_matrix <- apply(apply(datatable, 1, rev), 1, rev)
  A <- confusion_matrix[1,1]; B <- confusion_matrix[1,2]
  C <- confusion_matrix[2,1]; D <- confusion_matrix[2,2]
  error <- (C + B) / (A + B + C + D)
  return (error)
}

Verify that you get an accuracy and an error rate that sums to one. \[\cfrac { A+D }{ A+B+C+D } + \cfrac { B+C }{ A+B+C+D }= \cfrac { A+B+C+D }{ A+B+C+D } = 1\]

my_accuracy(classification.data, "scored.class", "class") +
my_error(classification.data, "scored.class", "class")
## [1] 1

5. Precision

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the precision of the predictions. \[\textrm{Precision} = \cfrac { TP }{ TP+FP } = \cfrac { A }{ A+B }\]

my_precision <- function(dataframe, predicted, actual) {
  datatable <- table(dataframe[ ,predicted], dataframe[ ,actual])
  confusion_matrix <- apply(apply(datatable, 1, rev), 1, rev)
  A <- confusion_matrix[1,1]; B <- confusion_matrix[1,2]
  C <- confusion_matrix[2,1]; D <- confusion_matrix[2,2]
  precision <- A / (A + B)
  return (precision)
}

6. Sensitivity

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the sensitivity of the predictions. Sensitivity is also known as recall. \[\textrm{Sensitivity} = \textrm{Recall} = \cfrac { TP }{ TP+FN } = \cfrac { A }{ A+C }\]

my_sensitivity <- function(dataframe, predicted, actual) {
  datatable <- table(dataframe[ ,predicted], dataframe[ ,actual])
  confusion_matrix <- apply(apply(datatable, 1, rev), 1, rev)
  A <- confusion_matrix[1,1]; B <- confusion_matrix[1,2]
  C <- confusion_matrix[2,1]; D <- confusion_matrix[2,2]
  sensitivity <- A / (A + C)
  return (sensitivity)
}

7. Specificity

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the specificity of the predictions. \[\textrm{Specificity} = \cfrac { TN }{ TN+FP } = \cfrac { D }{ B+D }\]

my_specificity <- function(dataframe, predicted, actual) {
  datatable <- table(dataframe[ ,predicted], dataframe[ ,actual])
  confusion_matrix <- apply(apply(datatable, 1, rev), 1, rev)
  A <- confusion_matrix[1,1]; B <- confusion_matrix[1,2]
  C <- confusion_matrix[2,1]; D <- confusion_matrix[2,2]
  specificity <- D / (B + D)
  return (specificity)
}

8. \(F_1\) Score

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the \(F_1\) score of the predictions. \[F_1 =\cfrac { 2\cdot \textrm{Precision} \cdot \textrm{Sensitivity} }{ \textrm{Precision} + \textrm{Sensitivity} }=\cfrac { 2\cdot \textrm{Precision} \cdot \textrm{Recall} }{ \textrm{Precision} + \textrm{Recall} }\]

my_F1_Score <- function(dataframe, predicted, actual) {
  datatable <- table(dataframe[ ,predicted], dataframe[ ,actual])
  confusion_matrix <- apply(apply(datatable, 1, rev), 1, rev)
  A <- confusion_matrix[1,1]; B <- confusion_matrix[1,2]
  C <- confusion_matrix[2,1]; D <- confusion_matrix[2,2]
  F1_Score <- (2 *(A / (A + B)) * (A / (A + C))) / ((A / (A + B)) + (A / (A + C)))
  return (F1_Score)
}

9. \(F_1\) Score Bounds

Before we move on, let’s consider a question that was asked: What are the bounds on the \(F_1\) score? Show that the \(F_1\) score will always be between 0 and 1. (Hint: If \(0 < a < 1\) and \(0 < b < 1\), then \(ab<a\).)

\[F_{ 1 } =\cfrac { 2\cdot \textrm{Precision} \cdot \textrm{Sensitivity} }{ \textrm{Precision} + \textrm{Sensitivity} } = 2\frac { \left( \frac { A }{ A+B } \right) \left( \frac { A }{ A+C } \right) }{ \left( \frac { A }{ A+B } \right) +\left( \frac { A }{ A+C } \right) } \tag{9.1}\] By construction, the denominators of the Precision and Sensitivity classification metrics hold the following conditions \(A+B \geq A\) and \(A+C \geq A\). Hence, \(\textrm{Precision}=\frac { A }{ A+B } \in [0,1]\) and \(\textrm{Sensitivity}=\frac { A }{ A+C } \in [0,1]\). Simplifying equation 9.1 above we get:

\[F_{ 1 }=\frac { 2\frac { { A }^{ 2 } }{ \left( A+B \right) \left( A+C \right) } }{ \frac { A\left( A+C \right) +A\left( A+B \right) }{ \left( A+B \right) \left( A+C \right) } } =\frac { 2{ A }^{ 2 } }{ A\left( A+C \right) +A\left( A+B \right) } =\frac { 2{ A }^{ 2 } }{ { A }^{ 2 }+AC+{ A }^{ 2 }+AB } =\frac { 2{ A }^{ 2 } }{ 2{ A }^{ 2 }+AB+AC } \tag{9.2}\] Now, considering an extreme case where the actual values are all positive and the model predicts all positive results such that both \(A>0\) and \(B=C=0\), we can find the upper limit boundary point such that: \[F_{ 1 }\left( B=C=0 \right) =\frac { 2{ A }^{ 2 } }{ 2{ A }^{ 2 }+0A+0A } =\frac { 2{ A }^{ 2 } }{ 2{ A }^{ 2 }+0+0 } =\frac { 2{ A }^{ 2 } }{ 2{ A }^{ 2 } } =1 \tag{9.3}\]

Then, considering the extreme case where the actual values are all negative such that \(A=0\) and \(A < C \leq B\), we can find the lower limit boundary point by taking the derivative of the function with repect to \(A\) and then evaluating the derivative at \(A=0\) such that: \[\frac { d }{ dA\quad } \left[ F_1 \right] = \frac { d }{ dA\quad } \left[ \frac { 2{ A }^{ 2 } }{ 2{ A }^{ 2 }+AB+AC } \right] =\frac { 4A }{ 4A+B+C } \tag{9.4}\]\[\frac { d }{ dA\quad } \left[ F_{ 1 }\left( A=0 \right) \right] =\frac { 4\cdot 0 }{ 4\cdot 0+B+C } =\frac { 0 }{ B+C } =0 \tag{9.5}\]

10. ROC Curve

Write a function that generates an ROC curve from a data set with a true classification column (class in our example) and a probability column (scored.probability in our example). Your function should return a list that includes the plot of the ROC curve and a vector that contains the calculated area under the curve (AUC). Use a sequence of thresholds ranging from 0 to 1 at 0.01 intervals.

my_ROC_Curve <- function(dataframe, probabilities, actual) {
  thresholds <- seq(0, 1, 1/180)
  TPR <- FPR <- numeric(length(thresholds))
  Positive <- sum(dataframe[,actual] == 1)
  Negative <- sum(dataframe[,actual] == 0)
  for (i in 1:length(thresholds)) {
    # Upper bound probability threshold for equation 10.1
    data_subset <- subset(dataframe, dataframe[ ,probabilities] <= thresholds[i])
    # Condition numerator as outlined by indicator function in equation 10.2
    TP <- sum(data_subset[data_subset[,actual] == 1, probabilities] > 0.5) # A
    TN <- sum(data_subset[data_subset[,actual] == 0, probabilities] <= 0.5) # D
    FP <- sum(data_subset[data_subset[,actual] == 0, probabilities] > 0.5) # B
    FN <- sum(data_subset[data_subset[,actual] == 1, probabilities] <= 0.5) # C
    TPR[i] <- 1 - (TP + FN) / Positive # Equation 10.3
    FPR[i] <- 1 - (TN + FP) / Negative # Equation 10.4
  }
  plot(FPR, TPR, type = "l", lwd = 2,
       xlab = "False Positive Rate (1 - Specificity)", 
       ylab = "True Positive Rate (Sensitivity)")
  abline(0, 1)
  # diff() returns n-1 suitably lagged and iterated differences
  diffFPR <- c(abs(diff(FPR)), 0); diffTPR <- c(abs(diff(TPR)), 0)
  # Variant of Trapezoidal numerical integration from equation 10.5
  area_under_curve <- sum(TPR * diffFPR) - sum(diffTPR * diffFPR) / 2
  return (area_under_curve)
}

\[A:=[0,p] \tag{10.1}\]\[{ 1 }_{ A }\left( x \right) :=\begin{cases} 1 & x\in A \\ 0 & x\notin A \end{cases} \tag{10.2}\]\[TPR_{ i }=\textrm{Sensitivity}_i=1-\sum _{ i,j=1 }^{ n }{ \frac { { 1 }_{ A }\left( x \right) \left[ { TP }_{ i }+{ FN }_{ i } \right] }{ { TP }_{ j }+{ FN }_{ j } } } \tag{10.3}\]\[FPR_{ i }=1-\textrm{Specificity}_i=1-\sum _{ i,j=1 }^{ n }{ \frac { { 1 }_{ A }\left( x \right) \left[ { TN }_{ i }+{ FP }_{ i } \right] }{ { TN }_{ j }+{ FP }_{ j } } } \tag{10.4}\]\[\int _{ a }^{ b }{ f\left( x \right) dx } \approx \frac { 1 }{ 2 } \sum _{ n=0 }^{ N-1 }{ \left( { y }_{ n }+{ y }_{ n+1 } \right) \Delta x } \approx \frac { 1 }{ 2 } \sum _{ n=0 }^{ N-1 }{ \left( { y }_{ n }+{ y }_{ n+1 } \right) \left( x_{ n+1 }-{ x }_{ n } \right) } \tag{10.5}\]

11. Classification Metrics

Use your created R functions and the provided classification output data set to produce all of the classification metrics discussed above.

my_accuracy(classification.data, "scored.class", "class")
## [1] 0.8066298
my_error(classification.data, "scored.class", "class")
## [1] 0.1933702
my_precision(classification.data, "scored.class", "class")
## [1] 0.84375
my_sensitivity(classification.data, "scored.class", "class")
## [1] 0.4736842
my_specificity(classification.data, "scored.class", "class")
## [1] 0.9596774
my_F1_Score(classification.data, "scored.class", "class")
## [1] 0.6067416
my_ROC_Curve(classification.data, "scored.probability", "class")

## [1] 0.8504527

12. The caret Package

Investigate the caret package. In particular, consider the functions confusionMatrix, sensitivity, and specificity. Apply the functions to the data set. How do the results compare with your own functions?

library(caret)
confusionMatrix(classification.data$scored.class, classification.data$class, positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 119  30
##          1   5  27
##                                           
##                Accuracy : 0.8066          
##                  95% CI : (0.7415, 0.8615)
##     No Information Rate : 0.6851          
##     P-Value [Acc > NIR] : 0.0001712       
##                                           
##                   Kappa : 0.4916          
##  Mcnemar's Test P-Value : 4.976e-05       
##                                           
##             Sensitivity : 0.4737          
##             Specificity : 0.9597          
##          Pos Pred Value : 0.8438          
##          Neg Pred Value : 0.7987          
##              Prevalence : 0.3149          
##          Detection Rate : 0.1492          
##    Detection Prevalence : 0.1768          
##       Balanced Accuracy : 0.7167          
##                                           
##        'Positive' Class : 1               
## 
sensitivity(factor(classification.data$scored.class), factor(classification.data$class), positive = "1")
## [1] 0.4736842
specificity(factor(classification.data$scored.class), factor(classification.data$class), negative = "0")
## [1] 0.9596774

All the values match if the positive and negative outcomes are declared properly in both the written functions and the caret package functions. Something worth noting is that the my_precision value is displayed as Pos Pred Value in the confusionMatrix output.

13. The pROC Package

Investigate the pROC package. Use it to generate an ROC curve for the data set. How do the results compare with your own functions?

library(pROC)
plot(roc(classification.data$class, classification.data$scored.probability, smooth=F))

auc(classification.data$class, classification.data$scored.probability)
## Area under the curve: 0.8503

The ROC curves are close to identical except for the \(x\)-axis labeling and tick marks. The written function plots \(y = \textrm{Sensitivity}\) and \(x = 1 - \textrm{Specificity}\) with the \(x\)-axis going from \([0,1]\). The pROC package function plots \(y = \textrm{Sensitivity}\) and \(x = \textrm{Specificity}\) with the \(x\)-axis going from \([1,0]\). This equates to the same plot, but with slightly different labeling and tick marks. The Area Under the Curve (AUC) outputs are equivalent to the third decimal. The slight difference is likely due to differences in the number of bins (thresholds) used or the approximation method applied to calculate the area.

References

Linear Models with R, Faraway, 2005.

A Modern Approach to Regression with R, Sheather, 2009.

http://gim.unmc.edu/dxtests/roc2.htm

http://www.cs.uu.nl/docs/vakken/dm/hc6.pdf

http://www.saedsayad.com/model_evaluation_c.htm

https://cran.r-project.org/web/packages/caret/caret.pdf

http://calculus.seas.upenn.edu/?n=Main.NumericalIntegration

https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes

https://stat.ethz.ch/R-manual/R-devel/library/base/html/diff.html