DATA 621 - Assignment 2

Overview

In this homework assignment, you will work through various classification metrics. You will be asked to create functions in R to carry out the various calculations. You will also investigate some functions in packages that will let you obtain the equivalent results. Finally, you will create graphical output that also can be used to evaluate the output of classification models, such as binary logistic regression.

Download the classification output data set (attached in Blackboard to the assignment).

The classification-output-data.csv was downloaded an read-in to the dt variable below. The file contains 181 observations and 3 variables. Three variables will be considered for this report - class (actual class for the observation), scored.class (predicted class for the observation), and scored.probability (predicted probability of success for the observation).

##      class          predicted           prob        
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.02323  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.11702  
##  Median :0.0000   Median :0.0000   Median :0.23999  
##  Mean   :0.3149   Mean   :0.1768   Mean   :0.30373  
##  3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.43093  
##  Max.   :1.0000   Max.   :1.0000   Max.   :0.94633

## [1] 181   3

The data set has three key columns we will use:

1. class: the actual class for the observation
2. scored.class: the predicted class for the observation (based on a threshold of 0.5)
3. scored.probability: the predicted probability of success for the observation

Use the table() function to get the raw confusion matrix for this scored dataset. Make sure you understand the output. In particular, do the rows represent the actual or predicted class? The columns?

Here is raw confusion matrix using table():

##      predicted
## class   0   1
##     0 119   5
##     1  30  27

The rows in the table reflect actual class values of 0 or 1. Columns reflected predicted class values of 0 or 1. A table interpretation follows:

Given that 0 is a negative class and 1 is a positive class the table can be read as follows:

119 true negative observations (TN)
5 false positive observations (FP)
30 false negative observations (FN)
27 true positive observations (TP)

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the accuracy of the predictions.

Before calculating accuracy I create a helper function Cm_val()that will provide key formula inputs for accuracy and other metrics.

cm_val Function

The cm_val function is a helper function that calculates the key values for the various metrics: TP, TN, FP, and FN from the provided dataframe as well as the specified columns containing actual and predicted values and specified values to be considered True or False.

cm_val <- function(tbl, actual, predicted, pos_value, neg_value){
   cm <- as.data.frame(table(tbl[,c(actual)], tbl[,c(predicted)]))
   FP <- filter(cm, Var1==neg_value & Var2==pos_value)$Freq
   FN <- filter(cm, Var1==pos_value & Var2==neg_value)$Freq
   TP <- filter(cm, Var1==pos_value & Var2==pos_value)$Freq
   TN <- filter(cm, Var1==neg_value & Var2==neg_value)$Freq

   # Reset to 0 if no category was found
   if (length(FN)==0) FN<-0;
   if (length(FP)==0) FP<-0;
   if (length(TP)==0) TP<-0;
   if (length(TN)==0) TN<-0;
   return (list(TP=TP, FP=FP, TN=TN, FN=FN))
}

Accuracy

# Define positive as 1 and negative as 0
pos_value <- 1
neg_value <- 0

accuracy <- function(tbl, actual, predicted, pos_value, neg_value){
  tf <- cm_val(tbl, actual, predicted, pos_value, neg_value)
  return ((tf$TP+tf$TN)/(tf$TP+tf$TN+tf$FP+tf$FN))
}

(my_accuracy <- accuracy(dt, actual='class', predicted='predicted', 
                 pos_value, neg_value))

## [1] 0.8066298

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the classification error rate of the predictions.

error.rate <- function(tbl, actual, predicted, pos_value, neg_value){
  tf <- cm_val(tbl, actual, predicted, pos_value, neg_value)
  return ((tf$FP+tf$FN)/(tf$TP+tf$TN+tf$FP+tf$FN))
}

(my_error <- error.rate(dt, actual='class', predicted='predicted', 
                   pos_value, neg_value))

## [1] 0.1933702

Accuracy and classification error rate should add up to one.

I am able to confirm that my accuracy ( 0.8066298 ) and my classification error ( 0.1933702 ) equal 1.

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the precision of the predictions.

precision <- function(tbl, actual, predicted, pos_value, neg_value){
  tf <- cm_val(tbl, actual, predicted, pos_value, neg_value)
  return (tf$TP/(tf$TP+tf$FP))
}

(pre <- precision(dt, actual='class', predicted='predicted', 
                  pos_value, neg_value))

## [1] 0.84375

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the sensitivity of the predictions. Sensitivity is also known as recall.

sensitivity <- function(tbl, actual, predicted, pos_value, neg_value){
  tf <- cm_val(tbl, actual, predicted, pos_value, neg_value)
  return (tf$TP/(tf$TP+tf$FN))
}

(sens <- sensitivity(dt, actual='class', predicted='predicted', 
                     pos_value, neg_value))

## [1] 0.4736842

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the specificity of the predictions.

specificity <- function(tbl, actual, predicted, pos_value, neg_value){
  tf <- cm_val(tbl, actual, predicted, pos_value, neg_value)
  return (tf$TN/(tf$TN+tf$FP))
}

(spec <- specificity(dt, actual='class', predicted='predicted', 
                     pos_value, neg_value))

## [1] 0.9596774

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the F1 score of the predictions,

f1.score <- function(tbl, actual, predicted, pos_value, neg_value){
  p <- precision(dt, actual='class', predicted='predicted', 
                 pos_value, neg_value)
  s <- sensitivity(dt, actual='class', predicted='predicted', 
                   pos_value, neg_value)
  return ((2*p*s)/(p+s))
}

(f1 <- f1.score(dt, actual='class', predicted='predicted', 
                pos_value, neg_value))

## [1] 0.6067416

Before we move on, let’s consider a question that was asked: What are the bounds on the F1 score? Show that the F1 score will always be between 0 and 1. (Hint: If 0 < 𝑎 < 1 and 0 < 𝑏 < 1 then 𝑎𝑏 < 𝑎.)

Both Precision and Sensitivity have a range from 0 to 1. Consider that if \(a>0\) and \(0<b<1\), then \(ab<a\) (a fraction of any positive number will be smaller than the original number).

Then \(Precision \times Sensitivity < Precision\) and \(Precision \times Sensitivity < Sensitivity\).

Then \(Precision \times Sensitivity+Precision \times Sensitivity < Precision + Sensitivity\), or

\(2\times Precision \times Sensitivity < Precision + Sensitivity\).

The fraction of these two values will be lower than \(1\). Also, since both values are positive, \(F1\ score\) will be positive. If \(Precision\) is zero, then \(Sensitivity\) is zero and \(F1\ Score\) is not defined. If \(Precision\) is one and \(Sensitivity\) is one, then \(F1\ Score\) is one.

So we have \(0<F\ Score\le1\).

Write a function that generates an ROC curve from a data set with a true classification column (class in our example) and a probability column (scored.probability in our example). Your function should return a list that includes the plot of the ROC curve and a vector that contains the calculated area under the curve (AUC). Note that I recommend using a sequence of thresholds ranging from 0 to 1 at 0.01 intervals.

The following function calculates the proper specificity and sensivity based on provided probability by iterating from 0 to 1 at 0.01 interval. It then computes the area under the curve by summing up all relevant rectangles. Plot and AUC are returned.

manual.roc <- function(tbl, actual, prob, pos_value, neg_value) {
  tbl2 <- tbl[, c(actual, prob)]
  tbl2 <- cbind(tbl2, predicted = rep(0, nrow(tbl)))

  cutoff <- seq(0,1,0.01)
  sens <- rep(0,length(cutoff))
  spec <- rep(0,length(cutoff))
  for (i in 1:length(cutoff)) {
    tbl3 <- within(tbl2, predicted[prob>cutoff[i]] <- 1)
    sens[i] <- sensitivity(tbl3, actual='class', 
                           predicted='predicted', pos_value, neg_value)
    spec[i] <- specificity(tbl3, actual='class', 
                           predicted='predicted', pos_value, neg_value)
  }
  roc.values <- as.data.frame(cbind(cutoff, sens, spec))
  roc.values[,'spec'] <- 1 - roc.values[,'spec']
  
  # Prepare plot
  pl <- ggplot(arrange(roc.values, desc(cutoff)), aes(x=spec, y=sens)) + 
    geom_step() + 
    xlab("1-Specificity") + ylab("Sensitivity") + 
    ggtitle("ROC Curve")
  
  # Find AUC
  auc <- roc.values %>% 
    arrange(desc(cutoff)) %>%
    mutate(auc =  (spec-lag(spec))*sens) %>%
    replace(is.na(.),0) %>%
    summarize(sum(auc))
  
  return (list(plot=pl, AUC=auc))
}

Let us run the function and review plot and AUC value.

mROC <- manual.roc(dt, 'class', 'prob', pos_value, neg_value)
mROC$plot

mROC$AUC

##    sum(auc)
## 1 0.8539898

11. Classification Metrics

All metrics were provided as they were calculated. As we will see below using built-in functions makes life easier.

12. `caret` Package

caret package provides functions to calculate all of the values above and many more. For example, we can calculate sensitivity and specificity.

caret::sensitivity(as.factor(dt$predicted), as.factor(dt$class), positive='1')

## [1] 0.4736842

caret::specificity(as.factor(dt$predicted), as.factor(dt$class), negative='0')

## [1] 0.9596774

Even more encompassing function is confusionMatrix. It not only returns the actual confusion matrix, but other values including sensitivity and specificity.

 (cm <-confusionMatrix(factor(dt$predicted), as.factor(dt$class), positive='1'))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 119  30
##          1   5  27
##                                           
##                Accuracy : 0.8066          
##                  95% CI : (0.7415, 0.8615)
##     No Information Rate : 0.6851          
##     P-Value [Acc > NIR] : 0.0001712       
##                                           
##                   Kappa : 0.4916          
##                                           
##  Mcnemar's Test P-Value : 4.976e-05       
##                                           
##             Sensitivity : 0.4737          
##             Specificity : 0.9597          
##          Pos Pred Value : 0.8438          
##          Neg Pred Value : 0.7987          
##              Prevalence : 0.3149          
##          Detection Rate : 0.1492          
##    Detection Prevalence : 0.1768          
##       Balanced Accuracy : 0.7167          
##                                           
##        'Positive' Class : 1               
##

The F1 score, Precision and other values and with higher precision than the summary output above. All of the values match our calculations.

13. `pROC` Package

Let us try the pROC package.

roc(dt$class, dt$prob, levels=c(0,1), percent=TRUE, plot=TRUE, ci=TRUE)

## Setting direction: controls < cases

## 
## Call:
## roc.default(response = dt$class, predictor = dt$prob, levels = c(0,     1), percent = TRUE, ci = TRUE, plot = TRUE)
## 
## Data: dt$prob in 124 controls (dt$class 0) < 57 cases (dt$class 1).
## Area under the curve: 85.03%
## 95% CI: 79.05%-91.01% (DeLong)

With one line of code we can get the ROC curve and several metrics. This plot looks very similar to our manual plot (some difference is likely due to different aspects). We calculated AUC at 85.4% while pROC calculated it at 85.03%. The small difference is likely due to slight discrepancy in how steps were calculated (it is also possible that the code leaves out a sliver somewhere).