1 Introduction

This assignment explores classification metrics in data science using R.
The assignment has 13 items to be completed. The deliverables include a report in PDF, the R code and functions and the supporting RMarkdown file. This report is divided into 5 subsequent parts. In the second section, we solve items 1-9 which require implementation of key metrics and ROC curve graphic for a classified data set. In the third section, we apply those R functions to the provided data set to solve item 11. In the fourth section, we investigate the use of the caret package and apply that functionality to the provided data set to solve item 12. In the fifth and sixth sections, we implement our own ROC curve (to solve item 10) and compare it to the pROC package’s equivalent ROC curve implementation to solve item 13.

2 Formulas for Classification Metrics

In this section, we define a general function for classification statistics and then derive the individual functions as specializations of the general function.

#
# FUNCTION:  gen_classification_stats
# PURPOSE:   Calculates the classification statistics for a dataframe where the 
#            col_actual is the name of the column of the correct value
#            col_predicted is the name of the column of the predicted value
#
# RETURN:   Returns a dataframe of the summary classification statistics.
#              df[TP] - the count of True Positives
#              df[TN] - the count of True Negatives
#              df[FN] - the count of False Negatives
#              df[FP] - the count of False Positives
#              df[accuracy],  df[error_rate], df[precision], df[sensitivity],
#              df[specificity], df[F1Score]
#
# -----------------------------------------------------------------------
gen_classification_stats <- function( df, col_actual, col_predicted ) {
  
    # Note that the table function is called below as required in Item 2
  
    A = as.data.frame( table( df[ c(col_actual, col_predicted)]))
    
    # In order to use a variable string as a column name in an evaluated formula
    # one needs to use !! to force evaluation of the variable as its string equivalent.
    # Then as.symbol() forces the conversion of col_actual to 'class'
    # -------------------------------------------------------------------------------
    A %>% filter( 
    !!as.symbol(col_actual)==1 & 
      !!as.symbol(col_predicted) == 1) %>% pull(Freq) -> val
  
    TP = ifelse( is.integer(val) && length(val) > 0 , val, 0)
  
    A %>% filter( 
    !!as.symbol(col_actual)==0 & 
      !!as.symbol(col_predicted) == 0) %>% pull(Freq) -> val
  
    TN = ifelse( is.integer(val) && length(val) > 0 , val, 0)
  
    A %>% filter( 
    !!as.symbol(col_actual)==1 & 
      !!as.symbol(col_predicted) == 0) %>% pull(Freq) -> val
  
    FN = ifelse( is.integer( val ) && length(val) > 0, val, 0)

    A %>% filter( 
    !!as.symbol(col_actual)==0 & 
      !!as.symbol(col_predicted) == 1) %>% pull(Freq) -> val
  
    FP = ifelse( is.integer(val) && length(val) > 0, val, 0 )
    
    accuracy = (TP + TN) / ( TP + FP + TN + FN)
    
    error_rate = ( FP + FN ) / ( TP + FP + TN + FN)

    precision = TP / ( TP + FP)
    
    sensitivity = TP / ( TP + FN )
        
    specificity = TN/ ( TN + FP)
    
    F1Score = 2 * precision  * sensitivity / ( precision + sensitivity)

    XOut = data.frame( accuracy = accuracy, 
                       error_rate = error_rate ,
                       precision = precision ,
                       sensitivity = sensitivity ,
                       specificity = specificity ,
                       F1Score  = F1Score ,
                       TP = TP ,
                       TN = TN ,
                       FP = FP ,
                       FN = FN 
                       )
    return(XOut)
}

2.1 Item 3: Accuracy

The code below implements the accuracy of the predictions:

\[ Accuracy = \frac{ TP + TN}{TP + FP + TN + FN} \]

2.2 Item 4. Classification Error Rate

The code below implements the classification error rate of the predictions.

\[ ErrorRate = \frac{ FP + FN}{TP + FP + TN + FN } \]

We also need to verify the error rate and accuracy sum to one. This proved below as an algebraic identity.

\[ accuracy + errorrate = \frac{TP + TN}{TP+FP+TN+FN} + \frac{FP+FN}{TP+FP+TN+FN} \]

\[ accuracy + errorrate = \frac{ TP + TN + FP + FN}{TP+FP+ TN + FN} = 1 \]

2.3 Item 5: Precision

Precision is defined as:

\[ Precision = \frac{ TP}{TP + FP} \]

2.4 Item 6: Sensitivity

Sensitivity is defined as:

\[ Sensitivity = \frac{ TP}{TP + FN} \] It can also be viewed as the true positive rate.

2.5 Item 7: Specificity

Specificity is defined as:

\[ Specificity = \frac{ TN }{TN + FP} \]

2.6 Item 8: F1 Score

F1 Score is defined as:

\[ F1 Score = \frac{2 \cdot Precision \cdot Sensitivity}{ Precision + Sensitivity} \]

2.7 Item 9: Bounds on the F1 Score

To show that the F1 Score is between 0 and 1, we calculate the algebra from the definition. We assume \(TP > 0\) which is necessary for the Precision and Sensitivity to be non-zero.

\[ F1 = \frac{ 2 \cdot \frac{TP}{TP+FP} \cdot \frac{ TP }{TP+ FN}} { \frac{ TP}{TP + FP} + \frac{TP}{TP+FN} } \] \[ F1 = \frac{ 2 \frac{ TP^2}{(TP+FP)(TP+FN) } }{ \frac{TP(TP+FN)+TP(TP+FP)}{(TP+FP)(TP+FN) } } \] \[ F1 = \frac{ 2 \cdot TP^2}{2 TP^2 + TP(FN+FP)} \] \[ F1 = \frac{ TP^2}{TP^2 + \frac{1}{2}TP(FN+FP) } \]

\[ F1 = \frac{ TP }{TP + \left( \frac{FN+FP}{2}\right)} \] Since we know that \[TP > 0, FN \geq 0, FP \geq 0\], this implies \(F1 \geq 0\).

Next, we show \(F1 \leq 1\). This follows because \(0 < TP \leq TP + \frac{1}{2}(FN+FP)\).

This completes the proof of the inequality \[ 0 \leq F1 \leq 1\].

3 Applying the Metrics to Data

In this section, we use the created R functions and the provided classification output data to produce all of the classification metrics discussed above. We begin by loading the raw data file.

3.1 Item 11: Using our created R functions

The created R functions have the following output.

## [1] "accuracy is     "  "0.806629834254144"
## [1] "error rate is   "  "0.193370165745856"
## [1] "precision is    " "0.84375"
## [1] "sensitivity is  "  "0.473684210526316"
## [1] "specificity is  "  "0.959677419354839"
## [1] "F1Score is      "  "0.606741573033708"

4 Using the Caret Package

Using the caret package, we are able to replicate the statistics from our code and sample data. One caveat is that the levels defining True (or Positive) need to be explicitly defined to the confusionMatrix through the argument positive which is assigned the value “T” from the actual class.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   F   T
##          F 119  30
##          T   5  27
##                                           
##                Accuracy : 0.8066          
##                  95% CI : (0.7415, 0.8615)
##     No Information Rate : 0.6851          
##     P-Value [Acc > NIR] : 0.0001712       
##                                           
##                   Kappa : 0.4916          
##                                           
##  Mcnemar's Test P-Value : 4.976e-05       
##                                           
##             Sensitivity : 0.4737          
##             Specificity : 0.9597          
##          Pos Pred Value : 0.8438          
##          Neg Pred Value : 0.7987          
##              Prevalence : 0.3149          
##          Detection Rate : 0.1492          
##    Detection Prevalence : 0.1768          
##       Balanced Accuracy : 0.7167          
##                                           
##        'Positive' Class : T               
## 

The above results show that accuracy, error_rate, sensitivity, specificity, precision are equal between our code and the caret package code.

5 Implementing the ROC curve (Item 10)

# FUNCTION: ImplementROC
#
# PURPOSE:  Calculates the ROC curve from a vector of actual (0,1) values 
#           representing the classification and 
#           the numerical probabilities of the model.
#
# INPUTS:   actual      -- a vector of 0-1 values of length N representing 
#           the true classification of each observation
#           probability -- representing the numerical probability of 
#           being in the 1 class for each observation
ImplementROC <- function( actual, probability )
{
    N = 100 # The number of threshold points to generate the ROC curve.
    
    # Initialize the dataframe where intermediate data from the calculations 
    # will be stored for plots and tables.
    output_df = data.frame( index = integer(0), threshold = double(0),
                            specificity = double(0), 
                            sensitivity = double(0), 
                            falsepositive = double(0) )
       
    for( index in 1:N ){
       threshold = index/N
      
       scored.class = ifelse( probability > threshold, 1, 0)
       
       df = data.frame(actual = actual, predicted = scored.class)
       
       # Re-use our existing classification statistics function to generate
       # ROC curve
       (XOut = gen_classification_stats(df, "actual", "predicted") )
       
       #print(paste0( "threshold ", threshold, " Specificity: ", 
       #           XOut$specificity[1], " Sensitivity:  " , XOut$sensitivity[1]) )
       
       output_df = rbind(output_df, data.frame(index = index, 
                                               threshold = threshold, 
                                               specificity =XOut$specificity[1],
                                               sensitivity =XOut$sensitivity[1],
                                               falsepositive = 1 - XOut$specificity[1]
                                               ))
    }  
  
    output_df %>% arrange(sensitivity, falsepositive) %>%
      ggplot( aes( x=falsepositive, y = sensitivity ) ) +
       geom_step(size = 1, color="red") + labs(title="ROC Curve", 
                          x="False Positive Rate", 
                          y="Sensitivity") +
       geom_abline(intercept = 0, slope = 1) +
       theme(aspect.ratio = 1 ) -> mylineplot
    
    output_df %>% arrange(sensitivity, falsepositive ) -> output_df
    
    #
    # Now we calculate the AUC: Area Under the Curve
    # -------------------------------------------------
    diffFalsePositive = c(diff(output_df$falsepositive), 0 )
    diffSensitivity = c(diff(output_df$sensitivity), 0 )
    
    # The formula below essentially uses the trapezoid rule to do the 
    # numerical integration
    # It calculates the area under the rectangle and then adds a triangle 
    # since the left and right
    # end points of each Riemann subinterval is unequal
    AUC = sum(output_df$sensitivity * diffFalsePositive ) + 
      sum( diffSensitivity * diffFalsePositive) / 2
    
    outputs = list( output_df, mylineplot, AUC)
    
    
    return(outputs)
}

outputs = ImplementROC( ts$class, ts$scored.probability)

outputs[[2]] # Display the ROC chart

## [1] "AUC is: 0.848896434634975"

6 Using the pROC Package (Item 13)

In the section below, we illustrate the use of the pROC package to plot the receiver operating characteristic curve (ROC) for this data set. We then display the calculated AUC (area under the curve).

## Area under the curve: 0.8503

Comparing the Area under the curve of my code and pROC, it is clear that they give close answers but are not absolutely equal. The differences could be attributed to differences in implementing numerical integration strategies for the ROC curve. However, the ROC curves from both packages are virtually identical in shape.