This assignment explores classification metrics in data science using R.
The assignment has 13 items to be completed. The deliverables include a report in PDF, the R code and functions and the supporting RMarkdown file. This report is divided into 5 subsequent parts. In the second section, we solve items 1-9 which require implementation of key metrics and ROC curve graphic for a classified data set. In the third section, we apply those R functions to the provided data set to solve item 11. In the fourth section, we investigate the use of the caret package and apply that functionality to the provided data set to solve item 12. In the fifth and sixth sections, we implement our own ROC curve (to solve item 10) and compare it to the pROC package’s equivalent ROC curve implementation to solve item 13.
In this section, we define a general function for classification statistics and then derive the individual functions as specializations of the general function.
#
# FUNCTION: gen_classification_stats
# PURPOSE: Calculates the classification statistics for a dataframe where the
# col_actual is the name of the column of the correct value
# col_predicted is the name of the column of the predicted value
#
# RETURN: Returns a dataframe of the summary classification statistics.
# df[TP] - the count of True Positives
# df[TN] - the count of True Negatives
# df[FN] - the count of False Negatives
# df[FP] - the count of False Positives
# df[accuracy], df[error_rate], df[precision], df[sensitivity],
# df[specificity], df[F1Score]
#
# -----------------------------------------------------------------------
gen_classification_stats <- function( df, col_actual, col_predicted ) {
# Note that the table function is called below as required in Item 2
A = as.data.frame( table( df[ c(col_actual, col_predicted)]))
# In order to use a variable string as a column name in an evaluated formula
# one needs to use !! to force evaluation of the variable as its string equivalent.
# Then as.symbol() forces the conversion of col_actual to 'class'
# -------------------------------------------------------------------------------
A %>% filter(
!!as.symbol(col_actual)==1 &
!!as.symbol(col_predicted) == 1) %>% pull(Freq) -> val
TP = ifelse( is.integer(val) && length(val) > 0 , val, 0)
A %>% filter(
!!as.symbol(col_actual)==0 &
!!as.symbol(col_predicted) == 0) %>% pull(Freq) -> val
TN = ifelse( is.integer(val) && length(val) > 0 , val, 0)
A %>% filter(
!!as.symbol(col_actual)==1 &
!!as.symbol(col_predicted) == 0) %>% pull(Freq) -> val
FN = ifelse( is.integer( val ) && length(val) > 0, val, 0)
A %>% filter(
!!as.symbol(col_actual)==0 &
!!as.symbol(col_predicted) == 1) %>% pull(Freq) -> val
FP = ifelse( is.integer(val) && length(val) > 0, val, 0 )
accuracy = (TP + TN) / ( TP + FP + TN + FN)
error_rate = ( FP + FN ) / ( TP + FP + TN + FN)
precision = TP / ( TP + FP)
sensitivity = TP / ( TP + FN )
specificity = TN/ ( TN + FP)
F1Score = 2 * precision * sensitivity / ( precision + sensitivity)
XOut = data.frame( accuracy = accuracy,
error_rate = error_rate ,
precision = precision ,
sensitivity = sensitivity ,
specificity = specificity ,
F1Score = F1Score ,
TP = TP ,
TN = TN ,
FP = FP ,
FN = FN
)
return(XOut)
}The code below implements the accuracy of the predictions:
\[ Accuracy = \frac{ TP + TN}{TP + FP + TN + FN} \]
The code below implements the classification error rate of the predictions.
\[ ErrorRate = \frac{ FP + FN}{TP + FP + TN + FN } \]
error_rate <- function( df, col_actual, col_predicted ) {
gen_classification_stats(df, col_actual, col_predicted)$error_rate[1]
}We also need to verify the error rate and accuracy sum to one. This proved below as an algebraic identity.
\[ accuracy + errorrate = \frac{TP + TN}{TP+FP+TN+FN} + \frac{FP+FN}{TP+FP+TN+FN} \]
\[ accuracy + errorrate = \frac{ TP + TN + FP + FN}{TP+FP+ TN + FN} = 1 \]
Precision is defined as:
\[ Precision = \frac{ TP}{TP + FP} \]
Sensitivity is defined as:
\[ Sensitivity = \frac{ TP}{TP + FN} \] It can also be viewed as the true positive rate.
Specificity is defined as:
\[ Specificity = \frac{ TN }{TN + FP} \]
F1 Score is defined as:
\[ F1 Score = \frac{2 \cdot Precision \cdot Sensitivity}{ Precision + Sensitivity} \]
To show that the F1 Score is between 0 and 1, we calculate the algebra from the definition. We assume \(TP > 0\) which is necessary for the Precision and Sensitivity to be non-zero.
\[ F1 = \frac{ 2 \cdot \frac{TP}{TP+FP} \cdot \frac{ TP }{TP+ FN}} { \frac{ TP}{TP + FP} + \frac{TP}{TP+FN} } \] \[ F1 = \frac{ 2 \frac{ TP^2}{(TP+FP)(TP+FN) } }{ \frac{TP(TP+FN)+TP(TP+FP)}{(TP+FP)(TP+FN) } } \] \[ F1 = \frac{ 2 \cdot TP^2}{2 TP^2 + TP(FN+FP)} \] \[ F1 = \frac{ TP^2}{TP^2 + \frac{1}{2}TP(FN+FP) } \]
\[ F1 = \frac{ TP }{TP + \left( \frac{FN+FP}{2}\right)} \] Since we know that \[TP > 0, FN \geq 0, FP \geq 0\], this implies \(F1 \geq 0\).
Next, we show \(F1 \leq 1\). This follows because \(0 < TP \leq TP + \frac{1}{2}(FN+FP)\).
This completes the proof of the inequality \[ 0 \leq F1 \leq 1\].
In this section, we use the created R functions and the provided classification output data to produce all of the classification metrics discussed above. We begin by loading the raw data file.
# Code Block: importing-csv
# Sets the working folder and imports the code into CSV
working_dir = "/Volumes/DATA/dat/ang/datascience/621_DATA_MINING/HW2"
setwd(working_dir)
classification_file = "classification-output-data.csv"
# The training data set
ts = read.csv(classification_file)The created R functions have the following output.
## [1] "accuracy is " "0.806629834254144"
## [1] "error rate is " "0.193370165745856"
## [1] "precision is " "0.84375"
## [1] "sensitivity is " "0.473684210526316"
## [1] "specificity is " "0.959677419354839"
## [1] "F1Score is " "0.606741573033708"
Using the caret package, we are able to replicate the statistics from our code and sample data. One caveat is that the levels defining True (or Positive) need to be explicitly defined to the confusionMatrix through the argument positive which is assigned the value “T” from the actual class.
library(caret)
predicted_class = factor(ifelse(ts$scored.class == 1 , "T", "F") )
actual_class = factor( ifelse( ts$class == 1, "T", "F" ))
confusionMatrix(predicted_class, actual_class, positive = c("T") )## Confusion Matrix and Statistics
##
## Reference
## Prediction F T
## F 119 30
## T 5 27
##
## Accuracy : 0.8066
## 95% CI : (0.7415, 0.8615)
## No Information Rate : 0.6851
## P-Value [Acc > NIR] : 0.0001712
##
## Kappa : 0.4916
##
## Mcnemar's Test P-Value : 4.976e-05
##
## Sensitivity : 0.4737
## Specificity : 0.9597
## Pos Pred Value : 0.8438
## Neg Pred Value : 0.7987
## Prevalence : 0.3149
## Detection Rate : 0.1492
## Detection Prevalence : 0.1768
## Balanced Accuracy : 0.7167
##
## 'Positive' Class : T
##
The above results show that accuracy, error_rate, sensitivity, specificity, precision are equal between our code and the caret package code.
# FUNCTION: ImplementROC
#
# PURPOSE: Calculates the ROC curve from a vector of actual (0,1) values
# representing the classification and
# the numerical probabilities of the model.
#
# INPUTS: actual -- a vector of 0-1 values of length N representing
# the true classification of each observation
# probability -- representing the numerical probability of
# being in the 1 class for each observation
ImplementROC <- function( actual, probability )
{
N = 100 # The number of threshold points to generate the ROC curve.
# Initialize the dataframe where intermediate data from the calculations
# will be stored for plots and tables.
output_df = data.frame( index = integer(0), threshold = double(0),
specificity = double(0),
sensitivity = double(0),
falsepositive = double(0) )
for( index in 1:N ){
threshold = index/N
scored.class = ifelse( probability > threshold, 1, 0)
df = data.frame(actual = actual, predicted = scored.class)
# Re-use our existing classification statistics function to generate
# ROC curve
(XOut = gen_classification_stats(df, "actual", "predicted") )
#print(paste0( "threshold ", threshold, " Specificity: ",
# XOut$specificity[1], " Sensitivity: " , XOut$sensitivity[1]) )
output_df = rbind(output_df, data.frame(index = index,
threshold = threshold,
specificity =XOut$specificity[1],
sensitivity =XOut$sensitivity[1],
falsepositive = 1 - XOut$specificity[1]
))
}
output_df %>% arrange(sensitivity, falsepositive) %>%
ggplot( aes( x=falsepositive, y = sensitivity ) ) +
geom_step(size = 1, color="red") + labs(title="ROC Curve",
x="False Positive Rate",
y="Sensitivity") +
geom_abline(intercept = 0, slope = 1) +
theme(aspect.ratio = 1 ) -> mylineplot
output_df %>% arrange(sensitivity, falsepositive ) -> output_df
#
# Now we calculate the AUC: Area Under the Curve
# -------------------------------------------------
diffFalsePositive = c(diff(output_df$falsepositive), 0 )
diffSensitivity = c(diff(output_df$sensitivity), 0 )
# The formula below essentially uses the trapezoid rule to do the
# numerical integration
# It calculates the area under the rectangle and then adds a triangle
# since the left and right
# end points of each Riemann subinterval is unequal
AUC = sum(output_df$sensitivity * diffFalsePositive ) +
sum( diffSensitivity * diffFalsePositive) / 2
outputs = list( output_df, mylineplot, AUC)
return(outputs)
}
outputs = ImplementROC( ts$class, ts$scored.probability)
outputs[[2]] # Display the ROC chart## [1] "AUC is: 0.848896434634975"
In the section below, we illustrate the use of the pROC package to plot the receiver operating characteristic curve (ROC) for this data set. We then display the calculated AUC (area under the curve).
## Area under the curve: 0.8503
Comparing the Area under the curve of my code and pROC, it is clear that they give close answers but are not absolutely equal. The differences could be attributed to differences in implementing numerical integration strategies for the ROC curve. However, the ROC curves from both packages are virtually identical in shape.