DATA 621 Homework 2

Overview

In this homework assignment, you will work through various classification metrics. You will be asked to create functions in R to carry out the various calculations. You will also investigate some functions in packages that will let you obtain the equivalent results. Finally, you will create graphical output that also can be used to evaluate the output of classification models, such as binary logistic regression.

knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(message = FALSE)
knitr::opts_chunk$set(warning = FALSE)

library(dplyr)

Question 1

Download the classification output data set and load it

##   pregnant glucose diastolic skinfold insulin  bmi pedigree age class
## 1        7     124        70       33     215 25.5    0.161  37     0
## 2        2     122        76       27     200 35.9    0.483  26     0
##   scored.class scored.probability
## 1            0          0.3284523
## 2            0          0.2731904

instances = dim(data)[1]
features = dim(data)[2]
cat('Dataset for this assignment is', instances, 'instances and', features, 'features')

## Dataset for this assignment is 181 instances and 11 features

Create a dataframe of the features needed for questions 1 - 8, and 11

cdata = data %>% select(c(class, scored.class))
head(cdata,2)

##   class scored.class
## 1     0            0
## 2     0            0

Question 2

table function

Use the table() function to get the raw confusion matrix for the scored dataset.

library(data.table)

# Create the table
ctable = setDT(cdata)
ctable

##      class scored.class
##   1:     0            0
##   2:     0            0
##   3:     1            0
##   4:     0            0
##   5:     0            0
##  ---                   
## 177:     0            0
## 178:     1            1
## 179:     1            1
## 180:     0            0
## 181:     0            0

# Create a matrix
cmatrix = table(ctable$scored.class, ctable$class)
cmatrix

##    
##       0   1
##   0 119  30
##   1   5  27

Define variables for functions

TN = cmatrix[1,1]
TP = cmatrix[2,2]
FP = cmatrix[2,1]
FN = cmatrix[1,2]

cat('True negative(TN):',TN, 'True positive(TP):',TP, 'False negative(FN):',FN, 'False positive(FP):',FP)

## True negative(TN): 119 True positive(TP): 27 False negative(FN): 30 False positive(FP): 5

Question 3

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the accuracy of the predictions.

faccuracy <- function(TP, FP, TN, FN){
  accuracy <- sum(TP,TN)/sum(TP,FP,TN,FN)
  return(accuracy)
}

Question 4

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the classification error rate of the predictions.

fclass_error_rate <- function(TP, FP, TN, FN){
  class_error_rate <- sum(FP,FN)/sum(TP,FP,TN,FN)
  return(class_error_rate)
}

Question 5

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the precision of the predictions.

fprecision <- function(TP, FP){
  precision <- TP/(TP+FP)
  return(precision)
}

Question 6

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the sensitivity of the predictions. Sensitivity is also known as recall.

fsensitivity <- function(TP,FN){
  sensitivity <- TP/(TP+FN)
  return(sensitivity)
}

Question 7

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the specificity of the predictions.

fspecificity <- function(FP,TN){
  specificity <- TN/(TN+FP)
  return(specificity)
}

Question 8

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the F1 score of the predictions.

fF1_score <- function(precision, sensitivity){
  F1_score <- (2 * precision * sensitivity)/(precision + sensitivity)
  return(F1_score)
}

Question 9

What are the bounds on the F1 score? Show that the F1 score will always be between 0 and 1. (Hint: If 0 < 𝑎 < 1 and 0 < 𝑏 < 1 then 𝑎𝑏 < 𝑎.)

According to David M W Powers in Evaluation: From Precision, Recall and F-Factor to ROC, Informedness, Markedness & Correlation, “Sensitivity is the proportion of Real Positive cases that are correctly Predicted Positive”. Additionally, he states: “Precision denotes the proportion of Predicted Positive cases that are correctly Real Positives.” Considering that proportions inherently range from zero to one we can consider those to be the bounds of Sensitivity and Precision.

In order to calculate F1 score we insert Sensitivity and Precision in the following formula:

\(\frac{2 * Precision * Sensitivity}{Precision + Sensitivity}\)

Given that if 0 < 𝑎 < 1 and 0 < 𝑏 < 1 then 𝑎𝑏 < 𝑎, we know that the numerator (before multiplying by two) in the above formula cannot exceed the denominator. To prove that the numerator cannot exceed the denominator even after multiplying by two we will conduct two operations.

First, since we know Sensitivity and Precision can range from zero to one, we create a sequence of possible values for each at intervals of 0.01. Then, for each combination of Sensitivity and Precision (limited to 0.01 intervals) we compute the denominator and numerator and take the difference. If the difference is never a negative value we know the numerator never exceeds the denominator and therefore cannot exceed a value of one. Since we know Sensitivity and Precision cannot be below zero we also know the denominator, which sums these two metrics, can never be subzero.

Second, we again simulate each combination of potential Sensitivity and Precision values from zero to one at 0.01 increments. This time we compute the F1 score associated with each combination to demonstrate that all values of the simulation are bounded by zero and one.

# Simulate Precision and Sensitivity: vectors of values between 0 and 1 at 0.01 interval
precision_sweep <- seq(from = 0, to = 1, by = 0.01)
sensitivity_sweep <- seq(from = 0, to = 1, by = 0.01)

f1_bounds_method_1 <- function(precision_sweep, sensitivity_sweep) {
  # Initialize dataframe (first row to be removed later)
  df_f1 <- tibble("numerator" = 0, "denominator" = 0, "diff" = 0)
  # Loop through each value of Precision (0 to 1 at 0.01 intervals)
  for (prec in precision_sweep) {
    # Loop through each value of Sensitivity (0 to 1 at 0.01 intervals)
    for (sens in sensitivity_sweep) {
     # Calculate numerator and denominator
     f1_numerator <- (2 * prec * sens)
     f1_denominator <- (prec + sens)
     # Subtract numerator from denominator
     diff_denom_numer <-  f1_denominator - f1_numerator
     # Append row to dataframe with calculated values
     df_f1 <- df_f1 %>% add_row("numerator" = f1_numerator, "denominator" = f1_denominator, 
                                "diff" = diff_denom_numer)
    }
  }
  # Remove first row which was only created to initialize the dataframe
  df_f1[-1,]
  
  return (plot(df_f1$diff, xlab = "Simulation #", ylab = "Difference: Denominator - Numerator"))
}

f1_bounds_method_1(precision_sweep, sensitivity_sweep)

f1_bounds_method_2 <- function(precision_sweep, sensitivity_sweep) {
  # Initialize empty vector
  f1_scores <- c()
  # Loop through each value of Precision (0 to 1 at 0.01 intervals)
  for (prec in precision_sweep) {
    # Loop through each value of Sensitivity (0 to 1 at 0.01 intervals)
    for (sens in sensitivity_sweep) {
     # Append F1 score to F1 scores vector
     f1_scores <- c(f1_scores, fF1_score(prec, sens))
    }
  }
  return (plot(f1_scores, xlab = "Simulation #", ylab = "F1 Score"))
}

f1_bounds_method_2(precision_sweep, sensitivity_sweep)

Question 10

Write a function that generates an ROC curve from a data set with a true classification column (class in our example) and a probability column (scored.probability in our example). Your function should return a list that includes the plot of the ROC curve and a vector that contains the calculated area under the curve (AUC).

First, to generate the ROC curve we create a sequence of thresholds ranging from 0 to 1 at 0.01 intervals, as specified in the directions. We then loop through each threshold and store the associated Sensitivity and 1 - Specificity values in a dataframe. The resulting values are then plotted with 1 - Specificity on the x axis and Sensitivity on the y axis–generating our ROC curve plot. Next, we approximate the AUC, through a method demonstrated by Bob Horton for Revolutions, which samples from the positive labeled observations and negative labeled observations and calculates the summed probability of positive cases outranking negative cases.

generate_roc <- function(data) {
  # Initialize dataframe (first row to be removed later)
  df_sen_spec <- tibble("sensitivity" = 0, "specificity" = 0)
  # Loop through thresholds
  for (thres in seq(from = 0, to = 1, by = 0.01)) {
      # If scored probability is above threshold, assign a value of 1
      data$new_threshold <- ifelse(data$scored.probability > thres, 1, 0)
      # Create confusion matrix for threshold
      thres_method = table(factor(data$new_threshold, levels = 0:1), 
                          factor(data$class, levels = 0:1), 
                          dnn = c("Prediction", "Reference"))
      # Extract true negatives, true positives, false positive, and false negatives
      TN <- thres_method[1,1]
      TP <- thres_method[2,2]
      FP <- thres_method[2,1]
      FN <- thres_method[1,2]
      # Calculate Sensitivity and Specificity via our previously created functions
      thres_sensitivity = fsensitivity(TP, FN)
      thres_specificity = fspecificity(FP, TN)
      # Append row to dataframe to store this threshold's associated Sensitivity and Specificity
      df_sen_spec <- df_sen_spec %>% add_row("sensitivity" = thres_sensitivity, 
                                             "specificity" = 1 - thres_specificity)
  }
  # Remove first row which was only created to initialize the dataframe
  df_sen_spec[-1,]
  # Generate ROC curve plot
  roc_plot <- plot(df_sen_spec$specificity, df_sen_spec$sensitivity, 
                   type="l", asp = 1, xlab = "1 - Specificity", ylab = "Sensitivity")
  return (roc_plot)
}

calculate_auc_prob <- function(true_class, scored_probability, num_samples) {
  # Sample from positive observations
  pos <- sample(scored_probability[true_class], num_samples, replace=TRUE)
  # Sample from negative observations
  neg <- sample(scored_probability[!true_class], num_samples, replace=TRUE)
  # Approximate AUC by summing probability of positive cases outranking negative cases
  auc <- (sum(pos > neg) + sum(pos == neg)/2) / num_samples
  return (c(auc))
}

roc_auc <- function(data) {
  # Generate and store ROC curve plot
  roc_plot <- generate_roc(data)
  # Calculate and store AUC
  auc_cal <- calculate_auc_prob(as.logical(data$class), data$scored.probability, 10000000)
  roc_auc_list <- list(roc_plot, auc_cal)
  return (roc_auc_list)
}
  
roc_auc(data)

## [[1]]
## NULL
## 
## [[2]]
## [1] 0.8503294

Question 11

Use your created R functions and the provided classification output data set to produce all of the classification metrics discussed above.

accuracy = faccuracy(TP, FP, TN, FN)
cat('accuracy:',accuracy,'\n')

## accuracy: 0.8066298

class_error_rate = fclass_error_rate(TP, FP, TN, FN)
cat('classification error rate:',class_error_rate,'\n')

## classification error rate: 0.1933702

precision = fprecision(TP, FP)
cat('precision:',precision,'\n')

## precision: 0.84375

sensitivity = fsensitivity(TP,FN)
cat('sensitivity:',sensitivity,'\n')

## sensitivity: 0.4736842

specificity = fspecificity(FP,TN)
cat('specificity:',specificity,'\n')

## specificity: 0.9596774

F1_score = fF1_score(precision, sensitivity)
cat('F1 score:',F1_score)

## F1 score: 0.6067416

Question 12

Investigate the caret package. In particular, consider the functions confusionMatrix, sensitivity, and specificity. Apply the functions to the data set. How do the results compare with your own functions?

The caret package confusion matrix builds the matrix as follows TN FN FP TP

When I first ran the Caret package’s Confusion Matrix the sensitivity and specificity values were the opposite of what my custom functions produced. This link from Stack Overflow helped me solve the problem. I needed to set the parameter “positive” to a value of 1 because Caret defaults to positive being a value of 0.

library(caret)

# Creates vectors having data points
expected_value <- factor(c(cdata %>% pull(class)))
predicted_value <- factor(c(cdata %>% pull(scored.class)))
 
#Creating confusion matrix
caret_confusion_matrix <- confusionMatrix(data=predicted_value, reference = expected_value, positive='1')
 
#Display results 
caret_confusion_matrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 119  30
##          1   5  27
##                                           
##                Accuracy : 0.8066          
##                  95% CI : (0.7415, 0.8615)
##     No Information Rate : 0.6851          
##     P-Value [Acc > NIR] : 0.0001712       
##                                           
##                   Kappa : 0.4916          
##                                           
##  Mcnemar's Test P-Value : 4.976e-05       
##                                           
##             Sensitivity : 0.4737          
##             Specificity : 0.9597          
##          Pos Pred Value : 0.8438          
##          Neg Pred Value : 0.7987          
##              Prevalence : 0.3149          
##          Detection Rate : 0.1492          
##    Detection Prevalence : 0.1768          
##       Balanced Accuracy : 0.7167          
##                                           
##        'Positive' Class : 1               
##

Question 13

Investigate the pROC package. Use it to generate an ROC curve for the data set. How do the results compare with your own functions?

The ROC curve plot resulting from the pROC package is close to identical to the plot generated from our own functions with the only difference being light smoothing found in the latter. The AUC results from the pROC package are within 0.01-0.03 of the approximation from our function.

library(pROC)

# Generate ROC curve plot and AUC
roc(data$class, data$scored.probability,
    percent=TRUE, plot=TRUE, ci=TRUE)

## 
## Call:
## roc.default(response = data$class, predictor = data$scored.probability,     percent = TRUE, ci = TRUE, plot = TRUE)
## 
## Data: data$scored.probability in 124 controls (data$class 0) < 57 cases (data$class 1).
## Area under the curve: 85.03%
## 95% CI: 79.05%-91.01% (DeLong)