Instructions

Complete each of the following steps as instructed:

Download the classification output data set (attached in Blackboard to the assignment).

url <- 'https://raw.githubusercontent.com/jcp9010/MSDA/master/CUNY%20621/Homework%202/classification-output-data.csv'
dm_df <- read.csv(url, header = TRUE)
head(dm_df)

##   pregnant glucose diastolic skinfold insulin  bmi pedigree age class
## 1        7     124        70       33     215 25.5    0.161  37     0
## 2        2     122        76       27     200 35.9    0.483  26     0
## 3        3     107        62       13      48 22.9    0.678  23     1
## 4        1      91        64       24       0 29.2    0.192  21     0
## 5        4      83        86       19       0 29.3    0.317  34     0
## 6        1     100        74       12      46 19.5    0.149  28     0
##   scored.class scored.probability
## 1            0         0.32845226
## 2            0         0.27319044
## 3            0         0.10966039
## 4            0         0.05599835
## 5            0         0.10049072
## 6            0         0.05515460

The data set has three key columns we will use:

class: the actual class for the observation
scored.class: the predicted class for the observation (based on a threshold of 0.5)
scored.probability: the predicted probability of success for the observation

Use the table() function to get the raw confusion matrix for this scored dataset. Make sure you understand the output. In particular, do the rows represent the actual or predicted class? The columns?

with(dm_df, table(scored.class, class)[2:1,2:1])

##             class
## scored.class   1   0
##            1  27   5
##            0  30 119

The website: http://www.saedsayad.com/model_evaluation_c.htm does a great job explaining the confusion matrix. According to the website, “a confusion matrix shows the number of correct and incorrect predictions made by the classification model compared to the actual outcomes (target value) in the data. The matrix is \(NxN\), where N is the number of target values (classes). Performance of such models is commonly evaluated using the data in the matrix. The following table displays a 2x2 confusion matrix for two classes (Positive and Negative).”

Accuracy: the proportion of the total number of predictions that were correct.
Positive Predictive Value or Precision: the proportion of positive cases that were correctly identified.
Negative Predictive Value: the proportion of negative cases that were correctly identified.
Sensitivity or Recall: the proportion of actual positive cases that are correctly identified
Specificity: the proportion of actual negative cases which are correctly identified.

When we look at the Diabetes Dataset and the above table, I had created a confusion matrix with the rows being identified as Model Prediction and the columns being identified as Actual Target. For example, in the top row, the scored.class predicts 32 cases as TRUE, when in fact, there are only 32 TRUE cases and 5 NEGATIVE cases.

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the accuracy of the prediction.

\[\text{Accuracy} = \frac{TP + TN}{TP + FP + TN + FN}\]

accuracy_fct <- function(x){
  # Takes dataframe Diabetes and returns an accuracy
  df_table <- with(x, table(scored.class, class)[2:1,2:1])
  acc <- (df_table[1] + df_table[4])/(df_table[1] + df_table[3] + df_table[4] + df_table[2])
  return(acc)
}

print(paste0("Accuracy: ", round(accuracy_fct(dm_df),3)))

## [1] "Accuracy: 0.807"

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the classification error rate of the predictions.

\[\text{Classification Error Rate} = \frac{FP + FN}{TP + FP + TN + FN}\]

class_error_fct <- function(x){
  # Takes dataframe Diabetes and returns a Classification Error Rate
  df_table <- with(x, table(scored.class, class)[2:1,2:1])
  class_error <- (df_table[3] + df_table[2])/(df_table[4] + df_table[3] + df_table[1] + df_table[2])
  return(class_error)
}

print(paste0("Classification Error Rate: ", round(class_error_fct(dm_df),3)))

## [1] "Classification Error Rate: 0.193"

Verify that you get an accuracy and an error rate that sums to one.

total_sum <- accuracy_fct(dm_df) + class_error_fct(dm_df)
print(paste0("Accuracy: ", accuracy_fct(dm_df), " + ", class_error_fct(dm_df), " = ", total_sum))

## [1] "Accuracy: 0.806629834254144 + 0.193370165745856 = 1"

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the precision of the predictions.

\[\text{Precision} = \frac{TP}{TP + FP}\]

precision_fct <- function(x){
  # Takes dataframe Diabetes and returns Precision
  df_table <- with(x, table(scored.class, class)[2:1,2:1])
  prec <- (df_table[1])/(df_table[1] + df_table[3])
  return(prec)
}

print(paste0("Precision: ", round(precision_fct(dm_df),3)))

## [1] "Precision: 0.844"

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the sensitivity of the predictions. Sensitivity is also known as recall.

\[\text{Sensitivity} = \frac{TP}{TP + FN}\]

sensitivity_fct <- function(x){
  # Takes dataframe Diabetes and returns Sensitivity
  df_table <- with(x, table(scored.class, class)[2:1,2:1])
  sens <- (df_table[1])/(df_table[1] + df_table[2])
  return(sens)
}

print(paste0("Sensitivity: ", round(sensitivity_fct(dm_df),3)))

## [1] "Sensitivity: 0.474"

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the specificity of the predictions.

\[\text{Specificity} = \frac{TN}{TN + FP}\]

specificity_fct <- function(x){
  # Takes dataframe Diabetes and returns Specificity
  df_table <- with(x, table(scored.class, class)[2:1,2:1])
  spec <- (df_table[4])/(df_table[4] + df_table[3])
  return(spec)
}

print(paste0("Specificity: ", round(specificity_fct(dm_df),3)))

## [1] "Specificity: 0.96"

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the F1 score of the predictions.

\[\text{F1 Score} = \frac{2 * Precision * Sensitivity}{Precision + Sensitivity}\]

f1_fct <- function(x){
  # Takes dataframe Diabetes and returns a F1 score
  df_table <- with(x, table(scored.class, class)[2:1,2:1])
  prec <- (df_table[1])/(df_table[1] + df_table[3])
  sens <- (df_table[1])/(df_table[1] + df_table[2])
  
  f1 <- (2 * prec * sens)/(prec + sens)
  return(f1)
}

print(paste0("F1 Score: ", round(f1_fct(dm_df),3)))

## [1] "F1 Score: 0.607"

Before we move on, let’s consider a question that was asked: What are the bounds on the F1 score? Show that the F1 score will always be between 0 and 1. (Hint: if 0 < a < 1 and 0 < b < 1 than ab < a.)

We will assume that a is precision and b is sensitivity. Precision and Sensitivity are both given by numbers bounded between 0 and 1, or in other words: \(0 < a < 1\) and \(0 < b < 1\).

If we assume that \(a = 1\) and \(b = 1\), then:

\[\text{F1 Score} = \frac{2*1*1}{1+1} = 1\] Likewise, if we have a and b approach 0 (or take the limit as a and b goes to zero), given that a and b are both positive numbers, we will never have a F1 value that would be equal or less than 0. Therefore, since we have bounded a and b to between 0 and 1, the F1 function will always be between 0 and 1, or in other words \(ab < a\) and \(ab < b\) is true.

Write a function that generates an ROC curve from a data set with a true classification column (class in our example) and a probability column (scored.probability in our example). Your function should return a list that includes the plot of the ROC curve and a vector that contains the calculated area under the curve (AUC). Note that I recommend using a sequence of thresholds ranging from 0 to 1 at 0.01 intervals.

roc_curve_fct <- function(x){
  # Returns a list that includes the plot of the ROC curve and a vector that contains the calculated area under the curve.
  # Using sequence of thresholds ranging from 0 to 1 at 0.01 intervals.
  library(ggplot2)
  
  seq_int <- seq(0,1,by=0.01)
  TPR_vector <- c()
  FPR_vector <- c()
  
  for (i in 1:length(seq_int)){
    scored_class <- ifelse(x$scored.probability >= seq_int[i], 1, 0)
    rev_df <- data.frame(scored.class = scored_class, class = x$class)
    df_table <- with(rev_df, table(scored.class, class))
    TPR <- (df_table[4])/(df_table[4] + df_table[3])
    FPR <- (df_table[2]/(df_table[2] + df_table[1]))
    
    TPR_vector[i] <- TPR
    FPR_vector[i] <- FPR
  }
  
  plot_df <- data.frame(TRUE_POSITIVE = TPR_vector, FALSE_POSITIVE = FPR_vector)
  
  ROC_plot <- ggplot(plot_df, aes(x=FALSE_POSITIVE, y=TRUE_POSITIVE)) + geom_point() + geom_line(col="blue") + geom_abline(intercept = 0, slope = 1) + labs(title="Receiver Operator Curve", x = "False Positive Rate (1 - Specificity)", y = "True Positive Rate (Sensitivity)")
  
  # In order to roughly calculate the area under the curve, we must remove the NA values
  AUC_df <- plot_df[complete.cases(plot_df),]

  # Now to calculate the AUC 
  x <- abs(diff(AUC_df$FALSE_POSITIVE))
  y <- AUC_df$TRUE_POSITIVE
  
  area_under_curve <- sum(x*y)
  
  return(list(ROC_plot, area_under_curve))
}

ROC_list <- roc_curve_fct(dm_df)
ROC_plot <- ROC_list[[1]]

ROC_plot

area_under_curve <- ROC_list[[2]]

print(paste0("Area Under the Curve: ", area_under_curve))

## [1] "Area Under the Curve: 0.829937747594793"

Use your created R functions and the provided classification output data set to produce all of the classification metrics discussed above.

print(paste0("Accuracy: ", round(accuracy_fct(dm_df),3)))

## [1] "Accuracy: 0.807"

print(paste0("Classification Error Rate: ", round(class_error_fct(dm_df),3)))

## [1] "Classification Error Rate: 0.193"

print(paste0("Precision: ", round(precision_fct(dm_df),3)))

## [1] "Precision: 0.844"

print(paste0("Sensitivity: ", round(sensitivity_fct(dm_df),3)))

## [1] "Sensitivity: 0.474"

print(paste0("Specificity: ", round(specificity_fct(dm_df),3)))

## [1] "Specificity: 0.96"

print(paste0("F1 Score: ", round(f1_fct(dm_df),3)))

## [1] "F1 Score: 0.607"

Investigate the caret package. In particular, consider the functions confusionMatrix, sensitivity, and specificity. Apply the functions to the data set. How do the results compare with your own functions?

library(caret)
df_table <- with(dm_df, table(scored.class, class)[2:1,2:1])
confusionMatrix(df_table)

## Confusion Matrix and Statistics
## 
##             class
## scored.class   1   0
##            1  27   5
##            0  30 119
##                                           
##                Accuracy : 0.8066          
##                  95% CI : (0.7415, 0.8615)
##     No Information Rate : 0.6851          
##     P-Value [Acc > NIR] : 0.0001712       
##                                           
##                   Kappa : 0.4916          
##  Mcnemar's Test P-Value : 4.976e-05       
##                                           
##             Sensitivity : 0.4737          
##             Specificity : 0.9597          
##          Pos Pred Value : 0.8438          
##          Neg Pred Value : 0.7987          
##              Prevalence : 0.3149          
##          Detection Rate : 0.1492          
##    Detection Prevalence : 0.1768          
##       Balanced Accuracy : 0.7167          
##                                           
##        'Positive' Class : 1               
##

print(paste0("Sensitivity: ", sensitivity(df_table)))

## [1] "Sensitivity: 0.473684210526316"

print(paste0("Specificity: ", specificity(df_table)))

## [1] "Specificity: 0.959677419354839"

The results are similar.

Investigate the pROC package. Use it to generate an ROC curve for the data set. How do the results compare with your own functions?

library(pROC)
plot(roc(dm_df$class, dm_df$scored.probability), main="ROC Curve from pROC Package")

# Please note that the X axis is in Specificity (as opposed to 1 - Specificity in the above function)

# Area Under the Curve
auc(roc(dm_df$class, dm_df$scored.probability))

## Area under the curve: 0.8503

The results are similar, particularly with the plotting. (The AUC is off by approximately 0.02).

CUNY 621 Homework 2

Joel Park

3/8/2018

Overview

Supplemental Material

Instructions