Overview

In this homework assignment, you will work through various classification metrics. You will be asked to create functions in R to carry out the various calculations. You will also investigate some functions in packages that will let you obtain the equivalent results. Finally, you will create graphical output that also can be used to evaluate the output of classification models, such as binary logistic regression.

Part 1

The data was downloaded from BlackBoard and posted on my GitHub site. Here is the link.

data <- read.csv('https://raw.githubusercontent.com/logicalschema/Fall-2021/main/DATA621/hw2/classification-output-data.csv')

# Snippet of data
head(data)  

Part 2

The data set has three key columns we will use:

  • class: the actual class for the observation

  • scored.class: the predicted class for the observation (based on a threshold of 0.5)

  • scored.probability: the predicted probability of success for the observation

Use the table() function to get the raw confusion matrix for this scored dataset. Make sure you understand the output. In particular, do the rows represent the actual or predicted class? The columns?

library(cvms)
library(tibble) 

set.seed(93021)


# This is creating the confusion matrix using table
confusionMatrix <- table("Actual" = data$class, "Predicted" = data$scored.class)
confusionMatrix
##       Predicted
## Actual   0   1
##      0 119   5
##      1  30  27

The rows represent the actual class and the columns represent the predicted class.

The page https://cran.r-project.org/web/packages/cvms/vignettes/Creating_a_confusion_matrix.html provides a helpful tutorial using other functions and plotting the confusion matrix.

The following is a plot of the confusion matrix.

cfm <- as_tibble(confusionMatrix)
#https://cran.r-project.org/web/packages/cvms/vignettes/Creating_a_confusion_matrix.html
plot_confusion_matrix(cfm, 
                      target_col = "Actual", 
                      prediction_col = "Predicted",
                      counts_col = "n")

Here is another way to find the confusion matrix using evaluate.

eval <- evaluate(data,
                 target_col = "class",
                 prediction_cols = "scored.class",
                 type = "binomial"
)

eval
conf_mat <- eval$`Confusion Matrix`[[1]]
conf_mat

The website https://www.saedsayad.com/model_evaluation_c.htm provides the following to describe the confusion matrix:

Confusion Matrix

Part 3

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the accuracy of the predictions.

\[Accuracy = \frac{TP + TN}{TP + FP + TN + FN}\]

# Function assumes the description of the columns as given in the original dataset
# Returns the Accuracy
calcAccuracy <- function(dataframeX) {
    cm <- table(dataframeX$class, dataframeX$scored.class)
    accuracy <- (cm[1, 1] + cm[2, 2]) / sum(cm)
    return(accuracy)
}


accuracyMeasure <- calcAccuracy(data)

According to the function calcAccuracy defined in the R code, the accuracy for the present dataset is 0.8066298.

Part 4

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the classification error rate of the predictions. Verify that you get an accuracy and an error rate that sums to one.

\[Classification Error Rate = \frac{FP + FN}{TP + FP + TN + FN}\]

# Function assumes the description of the columns as given in the original dataset
# Returns the Classification Error Rate
calcClassErrorRate <- function(dataframex) {
    cm <- table(dataframex$class, dataframex$scored.class)
    cer <- (cm[1, 2] + cm[2, 1]) / sum(cm)
    return(cer)
}

dataCER <- calcClassErrorRate(data)

According to the function calcClassErrorRate defined in the R code, the classification error rate for the present dataset is 0.1933702. In addition, the sum of the accuracy and classification error rate for the dataset is 1.

Part 5

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the precision of the predictions.

\[Precision = \frac{TP}{TP + FP}\]

# Function assumes the description of the columns as given in the original dataset
# Returns the Precision Measure
calcPrecision <- function(dataframex) {
    cm <- table(dataframex$class, dataframex$scored.class)
    precision <- cm[2, 2] / (cm[2, 2] + cm[1, 2])
    return(precision)
}


precisionMeasure <- calcPrecision(data)

The precision of the predictions for the dataset is 0.84375

Part 6

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the sensitivity of the predictions. Sensitivity is also known as recall.

\[Sensitivity = \frac{TP}{TP + FN}\]

# Function assumes the description of the columns as given in the original dataset
# Returns the Sensitivity Measure
calcSensitivity <-  function(dataframex) {
    # https://stackoverflow.com/questions/30002013/error-in-confusion-matrix-the-data-and-reference-factors-must-have-the-same-nu
    cm <- table(factor(dataframex$class, levels = c(0, 1)),
                  factor(dataframex$scored.class, levels = c(0, 1)))
    sensitivity <- cm[2, 2] / (cm[2, 2] + cm[2, 1])
    return(sensitivity)
}

sensitivityMeasure <- calcSensitivity(data)

The sensitivity for the dataset is 0.4736842.

Part 7

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the specificity of the predictions.

\[Specificity = \frac{TN}{TN + FP}\]

# Function assumes the description of the columns as given in the original dataset
# Returns the Specificity Measure
calcSpecificity <- function(dataframex) {
    # add factor function to prevent errors in ROC calc
    cm <- table(factor(dataframex$class, levels = c(0, 1)), 
                  factor(dataframex$scored.class, levels = c(0, 1)))
    specificity <- cm[1, 1] / (cm[1, 1] + cm[1, 2])
    return(specificity)
}

specificityMeasure <- calcSpecificity(data)

The specificity for the data is 0.9596774.

Part 8

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the F1 score of the predictions.

\[F1 Score = \frac{2 \times Precision \times Sensitivity}{Precision + Sensitivity}\]

# Function assumes the description of the columns as given in the original dataset
# Functions calcPrecision and calcSensitivity have to be defined
# Returns the F1 Score
calcF1 <- function(dataframeX){
  precision <- calcPrecision(dataframeX)
  sensitivity <- calcSensitivity(dataframeX)
  f1score <- (2 * precision * sensitivity)/(precision + sensitivity)
  return(f1score)
}

f1Measure <- calcF1(data)

The F1 score for the dataset is 0.6067416.

Part 9

Before we move on, let’s consider a question that was asked: What are the bounds on the F1 score? Show that the F1 score will always be between 0 and 1. (Hint: If 0 < a < 1 and 0 < b < 1 then ab < a.)

Given:
\[F1 Score = \frac{2 \times Precision \times Sensitivity}{Precision + Sensitivity}\]

Precision is the “proportion of positive cases that were correctly identified” and sensitivity (i.e. recall) is the “proportion of actual positive cases which are correctly identified”. In addition, combining the precious formulas for precision and sensitivity, we get:

\[F1 Score = \frac{TP}{TP + \frac{1}{2} (FP + FN)}\]

In addition, \(0 \le p, s \le 1\) where p is precision and s is sensitivity.

\[F1 Score = \frac{2 \times p \times s}{p + s}\]
Because if 0 < a < 1 and 0 < b < 1 then ab < a and the above equations, we can say that the F1 Score will be in the range of [0, 1]

Part 10

Write a function that generates an ROC curve from a data set with a true classification column (class in our example) and a probability column (scored.probability in our example). Your function should return a list that includes the plot of the ROC curve and a vector that contains the calculated area under the curve (AUC). Note that I recommend using a sequence of thresholds ranging from 0 to 1 at 0.01 intervals.

library(ggplot2) 


# Function assumes the description of the columns as given in the original dataset
# Returns a list element where the 1st item is the plot and the 2nd is the auc
calcROC <- function(dataframex, n=100, plt=TRUE ) {
  temp <- dataframex[, c("class", "scored.probability")]
  specificity <- numeric(n + 1)
  sensitivity <- numeric(n + 1)
  auc <- numeric(n + 1)
  for (j in 0:n) {
    threshold <- j / n
    temp$scored.class <- ifelse(temp$scored.probability > threshold, 1, 0)
    specificity[j+1] <- calcSpecificity(temp)
    sensitivity[j+1] <- calcSensitivity(temp)
    if (j > 0) 
      auc[j+1] <- auc[j] + (sensitivity[j+1] + sensitivity[j]) * (specificity[j+1] - specificity[j]) / 2
  }
  
  # Plot
  # https://www.rdocumentation.org/packages/ggplot2/versions/0.9.1/topics/geom_step
  plt1 <- ggplot() + 
    geom_step(aes(x = 1 - specificity, y = sensitivity), direction = "vh", color='dodgerblue') + 
    geom_segment(aes(x = 0, y = 0, xend = 1, yend = 1), linetype = 3) + 
    labs(title = "ROC Curve", x = "FPR: False Positive Rate", y = "TPR: True Positive Rate") +
    theme_minimal() + 
    theme(axis.line = element_line(color='black'))
  
  
  return(list(plt1, auc[n+1]))
}


ROC <- calcROC(data, 100)

Here is a plot of the ROC:

The AUC is 0.8488964

Part 11

Use your created R functions and the provided classification output data set to produce all of the classification metrics discussed above.

In the previous steps, I ran my functions with the dataset. Here is a table of the measures.

Measure Value
Accuracy 0.8066298
Classification Error Rate 0.1933702
Precision 0.84375
Sensitivity 0.4736842
Specificity 0.9596774
F1 Score 0.6067416

Part 12

Investigate the caret package. In particular, consider the functions confusionMatrix, sensitivity, and specificity. Apply the functions to the data set. How do the results compare with your own functions?

Documentation about caret can be found here.

library(caret)
## Loading required package: lattice
caretCM <- confusionMatrix(data = as.factor(data$scored.class), 
                           reference = as.factor(data$class), 
                           positive = "1"
                           )

# names(caretCM)
# "positive" "table" "overall" "byClass" "mode" "dots"
# accuracy = caretCM$overall[1]
# classification error rate = 1 - caretCM$overall[1]
# precision = caretCM$byClass[3]
# sensitivity = caretCM$byClass[1]
# specificity = caretCM$byClass[2]
# F1 score = caretCM$byClass[7]

# Output of the confusionMatrix function
caretCM
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 119  30
##          1   5  27
##                                           
##                Accuracy : 0.8066          
##                  95% CI : (0.7415, 0.8615)
##     No Information Rate : 0.6851          
##     P-Value [Acc > NIR] : 0.0001712       
##                                           
##                   Kappa : 0.4916          
##                                           
##  Mcnemar's Test P-Value : 4.976e-05       
##                                           
##             Sensitivity : 0.4737          
##             Specificity : 0.9597          
##          Pos Pred Value : 0.8438          
##          Neg Pred Value : 0.7987          
##              Prevalence : 0.3149          
##          Detection Rate : 0.1492          
##    Detection Prevalence : 0.1768          
##       Balanced Accuracy : 0.7167          
##                                           
##        'Positive' Class : 1               
## 

Here is the caret confusion matrix:

##           Reference
## Prediction   0   1
##          0 119  30
##          1   5  27

Here is a comparison of the measures using the caret package:

Measure Value Caret Functions
Accuracy 0.8066298 0.8066298
Classification Error Rate 0.1933702 0.1933702
Precision 0.84375 0.84375
Sensitivity 0.4736842 0.4736842
Specificity 0.9596774 0.9596774
F1 Score 0.6067416 0.6067416

The functions return similar results.

Part 13

Investigate the pROC package. Use it to generate an ROC curve for the data set. How do the results compare with your own functions?

Documentation about the pROC package can be found here.

library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
roc2 <- roc(response = data$class, 
    predictor = data$scored.probability, 
    plot = TRUE, 
    print.auc = TRUE, 
    col="dodgerblue",
    main = "ROC Plot Using pROC")
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

The function from pROC plots a ROC curve similar to my function.

Appendix

The following is the R code used in this assignment.

# Part 1
data <- read.csv('https://raw.githubusercontent.com/logicalschema/Fall-2021/main/DATA621/hw2/classification-output-data.csv')

# Snippet of data
head(data)  

# Part 2
library(cvms)
library(tibble) 

set.seed(93021)


# This is creating the confusion matrix using table
confusionMatrix <- table("Actual" = data$class, "Predicted" = data$scored.class)
confusionMatrix

cfm <- as_tibble(confusionMatrix)
#https://cran.r-project.org/web/packages/cvms/vignettes/Creating_a_confusion
eval <- evaluate(data,
                 target_col = "class",
                 prediction_cols = "scored.class",
                 type = "binomial"
)

eval_matrix.html
plot_confusion_matrix(cfm, 
                      target_col = "Actual", 
                      prediction_col = "Predicted",
                      counts_col = "n")


conf_mat <- eval$`Confusion Matrix`[[1]]
conf_mat

# Part 3
# Function assumes the description of the columns as given in the original dataset
# Returns the Accuracy
calcAccuracy <- function(dataframeX) {
    cm <- table(dataframeX$class, dataframeX$scored.class)
    accuracy <- (cm[1, 1] + cm[2, 2]) / sum(cm)
    return(accuracy)
}


accuracyMeasure <- calcAccuracy(data)

# Part 4

# Function assumes the description of the columns as given in the original dataset
# Returns the Classification Error Rate
calcClassErrorRate <- function(dataframex) {
    cm <- table(dataframex$class, dataframex$scored.class)
    cer <- (cm[1, 2] + cm[2, 1]) / sum(cm)
    return(cer)
}

dataCER <- calcClassErrorRate(data)


# Part 5
# Function assumes the description of the columns as given in the original dataset
# Returns the Precision Measure
calcPrecision <- function(dataframex) {
    cm <- table(dataframex$class, dataframex$scored.class)
    precision <- cm[2, 2] / (cm[2, 2] + cm[1, 2])
    return(precision)
}


precisionMeasure <- calcPrecision(data)

# Part 6
# Function assumes the description of the columns as given in the original dataset
# Returns the Sensitivity Measure
calcSensitivity <-  function(dataframex) {
    # https://stackoverflow.com/questions/30002013/error-in-confusion-matrix-the-data-and-reference-factors-must-have-the-same-nu
    cm <- table(factor(dataframex$class, levels = c(0, 1)),
                  factor(dataframex$scored.class, levels = c(0, 1)))
    sensitivity <- cm[2, 2] / (cm[2, 2] + cm[2, 1])
    return(sensitivity)
}

sensitivityMeasure <- calcSensitivity(data)

# Part 7

# Function assumes the description of the columns as given in the original dataset
# Returns the Specificity Measure
calcSpecificity <- function(dataframex) {
    # add factor function to prevent errors in ROC calc
    cm <- table(factor(dataframex$class, levels = c(0, 1)), 
                  factor(dataframex$scored.class, levels = c(0, 1)))
    specificity <- cm[1, 1] / (cm[1, 1] + cm[1, 2])
    return(specificity)
}

specificityMeasure <- calcSpecificity(data)

# Part 8
# Function assumes the description of the columns as given in the original dataset
# Functions calcPrecision and calcSensitivity have to be defined
# Returns the F1 Score
calcF1 <- function(dataframeX){
  precision <- calcPrecision(dataframeX)
  sensitivity <- calcSensitivity(dataframeX)
  f1score <- (2 * precision * sensitivity)/(precision + sensitivity)
  return(f1score)
}

f1Measure <- calcF1(data)

# Part 10
library(ggplot2) 


# Function assumes the description of the columns as given in the original dataset
# Returns a list element where the 1st item is the plot and the 2nd is the auc
calcROC <- function(dataframex, n=100, plt=TRUE ) {
  temp <- dataframex[, c("class", "scored.probability")]
  specificity <- numeric(n + 1)
  sensitivity <- numeric(n + 1)
  auc <- numeric(n + 1)
  for (j in 0:n) {
    threshold <- j / n
    temp$scored.class <- ifelse(temp$scored.probability > threshold, 1, 0)
    specificity[j+1] <- calcSpecificity(temp)
    sensitivity[j+1] <- calcSensitivity(temp)
    if (j > 0) 
      auc[j+1] <- auc[j] + (sensitivity[j+1] + sensitivity[j]) * (specificity[j+1] - specificity[j]) / 2
  }
  
  # Plot
  # https://www.rdocumentation.org/packages/ggplot2/versions/0.9.1/topics/geom_step
  plt1 <- ggplot() + 
    geom_step(aes(x = 1 - specificity, y = sensitivity), direction = "vh", color='dodgerblue') + 
    geom_segment(aes(x = 0, y = 0, xend = 1, yend = 1), linetype = 3) + 
    labs(title = "ROC Curve", x = "FPR: False Positive Rate", y = "TPR: True Positive Rate") +
    theme_minimal() + 
    theme(axis.line = element_line(color='black'))
  
  
  return(list(plt1, auc[n+1]))
}


ROC <- calcROC(data, 100)

# Part 12
library(caret)
caretCM <- confusionMatrix(data = as.factor(data$scored.class), 
                           reference = as.factor(data$class), 
                           positive = "1"
                           )

# names(caretCM)
# "positive" "table" "overall" "byClass" "mode" "dots"
# accuracy = caretCM$overall[1]
# classification error rate = 1 - caretCM$overall[1]
# precision = caretCM$byClass[3]
# sensitivity = caretCM$byClass[1]
# specificity = caretCM$byClass[2]
# F1 score = caretCM$byClass[7]

# Output of the confusionMatrix function
caretCM


# Part 13
library(pROC)

roc2 <- roc(response = data$class, 
    predictor = data$scored.probability, 
    plot = TRUE, 
    print.auc = TRUE, 
    col="dodgerblue",
    main = "ROC Plot Using pROC")