Overview

In this homework assignment, you will work through various classification metrics. You will be asked to create functions in R to carry out the various calculations. You will also investigate some functions in packages that will let you obtain the equivalent results. Finally, you will create graphical output that also can be used to evaluate the output of classification models, such as binary logistic regression.

setwd("~/DataScience/DATA_621_F23/Homework-2")
library(tidyverse)

Task 1

Download the classification output data set (attached in Blackboard to the assignment).

df <- read_csv("classification-output-data.csv")
head(df, 15)
## # A tibble: 15 Γ— 11
##    pregnant glucose diastolic skinfold insulin   bmi pedigree   age class
##       <dbl>   <dbl>     <dbl>    <dbl>   <dbl> <dbl>    <dbl> <dbl> <dbl>
##  1        7     124        70       33     215  25.5    0.161    37     0
##  2        2     122        76       27     200  35.9    0.483    26     0
##  3        3     107        62       13      48  22.9    0.678    23     1
##  4        1      91        64       24       0  29.2    0.192    21     0
##  5        4      83        86       19       0  29.3    0.317    34     0
##  6        1     100        74       12      46  19.5    0.149    28     0
##  7        9      89        62        0       0  22.5    0.142    33     0
##  8        8     120        78        0       0  25      0.409    64     0
##  9        1      79        60       42      48  43.5    0.678    23     0
## 10        2     123        48       32     165  42.1    0.52     26     0
## 11        5      88        78       30       0  27.6    0.258    37     0
## 12        5     108        72       43      75  36.1    0.263    33     0
## 13       13      76        60        0       0  32.8    0.18     41     0
## 14        0     100        70       26      50  30.8    0.597    21     0
## 15        7     194        68       28       0  35.9    0.745    41     1
## # β„Ή 2 more variables: scored.class <dbl>, scored.probability <dbl>

Task 2

The data set has three key columns we will use: * class: the actual class for the observation * scored.class: the predicted class for the observation (based on a threshold of 0.5) * scored.probability: the predicted probability of success for the observation Use the table() function to get the raw confusion matrix for this scored dataset. Make sure you understand the output. In particular, do the rows represent the actual or predicted class? The columns?

confusion_matrix <- df |>
    select(class, scored.class) |>
    mutate(class = recode(class, `0` = "Actual Negative", `1` = "Actual Positive"),
        scored.class = recode(scored.class, `0` = "Predicted Negative", `1` = "Predicted Positive")) |>
    table()

confusion_matrix
##                  scored.class
## class             Predicted Negative Predicted Positive
##   Actual Negative                119                  5
##   Actual Positive                 30                 27

The rows are the actual data while the columns are the predicted class.

Task 3

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the accuracy of the predictions.

\[Accuracy= \frac{TP+TN}{TP+FP+TN+FN}\]

acc_func <- function(df) {
    tp <- sum(df$class == 1 & df$scored.class == 1)
    tn <- sum(df$class == 0 & df$scored.class == 0)
    return((tn + tp)/nrow(df))
}

acc_func(df)
## [1] 0.8066298

Task 4

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the classification error rate of the predictions.

\[ Classification \ Error \ Rate = \frac{FP+FN}{TP+FP+TN+FN} \]

cer_func <- function(df) {
    fp <- sum(df$class == 0 & df$scored.class == 1)
    fn <- sum(df$class == 1 & df$scored.class == 0)
    return((fp + fn)/nrow(df))
}

cer_func(df)
## [1] 0.1933702

Verify that you get an accuracy and an error rate that sums to one.

acc_func(df) + cer_func(df)
## [1] 1

Task 5

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the precision of the predictions.

\[ Precision = \frac{TP}{TP+FP} \]

prec_func <- function(df) {
    tp <- sum(df$class == 1 & df$scored.class == 1)
    fp <- sum(df$class == 0 & df$scored.class == 1)
    return(tp/(tp + fp))
}

prec_func(df)
## [1] 0.84375

Task 6

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the sensitivity of the predictions. Sensitivity is also known as recall.

\[ Sensitivity = \frac{TP}{TP+FN} \]

sens_func <- function(df) {
    tp <- sum(df$class == 1 & df$scored.class == 1)
    fn <- sum(df$class == 1 & df$scored.class == 0)
    return(tp/(tp + fn))
}

sens_func(df)
## [1] 0.4736842

Task 7

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the specificity of the predictions.

\[ Specificity = \frac{TN}{TN+FP} \]

spec_func <- function(df) {
    tn <- sum(df$class == 0 & df$scored.class == 0)
    fp <- sum(df$class == 0 & df$scored.class == 1)
    return(tn/(tn + fp))
}

spec_func(df)
## [1] 0.9596774

Task 8

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the F1 score of the predictions

\[ F1 \ Score = \frac{2 (Precision) (Sensitivity)}{Precisision + Sensitivity} \]

f1_func <- function(df) {
    return((2 * prec_func(df) * sens_func(df))/(prec_func(df) + sens_func(df)))
}

f1_func(df)
## [1] 0.6067416

Task 9

Before we move on, let’s consider a question that was asked: What are the bounds on the F1 score? Show that the F1 score will always be between 0 and 1. (Hint: If 0 < π‘Ž < 1 and 0 < 𝑏 < 1 then π‘Žπ‘ < π‘Ž.)

If we let a = Precision and b = Sensitivity, then

\[ 0 \leq F1 \\ \Rightarrow 0 \leq \frac{2(a)(b)}{a+b}, \ \text{where a and b lower bound is zero} \\ \text{since a and b are two nonnegative value F1 is always greater than or equal to zero} \\ \Rightarrow \frac{2(a)(b)}{a+b} \leq 1, \text{where the upper bounds for a and b is 1} \\ \Rightarrow \frac{2(1)(1)}{1+1} \leq 1 \Rightarrow \frac{2}{2} \leq 1 \Rightarrow 1 = 1 \\ \text{So,} \ 0 \leq F1 \leq 1 \]

Task 10

Write a function that generates an ROC curve from a data set with a true classification column (class in our example) and a probability column (scored.probability in our example). Your function should return a list that includes the plot of the ROC curve and a vector that contains the calculated area under the curve (AUC). Note that I recommend using a sequence of thresholds ranging from 0 to 1 at 0.01 intervals.

library(pROC)
roc_plot <- roc(df$class, df$scored.probability)
plot(roc_plot, main = "ROC Curve", xlab = "False Positive Rate", ylab = "True Positive Rate",
    col = "red", print.auc = TRUE)

Task 11

Use your created R functions and the provided classification output data set to produce all of the classification metrics discussed above.

library(knitr)
classification_metrics <- c(acc_func(df), cer_func(df), prec_func(df), sens_func(df),
    spec_func(df), f1_func(df))

names(classification_metrics) <- c("Accuracy", "Classification Error Rate", "Precision",
    "Sensitivity", "Specificity", "F1 Score")

metrics_data <- data.frame(Metrics = names(classification_metrics), Values = classification_metrics)
kable(classification_metrics, caption = "Classification Metrics")
Classification Metrics
x
Accuracy 0.8066298
Classification Error Rate 0.1933702
Precision 0.8437500
Sensitivity 0.4736842
Specificity 0.9596774
F1 Score 0.6067416

Task 12

Investigate the caret package. In particular, consider the functions confusionMatrix, sensitivity, and specificity. Apply the functions to the data set. How do the results compare with your own functions?

library(caret)

df_2 <- df |>
    select(class, scored.class) |>
    mutate(scored.class = as.factor(scored.class), class = as.factor(class))

caret_values <- confusionMatrix(df_2$class, df_2$scored.class)

caret_list <- c(caret_values$overall["Accuracy"], 1 - caret_values$overall["Accuracy"],
    caret_values$byClass["Specificity"], caret_values$byClass["Neg Pred Value"],
    caret_values$byClass["Pos Pred Value"], caret_values$byClass["F1"])

metrics_data$caret_values <- caret_list

metrics_data <- metrics_data |>
    select(, -1)
kable(metrics_data)
Values caret_values
Accuracy 0.8066298 0.8066298
Classification Error Rate 0.1933702 0.1933702
Precision 0.8437500 0.8437500
Sensitivity 0.4736842 0.4736842
Specificity 0.9596774 0.9596774
F1 Score 0.6067416 0.8717949

Task 13

Investigate the pROC package. Use it to generate an ROC curve for the data set. How do the results compare with your own functions?

roc_plot <- roc(df$class, df$scored.probability)
plot(roc_plot, main = "ROC Curve", xlab = "False Positive Rate", ylab = "True Positive Rate",
    col = "red", print.auc = TRUE)