Question 1

Download the classification output data set (attached in Blackboard to the assignment).

url <- "https://raw.githubusercontent.com/petferns/DATA621/main/classification-output-data.csv"
data <- read.csv(url)
head(data)
##   pregnant glucose diastolic skinfold insulin  bmi pedigree age class
## 1        7     124        70       33     215 25.5    0.161  37     0
## 2        2     122        76       27     200 35.9    0.483  26     0
## 3        3     107        62       13      48 22.9    0.678  23     1
## 4        1      91        64       24       0 29.2    0.192  21     0
## 5        4      83        86       19       0 29.3    0.317  34     0
## 6        1     100        74       12      46 19.5    0.149  28     0
##   scored.class scored.probability
## 1            0         0.32845226
## 2            0         0.27319044
## 3            0         0.10966039
## 4            0         0.05599835
## 5            0         0.10049072
## 6            0         0.05515460

Question 2

The data set has three key columns we will use: class: the actual class for the observation scored.class: the predicted class for the observation (based on a threshold of 0.5) scored.probability: the predicted probability of success for the observation Use the table() function to get the raw confusion matrix for this scored dataset. Make sure you understand the output. In particular, do the rows represent the actual or predicted class? The columns?

The rows represent actual class and the columns represent predicted class

data %>%
  select(scored.class, class) %>%
  table()
##             class
## scored.class   0   1
##            0 119  30
##            1   5  27

Question 3

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the accuracy of the predictions.

\(π΄π‘π‘π‘’π‘Ÿπ‘Žπ‘π‘¦ = \frac{𝑇𝑃 + 𝑇𝑁}{𝑇𝑃 + 𝐹𝑃 + 𝑇𝑁 + 𝐹N}\)

prediction_accuracy <- function(x){
  TP <- sum(x$class == 1 & x$scored.class == 1)
  TN <- sum(x$class == 0 & x$scored.class == 0)
  FP <- sum(x$class == 0 & x$scored.class == 1)
  FN <- sum(x$class == 1 & x$scored.class == 0)
  round((TP + TN)/(TP+FP+TN+FN), 3)
}
prediction_accuracy(data)
## [1] 0.807

Question 4

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the classification error rate of the predictions

\(Classification error rate = \frac{𝐹𝑃 + 𝐹𝑁} {𝑇𝑃 + 𝐹𝑃 + 𝑇𝑁 + 𝐹N}\) Verify that you get an accuracy and an error rate that sums to one.

error_rate <- function(x){
  TP <- sum(x$class == 1 & x$scored.class == 1)
  TN <- sum(x$class == 0 & x$scored.class == 0)
  FP <- sum(x$class == 0 & x$scored.class == 1)
  FN <- sum(x$class == 1 & x$scored.class == 0)
  round((FP + FN)/(TP + FP + TN + FN),3)
  
}
prediction_accuracy(data) + error_rate(data)
## [1] 1

Question 5

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the precision of the predictions.

\(π‘ƒπ‘Ÿπ‘’π‘π‘–π‘ π‘–π‘œπ‘› = \frac{𝑇𝑃} {𝑇𝑃 + 𝐹P}\)

precision <- function(x){
  TP <- sum(x$class == 1 & x$scored.class == 1)
  FP <- sum(x$class == 0 & x$scored.class == 1)
  round(TP/(TP+FP),3)
}

precision(data)
## [1] 0.844

Question 6

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the sensitivity of the predictions. Sensitivity is also known as recall.

\(𝑆𝑒𝑛𝑠𝑖𝑑𝑖𝑣𝑖𝑑𝑦 = \frac{𝑇𝑃} {𝑇𝑃 + 𝐹N}\)

sensitivity <- function(x){
  TP <- sum(x$class == 1 & x$scored.class == 1)
  FN <- sum(x$class == 1 & x$scored.class == 0)
  round(TP/(TP+FN),3)
}

sensitivity(data)
## [1] 0.474

Question 7

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the specificity of the predictions.

\(𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑑𝑦 = \frac{𝑇𝑁} {𝑇𝑁 + 𝐹P}\)

specificity <- function(x){
  TN <- sum(x$class == 0 & x$scored.class == 0)
  FP <- sum(x$class == 0 & x$scored.class == 1)
  round(TN/(TN+FP),3)
}

specificity(data)
## [1] 0.96

Question 8

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the F1 score of the predictions

\(𝐹1 π‘†π‘π‘œπ‘Ÿπ‘’ = \frac{2 Γ— π‘ƒπ‘Ÿπ‘’π‘π‘–π‘ π‘–π‘œπ‘› Γ— 𝑆𝑒𝑛𝑠𝑖𝑑𝑖𝑣𝑖𝑑𝑦} {π‘ƒπ‘Ÿπ‘’π‘π‘–π‘ π‘–π‘œπ‘› + 𝑆𝑒𝑛𝑠𝑖𝑑𝑖𝑣𝑖𝑑y}\)

f1_score <- function(x){
  round((2*precision(x)*sensitivity(x))/(precision(x)+sensitivity(x)),3)
}

f1_score(data)
## [1] 0.607

Question 9

Before we move on, let’s consider a question that was asked: What are the bounds on the F1 score? Show that the F1 score will always be between 0 and 1. (Hint: If 0 < π‘Ž < 1 and 0 < 𝑏 < 1 then π‘Žπ‘ < π‘Ž.)

As the Precision and Sensitivity used to calculate the F1 score are bounded between 0 and 1 the F1 score will always be between 0 and 1.

Question 10

Write a function that generates an ROC curve from a data set with a true classification column (class in our example) and a probability column (scored.probability in our example). Your function should return a list that includes the plot of the ROC curve and a vector that contains the calculated area under the curve (AUC). Note that I recommend using a sequence of thresholds ranging from 0 to 1 at 0.01 intervals.

ROC_curve <- function(x, y){
  x <- x[order(y, decreasing = TRUE)]
  FPR <- cumsum(!x) / sum(!x)
  TPR <- cumsum(x) / sum(x)
  xy <- data.frame(TPR, FPR, x)
  
  FPR_df <- c(diff(xy$FPR), 0)
  TPR_df <- c(diff(xy$TPR), 0)
  AUC <- round(sum(xy$TPR * FPR_df) + sum(TPR_df * FPR_df)/2, 4)
  
  plot(xy$FPR, xy$TPR, type = "l",
       main = "ROC Curve",
       xlab = "False Postivies",
       ylab = "True Postivies")
  abline(a = 0, b = 1)
  legend(.6, .4, AUC, title = "AUC")
}

ROC_curve(data$class,data$scored.probability)

Question 11

Use your created R functions and the provided classification output data set to produce all of the classification metrics discussed above.

metrics <- c(prediction_accuracy(data), error_rate(data), precision(data), sensitivity(data), specificity(data), f1_score(data))
names(metrics) <- c("Accuracy", "Classification Error Rate", "Precision", "Sensitivity", "Specificity", "F1-Score")
kable(metrics, col.names = "Classification  Metrics")
Classification Metrics
Accuracy 0.807
Classification Error Rate 0.193
Precision 0.844
Sensitivity 0.474
Specificity 0.960
F1-Score 0.607

Question 12

Investigate the caret package. In particular, consider the functions confusionMatrix, sensitivity, and specificity. Apply the functions to the data set. How do the results compare with your own functions?

The results from caret package and my own function match

set_df <- data %>%
  select(scored.class, class) %>%
  mutate(scored.class = as.factor(scored.class), 
         class = as.factor(class))

cMat <- confusionMatrix(set_df$scored.class, set_df$class, positive = "1")

caret_package <- c(cMat$overall["Accuracy"], cMat$byClass["Sensitivity"], cMat$byClass["Specificity"])
own_function <- c(prediction_accuracy(data), sensitivity(data), specificity(data))
res <- cbind(caret_package, own_function)
kable(res)
caret_package own_function
Accuracy 0.8066298 0.807
Sensitivity 0.4736842 0.474
Specificity 0.9596774 0.960

Question 13

Investigate the pROC package. Use it to generate an ROC curve for the data set. How do the results compare with your own functions?

The results from pROC package and my own function match

par(mfrow = c(1, 2))
plot(roc(data$class, data$scored.probability), print.auc = TRUE)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
ROC_curve(data$class,data$scored.probability)