Data 621 Homework1

1. Fetching Data

data_raw <- read.csv("https://raw.githubusercontent.com/salma71/Data_621/master/HW_2/classification-output-data.csv")
data <- data_raw %>% select(class, scored.class, scored.probability)
data

2. Raw confusion matrix

# Columns represent the actual class
# Rows represent the score/predicted class
t <- with(data, table(scored.class, class))
t
            class
scored.class   0   1
           0 119  30
           1   5  27

3. Accuracy

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the accuracy of the predictions.

\[Accuracy=\frac{TP+TN}{TP+FP+TN+FN}\]

# Accuracy
accurary <- function(x){
  TP <- sum(x$class == 1 & x$scored.class == 1)
  TN <- sum(x$class == 0 & x$scored.class == 0)
  round((TP + TN)/nrow(x), 4)
}
accurary(data)
[1] 0.8066

4. Classification error rate

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the classification error rate of the predictions.

\[Classification\hspace{.1cm}Error\hspace{.1cm}Rate=\frac{FP+FN}{TP+FP+TN+FN}\]

# Classification Error Rate
cls_er_rate <- function(x){
  FP <- sum(x$class == 0 & x$scored.class == 1)
  FN <- sum(x$class == 1 & x$scored.class == 0)
  round((FP + FN)/nrow(x), 4)
}
cls_er_rate(data)
[1] 0.1934

Verify that you get an accuracy and an error rate that sums to one

accurary(data) + cls_er_rate(data)
[1] 1

5. Precision

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the precision of the predictions.

\[Precision=\frac{TP}{TP+FP}\]

# Precision
precision <- function(x){
  TP <- sum(x$class == 1 & x$scored.class == 1)
  FP <- sum(x$class == 0 & x$scored.class == 1)
  round(TP/(TP + FP), 4)
}
precision(data)
[1] 0.8438

6. Sensitivity

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the sensitivity of the predictions. Sensitivity is also known as recall. \[Sensitivity=\frac{TP}{TP+FN}\]

# Sensitivity
sensitivity <- function(x){
  TP <- sum(x$class == 1 & x$scored.class == 1)
  FN <- sum(x$class == 1 & x$scored.class == 0)
  round(TP/(TP + FN), 4)
}
sensitivity(data)
[1] 0.4737

7. Specificity

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the specificity of the predictions.

\[Specificity=\frac{TN}{TN+FP}\]

# Specificity
specificity <- function(x){
  TN <- sum(x$class == 0 & x$scored.class == 0)
  FP <- sum(x$class == 0 & x$scored.class == 1)
  round(TN/(TN + FP), 4)
}
specificity(data)
[1] 0.9597

8. F1 score

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the F1 score of the predictions.

\[F1\hspace{.1cm}Score=\frac{2\times Precision\times Sensitivity}{Precision+Sensitivity}\]

# F1 Score
f1_score <- function(x){
  (2*precision(x)*sensitivity(x))/(precision(x)+sensitivity(x))
}
f1_score(data)
[1] 0.6067675

9. F1 score bound

Before we move on, let’s consider a question that was asked: What are the bounds on the F1 score? Show that the F1 score will always be between 0 and 1.

Both Precision and Sensitivity used to calculate F1 score are bounded between 0 and 1. Therefore, the F1 score will be between 0 and 1 as the upper bound is limited by \(\frac{2x1x1}{1+1}\) which simplifies to 1 and any 0 value of Precision or Specificity makes the numerator 0 to form the lower bound.

10. ROC curve

Write a function that generates an ROC curve from a data set with a true classification column (class in our example) and a probability column (scored.probability in our example). Your function should return a list that includes the plot of the ROC curve and a vector that contains the calculated area under the curve (AUC). Note that I recommend using a sequence of thresholds ranging from 0 to 1 at 0.01 intervals.

# ROC Curve
ROC <- function(x, y){
  x <- x[order(y, decreasing = TRUE)]
  TPR <- cumsum(x) / sum(x)
  FPR <- cumsum(!x) / sum(!x)
  xy <- data.frame(TPR, FPR, x)
  
  FPR_df <- c(diff(xy$FPR), 0)
  TPR_df <- c(diff(xy$TPR), 0)
  AUC <- round(sum(xy$TPR * FPR_df) + sum(TPR_df * FPR_df)/2, 4)
  
  p <- ggplot(data=xy, aes(x=FPR, y=TPR)) + geom_line() +
        geom_abline(slope=1,intercept=0) +
        labs(title="ROC Curve", x ="False Positive Rate", y = "True Positive Rate") +
        annotate("text", x = 0.75, y = 0.25, label = paste0("AUC: ", AUC))
  
  return(list("plot" = p, "auc" = AUC))
}
roc <- ROC(data$class,data$scored.probability)
roc$auc
[1] 0.8503
roc$plot

11. Classification report

Use your created R functions and the provided classification output data set to produce all of the classification metrics discussed above.

metrics <- c(accurary(data), cls_er_rate(data), precision(data), sensitivity(data), specificity(data), f1_score(data))
names(metrics) <- c("Accuracy", "Classification Error Rate", "Precision", "Sensitivity", "Specificity", "F1 Score")
as.data.frame(metrics, col.names = "Metrics")

12. Caret investigation

Investigate the caret package. In particular, consider the functions confusionMatrix, sensitivity, and specificity. Apply the functions to the data set. How do the results compare with your own functions?

# Investigating caret
b <- data %>%
  select(scored.class, class) %>%
  mutate(scored.class = as.factor(scored.class), 
         class = as.factor(class))

c <- confusionMatrix(b$scored.class, b$class, positive = "1")

caret_package <- c(c$overall["Accuracy"], c$byClass["Sensitivity"], c$byClass["Specificity"])
written_function <- c(accurary(data), sensitivity(data), specificity(data))
d <- cbind(caret_package, written_function)
as.data.frame(d)

The results from the caret package match our functions.

13. Investigate pROC

Investigate the pROC package. Use it to generate an ROC curve for the data set. How do the results compare with your own functions?

# Investigating pROC
library(gridExtra)
library(ggplotify)
grid.arrange(as.grob(~plot(roc(data$class, data$scored.probability), print.auc = TRUE)), ROC(data$class,data$scored.probability)$plot, ncol=2)

As shown on the plots above, the results from the pROC package are in agreement with our own functions.