DATA621 HW 2

1. Download the classification output data set (attached in Blackboard to the assignment).

2. The data set has three key columns we will use: class: the actual class for the observation scored.class: the predicted class for the observation (based on a threshold of 0.5) scored.probability: the predicted probability of success for the observation * Use the table() function to get the raw confusion matrix for this scored dataset. Make sure you understand the output. In particular, do the rows represent the actual or predicted class? The columns?

#select columns
class2 <- dplyr::select(class_output, scored.class, class)
#use table function
table(class2)

##             class
## scored.class   0   1
##            0 119  30
##            1   5  27

The rows represent the actual class and the columns represent the predicted class.

3. Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the accuracy of the predictions. Accuracy = \(\frac{TP + TN}{TP + FP + TN + FN}\)

class_accuracy <- function(df){
TP <- sum(df$class == 1 & df$scored.class == 1) 
TN <- sum(df$class == 0 & df$scored.class == 0)
(TP + TN)/nrow(df)
}
accuracy_variable <- class_accuracy(class2)

4. Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the classification error rate of the predictions. Classification Error Rate = \(\frac{FP + FN}{TP + FP + TN + FN}\)
Verify that you get an accuracy and an error rate that sums to one.

class_error <- function(df) {
  FP <- sum(df$class == 0 & df$scored.class == 1) 
  FN <- sum(df$class == 1 & df$scored.class == 0)
  (FP+FN)/nrow(df)
}
error_variable <- class_error(class2)

5. Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the precision of the predictions. Precision = \(\frac{TP}{TP + FP}\)

class_precision <- function(df){
  TP <- sum(df$class == 1 & df$scored.class == 1)  
  FP <- sum(df$class == 0 & df$scored.class == 1)
  TP/(TP + FP)
}

precision_variable <- class_precision(class2)

6. Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the sensitivity of the predictions. Sensitivity is also known as recall. Sensitivity = \(\frac{TP}{TP + FN}\)

class_sensitivity <- function(df){
  TP <- sum(df$class == 1 & df$scored.class == 1)  
  FN <- sum(df$class == 1 & df$scored.class == 0)
  TP/(TP + FN)
}
sensitivity_variable <- class_sensitivity(class2)

7. Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the specificity of the predictions. Specificity = \(\frac{TN}{TN + FP}\)

class_specificity <- function(df){
  TN <- sum(df$class == 0 & df$scored.class == 0)  
  FP <- sum(df$class == 0 & df$scored.class == 1)
  TN/(TN + FP)
}
specificity_variable <- class_specificity(class2)

8. Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the F1 score of the predictions. \(F1 Score = \frac{2*Precision*Sensitivity}{Precision + Sensitivity}\)

f1_score <- (2*precision_variable*sensitivity_variable) / (precision_variable+sensitivity_variable)
f1_score

## [1] 0.6067416

9. Before we move on, let’s consider a question that was asked: What are the bounds on the F1 score? Show that the F1 score will always be between 0 and 1. (Hint: if \(0 < a < 1\) and \(0 < b < 1\), then \(ab<a\))

Since the precision variable and the sensitivity variable will always be between 0 and 1, then so will the F1 variable. As both the precision and sensitivity variables approach 1, the F1 score also approaches 1: \(F1 Score = \frac{2*Precision*Sensitivity}{Precision + Sensitivity}=\frac{2*1*1}{1 + 1}=1\) And, as they both approach 0: \(F1 Score = \frac{2*Precision*Sensitivity}{Precision + Sensitivity}=\frac{2*0*0}{0 + 0}=0\) So we can see the bounds are 0 and 1.

10. Write a function that generates an ROC curve from a data set with a true classification column (class in our example) and a probability column (scored.probability in our example). Your function should return a list that includes the plot of the ROC curve and a vector that contains the calculated area under the curve (AUC). Note that I recommend using a sequence of thresholds ranging from 0 to 1 at 0.01 intervals.

roc_fx <- function(x,y){
  x <- x[order(y, decreasing=TRUE)]
  TP = cumsum(x)/sum(x)
  FP = cumsum(!x)/sum(!x)
  
  df <- data.frame(TP, FP)
  diffFP <- c(diff(FP), 0)
  diffTP <- c(diff(TP), 0)
  auc <- sum(TP * diffFP) + sum(diffTP * diffFP)/2
  
  return(c(df=df, auc = auc))
}

roc <- roc_fx(class_output$class, class_output$scored.probability)
plot(roc[[1]], 
     col="blue", 
     lwd=2)

11. Use your created R functions and the provided classification output data set to produce all of the classification metrics discussed above.

class_metrics <- c(accuracy_variable, error_variable, f1_score, precision_variable, sensitivity_variable, specificity_variable)
names(class_metrics) <- c("Accuracy", "Error", "F1", "Percision", "Sensitivity", "Specificity")
kable(class_metrics)

	x
Accuracy	0.8066298
Error	0.1933702
F1	0.6067416
Percision	0.8437500
Sensitivity	0.4736842
Specificity	0.9596774

12. Investigate the caret package. In particular, consider the functions confusionMatrix, sensitivity, and specificity. Apply the functions to the data set. How do the results compare with your own functions?

class_output$scored.class <- as.factor(class_output$scored.class)
class_output$class <- as.factor(class_output$class)

#confusionMatrix(class_output$scored.class, class_output$class, mode = 'everything')
#sensitivity(class_output$scored.class, class_output$class)
#specificity(class_output$scored.class, class_output$class)
#The only reason I commented the above functions out is because the package caret requires a new version of R, but my computer cannot run a new verison of R, so I know that this is the correct function, and I would assume that the results o the caret package would be very similar to my previous calculations.

13. Investigate the pROC package. Use it to generate an ROC curve for the data set. How do the results compare with your own functions?

plot(roc(class_output$class, class_output$scored.probability),
     colorize=TRUE, print.cutoffs.at=seq(0.1,by=0.1))

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

auc(roc(class_output$class, class_output$scored.probability))

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## Area under the curve: 0.8503

The graphs are very similar, (aside from format and color).