621_HW2

Note to Professor:

Critical Thinking Group #1 worked independently before meeting together, sharing our work and talking through the points where we each struggled. While we leveraged collaboration and group learning, we ultimately felt it made the most sense to submit independent work,

Step 1

Download the classification output data set (attached in Blackboard to the assignment).

library(dplyr)
library(ggplot2)
library(caret)
library(pROC)

x <- "C:\\Users\\bloom\\Downloads\\classification-output-data.csv"
class_output_data <- read.csv(x)

Step 2

The data set has three key columns we will use:

class: the actual class for the observation
scored.class: the predicted class for the observation (based on a threshold of 0.5)
scored.probability: the predicted probability of success for the observation

Use the table() function to get the raw confusion matrix for this scored data set. Make sure you understand the output. In particular, do the rows represent the actual or predicted class? The columns?

#summary(class_output_data)

table(class_output_data$class, class_output_data$scored.class)

##    
##       0   1
##   0 119   5
##   1  30  27

#summary(as.factor(class_output_data$class),as.factor(class_output_data$scored.class))

print("Actual Class")

## [1] "Actual Class"

summary(as.factor(class_output_data$class))

##   0   1 
## 124  57

print("Scored Class")

## [1] "Scored Class"

summary(as.factor(class_output_data$scored.class))

##   0   1 
## 149  32

As we see above, columns represent the Scored Class, rows represent the Actual Class. We can see we have 119 True Negatives, 27 True Positives, 5 False Positives and 30 False Negatives.

Step 3

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the accuracy of the predictions.

\[\frac{TP + TN}{TP+FP+TN+FN}\]

Step3 <- function(DF,Actual,Pred){
# Function takes 3 inputs: Database name, colum name with ACtual and column name with predicted values. 
  
  DF <- table(DF[,Actual], DF[,Pred])

  TN <- DF[1,1]
  TP <- DF[2,2]
  FN <- DF[2,1]
  FP <- DF[1,2]

 AC <- (TP + TN)/(TP+FP+TN+FN)
 
 print(paste("Accuracy =",round(AC,3)))

}

Step3(class_output_data,"class","scored.class")

## [1] "Accuracy = 0.807"

Step 4

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the classification error rate of the predictions.

\[\frac{FP + FN}{TP + FP + TN + FN}\]

Step4 <- function(DF,Actual,Pred){
# Function takes 3 inputs: Database name, colum name with ACtual and column name with predicted values. 

  DF <- table(DF[,Actual], DF[,Pred])

  TN <- DF[1,1]
  TP <- DF[2,2]
  FN <- DF[2,1]
  FP <- DF[1,2]

 CER <- (FP + FN)/(TP+FP+TN+FN)
 
 print(paste("Classification Error Rate =",round(CER,3)))

  AC <- (TP + TN)/(TP+FP+TN+FN)
  
  print(paste("Sum of Classification Error Rate and Accuracy =", AC+CER))
 
 
}

Step4(class_output_data,"class","scored.class")

## [1] "Classification Error Rate = 0.193"
## [1] "Sum of Classification Error Rate and Accuracy = 1"

Step 5

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the precision of the predictions.

\[\frac{TP}{TP+FP}\]

Step5 <- function(DF,Actual,Pred){
# Function takes 3 inputs: Database name, colum name with ACtual and column name with predicted values. 

  DF <- table(DF[,Actual], DF[,Pred])

  TN <- DF[1,1]
  TP <- DF[2,2]
  FN <- DF[2,1]
  FP <- DF[1,2]
  
  per <- TP/(TP+FP)
  
  print(paste("Precision =",round(per,3)))
  
}

Step5(class_output_data,"class","scored.class")

## [1] "Precision = 0.844"

Step 6

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the sensitivity of the predictions. Sensitivity is also known as recall.

\[\frac{TP}{TP+FN}\]

Step6 <- function(DF,Actual,Pred){
# Function takes 3 inputs: Database name, colum name with ACtual and column name with predicted values. 

  DF <- table(DF[,Actual], DF[,Pred])

  TN <- DF[1,1]
  TP <- DF[2,2]
  FN <- DF[2,1]
  FP <- DF[1,2]
  
  sens <- TP/(TP+FN)
  
  print(paste("Sensitivity =",round(sens,3)))
  
}

Step6(class_output_data,"class","scored.class")

## [1] "Sensitivity = 0.474"

Step 7

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the specificity of the predictions.

\[\frac{TN}{TN+FP}\]

Step7 <- function(DF,Actual,Pred){
# Function takes 3 inputs: Database name, colum name with ACtual and column name with predicted values. 

  DF <- table(DF[,Actual], DF[,Pred])

  TN <- DF[1,1]
  TP <- DF[2,2]
  FN <- DF[2,1]
  FP <- DF[1,2]
  
  spec <- TN/(TN+FP)
  
  print(paste("Specificity =",round(spec,3)))
  
}

Step7(class_output_data,"class","scored.class")

## [1] "Specificity = 0.96"

Step 8

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the F1 score of the predictions.

\[\frac{2 * Percision * Sensitivity}{Percision + Sensitivity}\]

Step8 <- function(DF,Actual,Pred){
# Function takes 3 inputs: Database name, colum name with ACtual and column name with predicted values. 

  DF <- table(DF[,Actual], DF[,Pred])

  TN <- DF[1,1]
  TP <- DF[2,2]
  FN <- DF[2,1]
  FP <- DF[1,2]
  
  per <- TP/(TP+FP)
  sens <- TP/(TP+FN)
  F1 <- (2*per*sens)/(per+sens)
  
  print(paste("F1 Score =",round(F1,3)))
  
}

Step8(class_output_data,"class","scored.class")

## [1] "F1 Score = 0.607"

Step 9

Before we move on, let’s consider a question that was asked: What are the bounds on the F1 score? Show that the F1 score will always be between 0 and 1. (Hint: If 0 < 𝑎 < 1 and 0 < 𝑏 < 1 then 𝑎𝑏 < 𝑎.)

As \(TP <= TP+FP\), \(TP <= TP + FN\) , and all terms are \(>= 0\) we know that both Precision and Sensitivity are between 0 and 1. Looking at the extremes, we see that if Precision and Sensitivity are both equal to 1 (a case with no False Positives or False Negatives), our F1 score is equal to 1. Similarly, if we see that Precision and Sensitivity are both equal to 0 (a case with no True Positives), F1 is undefined. With that noted, we do know that it approaches 0 as the sum of these values approach 0, as the numerator is found by multiplying two terms less than zero, and the denominator is found by addition. To approximate this, we can set Precision = Sensitivity = x, and note that \(2x^2 / 2x = x\) and the limit as x approaches 0 is 0.

Step 10

Write a function that generates an ROC curve from a data set with a true classification column (class in our example) and a probability column (scored.probability in our example). Your function should return a list that includes the plot of the ROC curve and a vector that contains the calculated area under the curve (AUC). Note that I recommend using a sequence of thresholds ranging from 0 to 1 at 0.01 intervals.

step10 <- function(database){
  #Function takes Database name as argument. Must have columns named "class" and "scored.probability" 

Thresholds <- seq(0, 1, .01)
FPR <- rep("",length(Thresholds))
TPR <- rep("",length(Thresholds))

for(i in 1:length(Thresholds)) {
  for(j in 1:nrow(class_output_data)){
    if (class_output_data[j,"scored.probability"]>Thresholds[i]){
          class_output_data[j,"Tpred"] = 1 } else {class_output_data[j,"Tpred"] = 0}
  }
          
#  DF <- table(class_output_data[,"class"], class_output_data[,"Tpred"])

#  TN <- DF[1,1]
#  TP <- DF[2,2]
#  FN <- DF[2,1]
#  FP <- DF[1,2]
 
  
  TN <- class_output_data %>%
    filter(Tpred == 0, class == 0) %>%
    count() %>%
    as.numeric()
  
  
  FP <- class_output_data %>%
    filter(Tpred == 1, class == 0) %>%
    count()%>%
    as.numeric()
  
  
  FN <- class_output_data %>%
    filter(Tpred == 0, class == 1) %>%
    count()%>%
    as.numeric()
  
  
  TP <- class_output_data %>%
    filter(Tpred == 1, class == 1) %>%
    count()%>%
    as.numeric()
  

   
  TPR[i] <- (TP/(TP + FN))
  FPR[i] <- (FP/(TN + FP))
}

ROC_DF <- data.frame(as.numeric(TPR), as.numeric(FPR))

names(ROC_DF) <- c("TPR","FPR")

#step10Plot <- ROC_DF %>% 
#  ggplot(aes(FPR, y=TPR))+
#  geom_point() 

ROC_DF2 <- ROC_DF %>%
  group_by(FPR) %>%
  top_n(1, TPR) %>%
  distinct()

step10Plot <- ROC_DF2 %>% 
  ggplot(aes(FPR, y=TPR))+
  geom_point() + geom_line()

AUC = 0 

for(i in 1:(nrow(ROC_DF2)-1)){
  x <- ROC_DF2$FPR[i] - ROC_DF2$FPR[i+1]
  y <- (ROC_DF2$TPR[i]+ROC_DF2$TPR[i+1])/2
  AUC <- AUC + x*y

}



#print(paste(step10Plot, "AUC=",AUC))
return(list(step10Plot, print(paste("AUC=",AUC))))

}

Step 11

Use your created R functions and the provided classification output data set to produce all of the classification metrics discussed above.

Step3(class_output_data,"class","scored.class")

## [1] "Accuracy = 0.807"

Step4(class_output_data,"class","scored.class")

## [1] "Classification Error Rate = 0.193"
## [1] "Sum of Classification Error Rate and Accuracy = 1"

Step5(class_output_data,"class","scored.class")

## [1] "Precision = 0.844"

Step6(class_output_data,"class","scored.class")

## [1] "Sensitivity = 0.474"

Step7(class_output_data,"class","scored.class")

## [1] "Specificity = 0.96"

Step8(class_output_data,"class","scored.class")

## [1] "F1 Score = 0.607"

step10(class_output_data)

## [1] "AUC= 0.851230899830221"

## [[1]]

## 
## [[2]]
## [1] "AUC= 0.851230899830221"

Step 12

Investigate the caret package. In particular, consider the functions confusionMatrix, sensitivity, and specificity. Apply the functions to the data set. How do the results compare with your own functions?

confusionMatrix(as.factor(class_output_data$scored.class), as.factor(class_output_data$class ), positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 119  30
##          1   5  27
##                                           
##                Accuracy : 0.8066          
##                  95% CI : (0.7415, 0.8615)
##     No Information Rate : 0.6851          
##     P-Value [Acc > NIR] : 0.0001712       
##                                           
##                   Kappa : 0.4916          
##                                           
##  Mcnemar's Test P-Value : 4.976e-05       
##                                           
##             Sensitivity : 0.4737          
##             Specificity : 0.9597          
##          Pos Pred Value : 0.8438          
##          Neg Pred Value : 0.7987          
##              Prevalence : 0.3149          
##          Detection Rate : 0.1492          
##    Detection Prevalence : 0.1768          
##       Balanced Accuracy : 0.7167          
##                                           
##        'Positive' Class : 1               
##

sensitivity(as.factor(class_output_data$scored.class), as.factor(class_output_data$class ), positive = "1")

## [1] 0.4736842

specificity(as.factor(class_output_data$scored.class), as.factor(class_output_data$class ), negative = "0")

## [1] 0.9596774

The convention in the table is inverted, however, the results are all identical.

Investigate the pROC package. Use it to generate an ROC curve for the data set. How do the results compare with your own functions

roc1 <- roc(class_output_data$class, class_output_data$scored.probability )

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

plot(roc1)

auc(roc1)

## Area under the curve: 0.8503

My curve looks correct - while the area under the curve is not completely accurate, my approach was sound, just less percise than that of the pROC package.