DATA 621 Homework #2

Caret’s Output
Our Function’s Output
- F1 Score Will Always be Between 0 and 1
ROC Curves

df <- read.csv("https://raw.githubusercontent.com/mikeasilva/data-621-group-projects/master/hw2/data/classification-output-data.csv")
actual <- "class"
predicted <- "scored.class"
probability <- "scored.probability"
# Load our functions
source("hw2.R")

Caret’s Output

confusionMatrix(factor(df[[predicted]]), factor(df[[actual]]))

Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 119  30
         1   5  27
                                          
               Accuracy : 0.8066          
                 95% CI : (0.7415, 0.8615)
    No Information Rate : 0.6851          
    P-Value [Acc > NIR] : 0.0001712       
                                          
                  Kappa : 0.4916          
                                          
 Mcnemar's Test P-Value : 4.976e-05       
                                          
            Sensitivity : 0.9597          
            Specificity : 0.4737          
         Pos Pred Value : 0.7987          
         Neg Pred Value : 0.8438          
             Prevalence : 0.6851          
         Detection Rate : 0.6575          
   Detection Prevalence : 0.8232          
      Balanced Accuracy : 0.7167          
                                          
       'Positive' Class : 0

Our Function’s Output

Confusion Matrix

confusion_matrix(df, actual, predicted) %>%
  kable() %>%
  kable_styling()

	0	1
0	119	30
1	5	27

Accuracy:

accuracy(df, actual, predicted)

[1] 0.8066298

Classification Error Rate:

classification_error_rate(df, actual, predicted)

[1] 0.1933702

Verify that Error + Accuracy Rate = 1:

accuracy(df, actual, predicted) + classification_error_rate(df, actual, predicted)

[1] 1

Precision:

precision(df, actual, predicted)

[1] 0.7986577

Sensitivity (Recall):

sensitivity(df, actual, predicted)

[1] 0.9596774

Caret’s Sensitivity: Caret Package produces the exact same sensitivity output

df2 <- df
df2$scored.class <- as.factor(df2$scored.class)
df2$class <- as.factor(df2$class)
caret::sensitivity(df2$scored.class,df2$class)

[1] 0.9596774

Specificity:

specificity(df, actual, predicted)

[1] 0.4736842

Caret’s Specificity: Caret Package produces the exact same Specificity output

caret::specificity(df2$scored.class,df2$class)

[1] 0.4736842

F1 Score:

f1_score(df, actual, predicted)

[1] 0.8717949

F1 Score Will Always be Between 0 and 1

Before we move on, let’s consider a question that was asked: What are the bounds on the F1 score? Show that the F1 score will always be between 0 and 1. (Hint: If 0<a<1 and 0<b<1 then ab<a).

Proving the range of the F1 Score must be between 0 and 1

Let’s expand the F1 equation using the definitions of precision and sensitivity.

\[\textrm{F1 Score}= \frac{2 \times\mathrm{Precision}\times\mathrm{Sensitivity}}{\mathrm{Precision}+\mathrm{Sensitivity}}\] The numerator is equal to \[=\frac{2 \times TP \times TP}{(TP + FP)(TP+FN)}\] The denominator is equal to \[= \frac{1}{\frac{TP}{TP+FP}+\frac{TP}{TP+FN}}\]
Combine the Numerator and the denominator: \[F1=\frac{2 \times TP \times TP}{(TP + FP)(TP+FN)} \times \frac{1}{\frac{TP}{TP+FP}+\frac{TP}{TP+FN}}\]
Multiply the denominator by (TP + FP)(TP + FN):

\[denominator = \frac{(TP+FP)(TP+FN)}{TP(TP+FN)+TP(TP+FP)}\]
\[F1=\frac{2(TP)^2}{(TP+FP)(TP+FN)}\times \frac{(TP+FP)(TP+FN)}{TP(TP+FN)+TP(TP+FP)}\]
Cancel the denominator of the first term with the numerator of the second term:

\[F1 =\frac{2(TP)^2}{TP(TP+FN)+TP(TP+FP)}\] Simplify the Denominator: \[F1 =\frac{2(TP)^2}{TP^2+FNTP+TP^2+TPFP}\]
\[F1 =\frac{2(TP)^2}{TP(TP+FN+TP+FP)}\]
\[F1 =\frac{2(TP)^2}{TP(TP+FN+TP+FP)}\]
\[F1 =\frac{2(TP)^2}{TP(2TP +FN +FP)}\] Cancel the TP terms and get final answer \[F1=\frac{2TP}{2TP+FN+FP}\]

In the equation above, \(TP,FN,FP \in \mathbb{N}\) where \(\mathbb{N} = \{0,1,2,3...\}\)

\(2TP \le2TP+FN+FP\) this equation shows that the numerator is at most equal to the denominator (i.e \(F1\le1\)).

There would be a Maximum value of \(1\) when \(FN=FP=0\) and \(TP > 0\)

The fraction \(=\frac{2TP}{2TP+FN+FP}\) has a minimum value of zero when the numerator is minimized (i.e. \(TP=0\)).

We have shown that \(0\le F1\le1\)

ROC Curves

The output from the pROC package and the ROC function that my team created is very similar

pROC Output

par(pty = "s")
roc(df[[actual]], df[[probability]], plot = TRUE, legacy.axes = TRUE, print.auc = TRUE)


Call:
roc.default(response = df[[actual]], predictor = df[[probability]],     plot = TRUE, legacy.axes = TRUE, print.auc = TRUE)

Data: df[[probability]] in 124 controls (df[[actual]] 0) < 57 cases (df[[actual]] 1).
Area under the curve: 0.8503

par(pty = "m")

Our ROC Curve

our_roc <- roc_curve(df, actual, probability)
our_roc$plot

our_roc$auc_vector

NULL

Source Codelibrary(dplyr)
library(ggplot2)

confusion_matrix <- function(df, actual, predicted){
  cm <- table(df[[predicted]], df[[actual]])
  if(nrow(cm) == ncol(cm)){
    return(cm)
  }
  # Make sure all missing values are present in the confusion matrix
  fixed_cm <- matrix(0, ncol(cm), ncol(cm))
  colnames(fixed_cm) <- colnames(cm)
  rownames(fixed_cm) <- colnames(cm)
  # Horribly inefficient at large scale but since we're dealing with a confusion matrix...
  for(r in rownames(cm)){
    for(c in colnames(cm)){
      fixed_cm[r, c] <- cm[r, c]
    }
  }
  return(fixed_cm)
}

confusion_matrix_outcomes <- function(cm){
  # This will obviously need some work to handle more than a 2x2 matrix
  TP <- cm[1]
  FN <- cm[2]
  FP <- cm[3]
  TN <- cm[4]
  list("TN" = TN, "FP" = FP, "FN" = FN, "TP" = TP)
}

# Write a function that takes the data set as a dataframe, with actual and 
# predicted classifications identified, and returns the accuracy of the 
# predictions.

accuracy <- function(df, actual, predicted){
  cm <- confusion_matrix_outcomes(confusion_matrix(df, actual, predicted))
  (cm$TP + cm$TN) / (cm$TP + cm$FP + cm$TN + cm$FN)
}

# Write a function that takes the data set as a dataframe, with actual and 
# predicted classifications identified, and returns the classification error 
# rate of the predictions.
# Verify that you get an accuracy and an error rate that sums to one.

classification_error_rate <- function(df, actual, predicted){
  cm <- confusion_matrix_outcomes(confusion_matrix(df, actual, predicted))
  (cm$FP + cm$FN) / (cm$TP + cm$FP + cm$TN + cm$FN)
}

# Write a function that takes the data set as a dataframe, with actual and 
# predicted classifications identified, and returns the precision of the 
# predictions.

precision <- function(df, actual, predicted){
  cm <- confusion_matrix_outcomes(confusion_matrix(df, actual, predicted))
  cm$TP / (cm$TP + cm$FP)
}

# Write a function that takes the data set as a dataframe, with actual and 
# predicted classifications identified, and returns the sensitivity of the 
# predictions. Sensitivity is also known as recall.

sensitivity <- function(df, actual, predicted){
  cm <- confusion_matrix_outcomes(confusion_matrix(df, actual, predicted))
  cm$TP / (cm$TP + cm$FN)
}

recall <- function(df, actual, predicted){
  sensitivity(df, actual, predicted)
}

# Write a function that takes the data set as a dataframe, with actual and
# predicted classifications identified, and returns the specificity of the
# predictions.

specificity <- function(df, actual, predicted){
  cm <- confusion_matrix_outcomes(confusion_matrix(df, actual, predicted))
  cm$TN / (cm$TN + cm$FP)
}

# Write a function that takes the data set as a dataframe, with actual and
# predicted classifications identified, and returns the F1 score of the
# predictions.

f1_score <- function(df, actual, predicted){
  p <- precision(df, actual, predicted)
  s <- sensitivity(df, actual, predicted)
  (2 * p * s) / (p + s)
}

# Write a function that generates an ROC curve from a data set with a true
# classification column (class in our example) and a probability column
# (scored.probability in our example). Your function should return a list
# that includes the plot of the ROC curve and a vector that contains the
# calculated area under the curve (AUC). Note that I recommend using a
# sequence of thresholds ranging from 0 to 1 at 0.01 intervals.

roc_curve <- function(df, actual, probability, interval = 0.01){
  outcome <- data.frame(matrix(ncol = 3, nrow = 0))
  names(outcome) <- c("prob", "TPR", "FPR")
  for (threshold in seq(0, 1, interval)){
    df$roc_prediction <- ifelse(df[[probability]] > threshold, 1, 0)
    cm <- confusion_matrix(df, actual, "roc_prediction")
    cmo <- confusion_matrix_outcomes(cm)
    s <- sensitivity(df, actual, "roc_prediction")
    f <- 1 - specificity(df, actual, "roc_prediction")
    row <- data.frame(prob = threshold, TPR = s, FPR = f)
    outcome <- rbind(outcome, row)
  }
  
  outcome$area <- (outcome$FPR - dplyr::lag(outcome$FPR)) * outcome$TPR
  
  auc <- sum(outcome$area, na.rm = TRUE)
  
  roc_plot <- ggplot(outcome) +
    geom_line(aes(FPR, TPR), color="dodger blue") +
    xlab("False Positive Rate (FPR)") +
    ylab("True Positive Rate (TPR)") +
    theme_minimal() +
    annotate(geom="text", x = 0.7, y = 0.07, label=paste("AUC:", round(auc, 3))) +
    geom_abline(intercept = 0, slope = 1, linetype = "dashed") +
    coord_equal(ratio=1)
  
  return(list(plot = roc_plot, auc = auc, outcome = outcome))
  
  return(outcome)
}