DATA 621 Homework #2

Caret’s Output
Our Function’s Output
- F1 Score Will Always be Between 0 and 1
pROC Output
Our ROC Curve
Source Code

df <- read.csv("./data/classification-output-data.csv")
actual <- "class"
predicted <- "scored.class"
probability <- "scored.probability"
# Load our functions
source("hw2.R")

Caret’s Output

confusionMatrix(factor(df[[predicted]]), factor(df[[actual]]))

Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 119  30
         1   5  27
                                          
               Accuracy : 0.8066          
                 95% CI : (0.7415, 0.8615)
    No Information Rate : 0.6851          
    P-Value [Acc > NIR] : 0.0001712       
                                          
                  Kappa : 0.4916          
                                          
 Mcnemar's Test P-Value : 4.976e-05       
                                          
            Sensitivity : 0.9597          
            Specificity : 0.4737          
         Pos Pred Value : 0.7987          
         Neg Pred Value : 0.8438          
             Prevalence : 0.6851          
         Detection Rate : 0.6575          
   Detection Prevalence : 0.8232          
      Balanced Accuracy : 0.7167          
                                          
       'Positive' Class : 0

Our Function’s Output

Confusion Matrix

confusion_matrix(df, actual, predicted) %>%
  kable() %>%
  kable_styling()

	0	1
0	119	30
1	5	27

Accuracy: 0.8066298

Classification Error Rate: 0.1933702

Precision: 0.7986577

Sensitivity (Recall): 0.9596774

Specificity: 0.4736842

F1 Score: 0.8717949

F1 Score Will Always be Between 0 and 1

Before we move on, let’s consider a question that was asked: What are the bounds on the F1 score? Show that the F1 score will always be between 0 and 1. (Hint: If 0<a<1 and 0<b<1 then ab<a).

Proving the range of the F1 Score must be between 0 and 1

Let’s expand the F1 equation using the definitions of precision and sensitivity.

\[\textrm{F1 Score}= \frac{2 \times\mathrm{Precision}\times\mathrm{Sensitivity}}{\mathrm{Precision}+\mathrm{Sensitivity}}\] The numerator is equal to \[=\frac{2 \times TP \times TP}{(TP + FP)(TP+FN)}\] The denominator is equal to \[= \frac{1}{\frac{TP}{TP+FP}+\frac{TP}{TP+FN}}\]
Combine the Numerator and the denominator: \[F1=\frac{2 \times TP \times TP}{(TP + FP)(TP+FN)} \times \frac{1}{\frac{TP}{TP+FP}+\frac{TP}{TP+FN}}\]
Multiply the denominator by (TP + FP)(TP + FN):

\[denominator = \frac{(TP+FP)(TP+FN)}{TP(TP+FN)+TP(TP+FP)}\]
\[F1=\frac{2(TP)^2}{(TP+FP)(TP+FN)}\times \frac{(TP+FP)(TP+FN)}{TP(TP+FN)+TP(TP+FP)}\]
Cancel the denominator of the first term with the numerator of the second term:

\[F1 =\frac{2(TP)^2}{TP(TP+FN)+TP(TP+FP)}\] Simplify the Denominator: \[F1 =\frac{2(TP)^2}{TP^2+FNTP+TP^2+TPFP}\]
\[F1 =\frac{2(TP)^2}{TP(TP+FN+TP+FP)}\]
\[F1 =\frac{2(TP)^2}{TP(TP+FN+TP+FP)}\]
\[F1 =\frac{2(TP)^2}{TP(2TP +FN +FP)}\] Cancel the TP terms and get final answer \[F1=\frac{2TP}{2TP+FN+FP}\]

In the equation above, \(TP,FN,FP \in \mathbb{N}\) where \(\mathbb{N} = \{0,1,2,3...\}\)

\(2TP \le2TP+FN+FP\) this equation shows that the numerator is at most equal to the denominator (i.e \(F1\le1\)).

There would be a Maximum value of \(1\) when \(FN=FP=0\) and \(TP > 0\)

The fraction \(=\frac{2TP}{2TP+FN+FP}\) has a minimum value of zero when the numerator is minimized (i.e. \(TP=0\)).

We have shown that \(0\le F1\le1\)

pROC Output

par(pty = "s")
roc(df[[actual]], df[[probability]], plot = TRUE, legacy.axes = TRUE, print.auc = TRUE)


Call:
roc.default(response = df[[actual]], predictor = df[[probability]],     plot = TRUE, legacy.axes = TRUE, print.auc = TRUE)

Data: df[[probability]] in 124 controls (df[[actual]] 0) < 57 cases (df[[actual]] 1).
Area under the curve: 0.8503

par(pty = "m")

Our ROC Curve

our_roc <- roc_curve(df, actual, probability)

our_roc$plot

our_roc$auc_vector

          0        0.01        0.02        0.03        0.04        0.05 
0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 
       0.06        0.07        0.08        0.09         0.1        0.11 
0.000000000 0.000000000 0.000000000 0.000000000 0.004810413 0.005942275 
       0.12        0.13        0.14        0.15        0.16        0.17 
0.006366723 0.000000000 0.000000000 0.000000000 0.000000000 0.028013582 
       0.18        0.19         0.2        0.21        0.22        0.23 
0.009620826 0.010186757 0.000000000 0.010752688 0.000000000 0.000000000 
       0.24        0.25        0.26        0.27        0.28        0.29 
0.011460102 0.011601585 0.011884550 0.012026033 0.012308998 0.000000000 
        0.3        0.31        0.32        0.33        0.34        0.35 
0.000000000 0.000000000 0.014006791 0.014148274 0.014148274 0.000000000 
       0.36        0.37        0.38        0.39         0.4        0.41 
0.000000000 0.000000000 0.030843237 0.000000000 0.015421619 0.031126203 
       0.42        0.43        0.44        0.45        0.46        0.47 
0.000000000 0.015846067 0.015987550 0.000000000 0.049235993 0.016411998 
       0.48        0.49         0.5        0.51        0.52        0.53 
0.016411998 0.016553480 0.000000000 0.000000000 0.000000000 0.000000000 
       0.54        0.55        0.56        0.57        0.58        0.59 
0.017119411 0.017119411 0.000000000 0.000000000 0.000000000 0.000000000 
        0.6        0.61        0.62        0.63        0.64        0.65 
0.017260894 0.000000000 0.051782683 0.000000000 0.034804754 0.000000000 
       0.66        0.67        0.68        0.69         0.7        0.71 
0.000000000 0.017402377 0.017402377 0.034804754 0.017402377 0.017402377 
       0.72        0.73        0.74        0.75        0.76        0.77 
0.017402377 0.017402377 0.000000000 0.000000000 0.000000000 0.017402377 
       0.78        0.79         0.8        0.81        0.82        0.83 
0.000000000 0.017402377 0.000000000 0.000000000 0.017402377 0.000000000 
       0.84        0.85        0.86        0.87        0.88        0.89 
0.017402377 0.017402377 0.017402377 0.017402377 0.000000000 0.052207131 
        0.9        0.91        0.92        0.93        0.94        0.95 
0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.017543860 
       0.96        0.97        0.98        0.99           1 
0.000000000 0.000000000 0.000000000 0.000000000 0.000000000

Source Codelibrary(dplyr)
library(ggplot2)

confusion_matrix <- function(df, actual, predicted){
  cm <- table(df[[predicted]], df[[actual]])
  if (nrow(cm) == ncol(cm)){
    return(cm)
  }
  # Make sure all missing values are present in the confusion matrix
  fixed_cm <- matrix(0, ncol(cm), ncol(cm))
  colnames(fixed_cm) <- colnames(cm)
  rownames(fixed_cm) <- colnames(cm)
  # Horribly inefficient at large scale but since we're dealing with a
  # confusion matrix...
  for (r in rownames(cm)){
    for (c in colnames(cm)){
      fixed_cm[r, c] <- cm[r, c]
    }
  }
  return(fixed_cm)
}

confusion_matrix_outcomes <- function(cm){
  # This will obviously need some work to handle more than a 2x2 matrix
  TP <- cm[1]
  FN <- cm[2]
  FP <- cm[3]
  TN <- cm[4]
  list("TN" = TN, "FP" = FP, "FN" = FN, "TP" = TP)
}

# Write a function that takes the data set as a dataframe, with actual and 
# predicted classifications identified, and returns the accuracy of the 
# predictions.

accuracy <- function(df, actual, predicted){
  cm <- confusion_matrix_outcomes(confusion_matrix(df, actual, predicted))
  (cm$TP + cm$TN) / (cm$TP + cm$FP + cm$TN + cm$FN)
}

# Write a function that takes the data set as a dataframe, with actual and 
# predicted classifications identified, and returns the classification error 
# rate of the predictions.

classification_error_rate <- function(df, actual, predicted){
  cm <- confusion_matrix_outcomes(confusion_matrix(df, actual, predicted))
  (cm$FP + cm$FN) / (cm$TP + cm$FP + cm$TN + cm$FN)
}

# Write a function that takes the data set as a dataframe, with actual and 
# predicted classifications identified, and returns the precision of the 
# predictions.

precision <- function(df, actual, predicted){
  cm <- confusion_matrix_outcomes(confusion_matrix(df, actual, predicted))
  cm$TP / (cm$TP + cm$FP)
}

# Write a function that takes the data set as a dataframe, with actual and 
# predicted classifications identified, and returns the sensitivity of the 
# predictions. Sensitivity is also known as recall.

sensitivity <- function(df, actual, predicted){
  cm <- confusion_matrix_outcomes(confusion_matrix(df, actual, predicted))
  cm$TP / (cm$TP + cm$FN)
}

recall <- function(df, actual, predicted){
  sensitivity(df, actual, predicted)
}

# Write a function that takes the data set as a dataframe, with actual and
# predicted classifications identified, and returns the specificity of the
# predictions.

specificity <- function(df, actual, predicted){
  cm <- confusion_matrix_outcomes(confusion_matrix(df, actual, predicted))
  cm$TN / (cm$TN + cm$FP)
}

# Write a function that takes the data set as a dataframe, with actual and
# predicted classifications identified, and returns the F1 score of the
# predictions.

f1_score <- function(df, actual, predicted){
  p <- precision(df, actual, predicted)
  s <- sensitivity(df, actual, predicted)
  (2 * p * s) / (p + s)
}

# Write a function that generates an ROC curve from a data set with a true
# classification column (class in our example) and a probability column
# (scored.probability in our example). Your function should return a list
# that includes the plot of the ROC curve and a vector that contains the
# calculated area under the curve (AUC). Note that I recommend using a
# sequence of thresholds ranging from 0 to 1 at 0.01 intervals.

roc_curve <- function(df, actual, probability, interval = 0.01){
  outcome <- data.frame(matrix(ncol = 3, nrow = 0))
  names(outcome) <- c("probability", "TPR", "FPR")
  for (threshold in seq(0, 1, interval)){
    df$roc_prediction <- ifelse(df[[probability]] > threshold, 1, 0)
    tpr <- sensitivity(df, actual, "roc_prediction")
    fpr <- 1 - specificity(df, actual, "roc_prediction")
    row <- data.frame(probability = threshold, TPR = tpr, FPR = fpr)
    outcome <- rbind(outcome, row)
  }
  # Compute the area
  outcome$area <- (outcome$FPR - dplyr::lag(outcome$FPR)) * outcome$TPR
  # Fill in missing values with zeros
  outcome <- outcome %>%
    mutate(area = ifelse(is.na(area), 0, area))
  # Create a vector of area under the curve
  auc_vector <- outcome$area
  names(auc_vector) <- outcome$probability
  # Sum it to get total area under the curve
  auc <- sum(auc_vector)
  # Create a ROC plot
  roc_plot <- ggplot(outcome) +
    geom_line(aes(FPR, TPR), color = "dodger blue") +
    xlab("False Positive Rate (FPR)") +
    ylab("True Positive Rate (TPR)") +
    theme_minimal() +
    annotate(geom = "text", x = 0.7, y = 0.07,
             label = paste("AUC:", round(auc, 3))) +
    geom_abline(intercept = 0, slope = 1, linetype = "dashed") +
    coord_equal(ratio = 1)
  # Return the list
  return(list(plot = roc_plot, auc = auc, auc_vector = auc_vector))
}

DATA 621 Homework #2

From Work by Critical Thinking Group 3