In this homework assignment, you will work through various classification metrics. You will be asked to create functions in R to carry out the various calculations. You will also investigate some functions in packages that will let you obtain the equivalent results. Finally, you will create graphical output that also can be used to evaluate the output of classification models, such as binary logistic regression.
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(message = FALSE)
knitr::opts_chunk$set(warning = FALSE)
library(dplyr)
Download the classification output data set and load it
## pregnant glucose diastolic skinfold insulin bmi pedigree age class
## 1 7 124 70 33 215 25.5 0.161 37 0
## 2 2 122 76 27 200 35.9 0.483 26 0
## scored.class scored.probability
## 1 0 0.3284523
## 2 0 0.2731904
instances = dim(data)[1]
features = dim(data)[2]
cat('Dataset for this assignment is', instances, 'instances and', features, 'features')
## Dataset for this assignment is 181 instances and 11 features
Create a dataframe of the features needed for questions 1 - 8, and 11
cdata = data %>% select(c(class, scored.class))
head(cdata,2)
## class scored.class
## 1 0 0
## 2 0 0
Use the table() function to get the raw confusion matrix for the scored dataset.
library(data.table)
# Create the table
ctable = setDT(cdata)
ctable
## class scored.class
## 1: 0 0
## 2: 0 0
## 3: 1 0
## 4: 0 0
## 5: 0 0
## ---
## 177: 0 0
## 178: 1 1
## 179: 1 1
## 180: 0 0
## 181: 0 0
# Create a matrix
cmatrix = table(ctable$scored.class, ctable$class)
cmatrix
##
## 0 1
## 0 119 30
## 1 5 27
TN = cmatrix[1,1]
TP = cmatrix[2,2]
FP = cmatrix[2,1]
FN = cmatrix[1,2]
cat('True negative(TN):',TN, 'True positive(TP):',TP, 'False negative(FN):',FN, 'False positive(FP):',FP)
## True negative(TN): 119 True positive(TP): 27 False negative(FN): 30 False positive(FP): 5
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the accuracy of the predictions.
faccuracy <- function(TP, FP, TN, FN){
accuracy <- sum(TP,TN)/sum(TP,FP,TN,FN)
return(accuracy)
}
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the classification error rate of the predictions.
fclass_error_rate <- function(TP, FP, TN, FN){
class_error_rate <- sum(FP,FN)/sum(TP,FP,TN,FN)
return(class_error_rate)
}
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the precision of the predictions.
fprecision <- function(TP, FP){
precision <- TP/(TP+FP)
return(precision)
}
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the sensitivity of the predictions. Sensitivity is also known as recall.
fsensitivity <- function(TP,FN){
sensitivity <- TP/(TP+FN)
return(sensitivity)
}
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the specificity of the predictions.
fspecificity <- function(FP,TN){
specificity <- TN/(TN+FP)
return(specificity)
}
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the F1 score of the predictions.
fF1_score <- function(precision, sensitivity){
F1_score <- (2 * precision * sensitivity)/(precision + sensitivity)
return(F1_score)
}
What are the bounds on the F1 score? Show that the F1 score will always be between 0 and 1. (Hint: If 0 < 𝑎 < 1 and 0 < 𝑏 < 1 then 𝑎𝑏 < 𝑎.)
According to David M W Powers in Evaluation: From Precision, Recall and F-Factor to ROC, Informedness, Markedness & Correlation, “Sensitivity is the proportion of Real Positive cases that are correctly Predicted Positive”. Additionally, he states: “Precision denotes the proportion of Predicted Positive cases that are correctly Real Positives.” Considering that proportions inherently range from zero to one we can consider those to be the bounds of Sensitivity and Precision.
In order to calculate F1 score we insert Sensitivity and Precision in the following formula:
\(\frac{2 * Precision * Sensitivity}{Precision + Sensitivity}\)
Given that if 0 < 𝑎 < 1 and 0 < 𝑏 < 1 then 𝑎𝑏 < 𝑎, we know that the numerator (before multiplying by two) in the above formula cannot exceed the denominator. To prove that the numerator cannot exceed the denominator even after multiplying by two we will conduct two operations.
First, since we know Sensitivity and Precision can range from zero to one, we create a sequence of possible values for each at intervals of 0.01. Then, for each combination of Sensitivity and Precision (limited to 0.01 intervals) we compute the denominator and numerator and take the difference. If the difference is never a negative value we know the numerator never exceeds the denominator and therefore cannot exceed a value of one. Since we know Sensitivity and Precision cannot be below zero we also know the denominator, which sums these two metrics, can never be subzero.
Second, we again simulate each combination of potential Sensitivity and Precision values from zero to one at 0.01 increments. This time we compute the F1 score associated with each combination to demonstrate that all values of the simulation are bounded by zero and one.
# Simulate Precision and Sensitivity: vectors of values between 0 and 1 at 0.01 interval
precision_sweep <- seq(from = 0, to = 1, by = 0.01)
sensitivity_sweep <- seq(from = 0, to = 1, by = 0.01)
f1_bounds_method_1 <- function(precision_sweep, sensitivity_sweep) {
# Initialize dataframe (first row to be removed later)
df_f1 <- tibble("numerator" = 0, "denominator" = 0, "diff" = 0)
# Loop through each value of Precision (0 to 1 at 0.01 intervals)
for (prec in precision_sweep) {
# Loop through each value of Sensitivity (0 to 1 at 0.01 intervals)
for (sens in sensitivity_sweep) {
# Calculate numerator and denominator
f1_numerator <- (2 * prec * sens)
f1_denominator <- (prec + sens)
# Subtract numerator from denominator
diff_denom_numer <- f1_denominator - f1_numerator
# Append row to dataframe with calculated values
df_f1 <- df_f1 %>% add_row("numerator" = f1_numerator, "denominator" = f1_denominator,
"diff" = diff_denom_numer)
}
}
# Remove first row which was only created to initialize the dataframe
df_f1[-1,]
return (plot(df_f1$diff, xlab = "Simulation #", ylab = "Difference: Denominator - Numerator"))
}
f1_bounds_method_1(precision_sweep, sensitivity_sweep)
f1_bounds_method_2 <- function(precision_sweep, sensitivity_sweep) {
# Initialize empty vector
f1_scores <- c()
# Loop through each value of Precision (0 to 1 at 0.01 intervals)
for (prec in precision_sweep) {
# Loop through each value of Sensitivity (0 to 1 at 0.01 intervals)
for (sens in sensitivity_sweep) {
# Append F1 score to F1 scores vector
f1_scores <- c(f1_scores, fF1_score(prec, sens))
}
}
return (plot(f1_scores, xlab = "Simulation #", ylab = "F1 Score"))
}
f1_bounds_method_2(precision_sweep, sensitivity_sweep)
Write a function that generates an ROC curve from a data set with a true classification column (class in our example) and a probability column (scored.probability in our example). Your function should return a list that includes the plot of the ROC curve and a vector that contains the calculated area under the curve (AUC).
First, to generate the ROC curve we create a sequence of thresholds ranging from 0 to 1 at 0.01 intervals, as specified in the directions. We then loop through each threshold and store the associated Sensitivity and 1 - Specificity values in a dataframe. The resulting values are then plotted with 1 - Specificity on the x axis and Sensitivity on the y axis–generating our ROC curve plot. Next, we approximate the AUC, through a method demonstrated by Bob Horton for Revolutions, which samples from the positive labeled observations and negative labeled observations and calculates the summed probability of positive cases outranking negative cases.
generate_roc <- function(data) {
# Initialize dataframe (first row to be removed later)
df_sen_spec <- tibble("sensitivity" = 0, "specificity" = 0)
# Loop through thresholds
for (thres in seq(from = 0, to = 1, by = 0.01)) {
# If scored probability is above threshold, assign a value of 1
data$new_threshold <- ifelse(data$scored.probability > thres, 1, 0)
# Create confusion matrix for threshold
thres_method = table(factor(data$new_threshold, levels = 0:1),
factor(data$class, levels = 0:1),
dnn = c("Prediction", "Reference"))
# Extract true negatives, true positives, false positive, and false negatives
TN <- thres_method[1,1]
TP <- thres_method[2,2]
FP <- thres_method[2,1]
FN <- thres_method[1,2]
# Calculate Sensitivity and Specificity via our previously created functions
thres_sensitivity = fsensitivity(TP, FN)
thres_specificity = fspecificity(FP, TN)
# Append row to dataframe to store this threshold's associated Sensitivity and Specificity
df_sen_spec <- df_sen_spec %>% add_row("sensitivity" = thres_sensitivity,
"specificity" = 1 - thres_specificity)
}
# Remove first row which was only created to initialize the dataframe
df_sen_spec[-1,]
# Generate ROC curve plot
roc_plot <- plot(df_sen_spec$specificity, df_sen_spec$sensitivity,
type="l", asp = 1, xlab = "1 - Specificity", ylab = "Sensitivity")
return (roc_plot)
}
calculate_auc_prob <- function(true_class, scored_probability, num_samples) {
# Sample from positive observations
pos <- sample(scored_probability[true_class], num_samples, replace=TRUE)
# Sample from negative observations
neg <- sample(scored_probability[!true_class], num_samples, replace=TRUE)
# Approximate AUC by summing probability of positive cases outranking negative cases
auc <- (sum(pos > neg) + sum(pos == neg)/2) / num_samples
return (c(auc))
}
roc_auc <- function(data) {
# Generate and store ROC curve plot
roc_plot <- generate_roc(data)
# Calculate and store AUC
auc_cal <- calculate_auc_prob(as.logical(data$class), data$scored.probability, 10000000)
roc_auc_list <- list(roc_plot, auc_cal)
return (roc_auc_list)
}
roc_auc(data)
## [[1]]
## NULL
##
## [[2]]
## [1] 0.8503294
Use your created R functions and the provided classification output data set to produce all of the classification metrics discussed above.
accuracy = faccuracy(TP, FP, TN, FN)
cat('accuracy:',accuracy,'\n')
## accuracy: 0.8066298
class_error_rate = fclass_error_rate(TP, FP, TN, FN)
cat('classification error rate:',class_error_rate,'\n')
## classification error rate: 0.1933702
precision = fprecision(TP, FP)
cat('precision:',precision,'\n')
## precision: 0.84375
sensitivity = fsensitivity(TP,FN)
cat('sensitivity:',sensitivity,'\n')
## sensitivity: 0.4736842
specificity = fspecificity(FP,TN)
cat('specificity:',specificity,'\n')
## specificity: 0.9596774
F1_score = fF1_score(precision, sensitivity)
cat('F1 score:',F1_score)
## F1 score: 0.6067416
Investigate the caret package. In particular, consider the functions confusionMatrix, sensitivity, and specificity. Apply the functions to the data set. How do the results compare with your own functions?
The caret package confusion matrix builds the matrix as follows TN FN FP TP
When I first ran the Caret package’s Confusion Matrix the sensitivity and specificity values were the opposite of what my custom functions produced. This link from Stack Overflow helped me solve the problem. I needed to set the parameter “positive” to a value of 1 because Caret defaults to positive being a value of 0.
library(caret)
# Creates vectors having data points
expected_value <- factor(c(cdata %>% pull(class)))
predicted_value <- factor(c(cdata %>% pull(scored.class)))
#Creating confusion matrix
caret_confusion_matrix <- confusionMatrix(data=predicted_value, reference = expected_value, positive='1')
#Display results
caret_confusion_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 119 30
## 1 5 27
##
## Accuracy : 0.8066
## 95% CI : (0.7415, 0.8615)
## No Information Rate : 0.6851
## P-Value [Acc > NIR] : 0.0001712
##
## Kappa : 0.4916
##
## Mcnemar's Test P-Value : 4.976e-05
##
## Sensitivity : 0.4737
## Specificity : 0.9597
## Pos Pred Value : 0.8438
## Neg Pred Value : 0.7987
## Prevalence : 0.3149
## Detection Rate : 0.1492
## Detection Prevalence : 0.1768
## Balanced Accuracy : 0.7167
##
## 'Positive' Class : 1
##
Investigate the pROC package. Use it to generate an ROC curve for the data set. How do the results compare with your own functions?
The ROC curve plot resulting from the pROC package is close to identical to the plot generated from our own functions with the only difference being light smoothing found in the latter. The AUC results from the pROC package are within 0.01-0.03 of the approximation from our function.
library(pROC)
# Generate ROC curve plot and AUC
roc(data$class, data$scored.probability,
percent=TRUE, plot=TRUE, ci=TRUE)
##
## Call:
## roc.default(response = data$class, predictor = data$scored.probability, percent = TRUE, ci = TRUE, plot = TRUE)
##
## Data: data$scored.probability in 124 controls (data$class 0) < 57 cases (data$class 1).
## Area under the curve: 85.03%
## 95% CI: 79.05%-91.01% (DeLong)