In this homework assignment, you will work through various classification metrics. You will be asked to create functions in R to carry out the various calculations. You will also investigate some functions in packages that will let you obtain the equivalent results. Finally, you will create graphical output that also can be used to evaluate the output of classification models, such as binary logistic regression
## pregnant glucose diastolic skinfold
## Min. : 0.000 Min. : 57.0 Min. : 38.0 Min. : 0.0
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 64.0 1st Qu.: 0.0
## Median : 3.000 Median :112.0 Median : 70.0 Median :22.0
## Mean : 3.862 Mean :118.3 Mean : 71.7 Mean :19.8
## 3rd Qu.: 6.000 3rd Qu.:136.0 3rd Qu.: 78.0 3rd Qu.:32.0
## Max. :15.000 Max. :197.0 Max. :104.0 Max. :54.0
## insulin bmi pedigree age
## Min. : 0.00 Min. :19.40 Min. :0.0850 Min. :21.00
## 1st Qu.: 0.00 1st Qu.:26.30 1st Qu.:0.2570 1st Qu.:24.00
## Median : 0.00 Median :31.60 Median :0.3910 Median :30.00
## Mean : 63.77 Mean :31.58 Mean :0.4496 Mean :33.31
## 3rd Qu.:105.00 3rd Qu.:36.00 3rd Qu.:0.5800 3rd Qu.:41.00
## Max. :543.00 Max. :50.00 Max. :2.2880 Max. :67.00
## class scored.class scored.probability
## Min. :0.0000 Min. :0.0000 Min. :0.02323
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.11702
## Median :0.0000 Median :0.0000 Median :0.23999
## Mean :0.3149 Mean :0.1768 Mean :0.30373
## 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.43093
## Max. :1.0000 Max. :1.0000 Max. :0.94633
The data set has three key columns we will use: class: the actual class for the observation scored.class: the predicted class for the observation (based on a threshold of 0.5) scored.probability: the predicted probability of success for the observation Use the table() function to get the raw confusion matrix for this scored dataset. Make sure you understand the output. In particular, do the rows represent the actual or predicted class? The columns?
## class scored.class scored.probability
## 1 0 0 0.32845226
## 2 0 0 0.27319044
## 3 1 0 0.10966039
## 4 0 0 0.05599835
## 5 0 0 0.10049072
## 6 0 0 0.05515460
## 7 0 0 0.10711542
## 8 0 0 0.45994744
## 9 0 0 0.11702368
## 10 0 0 0.31536320
## 11 0 0 0.12518925
## 12 0 0 0.27062482
## 13 0 0 0.20980960
## 14 0 0 0.09358589
## 15 1 1 0.88484573
## 16 1 0 0.39665216
## 17 0 1 0.89139491
## 18 1 1 0.53454900
## 19 1 1 0.94633418
## 20 0 0 0.14491618
## 21 0 0 0.21763796
## 22 0 0 0.07521357
## 23 0 0 0.08843254
## 24 0 0 0.30346820
## 25 1 1 0.72448003
## 26 1 0 0.27497369
## 27 1 0 0.42486483
## 28 1 0 0.43092552
## 29 0 0 0.02322803
## 30 0 0 0.04596084
## 31 0 0 0.12798534
## 32 0 0 0.29933706
## 33 1 0 0.45909503
## 34 0 0 0.10479581
## 35 1 1 0.86309177
## 36 1 1 0.63997495
## 37 0 0 0.35818434
## 38 0 0 0.37216467
## 39 1 1 0.81110322
## 40 0 0 0.16812736
## 41 0 0 0.15127796
## 42 0 0 0.10700703
## 43 0 0 0.18796139
## 44 0 0 0.13719711
## 45 0 0 0.30047491
## 46 0 0 0.13688715
## 47 0 0 0.09786911
## 48 0 0 0.06290701
## 49 1 0 0.26941931
## 50 0 0 0.48854279
## 51 0 0 0.37175798
## 52 1 0 0.09947993
## 53 0 0 0.08656230
## 54 0 0 0.17528939
## 55 1 0 0.46937901
## 56 1 1 0.61655438
## 57 0 0 0.09982792
## 58 1 1 0.68917710
## 59 0 0 0.25528619
## 60 1 1 0.85054326
## 61 1 0 0.16886251
## 62 0 0 0.09724147
## 63 0 0 0.24836913
## 64 0 0 0.18156098
## 65 0 0 0.23999362
## 66 0 0 0.40455890
## 67 0 0 0.35830256
## 68 0 0 0.15063079
## 69 1 0 0.48653832
## 70 1 1 0.61503482
## 71 0 0 0.35448950
## 72 1 0 0.17313962
## 73 1 0 0.25961113
## 74 1 1 0.69893986
## 75 0 0 0.29860163
## 76 0 0 0.10423133
## 77 0 0 0.21105283
## 78 0 0 0.06542776
## 79 0 0 0.05051433
## 80 0 0 0.10867969
## 81 0 0 0.07868948
## 82 1 1 0.68128759
## 83 1 0 0.37712032
## 84 0 0 0.16270775
## 85 0 0 0.35215080
## 86 1 0 0.47549641
## 87 0 0 0.13565815
## 88 0 0 0.13391981
## 89 0 1 0.52247107
## 90 0 0 0.25938189
## 91 0 0 0.09946740
## 92 0 0 0.12719232
## 93 1 0 0.37644619
## 94 0 1 0.52088231
## 95 1 1 0.76059210
## 96 1 0 0.20892653
## 97 1 0 0.23335214
## 98 0 0 0.20594069
## 99 0 0 0.11565955
## 100 0 0 0.08399311
## 101 1 0 0.11773115
## 102 1 1 0.71703637
## 103 0 0 0.12928998
## 104 0 0 0.43680373
## 105 0 0 0.28155366
## 106 1 1 0.59198379
## 107 1 1 0.84729320
## 108 0 0 0.31515560
## 109 0 0 0.13731015
## 110 0 0 0.16809635
## 111 0 0 0.05067111
## 112 0 0 0.49835908
## 113 1 0 0.45481435
## 114 0 0 0.45044211
## 115 1 0 0.18149738
## 116 0 0 0.29420367
## 117 1 0 0.40944833
## 118 1 0 0.31676829
## 119 0 0 0.19695491
## 120 0 0 0.06279957
## 121 1 1 0.88335913
## 122 0 0 0.09936447
## 123 1 0 0.40883702
## 124 0 0 0.36249223
## 125 0 0 0.07991838
## 126 1 1 0.61727619
## 127 0 0 0.22358166
## 128 0 0 0.30138627
## 129 0 0 0.06610913
## 130 1 0 0.16702896
## 131 0 0 0.29706445
## 132 0 1 0.62765020
## 133 0 0 0.20362875
## 134 0 0 0.45747349
## 135 0 0 0.37227213
## 136 1 1 0.63578069
## 137 0 0 0.08377338
## 138 0 0 0.15193780
## 139 0 0 0.05320989
## 140 1 1 0.54866439
## 141 0 0 0.49462614
## 142 0 0 0.23532549
## 143 0 0 0.18315190
## 144 0 0 0.06415054
## 145 0 0 0.08595564
## 146 0 0 0.37378785
## 147 0 0 0.41280937
## 148 1 1 0.83049762
## 149 0 0 0.13145383
## 150 0 0 0.06614058
## 151 0 0 0.10096173
## 152 0 0 0.02863965
## 153 0 0 0.26964042
## 154 1 0 0.32814202
## 155 0 0 0.14935090
## 156 1 0 0.45557144
## 157 0 0 0.08094979
## 158 0 0 0.03471434
## 159 1 1 0.66147500
## 160 0 0 0.06598930
## 161 0 0 0.13979031
## 162 0 0 0.04742669
## 163 0 0 0.02660702
## 164 1 1 0.78259456
## 165 0 0 0.14182849
## 166 0 0 0.28503028
## 167 1 0 0.33885542
## 168 1 0 0.16264455
## 169 0 1 0.56490618
## 170 0 0 0.05622422
## 171 0 0 0.18911685
## 172 0 0 0.17072487
## 173 0 0 0.16080488
## 174 1 0 0.24577269
## 175 0 0 0.10999049
## 176 1 1 0.67645162
## 177 0 0 0.31141958
## 178 1 1 0.70720959
## 179 1 1 0.88827658
## 180 0 0 0.42246786
## 181 0 0 0.11998103
Use the table() function to get the raw confusion matrix for this scored dataset. Make sure you understand the output. In particular, do the rows represent the actual or predicted class? The columns?
df1 %>%
select(class, scored.class) %>%
mutate(class = recode(class,
'0' = 'Actual Negative',
'1' = 'Actual Positive'),
scored.class = recode(scored.class,
'0' = 'Predicted Negative',
'1' = 'Predicted Positive')) %>%
table()## scored.class
## class Predicted Negative Predicted Positive
## Actual Negative 119 5
## Actual Positive 30 27
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the accuracy of the predictions.
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦= 𝑇𝑃+𝑇𝑁/𝑇𝑃+𝐹𝑃+𝑇𝑁+𝐹𝑁
#Creating a function to calculate accuracy
calculate_accuracy <- function(df) {
#Making sure the dataframe has 'class' and 'scored.class' columns
if (!all(c("class", "scored.class") %in% colnames(df))) {
stop("The dataframe must have 'class' and 'scored.class' columns.")
}
#Calculating True Positives (TP), True Negatives (TN), False Positives (FP), False Negatives (FN)
TP <- sum(df$class == 1 & df$scored.class == 1)
TN <- sum(df$class == 0 & df$scored.class == 0)
FP <- sum(df$class == 0 & df$scored.class == 1)
FN <- sum(df$class == 1 & df$scored.class == 0)
#Calculating accuracy
accuracy <- (TP + TN) / (TP + FP + TN + FN)
return(accuracy)
}
#Using the calculate_accuracy function with dataframe
accuracy_result <- calculate_accuracy(df1)
print(paste("Accuracy:", accuracy_result))## [1] "Accuracy: 0.806629834254144"
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the classification error rate of the predictions.
𝐶𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝐸𝑟𝑟𝑜𝑟 𝑅𝑎𝑡𝑒= 𝐹𝑃+𝐹𝑁𝑇𝑃+𝐹𝑃+𝑇𝑁+𝐹𝑁
Verify that you get an accuracy and an error rate that sums to one.
#Creating a function to calculate classification error rate
calculate_error_rate <- function(df) {
#Dataframe has 'class' and 'scored.class' columns
if (!all(c("class", "scored.class") %in% colnames(df))) {
stop("The dataframe has 'class' and 'scored.class' columns.")
}
#Calculating the classification error rate
error_rate <- mean(df$class != df$scored.class)
return(error_rate)
}
#Example usage with dataframe
error_rate_result <- calculate_error_rate(df1)
print(paste("Classification Error Rate:", error_rate_result))## [1] "Classification Error Rate: 0.193370165745856"
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the precision of the predictions.
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛= 𝑇/𝑃𝑇𝑃+𝐹𝑃
#Creating a function to calculate precision
calculate_precision <- function(df) {
#dataframe has 'class' and 'scored.class' columns
if (!all(c("class", "scored.class") %in% colnames(df))) {
stop("dataframe has 'class' and 'scored.class' columns.")
}
#Calculating true positive (TP), false positive (FP), and precision
TP <- sum(df$class == 1 & df$scored.class == 1)
FP <- sum(df$class == 0 & df$scored.class == 1)
#Avoiding division by zero
if (TP + FP == 0) {
precision <- NaN
} else {
precision <- TP / (TP + FP)
}
return(precision)
}
# Example usage with dataframe
precision_result <- calculate_precision(df1)
print(paste("Precision:", precision_result))## [1] "Precision: 0.84375"
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the sensitivity of the predictions. Sensitivity is also known as recall.
𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦= 𝑇𝑃/𝑇𝑃+𝐹𝑁
#Creating a unction to calculate sensitivity (recall)
calculate_sensitivity <- function(df) {
#Making sure the dataframe has 'class' and 'scored.class' columns
if (!all(c("class", "scored.class") %in% colnames(df))) {
stop("The dataframe has 'class' and 'scored.class' columns.")
}
#Calculating true positive (TP), false negative (FN), and sensitivity
TP <- sum(df$class == 1 & df$scored.class == 1)
FN <- sum(df$class == 1 & df$scored.class == 0)
#Avoiding division by zero
if (TP + FN == 0) {
sensitivity <- NaN
} else {
sensitivity <- TP / (TP + FN)
}
return(sensitivity)
}
#Example usage with dataframe
sensitivity_result <- calculate_sensitivity(df1)
print(paste("Sensitivity (Recall):", sensitivity_result))## [1] "Sensitivity (Recall): 0.473684210526316"
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the specificity of the predictions.
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦= 𝑇𝑁/𝑇𝑁+𝐹𝑃
#creating a function to calculate specificity
calculate_specificity <- function(df) {
#Making sure the dataframe has 'class' and 'scored.class' columns
if (!all(c("class", "scored.class") %in% colnames(df))) {
stop("The dataframe has 'class' and 'scored.class' columns.")
}
#Calculating true negative (TN), false positive (FP), and specificity
TN <- sum(df$class == 0 & df$scored.class == 0)
FP <- sum(df$class == 0 & df$scored.class == 1)
#Avoiding division by zero
if (TN + FP == 0) {
specificity <- NaN
} else {
specificity <- TN / (TN + FP)
}
return(specificity)
}
#Example usage with dataframe
specificity_result <- calculate_specificity(df1)
print(paste("Specificity:", specificity_result))## [1] "Specificity: 0.959677419354839"
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the F1 score of the predictions.
𝐹1 𝑆𝑐𝑜𝑟𝑒= 2×𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛×𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦/𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦
#Creating a function to calculate F1 score
calculate_f1_score <- function(df) {
#Making sure the dataframe has 'class' and 'scored.class' columns
if (!all(c("class", "scored.class") %in% colnames(df))) {
stop("The dataframe has 'class' and 'scored.class' columns.")
}
#Calculating true positive (TP), false positive (FP), false negative (FN), precision, recall, and F1 score
TP <- sum(df$class == 1 & df$scored.class == 1)
FP <- sum(df$class == 0 & df$scored.class == 1)
FN <- sum(df$class == 1 & df$scored.class == 0)
#Avoiding division by zero
if (TP + FP == 0 | TP + FN == 0) {
f1_score <- NaN
} else {
precision <- TP / (TP + FP)
recall <- TP / (TP + FN)
f1_score <- 2 * precision * recall / (precision + recall)
}
return(f1_score)
}
#Example usage with dataframe
f1_score_result <- calculate_f1_score(df1)
print(paste("F1 Score:", f1_score_result))## [1] "F1 Score: 0.606741573033708"
Before we move on, let’s consider a question that was asked: What are the bounds on the F1 score? Show that the F1 score will always be between 0 and 1. (Hint: If 0 < 𝑎 < 1 and 0 < 𝑏 < 1 then 𝑎𝑏 < 𝑎.)
#Creating: F1 function definition
f1_function <- function(precision, sensitivity) {
f1score <- (2 * precision * sensitivity) / (precision + sensitivity)
return(f1score)
}
#Example: 0 precision, 0.5 sensitivity
f1_example1 <- f1_function(0, 0.5)
cat("F1 Score (0 precision, 0.5 sensitivity):", f1_example1, "\n")## F1 Score (0 precision, 0.5 sensitivity): 0
#Example: 1 precision, 1 sensitivity
f1_example2 <- f1_function(1, 1)
cat("F1 Score (1 precision, 1 sensitivity):", f1_example2, "\n")## F1 Score (1 precision, 1 sensitivity): 1
The F1 score is a metric that considers both precision and sensitivity, and it ranges from 0 to 1.
Write a function that generates an ROC curve from a data set with a true classification column (class in our example) and a probability column (scored.probability in our example). Your function should return a list that includes the plot of the ROC curve and a vector that contains the calculated area under the curve (AUC). Note that I recommend using a sequence of thresholds ranging from 0 to 1 at 0.01 intervals.
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
generate_roc_curve <- function(df, true_col, prob_col) {
#Making sure required libraries are installed
if (!require(pROC)) {
install.packages("pROC")
library(pROC)
}
#Checking if the necessary columns are present in the dataframe
if (!(true_col %in% names(df) && prob_col %in% names(df))) {
stop("The specified columns are not present in the dataframe.")
}
#Calculating ROC curve and AUC
roc_curve <- roc(df[[true_col]], df[[prob_col]])
auc_value <- auc(roc_curve)
#Plotting ROC curve
plot(roc_curve, main = "ROC Curve", col = "blue", lwd = 2)
abline(h = seq(0, 1, by = 0.1), v = seq(0, 1, by = 0.1), col = "lightgray", lty = 2)
grid()
#Returning list with plot and AUC value
result_list <- list(ROC_Curve_Plot = roc_curve, AUC = auc_value)
return(result_list)
}
#Example usage:
result <- generate_roc_curve(df, "class", "scored.probability")## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.8503
Use your created R functions and the provided classification output data set to produce all of the classification metrics discussed above.
# Assuming df1 is your dataframe
# Load required libraries
if (!require(knitr)) {
install.packages("knitr")
library(knitr)
}## Loading required package: knitr
# Create a function to calculate all metrics
calculate_metrics <- function(df) {
# Making sure the dataframe has 'class' and 'scored.class' columns
if (!all(c("class", "scored.class") %in% colnames(df))) {
stop("The dataframe must have 'class' and 'scored.class' columns.")
}
# Calculating True Positives (TP), True Negatives (TN), False Positives (FP), False Negatives (FN)
TP <- sum(df$class == 1 & df$scored.class == 1)
TN <- sum(df$class == 0 & df$scored.class == 0)
FP <- sum(df$class == 0 & df$scored.class == 1)
FN <- sum(df$class == 1 & df$scored.class == 0)
# Calculating metrics
accuracy <- (TP + TN) / (TP + FP + TN + FN)
error_rate <- mean(df$class != df$scored.class)
precision <- if (TP + FP == 0) NaN else TP / (TP + FP)
sensitivity <- if (TP + FN == 0) NaN else TP / (TP + FN)
specificity <- if (TN + FP == 0) NaN else TN / (TN + FP)
f1_score <- if ((TP + FP == 0) | (TP + FN == 0)) NaN else 2 * precision * sensitivity / (precision + sensitivity)
# Return a named vector
result_vector <- c(
Accuracy = accuracy,
`Classification Error Rate` = error_rate,
Precision = precision,
Sensitivity = sensitivity,
Specificity = specificity,
`F1 Score` = f1_score
)
return(result_vector)
}
# Use the calculate_metrics function with your dataframe
created_functions <- calculate_metrics(df1)
# Create a table
kable(data.frame(Metric = names(created_functions), Value = created_functions), col.names = c("Metric", "Value"))| Metric | Value | |
|---|---|---|
| Accuracy | Accuracy | 0.8066298 |
| Classification Error Rate | Classification Error Rate | 0.1933702 |
| Precision | Precision | 0.8437500 |
| Sensitivity | Sensitivity | 0.4736842 |
| Specificity | Specificity | 0.9596774 |
| F1 Score | F1 Score | 0.6067416 |
#Loading required libraries "knitr"
if (!require(knitr)) {
install.packages("knitr")
library(knitr)
}
#Creating a function to calculate all metrics
calculate_metrics <- function(df) {
#Just making sure the dataframe has 'class' and 'scored.class' columns
if (!all(c("class", "scored.class") %in% colnames(df))) {
stop("dataframe must have 'class' and 'scored.class' columns.")
}
#Calculating True Positives (TP), True Negatives (TN), False Positives (FP), False Negatives (FN)
TP <- sum(df$class == 1 & df$scored.class == 1)
TN <- sum(df$class == 0 & df$scored.class == 0)
FP <- sum(df$class == 0 & df$scored.class == 1)
FN <- sum(df$class == 1 & df$scored.class == 0)
#Calculating metrics
accuracy <- (TP + TN) / (TP + FP + TN + FN)
error_rate <- mean(df$class != df$scored.class)
precision <- if (TP + FP == 0) NaN else TP / (TP + FP)
sensitivity <- if (TP + FN == 0) NaN else TP / (TP + FN)
specificity <- if (TN + FP == 0) NaN else TN / (TN + FP)
f1_score <- if ((TP + FP == 0) | (TP + FN == 0)) NaN else 2 * precision * sensitivity / (precision + sensitivity)
#Returning a data frame
result_df <- data.frame(
Metric = c("Accuracy", "Classification Error Rate", "Precision", "Sensitivity", "Specificity", "F1 Score"),
Value = c(accuracy, error_rate, precision, sensitivity, specificity, f1_score)
)
return(result_df)
}
#Using the calculate_metrics function with your dataframe
result_table <- calculate_metrics(df1)
#Printing the table that displays metrics
kable(result_table, col.names = c("Metric", "Value"))| Metric | Value |
|---|---|
| Accuracy | 0.8066298 |
| Classification Error Rate | 0.1933702 |
| Precision | 0.8437500 |
| Sensitivity | 0.4736842 |
| Specificity | 0.9596774 |
| F1 Score | 0.6067416 |
Investigate the caret package. In particular, consider the functions confusionMatrix, sensitivity, and specificity. Apply the functions to the data set. How do the results compare with your own functions?
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
cm <- confusionMatrix(data = as.factor(df$scored.class),
reference = as.factor(df$class),
positive = "1")
cm## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 119 30
## 1 5 27
##
## Accuracy : 0.8066
## 95% CI : (0.7415, 0.8615)
## No Information Rate : 0.6851
## P-Value [Acc > NIR] : 0.0001712
##
## Kappa : 0.4916
##
## Mcnemar's Test P-Value : 4.976e-05
##
## Sensitivity : 0.4737
## Specificity : 0.9597
## Pos Pred Value : 0.8438
## Neg Pred Value : 0.7987
## Prevalence : 0.3149
## Detection Rate : 0.1492
## Detection Prevalence : 0.1768
## Balanced Accuracy : 0.7167
##
## 'Positive' Class : 1
##
Investigate the pROC package. Use it to generate an ROC curve for the data set. How do the results compare with your own functions?
library(pROC)
#Assuming data frame with 'class' and 'scored.probability' columns
roc_obj <- roc(df$class, df$scored.probability)## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc_val <- auc(roc_obj)
par(mfrow = c(1, 2), pty = 's')
#Plotting ROC curve using pROC package
plot(roc_obj, print.auc = TRUE, main = 'ROC Curve - pROC Package')
#Displaying AUC value
cat("AUC (pROC Package):", auc_val, "\n")## AUC (pROC Package): 0.8503113