Exploring Classification Metrics
Overview
In this homework assignment, we were asked to work through various classification metrics. We were asked to create functions in R to carry out the various calculations. We were also asked to investigate some functions in packages that will let you obtain the equivalent results. Finally, We were asked to create graphical output that also can be used to evaluate the output of classification models, such as binary logistic regression. The data set was provided by the professor.
Load Data
#read the data
data <- read.csv("./classification-output-data.csv", stringsAsFactors = FALSE)First we should investigate the data file provided. It looks like the dependent variable class was regressed against several independent variables. The Scored class is the predicted variable, and the scored.probability shows the probability that the scored.class belongs to a class of 1. A further description of the variables is given below:
- pregnant: no of times pregnant
- glucose: plasma glucose concentration
- diastolic: diastolic blood pressure
- skinfold: triceps skin fold thickness
- insulin: serum insulin test
- bmi: body mass index
- pedigree: diabetes pedigree function
- age: age in years
- class: (1: positive for diabetes, 0 negative for diabetes)
Source: https://www.kaggle.com/kumargh/pimaindiansdiabetescsv
| pregnant | glucose | diastolic | skinfold | insulin | bmi | pedigree | age | class | scored.class | scored.probability |
|---|---|---|---|---|---|---|---|---|---|---|
| 7 | 124 | 70 | 33 | 215 | 25.5 | 0.161 | 37 | 0 | 0 | 0.3284523 |
| 2 | 122 | 76 | 27 | 200 | 35.9 | 0.483 | 26 | 0 | 0 | 0.2731904 |
| 3 | 107 | 62 | 13 | 48 | 22.9 | 0.678 | 23 | 1 | 0 | 0.1096604 |
| 1 | 91 | 64 | 24 | 0 | 29.2 | 0.192 | 21 | 0 | 0 | 0.0559984 |
| 4 | 83 | 86 | 19 | 0 | 29.3 | 0.317 | 34 | 0 | 0 | 0.1004907 |
| 1 | 100 | 74 | 12 | 46 | 19.5 | 0.149 | 28 | 0 | 0 | 0.0551546 |
Question 2
The data set has three key columns we will use:
- class: the actual class for the observation
- scored.class: the predicted class for the observation (based on a threshold of 0.5)
- scored.probability: the predicted probability of success for the observation
Use the table() function to get the raw confusion matrix for this scored dataset. Make sure you understand the output. In particular, do the rows represent the actual or predicted class? The columns?
First, Let us look at the actual class and predicted class separately.
Actual class
table(data$class, dnn = "Actual class") %>% kable() | Actual.class | Freq |
|---|---|
| 0 | 124 |
| 1 | 57 |
Predicted class
table(data$scored.class, dnn = "Predicted class") %>% kable() | Predicted.class | Freq |
|---|---|
| 0 | 149 |
| 1 | 32 |
A quick sanity check using function on threshold of stored.class and scored.probability
Let’s examine if there is any row where stored.class is 1 and scored.probability is less than 0.5 or stored.class is not 1 and scored.probability is greater than or equal to 0.5.
Prob_check <- function(df) {
count <- 0
for(i in 1:nrow(data)) {
if ( (df$scored.class[i] == 1 & df$scored.probability[i] < 0.5) | (df$scored.class[i] != 1 & df$scored.probability[i] >= 0.5) ) {
count <- count + 1
}
}
return(count)
}print(paste0("Row count: ", sprintf("%1d", Prob_check(data))))## [1] "Row count: 0"
So, we observe that there are no such row.
Raw confusion matrix for the data
table(data$scored.class, data$class,
dnn = c("Predicted", "Actual")) %>% kable() | 0 | 1 | |
|---|---|---|
| 0 | 119 | 30 |
| 1 | 5 | 27 |
A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known.
A true positive is an outcome where the model correctly predicts the positive class. Similarly, a true negative is an outcome where the model correctly predicts the negative class.
A false positive is an outcome where the model incorrectly predicts the positive class. And a false negative is an outcome where the model incorrectly predicts the negative class.
TP True Positive Row1Col1: 119 (Actual 0 and Predicted 0)
TN True Positive Row2Col2: 27 (Actual 1 and Predicted 1)
FN False Positive Row2Col1: 5 of the observations had an actual value of 0 but predicted as 1.
FP False Negative Row1Col2: 30 of the observations had an actual value of 1 but predicted as 0
Question 03
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the accuracy of the predictions.
\[Accuracy = \frac{TP + TN}{TP + FP + TN + FN}\] Let’s write a R function to calculate accuracy:
get_accuracy <- function(df){
confusion_matrix <- table(df$scored.class, df$class,
dnn = c("Predicted", "Actual"))
TN <- confusion_matrix[2,2]
FN <- confusion_matrix[2,1]
FP <- confusion_matrix[1,2]
TP <- confusion_matrix[1,1]
accuracy <- (TP+TN)/(TP+FP+TN+FN)
#print(paste0("The Accuracy rate is ", sprintf("%1.2f%%", 100*Accuracy)))
return(accuracy)
}get_accuracy(data)## [1] 0.8066298
If we run the function on our data and we find an accuracy rate of 80.7%.
acc<-confusionMatrix(table(data$scored.class, data$class))
acc$overall['Accuracy']## Accuracy
## 0.8066298
We can do the same using the caret package and it returns the same result.
Question 04
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the classification error rate of the predictions.
Verify that you get an accuracy and an error rate that sums to one.
\[Classification\ Error\ Rate = \frac{FP + FN}{TP + FP + TN + FN}\] Let’s create a R function to calculate Classification error rate.
get_classification_error <- function(df){
confusion_matrix <- table(df$scored.class, df$class,
dnn = c("Predicted", "Actual"))
TN <- confusion_matrix[2,2]
FN <- confusion_matrix[2,1]
FP <- confusion_matrix[1,2]
TP <- confusion_matrix[1,1]
error_rate <- (FP+FN)/(TP+FP+TN+FN)
return(error_rate)
}Let’s run it on our data
get_classification_error(data)## [1] 0.1933702
print(paste0("The sum is ", (get_classification_error(data) + get_accuracy(data))))## [1] "The sum is 1"
We can verify from above that an accuracy and an error rate that sums to one.
Question 5
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the precision of the predictions.
\[Precision=\frac{TP}{TP+FP}\] Let’s create a R function to calculate precision:
get_precision <- function(df){
confusion_matrix <- table(df$scored.class, df$class,
dnn = c("Predicted", "Actual"))
TN <- confusion_matrix[2,2]
FN <- confusion_matrix[2,1]
FP <- confusion_matrix[1,2]
TP <- confusion_matrix[1,1]
Precision <- (TP)/(TP+FP)
return(Precision)
}Let’s run it on our data
get_precision(data)## [1] 0.7986577
Let’s verify using caret:
posPredValue(table(data$scored.class, data$class))## [1] 0.7986577
Question 6
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the sensitivity of the predictions. Sensitivity is also known as recall.
\[Sensitivity=\frac{TP}{TP+FN}\]
Sensitivity is also known as Recall, Hit rate or True Positive Rate (TPR).
Let’s create a R function to calculate Sensitivity:
get_sensitivity <- function(df){
confusion_matrix <- table(df$scored.class, df$class,
dnn = c("Predicted", "Actual"))
TN <- confusion_matrix[2,2]
FN <- confusion_matrix[2,1]
FP <- confusion_matrix[1,2]
TP <- confusion_matrix[1,1]
sensitivity <- (TP)/(TP+FN)
return(sensitivity)
}Let’s run it on our data
get_sensitivity(data)## [1] 0.9596774
Let’s verify using caret:
sensitivity(table(data$scored.class, data$class))## [1] 0.9596774
Question 7
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the specificity of the predictions.
\[Specificity=\frac{TN}{TN+FP}\]
Specificity is also called selectivity or True Negative Rate (TNR).
Let’s create a R function to calculate Specificity:
get_specificity <- function(df){
confusion_matrix <- table(df$scored.class, df$class,
dnn = c("Predicted", "Actual"))
TN <- confusion_matrix[2,2]
FN <- confusion_matrix[2,1]
FP <- confusion_matrix[1,2]
TP <- confusion_matrix[1,1]
specificity <- (TN)/(TN+FP)
return(specificity)
}Let’s run it on our data
get_specificity(data)## [1] 0.4736842
Let’s verify using caret:
specificity(table(data$scored.class, data$class))## [1] 0.4736842
Question 8
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the F1 score of the predictions.
\[F1\ Score=\frac{2*Precision*Sensitivity}{Precision + Sensitivity}\]
The F1-score or F-measure is a measure of a test’s accuracy. F1 Score is the harmonic mean of precision and sensitivity. The highest possible value of an F-score is 1.0, indicating perfect precision and recall, and the lowest possible value is 0, if either the precision or the recall is zero.
The R function is below:
get_F1_score <- function(df){
confusion_matrix <- table(df$scored.class, df$class,
dnn = c("Predicted", "Actual"))
TN <- confusion_matrix[2,2]
FN <- confusion_matrix[2,1]
FP <- confusion_matrix[1,2]
TP <- confusion_matrix[1,1]
sensitivity <- (TP)/(TP+FN)
precision <- (TP)/(TP+FP)
F1_score <- (2 * precision * sensitivity)/(precision + sensitivity)
return(F1_score)
}Let’s run it on our data
get_F1_score(data)## [1] 0.8717949
Let’s verify using caret:
acc$byClass['F1']## F1
## 0.8717949
Question 9
Before we move on, let’s consider a question that was asked: What are the bounds on the F1 score? Show that the F1 score will always be between 0 and 1. (Hint: If 0 < a < 1 and 0 < b < 1 then a b < a)
Precision values can range from 0 to 1
\[0\ge p\ge 1\]
Sensitivity values can also range from 0 to 1
\[0\ge s\ge 1\] Using If 0 < a < 1 and 0 < b < 1 then ab < a, we get
\[ps\le s\] \[ps\le p\] This implies that
\[0\le ps\le p\le 1\] \[0\le ps\le s\le 1\]
Any resulting quotient will range from 0 to 1. This prove that the F1 score will always be between 0 and 1
Question 10
Write a function that an ROC curve from a data set with a true classification column (class in our example) and a probability column (scored.probability in our example). Your function should return a list that includes the plot of the ROC curve and a vector that contains the calculated area under the curve (AUC). Note that I recommend using a sequence of thresholds ranging from 0 to 1 at 0.01 intervals.
Let’s create a R function an ROC curve from a data set:
roc_function<- function(d){
#Create a count
temp <- table(d[ ,'class'], d[ ,"scored.probability"])
#Calculate frequency
allPos <- sum(data$class == 1, na.rm=TRUE)
allNeg <- sum(data$class == 0, na.rm=TRUE)
#Set threshold
threshold <- seq(0,1,0.01)
#Calculating probability for threshold
x <- c()
y <- c()
for (i in 1:length(threshold)) {
TP <- sum(data$scored.probability >= threshold[i] & data$class == 1, na.rm=TRUE)
TN <- sum(data$scored.probability < threshold[i] & data$class == 0, na.rm=TRUE)
y[i] <- TP / allPos
x[i] <- 1-TN / allNeg
}
rocPlot <- plot(x,y,type = "s", xlim=c(-0.5,1.5),
main = "ROC Curve from function",
xlab = "1-Specificity",
ylab = "Sensitivity")
fPlot <- abline(0,1); fPlot
xd <- c(0, abs(diff(x)))
fAuc <- sum(xd*y); fAuc
print(paste0("Area under the curve: ", fAuc))
}Let’s call the function on our data
roc_function(data)## [1] "Area under the curve: 0.843803056027165"
Question 11
Use your created R functions and the provided classification output data set to produce all of the classification metrics discussed above.
The classification metrics can be found using R function created is as follows:
classification_metrics <-
c(get_accuracy(data),
get_classification_error(data),
get_precision(data),
get_sensitivity(data),
get_specificity(data),
get_F1_score(data))
names(classification_metrics) <- c("Accuracy", "Classification Error", "Precision",
"Sensitivity", "Specificity", "F1 Score")
classification_metrics<-as.data.frame(classification_metrics)
names(classification_metrics)[1]<-'Scores'
kable(classification_metrics)| Scores | |
|---|---|
| Accuracy | 0.8066298 |
| Classification Error | 0.1933702 |
| Precision | 0.7986577 |
| Sensitivity | 0.9596774 |
| Specificity | 0.4736842 |
| F1 Score | 0.8717949 |
Question 12
Investigate the caret package. In particular, consider the functions confusionMatrix, sensitivity, and specificity. Apply the functions to the data set. How do the results compare with your own functions?
We have already compared our function output with the caret package for each part. Our functions output exactly matches with caret package.
classification_metrics_Caret <- confusionMatrix(data = as.factor(data$scored.class), reference = as.factor(data$class), positive = '0')
classification_metrics_Caret## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 119 30
## 1 5 27
##
## Accuracy : 0.8066
## 95% CI : (0.7415, 0.8615)
## No Information Rate : 0.6851
## P-Value [Acc > NIR] : 0.0001712
##
## Kappa : 0.4916
##
## Mcnemar's Test P-Value : 4.976e-05
##
## Sensitivity : 0.9597
## Specificity : 0.4737
## Pos Pred Value : 0.7987
## Neg Pred Value : 0.8438
## Prevalence : 0.6851
## Detection Rate : 0.6575
## Detection Prevalence : 0.8232
## Balanced Accuracy : 0.7167
##
## 'Positive' Class : 0
##
Question 13
Investigate the pROC package. Use it to generate an ROC curve for the data set. How do the results compare with your own functions?
#Generate the function
rCurve <- roc(data$class, data$scored.probability, levels=c(1,0), direction=">")Area under the curve
auc(rCurve)## Area under the curve: 0.8503
Confidence interval for the curve
ci(rCurve)## 95% CI: 0.7905-0.9101 (DeLong)
Let us compare the ROC curve from the pRoc package to the one we generates for question 10. We see that graph looks the same, however we got Area under the curve of 0.8438 compared to 0.8503 from the pRoc package.
plot(rCurve, main="ROC Curve from pRoc", legacy.axes = TRUE, print.auc=TRUE)roc_function(data)## [1] "Area under the curve: 0.843803056027165"
References
https://www.kaggle.com/kumargh/pimaindiansdiabetescsv https://developers.google.com/machine-learning/crash-course/classification/true-false-positive-negative ps://en.wikipedia.org/wiki/Confusion_matrix https://rdrr.io/cran/caret/man/sensitivity.html https://cran.r-project.org/web/packages/pROC/index.html https://stackoverflow.com/questions/41056896/proc-changing-scale-of-the-roc-chart