Download the classification output data set.
df <- read.csv("https://raw.githubusercontent.com/che10vek/Data621/master/classification-output-data.csv")
kable(head(df,10), booktabs = T)| pregnant | glucose | diastolic | skinfold | insulin | bmi | pedigree | age | class | scored.class | scored.probability |
|---|---|---|---|---|---|---|---|---|---|---|
| 7 | 124 | 70 | 33 | 215 | 25.5 | 0.161 | 37 | 0 | 0 | 0.3284523 |
| 2 | 122 | 76 | 27 | 200 | 35.9 | 0.483 | 26 | 0 | 0 | 0.2731904 |
| 3 | 107 | 62 | 13 | 48 | 22.9 | 0.678 | 23 | 1 | 0 | 0.1096604 |
| 1 | 91 | 64 | 24 | 0 | 29.2 | 0.192 | 21 | 0 | 0 | 0.0559984 |
| 4 | 83 | 86 | 19 | 0 | 29.3 | 0.317 | 34 | 0 | 0 | 0.1004907 |
| 1 | 100 | 74 | 12 | 46 | 19.5 | 0.149 | 28 | 0 | 0 | 0.0551546 |
| 9 | 89 | 62 | 0 | 0 | 22.5 | 0.142 | 33 | 0 | 0 | 0.1071154 |
| 8 | 120 | 78 | 0 | 0 | 25.0 | 0.409 | 64 | 0 | 0 | 0.4599474 |
| 1 | 79 | 60 | 42 | 48 | 43.5 | 0.678 | 23 | 0 | 0 | 0.1170237 |
| 2 | 123 | 48 | 32 | 165 | 42.1 | 0.520 | 26 | 0 | 0 | 0.3153632 |
The data set has three key columns we will use:
Use the table() function to get the raw confusion matrix for this scored dataset. Make sure you understand the output. In particular, do the rows represent the actual or predicted class? The columns?
cmtable<-table(df$class, df$scored.class)
cmtable##
## 0 1
## 0 119 5
## 1 30 27
Interpreting the output of this Confusion Matrix: The rows represent Actual Values, and the columns respesent Predicted Values. Let’s rename the rows and columns to to make it clearer.
colnames(cmtable) <- c("Predicted No", "Predicted Yes")
rownames(cmtable) <- c("Actual No", "Actual Yes")
kable(cmtable, booktabs = T)| Predicted No | Predicted Yes | |
|---|---|---|
| Actual No | 119 | 5 |
| Actual Yes | 30 | 27 |
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the accuracy of the predictions.
get_accuracy <- function(actual, predicted)
{
cmtable <- as.matrix(table(predicted, actual))
TN <- cmtable[1,1]
FN <- cmtable[1,2]
FP <- cmtable[2,1]
TP <- cmtable[2,2]
return ((TP + TN) / (TN + FN + TP + FP))
}
get_accuracy(df$class, df$scored.class)## [1] 0.8066298
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the classification error rate of the predictions.
get_class_err_rate <- function(actual, predicted)
{
cm <- as.matrix(table(predicted, actual))
TN <- cm[1,1]
FN <- cm[2,1]
FP <- cm[1,2]
TP <- cm[2,2]
return ((FP + FN) / (TN + FN + TP + FP))
}
get_class_err_rate(df$class, df$scored.class)## [1] 0.1933702
Verify that you get an accuracy and an error rate that sums to one.
Note: Accuracy is used when the dataset is balanced. It means the dataset has equal number of classes. if the dataset is imbalanced, we have to use other metrics like precision, recall and f1 score.
get_accuracy(df$class, df$scored.class) + get_class_err_rate(df$class, df$scored.class)## [1] 1
Question 5 - Precision
\(Precision = \frac{TP}{TP + FP}\)
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the precision of the predictions.
Note: False positive rate of the classifer is otherwise called Type-1 Error.
\(False Positive Rate = \frac{FP}{FP + TN}\)
E.g of type1 error: Suppose a spam detection model predicted as spam, but in reality it is not. The impact of such event is customer is going to miss the email.
In such use cases, we should focus on reducing false positive rate and we use the metric Precision to measure this. For precision, the more closer to 1, the better would be the tolerance of the classifer towards the false positive rate.
get_precision <- function(actual, predicted){
cm <- as.matrix.data.frame(table(predicted, actual))
TN <- cm[1,1]
FN <- cm[2,1]
FP <- cm[1,2]
TP <- cm[2,2]
return (TP / (TP + FP))
}
prec<- get_precision(df$class, df$scored.class)
prec## [1] 0.4736842
Question 6 - Sensitivity (Recall) Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the sensitivity of the predictions. Sensitivity is also known as recall.
\(Sensitivity = \frac{TP}{TP + FN}\)
Note: Sensitivity/Recall indicate the False negative rate of the classifer which is otherwise called Type-2 error.
e.g of type2 error: \(False Negative Rate = \frac{FN}{FN + TP}\)
E.g of type1 error: A person has a cancer in reality, but the model predicted as false. The impact of such event is bad and patient will die without the treatment.
In such use cases, we should focus on reducing false negative rate and we use the metric Recall or Sensitivityto measure this. For Sensitivity, the more closer to 1, the better would be the tolerance of the classifer towards the false Negative rate.
The goal is to reduce both Type1 and Type2 error. Hence we have another metric called “F-beta” Score to combine both of these metrics. beta is value chosen depends on tolerance of how much False positive and False negative rate is allowed.
F1 Score is calculated when beta is 1.
get_sensitivity <- function(actual, predicted){
cm <- as.matrix.data.frame(table(predicted, actual))
TN <- cm[1,1]
FN <- cm[2,1]
FP <- cm[1,2]
TP <- cm[2,2]
return (TP / (TP + FN))
}
recall <- get_sensitivity(df$class, df$scored.class)
recall## [1] 0.84375
Question 7 - Specficity
\(Specficity = 1 - \frac{FP}{FP + TN}\)
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the specificity of the predictions.
get_specficity <- function(actual, predicted){
cm <- as.matrix.data.frame(table(predicted, actual))
TN <- cm[1,1]
FN <- cm[2,1]
FP <- cm[1,2]
TP <- cm[2,2]
return (1 - (FP / (FP + TN)))
}
get_specficity(df$class, df$scored.class)## [1] 0.7986577
Question 8 - F1 Score
\(F1Score = \frac{2* precision * Sensitivity}{precision + Sensitivity}\)
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the F1 score of the predictions.
get_f1score <- function(prec, recall){
return (2* prec * recall/ (prec + recall))
}
get_f1score(prec, recall)## [1] 0.6067416
Question 9:
Before we move on, let’s consider a question that was asked: What are the bounds on the F1 score? Show that the F1 score will always be between 0 and 1.
As we know, there are 2 types of Error. They are type-1 and type-2
Type-1 errors are False positive which will be dealt with the metric Precision.
\(Precision = \frac{TP}{TP + FP}\)
Type-2 errors are False negative which will be dealt with the metric Recall or sensitivity.
\(Sensitivity = \frac{TP}{TP + FN}\)
Our goal is reduce the Type-1 and Type-2 errors.
When FP is 0, we have precision is equal to 1.
when FN is 0, we have recall is equal to 1
Therefore, the maximum value of precision is equal to 1.
When TP =0, we dont have any True positive cases and the precision and recall would be 0.
Therefore, the minimum value of recall is equal to 0.
\(F1Minimum = \frac{2* 0 * 0}{0 + 0}\) = 0
\(F1Maximum = \frac{2* 1 * 1}{1 + 1}\) = 1
hence the f1_score ranges from 0 to 1.
Question 10: Write a function that generates an ROC curve from a data set with a true classification column (class in our example) and a probability column (scored.probability in our example). Your function should return a list that includes the plot of the ROC curve and a vector that contains the calculated area under the curve (AUC). Note that I recommend using a sequence of thresholds ranging from 0 to 1 at 0.01 intervals
library(ggplot2)
#ROC Curve
my_curve <- function(data) {
thresholds <- seq(0, 1, by=0.01)
X <- c()
Y <- c()
for (i in 1:length(thresholds)){
scored_class <- ifelse(data$scored.probability >= thresholds[i], 1, 0)
cur_df <- data.frame(scored.class = scored_class, class = data$class)
df_table <- with(cur_df, table(scored.class, class))
X[i] <- (df_table[4])/(df_table[4] + df_table[3])
Y[i]<- (df_table[2]/(df_table[2] + df_table[1]))
}
plot_df <- data.frame(TRUE_POSITIVE = X, FALSE_POSITIVE = Y)
# In order to roughly calculate the area under the curve, we must remove the NA values
plot_df <- na.omit(plot_df)
ROC_plot <- ggplot(plot_df, aes(x=FALSE_POSITIVE, y=TRUE_POSITIVE)) +
geom_line(col=rgb(0.2,0.4,0.1,0.7) , lwd=1) +
geom_abline(intercept = 0, slope = 1) +
labs(title="ROC Plot", x = "False Positive Rate", y = "True Positive Rate")
# Now to calculate the AUC
x <- abs(diff(plot_df$FALSE_POSITIVE))
y <- plot_df$TRUE_POSITIVE
area_under_curve <- sum(x*y)
ROC_plot2 <- ggplot(plot_df, aes(x=FALSE_POSITIVE, y=TRUE_POSITIVE)) +
ggtitle('ROC Curve') +
geom_line(col="blue") +
geom_abline(intercept = 0, slope = 1) +
geom_point() +
xlab('False Positive Rate') +
ylab('True Positive Rate')
return(list(ROC_plot, area_under_curve))
}
result = my_curve(df)
ROC_plot <-result[[1]]
ROC_curve <-result[[2]]
result## [[1]]
##
## [[2]]
## [1] 0.8299377
Use your created R functions and the provided classification output data set to produce all of the classification metrics discussed above.
my_Classification <- data.frame('Accuracy' = get_accuracy(df$class, df$scored.class),
'Error rate' = get_class_err_rate(df$class, df$scored.class),
'Precision' = get_precision(df$class, df$scored.class),
'Sensitivity' = get_sensitivity(df$class, df$scored.class),
'Specficity' = get_specficity(df$class, df$scored.class),
'F_Score' = get_f1score(prec, recall)
)
kable(my_Classification, booktabs = T)| Accuracy | Error.rate | Precision | Sensitivity | Specficity | F_Score |
|---|---|---|---|---|---|
| 0.8066298 | 0.1933702 | 0.4736842 | 0.84375 | 0.7986577 | 0.6067416 |
Investigate the caret package. In particular, consider the functions confusionMatrix, sensitivity, and specificity. Apply the functions to the data set.
Note: the functions from the caret package ask for the Predicted values to be formatted as rows and the True Positives to be in the top left cell of the table. So we re-factor our variables before passing them into the functions.
truth <- factor(df$class)
levels(truth) = list("Yes" = 1, "No" = 0)
pred <- factor(df$scored.class)
levels(pred) = list("Yes" = 1, "No" = 0)
confusionMatrix(pred, truth)## Confusion Matrix and Statistics
##
## Reference
## Prediction Yes No
## Yes 27 5
## No 30 119
##
## Accuracy : 0.8066
## 95% CI : (0.7415, 0.8615)
## No Information Rate : 0.6851
## P-Value [Acc > NIR] : 0.0001712
##
## Kappa : 0.4916
##
## Mcnemar's Test P-Value : 0.00004976
##
## Sensitivity : 0.4737
## Specificity : 0.9597
## Pos Pred Value : 0.8438
## Neg Pred Value : 0.7987
## Prevalence : 0.3149
## Detection Rate : 0.1492
## Detection Prevalence : 0.1768
## Balanced Accuracy : 0.7167
##
## 'Positive' Class : Yes
##
sensitivity(pred, truth)## [1] 0.4736842
specificity(pred, truth)## [1] 0.9596774
We can see above that the caret functions give the same results as our manually created functions.
Investigate the pROC package. Use it to generate an ROC curve for the data set. How do the results compare with your own functions?
head(df)roc_obj <- roc(df$class, df$scored.probability)## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_obj, colorize=TRUE)print(roc_obj)##
## Call:
## roc.default(response = df$class, predictor = df$scored.probability)
##
## Data: df$scored.probability in 124 controls (df$class 0) < 57 cases (df$class 1).
## Area under the curve: 0.8503
We can see the pROC results are very similar to our manually created functions.
#install.packages("ROCR")
library(ROCR)## Warning: package 'ROCR' was built under R version 3.6.3
## Loading required package: gplots
## Warning: package 'gplots' was built under R version 3.6.3
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
pred <- prediction(df$scored.probability,df$class)
roc.perf = performance(pred, measure = "tpr", x.measure = "fpr")
plot(roc.perf, colorize=TRUE , print.cutoffs.at=seq(0,1,0.05), text.adj=c(-.2,1))