Download the classification output data set.
df <- read.csv("https://raw.githubusercontent.com/che10vek/Data621/master/classification-output-data.csv")
head(df,10)
## pregnant glucose diastolic skinfold insulin bmi pedigree age class
## 1 7 124 70 33 215 25.5 0.161 37 0
## 2 2 122 76 27 200 35.9 0.483 26 0
## 3 3 107 62 13 48 22.9 0.678 23 1
## 4 1 91 64 24 0 29.2 0.192 21 0
## 5 4 83 86 19 0 29.3 0.317 34 0
## 6 1 100 74 12 46 19.5 0.149 28 0
## 7 9 89 62 0 0 22.5 0.142 33 0
## 8 8 120 78 0 0 25.0 0.409 64 0
## 9 1 79 60 42 48 43.5 0.678 23 0
## 10 2 123 48 32 165 42.1 0.520 26 0
## scored.class scored.probability
## 1 0 0.32845226
## 2 0 0.27319044
## 3 0 0.10966039
## 4 0 0.05599835
## 5 0 0.10049072
## 6 0 0.05515460
## 7 0 0.10711542
## 8 0 0.45994744
## 9 0 0.11702368
## 10 0 0.31536320
The data set has three key columns we will use: - class: the actual class for the observation - scored.class: the predicted class for the observation (based on a threshold of 0.5) - scored.probability: the predicted probability of success for the observation
Use the table() function to get the raw confusion matrix for this scored dataset. Make sure you understand the output. In particular, do the rows represent the actual or predicted class? The columns?
cmtable<-table(df$class, df$scored.class)
cmtable
##
## 0 1
## 0 119 5
## 1 30 27
Interpreting the output of this Confusion Matrix: The rows represent Actual Values, and the columns respesent Predicted Values. Let’s rename the rows and columns to to make it clearer.
colnames(cmtable) <- c("Predicted No", "Predicted Yes")
rownames(cmtable) <- c("Actual No", "Actual Yes")
cmtable
##
## Predicted No Predicted Yes
## Actual No 119 5
## Actual Yes 30 27
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the accuracy of the predictions.
get_accuracy <- function(actual, predicted)
{
cmtable <- as.matrix(table(predicted, actual))
TN <- cmtable[1,1]
FN <- cmtable[1,2]
FP <- cmtable[2,1]
TP <- cmtable[2,2]
return ((TP + TN) / (TN + FN + TP + FP))
}
get_accuracy(df$class, df$scored.class)
## [1] 0.8066298
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the classification error rate of the predictions.
get_class_err_rate <- function(actual, predicted)
{
cmtable <- as.matrix(table(predicted, actual))
TN <- cmtable[1,1]
FN <- cmtable[1,2]
FP <- cmtable[2,1]
TP <- cmtable[2,2]
return ((FP + FN) / (TN + FN + TP + FP))
}
get_class_err_rate(df$class, df$scored.class)
## [1] 0.1933702
Verify that you get an accuracy and an error rate that sums to one.
get_accuracy(df$class, df$scored.class) + get_class_err_rate(df$class, df$scored.class)
## [1] 1