library(data.table)
Download the classification output data set (attached in Blackboard to the assignment).
Data = fread("https://raw.githubusercontent.com/chrisestevez/DataAnalytics/master/Data/classification-output-data.csv", select = c("class","scored.class","scored.probability"))
The data set has three key columns we will use:
class: the actual class for the observation
scored.class: the predicted class for the observation (based on a threshold of 0.5)
scored.probability: the predicted probability of success for the observation
The rows represent the actual class.
str(Data)
## Classes 'data.table' and 'data.frame': 181 obs. of 3 variables:
## $ class : int 0 0 1 0 0 0 0 0 0 0 ...
## $ scored.class : int 0 0 0 0 0 0 0 0 0 0 ...
## $ scored.probability: num 0.328 0.273 0.11 0.056 0.1 ...
## - attr(*, ".internal.selfref")=<externalptr>
summary(Data)
## class scored.class scored.probability
## Min. :0.0000 Min. :0.0000 Min. :0.02323
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.11702
## Median :0.0000 Median :0.0000 Median :0.23999
## Mean :0.3149 Mean :0.1768 Mean :0.30373
## 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.43093
## Max. :1.0000 Max. :1.0000 Max. :0.94633
table(Data$scored.class,Data$class)
##
## 0 1
## 0 119 30
## 1 5 27
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the accuracy of the predictions. \[Accuracy=\frac{TP+TN}{TP+FP+TN+FN}\]
Accuracy_ = function(input) {
tb = table(input$class,input$scored.class)
TN=tb[1,1]
TP=tb[2,2]
FN=tb[2,1]
FP=tb[1,2]
return((TP+TN)/(TP+FP+TN+FN))
}
Testing the Accuracy function yields 0.8066298
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the classification error rate of the predictions.
\[Classification Error Rate=\frac{FP+FN}{TP+FP+TN+FN}\]
CER_ = function(input) {
tb = table(input$class,input$scored.class)
TN=tb[1,1]
TP=tb[2,2]
FN=tb[2,1]
FP=tb[1,2]
return((FP+FN)/(TP+FP+TN+FN))
}
Testing the Classification Error Rate function yields 0.1933702
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the precision of the predictions. \[Precision=\frac{TP}{TP+FP}\]
Precision_ = function(input) {
tb = table(input$class,input$scored.class)
TP=tb[2,2]
FP=tb[1,2]
return((TP)/(TP+FP))
}
Testing the Precision function yields 0.84375
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the sensitivity of the predictions. Sensitivity is also known as recall. \[Sensitivity=\frac{TP}{TP+FN}\]
Sensitivity_ = function(input) {
tb = table(input$class,input$scored.class)
TP=tb[2,2]
FN=tb[2,1]
return((TP)/(TP+FN))
}
Testing the Sensitivity function yields 0.4736842
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the specificity of the predictions. \[Specificity=\frac{TN}{TN+FP}\]
Specificity_ = function(input) {
tb = table(input$class,input$scored.class)
TN=tb[1,1]
TP=tb[2,2]
FN=tb[2,1]
FP=tb[1,2]
return((TN)/(TN+FP))
}
Testing the Specificity function yields 0.9596774
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the F1 score of the predictions. \[F1 Score=\frac{2*Precision*Sensitivity}{Precision+Sensitivity}\]
F1_ = function(input) {
tb = table(input$class,input$scored.class)
TN=tb[1,1]
TP=tb[2,2]
FN=tb[2,1]
FP=tb[1,2]
Precision = (TP)/(TP+FP)
Sensitivity = (TP)/(TP+FN)
Precision =(TP)/(TP+FP)
return((2*Precision*Sensitivity)/(Precision+Sensitivity))
}
Testing the F1 Score function yields 0.6067416
Before we move on, let’s consider a question that was asked: What are the bounds on the F1 score? Show that the F1 score will always be between 0 and 1.
Both scores used to calculate F1 are bounded between 0 and 1 Sensitivity is bounded between 0 and 1.
MY_ROC = function(labels, scores){
labels = labels[order(scores, decreasing=TRUE)]
result =data.frame(TPR=cumsum(labels)/sum(labels), FPR=cumsum(!labels)/sum(!labels), labels)
dFPR = c(diff(result$FPR), 0)
dTPR = c(diff(result$TPR), 0)
AUC = round(sum(result$TPR * dFPR) + sum(dTPR * dFPR)/2,4)
plot(result$FPR,result$TPR,type="l",main ="ROC Curve",ylab="Sensitivity",xlab="1-Specificity")
abline(a=0,b=1)
legend(.6,.2,AUC,title = "AUC")
}
MY_ROC(Data$class,Data$scored.probability)
Use your created R functions and the provided classification output data set to produce all of the classification metrics discussed above.
Accuracy: 0.8066298
Classification Error Rate is: 0.1933702
Precision is: 0.84375
Sensitivity is: 0.4736842
Specificity is: 0.9596774
F1 Score is: 0.6067416
Investigate the caret package. In particular, consider the functions confusionMatrix, sensitivity, and specificity. Apply the functions to the data set. How do the results compare with your own functions?
The results all seem similar to the answers generated by the caret package.
library("caret")
car =confusionMatrix(Data$scored.class, Data$class, positive='1')
#print table
car$table
## Reference
## Prediction 0 1
## 0 119 30
## 1 5 27
car$byClass
## Sensitivity Specificity Pos Pred Value
## 0.4736842 0.9596774 0.8437500
## Neg Pred Value Precision Recall
## 0.7986577 0.8437500 0.4736842
## F1 Prevalence Detection Rate
## 0.6067416 0.3149171 0.1491713
## Detection Prevalence Balanced Accuracy
## 0.1767956 0.7166808
Accuracy: 0.8066298
Classification Error Rate is: 0.1933702
Precision is: 0.84375
Sensitivity is: 0.4736842
Specificity is: 0.9596774
F1 Score is: 0.6067416
Investigate the pROC package. Use it to generate an ROC curve for the data set. How do the results compare with your own functions? The results were similar to the pROC package.
library("pROC")
par(mfrow=c(1,2))
plot(roc(Data$class,Data$scored.probability),print.auc=TRUE)
MY_ROC(Data$class,Data$scored.probability)