Read in the classification data set and display the columns
# Read the data
df <- read.csv("https://raw.githubusercontent.com/L-Velasco/DATA621_FA18/master/HW2/classification-output-data.csv", stringsAsFactors = FALSE)
names(df)
## [1] "pregnant" "glucose" "diastolic"
## [4] "skinfold" "insulin" "bmi"
## [7] "pedigree" "age" "class"
## [10] "scored.class" "scored.probability"
The three columns needed for this assignment are:
The table function creates a confusion matrix displaying the actual class in columns and the predicted class in rows.
table(df$scored.class,df$class)
##
## 0 1
## 0 119 30
## 1 5 27
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the accuracy of the predictions.
mAccuracy <- function(dframe, pclass, aclass){
ft <- table(dframe[ ,pclass], dframe[ ,aclass])
fTP <- ft[2,2]; fTN <- ft[1,1]; fFP <- ft[2,1]; fFN <- ft[1,2]
A <- (fTP + fTN) / (fTP + fTN + fFP + fFN)
return(A)
}
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the classification error rate of the predictions.
mError <- function(dframe, pclass, aclass){
ft <- table(dframe[ ,pclass], dframe[ ,aclass])
fTP <- ft[2,2]; fTN <- ft[1,1]; fFP <- ft[2,1]; fFN <- ft[1,2]
E <- (fFP + fFN) / (fTP + fTN + fFP + fFN)
return(E)
}
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the precision of the predictions.
mPrecision <- function(dframe, pclass, aclass){
ft <- table(dframe[ ,pclass], dframe[ ,aclass])
fTP <- ft[2,2]; fTN <- ft[1,1]; fFP <- ft[2,1]; fFN <- ft[1,2]
P <- (fTP) / (fTP + fFP)
return(P)
}
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the sensitivity of the predictions.
Sensitivity is also known as recall.
mSensitivity <- function(dframe, pclass, aclass){
ft <- table(dframe[ ,pclass], dframe[ ,aclass])
fTP <- ft[2,2]; fTN <- ft[1,1]; fFP <- ft[2,1]; fFN <- ft[1,2]
Se <- (fTP) / (fTP + fFN)
return(Se)
}
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the specificity of the predictions.
mSpecificity <- function(dframe, pclass, aclass){
ft <- table(dframe[ ,pclass], dframe[ ,aclass])
fTP <- ft[2,2]; fTN <- ft[1,1]; fFP <- ft[2,1]; fFN <- ft[1,2]
Sp <- (fTN) / (fTN + fFP)
return(Sp)
}
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the F1 score of the predictions.
mF1_score <- function(dframe, pclass, aclass){
ft <- table(dframe[ ,pclass], dframe[ ,aclass])
fTP <- ft[2,2]; fTN <- ft[1,1]; fFP <- ft[2,1]; fFN <- ft[1,2]
P <- (fTP) / (fTP + fFP)
Se <- (fTP) / (fTP + fFN)
FS <- (2 * P * Se) / (P + Se)
return(FS)
}
What are the bounds on the F1 score? Show that the F1 score will always be between 0 and 1. (Hint: If 0 < 𝑎 < 1 and 0 < 𝑏 < 1 then 𝑎𝑏 < 𝑎.)
F1 is bounded from 0 to 1. Both Recall and Precision are positive fractions from 0 to 1. In the event of maximum sensitiviy and precision, the result of F1 score formula will be 1, while the opposite event evaluates to 0.
Write a function that generates an ROC curve from a data set with a true classification column (class in our example) and a probability column (scored.probability in our example). Your function should return a list that includes the plot of the ROC curve and a vector that contains the calculated area under the curve (AUC).
mRoc_curve <- function(dframe, pclass, aclass){
ft <- table(dframe[ ,pclass], dframe[ ,aclass])
fTP <- ft[2,2]; fTN <- ft[1,1]; fFP <- ft[2,1]; fFN <- ft[1,2]
allPos <- sum(df$class == 1, na.rm=TRUE)
allNeg <- sum(df$class == 0, na.rm=TRUE)
threshold <- seq(0.0,1.0,0.01)
x <- c()
y <- c()
for (i in 1:length(threshold)) {
TP <- sum(df$scored.probability >= threshold[i] & df$class == 1, na.rm=TRUE)
TN <- sum(df$scored.probability < threshold[i] & df$class == 0, na.rm=TRUE)
y[i] <- TP / allPos
x[i] <- 1- (TN / allNeg)
}
fPlot <- plot(x,y,type = "s",
main = "ROC Curve",
xlab = "False Positive Rate (1-Specificity)",
ylab = "True Positive Rate")
fPlot <- abline(0,1); fPlot
xd <- c(0, abs(diff(x)))
fAuc <- sum(xd*y); fAuc
return(fAuc)
}
mRoc_curve(df, "scored.class", "class")
## [1] 0.8438031
Accuracy
mAccuracy(df, "scored.class", "class")
## [1] 0.8066298
Error
mError(df, "scored.class", "class")
## [1] 0.1933702
Precision
mPrecision(df, "scored.class", "class")
## [1] 0.84375
Sensitivity
mSensitivity(df, "scored.class", "class")
## [1] 0.4736842
Specificity
mSpecificity(df, "scored.class", "class")
## [1] 0.9596774
F1 Score
mF1_score(df, "scored.class", "class")
## [1] 0.6067416
Consider the functions confusionMatrix, sensitivity, and specificity. Apply the functions to the data set.
library(caret)
pclass <- factor(df$scored.class,levels = c(1,0))
aclass <- factor(df$class,levels = c(1,0))
confusionMatrix(pclass, aclass)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 0
## 1 27 5
## 0 30 119
##
## Accuracy : 0.8066
## 95% CI : (0.7415, 0.8615)
## No Information Rate : 0.6851
## P-Value [Acc > NIR] : 0.0001712
##
## Kappa : 0.4916
## Mcnemar's Test P-Value : 4.976e-05
##
## Sensitivity : 0.4737
## Specificity : 0.9597
## Pos Pred Value : 0.8438
## Neg Pred Value : 0.7987
## Prevalence : 0.3149
## Detection Rate : 0.1492
## Detection Prevalence : 0.1768
## Balanced Accuracy : 0.7167
##
## 'Positive' Class : 1
##
Based on the results, the sensitivity and specificity metrics are the same with my created functions.
generate an ROC curve for the data set.
library(pROC)
plot(roc(df$class, df$scored.probability))
auc(df$class, df$scored.probability)
## Area under the curve: 0.8503
Based on the results, the graph is the same similar with my created funtions, only that the x axis from pROC plots from 1 to 0, while my created function plots x points from 0 to 1.
The AUC is very close, difference of 0.006.