Set working environment
Load data
output <- read.csv ('E:/621 GH Bus Analysis/621 WK2 ROC/classification-output-data.csv')
glimpse(output)
## Observations: 181
## Variables: 11
## $ pregnant <int> 7, 2, 3, 1, 4, 1, 9, 8, 1, 2, 5, 5, 13, 0, ...
## $ glucose <int> 124, 122, 107, 91, 83, 100, 89, 120, 79, 12...
## $ diastolic <int> 70, 76, 62, 64, 86, 74, 62, 78, 60, 48, 78,...
## $ skinfold <int> 33, 27, 13, 24, 19, 12, 0, 0, 42, 32, 30, 4...
## $ insulin <int> 215, 200, 48, 0, 0, 46, 0, 0, 48, 165, 0, 7...
## $ bmi <dbl> 25.5, 35.9, 22.9, 29.2, 29.3, 19.5, 22.5, 2...
## $ pedigree <dbl> 0.161, 0.483, 0.678, 0.192, 0.317, 0.149, 0...
## $ age <int> 37, 26, 23, 21, 34, 28, 33, 64, 23, 26, 37,...
## $ class <int> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1...
## $ scored.class <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1...
## $ scored.probability <dbl> 0.32845226, 0.27319044, 0.10966039, 0.05599...
As seen from the exploratory analysis, we can see that is a dataset which contains 181 observations of 11 variables (mostly the variables pertaining to the medical history of a patient). As instructed by the instructor, we will use three variables for this report - class (actual class for the observation), scored.class (predicted class for the observation), and scored.probability (predicted probability of success for the observation). The purpose of utilizing these 3 variables will be to construct our own forumla of sensititity, specificity, the confustion matrix which utitlizes these inforamtion, and the Receiver Operative Curve. Then, we will utilize the exisitng R packages which contain such information, and compare the results of our own manually calculated results with that from utilizing the packages.
vis_miss(output)
boxplot(output,xlab="variables")
We seen from the missing data check, we can see that there is no missing data in the file. From the box plot, the variables distributions are also seen, the three variables of interest are plotted out. There does not seem much surprise from there.
cf <- table(output[,9:10])
cf
## scored.class
## class 0 1
## 0 119 5
## 1 30 27
WE are tabulating the confusions matrix above.There are 119 true negatives, and 27 true positives. And there are 30 false negatives, and5 false positives.
Accuracy <- function(df)
{
names = c("class", "scored.class")
cmatrix = table(df[, names])
accuracy = (cmatrix[2,2] + cmatrix[1,1]) / (cmatrix[2,2] + cmatrix[1,2] + cmatrix[1,1] + cmatrix[2,1])
return(round(accuracy, 2))
}
Accuracy(output)
## [1] 0.81
Using the formula provided by the instruction, we constructed the accuracy function as above. Applying our data to this function, the accuracy is 81%.
Classification_error_rate <- function(df)
{
names = c("class", "scored.class")
cmatrix = table(df[, names])
classification_error_rate = (cmatrix[1,2] + cmatrix[2,1]) / (cmatrix[2,2] + cmatrix[1,2] + cmatrix[1,1] + cmatrix[2,1])
return(round(classification_error_rate, 2))
}
Classification_error_rate(output)
## [1] 0.19
Using the forumla provided by the instruction, we constructed a fucntion above for the classification error rate. Applying data to this function,the resulted classification error rate was at 19%. To answer the question that accuracy and error rate sums to one. OUr calculated numbers was 0.81 and 0.19, which indeed sums upt to 100%
Precision <- function(df)
{
names = c("class", "scored.class")
cmatrix = table(df[, names])
precision = (cmatrix[2,2] / (cmatrix[2,2] + cmatrix[1,2]))
return(round(precision, 2))
}
Precision(output)
## [1] 0.84
Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the sensitivity of the predictions. Sensitivity is also known as recall.
Sensitivity <- function(df)
{
names = c("class", "scored.class")
cmatrix = table(df[, names])
sensitivity = cmatrix[2,2] / (cmatrix[2,2] + cmatrix[2,1])
return(round(sensitivity, 2))
}
Sensitivity(output)
## [1] 0.47
Using the provided formula per instruction, we wrote a function for the sensitivity. After applying the above formula to our data, the resulted sensitivity is at 47%.
Specificity <- function(df)
{
names = c("class", "scored.class")
cmatrix = table(df[, names])
specificity = cmatrix[1,1] / (cmatrix[1,1] + cmatrix[1,2])
return(round(specificity, 2))
}
Specificity(output)
## [1] 0.96
Using the formula and applying our data, the specificity is at 0.96.
F1_Score <- function(df)
{
names = c("class", "scored.class")
cmatrix = table(df[, names])
precision = Precision(df)
sensitivity = Sensitivity(df)
f1_score = (2 * precision * sensitivity) /(precision + sensitivity)
return(round(f1_score, 2))
}
F1_Score(output)
## [1] 0.6
After manually constructing the formula for the F1 score, and applying our data to it, we have received a F1-Score at 60%.
Answer: Because 0 < precision < 1, 0 < sensitivity < 1, therefore, precision * sensitivity < precision, and precision * sensitivity < sensitivity, therefore, the F1-Score will always be between 0 and 1.
### FROM NS################????
# 10.Manually create ROC curve
roc_func <- function(x,y){
x <- x[order(y, decreasing=TRUE)]
TP = cumsum(x)/sum(x)
FP = cumsum(!x)/sum(!x)
df <- data.frame(TP, FP)
diffFP <- c(diff(FP), 0)
diffTP <- c(diff(TP), 0)
auc <- sum(TP * diffFP) +
sum(diffTP * diffFP)/2
return(c(df=df, auc = auc))
}
roc_data <- roc_func(output$class, output$scored.probability)
plot(roc_data[[1]],
col="red",
lwd=2)
kable(roc_data$auc)
x |
---|
0.8503113 |
# auc(output) ## why not working ##
Accuracy(output)
## [1] 0.81
Classification_error_rate(output)
## [1] 0.19
Precision(output)
## [1] 0.84
Sensitivity(output)
## [1] 0.47
Specificity(output)
## [1] 0.96
F1_Score(output)
## [1] 0.6
if (!"caret" %in% installed.packages()) install.packages(caret)
require(caret)
ls(pos = "package:caret")
## [1] "anovaScores" "avNNet"
## [3] "bag" "bagControl"
## [5] "bagEarth" "bagEarthStats"
## [7] "bagFDA" "best"
## [9] "BoxCoxTrans" "calibration"
## [11] "caretFuncs" "caretGA"
## [13] "caretSA" "caretSBF"
## [15] "caretTheme" "cforestStats"
## [17] "checkConditionalX" "checkInstall"
## [19] "checkResamples" "class2ind"
## [21] "classDist" "cluster"
## [23] "compare_models" "confusionMatrix"
## [25] "confusionMatrix.train" "contr.dummy"
## [27] "contr.ltfr" "createDataPartition"
## [29] "createFolds" "createModel"
## [31] "createMultiFolds" "createResample"
## [33] "createTimeSlices" "ctreeBag"
## [35] "defaultSummary" "dotPlot"
## [37] "downSample" "dummyVars"
## [39] "expandParameters" "expoTrans"
## [41] "extractPrediction" "extractProb"
## [43] "F_meas" "featurePlot"
## [45] "filterVarImp" "findCorrelation"
## [47] "findLinearCombos" "flatTable"
## [49] "gafs" "gafs.default"
## [51] "gafs_initial" "gafs_lrSelection"
## [53] "gafs_raMutation" "gafs_rwSelection"
## [55] "gafs_spCrossover" "gafs_tourSelection"
## [57] "gafs_uCrossover" "gafsControl"
## [59] "gamFormula" "gamFuncs"
## [61] "gamScores" "getModelInfo"
## [63] "getSamplingInfo" "getTrainPerf"
## [65] "ggplot.gafs" "ggplot.safs"
## [67] "groupKFold" "hasTerms"
## [69] "icr" "index2vec"
## [71] "ipredStats" "knn3"
## [73] "knn3Train" "knnreg"
## [75] "knnregTrain" "ldaBag"
## [77] "ldaFuncs" "ldaSBF"
## [79] "learning_curve_dat" "lift"
## [81] "lmFuncs" "lmSBF"
## [83] "LPH07_1" "LPH07_2"
## [85] "lrFuncs" "MAE"
## [87] "maxDissim" "MeanSD"
## [89] "minDiss" "mnLogLoss"
## [91] "modelCor" "modelLookup"
## [93] "multiClassSummary" "nbBag"
## [95] "nbFuncs" "nbSBF"
## [97] "nearZeroVar" "negPredValue"
## [99] "nnetBag" "nullModel"
## [101] "nzv" "oneSE"
## [103] "outcome_conversion" "panel.calibration"
## [105] "panel.lift" "panel.lift2"
## [107] "panel.needle" "pcaNNet"
## [109] "pickSizeBest" "pickSizeTolerance"
## [111] "pickVars" "plot.gafs"
## [113] "plot.rfe" "plot.safs"
## [115] "plot.train" "plotClassProbs"
## [117] "plotObsVsPred" "plsBag"
## [119] "plsda" "posPredValue"
## [121] "postResample" "precision"
## [123] "predict.bagEarth" "predict.gafs"
## [125] "predict.train" "predictionFunction"
## [127] "predictors" "preProcess"
## [129] "print.train" "probFunction"
## [131] "progress" "prSummary"
## [133] "R2" "recall"
## [135] "resampleHist" "resamples"
## [137] "resampleSummary" "resampleWrapper"
## [139] "rfe" "rfeControl"
## [141] "rfeIter" "rfFuncs"
## [143] "rfGA" "rfSA"
## [145] "rfSBF" "rfStats"
## [147] "RMSE" "safs"
## [149] "safs_initial" "safs_perturb"
## [151] "safs_prob" "safsControl"
## [153] "sbf" "sbfControl"
## [155] "sbfIter" "sensitivity"
## [157] "SLC14_1" "SLC14_2"
## [159] "sortImp" "spatialSign"
## [161] "specificity" "splsda"
## [163] "sumDiss" "summary.bagEarth"
## [165] "svmBag" "thresholder"
## [167] "tolerance" "train"
## [169] "trainControl" "treebagFuncs"
## [171] "treebagGA" "treebagSA"
## [173] "treebagSBF" "twoClassSim"
## [175] "twoClassSummary" "upSample"
## [177] "var_seq" "varImp"
## [179] "well_numbered"
?sensitivity
## starting httpd help server ... done
?confusionMatrix
?precision
TO transpose the tables, to make the referenced value (i.e., truth, “class”) in columns, and the predicted measurement system (i.e. “scored.class”) in rows
First, get the raw confusion matrix for this scored dataset
data <- read.csv('E:/621 GH Bus Analysis/621 WK2 ROC/classification-output-data.csv')
cmatrix <- table(data$class, data$scored.class)
cmatrix
##
## 0 1
## 0 119 5
## 1 30 27
df <- data[c("class","scored.class")]
cmatrix.t <- t(table(df))
cmatrix.t
## class
## scored.class 0 1
## 0 119 30
## 1 5 27
str(cmatrix.t)
## 'table' int [1:2, 1:2] 119 5 30 27
## - attr(*, "dimnames")=List of 2
## ..$ scored.class: chr [1:2] "0" "1"
## ..$ class : chr [1:2] "0" "1"
Next, to compare the results using the functions we generated from above and that from the caret package ones
sens.caret <- round(sensitivity(cmatrix.t, positive = rownames(cmatrix)[2]),2)
sens.caret
## [1] 0.47
identical(Sensitivity(data), sens.caret)
## [1] TRUE
spec.caret <- round(specificity(cmatrix.t, negative = rownames(cmatrix)[1]),2)
spec.caret
## [1] 0.96
identical(Specificity(data), spec.caret)
## [1] TRUE
cMat.caret <- confusionMatrix(cmatrix.t, positive = "1")
cMat.caret
## Confusion Matrix and Statistics
##
## class
## scored.class 0 1
## 0 119 30
## 1 5 27
##
## Accuracy : 0.8066
## 95% CI : (0.7415, 0.8615)
## No Information Rate : 0.6851
## P-Value [Acc > NIR] : 0.0001712
##
## Kappa : 0.4916
##
## Mcnemar's Test P-Value : 4.976e-05
##
## Sensitivity : 0.4737
## Specificity : 0.9597
## Pos Pred Value : 0.8438
## Neg Pred Value : 0.7987
## Prevalence : 0.3149
## Detection Rate : 0.1492
## Detection Prevalence : 0.1768
## Balanced Accuracy : 0.7167
##
## 'Positive' Class : 1
##
str(cMat.caret)
## List of 6
## $ positive: chr "1"
## $ table : 'table' int [1:2, 1:2] 119 5 30 27
## ..- attr(*, "dimnames")=List of 2
## .. ..$ scored.class: chr [1:2] "0" "1"
## .. ..$ class : chr [1:2] "0" "1"
## $ overall : Named num [1:7] 0.807 0.492 0.741 0.861 0.685 ...
## ..- attr(*, "names")= chr [1:7] "Accuracy" "Kappa" "AccuracyLower" "AccuracyUpper" ...
## $ byClass : Named num [1:11] 0.474 0.96 0.844 0.799 0.844 ...
## ..- attr(*, "names")= chr [1:11] "Sensitivity" "Specificity" "Pos Pred Value" "Neg Pred Value" ...
## $ mode : chr "sens_spec"
## $ dots : list()
## - attr(*, "class")= chr "confusionMatrix"
prec.caret <- round(precision(cmatrix.t, relevant = "1"),2)
prec.caret
## [1] 0.84
identical(Precision(data), prec.caret)
## [1] TRUE
acc.caret <- round(cMat.caret$overall[1],2)
acc.caret
## Accuracy
## 0.81
identical(Accuracy(data), acc.caret)
## [1] FALSE
Answer: Below are the comparison of our own calculation vs the results from the caret package.
Accuracy : 0.81 vs 0.8066 (95% CI 0.74-0.86) Classification_error_rate: 0.19 vs Precision: 0.84 vs Sensitivity:0.47 vs 0.4727 Specificity:0.96 vs 0.9597 F1_Score:0.60 vs
roc(output$class, output$scored.probability, levels=c(0,1), percent=TRUE, plot=TRUE, ci=TRUE)
## Setting direction: controls < cases
##
## Call:
## roc.default(response = output$class, predictor = output$scored.probability, levels = c(0, 1), percent = TRUE, ci = TRUE, plot = TRUE)
##
## Data: output$scored.probability in 124 controls (output$class 0) < 57 cases (output$class 1).
## Area under the curve: 85.03%
## 95% CI: 79.05%-91.01% (DeLong)
Using the pROC package, the generated areas uncer curve is 85.03% (95% CI 19-91%). This ROC curve is pretty good, which explians majority of the sensitivity and specificity, leaving out only 15% above the curve (1-85% = 15%). Also we can see that the ROC curve generated from the pROC curve is smoother than that from our own function.
Although due to the formatting difference, it is a little hard to compare that to the ROC curve that we generated using our hand calculated function.