Set working environment

Load data

output <- read.csv ('E:/621 GH Bus Analysis/621 WK2 ROC/classification-output-data.csv')
  1. DATA EXPLORATION
glimpse(output)
## Observations: 181
## Variables: 11
## $ pregnant           <int> 7, 2, 3, 1, 4, 1, 9, 8, 1, 2, 5, 5, 13, 0, ...
## $ glucose            <int> 124, 122, 107, 91, 83, 100, 89, 120, 79, 12...
## $ diastolic          <int> 70, 76, 62, 64, 86, 74, 62, 78, 60, 48, 78,...
## $ skinfold           <int> 33, 27, 13, 24, 19, 12, 0, 0, 42, 32, 30, 4...
## $ insulin            <int> 215, 200, 48, 0, 0, 46, 0, 0, 48, 165, 0, 7...
## $ bmi                <dbl> 25.5, 35.9, 22.9, 29.2, 29.3, 19.5, 22.5, 2...
## $ pedigree           <dbl> 0.161, 0.483, 0.678, 0.192, 0.317, 0.149, 0...
## $ age                <int> 37, 26, 23, 21, 34, 28, 33, 64, 23, 26, 37,...
## $ class              <int> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1...
## $ scored.class       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1...
## $ scored.probability <dbl> 0.32845226, 0.27319044, 0.10966039, 0.05599...

As seen from the exploratory analysis, we can see that is a dataset which contains 181 observations of 11 variables (mostly the variables pertaining to the medical history of a patient). As instructed by the instructor, we will use three variables for this report - class (actual class for the observation), scored.class (predicted class for the observation), and scored.probability (predicted probability of success for the observation). The purpose of utilizing these 3 variables will be to construct our own forumla of sensititity, specificity, the confustion matrix which utitlizes these inforamtion, and the Receiver Operative Curve. Then, we will utilize the exisitng R packages which contain such information, and compare the results of our own manually calculated results with that from utilizing the packages.

vis_miss(output)

boxplot(output,xlab="variables")

We seen from the missing data check, we can see that there is no missing data in the file. From the box plot, the variables distributions are also seen, the three variables of interest are plotted out. There does not seem much surprise from there.

  1. Use the table() function to get the raw confusion matrix for this scored dataset. Make sure you understand the output. In particular, do the rows represent the actual or predicted class? The columns?
cf <- table(output[,9:10])
cf
##      scored.class
## class   0   1
##     0 119   5
##     1  30  27

WE are tabulating the confusions matrix above.There are 119 true negatives, and 27 true positives. And there are 30 false negatives, and5 false positives.

  1. Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the accuracy of the predictions.
Accuracy <- function(df)
{
  names    = c("class", "scored.class")
  cmatrix  = table(df[, names])
  accuracy = (cmatrix[2,2] + cmatrix[1,1]) / (cmatrix[2,2] + cmatrix[1,2] + cmatrix[1,1] + cmatrix[2,1])
  return(round(accuracy, 2))
}
Accuracy(output)
## [1] 0.81

Using the formula provided by the instruction, we constructed the accuracy function as above. Applying our data to this function, the accuracy is 81%.

  1. Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the classification error rate of the predictions. Verify that you get an accuracy and an error rate that sums to one
Classification_error_rate <- function(df)
{
  names    = c("class", "scored.class")
  cmatrix  = table(df[, names])
  classification_error_rate = (cmatrix[1,2] + cmatrix[2,1]) / (cmatrix[2,2] + cmatrix[1,2] + cmatrix[1,1] + cmatrix[2,1])
  return(round(classification_error_rate, 2))
}
Classification_error_rate(output)
## [1] 0.19

Using the forumla provided by the instruction, we constructed a fucntion above for the classification error rate. Applying data to this function,the resulted classification error rate was at 19%. To answer the question that accuracy and error rate sums to one. OUr calculated numbers was 0.81 and 0.19, which indeed sums upt to 100%

  1. Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the precision of the predictions.
Precision <- function(df)
{
  names    = c("class", "scored.class")
  cmatrix  = table(df[, names])
  precision = (cmatrix[2,2] / (cmatrix[2,2] + cmatrix[1,2]))
  return(round(precision, 2))
}
Precision(output)
## [1] 0.84

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the sensitivity of the predictions. Sensitivity is also known as recall.

Sensitivity <- function(df)
{
  names    = c("class", "scored.class")
  cmatrix  = table(df[, names])
  sensitivity = cmatrix[2,2] / (cmatrix[2,2] + cmatrix[2,1])
  return(round(sensitivity, 2))
}
Sensitivity(output)
## [1] 0.47

Using the provided formula per instruction, we wrote a function for the sensitivity. After applying the above formula to our data, the resulted sensitivity is at 47%.

  1. Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the specificity of the predictions.
Specificity <- function(df)
  {
  names    = c("class", "scored.class")
  cmatrix  = table(df[, names])
  specificity =  cmatrix[1,1] / (cmatrix[1,1] + cmatrix[1,2])
  return(round(specificity, 2))
  }
Specificity(output)
## [1] 0.96

Using the formula and applying our data, the specificity is at 0.96.

  1. Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the F1 score of the predictions.
F1_Score <- function(df)
{
  names    = c("class", "scored.class")
  cmatrix  = table(df[, names])
  precision = Precision(df)
  sensitivity = Sensitivity(df)
  f1_score =  (2 * precision * sensitivity) /(precision + sensitivity)
  return(round(f1_score, 2))
}
F1_Score(output)
## [1] 0.6

After manually constructing the formula for the F1 score, and applying our data to it, we have received a F1-Score at 60%.

  1. Before we move on, let’s consider a question that was asked: What are the bounds on the F1 score? Show that the F1 score will always be between 0 and 1. (Hint: If 0 < ???? < 1 and 0 < ???? < 1 then ???????? < ????.) F1 = (2precisionsensitivity) /(precision+sensitivity)

Answer: Because 0 < precision < 1, 0 < sensitivity < 1, therefore, precision * sensitivity < precision, and precision * sensitivity < sensitivity, therefore, the F1-Score will always be between 0 and 1.

  1. Write a function that generates an ROC curve from a data set with a true classification column (class in our example) and a probability column (scored.probability in our example). Your function should return a list that includes the plot of the ROC curve and a vector that contains the calculated area under the curve (AUC). Note that I recommend using a sequence of thresholds ranging from 0 to 1 at 0.01 intervals.
### FROM NS################????
# 10.Manually create ROC curve
roc_func <- function(x,y){
  x <- x[order(y, decreasing=TRUE)]
  TP = cumsum(x)/sum(x)
  FP = cumsum(!x)/sum(!x)
  
  df <- data.frame(TP, FP)
  diffFP <- c(diff(FP), 0)
  diffTP <- c(diff(TP), 0)
  auc <- sum(TP * diffFP) + 
    sum(diffTP * diffFP)/2
  
  return(c(df=df, auc = auc))
}
roc_data <- roc_func(output$class, output$scored.probability)
plot(roc_data[[1]], 
     col="red", 
     lwd=2)

kable(roc_data$auc)
x
0.8503113
  1. Use your created R functions and the provided classification output data set to produce all of the classification metrics discussed above.
# auc(output)  ## why not working ##
Accuracy(output)
## [1] 0.81
Classification_error_rate(output)
## [1] 0.19
Precision(output)
## [1] 0.84
Sensitivity(output)
## [1] 0.47
Specificity(output)
## [1] 0.96
F1_Score(output)
## [1] 0.6
  1. Investigate the caret package. In particular, consider the functions confusionMatrix, sensitivity, and specificity. Apply the functions to the data set. How do the results compare with your own functions?
if (!"caret" %in% installed.packages()) install.packages(caret)
require(caret)

ls(pos = "package:caret")
##   [1] "anovaScores"           "avNNet"               
##   [3] "bag"                   "bagControl"           
##   [5] "bagEarth"              "bagEarthStats"        
##   [7] "bagFDA"                "best"                 
##   [9] "BoxCoxTrans"           "calibration"          
##  [11] "caretFuncs"            "caretGA"              
##  [13] "caretSA"               "caretSBF"             
##  [15] "caretTheme"            "cforestStats"         
##  [17] "checkConditionalX"     "checkInstall"         
##  [19] "checkResamples"        "class2ind"            
##  [21] "classDist"             "cluster"              
##  [23] "compare_models"        "confusionMatrix"      
##  [25] "confusionMatrix.train" "contr.dummy"          
##  [27] "contr.ltfr"            "createDataPartition"  
##  [29] "createFolds"           "createModel"          
##  [31] "createMultiFolds"      "createResample"       
##  [33] "createTimeSlices"      "ctreeBag"             
##  [35] "defaultSummary"        "dotPlot"              
##  [37] "downSample"            "dummyVars"            
##  [39] "expandParameters"      "expoTrans"            
##  [41] "extractPrediction"     "extractProb"          
##  [43] "F_meas"                "featurePlot"          
##  [45] "filterVarImp"          "findCorrelation"      
##  [47] "findLinearCombos"      "flatTable"            
##  [49] "gafs"                  "gafs.default"         
##  [51] "gafs_initial"          "gafs_lrSelection"     
##  [53] "gafs_raMutation"       "gafs_rwSelection"     
##  [55] "gafs_spCrossover"      "gafs_tourSelection"   
##  [57] "gafs_uCrossover"       "gafsControl"          
##  [59] "gamFormula"            "gamFuncs"             
##  [61] "gamScores"             "getModelInfo"         
##  [63] "getSamplingInfo"       "getTrainPerf"         
##  [65] "ggplot.gafs"           "ggplot.safs"          
##  [67] "groupKFold"            "hasTerms"             
##  [69] "icr"                   "index2vec"            
##  [71] "ipredStats"            "knn3"                 
##  [73] "knn3Train"             "knnreg"               
##  [75] "knnregTrain"           "ldaBag"               
##  [77] "ldaFuncs"              "ldaSBF"               
##  [79] "learning_curve_dat"    "lift"                 
##  [81] "lmFuncs"               "lmSBF"                
##  [83] "LPH07_1"               "LPH07_2"              
##  [85] "lrFuncs"               "MAE"                  
##  [87] "maxDissim"             "MeanSD"               
##  [89] "minDiss"               "mnLogLoss"            
##  [91] "modelCor"              "modelLookup"          
##  [93] "multiClassSummary"     "nbBag"                
##  [95] "nbFuncs"               "nbSBF"                
##  [97] "nearZeroVar"           "negPredValue"         
##  [99] "nnetBag"               "nullModel"            
## [101] "nzv"                   "oneSE"                
## [103] "outcome_conversion"    "panel.calibration"    
## [105] "panel.lift"            "panel.lift2"          
## [107] "panel.needle"          "pcaNNet"              
## [109] "pickSizeBest"          "pickSizeTolerance"    
## [111] "pickVars"              "plot.gafs"            
## [113] "plot.rfe"              "plot.safs"            
## [115] "plot.train"            "plotClassProbs"       
## [117] "plotObsVsPred"         "plsBag"               
## [119] "plsda"                 "posPredValue"         
## [121] "postResample"          "precision"            
## [123] "predict.bagEarth"      "predict.gafs"         
## [125] "predict.train"         "predictionFunction"   
## [127] "predictors"            "preProcess"           
## [129] "print.train"           "probFunction"         
## [131] "progress"              "prSummary"            
## [133] "R2"                    "recall"               
## [135] "resampleHist"          "resamples"            
## [137] "resampleSummary"       "resampleWrapper"      
## [139] "rfe"                   "rfeControl"           
## [141] "rfeIter"               "rfFuncs"              
## [143] "rfGA"                  "rfSA"                 
## [145] "rfSBF"                 "rfStats"              
## [147] "RMSE"                  "safs"                 
## [149] "safs_initial"          "safs_perturb"         
## [151] "safs_prob"             "safsControl"          
## [153] "sbf"                   "sbfControl"           
## [155] "sbfIter"               "sensitivity"          
## [157] "SLC14_1"               "SLC14_2"              
## [159] "sortImp"               "spatialSign"          
## [161] "specificity"           "splsda"               
## [163] "sumDiss"               "summary.bagEarth"     
## [165] "svmBag"                "thresholder"          
## [167] "tolerance"             "train"                
## [169] "trainControl"          "treebagFuncs"         
## [171] "treebagGA"             "treebagSA"            
## [173] "treebagSBF"            "twoClassSim"          
## [175] "twoClassSummary"       "upSample"             
## [177] "var_seq"               "varImp"               
## [179] "well_numbered"
?sensitivity
## starting httpd help server ... done
?confusionMatrix
?precision

TO transpose the tables, to make the referenced value (i.e., truth, “class”) in columns, and the predicted measurement system (i.e. “scored.class”) in rows

First, get the raw confusion matrix for this scored dataset

data <- read.csv('E:/621 GH Bus Analysis/621 WK2 ROC/classification-output-data.csv')

cmatrix <-  table(data$class, data$scored.class)
cmatrix
##    
##       0   1
##   0 119   5
##   1  30  27
df <- data[c("class","scored.class")]
cmatrix.t <- t(table(df))
cmatrix.t
##             class
## scored.class   0   1
##            0 119  30
##            1   5  27
str(cmatrix.t)
##  'table' int [1:2, 1:2] 119 5 30 27
##  - attr(*, "dimnames")=List of 2
##   ..$ scored.class: chr [1:2] "0" "1"
##   ..$ class       : chr [1:2] "0" "1"

Next, to compare the results using the functions we generated from above and that from the caret package ones

sens.caret <- round(sensitivity(cmatrix.t, positive = rownames(cmatrix)[2]),2)
sens.caret
## [1] 0.47
identical(Sensitivity(data), sens.caret)
## [1] TRUE
spec.caret <- round(specificity(cmatrix.t, negative = rownames(cmatrix)[1]),2)
spec.caret
## [1] 0.96
identical(Specificity(data), spec.caret)
## [1] TRUE
cMat.caret <- confusionMatrix(cmatrix.t, positive = "1")
cMat.caret
## Confusion Matrix and Statistics
## 
##             class
## scored.class   0   1
##            0 119  30
##            1   5  27
##                                           
##                Accuracy : 0.8066          
##                  95% CI : (0.7415, 0.8615)
##     No Information Rate : 0.6851          
##     P-Value [Acc > NIR] : 0.0001712       
##                                           
##                   Kappa : 0.4916          
##                                           
##  Mcnemar's Test P-Value : 4.976e-05       
##                                           
##             Sensitivity : 0.4737          
##             Specificity : 0.9597          
##          Pos Pred Value : 0.8438          
##          Neg Pred Value : 0.7987          
##              Prevalence : 0.3149          
##          Detection Rate : 0.1492          
##    Detection Prevalence : 0.1768          
##       Balanced Accuracy : 0.7167          
##                                           
##        'Positive' Class : 1               
## 
str(cMat.caret)
## List of 6
##  $ positive: chr "1"
##  $ table   : 'table' int [1:2, 1:2] 119 5 30 27
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ scored.class: chr [1:2] "0" "1"
##   .. ..$ class       : chr [1:2] "0" "1"
##  $ overall : Named num [1:7] 0.807 0.492 0.741 0.861 0.685 ...
##   ..- attr(*, "names")= chr [1:7] "Accuracy" "Kappa" "AccuracyLower" "AccuracyUpper" ...
##  $ byClass : Named num [1:11] 0.474 0.96 0.844 0.799 0.844 ...
##   ..- attr(*, "names")= chr [1:11] "Sensitivity" "Specificity" "Pos Pred Value" "Neg Pred Value" ...
##  $ mode    : chr "sens_spec"
##  $ dots    : list()
##  - attr(*, "class")= chr "confusionMatrix"
prec.caret <- round(precision(cmatrix.t, relevant = "1"),2)
prec.caret
## [1] 0.84
identical(Precision(data), prec.caret)
## [1] TRUE
acc.caret <- round(cMat.caret$overall[1],2)
acc.caret
## Accuracy 
##     0.81
identical(Accuracy(data), acc.caret)  
## [1] FALSE

Answer: Below are the comparison of our own calculation vs the results from the caret package.

Accuracy : 0.81 vs 0.8066 (95% CI 0.74-0.86) Classification_error_rate: 0.19 vs Precision: 0.84 vs Sensitivity:0.47 vs 0.4727 Specificity:0.96 vs 0.9597 F1_Score:0.60 vs

  1. Investigate the pROC Package. Use it to generate an ROC curve for the dataset. How do the results compare tiht your own function?
roc(output$class, output$scored.probability, levels=c(0,1), percent=TRUE, plot=TRUE, ci=TRUE)
## Setting direction: controls < cases

## 
## Call:
## roc.default(response = output$class, predictor = output$scored.probability,     levels = c(0, 1), percent = TRUE, ci = TRUE, plot = TRUE)
## 
## Data: output$scored.probability in 124 controls (output$class 0) < 57 cases (output$class 1).
## Area under the curve: 85.03%
## 95% CI: 79.05%-91.01% (DeLong)

Using the pROC package, the generated areas uncer curve is 85.03% (95% CI 19-91%). This ROC curve is pretty good, which explians majority of the sensitivity and specificity, leaving out only 15% above the curve (1-85% = 15%). Also we can see that the ROC curve generated from the pROC curve is smoother than that from our own function.
Although due to the formatting difference, it is a little hard to compare that to the ROC curve that we generated using our hand calculated function.