Libraries

library(tidyverse)
library(e1071)
library(caret)
library(pROC)
library(knitr)
library(ROCR)
library(randomForest)

Shiny App

I built a shiny app for this homework which can be used to explore the data and run logistic regression, decision tree, random forest. My future plan is to improve the shiny app in a way that we will be able to upload any classification data and be able to run various classifiers. As a result we will be free of writing same script over and over again. Please visit the shiny app using following link.

https://forhadakbar.shinyapps.io/HeartDiseaseAnalyzer/

PART-A

STEP#0: Pick any two classifiers of (SVM,Logistic,DecisionTree,NaiveBayes). Pick heart or ecoli dataset. Heart is simpler and ecoli compounds the problem as it is NOT a balanced dataset. From a grading perspective both carry the same weight.

STEP#1 For each classifier, Set a seed (43)

STEP#2 Do a 80/20 split and determine the Accuracy, AUC and as many metrics as returned by the Caret package (confusionMatrix) Call this the base_metric. Note down as best as you can development (engineering) cost as well as computing cost(elapsed time).

Start with the original dataset and set a seed (43). Then run a cross validation of 5 and 10 of the model on the training set. Determine the same set of metrics and compare the cv_metrics with the base_metric. Note down as best as you can development (engineering) cost as well as computing cost(elapsed time).

Start with the original dataset and set a seed (43) Then run a bootstrap of 200 resamples and compute the same set of metrics and for each of the two classifiers build a three column table for each experiment (base, bootstrap, cross-validated). Note down as best as you can development (engineering) cost as well as computing cost(elapsed time).

Base Model

Load Data

I have downloaded the heart data from https://archive.ics.uci.edu/ml/datasets/Heart+Disease. I will skip most of the data exploration here except few necessary exploration and directly jump into what was asked in the instruction of the homework. I picked logistic regression and NaiveBayes for this homework part A. I will compute time each steps to calculate elapsed time.

st <- proc.time() # start time for common part
heart <- read.csv("heart.csv", header = T, sep = ",", stringsAsFactors=F)
head(heart)
##   age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1  63   1  3      145  233   1       0     150     0     2.3     0  0    1
## 2  37   1  2      130  250   0       1     187     0     3.5     0  0    2
## 3  41   0  1      130  204   0       0     172     0     1.4     2  0    2
## 4  56   1  1      120  236   0       1     178     0     0.8     2  0    2
## 5  57   0  0      120  354   0       1     163     1     0.6     2  0    2
## 6  57   1  0      140  192   0       1     148     0     0.4     1  0    1
##   outcome
## 1       1
## 2       1
## 3       1
## 4       1
## 5       1
## 6       1
dim(heart)
## [1] 303  14

Let us change the name of the first column and the outcome as factor.

names(heart)[[1]] <- 'age'
heart$outcome <- as.factor(heart$outcome)
str(heart)
## 'data.frame':    303 obs. of  14 variables:
##  $ age     : int  63 37 41 56 57 57 56 44 52 57 ...
##  $ sex     : int  1 1 0 1 0 1 0 1 1 1 ...
##  $ cp      : int  3 2 1 1 0 0 1 1 2 2 ...
##  $ trestbps: int  145 130 130 120 120 140 140 120 172 150 ...
##  $ chol    : int  233 250 204 236 354 192 294 263 199 168 ...
##  $ fbs     : int  1 0 0 0 0 0 0 0 1 0 ...
##  $ restecg : int  0 1 0 1 1 1 0 1 1 1 ...
##  $ thalach : int  150 187 172 178 163 148 153 173 162 174 ...
##  $ exang   : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ oldpeak : num  2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
##  $ slope   : int  0 0 2 2 2 1 1 2 2 2 ...
##  $ ca      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ thal    : int  1 2 2 2 2 1 2 3 3 2 ...
##  $ outcome : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...

Let’s check if a column has constant values in all its rows

isConstant <- function(x) length(names(table(x))) < 2 
apply(heart, 2, isConstant) 
##      age      sex       cp trestbps     chol      fbs  restecg  thalach 
##    FALSE    FALSE    FALSE    FALSE    FALSE    FALSE    FALSE    FALSE 
##    exang  oldpeak    slope       ca     thal  outcome 
##    FALSE    FALSE    FALSE    FALSE    FALSE    FALSE
pairs(heart)

classLabels <- table(heart$outcome)
print(classLabels)
## 
##   0   1 
## 138 165
names(classLabels)
## [1] "0" "1"
ifelse(length(names(classLabels))==2, "binary classification", "multi-class classification")
## [1] "binary classification"

Split the data

set.seed(43)
tstidx <- sample(1:nrow(heart),0.20*nrow(heart),replace = F)
trdata <- heart[-tstidx,]
tsdata <- heart[tstidx,]
 # time for common portion
common_portion_time_base <- proc.time()-st

Base line performance of LR

st <- proc.time() # start time for LR
glm.model <- glm(outcome~.,data = trdata,family = 'binomial')
predts_glm <- predict(glm.model,tsdata[,1:13],type = 'response')
predtsclass_glm <- ifelse(predts_glm<0.5,0,1)
tscfm_glm <- caret::confusionMatrix(table(tsdata[[14]],predtsclass_glm))
tscfm_glm
## Confusion Matrix and Statistics
## 
##    predtsclass_glm
##      0  1
##   0 22  5
##   1  6 27
##                                           
##                Accuracy : 0.8167          
##                  95% CI : (0.6956, 0.9048)
##     No Information Rate : 0.5333          
##     P-Value [Acc > NIR] : 4.344e-06       
##                                           
##                   Kappa : 0.6309          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.7857          
##             Specificity : 0.8438          
##          Pos Pred Value : 0.8148          
##          Neg Pred Value : 0.8182          
##              Prevalence : 0.4667          
##          Detection Rate : 0.3667          
##    Detection Prevalence : 0.4500          
##       Balanced Accuracy : 0.8147          
##                                           
##        'Positive' Class : 0               
## 
acc_glm <- sum(diag(tscfm_glm$table))/sum(tscfm_glm$table)
roc_glm <- roc(tsdata$outcome, as.numeric(predts_glm))
auc_glm <- auc(roc_glm)
(df_glm <- data.frame(tscfm_glm$byClass))
##                      tscfm_glm.byClass
## Sensitivity                  0.7857143
## Specificity                  0.8437500
## Pos Pred Value               0.8148148
## Neg Pred Value               0.8181818
## Precision                    0.8148148
## Recall                       0.7857143
## F1                           0.8000000
## Prevalence                   0.4666667
## Detection Rate               0.3666667
## Detection Prevalence         0.4500000
## Balanced Accuracy            0.8147321
LR_Base_time <- proc.time()-st # time for base LR
total_LR_Base_time <- common_portion_time_base + LR_Base_time # time for base LR
total_LR_Base_time
##    user  system elapsed 
##    1.28    3.04    4.47
glm_row <- c("base_LR",acc_glm,auc_glm,data.frame(total_LR_Base_time[3])[1,], df_glm[1:11,])

Base line performance of NB

st <- proc.time() # start time for NB
nb.model<-naiveBayes(outcome~.,data=trdata)
predts_nb <- predict(nb.model,tsdata[,1:13],type='raw')
predtsclass_nb <- unlist(apply(round(predts_nb),1,which.max))-1
tstbl_nb <- table(tsdata[[14]], predtsclass_nb)
tscfm_nb <- caret::confusionMatrix(tstbl_nb)
tscfm_nb
## Confusion Matrix and Statistics
## 
##    predtsclass_nb
##      0  1
##   0 23  4
##   1  7 26
##                                           
##                Accuracy : 0.8167          
##                  95% CI : (0.6956, 0.9048)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 3.781e-07       
##                                           
##                   Kappa : 0.6333          
##                                           
##  Mcnemar's Test P-Value : 0.5465          
##                                           
##             Sensitivity : 0.7667          
##             Specificity : 0.8667          
##          Pos Pred Value : 0.8519          
##          Neg Pred Value : 0.7879          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3833          
##    Detection Prevalence : 0.4500          
##       Balanced Accuracy : 0.8167          
##                                           
##        'Positive' Class : 0               
## 
acc_nb <- sum(diag(tscfm_nb$table))/sum(tscfm_nb$table)
roc_nb <- roc(tsdata$outcome, as.numeric(predts_nb[,2]))
auc_nb <- auc(roc_nb)
(df_nb <- data.frame(tscfm_nb$byClass))
##                      tscfm_nb.byClass
## Sensitivity                 0.7666667
## Specificity                 0.8666667
## Pos Pred Value              0.8518519
## Neg Pred Value              0.7878788
## Precision                   0.8518519
## Recall                      0.7666667
## F1                          0.8070175
## Prevalence                  0.5000000
## Detection Rate              0.3833333
## Detection Prevalence        0.4500000
## Balanced Accuracy           0.8166667
NB_Base_time <- proc.time()-st # time for base NB
total_NB_Base_time <- common_portion_time_base + NB_Base_time # time for base NB
total_NB_Base_time
##    user  system elapsed 
##    1.35    3.03    4.46
nb_row <- c("base_NB",acc_nb,auc_nb,data.frame(total_NB_Base_time[3])[1,],df_nb[1:11,])

Let’s compare the metrics of base LR and NB.

table_base <- data.frame(matrix(ncol = 12, nrow = 0))
table_base <- rbind(table_base, glm_row, nb_row)
colnames(table_base) <- c("ALGO", "ACCURACY","AUC", "Total elapsed time", "Sensitivity", "Specificity", "Pos Pred Value", "Neg Pred Value", "Precision", "Recall", "F1", "Prevalence", "Detection Rate", "Detection Prevalence", "Balanced Accuracy")
knitr::kable(table_base)
ALGO ACCURACY AUC Total elapsed time Sensitivity Specificity Pos Pred Value Neg Pred Value Precision Recall F1 Prevalence Detection Rate Detection Prevalence Balanced Accuracy
base_LR 0.816666666666667 0.906846240179573 4.47 0.785714285714286 0.84375 0.814814814814815 0.818181818181818 0.814814814814815 0.785714285714286 0.8 0.466666666666667 0.366666666666667 0.45 0.814732142857143
base_NB 0.816666666666667 0.897867564534231 4.46 0.766666666666667 0.866666666666667 0.851851851851852 0.787878787878788 0.851851851851852 0.766666666666667 0.807017543859649 0.5 0.383333333333333 0.45 0.816666666666667

Cross validation

I am using all code from class notes.

st <- proc.time() # start time for common portion of 10 fold CV
set.seed(43)
tstidx_cv <- sample(1:nrow(heart),0.20*nrow(heart),replace = F)
trdata_cv <- heart[-tstidx,]
tsdata_cv <- heart[tstidx,]
N <- nrow(trdata_cv)
ten_NF = 10
ten_folds <- split(1:N,cut(1:N, quantile(1:N, probs = seq(0, 1, by =1/ten_NF))))
length(ten_folds)
## [1] 10
lapply(ten_folds,length)
## $`(1,25.2]`
## [1] 24
## 
## $`(25.2,49.4]`
## [1] 24
## 
## $`(49.4,73.6]`
## [1] 24
## 
## $`(73.6,97.8]`
## [1] 24
## 
## $`(97.8,122]`
## [1] 25
## 
## $`(122,146]`
## [1] 24
## 
## $`(146,170]`
## [1] 24
## 
## $`(170,195]`
## [1] 24
## 
## $`(195,219]`
## [1] 24
## 
## $`(219,243]`
## [1] 25
common_portion_time_cvtenfold <- proc.time()-st

LR with 10 fold CV

st <- proc.time() # start time for LR with 10 fold cv 
ridx<-sample(1:nrow(trdata_cv),nrow(trdata_cv),replace=FALSE) # randomize the data

cv_df <- do.call('rbind',lapply(ten_folds,FUN=function(idx,data=trdata_cv[ridx,]) {
  m<-glm(outcome~.,data=data[-idx,],family = 'binomial')# keep one fold for validation
   p<-predict(m,data[idx,-c(14)],type = 'response') # predict for that test fold
   pc<-ifelse(p<0.5,0,1)
  pred_tbl<-table(data[idx,c(14)],pc) #table(actual,predicted)
  pred_cfm<-caret::confusionMatrix(pred_tbl)
  list(fold=idx,m=m,cfm=pred_cfm) # store the fold, model,cfm 
  }
)) # lapply repeats over all folds
cv_df <- as.data.frame(cv_df)

#tstcv.perf<-as.data.frame(do.call('rbind',lapply(cv_df$cfm,FUN=function(cfm)c(cfm$overall,cfm$byClass))))

# (cv.tst.perf<-apply(tstcv.perf[tstcv.perf$AccuracyPValue<0.01,-c(6:7)],2,mean))
tstcv_preds_LR_tenfold <- lapply(cv_df$m,FUN=function(M,D=tsdata_cv[,-c(14)])predict(M,D,type = 'response'))

tstcv_cfm <- lapply(tstcv_preds_LR_tenfold,FUN=function(P,A=tsdata_cv[[14]])
{pred_class<-ifelse(P<0.5,0,1)
  pred_tbl<-table(pred_class,A)
  pred_cfm<-caret::confusionMatrix(pred_tbl)
  pred_cfm
})

tstcv.perf <- as.data.frame(do.call('rbind',lapply(tstcv_cfm,FUN=function(cfm)c(cfm$overall,cfm$byClass))))

cv.tst.perf_LR_tenfold <-apply(tstcv.perf[tstcv.perf$AccuracyPValue<0.01,-c(6:7)],2,mean)
# cv.tst.perf.var_LR_tenfold <- apply(tstcv.perf[tstcv.perf$AccuracyPValue<0.01,-c(6:7)],2,sd)
(df_LR_tenfold <- data.frame(cv.tst.perf_LR_tenfold))
##                      cv.tst.perf_LR_tenfold
## Accuracy                          0.8283333
## Kappa                             0.6547540
## AccuracyLower                     0.7091456
## AccuracyUpper                     0.9132489
## AccuracyNull                      0.5500000
## Sensitivity                       0.8333333
## Specificity                       0.8242424
## Pos Pred Value                    0.7965761
## Neg Pred Value                    0.8590574
## Precision                         0.7965761
## Recall                            0.8333333
## F1                                0.8137043
## Prevalence                        0.4500000
## Detection Rate                    0.3750000
## Detection Prevalence              0.4716667
## Balanced Accuracy                 0.8287879

Now, average confusion matrix and AUC from all 10 folds will be computed.

#average confusion matrix
tstcv.perf_cfm <- as.data.frame(do.call('rbind',lapply(tstcv_cfm,FUN=function(cfm)c(cfm$overall, cfm$table))))

cv.tst.perf_cfm <-apply(tstcv.perf_cfm[tstcv.perf$AccuracyPValue<0.01,-c(6:7)],2,mean)

(cv_cfm <- matrix(c(cv.tst.perf_cfm[6], cv.tst.perf_cfm[7], cv.tst.perf_cfm[8], cv.tst.perf_cfm[9]), nrow = 2))
##      [,1] [,2]
## [1,] 22.5  5.8
## [2,]  4.5 27.2
#Average AUC

tstcv_preds_LR_tenfold <- data.frame(tstcv_preds_LR_tenfold)


LR_tenfold_prediction <- (tstcv_preds_LR_tenfold[1] + tstcv_preds_LR_tenfold[2] + tstcv_preds_LR_tenfold[3] + tstcv_preds_LR_tenfold[4] + tstcv_preds_LR_tenfold[5]+tstcv_preds_LR_tenfold[6]+tstcv_preds_LR_tenfold[7]+tstcv_preds_LR_tenfold[8]+tstcv_preds_LR_tenfold[9]+tstcv_preds_LR_tenfold[10]) / ten_NF

LR_tenfold_auc <- performance(prediction(LR_tenfold_prediction, tsdata_cv$outcome), 'auc')@y.values[[1]]

LR_tenfold_auc
## [1] 0.9046016
LR_tenfoldcv_time  <- proc.time()-st # time for LR with 10 fold cv 
total_LR_tenfold_time <- common_portion_time_cvtenfold + LR_tenfoldcv_time # total time for LR with 10 fold cv
total_LR_tenfold_time
##    user  system elapsed 
##    0.61    0.18    0.84
LR_tenfold_row <- c("LR with 10 fold CV",df_LR_tenfold[1,], LR_tenfold_auc,data.frame(total_LR_tenfold_time[3])[1,], df_LR_tenfold[6:16,] )

NB with 10 fold CV

st <- proc.time() # start for NB with 10 fold cv
ridx <- sample(1:nrow(trdata_cv), nrow(trdata_cv), replace = FALSE)   # Randomize the data
cv_df <- do.call('rbind', lapply(ten_folds, FUN = function(idx, data = trdata_cv[ridx,]) {
  m <- naiveBayes(outcome~., data = data[-idx,])         # keep one fold for validation
  p <- predict(m, data[idx, -c(14)], type = 'raw')      # predict for that test fold
  pc <- unlist(apply(round(p), 1, which.max)) - 1
  pred_tbl <- table(data[idx, c(14)], pc)               # table(actual, predicted)
  pred_cfm <- caret::confusionMatrix(pred_tbl)
  list(fold = idx, m = m, cfm = pred_cfm)               # store the fold, model, cfm
}
))
cv_df <- as.data.frame(cv_df)
tstcv_preds_NB_tenfold <- lapply(cv_df$m, FUN = function(M, D = tsdata_cv[,-c(14)]) predict(M, D, type = 'raw'))   

tstcv_cfm <- lapply(tstcv_preds_NB_tenfold, FUN = function(P, A = tsdata_cv[[14]])
  
{pred_class <- unlist(apply(round(P), 1, which.max)) - 1
  pred_tbl <- table(pred_class, A)
  pred_cfm <- caret::confusionMatrix(pred_tbl)
  pred_cfm
})
tstcv_perf <- as.data.frame(do.call('rbind', lapply(tstcv_cfm, FUN = function(cfm) c(cfm$overall, cfm$byClass))))

cv_tst_perf_NB_tenfold <- apply(tstcv_perf[tstcv_perf$AccuracyPValue < 0.01, -c(6:7)], 2, mean)

cv_tst_perf_NB_tenfold <- data.frame(cv_tst_perf_NB_tenfold)

cv_tst_perf_NB_tenfold
##                      cv_tst_perf_NB_tenfold
## Accuracy                          0.7950000
## Kappa                             0.5926211
## AccuracyLower                     0.6711984
## AccuracyUpper                     0.8881932
## AccuracyNull                      0.5500000
## Sensitivity                       0.8629630
## Specificity                       0.7393939
## Pos Pred Value                    0.7309665
## Neg Pred Value                    0.8687755
## Precision                         0.7309665
## Recall                            0.8629630
## F1                                0.7912098
## Prevalence                        0.4500000
## Detection Rate                    0.3883333
## Detection Prevalence              0.5316667
## Balanced Accuracy                 0.8011785

Now, average confusion matrix and AUC from all 10 folds will be computed.

#average confusion matrix
tstcv.perf_cfm  <- as.data.frame(do.call('rbind', lapply(tstcv_cfm, FUN = function(cfm) c(cfm$overall, cfm$table))))

cv.tst.perf_cfm  <- apply(tstcv.perf_cfm[tstcv.perf_cfm$AccuracyPValue < 0.01, -c(6:7)], 2, mean)

(cv_confusion_matrix <- matrix(c(cv.tst.perf_cfm [6], cv.tst.perf_cfm [7], cv.tst.perf_cfm [8], cv.tst.perf_cfm [9]), nrow = 2))
##      [,1] [,2]
## [1,] 23.3  8.6
## [2,]  3.7 24.4
NB_tenfold_prediction  <- (tstcv_preds_NB_tenfold[[1]][,2] + tstcv_preds_NB_tenfold[[2]][,2] + tstcv_preds_NB_tenfold[[3]][,2] + tstcv_preds_NB_tenfold[[4]][,2] + tstcv_preds_NB_tenfold[[5]][,2] + tstcv_preds_NB_tenfold[[6]][,2] + tstcv_preds_NB_tenfold[[7]][,2] + tstcv_preds_NB_tenfold[[8]][,2] + tstcv_preds_NB_tenfold[[9]][,2] + tstcv_preds_NB_tenfold[[10]][,2]) / ten_NF


NB_tenfold_auc <- performance(prediction(NB_tenfold_prediction , tsdata_cv$outcome), 'auc')@y.values[[1]]

NB_tenfold_auc 
## [1] 0.9012346
NB_tenfoldcv_time  <- proc.time()-st # time for NB with 10 fold cv 
total_NB_tenfold_time <- common_portion_time_cvtenfold + NB_tenfoldcv_time # total time for NB with 10 fold cv
total_NB_tenfold_time
##    user  system elapsed 
##    0.96    0.14    1.17
NB_tenfold_row <- c("NB with 10 fold CV",cv_tst_perf_NB_tenfold[1,], NB_tenfold_auc, data.frame(total_NB_tenfold_time[3])[1,], cv_tst_perf_NB_tenfold[6:16,] )

LR with 5 fold CV

st <- proc.time() # start time for common portion of 5 fold CV
N <- nrow(trdata_cv)
five_NF = 5
five_folds <- split(1:N,cut(1:N, quantile(1:N, probs = seq(0, 1, by =1/five_NF))))
length(five_folds)
## [1] 5
lapply(five_folds,length)
## $`(1,49.4]`
## [1] 48
## 
## $`(49.4,97.8]`
## [1] 48
## 
## $`(97.8,146]`
## [1] 49
## 
## $`(146,195]`
## [1] 48
## 
## $`(195,243]`
## [1] 49
common_portion_time_cvfivefold <- proc.time()-st
st <- proc.time() # start time for LR with 5 fold cv
ridx<-sample(1:nrow(trdata_cv),nrow(trdata_cv),replace=FALSE) # randomize the data

cv_df <- do.call('rbind',lapply(five_folds,FUN=function(idx,data=trdata_cv[ridx,]) {
  m<-glm(outcome~.,data=data[-idx,],family = 'binomial')# keep one fold for validation
   p<-predict(m,data[idx,-c(14)],type = 'response') # predict for that test fold
   pc<-ifelse(p<0.5,0,1)
  pred_tbl<-table(data[idx,c(14)],pc) #table(actual,predicted)
  pred_cfm<-caret::confusionMatrix(pred_tbl)
  list(fold=idx,m=m,cfm=pred_cfm) # store the fold, model,cfm 
  }
)) # lapply repeats over all folds
cv_df <- as.data.frame(cv_df)

#tstcv.perf<-as.data.frame(do.call('rbind',lapply(cv_df$cfm,FUN=function(cfm)c(cfm$overall,cfm$byClass))))

# (cv.tst.perf<-apply(tstcv.perf[tstcv.perf$AccuracyPValue<0.01,-c(6:7)],2,mean))
tstcv_preds_LR_fivefold <- lapply(cv_df$m,FUN=function(M,D=tsdata_cv[,-c(14)])predict(M,D,type = 'response'))

tstcv_cfm <- lapply(tstcv_preds_LR_fivefold,FUN=function(P,A=tsdata_cv[[14]])
{pred_class<-ifelse(P<0.5,0,1)
  pred_tbl<-table(pred_class,A)
  pred_cfm<-caret::confusionMatrix(pred_tbl)
  pred_cfm
})

tstcv.perf <- as.data.frame(do.call('rbind',lapply(tstcv_cfm,FUN=function(cfm)c(cfm$overall,cfm$byClass))))

cv.tst.perf_LR_fivefold <-apply(tstcv.perf[tstcv.perf$AccuracyPValue<0.01,-c(6:7)],2,mean)
# cv.tst.perf.var_LR_fivefold <- apply(tstcv.perf[tstcv.perf$AccuracyPValue<0.01,-c(6:7)],2,sd)
(df_LR_fivefold <- data.frame(cv.tst.perf_LR_fivefold))
##                      cv.tst.perf_LR_fivefold
## Accuracy                           0.8166667
## Kappa                              0.6319451
## AccuracyLower                      0.6959731
## AccuracyUpper                      0.9043612
## AccuracyNull                       0.5500000
## Sensitivity                        0.8296296
## Specificity                        0.8060606
## Pos Pred Value                     0.7787429
## Neg Pred Value                     0.8526540
## Precision                          0.7787429
## Recall                             0.8296296
## F1                                 0.8030427
## Prevalence                         0.4500000
## Detection Rate                     0.3733333
## Detection Prevalence               0.4800000
## Balanced Accuracy                  0.8178451

Now, average confusion matrix and AUC from all 5 folds will be computed.

#average confusion matrix
tstcv.perf_cfm <- as.data.frame(do.call('rbind',lapply(tstcv_cfm,FUN=function(cfm)c(cfm$overall, cfm$table))))

cv.tst.perf_cfm <-apply(tstcv.perf_cfm[tstcv.perf_cfm$AccuracyPValue<0.01,-c(6:7)],2,mean)

(cv_cfm <- matrix(c(cv.tst.perf_cfm[6], cv.tst.perf_cfm[7], cv.tst.perf_cfm[8], cv.tst.perf_cfm[9]), nrow = 2))
##      [,1] [,2]
## [1,] 22.4  6.4
## [2,]  4.6 26.6
#Average AUC

tstcv_preds_LR_fivefold <- data.frame(tstcv_preds_LR_fivefold)


LR_fivefold_prediction <- (tstcv_preds_LR_fivefold[1] + tstcv_preds_LR_fivefold[2] + tstcv_preds_LR_fivefold[3] + tstcv_preds_LR_fivefold[4] + tstcv_preds_LR_fivefold[5]) / five_NF

LR_fivefold_auc <- performance(prediction(LR_fivefold_prediction, tsdata_cv$outcome), 'auc')@y.values[[1]]

LR_fivefold_auc
## [1] 0.9057239
LR_fivefoldcv_time  <- proc.time()-st # time for LR with 5 fold cv 
total_LR_fivefold_time <- common_portion_time_cvfivefold + LR_fivefoldcv_time # total time for LR with 5 fold cv
total_LR_fivefold_time
##    user  system elapsed 
##    0.54    0.12    0.67
LR_fivefold_row <- c("LR with 5 fold CV",df_LR_fivefold[1,], LR_fivefold_auc, data.frame(total_LR_fivefold_time[3])[1,],df_LR_fivefold[6:16,] )

NB with 5 fold CV

st <- proc.time()
ridx <- sample(1:nrow(trdata_cv), nrow(trdata_cv), replace = FALSE)   # Randomize the data
cv_df <- do.call('rbind', lapply(five_folds, FUN = function(idx, data = trdata_cv[ridx,]) {
  m <- naiveBayes(outcome~., data = data[-idx,])         # keep one fold for validation
  p <- predict(m, data[idx, -c(14)], type = 'raw')      # predict for that test fold
  pc <- unlist(apply(round(p), 1, which.max)) - 1
  pred_tbl <- table(data[idx, c(14)], pc)               # table(actual, predicted)
  pred_cfm <- caret::confusionMatrix(pred_tbl)
  list(fold = idx, m = m, cfm = pred_cfm)               # store the fold, model, cfm
}
))
cv_df <- as.data.frame(cv_df)
tstcv_preds_NB_fivefold <- lapply(cv_df$m, FUN = function(M, D = tsdata_cv[,-c(14)]) predict(M, D, type = 'raw'))   

tstcv_cfm <- lapply(tstcv_preds_NB_fivefold, FUN = function(P, A = tsdata_cv[[14]])
  
{pred_class <- unlist(apply(round(P), 1, which.max)) - 1
  pred_tbl <- table(pred_class, A)
  pred_cfm <- caret::confusionMatrix(pred_tbl)
  pred_cfm
})
tstcv_perf <- as.data.frame(do.call('rbind', lapply(tstcv_cfm, FUN = function(cfm) c(cfm$overall, cfm$byClass))))

cv_tst_perf_NB_fivefold <- apply(tstcv_perf[tstcv_perf$AccuracyPValue < 0.01, -c(6:7)], 2, mean)

cv_tst_perf_NB_fivefold <- data.frame(cv_tst_perf_NB_fivefold)

cv_tst_perf_NB_fivefold
##                      cv_tst_perf_NB_fivefold
## Accuracy                           0.8033333
## Kappa                              0.6082071
## AccuracyLower                      0.6806344
## AccuracyUpper                      0.8945124
## AccuracyNull                       0.5500000
## Sensitivity                        0.8592593
## Specificity                        0.7575758
## Pos Pred Value                     0.7440726
## Neg Pred Value                     0.8691176
## Precision                          0.7440726
## Recall                             0.8592593
## F1                                 0.7970774
## Prevalence                         0.4500000
## Detection Rate                     0.3866667
## Detection Prevalence               0.5200000
## Balanced Accuracy                  0.8084175

Now, average confusion matrix and AUC from all 5 folds will be computed.

#average confusion matrix
tstcv.perf_cfm  <- as.data.frame(do.call('rbind', lapply(tstcv_cfm, FUN = function(cfm) c(cfm$overall, cfm$table))))

cv.tst.perf_cfm  <- apply(tstcv.perf_cfm[tstcv.perf_cfm$AccuracyPValue < 0.01, -c(6:7)], 2, mean)

(cv_confusion_matrix <- matrix(c(cv.tst.perf_cfm [6], cv.tst.perf_cfm [7], cv.tst.perf_cfm [8], cv.tst.perf_cfm [9]), nrow = 2))
##      [,1] [,2]
## [1,] 23.2    8
## [2,]  3.8   25
NB_fivefold_prediction  <- (tstcv_preds_NB_fivefold[[1]][,2] + tstcv_preds_NB_fivefold[[2]][,2] + tstcv_preds_NB_fivefold[[3]][,2] + tstcv_preds_NB_fivefold[[4]][,2] + tstcv_preds_NB_fivefold[[5]][,2]) / five_NF


NB_fivefold_auc <- performance(prediction(NB_fivefold_prediction , tsdata_cv$outcome), 'auc')@y.values[[1]]

NB_fivefold_auc 
## [1] 0.9012346
NB_fivefoldcv_time  <- proc.time()-st # time for NB with 5 fold cv 
total_NB_fivefold_time <- common_portion_time_cvfivefold + NB_fivefoldcv_time # total time for NB with 5 fold cv
total_NB_fivefold_time
##    user  system elapsed 
##    0.73    0.16    0.94
NB_fivefold_row <- c("NB with 5 fold CV",cv_tst_perf_NB_fivefold[1,], NB_fivefold_auc, data.frame(total_NB_fivefold_time[3])[1,], cv_tst_perf_NB_fivefold[6:16,] )

Let’s compare the metrics of all cross validation models.

table_cv <- data.frame(matrix(ncol = 12, nrow = 0))
table_cv <- rbind(table_cv, LR_tenfold_row, NB_tenfold_row,LR_fivefold_row, NB_fivefold_row)
colnames(table_cv) <- c("ALGO", "ACCURACY","AUC", "Total elapsed time", "Sensitivity", "Specificity", "Pos Pred Value", "Neg Pred Value", "Precision", "Recall", "F1", "Prevalence", "Detection Rate", "Detection Prevalence", "Balanced Accuracy")
knitr::kable(table_cv)
ALGO ACCURACY AUC Total elapsed time Sensitivity Specificity Pos Pred Value Neg Pred Value Precision Recall F1 Prevalence Detection Rate Detection Prevalence Balanced Accuracy
LR with 10 fold CV 0.828333333333333 0.904601571268238 0.840000000000002 0.833333333333333 0.824242424242424 0.796576092098895 0.859057398113457 0.796576092098895 0.833333333333333 0.813704341162079 0.45 0.375 0.471666666666667 0.828787878787879
NB with 10 fold CV 0.795 0.901234567901235 1.17 0.862962962962963 0.739393939393939 0.730966484100972 0.8687754901548 0.730966484100972 0.862962962962963 0.791209781839312 0.45 0.388333333333333 0.531666666666667 0.801178451178451
LR with 5 fold CV 0.816666666666667 0.905723905723906 0.670000000000002 0.82962962962963 0.806060606060606 0.77874293012224 0.852653958944282 0.77874293012224 0.82962962962963 0.80304270777955 0.45 0.373333333333333 0.48 0.817845117845118
NB with 5 fold CV 0.803333333333333 0.901234567901235 0.940000000000001 0.859259259259259 0.757575757575758 0.744072622779519 0.869117596859532 0.744072622779519 0.859259259259259 0.797077439361115 0.45 0.386666666666667 0.52 0.808417508417508

Bootstrap

I will use all codes from class notes.

st <- proc.time() # start time for bootstrap commotn portion
set.seed(43)
tstidx_boot <- sample(1:nrow(heart),0.20*nrow(heart),replace = F)
trdata_boot <- heart[-tstidx_boot,]
tsdata_boot <- heart[tstidx_boot,]
common_portion_time_boot <- proc.time() - st # time for bootstrap commotn portion

LR with bootstrap

st <- proc.time() # start time for LR with bootstrap 
df<-trdata_boot

runModel <- function(df) {glm(outcome~.,data=df[sample(1:nrow(df),nrow(df),replace=T),], family = 'binomial')}
lapplyrunmodel <- function(x)runModel(df)

models_LR <- lapply(1:200,lapplyrunmodel)
bagging_preds_LR <- lapply(models_LR,FUN=function(M,D=tsdata_boot[,-c(14)])predict(M,D,type='response'))

bagging_cfm_LR <- lapply(bagging_preds_LR,FUN=function(P,A=tsdata_boot[[14]])
{pred_class<-ifelse(P < 0.5, 0, 1)
  pred_tbl<-table(A,pred_class)
  pred_cfm<-caret::confusionMatrix(pred_tbl)
  pred_cfm
})


bagging.perf_LR <- as.data.frame(do.call('rbind',lapply(bagging_cfm_LR,FUN=function(cfm)c(cfm$overall,cfm$byClass))))

bagging.perf.mean_LR <- apply(bagging.perf_LR[bagging.perf_LR$AccuracyPValue<0.01,-c(6:7)],2,mean)
(bagging.perf.mean_LR <- data.frame(bagging.perf.mean_LR))
##                      bagging.perf.mean_LR
## Accuracy                        0.8120000
## Kappa                           0.6227540
## AccuracyLower                   0.6906240
## AccuracyUpper                   0.9008928
## AccuracyNull                    0.5306667
## Sensitivity                     0.7728446
## Specificity                     0.8516396
## Pos Pred Value                  0.8279630
## Neg Pred Value                  0.7989394
## Precision                       0.8279630
## Recall                          0.7728446
## F1                              0.7983576
## Prevalence                      0.4831667
## Detection Rate                  0.3725833
## Detection Prevalence            0.4500000
## Balanced Accuracy               0.8122421

Now, average confusion matrix and AUC from all 200 models will be computed.

#average confusion matrix

bagging.perf_cfm <- as.data.frame(do.call('rbind',lapply(bagging_cfm_LR,FUN=function(cfm)c(cfm$overall,cfm$table))))


boot.tst.perf_cfm  <- apply(bagging.perf_cfm[bagging.perf_cfm$AccuracyPValue < 0.01, -c(6:7)], 2, mean)

(boot_confusion_matrix <- matrix(c(boot.tst.perf_cfm [6], boot.tst.perf_cfm [7], boot.tst.perf_cfm [8], boot.tst.perf_cfm [9]), nrow = 2))
##        [,1]   [,2]
## [1,] 22.355  4.645
## [2,]  6.635 26.365
#average AUC
prediction <- rep(0, 60)

for(i in 1:200) {
  prediction <- prediction + data.frame(bagging_preds_LR)[i]
}

LR_boot_prediction <- prediction / 200

LR_boot_auc <- performance(prediction(LR_boot_prediction, tsdata_boot$outcome), 'auc')@y.values[[1]]

LR_boot_auc
## [1] 0.9068462
LR_boot_time  <- proc.time()-st # time for NB with 5 fold cv 
total_LR_boot_time <- common_portion_time_boot + LR_boot_time # total time for NB with 5 fold cv
total_LR_boot_time
##    user  system elapsed 
##    7.02    0.14    7.46
LR_boot_row <- c("LR with Bootstrap",bagging.perf.mean_LR[1,], LR_boot_auc,data.frame(total_LR_boot_time[3])[1,], bagging.perf.mean_LR[6:16,] )

NB with bootstrap

st <- proc.time()
df<-trdata_boot

runModel <- function(df) {naiveBayes(outcome~.,data=df[sample(1:nrow(df),nrow(df),replace=T),])}
lapplyrunmodel <- function(x)runModel(df)

models_NB <- lapply(1:200,lapplyrunmodel)
bagging_preds_NB <- lapply(models_NB,FUN=function(M,D=tsdata_boot[,-c(14)])predict(M,D,type='raw'))

bagging_cfm_NB <- lapply(bagging_preds_NB,FUN=function(P,A=tsdata_boot[[14]])
{pred_class<-unlist(apply(round(P),1,which.max))-1
  pred_tbl<-table(A,pred_class)
  pred_cfm<-caret::confusionMatrix(pred_tbl)
  pred_cfm
})


bagging.perf_NB <- as.data.frame(do.call('rbind',lapply(bagging_cfm_NB,FUN=function(cfm)c(cfm$overall,cfm$byClass))))

bagging.perf.mean_NB <- apply(bagging.perf_NB[bagging.perf_NB$AccuracyPValue<0.01,-c(6:7)],2,mean)
(bagging.perf.mean_NB <- data.frame(bagging.perf.mean_NB))
##                      bagging.perf.mean_NB
## Accuracy                        0.7958115
## Kappa                           0.5936582
## AccuracyLower                   0.6722935
## AccuracyUpper                   0.8886170
## AccuracyNull                    0.5410995
## Sensitivity                     0.7372101
## Specificity                     0.8649522
## Pos Pred Value                  0.8547605
## Neg Pred Value                  0.7475805
## Precision                       0.8547605
## Recall                          0.7372101
## F1                              0.7902610
## Prevalence                      0.5234729
## Detection Rate                  0.3846422
## Detection Prevalence            0.4500000
## Balanced Accuracy               0.8010811

Now, average confusion matrix and AUC from all 200 models will be computed.

#average confusion matrix

bagging.perf_cfm <- as.data.frame(do.call('rbind',lapply(bagging_cfm_NB,FUN=function(cfm)c(cfm$overall,cfm$table))))


boot.tst.perf_cfm  <- apply(bagging.perf_cfm[bagging.perf_cfm$AccuracyPValue < 0.01, -c(6:7)], 2, mean)

(boot_confusion_matrix <- matrix(c(boot.tst.perf_cfm [6], boot.tst.perf_cfm [7], boot.tst.perf_cfm [8], boot.tst.perf_cfm [9]), nrow = 2))
##           [,1]      [,2]
## [1,] 23.078534  3.921466
## [2,]  8.329843 24.670157
#average AUC
prediction <- rep(0, 60)

for(i in 1:200) {
  prediction <- prediction + bagging_preds_NB[[i]][,2]
}

NB_boot_prediction <- prediction / 200

NB_boot_auc <- performance(prediction(NB_boot_prediction, tsdata_boot$outcome), 'auc')@y.values[[1]]

NB_boot_auc
## [1] 0.8989899
NB_boot_time  <- proc.time()-st # time for NB with 5 fold cv 
total_NB_boot_time <- common_portion_time_boot + NB_boot_time # total time for NB with 5 fold cv
total_NB_boot_time
##    user  system elapsed 
##    5.68    0.09    5.83
NB_boot_row <- c("NB with Bootstrap",bagging.perf.mean_NB[1,], NB_boot_auc,data.frame(total_NB_boot_time[3])[1,], bagging.perf.mean_NB[6:16,] )
table_boot <- data.frame(matrix(ncol = 12, nrow = 0))
table_boot <- rbind(table_boot,LR_boot_row, NB_boot_row )
colnames(table_boot) <- c("ALGO", "ACCURACY","AUC", "Total elapsed time", "Sensitivity", "Specificity", "Pos Pred Value", "Neg Pred Value", "Precision", "Recall", "F1", "Prevalence", "Detection Rate", "Detection Prevalence", "Balanced Accuracy")
knitr::kable(table_boot)
ALGO ACCURACY AUC Total elapsed time Sensitivity Specificity Pos Pred Value Neg Pred Value Precision Recall F1 Prevalence Detection Rate Detection Prevalence Balanced Accuracy
LR with Bootstrap 0.812 0.906846240179573 7.46 0.772844556441993 0.851639640801397 0.827962962962963 0.798939393939394 0.827962962962963 0.772844556441993 0.798357621150137 0.483166666666667 0.372583333333333 0.45 0.812242098621695
NB with Bootstrap 0.795811518324607 0.898989898989899 5.83 0.737210053584752 0.864952207712975 0.854760519681986 0.747580517214025 0.854760519681986 0.737210053584752 0.79026103085344 0.52347294938918 0.384642233856894 0.45 0.801081130648863

Let’s put together all performacne metrics from base, cross validation and bootstrap together.

table_base_cv_boot <- data.frame(matrix(ncol = 12, nrow = 0))
table_base_cv_boot <- rbind(table_base_cv_boot,glm_row,nb_row,LR_tenfold_row,NB_tenfold_row,LR_fivefold_row,NB_fivefold_row,LR_boot_row, NB_boot_row )
colnames(table_base_cv_boot) <- c("ALGO", "ACCURACY","AUC","Total elapsed time", "Sensitivity", "Specificity", "Pos Pred Value", "Neg Pred Value", "Precision", "Recall", "F1", "Prevalence", "Detection Rate", "Detection Prevalence", "Balanced Accuracy")
knitr::kable(table_base_cv_boot)
ALGO ACCURACY AUC Total elapsed time Sensitivity Specificity Pos Pred Value Neg Pred Value Precision Recall F1 Prevalence Detection Rate Detection Prevalence Balanced Accuracy
base_LR 0.816666666666667 0.906846240179573 4.47 0.785714285714286 0.84375 0.814814814814815 0.818181818181818 0.814814814814815 0.785714285714286 0.8 0.466666666666667 0.366666666666667 0.45 0.814732142857143
base_NB 0.816666666666667 0.897867564534231 4.46 0.766666666666667 0.866666666666667 0.851851851851852 0.787878787878788 0.851851851851852 0.766666666666667 0.807017543859649 0.5 0.383333333333333 0.45 0.816666666666667
LR with 10 fold CV 0.828333333333333 0.904601571268238 0.840000000000002 0.833333333333333 0.824242424242424 0.796576092098895 0.859057398113457 0.796576092098895 0.833333333333333 0.813704341162079 0.45 0.375 0.471666666666667 0.828787878787879
NB with 10 fold CV 0.795 0.901234567901235 1.17 0.862962962962963 0.739393939393939 0.730966484100972 0.8687754901548 0.730966484100972 0.862962962962963 0.791209781839312 0.45 0.388333333333333 0.531666666666667 0.801178451178451
LR with 5 fold CV 0.816666666666667 0.905723905723906 0.670000000000002 0.82962962962963 0.806060606060606 0.77874293012224 0.852653958944282 0.77874293012224 0.82962962962963 0.80304270777955 0.45 0.373333333333333 0.48 0.817845117845118
NB with 5 fold CV 0.803333333333333 0.901234567901235 0.940000000000001 0.859259259259259 0.757575757575758 0.744072622779519 0.869117596859532 0.744072622779519 0.859259259259259 0.797077439361115 0.45 0.386666666666667 0.52 0.808417508417508
LR with Bootstrap 0.812 0.906846240179573 7.46 0.772844556441993 0.851639640801397 0.827962962962963 0.798939393939394 0.827962962962963 0.772844556441993 0.798357621150137 0.483166666666667 0.372583333333333 0.45 0.812242098621695
NB with Bootstrap 0.795811518324607 0.898989898989899 5.83 0.737210053584752 0.864952207712975 0.854760519681986 0.747580517214025 0.854760519681986 0.737210053584752 0.79026103085344 0.52347294938918 0.384642233856894 0.45 0.801081130648863

PART B

For the same dataset, set seed (43) split 80/20.
Using randomForest grow three different forests varuing the number of trees atleast three times. Start with seeding and fresh split for each forest. Note down as best as you can development (engineering) cost as well as computing cost(elapsed time) for each run. And compare these results with the experiment in Part A. Submit a pdf and executable script in python or R.

I will start with 50 tress and then grow 50 threes in next two runs. All the metric will be gathered each model run with seeding and fresh split.

First run: Random Forest

st <- proc.time() #start time for random forest 1st run
set.seed(43)
tstidx_rf50 <- sample(1:nrow(heart),0.20*nrow(heart),replace = F)
trdata_rf50 <- heart[-tstidx_rf50,]
tsdata_rf50 <- heart[tstidx_rf50,]
rf50_model<-randomForest(outcome~.,data=trdata_rf50,ntree=50)

rf50_pred<-predict(rf50_model,tsdata_rf50[,-c(which(names(tsdata_rf50)=="outcome"))])

rf50_mtab<-table(tsdata_rf50$outcome,rf50_pred)

rf50_cmx<-caret::confusionMatrix(rf50_mtab)

rf50_cmx
## Confusion Matrix and Statistics
## 
##    rf50_pred
##      0  1
##   0 22  5
##   1  4 29
##                                          
##                Accuracy : 0.85           
##                  95% CI : (0.7343, 0.929)
##     No Information Rate : 0.5667         
##     P-Value [Acc > NIR] : 2.679e-06      
##                                          
##                   Kappa : 0.6959         
##                                          
##  Mcnemar's Test P-Value : 1              
##                                          
##             Sensitivity : 0.8462         
##             Specificity : 0.8529         
##          Pos Pred Value : 0.8148         
##          Neg Pred Value : 0.8788         
##              Prevalence : 0.4333         
##          Detection Rate : 0.3667         
##    Detection Prevalence : 0.4500         
##       Balanced Accuracy : 0.8495         
##                                          
##        'Positive' Class : 0              
## 
acc_rf50 <- sum(diag(rf50_cmx$table))/sum(rf50_cmx$table)

Predictionprob_rf50 <- predict(rf50_model,tsdata_rf50,type="prob")[, 2]
auc_rf50 <- roc(as.numeric(tsdata_rf50$outcome),as.numeric(as.matrix((Predictionprob_rf50))))$auc
auc_rf50
## Area under the curve: 0.9321
(df_rf50 <- data.frame(rf50_cmx$byClass))
##                      rf50_cmx.byClass
## Sensitivity                 0.8461538
## Specificity                 0.8529412
## Pos Pred Value              0.8148148
## Neg Pred Value              0.8787879
## Precision                   0.8148148
## Recall                      0.8461538
## F1                          0.8301887
## Prevalence                  0.4333333
## Detection Rate              0.3666667
## Detection Prevalence        0.4500000
## Balanced Accuracy           0.8495475
total_rf50_time <- proc.time() - st # time for 1st random forest run
rf50_row <- c("Random Forest with 50 trees",acc_rf50, auc_rf50,data.frame(total_rf50_time[3])[1,], df_rf50[1:11,])

2nd run: Random Forest

st <- proc.time() # start time for 2nd random forest run
set.seed(43)
tstidx_rf100 <- sample(1:nrow(heart),0.20*nrow(heart),replace = F)
trdata_rf100 <- heart[-tstidx_rf100,]
tsdata_rf100 <- heart[tstidx_rf100,]
rfgrow100 <- grow(rf50_model,50)

rfgrow_pred100 <- predict(rfgrow100,tsdata_rf100[,-c(which(names(tsdata_rf100)=="outcome"))])

rfgrow_mtab100 <- table(tsdata_rf100$outcome,rfgrow_pred100)
rfgrow_cmx100 <- caret::confusionMatrix(rfgrow_mtab100)
rfgrow_cmx100
## Confusion Matrix and Statistics
## 
##    rfgrow_pred100
##      0  1
##   0 22  5
##   1  5 28
##                                           
##                Accuracy : 0.8333          
##                  95% CI : (0.7148, 0.9171)
##     No Information Rate : 0.55            
##     P-Value [Acc > NIR] : 3.483e-06       
##                                           
##                   Kappa : 0.6633          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8148          
##             Specificity : 0.8485          
##          Pos Pred Value : 0.8148          
##          Neg Pred Value : 0.8485          
##              Prevalence : 0.4500          
##          Detection Rate : 0.3667          
##    Detection Prevalence : 0.4500          
##       Balanced Accuracy : 0.8316          
##                                           
##        'Positive' Class : 0               
## 
acc_rf100 <- sum(diag(rfgrow_cmx100$table))/sum(rfgrow_cmx100$table)

Predictionprob_rf100 <- predict(rfgrow100,tsdata_rf100,type="prob")[, 2]
auc_rf100 <- roc(as.numeric(tsdata_rf100$outcome),as.numeric(as.matrix((Predictionprob_rf100))))$auc
auc_rf100
## Area under the curve: 0.9321
(df_rf100 <- data.frame(rfgrow_cmx100$byClass))
##                      rfgrow_cmx100.byClass
## Sensitivity                      0.8148148
## Specificity                      0.8484848
## Pos Pred Value                   0.8148148
## Neg Pred Value                   0.8484848
## Precision                        0.8148148
## Recall                           0.8148148
## F1                               0.8148148
## Prevalence                       0.4500000
## Detection Rate                   0.3666667
## Detection Prevalence             0.4500000
## Balanced Accuracy                0.8316498
total_rf100_time <- proc.time() - st # time for 2nd random forest run
rf100_row <- c("Random Forest 2nd run 50 more trees",acc_rf100, auc_rf100,data.frame(total_rf100_time[3])[1,], df_rf100[1:11,])

3rd run: Random Forest

st <- proc.time() #start time for 3rd random forest run
set.seed(43)
tstidx_rf150 <- sample(1:nrow(heart),0.20*nrow(heart),replace = F)
trdata_rf150 <- heart[-tstidx_rf150,]
tsdata_rf150 <- heart[tstidx_rf150,]
rfgrow150 <- grow(rfgrow100,50)

rfgrow_pred150 <- predict(rfgrow150,tsdata_rf150[,-c(which(names(tsdata_rf150)=="outcome"))])

rfgrow_mtab150 <- table(tsdata_rf150$outcome,rfgrow_pred150)
rfgrow_cmx150 <- caret::confusionMatrix(rfgrow_mtab150)
rfgrow_cmx150
## Confusion Matrix and Statistics
## 
##    rfgrow_pred150
##      0  1
##   0 22  5
##   1  5 28
##                                           
##                Accuracy : 0.8333          
##                  95% CI : (0.7148, 0.9171)
##     No Information Rate : 0.55            
##     P-Value [Acc > NIR] : 3.483e-06       
##                                           
##                   Kappa : 0.6633          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8148          
##             Specificity : 0.8485          
##          Pos Pred Value : 0.8148          
##          Neg Pred Value : 0.8485          
##              Prevalence : 0.4500          
##          Detection Rate : 0.3667          
##    Detection Prevalence : 0.4500          
##       Balanced Accuracy : 0.8316          
##                                           
##        'Positive' Class : 0               
## 
acc_rf150 <- sum(diag(rfgrow_cmx150$table))/sum(rfgrow_cmx150$table)

Predictionprob_rf150 <- predict(rfgrow150,tsdata_rf150,type="prob")[, 2]
auc_rf150 <- roc(as.numeric(tsdata_rf150$outcome),as.numeric(as.matrix((Predictionprob_rf150))))$auc
auc_rf150
## Area under the curve: 0.9321
(df_rf150 <- data.frame(rfgrow_cmx150$byClass))
##                      rfgrow_cmx150.byClass
## Sensitivity                      0.8148148
## Specificity                      0.8484848
## Pos Pred Value                   0.8148148
## Neg Pred Value                   0.8484848
## Precision                        0.8148148
## Recall                           0.8148148
## F1                               0.8148148
## Prevalence                       0.4500000
## Detection Rate                   0.3666667
## Detection Prevalence             0.4500000
## Balanced Accuracy                0.8316498
total_rf150_time <- proc.time() - st #time for 3rd random forest run
rf150_row <- c("Random Forest 3rd run with 50 more trees",acc_rf150, auc_rf150,data.frame(total_rf150_time[3])[1,], df_rf150[1:11,])
table_rf <- data.frame(matrix(ncol = 12, nrow = 0))
table_rf <- rbind(table_rf, rf50_row, rf100_row, rf150_row )
colnames(table_rf) <- c("ALGO", "ACCURACY","AUC","Total elapsed time", "Sensitivity", "Specificity", "Pos Pred Value", "Neg Pred Value", "Precision", "Recall", "F1", "Prevalence", "Detection Rate", "Detection Prevalence", "Balanced Accuracy")
knitr::kable(table_rf)
ALGO ACCURACY AUC Total elapsed time Sensitivity Specificity Pos Pred Value Neg Pred Value Precision Recall F1 Prevalence Detection Rate Detection Prevalence Balanced Accuracy
Random Forest with 50 trees 0.85 0.932098765432099 0.34 0.846153846153846 0.852941176470588 0.814814814814815 0.878787878787879 0.814814814814815 0.846153846153846 0.830188679245283 0.433333333333333 0.366666666666667 0.45 0.849547511312217
Random Forest 2nd run 50 more trees 0.833333333333333 0.932098765432099 0.34 0.814814814814815 0.848484848484849 0.814814814814815 0.848484848484849 0.814814814814815 0.814814814814815 0.814814814814815 0.45 0.366666666666667 0.45 0.831649831649832
Random Forest 3rd run with 50 more trees 0.833333333333333 0.932098765432099 0.32 0.814814814814815 0.848484848484849 0.814814814814815 0.848484848484849 0.814814814814815 0.814814814814815 0.814814814814815 0.45 0.366666666666667 0.45 0.831649831649832

Part C

Include a summary of your findings. Which of the two methods bootstrap vs cv do you recommend to your customer? And why? Be elaborate. Including computing costs, engineering costs and model performance. Did you incorporate Pareto’s maxim or the Razor and how did these two heuristics influence your decision?

Putting all the model metrics together:

table <- data.frame(matrix(ncol = 12, nrow = 0))
table <- rbind(table, glm_row, nb_row,LR_tenfold_row, NB_tenfold_row,LR_fivefold_row, NB_fivefold_row,LR_boot_row, NB_boot_row, rf50_row, rf100_row, rf150_row )
colnames(table) <- c("ALGO", "ACCURACY","AUC","Total elapsed time", "Sensitivity", "Specificity", "Pos Pred Value", "Neg Pred Value", "Precision", "Recall", "F1", "Prevalence", "Detection Rate", "Detection Prevalence", "Balanced Accuracy")
knitr::kable(table)
ALGO ACCURACY AUC Total elapsed time Sensitivity Specificity Pos Pred Value Neg Pred Value Precision Recall F1 Prevalence Detection Rate Detection Prevalence Balanced Accuracy
base_LR 0.816666666666667 0.906846240179573 4.47 0.785714285714286 0.84375 0.814814814814815 0.818181818181818 0.814814814814815 0.785714285714286 0.8 0.466666666666667 0.366666666666667 0.45 0.814732142857143
base_NB 0.816666666666667 0.897867564534231 4.46 0.766666666666667 0.866666666666667 0.851851851851852 0.787878787878788 0.851851851851852 0.766666666666667 0.807017543859649 0.5 0.383333333333333 0.45 0.816666666666667
LR with 10 fold CV 0.828333333333333 0.904601571268238 0.840000000000002 0.833333333333333 0.824242424242424 0.796576092098895 0.859057398113457 0.796576092098895 0.833333333333333 0.813704341162079 0.45 0.375 0.471666666666667 0.828787878787879
NB with 10 fold CV 0.795 0.901234567901235 1.17 0.862962962962963 0.739393939393939 0.730966484100972 0.8687754901548 0.730966484100972 0.862962962962963 0.791209781839312 0.45 0.388333333333333 0.531666666666667 0.801178451178451
LR with 5 fold CV 0.816666666666667 0.905723905723906 0.670000000000002 0.82962962962963 0.806060606060606 0.77874293012224 0.852653958944282 0.77874293012224 0.82962962962963 0.80304270777955 0.45 0.373333333333333 0.48 0.817845117845118
NB with 5 fold CV 0.803333333333333 0.901234567901235 0.940000000000001 0.859259259259259 0.757575757575758 0.744072622779519 0.869117596859532 0.744072622779519 0.859259259259259 0.797077439361115 0.45 0.386666666666667 0.52 0.808417508417508
LR with Bootstrap 0.812 0.906846240179573 7.46 0.772844556441993 0.851639640801397 0.827962962962963 0.798939393939394 0.827962962962963 0.772844556441993 0.798357621150137 0.483166666666667 0.372583333333333 0.45 0.812242098621695
NB with Bootstrap 0.795811518324607 0.898989898989899 5.83 0.737210053584752 0.864952207712975 0.854760519681986 0.747580517214025 0.854760519681986 0.737210053584752 0.79026103085344 0.52347294938918 0.384642233856894 0.45 0.801081130648863
Random Forest with 50 trees 0.85 0.932098765432099 0.34 0.846153846153846 0.852941176470588 0.814814814814815 0.878787878787879 0.814814814814815 0.846153846153846 0.830188679245283 0.433333333333333 0.366666666666667 0.45 0.849547511312217
Random Forest 2nd run 50 more trees 0.833333333333333 0.932098765432099 0.34 0.814814814814815 0.848484848484849 0.814814814814815 0.848484848484849 0.814814814814815 0.814814814814815 0.814814814814815 0.45 0.366666666666667 0.45 0.831649831649832
Random Forest 3rd run with 50 more trees 0.833333333333333 0.932098765432099 0.32 0.814814814814815 0.848484848484849 0.814814814814815 0.848484848484849 0.814814814814815 0.814814814814815 0.814814814814815 0.45 0.366666666666667 0.45 0.831649831649832

We can see from above metrics of bootstrap and cross validation that, cross validation models perform better with much less computing costs and engineering costs. We need to aggregate 200 resamples in case of bootstrap which make it computationally expensive compared to cross validation. We can see comparing 5-fold CV and 10-fold CV that, 5-fold CV has lower computing cost than 10-fold CV, although LR model with 10 fold CV produces best accuracy.

We can see that we can apply the Occam’s razor or law of parsimony, by choosing LR model with 5 fold CV even though it’s accuracy is slightly lower than LR model with 10 fold CV, but it’s simpler in nature with less computation cost and slightly better AUC.

Random forest with 50 trees performed best among all the models with best accuracy, AUC and computation time. As mentioned in the beginning of the homework that i built a shiny app where we can run logistic regression, decision tree and random forest. My future plan is to improve the shiny app in a way that we will be able to upload any classification data and be able to run various classifiers. As a result we will be free of writing same script over and over again. Please note, most of the coding of this homework can be made reusable and cleaner by using function and loop. I couldn’t do it because of time constraint.

Engineering cost: It took me about 5 working days to craft out this solution and about one week to built the shiny app. I am estimating a data scientist hourly cost is at $60/hr which gives us (5x8x60) = 2400 for this home work and (7x8x60) = 3360 for the shiny app. Also, there are many other cost involved in a machine learning project such as infrastructure costs (cloud compute, data storage), integration costs (data pipeline development, API development, documentation), maintenance costs which should be considered while estimating engineering cost.