DATA 622: Home Work 02
Libraries
Shiny App
I built a shiny app for this homework which can be used to explore the data and run logistic regression, decision tree, random forest. My future plan is to improve the shiny app in a way that we will be able to upload any classification data and be able to run various classifiers. As a result we will be free of writing same script over and over again. Please visit the shiny app using following link.
PART-A
STEP#0: Pick any two classifiers of (SVM,Logistic,DecisionTree,NaiveBayes). Pick heart or ecoli dataset. Heart is simpler and ecoli compounds the problem as it is NOT a balanced dataset. From a grading perspective both carry the same weight.
STEP#1 For each classifier, Set a seed (43)
STEP#2 Do a 80/20 split and determine the Accuracy, AUC and as many metrics as returned by the Caret package (confusionMatrix) Call this the base_metric. Note down as best as you can development (engineering) cost as well as computing cost(elapsed time).
Start with the original dataset and set a seed (43). Then run a cross validation of 5 and 10 of the model on the training set. Determine the same set of metrics and compare the cv_metrics with the base_metric. Note down as best as you can development (engineering) cost as well as computing cost(elapsed time).
Start with the original dataset and set a seed (43) Then run a bootstrap of 200 resamples and compute the same set of metrics and for each of the two classifiers build a three column table for each experiment (base, bootstrap, cross-validated). Note down as best as you can development (engineering) cost as well as computing cost(elapsed time).
Base Model
Load Data
I have downloaded the heart data from https://archive.ics.uci.edu/ml/datasets/Heart+Disease. I will skip most of the data exploration here except few necessary exploration and directly jump into what was asked in the instruction of the homework. I picked logistic regression and NaiveBayes for this homework part A. I will compute time each steps to calculate elapsed time.
## age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1 63 1 3 145 233 1 0 150 0 2.3 0 0 1
## 2 37 1 2 130 250 0 1 187 0 3.5 0 0 2
## 3 41 0 1 130 204 0 0 172 0 1.4 2 0 2
## 4 56 1 1 120 236 0 1 178 0 0.8 2 0 2
## 5 57 0 0 120 354 0 1 163 1 0.6 2 0 2
## 6 57 1 0 140 192 0 1 148 0 0.4 1 0 1
## outcome
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
## 6 1
## [1] 303 14
Let us change the name of the first column and the outcome as factor.
## 'data.frame': 303 obs. of 14 variables:
## $ age : int 63 37 41 56 57 57 56 44 52 57 ...
## $ sex : int 1 1 0 1 0 1 0 1 1 1 ...
## $ cp : int 3 2 1 1 0 0 1 1 2 2 ...
## $ trestbps: int 145 130 130 120 120 140 140 120 172 150 ...
## $ chol : int 233 250 204 236 354 192 294 263 199 168 ...
## $ fbs : int 1 0 0 0 0 0 0 0 1 0 ...
## $ restecg : int 0 1 0 1 1 1 0 1 1 1 ...
## $ thalach : int 150 187 172 178 163 148 153 173 162 174 ...
## $ exang : int 0 0 0 0 1 0 0 0 0 0 ...
## $ oldpeak : num 2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
## $ slope : int 0 0 2 2 2 1 1 2 2 2 ...
## $ ca : int 0 0 0 0 0 0 0 0 0 0 ...
## $ thal : int 1 2 2 2 2 1 2 3 3 2 ...
## $ outcome : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
Let’s check if a column has constant values in all its rows
## age sex cp trestbps chol fbs restecg thalach
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## exang oldpeak slope ca thal outcome
## FALSE FALSE FALSE FALSE FALSE FALSE
##
## 0 1
## 138 165
## [1] "0" "1"
## [1] "binary classification"
Split the data
Base line performance of LR
predts_glm <- predict(glm.model,tsdata[,1:13],type = 'response')
predtsclass_glm <- ifelse(predts_glm<0.5,0,1)
tscfm_glm <- caret::confusionMatrix(table(tsdata[[14]],predtsclass_glm))
tscfm_glm## Confusion Matrix and Statistics
##
## predtsclass_glm
## 0 1
## 0 22 5
## 1 6 27
##
## Accuracy : 0.8167
## 95% CI : (0.6956, 0.9048)
## No Information Rate : 0.5333
## P-Value [Acc > NIR] : 4.344e-06
##
## Kappa : 0.6309
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.7857
## Specificity : 0.8438
## Pos Pred Value : 0.8148
## Neg Pred Value : 0.8182
## Prevalence : 0.4667
## Detection Rate : 0.3667
## Detection Prevalence : 0.4500
## Balanced Accuracy : 0.8147
##
## 'Positive' Class : 0
##
acc_glm <- sum(diag(tscfm_glm$table))/sum(tscfm_glm$table)
roc_glm <- roc(tsdata$outcome, as.numeric(predts_glm))
auc_glm <- auc(roc_glm)## tscfm_glm.byClass
## Sensitivity 0.7857143
## Specificity 0.8437500
## Pos Pred Value 0.8148148
## Neg Pred Value 0.8181818
## Precision 0.8148148
## Recall 0.7857143
## F1 0.8000000
## Prevalence 0.4666667
## Detection Rate 0.3666667
## Detection Prevalence 0.4500000
## Balanced Accuracy 0.8147321
## user system elapsed
## 1.28 3.04 4.47
Base line performance of NB
predts_nb <- predict(nb.model,tsdata[,1:13],type='raw')
predtsclass_nb <- unlist(apply(round(predts_nb),1,which.max))-1
tstbl_nb <- table(tsdata[[14]], predtsclass_nb)
tscfm_nb <- caret::confusionMatrix(tstbl_nb)
tscfm_nb## Confusion Matrix and Statistics
##
## predtsclass_nb
## 0 1
## 0 23 4
## 1 7 26
##
## Accuracy : 0.8167
## 95% CI : (0.6956, 0.9048)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : 3.781e-07
##
## Kappa : 0.6333
##
## Mcnemar's Test P-Value : 0.5465
##
## Sensitivity : 0.7667
## Specificity : 0.8667
## Pos Pred Value : 0.8519
## Neg Pred Value : 0.7879
## Prevalence : 0.5000
## Detection Rate : 0.3833
## Detection Prevalence : 0.4500
## Balanced Accuracy : 0.8167
##
## 'Positive' Class : 0
##
acc_nb <- sum(diag(tscfm_nb$table))/sum(tscfm_nb$table)
roc_nb <- roc(tsdata$outcome, as.numeric(predts_nb[,2]))
auc_nb <- auc(roc_nb)## tscfm_nb.byClass
## Sensitivity 0.7666667
## Specificity 0.8666667
## Pos Pred Value 0.8518519
## Neg Pred Value 0.7878788
## Precision 0.8518519
## Recall 0.7666667
## F1 0.8070175
## Prevalence 0.5000000
## Detection Rate 0.3833333
## Detection Prevalence 0.4500000
## Balanced Accuracy 0.8166667
## user system elapsed
## 1.35 3.03 4.46
Let’s compare the metrics of base LR and NB.
table_base <- data.frame(matrix(ncol = 12, nrow = 0))
table_base <- rbind(table_base, glm_row, nb_row)
colnames(table_base) <- c("ALGO", "ACCURACY","AUC", "Total elapsed time", "Sensitivity", "Specificity", "Pos Pred Value", "Neg Pred Value", "Precision", "Recall", "F1", "Prevalence", "Detection Rate", "Detection Prevalence", "Balanced Accuracy")
knitr::kable(table_base)| ALGO | ACCURACY | AUC | Total elapsed time | Sensitivity | Specificity | Pos Pred Value | Neg Pred Value | Precision | Recall | F1 | Prevalence | Detection Rate | Detection Prevalence | Balanced Accuracy |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| base_LR | 0.816666666666667 | 0.906846240179573 | 4.47 | 0.785714285714286 | 0.84375 | 0.814814814814815 | 0.818181818181818 | 0.814814814814815 | 0.785714285714286 | 0.8 | 0.466666666666667 | 0.366666666666667 | 0.45 | 0.814732142857143 |
| base_NB | 0.816666666666667 | 0.897867564534231 | 4.46 | 0.766666666666667 | 0.866666666666667 | 0.851851851851852 | 0.787878787878788 | 0.851851851851852 | 0.766666666666667 | 0.807017543859649 | 0.5 | 0.383333333333333 | 0.45 | 0.816666666666667 |
Cross validation
I am using all code from class notes.
set.seed(43)
tstidx_cv <- sample(1:nrow(heart),0.20*nrow(heart),replace = F)
trdata_cv <- heart[-tstidx,]
tsdata_cv <- heart[tstidx,]N <- nrow(trdata_cv)
ten_NF = 10
ten_folds <- split(1:N,cut(1:N, quantile(1:N, probs = seq(0, 1, by =1/ten_NF))))
length(ten_folds)## [1] 10
## $`(1,25.2]`
## [1] 24
##
## $`(25.2,49.4]`
## [1] 24
##
## $`(49.4,73.6]`
## [1] 24
##
## $`(73.6,97.8]`
## [1] 24
##
## $`(97.8,122]`
## [1] 25
##
## $`(122,146]`
## [1] 24
##
## $`(146,170]`
## [1] 24
##
## $`(170,195]`
## [1] 24
##
## $`(195,219]`
## [1] 24
##
## $`(219,243]`
## [1] 25
LR with 10 fold CV
ridx<-sample(1:nrow(trdata_cv),nrow(trdata_cv),replace=FALSE) # randomize the data
cv_df <- do.call('rbind',lapply(ten_folds,FUN=function(idx,data=trdata_cv[ridx,]) {
m<-glm(outcome~.,data=data[-idx,],family = 'binomial')# keep one fold for validation
p<-predict(m,data[idx,-c(14)],type = 'response') # predict for that test fold
pc<-ifelse(p<0.5,0,1)
pred_tbl<-table(data[idx,c(14)],pc) #table(actual,predicted)
pred_cfm<-caret::confusionMatrix(pred_tbl)
list(fold=idx,m=m,cfm=pred_cfm) # store the fold, model,cfm
}
)) # lapply repeats over all foldscv_df <- as.data.frame(cv_df)
#tstcv.perf<-as.data.frame(do.call('rbind',lapply(cv_df$cfm,FUN=function(cfm)c(cfm$overall,cfm$byClass))))
# (cv.tst.perf<-apply(tstcv.perf[tstcv.perf$AccuracyPValue<0.01,-c(6:7)],2,mean))tstcv_preds_LR_tenfold <- lapply(cv_df$m,FUN=function(M,D=tsdata_cv[,-c(14)])predict(M,D,type = 'response'))
tstcv_cfm <- lapply(tstcv_preds_LR_tenfold,FUN=function(P,A=tsdata_cv[[14]])
{pred_class<-ifelse(P<0.5,0,1)
pred_tbl<-table(pred_class,A)
pred_cfm<-caret::confusionMatrix(pred_tbl)
pred_cfm
})
tstcv.perf <- as.data.frame(do.call('rbind',lapply(tstcv_cfm,FUN=function(cfm)c(cfm$overall,cfm$byClass))))
cv.tst.perf_LR_tenfold <-apply(tstcv.perf[tstcv.perf$AccuracyPValue<0.01,-c(6:7)],2,mean)
# cv.tst.perf.var_LR_tenfold <- apply(tstcv.perf[tstcv.perf$AccuracyPValue<0.01,-c(6:7)],2,sd)## cv.tst.perf_LR_tenfold
## Accuracy 0.8283333
## Kappa 0.6547540
## AccuracyLower 0.7091456
## AccuracyUpper 0.9132489
## AccuracyNull 0.5500000
## Sensitivity 0.8333333
## Specificity 0.8242424
## Pos Pred Value 0.7965761
## Neg Pred Value 0.8590574
## Precision 0.7965761
## Recall 0.8333333
## F1 0.8137043
## Prevalence 0.4500000
## Detection Rate 0.3750000
## Detection Prevalence 0.4716667
## Balanced Accuracy 0.8287879
Now, average confusion matrix and AUC from all 10 folds will be computed.
#average confusion matrix
tstcv.perf_cfm <- as.data.frame(do.call('rbind',lapply(tstcv_cfm,FUN=function(cfm)c(cfm$overall, cfm$table))))
cv.tst.perf_cfm <-apply(tstcv.perf_cfm[tstcv.perf$AccuracyPValue<0.01,-c(6:7)],2,mean)
(cv_cfm <- matrix(c(cv.tst.perf_cfm[6], cv.tst.perf_cfm[7], cv.tst.perf_cfm[8], cv.tst.perf_cfm[9]), nrow = 2))## [,1] [,2]
## [1,] 22.5 5.8
## [2,] 4.5 27.2
#Average AUC
tstcv_preds_LR_tenfold <- data.frame(tstcv_preds_LR_tenfold)
LR_tenfold_prediction <- (tstcv_preds_LR_tenfold[1] + tstcv_preds_LR_tenfold[2] + tstcv_preds_LR_tenfold[3] + tstcv_preds_LR_tenfold[4] + tstcv_preds_LR_tenfold[5]+tstcv_preds_LR_tenfold[6]+tstcv_preds_LR_tenfold[7]+tstcv_preds_LR_tenfold[8]+tstcv_preds_LR_tenfold[9]+tstcv_preds_LR_tenfold[10]) / ten_NF
LR_tenfold_auc <- performance(prediction(LR_tenfold_prediction, tsdata_cv$outcome), 'auc')@y.values[[1]]
LR_tenfold_auc## [1] 0.9046016
total_LR_tenfold_time <- common_portion_time_cvtenfold + LR_tenfoldcv_time # total time for LR with 10 fold cv
total_LR_tenfold_time## user system elapsed
## 0.61 0.18 0.84
NB with 10 fold CV
cv_df <- do.call('rbind', lapply(ten_folds, FUN = function(idx, data = trdata_cv[ridx,]) {
m <- naiveBayes(outcome~., data = data[-idx,]) # keep one fold for validation
p <- predict(m, data[idx, -c(14)], type = 'raw') # predict for that test fold
pc <- unlist(apply(round(p), 1, which.max)) - 1
pred_tbl <- table(data[idx, c(14)], pc) # table(actual, predicted)
pred_cfm <- caret::confusionMatrix(pred_tbl)
list(fold = idx, m = m, cfm = pred_cfm) # store the fold, model, cfm
}
))tstcv_preds_NB_tenfold <- lapply(cv_df$m, FUN = function(M, D = tsdata_cv[,-c(14)]) predict(M, D, type = 'raw'))
tstcv_cfm <- lapply(tstcv_preds_NB_tenfold, FUN = function(P, A = tsdata_cv[[14]])
{pred_class <- unlist(apply(round(P), 1, which.max)) - 1
pred_tbl <- table(pred_class, A)
pred_cfm <- caret::confusionMatrix(pred_tbl)
pred_cfm
})tstcv_perf <- as.data.frame(do.call('rbind', lapply(tstcv_cfm, FUN = function(cfm) c(cfm$overall, cfm$byClass))))
cv_tst_perf_NB_tenfold <- apply(tstcv_perf[tstcv_perf$AccuracyPValue < 0.01, -c(6:7)], 2, mean)
cv_tst_perf_NB_tenfold <- data.frame(cv_tst_perf_NB_tenfold)
cv_tst_perf_NB_tenfold## cv_tst_perf_NB_tenfold
## Accuracy 0.7950000
## Kappa 0.5926211
## AccuracyLower 0.6711984
## AccuracyUpper 0.8881932
## AccuracyNull 0.5500000
## Sensitivity 0.8629630
## Specificity 0.7393939
## Pos Pred Value 0.7309665
## Neg Pred Value 0.8687755
## Precision 0.7309665
## Recall 0.8629630
## F1 0.7912098
## Prevalence 0.4500000
## Detection Rate 0.3883333
## Detection Prevalence 0.5316667
## Balanced Accuracy 0.8011785
Now, average confusion matrix and AUC from all 10 folds will be computed.
#average confusion matrix
tstcv.perf_cfm <- as.data.frame(do.call('rbind', lapply(tstcv_cfm, FUN = function(cfm) c(cfm$overall, cfm$table))))
cv.tst.perf_cfm <- apply(tstcv.perf_cfm[tstcv.perf_cfm$AccuracyPValue < 0.01, -c(6:7)], 2, mean)
(cv_confusion_matrix <- matrix(c(cv.tst.perf_cfm [6], cv.tst.perf_cfm [7], cv.tst.perf_cfm [8], cv.tst.perf_cfm [9]), nrow = 2))## [,1] [,2]
## [1,] 23.3 8.6
## [2,] 3.7 24.4
NB_tenfold_prediction <- (tstcv_preds_NB_tenfold[[1]][,2] + tstcv_preds_NB_tenfold[[2]][,2] + tstcv_preds_NB_tenfold[[3]][,2] + tstcv_preds_NB_tenfold[[4]][,2] + tstcv_preds_NB_tenfold[[5]][,2] + tstcv_preds_NB_tenfold[[6]][,2] + tstcv_preds_NB_tenfold[[7]][,2] + tstcv_preds_NB_tenfold[[8]][,2] + tstcv_preds_NB_tenfold[[9]][,2] + tstcv_preds_NB_tenfold[[10]][,2]) / ten_NF
NB_tenfold_auc <- performance(prediction(NB_tenfold_prediction , tsdata_cv$outcome), 'auc')@y.values[[1]]
NB_tenfold_auc ## [1] 0.9012346
total_NB_tenfold_time <- common_portion_time_cvtenfold + NB_tenfoldcv_time # total time for NB with 10 fold cv
total_NB_tenfold_time## user system elapsed
## 0.96 0.14 1.17
LR with 5 fold CV
N <- nrow(trdata_cv)
five_NF = 5
five_folds <- split(1:N,cut(1:N, quantile(1:N, probs = seq(0, 1, by =1/five_NF))))
length(five_folds)## [1] 5
## $`(1,49.4]`
## [1] 48
##
## $`(49.4,97.8]`
## [1] 48
##
## $`(97.8,146]`
## [1] 49
##
## $`(146,195]`
## [1] 48
##
## $`(195,243]`
## [1] 49
ridx<-sample(1:nrow(trdata_cv),nrow(trdata_cv),replace=FALSE) # randomize the data
cv_df <- do.call('rbind',lapply(five_folds,FUN=function(idx,data=trdata_cv[ridx,]) {
m<-glm(outcome~.,data=data[-idx,],family = 'binomial')# keep one fold for validation
p<-predict(m,data[idx,-c(14)],type = 'response') # predict for that test fold
pc<-ifelse(p<0.5,0,1)
pred_tbl<-table(data[idx,c(14)],pc) #table(actual,predicted)
pred_cfm<-caret::confusionMatrix(pred_tbl)
list(fold=idx,m=m,cfm=pred_cfm) # store the fold, model,cfm
}
)) # lapply repeats over all foldscv_df <- as.data.frame(cv_df)
#tstcv.perf<-as.data.frame(do.call('rbind',lapply(cv_df$cfm,FUN=function(cfm)c(cfm$overall,cfm$byClass))))
# (cv.tst.perf<-apply(tstcv.perf[tstcv.perf$AccuracyPValue<0.01,-c(6:7)],2,mean))tstcv_preds_LR_fivefold <- lapply(cv_df$m,FUN=function(M,D=tsdata_cv[,-c(14)])predict(M,D,type = 'response'))
tstcv_cfm <- lapply(tstcv_preds_LR_fivefold,FUN=function(P,A=tsdata_cv[[14]])
{pred_class<-ifelse(P<0.5,0,1)
pred_tbl<-table(pred_class,A)
pred_cfm<-caret::confusionMatrix(pred_tbl)
pred_cfm
})
tstcv.perf <- as.data.frame(do.call('rbind',lapply(tstcv_cfm,FUN=function(cfm)c(cfm$overall,cfm$byClass))))
cv.tst.perf_LR_fivefold <-apply(tstcv.perf[tstcv.perf$AccuracyPValue<0.01,-c(6:7)],2,mean)
# cv.tst.perf.var_LR_fivefold <- apply(tstcv.perf[tstcv.perf$AccuracyPValue<0.01,-c(6:7)],2,sd)## cv.tst.perf_LR_fivefold
## Accuracy 0.8166667
## Kappa 0.6319451
## AccuracyLower 0.6959731
## AccuracyUpper 0.9043612
## AccuracyNull 0.5500000
## Sensitivity 0.8296296
## Specificity 0.8060606
## Pos Pred Value 0.7787429
## Neg Pred Value 0.8526540
## Precision 0.7787429
## Recall 0.8296296
## F1 0.8030427
## Prevalence 0.4500000
## Detection Rate 0.3733333
## Detection Prevalence 0.4800000
## Balanced Accuracy 0.8178451
Now, average confusion matrix and AUC from all 5 folds will be computed.
#average confusion matrix
tstcv.perf_cfm <- as.data.frame(do.call('rbind',lapply(tstcv_cfm,FUN=function(cfm)c(cfm$overall, cfm$table))))
cv.tst.perf_cfm <-apply(tstcv.perf_cfm[tstcv.perf_cfm$AccuracyPValue<0.01,-c(6:7)],2,mean)
(cv_cfm <- matrix(c(cv.tst.perf_cfm[6], cv.tst.perf_cfm[7], cv.tst.perf_cfm[8], cv.tst.perf_cfm[9]), nrow = 2))## [,1] [,2]
## [1,] 22.4 6.4
## [2,] 4.6 26.6
#Average AUC
tstcv_preds_LR_fivefold <- data.frame(tstcv_preds_LR_fivefold)
LR_fivefold_prediction <- (tstcv_preds_LR_fivefold[1] + tstcv_preds_LR_fivefold[2] + tstcv_preds_LR_fivefold[3] + tstcv_preds_LR_fivefold[4] + tstcv_preds_LR_fivefold[5]) / five_NF
LR_fivefold_auc <- performance(prediction(LR_fivefold_prediction, tsdata_cv$outcome), 'auc')@y.values[[1]]
LR_fivefold_auc## [1] 0.9057239
total_LR_fivefold_time <- common_portion_time_cvfivefold + LR_fivefoldcv_time # total time for LR with 5 fold cv
total_LR_fivefold_time## user system elapsed
## 0.54 0.12 0.67
NB with 5 fold CV
cv_df <- do.call('rbind', lapply(five_folds, FUN = function(idx, data = trdata_cv[ridx,]) {
m <- naiveBayes(outcome~., data = data[-idx,]) # keep one fold for validation
p <- predict(m, data[idx, -c(14)], type = 'raw') # predict for that test fold
pc <- unlist(apply(round(p), 1, which.max)) - 1
pred_tbl <- table(data[idx, c(14)], pc) # table(actual, predicted)
pred_cfm <- caret::confusionMatrix(pred_tbl)
list(fold = idx, m = m, cfm = pred_cfm) # store the fold, model, cfm
}
))tstcv_preds_NB_fivefold <- lapply(cv_df$m, FUN = function(M, D = tsdata_cv[,-c(14)]) predict(M, D, type = 'raw'))
tstcv_cfm <- lapply(tstcv_preds_NB_fivefold, FUN = function(P, A = tsdata_cv[[14]])
{pred_class <- unlist(apply(round(P), 1, which.max)) - 1
pred_tbl <- table(pred_class, A)
pred_cfm <- caret::confusionMatrix(pred_tbl)
pred_cfm
})tstcv_perf <- as.data.frame(do.call('rbind', lapply(tstcv_cfm, FUN = function(cfm) c(cfm$overall, cfm$byClass))))
cv_tst_perf_NB_fivefold <- apply(tstcv_perf[tstcv_perf$AccuracyPValue < 0.01, -c(6:7)], 2, mean)
cv_tst_perf_NB_fivefold <- data.frame(cv_tst_perf_NB_fivefold)
cv_tst_perf_NB_fivefold## cv_tst_perf_NB_fivefold
## Accuracy 0.8033333
## Kappa 0.6082071
## AccuracyLower 0.6806344
## AccuracyUpper 0.8945124
## AccuracyNull 0.5500000
## Sensitivity 0.8592593
## Specificity 0.7575758
## Pos Pred Value 0.7440726
## Neg Pred Value 0.8691176
## Precision 0.7440726
## Recall 0.8592593
## F1 0.7970774
## Prevalence 0.4500000
## Detection Rate 0.3866667
## Detection Prevalence 0.5200000
## Balanced Accuracy 0.8084175
Now, average confusion matrix and AUC from all 5 folds will be computed.
#average confusion matrix
tstcv.perf_cfm <- as.data.frame(do.call('rbind', lapply(tstcv_cfm, FUN = function(cfm) c(cfm$overall, cfm$table))))
cv.tst.perf_cfm <- apply(tstcv.perf_cfm[tstcv.perf_cfm$AccuracyPValue < 0.01, -c(6:7)], 2, mean)
(cv_confusion_matrix <- matrix(c(cv.tst.perf_cfm [6], cv.tst.perf_cfm [7], cv.tst.perf_cfm [8], cv.tst.perf_cfm [9]), nrow = 2))## [,1] [,2]
## [1,] 23.2 8
## [2,] 3.8 25
NB_fivefold_prediction <- (tstcv_preds_NB_fivefold[[1]][,2] + tstcv_preds_NB_fivefold[[2]][,2] + tstcv_preds_NB_fivefold[[3]][,2] + tstcv_preds_NB_fivefold[[4]][,2] + tstcv_preds_NB_fivefold[[5]][,2]) / five_NF
NB_fivefold_auc <- performance(prediction(NB_fivefold_prediction , tsdata_cv$outcome), 'auc')@y.values[[1]]
NB_fivefold_auc ## [1] 0.9012346
total_NB_fivefold_time <- common_portion_time_cvfivefold + NB_fivefoldcv_time # total time for NB with 5 fold cv
total_NB_fivefold_time## user system elapsed
## 0.73 0.16 0.94
NB_fivefold_row <- c("NB with 5 fold CV",cv_tst_perf_NB_fivefold[1,], NB_fivefold_auc, data.frame(total_NB_fivefold_time[3])[1,], cv_tst_perf_NB_fivefold[6:16,] )Let’s compare the metrics of all cross validation models.
table_cv <- data.frame(matrix(ncol = 12, nrow = 0))
table_cv <- rbind(table_cv, LR_tenfold_row, NB_tenfold_row,LR_fivefold_row, NB_fivefold_row)
colnames(table_cv) <- c("ALGO", "ACCURACY","AUC", "Total elapsed time", "Sensitivity", "Specificity", "Pos Pred Value", "Neg Pred Value", "Precision", "Recall", "F1", "Prevalence", "Detection Rate", "Detection Prevalence", "Balanced Accuracy")
knitr::kable(table_cv)| ALGO | ACCURACY | AUC | Total elapsed time | Sensitivity | Specificity | Pos Pred Value | Neg Pred Value | Precision | Recall | F1 | Prevalence | Detection Rate | Detection Prevalence | Balanced Accuracy |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LR with 10 fold CV | 0.828333333333333 | 0.904601571268238 | 0.840000000000002 | 0.833333333333333 | 0.824242424242424 | 0.796576092098895 | 0.859057398113457 | 0.796576092098895 | 0.833333333333333 | 0.813704341162079 | 0.45 | 0.375 | 0.471666666666667 | 0.828787878787879 |
| NB with 10 fold CV | 0.795 | 0.901234567901235 | 1.17 | 0.862962962962963 | 0.739393939393939 | 0.730966484100972 | 0.8687754901548 | 0.730966484100972 | 0.862962962962963 | 0.791209781839312 | 0.45 | 0.388333333333333 | 0.531666666666667 | 0.801178451178451 |
| LR with 5 fold CV | 0.816666666666667 | 0.905723905723906 | 0.670000000000002 | 0.82962962962963 | 0.806060606060606 | 0.77874293012224 | 0.852653958944282 | 0.77874293012224 | 0.82962962962963 | 0.80304270777955 | 0.45 | 0.373333333333333 | 0.48 | 0.817845117845118 |
| NB with 5 fold CV | 0.803333333333333 | 0.901234567901235 | 0.940000000000001 | 0.859259259259259 | 0.757575757575758 | 0.744072622779519 | 0.869117596859532 | 0.744072622779519 | 0.859259259259259 | 0.797077439361115 | 0.45 | 0.386666666666667 | 0.52 | 0.808417508417508 |
Bootstrap
I will use all codes from class notes.
LR with bootstrap
df<-trdata_boot
runModel <- function(df) {glm(outcome~.,data=df[sample(1:nrow(df),nrow(df),replace=T),], family = 'binomial')}
lapplyrunmodel <- function(x)runModel(df)
models_LR <- lapply(1:200,lapplyrunmodel)bagging_preds_LR <- lapply(models_LR,FUN=function(M,D=tsdata_boot[,-c(14)])predict(M,D,type='response'))
bagging_cfm_LR <- lapply(bagging_preds_LR,FUN=function(P,A=tsdata_boot[[14]])
{pred_class<-ifelse(P < 0.5, 0, 1)
pred_tbl<-table(A,pred_class)
pred_cfm<-caret::confusionMatrix(pred_tbl)
pred_cfm
})
bagging.perf_LR <- as.data.frame(do.call('rbind',lapply(bagging_cfm_LR,FUN=function(cfm)c(cfm$overall,cfm$byClass))))
bagging.perf.mean_LR <- apply(bagging.perf_LR[bagging.perf_LR$AccuracyPValue<0.01,-c(6:7)],2,mean)## bagging.perf.mean_LR
## Accuracy 0.8120000
## Kappa 0.6227540
## AccuracyLower 0.6906240
## AccuracyUpper 0.9008928
## AccuracyNull 0.5306667
## Sensitivity 0.7728446
## Specificity 0.8516396
## Pos Pred Value 0.8279630
## Neg Pred Value 0.7989394
## Precision 0.8279630
## Recall 0.7728446
## F1 0.7983576
## Prevalence 0.4831667
## Detection Rate 0.3725833
## Detection Prevalence 0.4500000
## Balanced Accuracy 0.8122421
Now, average confusion matrix and AUC from all 200 models will be computed.
#average confusion matrix
bagging.perf_cfm <- as.data.frame(do.call('rbind',lapply(bagging_cfm_LR,FUN=function(cfm)c(cfm$overall,cfm$table))))
boot.tst.perf_cfm <- apply(bagging.perf_cfm[bagging.perf_cfm$AccuracyPValue < 0.01, -c(6:7)], 2, mean)
(boot_confusion_matrix <- matrix(c(boot.tst.perf_cfm [6], boot.tst.perf_cfm [7], boot.tst.perf_cfm [8], boot.tst.perf_cfm [9]), nrow = 2))## [,1] [,2]
## [1,] 22.355 4.645
## [2,] 6.635 26.365
#average AUC
prediction <- rep(0, 60)
for(i in 1:200) {
prediction <- prediction + data.frame(bagging_preds_LR)[i]
}
LR_boot_prediction <- prediction / 200
LR_boot_auc <- performance(prediction(LR_boot_prediction, tsdata_boot$outcome), 'auc')@y.values[[1]]
LR_boot_auc## [1] 0.9068462
total_LR_boot_time <- common_portion_time_boot + LR_boot_time # total time for NB with 5 fold cv
total_LR_boot_time## user system elapsed
## 7.02 0.14 7.46
NB with bootstrap
df<-trdata_boot
runModel <- function(df) {naiveBayes(outcome~.,data=df[sample(1:nrow(df),nrow(df),replace=T),])}
lapplyrunmodel <- function(x)runModel(df)
models_NB <- lapply(1:200,lapplyrunmodel)bagging_preds_NB <- lapply(models_NB,FUN=function(M,D=tsdata_boot[,-c(14)])predict(M,D,type='raw'))
bagging_cfm_NB <- lapply(bagging_preds_NB,FUN=function(P,A=tsdata_boot[[14]])
{pred_class<-unlist(apply(round(P),1,which.max))-1
pred_tbl<-table(A,pred_class)
pred_cfm<-caret::confusionMatrix(pred_tbl)
pred_cfm
})
bagging.perf_NB <- as.data.frame(do.call('rbind',lapply(bagging_cfm_NB,FUN=function(cfm)c(cfm$overall,cfm$byClass))))
bagging.perf.mean_NB <- apply(bagging.perf_NB[bagging.perf_NB$AccuracyPValue<0.01,-c(6:7)],2,mean)## bagging.perf.mean_NB
## Accuracy 0.7958115
## Kappa 0.5936582
## AccuracyLower 0.6722935
## AccuracyUpper 0.8886170
## AccuracyNull 0.5410995
## Sensitivity 0.7372101
## Specificity 0.8649522
## Pos Pred Value 0.8547605
## Neg Pred Value 0.7475805
## Precision 0.8547605
## Recall 0.7372101
## F1 0.7902610
## Prevalence 0.5234729
## Detection Rate 0.3846422
## Detection Prevalence 0.4500000
## Balanced Accuracy 0.8010811
Now, average confusion matrix and AUC from all 200 models will be computed.
#average confusion matrix
bagging.perf_cfm <- as.data.frame(do.call('rbind',lapply(bagging_cfm_NB,FUN=function(cfm)c(cfm$overall,cfm$table))))
boot.tst.perf_cfm <- apply(bagging.perf_cfm[bagging.perf_cfm$AccuracyPValue < 0.01, -c(6:7)], 2, mean)
(boot_confusion_matrix <- matrix(c(boot.tst.perf_cfm [6], boot.tst.perf_cfm [7], boot.tst.perf_cfm [8], boot.tst.perf_cfm [9]), nrow = 2))## [,1] [,2]
## [1,] 23.078534 3.921466
## [2,] 8.329843 24.670157
#average AUC
prediction <- rep(0, 60)
for(i in 1:200) {
prediction <- prediction + bagging_preds_NB[[i]][,2]
}
NB_boot_prediction <- prediction / 200
NB_boot_auc <- performance(prediction(NB_boot_prediction, tsdata_boot$outcome), 'auc')@y.values[[1]]
NB_boot_auc## [1] 0.8989899
total_NB_boot_time <- common_portion_time_boot + NB_boot_time # total time for NB with 5 fold cv
total_NB_boot_time## user system elapsed
## 5.68 0.09 5.83
NB_boot_row <- c("NB with Bootstrap",bagging.perf.mean_NB[1,], NB_boot_auc,data.frame(total_NB_boot_time[3])[1,], bagging.perf.mean_NB[6:16,] )table_boot <- data.frame(matrix(ncol = 12, nrow = 0))
table_boot <- rbind(table_boot,LR_boot_row, NB_boot_row )
colnames(table_boot) <- c("ALGO", "ACCURACY","AUC", "Total elapsed time", "Sensitivity", "Specificity", "Pos Pred Value", "Neg Pred Value", "Precision", "Recall", "F1", "Prevalence", "Detection Rate", "Detection Prevalence", "Balanced Accuracy")
knitr::kable(table_boot)| ALGO | ACCURACY | AUC | Total elapsed time | Sensitivity | Specificity | Pos Pred Value | Neg Pred Value | Precision | Recall | F1 | Prevalence | Detection Rate | Detection Prevalence | Balanced Accuracy |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LR with Bootstrap | 0.812 | 0.906846240179573 | 7.46 | 0.772844556441993 | 0.851639640801397 | 0.827962962962963 | 0.798939393939394 | 0.827962962962963 | 0.772844556441993 | 0.798357621150137 | 0.483166666666667 | 0.372583333333333 | 0.45 | 0.812242098621695 |
| NB with Bootstrap | 0.795811518324607 | 0.898989898989899 | 5.83 | 0.737210053584752 | 0.864952207712975 | 0.854760519681986 | 0.747580517214025 | 0.854760519681986 | 0.737210053584752 | 0.79026103085344 | 0.52347294938918 | 0.384642233856894 | 0.45 | 0.801081130648863 |
Let’s put together all performacne metrics from base, cross validation and bootstrap together.
table_base_cv_boot <- data.frame(matrix(ncol = 12, nrow = 0))
table_base_cv_boot <- rbind(table_base_cv_boot,glm_row,nb_row,LR_tenfold_row,NB_tenfold_row,LR_fivefold_row,NB_fivefold_row,LR_boot_row, NB_boot_row )
colnames(table_base_cv_boot) <- c("ALGO", "ACCURACY","AUC","Total elapsed time", "Sensitivity", "Specificity", "Pos Pred Value", "Neg Pred Value", "Precision", "Recall", "F1", "Prevalence", "Detection Rate", "Detection Prevalence", "Balanced Accuracy")
knitr::kable(table_base_cv_boot)| ALGO | ACCURACY | AUC | Total elapsed time | Sensitivity | Specificity | Pos Pred Value | Neg Pred Value | Precision | Recall | F1 | Prevalence | Detection Rate | Detection Prevalence | Balanced Accuracy |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| base_LR | 0.816666666666667 | 0.906846240179573 | 4.47 | 0.785714285714286 | 0.84375 | 0.814814814814815 | 0.818181818181818 | 0.814814814814815 | 0.785714285714286 | 0.8 | 0.466666666666667 | 0.366666666666667 | 0.45 | 0.814732142857143 |
| base_NB | 0.816666666666667 | 0.897867564534231 | 4.46 | 0.766666666666667 | 0.866666666666667 | 0.851851851851852 | 0.787878787878788 | 0.851851851851852 | 0.766666666666667 | 0.807017543859649 | 0.5 | 0.383333333333333 | 0.45 | 0.816666666666667 |
| LR with 10 fold CV | 0.828333333333333 | 0.904601571268238 | 0.840000000000002 | 0.833333333333333 | 0.824242424242424 | 0.796576092098895 | 0.859057398113457 | 0.796576092098895 | 0.833333333333333 | 0.813704341162079 | 0.45 | 0.375 | 0.471666666666667 | 0.828787878787879 |
| NB with 10 fold CV | 0.795 | 0.901234567901235 | 1.17 | 0.862962962962963 | 0.739393939393939 | 0.730966484100972 | 0.8687754901548 | 0.730966484100972 | 0.862962962962963 | 0.791209781839312 | 0.45 | 0.388333333333333 | 0.531666666666667 | 0.801178451178451 |
| LR with 5 fold CV | 0.816666666666667 | 0.905723905723906 | 0.670000000000002 | 0.82962962962963 | 0.806060606060606 | 0.77874293012224 | 0.852653958944282 | 0.77874293012224 | 0.82962962962963 | 0.80304270777955 | 0.45 | 0.373333333333333 | 0.48 | 0.817845117845118 |
| NB with 5 fold CV | 0.803333333333333 | 0.901234567901235 | 0.940000000000001 | 0.859259259259259 | 0.757575757575758 | 0.744072622779519 | 0.869117596859532 | 0.744072622779519 | 0.859259259259259 | 0.797077439361115 | 0.45 | 0.386666666666667 | 0.52 | 0.808417508417508 |
| LR with Bootstrap | 0.812 | 0.906846240179573 | 7.46 | 0.772844556441993 | 0.851639640801397 | 0.827962962962963 | 0.798939393939394 | 0.827962962962963 | 0.772844556441993 | 0.798357621150137 | 0.483166666666667 | 0.372583333333333 | 0.45 | 0.812242098621695 |
| NB with Bootstrap | 0.795811518324607 | 0.898989898989899 | 5.83 | 0.737210053584752 | 0.864952207712975 | 0.854760519681986 | 0.747580517214025 | 0.854760519681986 | 0.737210053584752 | 0.79026103085344 | 0.52347294938918 | 0.384642233856894 | 0.45 | 0.801081130648863 |
PART B
For the same dataset, set seed (43) split 80/20.
Using randomForest grow three different forests varuing the number of trees atleast three times. Start with seeding and fresh split for each forest. Note down as best as you can development (engineering) cost as well as computing cost(elapsed time) for each run. And compare these results with the experiment in Part A. Submit a pdf and executable script in python or R.
I will start with 50 tress and then grow 50 threes in next two runs. All the metric will be gathered each model run with seeding and fresh split.
First run: Random Forest
set.seed(43)
tstidx_rf50 <- sample(1:nrow(heart),0.20*nrow(heart),replace = F)
trdata_rf50 <- heart[-tstidx_rf50,]
tsdata_rf50 <- heart[tstidx_rf50,]rf50_model<-randomForest(outcome~.,data=trdata_rf50,ntree=50)
rf50_pred<-predict(rf50_model,tsdata_rf50[,-c(which(names(tsdata_rf50)=="outcome"))])
rf50_mtab<-table(tsdata_rf50$outcome,rf50_pred)
rf50_cmx<-caret::confusionMatrix(rf50_mtab)
rf50_cmx## Confusion Matrix and Statistics
##
## rf50_pred
## 0 1
## 0 22 5
## 1 4 29
##
## Accuracy : 0.85
## 95% CI : (0.7343, 0.929)
## No Information Rate : 0.5667
## P-Value [Acc > NIR] : 2.679e-06
##
## Kappa : 0.6959
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.8462
## Specificity : 0.8529
## Pos Pred Value : 0.8148
## Neg Pred Value : 0.8788
## Prevalence : 0.4333
## Detection Rate : 0.3667
## Detection Prevalence : 0.4500
## Balanced Accuracy : 0.8495
##
## 'Positive' Class : 0
##
acc_rf50 <- sum(diag(rf50_cmx$table))/sum(rf50_cmx$table)
Predictionprob_rf50 <- predict(rf50_model,tsdata_rf50,type="prob")[, 2]
auc_rf50 <- roc(as.numeric(tsdata_rf50$outcome),as.numeric(as.matrix((Predictionprob_rf50))))$auc
auc_rf50## Area under the curve: 0.9321
## rf50_cmx.byClass
## Sensitivity 0.8461538
## Specificity 0.8529412
## Pos Pred Value 0.8148148
## Neg Pred Value 0.8787879
## Precision 0.8148148
## Recall 0.8461538
## F1 0.8301887
## Prevalence 0.4333333
## Detection Rate 0.3666667
## Detection Prevalence 0.4500000
## Balanced Accuracy 0.8495475
2nd run: Random Forest
set.seed(43)
tstidx_rf100 <- sample(1:nrow(heart),0.20*nrow(heart),replace = F)
trdata_rf100 <- heart[-tstidx_rf100,]
tsdata_rf100 <- heart[tstidx_rf100,]rfgrow100 <- grow(rf50_model,50)
rfgrow_pred100 <- predict(rfgrow100,tsdata_rf100[,-c(which(names(tsdata_rf100)=="outcome"))])
rfgrow_mtab100 <- table(tsdata_rf100$outcome,rfgrow_pred100)
rfgrow_cmx100 <- caret::confusionMatrix(rfgrow_mtab100)
rfgrow_cmx100## Confusion Matrix and Statistics
##
## rfgrow_pred100
## 0 1
## 0 22 5
## 1 5 28
##
## Accuracy : 0.8333
## 95% CI : (0.7148, 0.9171)
## No Information Rate : 0.55
## P-Value [Acc > NIR] : 3.483e-06
##
## Kappa : 0.6633
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.8148
## Specificity : 0.8485
## Pos Pred Value : 0.8148
## Neg Pred Value : 0.8485
## Prevalence : 0.4500
## Detection Rate : 0.3667
## Detection Prevalence : 0.4500
## Balanced Accuracy : 0.8316
##
## 'Positive' Class : 0
##
acc_rf100 <- sum(diag(rfgrow_cmx100$table))/sum(rfgrow_cmx100$table)
Predictionprob_rf100 <- predict(rfgrow100,tsdata_rf100,type="prob")[, 2]
auc_rf100 <- roc(as.numeric(tsdata_rf100$outcome),as.numeric(as.matrix((Predictionprob_rf100))))$auc
auc_rf100## Area under the curve: 0.9321
## rfgrow_cmx100.byClass
## Sensitivity 0.8148148
## Specificity 0.8484848
## Pos Pred Value 0.8148148
## Neg Pred Value 0.8484848
## Precision 0.8148148
## Recall 0.8148148
## F1 0.8148148
## Prevalence 0.4500000
## Detection Rate 0.3666667
## Detection Prevalence 0.4500000
## Balanced Accuracy 0.8316498
3rd run: Random Forest
set.seed(43)
tstidx_rf150 <- sample(1:nrow(heart),0.20*nrow(heart),replace = F)
trdata_rf150 <- heart[-tstidx_rf150,]
tsdata_rf150 <- heart[tstidx_rf150,]rfgrow150 <- grow(rfgrow100,50)
rfgrow_pred150 <- predict(rfgrow150,tsdata_rf150[,-c(which(names(tsdata_rf150)=="outcome"))])
rfgrow_mtab150 <- table(tsdata_rf150$outcome,rfgrow_pred150)
rfgrow_cmx150 <- caret::confusionMatrix(rfgrow_mtab150)
rfgrow_cmx150## Confusion Matrix and Statistics
##
## rfgrow_pred150
## 0 1
## 0 22 5
## 1 5 28
##
## Accuracy : 0.8333
## 95% CI : (0.7148, 0.9171)
## No Information Rate : 0.55
## P-Value [Acc > NIR] : 3.483e-06
##
## Kappa : 0.6633
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.8148
## Specificity : 0.8485
## Pos Pred Value : 0.8148
## Neg Pred Value : 0.8485
## Prevalence : 0.4500
## Detection Rate : 0.3667
## Detection Prevalence : 0.4500
## Balanced Accuracy : 0.8316
##
## 'Positive' Class : 0
##
acc_rf150 <- sum(diag(rfgrow_cmx150$table))/sum(rfgrow_cmx150$table)
Predictionprob_rf150 <- predict(rfgrow150,tsdata_rf150,type="prob")[, 2]
auc_rf150 <- roc(as.numeric(tsdata_rf150$outcome),as.numeric(as.matrix((Predictionprob_rf150))))$auc
auc_rf150## Area under the curve: 0.9321
## rfgrow_cmx150.byClass
## Sensitivity 0.8148148
## Specificity 0.8484848
## Pos Pred Value 0.8148148
## Neg Pred Value 0.8484848
## Precision 0.8148148
## Recall 0.8148148
## F1 0.8148148
## Prevalence 0.4500000
## Detection Rate 0.3666667
## Detection Prevalence 0.4500000
## Balanced Accuracy 0.8316498
rf150_row <- c("Random Forest 3rd run with 50 more trees",acc_rf150, auc_rf150,data.frame(total_rf150_time[3])[1,], df_rf150[1:11,])table_rf <- data.frame(matrix(ncol = 12, nrow = 0))
table_rf <- rbind(table_rf, rf50_row, rf100_row, rf150_row )
colnames(table_rf) <- c("ALGO", "ACCURACY","AUC","Total elapsed time", "Sensitivity", "Specificity", "Pos Pred Value", "Neg Pred Value", "Precision", "Recall", "F1", "Prevalence", "Detection Rate", "Detection Prevalence", "Balanced Accuracy")
knitr::kable(table_rf)| ALGO | ACCURACY | AUC | Total elapsed time | Sensitivity | Specificity | Pos Pred Value | Neg Pred Value | Precision | Recall | F1 | Prevalence | Detection Rate | Detection Prevalence | Balanced Accuracy |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Random Forest with 50 trees | 0.85 | 0.932098765432099 | 0.34 | 0.846153846153846 | 0.852941176470588 | 0.814814814814815 | 0.878787878787879 | 0.814814814814815 | 0.846153846153846 | 0.830188679245283 | 0.433333333333333 | 0.366666666666667 | 0.45 | 0.849547511312217 |
| Random Forest 2nd run 50 more trees | 0.833333333333333 | 0.932098765432099 | 0.34 | 0.814814814814815 | 0.848484848484849 | 0.814814814814815 | 0.848484848484849 | 0.814814814814815 | 0.814814814814815 | 0.814814814814815 | 0.45 | 0.366666666666667 | 0.45 | 0.831649831649832 |
| Random Forest 3rd run with 50 more trees | 0.833333333333333 | 0.932098765432099 | 0.32 | 0.814814814814815 | 0.848484848484849 | 0.814814814814815 | 0.848484848484849 | 0.814814814814815 | 0.814814814814815 | 0.814814814814815 | 0.45 | 0.366666666666667 | 0.45 | 0.831649831649832 |
Part C
Include a summary of your findings. Which of the two methods bootstrap vs cv do you recommend to your customer? And why? Be elaborate. Including computing costs, engineering costs and model performance. Did you incorporate Pareto’s maxim or the Razor and how did these two heuristics influence your decision?
Putting all the model metrics together:
table <- data.frame(matrix(ncol = 12, nrow = 0))
table <- rbind(table, glm_row, nb_row,LR_tenfold_row, NB_tenfold_row,LR_fivefold_row, NB_fivefold_row,LR_boot_row, NB_boot_row, rf50_row, rf100_row, rf150_row )
colnames(table) <- c("ALGO", "ACCURACY","AUC","Total elapsed time", "Sensitivity", "Specificity", "Pos Pred Value", "Neg Pred Value", "Precision", "Recall", "F1", "Prevalence", "Detection Rate", "Detection Prevalence", "Balanced Accuracy")
knitr::kable(table)| ALGO | ACCURACY | AUC | Total elapsed time | Sensitivity | Specificity | Pos Pred Value | Neg Pred Value | Precision | Recall | F1 | Prevalence | Detection Rate | Detection Prevalence | Balanced Accuracy |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| base_LR | 0.816666666666667 | 0.906846240179573 | 4.47 | 0.785714285714286 | 0.84375 | 0.814814814814815 | 0.818181818181818 | 0.814814814814815 | 0.785714285714286 | 0.8 | 0.466666666666667 | 0.366666666666667 | 0.45 | 0.814732142857143 |
| base_NB | 0.816666666666667 | 0.897867564534231 | 4.46 | 0.766666666666667 | 0.866666666666667 | 0.851851851851852 | 0.787878787878788 | 0.851851851851852 | 0.766666666666667 | 0.807017543859649 | 0.5 | 0.383333333333333 | 0.45 | 0.816666666666667 |
| LR with 10 fold CV | 0.828333333333333 | 0.904601571268238 | 0.840000000000002 | 0.833333333333333 | 0.824242424242424 | 0.796576092098895 | 0.859057398113457 | 0.796576092098895 | 0.833333333333333 | 0.813704341162079 | 0.45 | 0.375 | 0.471666666666667 | 0.828787878787879 |
| NB with 10 fold CV | 0.795 | 0.901234567901235 | 1.17 | 0.862962962962963 | 0.739393939393939 | 0.730966484100972 | 0.8687754901548 | 0.730966484100972 | 0.862962962962963 | 0.791209781839312 | 0.45 | 0.388333333333333 | 0.531666666666667 | 0.801178451178451 |
| LR with 5 fold CV | 0.816666666666667 | 0.905723905723906 | 0.670000000000002 | 0.82962962962963 | 0.806060606060606 | 0.77874293012224 | 0.852653958944282 | 0.77874293012224 | 0.82962962962963 | 0.80304270777955 | 0.45 | 0.373333333333333 | 0.48 | 0.817845117845118 |
| NB with 5 fold CV | 0.803333333333333 | 0.901234567901235 | 0.940000000000001 | 0.859259259259259 | 0.757575757575758 | 0.744072622779519 | 0.869117596859532 | 0.744072622779519 | 0.859259259259259 | 0.797077439361115 | 0.45 | 0.386666666666667 | 0.52 | 0.808417508417508 |
| LR with Bootstrap | 0.812 | 0.906846240179573 | 7.46 | 0.772844556441993 | 0.851639640801397 | 0.827962962962963 | 0.798939393939394 | 0.827962962962963 | 0.772844556441993 | 0.798357621150137 | 0.483166666666667 | 0.372583333333333 | 0.45 | 0.812242098621695 |
| NB with Bootstrap | 0.795811518324607 | 0.898989898989899 | 5.83 | 0.737210053584752 | 0.864952207712975 | 0.854760519681986 | 0.747580517214025 | 0.854760519681986 | 0.737210053584752 | 0.79026103085344 | 0.52347294938918 | 0.384642233856894 | 0.45 | 0.801081130648863 |
| Random Forest with 50 trees | 0.85 | 0.932098765432099 | 0.34 | 0.846153846153846 | 0.852941176470588 | 0.814814814814815 | 0.878787878787879 | 0.814814814814815 | 0.846153846153846 | 0.830188679245283 | 0.433333333333333 | 0.366666666666667 | 0.45 | 0.849547511312217 |
| Random Forest 2nd run 50 more trees | 0.833333333333333 | 0.932098765432099 | 0.34 | 0.814814814814815 | 0.848484848484849 | 0.814814814814815 | 0.848484848484849 | 0.814814814814815 | 0.814814814814815 | 0.814814814814815 | 0.45 | 0.366666666666667 | 0.45 | 0.831649831649832 |
| Random Forest 3rd run with 50 more trees | 0.833333333333333 | 0.932098765432099 | 0.32 | 0.814814814814815 | 0.848484848484849 | 0.814814814814815 | 0.848484848484849 | 0.814814814814815 | 0.814814814814815 | 0.814814814814815 | 0.45 | 0.366666666666667 | 0.45 | 0.831649831649832 |
We can see from above metrics of bootstrap and cross validation that, cross validation models perform better with much less computing costs and engineering costs. We need to aggregate 200 resamples in case of bootstrap which make it computationally expensive compared to cross validation. We can see comparing 5-fold CV and 10-fold CV that, 5-fold CV has lower computing cost than 10-fold CV, although LR model with 10 fold CV produces best accuracy.
We can see that we can apply the Occam’s razor or law of parsimony, by choosing LR model with 5 fold CV even though it’s accuracy is slightly lower than LR model with 10 fold CV, but it’s simpler in nature with less computation cost and slightly better AUC.
Random forest with 50 trees performed best among all the models with best accuracy, AUC and computation time. As mentioned in the beginning of the homework that i built a shiny app where we can run logistic regression, decision tree and random forest. My future plan is to improve the shiny app in a way that we will be able to upload any classification data and be able to run various classifiers. As a result we will be free of writing same script over and over again. Please note, most of the coding of this homework can be made reusable and cleaner by using function and loop. I couldn’t do it because of time constraint.
Engineering cost: It took me about 5 working days to craft out this solution and about one week to built the shiny app. I am estimating a data scientist hourly cost is at $60/hr which gives us (5x8x60) = 2400 for this home work and (7x8x60) = 3360 for the shiny app. Also, there are many other cost involved in a machine learning project such as infrastructure costs (cloud compute, data storage), integration costs (data pipeline development, API development, documentation), maintenance costs which should be considered while estimating engineering cost.