library(caret)
library(pROC)
library(tidyverse)
library(kableExtra)
library(ggplot2)
Pick any two classifiers of (SVM, Logistic, DecisionTree, NaiveBayes). Pick heart or ecoli dataset. Heart is simpler and ecoli compounds the problem as it is NOT a balanced dataset.
We pick the heart data
data <- read.csv('https://raw.githubusercontent.com/monuchacko/cuny_msds/master/data_622/heart.csv',head=T,sep=',',stringsAsFactors=F, fileEncoding = "UTF-8-BOM")
# Check the column names
names(data)
## [1] "age" "sex" "cp" "trestbps" "chol" "fbs"
## [7] "restecg" "thalach" "exang" "oldpeak" "slope" "ca"
## [13] "thal" "target"
# View sample data
head(data) %>% kable() %>% kable_styling() %>% scroll_box(width = "800px", height = "300px")
age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | target |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
63 | 1 | 3 | 145 | 233 | 1 | 0 | 150 | 0 | 2.3 | 0 | 0 | 1 | 1 |
37 | 1 | 2 | 130 | 250 | 0 | 1 | 187 | 0 | 3.5 | 0 | 0 | 2 | 1 |
41 | 0 | 1 | 130 | 204 | 0 | 0 | 172 | 0 | 1.4 | 2 | 0 | 2 | 1 |
56 | 1 | 1 | 120 | 236 | 0 | 1 | 178 | 0 | 0.8 | 2 | 0 | 2 | 1 |
57 | 0 | 0 | 120 | 354 | 0 | 1 | 163 | 1 | 0.6 | 2 | 0 | 2 | 1 |
57 | 1 | 0 | 140 | 192 | 0 | 1 | 148 | 0 | 0.4 | 1 | 0 | 1 | 1 |
Clean and prepare data
data$cp <- as.factor(data$cp)
data$fbs <- as.factor(data$fbs)
data$exang <- as.factor(data$exang)
data$slope <- as.factor(data$slope)
data$ca <- as.factor(data$ca)
data$sex <- as.factor(data$sex)
data$restecg <- as.factor(data$restecg)
data$thal <- as.factor(data$thal)
data$target <- as.factor(data$target)
colSums(is.na(data))
## age sex cp trestbps chol fbs restecg thalach
## 0 0 0 0 0 0 0 0
## exang oldpeak slope ca thal target
## 0 0 0 0 0 0
str(data)
## 'data.frame': 303 obs. of 14 variables:
## $ age : int 63 37 41 56 57 57 56 44 52 57 ...
## $ sex : Factor w/ 2 levels "0","1": 2 2 1 2 1 2 1 2 2 2 ...
## $ cp : Factor w/ 4 levels "0","1","2","3": 4 3 2 2 1 1 2 2 3 3 ...
## $ trestbps: int 145 130 130 120 120 140 140 120 172 150 ...
## $ chol : int 233 250 204 236 354 192 294 263 199 168 ...
## $ fbs : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 2 1 ...
## $ restecg : Factor w/ 3 levels "0","1","2": 1 2 1 2 2 2 1 2 2 2 ...
## $ thalach : int 150 187 172 178 163 148 153 173 162 174 ...
## $ exang : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 1 1 ...
## $ oldpeak : num 2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
## $ slope : Factor w/ 3 levels "0","1","2": 1 1 3 3 3 2 2 3 3 3 ...
## $ ca : Factor w/ 5 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ thal : Factor w/ 4 levels "0","1","2","3": 2 3 3 3 3 2 3 4 4 3 ...
## $ target : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
summary(data)
## age sex cp trestbps chol fbs
## Min. :29.00 0: 96 0:143 Min. : 94.0 Min. :126.0 0:258
## 1st Qu.:47.50 1:207 1: 50 1st Qu.:120.0 1st Qu.:211.0 1: 45
## Median :55.00 2: 87 Median :130.0 Median :240.0
## Mean :54.37 3: 23 Mean :131.6 Mean :246.3
## 3rd Qu.:61.00 3rd Qu.:140.0 3rd Qu.:274.5
## Max. :77.00 Max. :200.0 Max. :564.0
## restecg thalach exang oldpeak slope ca thal target
## 0:147 Min. : 71.0 0:204 Min. :0.00 0: 21 0:175 0: 2 0:138
## 1:152 1st Qu.:133.5 1: 99 1st Qu.:0.00 1:140 1: 65 1: 18 1:165
## 2: 4 Median :153.0 Median :0.80 2:142 2: 38 2:166
## Mean :149.6 Mean :1.04 3: 20 3:117
## 3rd Qu.:166.0 3rd Qu.:1.60 4: 5
## Max. :202.0 Max. :6.20
head(data) %>% kable() %>% kable_styling() %>% scroll_box(width = "800px", height = "300px")
age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | target |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
63 | 1 | 3 | 145 | 233 | 1 | 0 | 150 | 0 | 2.3 | 0 | 0 | 1 | 1 |
37 | 1 | 2 | 130 | 250 | 0 | 1 | 187 | 0 | 3.5 | 0 | 0 | 2 | 1 |
41 | 0 | 1 | 130 | 204 | 0 | 0 | 172 | 0 | 1.4 | 2 | 0 | 2 | 1 |
56 | 1 | 1 | 120 | 236 | 0 | 1 | 178 | 0 | 0.8 | 2 | 0 | 2 | 1 |
57 | 0 | 0 | 120 | 354 | 0 | 1 | 163 | 1 | 0.6 | 2 | 0 | 2 | 1 |
57 | 1 | 0 | 140 | 192 | 0 | 1 | 148 | 0 | 0.4 | 1 | 0 | 1 | 1 |
Set a seed (43)
# do a 80/20 split
set.seed(43)
View Heart Disease vs Age Chart
ggplot(data,aes(x=age,fill=target,color=target)) + geom_histogram(binwidth = 1,color="black") + labs(x = "Age",y = "Frequency", title = "Heart Disease vs Age")
Do a 80/20 split and determine the Accuracy, AUC and as many metrics as returned by the Caret package (confusionMatrix). Call this the base_metric. Note down as best as you can development (engineering) cost as well as computing cost (elapsed time).
Start with the original dataset and set a seed (43). Then run a cross validation of 5 and 10 of the model on the training set. Determine the same set of metrics and compare the cv_metrics with the base_metric. Note down as best as you can development (engineering) cost as well as computing cost(elapsed time). Start with the original dataset and set a seed (43) Then run a bootstrap of 200 resamples and compute the same set of metrics and for each of the two classifiers build a three column table for each experiment (base, bootstrap, cross-validated). Note down as best as you can development (engineering) cost as well as computing cost(elapsed time).
# 80/20 split
train_ind <- sample(seq_len(nrow(data)), size = floor(0.8 * nrow(data)))
train_heart <- data[ train_ind,]
test_heart <- data[-train_ind,]
draw_confusion_matrix <- function(cm) {
layout(matrix(c(1,1,2)))
par(mar=c(2,2,2,2))
plot(c(100, 345), c(300, 450), type = "n", xlab="", ylab="", xaxt='n', yaxt='n')
title('CONFUSION MATRIX', cex.main=2)
# create the matrix
rect(150, 430, 240, 370, col='#3F97D0')
text(195, 435, 'Class1', cex=1.2)
rect(250, 430, 340, 370, col='#F7AD50')
text(295, 435, 'Class2', cex=1.2)
text(125, 370, 'Predicted', cex=1.3, srt=90, font=2)
text(245, 450, 'Actual', cex=1.3, font=2)
rect(150, 305, 240, 365, col='#F7AD50')
rect(250, 305, 340, 365, col='#3F97D0')
text(140, 400, 'Class1', cex=1.2, srt=90)
text(140, 335, 'Class2', cex=1.2, srt=90)
# add in the cm results
res <- as.numeric(cm$table)
text(195, 400, res[1], cex=1.6, font=2, col='white')
text(195, 335, res[2], cex=1.6, font=2, col='white')
text(295, 400, res[3], cex=1.6, font=2, col='white')
text(295, 335, res[4], cex=1.6, font=2, col='white')
# add in the specifics
plot(c(100, 0), c(100, 0), type = "n", xlab="", ylab="", main = "DETAILS", xaxt='n', yaxt='n')
text(10, 85, names(cm$byClass[1]), cex=1.2, font=2)
text(10, 70, round(as.numeric(cm$byClass[1]), 3), cex=1.2)
text(30, 85, names(cm$byClass[2]), cex=1.2, font=2)
text(30, 70, round(as.numeric(cm$byClass[2]), 3), cex=1.2)
text(50, 85, names(cm$byClass[5]), cex=1.2, font=2)
text(50, 70, round(as.numeric(cm$byClass[5]), 3), cex=1.2)
text(70, 85, names(cm$byClass[6]), cex=1.2, font=2)
text(70, 70, round(as.numeric(cm$byClass[6]), 3), cex=1.2)
text(90, 85, names(cm$byClass[7]), cex=1.2, font=2)
text(90, 70, round(as.numeric(cm$byClass[7]), 3), cex=1.2)
# add in the accuracy information
text(30, 35, names(cm$overall[1]), cex=1.5, font=2)
text(30, 20, round(as.numeric(cm$overall[1]), 3), cex=1.4)
text(70, 35, names(cm$overall[2]), cex=1.5, font=2)
text(70, 20, round(as.numeric(cm$overall[2]), 3), cex=1.4)
}
eval_model <- function(train_method, tr, model_name){
# Timer begin
ptm <- proc.time()
# set seed
set.seed(43)
# Base Model
if (grepl("Base", model_name, fixed=TRUE)) {
# train model
dt_model = train(form = target ~ ., data = train_heart, trControl = tr, method = train_method)
print(dt_model)
# evaluate model
model_cm <- confusionMatrix(predict(dt_model, subset(test_heart, select = -c(target))), test_heart$target)
# Confusion Matrix visuals
draw_confusion_matrix(model_cm)
print(paste(model_name, 'results'))
print(model_cm)
# Timer end
elapsed_time <- (proc.time() - ptm)[[3]]
# determine the Accuracy, AUC and as many metrics as returned by the Caret package (confusionMatrix).
# store results
accuracy <- model_cm$overall[[1]]
auc_val <- as.numeric(auc(roc(test_heart$target, factor(predict(dt_model, test_heart), ordered = TRUE))))
sensitivity <- model_cm$byClass[[1]]
specificity <- model_cm$byClass[[2]]
precision <- model_cm$byClass[[5]]
recall <- model_cm$byClass[[6]]
f1 <- model_cm$byClass[[7]]
# Bootstrap Model
}
else if (grepl("Boot", model_name, fixed=TRUE)){
dt_model = train(form = target ~ ., data = data, trControl = tr, method = train_method)
# end timer
elapsed_time <- (proc.time() - ptm)[[3]]
accuracy <- c()
auc_val <- c()
sensitivity <- c()
specificity <- c()
precision <- c()
recall <- c()
f1 <- c()
i <- 1
pred_df <- dt_model$pred
for (resample in unique(pred_df$Resample)){
temp <- filter(pred_df, Resample == resample)
model_cm <- confusionMatrix(temp$pred, temp$obs)
accuracy[i] <- model_cm$overall[[1]]
auc_val[[i]] <- auc(roc(as.numeric(temp$pred, ordered = TRUE), as.numeric(temp$obs, ordered = TRUE)))
sensitivity[[i]] <- model_cm$byClass[[1]]
specificity[[i]] <- model_cm$byClass[[2]]
precision[[i]] <- model_cm$byClass[[5]]
recall[[i]] <- model_cm$byClass[[6]]
f1[[i]] <- model_cm$byClass[[7]]
i <- i + 1
}
accuracy <- mean(accuracy)
auc_val <- mean(auc_val)
sensitivity <- mean(sensitivity)
specificity <- mean(specificity)
precision <- mean(precision)
recall <- mean(recall)
f1 <- mean(f1)
}
else if (grepl("RF", model_name, fixed=TRUE)){
# Random Forest
# train model
dt_model = train(form = target ~ ., data = train_heart, trControl = tr, ntree = as.numeric(str_sub(model_name, start= -2)), method = train_method)
print(dt_model)
# evaluate model
model_cm <- confusionMatrix(predict(dt_model, subset(test_heart, select = -c(target))), test_heart$target)
draw_confusion_matrix(model_cm)
print(paste(model_name, 'results'))
print(model_cm)
# end timer
elapsed_time <- (proc.time() - ptm)[[3]]
# determine the Accuracy, AUC and as many metrics as returned by the Caret package (confusionMatrix).
# store results
accuracy <- model_cm$overall[[1]]
auc_val <- as.numeric(auc(roc(test_heart$target, factor(predict(dt_model, test_heart), ordered = TRUE))))
sensitivity <- model_cm$byClass[[1]]
specificity <- model_cm$byClass[[2]]
precision <- model_cm$byClass[[5]]
recall <- model_cm$byClass[[6]]
f1 <- model_cm$byClass[[7]]
}
else {
# Cross Validation
dt_model = train(form = target ~ ., data = data, trControl = tr, method = train_method)
print(dt_model)
model_cm <- confusionMatrix(dt_model$pred[order(dt_model$pred$rowIndex),]$pred, data$target)
draw_confusion_matrix(model_cm)
print(paste(model_name, 'results'))
print(model_cm)
# end timer
elapsed_time <- (proc.time() - ptm)[[3]]
# determine the Accuracy, AUC and as many metrics as returned by the Caret package (confusionMatrix).
# store results
accuracy <- model_cm$overall[[1]]
auc_val <- as.numeric(auc(roc(test_heart$target, factor(predict(dt_model, test_heart), ordered = TRUE))))
sensitivity <- model_cm$byClass[[1]]
specificity <- model_cm$byClass[[2]]
precision <- model_cm$byClass[[5]]
recall <- model_cm$byClass[[6]]
f1 <- model_cm$byClass[[7]]
}
full_results <- rbind(accuracy, auc_val, sensitivity, specificity, precision, recall, f1, elapsed_time)
colnames(full_results) <- c(model_name)
return(full_results)
}
Base Metric - Decision Tree
dt_base <- eval_model("rpart", trainControl(method="none"), "DT Base")
## CART
##
## 242 samples
## 13 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: None
## [1] "DT Base results"
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 0 0
## 1 27 34
##
## Accuracy : 0.5574
## 95% CI : (0.4245, 0.6845)
## No Information Rate : 0.5574
## P-Value [Acc > NIR] : 0.5531
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 5.624e-07
##
## Sensitivity : 0.0000
## Specificity : 1.0000
## Pos Pred Value : NaN
## Neg Pred Value : 0.5574
## Prevalence : 0.4426
## Detection Rate : 0.0000
## Detection Prevalence : 0.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : 0
##
Base Metric - SVM
svm_base <- eval_model("svmLinearWeights", trainControl(method="none"), "SVM Base")
## Linear Support Vector Machines with Class Weights
##
## 242 samples
## 13 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: None
## [1] "SVM Base results"
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 20 2
## 1 7 32
##
## Accuracy : 0.8525
## 95% CI : (0.7383, 0.9302)
## No Information Rate : 0.5574
## P-Value [Acc > NIR] : 8.993e-07
##
## Kappa : 0.6952
##
## Mcnemar's Test P-Value : 0.1824
##
## Sensitivity : 0.7407
## Specificity : 0.9412
## Pos Pred Value : 0.9091
## Neg Pred Value : 0.8205
## Prevalence : 0.4426
## Detection Rate : 0.3279
## Detection Prevalence : 0.3607
## Balanced Accuracy : 0.8410
##
## 'Positive' Class : 0
##
5 Cross Validation Folds - Decision Tree
dt_5cv <- eval_model("rpart", tr = trainControl(method = "cv", number = 5, savePredictions = 'final'), "DT 5 cv")
## CART
##
## 303 samples
## 13 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 243, 242, 242, 243, 242
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.03623188 0.7621311 0.5186339
## 0.03985507 0.7391803 0.4718201
## 0.48550725 0.6400000 0.2408988
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.03623188.
## [1] "DT 5 cv results"
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 98 32
## 1 40 133
##
## Accuracy : 0.7624
## 95% CI : (0.7104, 0.8092)
## No Information Rate : 0.5446
## P-Value [Acc > NIR] : 3.29e-15
##
## Kappa : 0.5187
##
## Mcnemar's Test P-Value : 0.4094
##
## Sensitivity : 0.7101
## Specificity : 0.8061
## Pos Pred Value : 0.7538
## Neg Pred Value : 0.7688
## Prevalence : 0.4554
## Detection Rate : 0.3234
## Detection Prevalence : 0.4290
## Balanced Accuracy : 0.7581
##
## 'Positive' Class : 0
##
5 Cross Validation Folds - SVM
svm_5cv <- eval_model("svmLinearWeights", tr = trainControl(method = "cv", number = 5, savePredictions = 'final'), "SVM 5 cv")
## Linear Support Vector Machines with Class Weights
##
## 303 samples
## 13 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 243, 242, 242, 243, 242
## Resampling results across tuning parameters:
##
## cost weight Accuracy Kappa
## 0.25 1 0.8450820 0.6862833
## 0.25 2 0.8383607 0.6670541
## 0.25 3 0.8153005 0.6183641
## 0.50 1 0.8318033 0.6591552
## 0.50 2 0.8285246 0.6468837
## 0.50 3 0.8185792 0.6253926
## 1.00 1 0.8350820 0.6664592
## 1.00 2 0.8252459 0.6399340
## 1.00 3 0.8153005 0.6188832
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were cost = 0.25 and weight = 1.
## [1] "SVM 5 cv results"
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 110 19
## 1 28 146
##
## Accuracy : 0.8449
## 95% CI : (0.7991, 0.8837)
## No Information Rate : 0.5446
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.6856
##
## Mcnemar's Test P-Value : 0.2432
##
## Sensitivity : 0.7971
## Specificity : 0.8848
## Pos Pred Value : 0.8527
## Neg Pred Value : 0.8391
## Prevalence : 0.4554
## Detection Rate : 0.3630
## Detection Prevalence : 0.4257
## Balanced Accuracy : 0.8410
##
## 'Positive' Class : 0
##
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
10 Cross Validation Folds - Decision Tree
dt_10cv <- eval_model("rpart", trainControl(method = "cv", number = 10, savePredictions = 'final'), "DT 10 cv")
## CART
##
## 303 samples
## 13 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 273, 272, 273, 273, 273, 273, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.03623188 0.7493548 0.4924633
## 0.03985507 0.7493548 0.4917245
## 0.48550725 0.6560215 0.2761508
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.03985507.
## [1] "DT 10 cv results"
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 96 34
## 1 42 131
##
## Accuracy : 0.7492
## 95% CI : (0.6964, 0.797)
## No Information Rate : 0.5446
## P-Value [Acc > NIR] : 1.515e-13
##
## Kappa : 0.4919
##
## Mcnemar's Test P-Value : 0.422
##
## Sensitivity : 0.6957
## Specificity : 0.7939
## Pos Pred Value : 0.7385
## Neg Pred Value : 0.7572
## Prevalence : 0.4554
## Detection Rate : 0.3168
## Detection Prevalence : 0.4290
## Balanced Accuracy : 0.7448
##
## 'Positive' Class : 0
##
10 Cross Validation Folds - SVM
svm_10cv <- eval_model("svmLinearWeights", trainControl(method = "cv", number = 10, savePredictions = 'final'), "SVM 10 cv")
## Linear Support Vector Machines with Class Weights
##
## 303 samples
## 13 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 273, 272, 273, 273, 273, 273, ...
## Resampling results across tuning parameters:
##
## cost weight Accuracy Kappa
## 0.25 1 0.8320430 0.6592645
## 0.25 2 0.8449462 0.6825573
## 0.25 3 0.8216129 0.6326403
## 0.50 1 0.8351613 0.6656510
## 0.50 2 0.8481720 0.6890891
## 0.50 3 0.8182796 0.6256678
## 1.00 1 0.8383871 0.6718040
## 1.00 2 0.8515054 0.6960616
## 1.00 3 0.8249462 0.6397001
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were cost = 1 and weight = 2.
## [1] "SVM 10 cv results"
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 102 9
## 1 36 156
##
## Accuracy : 0.8515
## 95% CI : (0.8064, 0.8896)
## No Information Rate : 0.5446
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6957
##
## Mcnemar's Test P-Value : 0.0001063
##
## Sensitivity : 0.7391
## Specificity : 0.9455
## Pos Pred Value : 0.9189
## Neg Pred Value : 0.8125
## Prevalence : 0.4554
## Detection Rate : 0.3366
## Detection Prevalence : 0.3663
## Balanced Accuracy : 0.8423
##
## 'Positive' Class : 0
##
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
Bootstrap - Decision Trees
dt_bt <- eval_model("rpart", trainControl(method="boot", number=200, savePredictions = 'final', returnResamp = 'final'), "DT Bootstrap")
print(dt_bt)
## DT Bootstrap
## accuracy 0.7420239
## auc_val 0.7458831
## sensitivity 0.7039119
## specificity 0.7768658
## precision 0.7329613
## recall 0.7039119
## f1 0.7115944
## elapsed_time 4.5100000
Bootstrap - SVM
set.seed(43)
#
svm_bt <- eval_model("svmLinearWeights", trainControl(method="boot", number=200, savePredictions = 'final', returnResamp = 'final'), "SVM Bootstrap")
print(svm_bt)
## SVM Bootstrap
## accuracy 0.8255533
## auc_val 0.8375211
## sensitivity 0.7207400
## specificity 0.9159850
## precision 0.8813384
## recall 0.7207400
## f1 0.7906451
## elapsed_time 120.7300000
data.frame(cbind(dt_base, dt_5cv, dt_10cv, dt_bt, svm_base, svm_5cv, svm_10cv, svm_bt))
## DT.Base DT.5.cv DT.10.cv DT.Bootstrap SVM.Base SVM.5.cv
## accuracy 0.557377 0.7623762 0.7491749 0.7420239 0.8524590 0.8448845
## auc_val 0.500000 0.7854031 0.7412854 0.7458831 0.8409586 0.8594771
## sensitivity 0.000000 0.7101449 0.6956522 0.7039119 0.7407407 0.7971014
## specificity 1.000000 0.8060606 0.7939394 0.7768658 0.9411765 0.8848485
## precision NA 0.7538462 0.7384615 0.7329613 0.9090909 0.8527132
## recall 0.000000 0.7101449 0.6956522 0.7039119 0.7407407 0.7971014
## f1 NA 0.7313433 0.7164179 0.7115944 0.8163265 0.8239700
## elapsed_time 0.520000 0.6400000 0.7200000 4.5100000 0.3900000 1.0400000
## SVM.10.cv SVM.Bootstrap
## accuracy 0.8514851 0.8255533
## auc_val 0.8224401 0.8375211
## sensitivity 0.7391304 0.7207400
## specificity 0.9454545 0.9159850
## precision 0.9189189 0.8813384
## recall 0.7391304 0.7207400
## f1 0.8192771 0.7906451
## elapsed_time 1.6800000 120.7300000
For the same dataset, set seed (43) split 80/20.
# do a 80/20 split
set.seed(43)
train_ind <- sample(seq_len(nrow(data)), size = floor(0.8 * nrow(data)))
train_heart <- data[ train_ind,]
test_heart <- data[-train_ind,]
Using randomForest grow three different forests varuing the number of trees atleast three times. Start with seeding andfresh split for each forest. Note down as best as you can development (engineering) cost as well as computing cost(elapsed time) for each run. And compare these results with the experiment in Part A. Submit a pdf and executable script in python or R.
rf_10 <- eval_model("rf", trainControl(), "RF 10")
## Random Forest
##
## 242 samples
## 13 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 242, 242, 242, 242, 242, 242, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7792646 0.5549714
## 12 0.7551120 0.5039891
## 22 0.7556300 0.5050744
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
## [1] "RF 10 results"
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 18 1
## 1 9 33
##
## Accuracy : 0.8361
## 95% CI : (0.7191, 0.9185)
## No Information Rate : 0.5574
## P-Value [Acc > NIR] : 3.844e-06
##
## Kappa : 0.6573
##
## Mcnemar's Test P-Value : 0.02686
##
## Sensitivity : 0.6667
## Specificity : 0.9706
## Pos Pred Value : 0.9474
## Neg Pred Value : 0.7857
## Prevalence : 0.4426
## Detection Rate : 0.2951
## Detection Prevalence : 0.3115
## Balanced Accuracy : 0.8186
##
## 'Positive' Class : 0
##
rf_50 <- eval_model("rf", trainControl(), "RF 50")
## Random Forest
##
## 242 samples
## 13 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 242, 242, 242, 242, 242, 242, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7972460 0.5896982
## 12 0.7651698 0.5239571
## 22 0.7512177 0.4950400
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
## [1] "RF 50 results"
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 19 2
## 1 8 32
##
## Accuracy : 0.8361
## 95% CI : (0.7191, 0.9185)
## No Information Rate : 0.5574
## P-Value [Acc > NIR] : 3.844e-06
##
## Kappa : 0.66
##
## Mcnemar's Test P-Value : 0.1138
##
## Sensitivity : 0.7037
## Specificity : 0.9412
## Pos Pred Value : 0.9048
## Neg Pred Value : 0.8000
## Prevalence : 0.4426
## Detection Rate : 0.3115
## Detection Prevalence : 0.3443
## Balanced Accuracy : 0.8224
##
## 'Positive' Class : 0
##
rf_99 <- eval_model("rf", trainControl(), "RF 99")
## Random Forest
##
## 242 samples
## 13 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 242, 242, 242, 242, 242, 242, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7968746 0.5887862
## 12 0.7688465 0.5307189
## 22 0.7515108 0.4957248
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
## [1] "RF 99 results"
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 20 2
## 1 7 32
##
## Accuracy : 0.8525
## 95% CI : (0.7383, 0.9302)
## No Information Rate : 0.5574
## P-Value [Acc > NIR] : 8.993e-07
##
## Kappa : 0.6952
##
## Mcnemar's Test P-Value : 0.1824
##
## Sensitivity : 0.7407
## Specificity : 0.9412
## Pos Pred Value : 0.9091
## Neg Pred Value : 0.8205
## Prevalence : 0.4426
## Detection Rate : 0.3279
## Detection Prevalence : 0.3607
## Balanced Accuracy : 0.8410
##
## 'Positive' Class : 0
##
data.frame(cbind(rf_10, rf_50, rf_99))
## RF.10 RF.50 RF.99
## accuracy 0.8360656 0.8360656 0.8524590
## auc_val 0.8186275 0.8077342 0.8409586
## sensitivity 0.6666667 0.7037037 0.7407407
## specificity 0.9705882 0.9411765 0.9411765
## precision 0.9473684 0.9047619 0.9090909
## recall 0.6666667 0.7037037 0.7407407
## f1 0.7826087 0.7916667 0.8163265
## elapsed_time 1.1100000 2.0000000 3.1200000
Include a summary of your findings. Which of the two methods bootstrap vs cv do you recommend to your customer? And why? Be elaborate. Including computing costs, engineering costs and model performance. Did you incorporate Pareto’s maxim or the Razor and how did these two heuristics influence your decision?
result <- data.frame(cbind(dt_base, dt_5cv, dt_10cv, dt_bt, svm_base, svm_5cv, svm_10cv, svm_bt, rf_10, rf_50, rf_99))
result %>% kable() %>% kable_styling() %>% scroll_box(width = "800px", height = "400px")
DT.Base | DT.5.cv | DT.10.cv | DT.Bootstrap | SVM.Base | SVM.5.cv | SVM.10.cv | SVM.Bootstrap | RF.10 | RF.50 | RF.99 | |
---|---|---|---|---|---|---|---|---|---|---|---|
accuracy | 0.557377 | 0.7623762 | 0.7491749 | 0.7420239 | 0.8524590 | 0.8448845 | 0.8514851 | 0.8255533 | 0.8360656 | 0.8360656 | 0.8524590 |
auc_val | 0.500000 | 0.7854031 | 0.7412854 | 0.7458831 | 0.8409586 | 0.8594771 | 0.8224401 | 0.8375211 | 0.8186275 | 0.8077342 | 0.8409586 |
sensitivity | 0.000000 | 0.7101449 | 0.6956522 | 0.7039119 | 0.7407407 | 0.7971014 | 0.7391304 | 0.7207400 | 0.6666667 | 0.7037037 | 0.7407407 |
specificity | 1.000000 | 0.8060606 | 0.7939394 | 0.7768658 | 0.9411765 | 0.8848485 | 0.9454545 | 0.9159850 | 0.9705882 | 0.9411765 | 0.9411765 |
precision | NA | 0.7538462 | 0.7384615 | 0.7329613 | 0.9090909 | 0.8527132 | 0.9189189 | 0.8813384 | 0.9473684 | 0.9047619 | 0.9090909 |
recall | 0.000000 | 0.7101449 | 0.6956522 | 0.7039119 | 0.7407407 | 0.7971014 | 0.7391304 | 0.7207400 | 0.6666667 | 0.7037037 | 0.7407407 |
f1 | NA | 0.7313433 | 0.7164179 | 0.7115944 | 0.8163265 | 0.8239700 | 0.8192771 | 0.7906451 | 0.7826087 | 0.7916667 | 0.8163265 |
elapsed_time | 0.520000 | 0.6400000 | 0.7200000 | 4.5100000 | 0.3900000 | 1.0400000 | 1.6800000 | 120.7300000 | 1.1100000 | 2.0000000 | 3.1200000 |
I would recommend cross validation. Cross-Validation is a very powerful tool. It helps us better use our data, and it gives us much more information about our algorithm performance.
SVM
It looks like the base SVM did most of the work as suggested by Pareto principle and cross validation gave it a performance boost. 10-fold did not add any value to the 5-fold. There was a increase in processing time. According to Occam’s razor, we should use the simple model, i.e. 5-fold.
Decision Tree
Here cross validation helped with parameter selection. 5 fold cross validation yields better results than 10 fold cross validation. impler solution (5-fold CV) should be used (Occam’s razor principle). Did not find any added benefit using 10 fold CV.
Random Forest
Random Forest had the same performance as the base SVM model. It took more time to compute. It is better to use simpler model according to Occam’s razor principle.