Data 622 Finding the best Model

Part A

library(caret)
library(pROC)
library(tidyverse)
library(kableExtra)
library(ggplot2)

STEP# 0:

Pick any two classifiers of (SVM, Logistic, DecisionTree, NaiveBayes). Pick heart or ecoli dataset. Heart is simpler and ecoli compounds the problem as it is NOT a balanced dataset.

We pick the heart data

data <- read.csv('https://raw.githubusercontent.com/monuchacko/cuny_msds/master/data_622/heart.csv',head=T,sep=',',stringsAsFactors=F, fileEncoding = "UTF-8-BOM")

# Check the column names
names(data)

##  [1] "age"      "sex"      "cp"       "trestbps" "chol"     "fbs"     
##  [7] "restecg"  "thalach"  "exang"    "oldpeak"  "slope"    "ca"      
## [13] "thal"     "target"

# View sample data
head(data) %>% kable() %>% kable_styling() %>% scroll_box(width = "800px", height = "300px")

age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	thal	target
63	1	3	145	233	1	0	150	0	2.3	0	1	1
37	1	2	130	250	0	1	187	0	3.5	0	2	1
41	0	1	130	204	0	0	172	0	1.4	2	2	1
56	1	1	120	236	0	1	178	0	0.8	2	2	1
57	0	0	120	354	0	1	163	1	0.6	2	2	1
57	1	0	140	192	0	1	148	0	0.4	1	1	1

Clean and prepare data

data$cp <- as.factor(data$cp)
data$fbs <- as.factor(data$fbs)
data$exang <- as.factor(data$exang)
data$slope <- as.factor(data$slope)
data$ca <- as.factor(data$ca)
data$sex <- as.factor(data$sex)
data$restecg <- as.factor(data$restecg)
data$thal <- as.factor(data$thal)
data$target <- as.factor(data$target)

colSums(is.na(data))

##      age      sex       cp trestbps     chol      fbs  restecg  thalach 
##        0        0        0        0        0        0        0        0 
##    exang  oldpeak    slope       ca     thal   target 
##        0        0        0        0        0        0

str(data)

## 'data.frame':    303 obs. of  14 variables:
##  $ age     : int  63 37 41 56 57 57 56 44 52 57 ...
##  $ sex     : Factor w/ 2 levels "0","1": 2 2 1 2 1 2 1 2 2 2 ...
##  $ cp      : Factor w/ 4 levels "0","1","2","3": 4 3 2 2 1 1 2 2 3 3 ...
##  $ trestbps: int  145 130 130 120 120 140 140 120 172 150 ...
##  $ chol    : int  233 250 204 236 354 192 294 263 199 168 ...
##  $ fbs     : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 2 1 ...
##  $ restecg : Factor w/ 3 levels "0","1","2": 1 2 1 2 2 2 1 2 2 2 ...
##  $ thalach : int  150 187 172 178 163 148 153 173 162 174 ...
##  $ exang   : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 1 1 ...
##  $ oldpeak : num  2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
##  $ slope   : Factor w/ 3 levels "0","1","2": 1 1 3 3 3 2 2 3 3 3 ...
##  $ ca      : Factor w/ 5 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ thal    : Factor w/ 4 levels "0","1","2","3": 2 3 3 3 3 2 3 4 4 3 ...
##  $ target  : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...

summary(data)

##       age        sex     cp         trestbps          chol       fbs    
##  Min.   :29.00   0: 96   0:143   Min.   : 94.0   Min.   :126.0   0:258  
##  1st Qu.:47.50   1:207   1: 50   1st Qu.:120.0   1st Qu.:211.0   1: 45  
##  Median :55.00           2: 87   Median :130.0   Median :240.0          
##  Mean   :54.37           3: 23   Mean   :131.6   Mean   :246.3          
##  3rd Qu.:61.00                   3rd Qu.:140.0   3rd Qu.:274.5          
##  Max.   :77.00                   Max.   :200.0   Max.   :564.0          
##  restecg    thalach      exang      oldpeak     slope   ca      thal    target 
##  0:147   Min.   : 71.0   0:204   Min.   :0.00   0: 21   0:175   0:  2   0:138  
##  1:152   1st Qu.:133.5   1: 99   1st Qu.:0.00   1:140   1: 65   1: 18   1:165  
##  2:  4   Median :153.0           Median :0.80   2:142   2: 38   2:166          
##          Mean   :149.6           Mean   :1.04           3: 20   3:117          
##          3rd Qu.:166.0           3rd Qu.:1.60           4:  5                  
##          Max.   :202.0           Max.   :6.20

head(data) %>% kable() %>% kable_styling() %>% scroll_box(width = "800px", height = "300px")

age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	thal	target
63	1	3	145	233	1	0	150	0	2.3	0	1	1
37	1	2	130	250	0	1	187	0	3.5	0	2	1
41	0	1	130	204	0	0	172	0	1.4	2	2	1
56	1	1	120	236	0	1	178	0	0.8	2	2	1
57	0	0	120	354	0	1	163	1	0.6	2	2	1
57	1	0	140	192	0	1	148	0	0.4	1	1	1

STEP# 1

Set a seed (43)

# do a 80/20 split 
set.seed(43)

View Heart Disease vs Age Chart

ggplot(data,aes(x=age,fill=target,color=target)) + geom_histogram(binwidth = 1,color="black") + labs(x = "Age",y = "Frequency", title = "Heart Disease vs Age")

STEP# 2

Do a 80/20 split and determine the Accuracy, AUC and as many metrics as returned by the Caret package (confusionMatrix). Call this the base_metric. Note down as best as you can development (engineering) cost as well as computing cost (elapsed time).

Start with the original dataset and set a seed (43). Then run a cross validation of 5 and 10 of the model on the training set. Determine the same set of metrics and compare the cv_metrics with the base_metric. Note down as best as you can development (engineering) cost as well as computing cost(elapsed time). Start with the original dataset and set a seed (43) Then run a bootstrap of 200 resamples and compute the same set of metrics and for each of the two classifiers build a three column table for each experiment (base, bootstrap, cross-validated). Note down as best as you can development (engineering) cost as well as computing cost(elapsed time).

# 80/20 split 
train_ind <- sample(seq_len(nrow(data)), size = floor(0.8 * nrow(data)))
train_heart <- data[ train_ind,]
test_heart  <- data[-train_ind,]

draw_confusion_matrix <- function(cm) {

  layout(matrix(c(1,1,2)))
  par(mar=c(2,2,2,2))
  plot(c(100, 345), c(300, 450), type = "n", xlab="", ylab="", xaxt='n', yaxt='n')
  title('CONFUSION MATRIX', cex.main=2)

  # create the matrix 
  rect(150, 430, 240, 370, col='#3F97D0')
  text(195, 435, 'Class1', cex=1.2)
  rect(250, 430, 340, 370, col='#F7AD50')
  text(295, 435, 'Class2', cex=1.2)
  text(125, 370, 'Predicted', cex=1.3, srt=90, font=2)
  text(245, 450, 'Actual', cex=1.3, font=2)
  rect(150, 305, 240, 365, col='#F7AD50')
  rect(250, 305, 340, 365, col='#3F97D0')
  text(140, 400, 'Class1', cex=1.2, srt=90)
  text(140, 335, 'Class2', cex=1.2, srt=90)

  # add in the cm results 
  res <- as.numeric(cm$table)
  text(195, 400, res[1], cex=1.6, font=2, col='white')
  text(195, 335, res[2], cex=1.6, font=2, col='white')
  text(295, 400, res[3], cex=1.6, font=2, col='white')
  text(295, 335, res[4], cex=1.6, font=2, col='white')

  # add in the specifics 
  plot(c(100, 0), c(100, 0), type = "n", xlab="", ylab="", main = "DETAILS", xaxt='n', yaxt='n')
  text(10, 85, names(cm$byClass[1]), cex=1.2, font=2)
  text(10, 70, round(as.numeric(cm$byClass[1]), 3), cex=1.2)
  text(30, 85, names(cm$byClass[2]), cex=1.2, font=2)
  text(30, 70, round(as.numeric(cm$byClass[2]), 3), cex=1.2)
  text(50, 85, names(cm$byClass[5]), cex=1.2, font=2)
  text(50, 70, round(as.numeric(cm$byClass[5]), 3), cex=1.2)
  text(70, 85, names(cm$byClass[6]), cex=1.2, font=2)
  text(70, 70, round(as.numeric(cm$byClass[6]), 3), cex=1.2)
  text(90, 85, names(cm$byClass[7]), cex=1.2, font=2)
  text(90, 70, round(as.numeric(cm$byClass[7]), 3), cex=1.2)

  # add in the accuracy information 
  text(30, 35, names(cm$overall[1]), cex=1.5, font=2)
  text(30, 20, round(as.numeric(cm$overall[1]), 3), cex=1.4)
  text(70, 35, names(cm$overall[2]), cex=1.5, font=2)
  text(70, 20, round(as.numeric(cm$overall[2]), 3), cex=1.4)
}

eval_model <- function(train_method, tr, model_name){
  
  # Timer begin
  ptm <- proc.time()
  
  # set seed
  set.seed(43)
  
  # Base Model
  if (grepl("Base", model_name, fixed=TRUE)) {
    
    # train model
    dt_model = train(form = target ~ ., data = train_heart, trControl = tr, method = train_method)
    print(dt_model)
    
    # evaluate model
    model_cm <- confusionMatrix(predict(dt_model, subset(test_heart, select = -c(target))), test_heart$target)
    
    # Confusion Matrix visuals
    draw_confusion_matrix(model_cm)
    
    print(paste(model_name, 'results'))
    print(model_cm)
    
    # Timer end
    elapsed_time <- (proc.time() - ptm)[[3]]
  
    # determine the Accuracy, AUC and as many metrics as returned by the Caret package (confusionMatrix).
    # store results
    accuracy <- model_cm$overall[[1]]
    auc_val <- as.numeric(auc(roc(test_heart$target, factor(predict(dt_model, test_heart), ordered = TRUE))))
    sensitivity <- model_cm$byClass[[1]]
    specificity <- model_cm$byClass[[2]]
    precision <- model_cm$byClass[[5]]
    recall <- model_cm$byClass[[6]]
    f1 <- model_cm$byClass[[7]]
    
  # Bootstrap Model
  } 
  else if (grepl("Boot", model_name, fixed=TRUE)){
    
    dt_model = train(form = target ~ ., data = data, trControl = tr, method = train_method)
    
    # end timer
    elapsed_time <- (proc.time() - ptm)[[3]]
    
    accuracy <- c()
    auc_val <- c()
    sensitivity <- c()
    specificity <- c()
    precision <- c()
    recall <- c()
    f1 <- c()
    i <- 1
    
    pred_df <- dt_model$pred
    for (resample in unique(pred_df$Resample)){
      temp <- filter(pred_df, Resample == resample)
      model_cm <- confusionMatrix(temp$pred, temp$obs)
      accuracy[i] <- model_cm$overall[[1]]
      auc_val[[i]] <- auc(roc(as.numeric(temp$pred, ordered = TRUE), as.numeric(temp$obs, ordered = TRUE)))
      sensitivity[[i]] <- model_cm$byClass[[1]]
      specificity[[i]] <- model_cm$byClass[[2]]
      precision[[i]] <- model_cm$byClass[[5]]
      recall[[i]] <- model_cm$byClass[[6]]
      f1[[i]] <- model_cm$byClass[[7]]
      i <- i + 1
    }
  
    accuracy <- mean(accuracy)
    auc_val <- mean(auc_val)
    sensitivity <- mean(sensitivity)
    specificity <- mean(specificity)
    precision <- mean(precision)
    recall <- mean(recall)
    f1 <- mean(f1)
  } 
  else if (grepl("RF", model_name, fixed=TRUE)){
    # Random Forest
    # train model
    dt_model = train(form = target ~ ., data = train_heart, trControl = tr, ntree = as.numeric(str_sub(model_name, start= -2)), method = train_method)
    print(dt_model)
    
    # evaluate model
    model_cm <- confusionMatrix(predict(dt_model, subset(test_heart, select = -c(target))), test_heart$target)
    
    draw_confusion_matrix(model_cm)
    
    print(paste(model_name, 'results'))
    print(model_cm)
    
    # end timer
    elapsed_time <- (proc.time() - ptm)[[3]]
  
    # determine the Accuracy, AUC and as many metrics as returned by the Caret package (confusionMatrix).
    # store results
    accuracy <- model_cm$overall[[1]]
    auc_val <- as.numeric(auc(roc(test_heart$target, factor(predict(dt_model, test_heart), ordered = TRUE))))
    sensitivity <- model_cm$byClass[[1]]
    specificity <- model_cm$byClass[[2]]
    precision <- model_cm$byClass[[5]]
    recall <- model_cm$byClass[[6]]
    f1 <- model_cm$byClass[[7]]
  } 
  else {

    # Cross Validation
    
    dt_model = train(form = target ~ ., data = data, trControl = tr, method = train_method)

    print(dt_model)
    model_cm <- confusionMatrix(dt_model$pred[order(dt_model$pred$rowIndex),]$pred, data$target)
  
    draw_confusion_matrix(model_cm)
  
    print(paste(model_name, 'results'))
    print(model_cm)
  
    # end timer
    elapsed_time <- (proc.time() - ptm)[[3]]

    # determine the Accuracy, AUC and as many metrics as returned by the Caret package (confusionMatrix).
    # store results
    accuracy <- model_cm$overall[[1]]
    auc_val <- as.numeric(auc(roc(test_heart$target, factor(predict(dt_model, test_heart), ordered = TRUE))))
    sensitivity <- model_cm$byClass[[1]]
    specificity <- model_cm$byClass[[2]]
    precision <- model_cm$byClass[[5]]
    recall <- model_cm$byClass[[6]]
    f1 <- model_cm$byClass[[7]]

  }
  
  full_results <- rbind(accuracy, auc_val, sensitivity, specificity, precision, recall, f1, elapsed_time)
  colnames(full_results) <- c(model_name)
  return(full_results)
}

Base Metric - Decision Tree

dt_base <- eval_model("rpart", trainControl(method="none"), "DT Base")

## CART 
## 
## 242 samples
##  13 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: None

## [1] "DT Base results"
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0  0  0
##          1 27 34
##                                           
##                Accuracy : 0.5574          
##                  95% CI : (0.4245, 0.6845)
##     No Information Rate : 0.5574          
##     P-Value [Acc > NIR] : 0.5531          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 5.624e-07       
##                                           
##             Sensitivity : 0.0000          
##             Specificity : 1.0000          
##          Pos Pred Value :    NaN          
##          Neg Pred Value : 0.5574          
##              Prevalence : 0.4426          
##          Detection Rate : 0.0000          
##    Detection Prevalence : 0.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : 0               
##

Base Metric - SVM

svm_base <- eval_model("svmLinearWeights", trainControl(method="none"), "SVM Base")

## Linear Support Vector Machines with Class Weights 
## 
## 242 samples
##  13 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: None

## [1] "SVM Base results"
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 20  2
##          1  7 32
##                                           
##                Accuracy : 0.8525          
##                  95% CI : (0.7383, 0.9302)
##     No Information Rate : 0.5574          
##     P-Value [Acc > NIR] : 8.993e-07       
##                                           
##                   Kappa : 0.6952          
##                                           
##  Mcnemar's Test P-Value : 0.1824          
##                                           
##             Sensitivity : 0.7407          
##             Specificity : 0.9412          
##          Pos Pred Value : 0.9091          
##          Neg Pred Value : 0.8205          
##              Prevalence : 0.4426          
##          Detection Rate : 0.3279          
##    Detection Prevalence : 0.3607          
##       Balanced Accuracy : 0.8410          
##                                           
##        'Positive' Class : 0               
##

5 Cross Validation Folds - Decision Tree

dt_5cv <- eval_model("rpart", tr = trainControl(method = "cv", number = 5, savePredictions = 'final'), "DT 5 cv")

## CART 
## 
## 303 samples
##  13 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 243, 242, 242, 243, 242 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa    
##   0.03623188  0.7621311  0.5186339
##   0.03985507  0.7391803  0.4718201
##   0.48550725  0.6400000  0.2408988
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.03623188.

## [1] "DT 5 cv results"
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  98  32
##          1  40 133
##                                           
##                Accuracy : 0.7624          
##                  95% CI : (0.7104, 0.8092)
##     No Information Rate : 0.5446          
##     P-Value [Acc > NIR] : 3.29e-15        
##                                           
##                   Kappa : 0.5187          
##                                           
##  Mcnemar's Test P-Value : 0.4094          
##                                           
##             Sensitivity : 0.7101          
##             Specificity : 0.8061          
##          Pos Pred Value : 0.7538          
##          Neg Pred Value : 0.7688          
##              Prevalence : 0.4554          
##          Detection Rate : 0.3234          
##    Detection Prevalence : 0.4290          
##       Balanced Accuracy : 0.7581          
##                                           
##        'Positive' Class : 0               
##

5 Cross Validation Folds - SVM

svm_5cv <- eval_model("svmLinearWeights", tr = trainControl(method = "cv", number = 5, savePredictions = 'final'), "SVM 5 cv")

## Linear Support Vector Machines with Class Weights 
## 
## 303 samples
##  13 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 243, 242, 242, 243, 242 
## Resampling results across tuning parameters:
## 
##   cost  weight  Accuracy   Kappa    
##   0.25  1       0.8450820  0.6862833
##   0.25  2       0.8383607  0.6670541
##   0.25  3       0.8153005  0.6183641
##   0.50  1       0.8318033  0.6591552
##   0.50  2       0.8285246  0.6468837
##   0.50  3       0.8185792  0.6253926
##   1.00  1       0.8350820  0.6664592
##   1.00  2       0.8252459  0.6399340
##   1.00  3       0.8153005  0.6188832
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were cost = 0.25 and weight = 1.

## [1] "SVM 5 cv results"
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 110  19
##          1  28 146
##                                           
##                Accuracy : 0.8449          
##                  95% CI : (0.7991, 0.8837)
##     No Information Rate : 0.5446          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.6856          
##                                           
##  Mcnemar's Test P-Value : 0.2432          
##                                           
##             Sensitivity : 0.7971          
##             Specificity : 0.8848          
##          Pos Pred Value : 0.8527          
##          Neg Pred Value : 0.8391          
##              Prevalence : 0.4554          
##          Detection Rate : 0.3630          
##    Detection Prevalence : 0.4257          
##       Balanced Accuracy : 0.8410          
##                                           
##        'Positive' Class : 0               
##

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

10 Cross Validation Folds - Decision Tree

dt_10cv <- eval_model("rpart", trainControl(method = "cv", number = 10, savePredictions = 'final'), "DT 10 cv")

## CART 
## 
## 303 samples
##  13 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 273, 272, 273, 273, 273, 273, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa    
##   0.03623188  0.7493548  0.4924633
##   0.03985507  0.7493548  0.4917245
##   0.48550725  0.6560215  0.2761508
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.03985507.

## [1] "DT 10 cv results"
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  96  34
##          1  42 131
##                                          
##                Accuracy : 0.7492         
##                  95% CI : (0.6964, 0.797)
##     No Information Rate : 0.5446         
##     P-Value [Acc > NIR] : 1.515e-13      
##                                          
##                   Kappa : 0.4919         
##                                          
##  Mcnemar's Test P-Value : 0.422          
##                                          
##             Sensitivity : 0.6957         
##             Specificity : 0.7939         
##          Pos Pred Value : 0.7385         
##          Neg Pred Value : 0.7572         
##              Prevalence : 0.4554         
##          Detection Rate : 0.3168         
##    Detection Prevalence : 0.4290         
##       Balanced Accuracy : 0.7448         
##                                          
##        'Positive' Class : 0              
##

10 Cross Validation Folds - SVM

svm_10cv <- eval_model("svmLinearWeights", trainControl(method = "cv", number = 10, savePredictions = 'final'), "SVM 10 cv")

## Linear Support Vector Machines with Class Weights 
## 
## 303 samples
##  13 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 273, 272, 273, 273, 273, 273, ... 
## Resampling results across tuning parameters:
## 
##   cost  weight  Accuracy   Kappa    
##   0.25  1       0.8320430  0.6592645
##   0.25  2       0.8449462  0.6825573
##   0.25  3       0.8216129  0.6326403
##   0.50  1       0.8351613  0.6656510
##   0.50  2       0.8481720  0.6890891
##   0.50  3       0.8182796  0.6256678
##   1.00  1       0.8383871  0.6718040
##   1.00  2       0.8515054  0.6960616
##   1.00  3       0.8249462  0.6397001
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were cost = 1 and weight = 2.

## [1] "SVM 10 cv results"
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 102   9
##          1  36 156
##                                           
##                Accuracy : 0.8515          
##                  95% CI : (0.8064, 0.8896)
##     No Information Rate : 0.5446          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6957          
##                                           
##  Mcnemar's Test P-Value : 0.0001063       
##                                           
##             Sensitivity : 0.7391          
##             Specificity : 0.9455          
##          Pos Pred Value : 0.9189          
##          Neg Pred Value : 0.8125          
##              Prevalence : 0.4554          
##          Detection Rate : 0.3366          
##    Detection Prevalence : 0.3663          
##       Balanced Accuracy : 0.8423          
##                                           
##        'Positive' Class : 0               
##

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

Bootstrap - Decision Trees

dt_bt <- eval_model("rpart", trainControl(method="boot", number=200, savePredictions = 'final', returnResamp = 'final'), "DT Bootstrap")
print(dt_bt)

##              DT Bootstrap
## accuracy        0.7420239
## auc_val         0.7458831
## sensitivity     0.7039119
## specificity     0.7768658
## precision       0.7329613
## recall          0.7039119
## f1              0.7115944
## elapsed_time    4.5100000

Bootstrap - SVM

set.seed(43)

# 
svm_bt <- eval_model("svmLinearWeights", trainControl(method="boot", number=200, savePredictions = 'final', returnResamp = 'final'), "SVM Bootstrap")
print(svm_bt)

##              SVM Bootstrap
## accuracy         0.8255533
## auc_val          0.8375211
## sensitivity      0.7207400
## specificity      0.9159850
## precision        0.8813384
## recall           0.7207400
## f1               0.7906451
## elapsed_time   120.7300000

data.frame(cbind(dt_base, dt_5cv, dt_10cv, dt_bt, svm_base, svm_5cv, svm_10cv, svm_bt))

##               DT.Base   DT.5.cv  DT.10.cv DT.Bootstrap  SVM.Base  SVM.5.cv
## accuracy     0.557377 0.7623762 0.7491749    0.7420239 0.8524590 0.8448845
## auc_val      0.500000 0.7854031 0.7412854    0.7458831 0.8409586 0.8594771
## sensitivity  0.000000 0.7101449 0.6956522    0.7039119 0.7407407 0.7971014
## specificity  1.000000 0.8060606 0.7939394    0.7768658 0.9411765 0.8848485
## precision          NA 0.7538462 0.7384615    0.7329613 0.9090909 0.8527132
## recall       0.000000 0.7101449 0.6956522    0.7039119 0.7407407 0.7971014
## f1                 NA 0.7313433 0.7164179    0.7115944 0.8163265 0.8239700
## elapsed_time 0.520000 0.6400000 0.7200000    4.5100000 0.3900000 1.0400000
##              SVM.10.cv SVM.Bootstrap
## accuracy     0.8514851     0.8255533
## auc_val      0.8224401     0.8375211
## sensitivity  0.7391304     0.7207400
## specificity  0.9454545     0.9159850
## precision    0.9189189     0.8813384
## recall       0.7391304     0.7207400
## f1           0.8192771     0.7906451
## elapsed_time 1.6800000   120.7300000

Part B

For the same dataset, set seed (43) split 80/20.

# do a 80/20 split 
set.seed(43)
train_ind <- sample(seq_len(nrow(data)), size = floor(0.8 * nrow(data)))
train_heart <- data[ train_ind,]
test_heart  <- data[-train_ind,]

Using randomForest grow three different forests varuing the number of trees atleast three times. Start with seeding andfresh split for each forest. Note down as best as you can development (engineering) cost as well as computing cost(elapsed time) for each run. And compare these results with the experiment in Part A. Submit a pdf and executable script in python or R.

rf_10 <- eval_model("rf", trainControl(), "RF 10")

## Random Forest 
## 
## 242 samples
##  13 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 242, 242, 242, 242, 242, 242, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.7792646  0.5549714
##   12    0.7551120  0.5039891
##   22    0.7556300  0.5050744
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

## [1] "RF 10 results"
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 18  1
##          1  9 33
##                                           
##                Accuracy : 0.8361          
##                  95% CI : (0.7191, 0.9185)
##     No Information Rate : 0.5574          
##     P-Value [Acc > NIR] : 3.844e-06       
##                                           
##                   Kappa : 0.6573          
##                                           
##  Mcnemar's Test P-Value : 0.02686         
##                                           
##             Sensitivity : 0.6667          
##             Specificity : 0.9706          
##          Pos Pred Value : 0.9474          
##          Neg Pred Value : 0.7857          
##              Prevalence : 0.4426          
##          Detection Rate : 0.2951          
##    Detection Prevalence : 0.3115          
##       Balanced Accuracy : 0.8186          
##                                           
##        'Positive' Class : 0               
##

rf_50 <- eval_model("rf", trainControl(), "RF 50")

## Random Forest 
## 
## 242 samples
##  13 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 242, 242, 242, 242, 242, 242, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.7972460  0.5896982
##   12    0.7651698  0.5239571
##   22    0.7512177  0.4950400
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

## [1] "RF 50 results"
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 19  2
##          1  8 32
##                                           
##                Accuracy : 0.8361          
##                  95% CI : (0.7191, 0.9185)
##     No Information Rate : 0.5574          
##     P-Value [Acc > NIR] : 3.844e-06       
##                                           
##                   Kappa : 0.66            
##                                           
##  Mcnemar's Test P-Value : 0.1138          
##                                           
##             Sensitivity : 0.7037          
##             Specificity : 0.9412          
##          Pos Pred Value : 0.9048          
##          Neg Pred Value : 0.8000          
##              Prevalence : 0.4426          
##          Detection Rate : 0.3115          
##    Detection Prevalence : 0.3443          
##       Balanced Accuracy : 0.8224          
##                                           
##        'Positive' Class : 0               
##

rf_99 <- eval_model("rf", trainControl(), "RF 99")

## Random Forest 
## 
## 242 samples
##  13 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 242, 242, 242, 242, 242, 242, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.7968746  0.5887862
##   12    0.7688465  0.5307189
##   22    0.7515108  0.4957248
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

## [1] "RF 99 results"
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 20  2
##          1  7 32
##                                           
##                Accuracy : 0.8525          
##                  95% CI : (0.7383, 0.9302)
##     No Information Rate : 0.5574          
##     P-Value [Acc > NIR] : 8.993e-07       
##                                           
##                   Kappa : 0.6952          
##                                           
##  Mcnemar's Test P-Value : 0.1824          
##                                           
##             Sensitivity : 0.7407          
##             Specificity : 0.9412          
##          Pos Pred Value : 0.9091          
##          Neg Pred Value : 0.8205          
##              Prevalence : 0.4426          
##          Detection Rate : 0.3279          
##    Detection Prevalence : 0.3607          
##       Balanced Accuracy : 0.8410          
##                                           
##        'Positive' Class : 0               
##

data.frame(cbind(rf_10, rf_50, rf_99))

##                  RF.10     RF.50     RF.99
## accuracy     0.8360656 0.8360656 0.8524590
## auc_val      0.8186275 0.8077342 0.8409586
## sensitivity  0.6666667 0.7037037 0.7407407
## specificity  0.9705882 0.9411765 0.9411765
## precision    0.9473684 0.9047619 0.9090909
## recall       0.6666667 0.7037037 0.7407407
## f1           0.7826087 0.7916667 0.8163265
## elapsed_time 1.1100000 2.0000000 3.1200000

Part C

Include a summary of your findings. Which of the two methods bootstrap vs cv do you recommend to your customer? And why? Be elaborate. Including computing costs, engineering costs and model performance. Did you incorporate Pareto’s maxim or the Razor and how did these two heuristics influence your decision?

result <- data.frame(cbind(dt_base, dt_5cv, dt_10cv, dt_bt, svm_base, svm_5cv, svm_10cv, svm_bt, rf_10, rf_50, rf_99))
result %>% kable() %>% kable_styling() %>% scroll_box(width = "800px", height = "400px")

	DT.Base	DT.5.cv	DT.10.cv	DT.Bootstrap	SVM.Base	SVM.5.cv	SVM.10.cv	SVM.Bootstrap	RF.10	RF.50	RF.99
accuracy	0.557377	0.7623762	0.7491749	0.7420239	0.8524590	0.8448845	0.8514851	0.8255533	0.8360656	0.8360656	0.8524590
auc_val	0.500000	0.7854031	0.7412854	0.7458831	0.8409586	0.8594771	0.8224401	0.8375211	0.8186275	0.8077342	0.8409586
sensitivity	0.000000	0.7101449	0.6956522	0.7039119	0.7407407	0.7971014	0.7391304	0.7207400	0.6666667	0.7037037	0.7407407
specificity	1.000000	0.8060606	0.7939394	0.7768658	0.9411765	0.8848485	0.9454545	0.9159850	0.9705882	0.9411765	0.9411765
precision	NA	0.7538462	0.7384615	0.7329613	0.9090909	0.8527132	0.9189189	0.8813384	0.9473684	0.9047619	0.9090909
recall	0.000000	0.7101449	0.6956522	0.7039119	0.7407407	0.7971014	0.7391304	0.7207400	0.6666667	0.7037037	0.7407407
f1	NA	0.7313433	0.7164179	0.7115944	0.8163265	0.8239700	0.8192771	0.7906451	0.7826087	0.7916667	0.8163265
elapsed_time	0.520000	0.6400000	0.7200000	4.5100000	0.3900000	1.0400000	1.6800000	120.7300000	1.1100000	2.0000000	3.1200000

Analysis

I would recommend cross validation. Cross-Validation is a very powerful tool. It helps us better use our data, and it gives us much more information about our algorithm performance.

SVM

It looks like the base SVM did most of the work as suggested by Pareto principle and cross validation gave it a performance boost. 10-fold did not add any value to the 5-fold. There was a increase in processing time. According to Occam’s razor, we should use the simple model, i.e. 5-fold.

Decision Tree

Here cross validation helped with parameter selection. 5 fold cross validation yields better results than 10 fold cross validation. impler solution (5-fold CV) should be used (Occam’s razor principle). Did not find any added benefit using 10 fold CV.

Random Forest

Random Forest had the same performance as the base SVM model. It took more time to compute. It is better to use simpler model according to Occam’s razor principle.