Please refer to the HW2 Document.

2 Part A

2.1 Step 0

Picked two classifiers, SVM and DecisionTree, and the heart dataset for the following.

2.1.1 Load Data

## 
## -- Column specification ----------------------------------------------------------------------------------------------------
## cols(
##   age = col_double(),
##   sex = col_double(),
##   cp = col_double(),
##   trestbps = col_double(),
##   chol = col_double(),
##   fbs = col_double(),
##   restecg = col_double(),
##   thalach = col_double(),
##   exang = col_double(),
##   oldpeak = col_double(),
##   slope = col_double(),
##   ca = col_double(),
##   thal = col_double(),
##   target = col_double()
## )

2.2 Step 1

For each classifier, set a seed (43).

2.3 Step 2

Do a 80/20 split and determine the Accuracy, AUC and as many metrics as returned by the Caret package (confusionMatrix). Note down as best as you can development (engineering) cost as well as computing cost(elapsed time).

2.3.2 SVM

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

2.3.3 Decision Tree

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

2.4 Step 3

Start with the original dataset and set a seed (43). Then run a cross validation of 5 and 10 of the model on the training set.

Determine the same set of metrics and compare the cv_metrics with the base_metric. Note down as best as you can development (engineering) cost as well as computing cost(elapsed time).

2.4.1 SVM w/ 5-fold CV

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

2.4.2 SVM w/ 10-fold CV

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

2.4.3 SVM w/ Bootstrap

SVM with Bootstrap 200 resamples

2.4.4 DT w/ 5-fold CV

Decision Tree with 5-fold CV

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

2.4.5 DT w/ 10-fold CV

Decision Tree with 10-fold CV

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

2.4.6 DT w/ Bootstrap

Decision Tree with Bootstrap 200 resamples

3 Part B

For the same dataset, set a seed (43) and split 80/20.

Using randomForest to grow 4 different forests (n = 15, 25, 50, and 75). Note down as best as you can development (engineering) cost as well as computing cost(elapsed time) for each run. And compare these results with the experiment in Part A.

3.1 Random Forest with n = 15

3.2 Random Forest with n = 25

3.3 Random Forest with n = 50

3.4 Random Forest with n = 75

3.5 Comparison

Comparing the results from Part B with the experiment in Part A, we can see that SVM models have better accuracies on average, while Random Forest have better AUC values on average. Decision Tree models have the lowest accuracies and AUC values among all 12 models. The elapsed time of the models are similar.

  • Summary of all 12 models metrics

4 Part C

Comparing the results from bootstrap models and cross validation models, it is clear that cross validation models perform better and with less engineering cost and computing cost. Bootstrap models require the aggregation of 200 resamples, and therefore the cost is much higher than cross validation method.

Within SVM models, cross validation models have better accuracies, better AUC values, and less engineering cost and computing cost than bootstrap model.

Within Decision Tree models, cross validation models have better accuracies, better AUC values, and much less engineering cost and computing cost than bootstrap model.

Comparing 5-fold CV and 10-fold CV, 5-fold CV has lower computing cost than 10-fold CV although they produce the same accuracies and AUC values.

Thus, I would recommend my customers to use 5-fold cross validation model as it has the best metrics and the lowest cost. This also verifies Occam’s Razor principle that the simpler the model usually the better one to choose.