1 Assignment

Consider the the Breast Cancer Coimbra data set (available from UCI repository).

Compare three different ensemble approaches, using the ‘caret’ package:

Evaluation of derived models should follow a correct methodology, comparing different estimates of generalization error (i.e. holdout, cross-validation, bootstrap, …)

Submit a report (in PDF, generated from R) with the code and the resulting analysis.

2 Introduction

A quick review shows that Ensemble methods are not limited to the above referred. Nevertheless these are indeed the main ensemble learning methods.

Weak (or base) learners are the building block for more complex models. They normally do not perform well by themselves due to high variance or high bias. Ensemble methods therefore try to combine weak learners in order to create a strong learner.

Normally (in bagging and boosting methods), a single base learning algorithm is used to build what we call an “homogeneous” ensemble model. However, we can aggregate several base learning into an “heterogeneous” ensemble model. It is important to choose base models that are coherent to each other, i.e. one with high bias with anoter with high variance, so the ensemble will try to compensate each one.

These are the three main methods of ensemble:

3 Bagging

Here we will compare two bagging algorithms, treebag and random forest against the usual rpart model.

set.seed(2020)
control <- trainControl(
  method = "boot",
  number = 25,
  savePredictions = "final",
  classProbs = TRUE,
  index = createResample(training$Classification, 25),
  summaryFunction = twoClassSummary
)
metric <- "ROC"

set.seed(2020)
fit.rpart <- train(Classification ~ ., data = training, method = "rpart", metric = metric, trControl = control)

set.seed(2020)
fit.treebag <- train(Classification ~ .,
  data = training,
  method = "treebag", trControl = control, verbose = FALSE
)

set.seed(2020)
fit.rf <- train(Classification ~ .,
  data = training,
  method = "rf", trControl = control, verbose = FALSE
)

3.1 Comparing training metrics

results_bag <- resamples(list(rpart = fit.rpart, treebag = fit.treebag, rf = fit.rf))

# Compare models
dotplot(results_bag)

3.2 Comparing predictions

pred_bag <- list()
pred_bag$rpart <- predict(fit.rpart, newdata = testing, type = "prob")[, "Patient"]
pred_bag$treebag <- predict(fit.treebag, newdata = testing, type = "prob")[, "Patient"]
pred_bag$rf <- predict(fit.rf, newdata = testing, type = "prob")[, "Patient"]
pred_bag <- data.frame(pred_bag)

caTools::colAUC(pred_bag, testing$Classification, plotROC = TRUE)
                        rpart   treebag        rf
Patient vs. Control 0.5673077 0.7644231 0.8052885

4 Boosting

Here we will compare two boosting algorithms, gbm and C5.0 against the usual rpart model.

set.seed(2020)
control <- trainControl(
  method = "boot",
  number = 25,
  savePredictions = "final",
  classProbs = TRUE,
  index = createResample(training$Classification, 25),
  summaryFunction = twoClassSummary
)
metric <- "ROC"

set.seed(2020)
fit.rpart <- train(Classification ~ ., data = training, method = "rpart", metric = metric, trControl = control)

set.seed(2020)
fit.gbm <- train(Classification ~ .,
  data = training,
  method = "gbm", trControl = control, verbose = FALSE, distribution = "adaboost"
)

set.seed(2020)
fit.c50 <- train(Classification ~ .,
  data = training,
  method = "C5.0", trControl = control, verbose = FALSE
)

4.1 Comparing training metrics

results_boost <- resamples(list(rpart = fit.rpart, gbm = fit.gbm, c50 = fit.c50))

# Compare models
dotplot(results_boost)

4.2 Comparing predictions

pred_boost <- list()
pred_boost$rpart <- predict(fit.rpart, newdata = testing, type = "prob")[, "Patient"]
pred_boost$gbm <- predict(fit.gbm, newdata = testing, type = "prob")[, "Patient"]
pred_boost$c50 <- predict(fit.c50, newdata = testing, type = "prob")[, "Patient"]
pred_boost <- data.frame(pred_boost)

caTools::colAUC(pred_boost, testing$Classification, plotROC = TRUE)
                        rpart       gbm       c50
Patient vs. Control 0.5673077 0.7548077 0.7644231

5 Stacking

Here we will create a stacking model with rpart, svmLinear and naive bayes and compare with rpart alone.

# DO NOT use the trainControl object used to fit the training models to fit the ensemble.
set.seed(2020)
control <- trainControl(
  method = "boot",
  number = 25,
  savePredictions = "final",
  classProbs = TRUE,
  index = createResample(training$Classification, 25),
  summaryFunction = twoClassSummary
)
metric <- "ROC"

model_list <- caretList(
  Classification ~ .,
  data = training,
  trControl = control,
  methodList = c("rpart", "svmLinear", "nb")
)

set.seed(2020)
fit_control <- trainControl(
  method = "boot",
  number = 25,
  savePredictions = "final",
  classProbs = TRUE,
  index = createResample(training$Classification, 25),
  summaryFunction = twoClassSummary
)

set.seed(2020)
fit.rpart <- train(Classification ~ .,
  data = training, method = "rpart",
  metric = metric, trControl = fit_control
)

set.seed(2020)
fit.svm <- train(Classification ~ .,
  data = training, method = "svmLinear",
  metric = metric, trControl = fit_control
)

set.seed(2020)
fit.nb <- train(Classification ~ .,
  data = training, method = "nb",
  metric = metric, trControl = fit_control
)

set.seed(2020)
glm_ensemble <- caretStack(
  model_list,
  method = "glm",
  metric = metric,
  trControl = trainControl(
    method = "boot",
    number = 25,
    savePredictions = "final",
    classProbs = TRUE,
    summaryFunction = twoClassSummary
  )
)

5.1 Comparing training metrics

results_stack <- resamples(list(
  rpart = fit.rpart, svm = fit.svm, nb = fit.nb,
  stack = glm_ensemble$ens_model
))

# Compare models
dotplot(results_stack)

5.2 Comparing predictions

model_preds <- lapply(model_list, predict, newdata = testing, type = "prob")
model_preds <- lapply(model_preds, function(x) x[, "Patient"])
model_preds <- data.frame(model_preds)

model_preds$stack <- predict(glm_ensemble, newdata = testing, type = "prob")

caTools::colAUC(model_preds, testing$Classification, plotROC = TRUE)
                        rpart svmLinear        nb     stack
Patient vs. Control 0.5673077 0.7740385 0.7932692 0.7596154

6 Comparing all models

Just as a summary, let’s compare all models together.

6.1 Comparing training metrics

It is interesting to see that although SVM is ranked as the best model, we clearly see that the stack ensemble is much more robust, having the shortest confidence interval. In general we can also notice that ensemble models are always better than rpart alone.

results_all <- resamples(list(
  rpart = fit.rpart, svm = fit.svm, nb = fit.nb,
  stack = glm_ensemble$ens_model, gbm = fit.gbm, c50 = fit.c50, treebag = fit.treebag, rf = fit.rf
))

# Compare models
dotplot(results_all)

6.2 Comparing predictions

In the “final” test, the predictions in an independent dataset, rpart shows it’s weakness in generalizing the prediction, while rf do a good job.

all_preds <- data.frame(cbind(rpart = model_preds$rpart, svm = model_preds$svmLinear, nb = model_preds$nb, stack = model_preds$stack, gbm = pred_boost$gbm, c50 = pred_boost$c50, treebag = pred_bag$treebag, rf = pred_bag$rf))
caTools::colAUC(all_preds, testing$Classification)
                        rpart       svm        nb     stack       gbm       c50   treebag        rf
Patient vs. Control 0.5673077 0.7740385 0.7932692 0.7596154 0.7548077 0.7644231 0.7644231 0.8052885
