Introduction

The following notes are from the COURSERA Practical Machine Learning course, and are intended to help others understand the concepts and code behind the math. Bootstrap aggregation, also known as bagging, is a Machine Learning method based on a few basic principles:

This method has the advantages of:

Building an Example of Bagging

Let’s see an example with the ozone data set.

library(ElemStatLearn)
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
data(ozone, package = "ElemStatLearn")
ozone <- ozone[order(ozone$ozone), ]
head(ozone)
##     ozone radiation temperature wind
## 17      1         8          59  9.7
## 19      4        25          61  9.7
## 14      6        78          57 18.4
## 45      7        48          80 14.3
## 106     7        49          69 10.3
## 7       8        19          61 20.1

The example below (taken directly from the COURSERA Johns Hopkings University Practical Machine Learning class) is a bit intricate, but it basically builds 10 resampled sets from where to extract 10 Loess curves from the data.

ll <- matrix(NA, nrow = 10, ncol = 155)
for(i in 1:10) {
  ss <- sample(1:dim(ozone)[1], replace = TRUE)
  ozone0 <- ozone[ss, ]
  ozone0 <- ozone0[order(ozone0$ozone), ]
  loess0 <- loess(temperature ~ ozone, data = ozone0, span = 0.2)
  ll[i, ] <- predict(loess0, newdata = data.frame(ozone = 1:155))
}

Loess is a smoothing curve to fit through the data, much like a spline. We can see the effect plotting the 10 sampled lines and the average line. The code above can see daunting, but we can break it down into pieces to get to know it well.

It is a little more complex because we take more steps to create the loop, sample from the ozone data set, create the loess smoother, and then the prediction using a loess object. Using objects to created more predictions is sometimes a bit harder to understand in R, but that is the way the language works.

plot(ozone$ozone, ozone$temperature, pch = 19, cex = 0.5, xlab = "OZONE", ylab = "Temperature in Centigrades")
for (i in 1:10) {
  lines(1:155, ll[i, ], col = "grey", lwd = 2)
}
lines(1:155, apply(ll, 2, mean), col = "red", lwd = 2)

The last line might be hard to see, but we are basically plotting:

Bagging with Caret

Alternatively, you can bag any model you choose using the bag option. You can even construct your own.

library(party)
predictors  <- data.frame(ozone = ozone$ozone)
temperature <- ozone$temperature
treebag <- bag(predictors, temperature, B = 10, bagControl = bagControl(fit = ctreeBag$fit, predict = ctreeBag$pred, aggregate = ctreeBag$aggregate))
## Warning: executing %dopar% sequentially: no parallel backend registered
plot(ozone$ozone, temperature, col = 'lightgrey', pch = 19)
points(ozone$ozone, predict(treebag$fits[[1]]$fit, predictors), pch = 19, col = "red")
points(ozone$ozone, predict(treebag, predictors), pch = 19, col = "blue")

Using the bag function requires careful thinking and deep knowledge of the many options available within the code. If you execute the plot one line at a time you will see that a) the gray lines are the ozone temperatures, b) the red ones are the predictions from one particular model, and c) the blue lines are the aggregations from several models, and capture a little better the prediction model.

Conclusion

Bagging is very useful with non-linear models. It’s often used with trees. An extension of bagging is random forests. Several models use bagging in Caret’s train function. I hope you enjoyed this small example of using Bootstrap Aggregation.