Bootstrap Aggregating, or Bagging, is an ensemble method that improves the accuracy of unstable models by averaging a set of the same model fit to bootstrapped samples of a feature space. Consider a regression or classification setting: our data is given as a set of \(p\) predictors \(X_1, ... , X_n\) with response vector \(Y = (Y_1, ..., Y_n)\). We use some base procedure \(\hat{g(.)}\) to models the relationship between \(X\) and \(Y\). A bagged model, \(\hat{g_{bag}}(.)\), is a linear combination of several \(\hat{g(.)}\) fit to bootstrapped samples of \(X\). Bagging acts as a smoothing operator for hard loss functions (consider a single split in a decision tree); smoothing decisions reduces variance in the model, ultimately improving the model error. Heuristically, the variance of the bagged estimator \(\hat{g_{bag}}(.)\) should be equal to or smaller than the variance of the original estimator \(\hat{g(.)}\). The reduction in variance is larger when the original estimator is unstable. The bagging algorithm is as follows:
Step 1: Construct a bootstrap sample \((X^{*}, Y^{*})\) by drawing \(n\) times with replacement from the data \((X_1, Y_1), ... ,(X_n, Y_n)\).
Step 2: Compute the bootstrapped estimator \(\hat{g}^{*}(.)\) by fitting the model to the boostrapped data.
Step 3: Repeate Steps 1 and 2 \(M\) times, yielding \(\hat{g}^{*k}(.), (k = 1, ..., M)\). The bagged estimator is \(\hat{g}_{bag}(.) = \frac{1}{M} \sum_{k = 1}^{M} \hat{g}^{*k}(.)\)
Figure 1: Bagging Procedure
This diagram visualizes bagging. A user generates \(M\) bootstrapped samples from the training data \((X,Y)\) and computes \(M\) estimators, shown as decision trees. The bagged estimator is the the average of the \(M\) estimators.
We demonstrate bagging a CART (Classification and Regression Trees) model for spam email classification. The complete R code can be found here.
For this example we used the spambase dataset from the UCI Machine Learning Repository. The dataset contains 4,600 emails, each described as a vector with 57 attributes and an indicator of whether or not the message was a spam email. We randomly selected 300 emails from the dataset, and split this into training/testing datasets with a ratio of 70:30. We then implemented the bagging algorithm, using the R package {rpart}
to fit CART decision trees. First, we wrote a function resample.spam.train()
to produce a Bootstrap sample from our training dataset by sampling with replacement:
resample.spam.train <- function()
{
indices <- sample(1:nrow(df_train),nrow(df_train),replace=TRUE)
df <- df_train[indices,]
return(df)
}
Then, we wrote a function that fits a decision tree to a bootstrapped sample and predicts the responses of our testing data:
# bagging
# =============================================================================
bag.trees <- function(B) # B is the number of bootstrap samples
{
bootstrap.samples <- list()
pred.mat <- matrix(NA, nrow = nrow(df_test), ncol = B)
for(i in 1:B)
{
spam.sample <- resample.spam.train() # gets a bootstrap sample
tree <- rpart(spam ~ .,
data = spam.sample) # fits a tree
pruned.tree <- prune(tree,
cp= tree$cptable[which.min(tree$cptable[,"xerror"]),
"CP"]) #prunes tree
# predictions
pred.mat[,i] <- predict(pruned.tree,
newdata = df_test[,-ncol(df_test)],
type = "class")
pred.mat[,i] <- pred.mat[,i] - 1 # convert to (0/1)
}
return(pred.mat)
}
Finally, we created a bagged estimator by aggregating 50 CART trees.
# run the bagging function with 50 boostrap samples
# =============================================================================
set.seed(11)
bag <- bag.trees(50)
# aggregation
# =============================================================================
# the 0/1 output is the one that the majority of the trees select
bagged.preds <- (rowMeans(bag) >= 0.5)+0
# final classification accuracy
round(mean(bagged.preds == df_test$spam),2) #88%
Bagging CART trees improved prediction accuracy on average by 10% - the bagged prediction accuracy was 88%, and the prediction accuracy of a single CART tree was, on average, 78%. The variance in prediction accuracy of our model improved dramatically; The prediction accuracy of a single tree ranged from 64% to 89%, with a variance of 24.89. Bagged trees had a prediction accuracy ranging from 81 to 90 percent, with a variance of 4.49.
Bühlmann, P.: Bagging, Boosting and Ensemble Methods. Handbook of Computational Statistics. 47, 985-1022 (2004).
Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
Terry Therneau, Beth Atkinson and Brian Ripley (2017). rpart: Recursive Partitioning and Regression Trees. R package version 4.1-11. https://CRAN.R-project.org/package=rpart