Bootstrap Aggregating, or Bagging, is an ensemble method that improves the accuracy of unstable models by averaging a set of the same model fit to bootstrapped samples of a feature space. Consider a regression or classification setting: our data is given as a set of \(p\) predictors \(X_1, ... , X_n\) with response vector \(Y = (Y_1, ..., Y_n)\). We use some base procedure \(\hat{g(.)}\) to models the relationship between \(X\) and \(Y\). A bagged model, \(\hat{g_{bag}}(.)\), is a linear combination of several \(\hat{g(.)}\) fit to bootstrapped samples of \(X\). Bagging acts as a smoothing operator for hard loss functions (consider a single split in a decision tree); smoothing decisions reduces variance in the model, ultimately improving the model error. Heuristically, the variance of the bagged estimator \(\hat{g_{bag}}(.)\) should be equal to or smaller than the variance of the original estimator \(\hat{g(.)}\). The reduction in variance is larger when the original estimator is unstable. The bagging algorithm is as follows:

Bagging Algorithm

Step 1: Construct a bootstrap sample \((X^{*}, Y^{*})\) by drawing \(n\) times with replacement from the data \((X_1, Y_1), ... ,(X_n, Y_n)\).

Step 2: Compute the bootstrapped estimator \(\hat{g}^{*}(.)\) by fitting the model to the boostrapped data.

Step 3: Repeate Steps 1 and 2 \(M\) times, yielding \(\hat{g}^{*k}(.), (k = 1, ..., M)\). The bagged estimator is \(\hat{g}_{bag}(.) = \frac{1}{M} \sum_{k = 1}^{M} \hat{g}^{*k}(.)\)


Figure 1: Bagging Procedure

Figure 1: Bagging Procedure

This diagram visualizes bagging. A user generates \(M\) bootstrapped samples from the training data \((X,Y)\) and computes \(M\) estimators, shown as decision trees. The bagged estimator is the the average of the \(M\) estimators.

Example: Spam classification with CART

We demonstrate bagging a CART (Classification and Regression Trees) model for spam email classification. The complete R code can be found here.

For this example we used the spambase dataset from the UCI Machine Learning Repository. The dataset contains 4,600 emails, each described as a vector with 57 attributes and an indicator of whether or not the message was a spam email. We randomly selected 300 emails from the dataset, and split this into training/testing datasets with a ratio of 70:30. We then implemented the bagging algorithm, using the R package {rpart} to fit CART decision trees. First, we wrote a function resample.spam.train() to produce a Bootstrap sample from our training dataset by sampling with replacement:

resample.spam.train <- function()
{
  indices <- sample(1:nrow(df_train),nrow(df_train),replace=TRUE)
  df <- df_train[indices,]
  return(df)
}

Then, we wrote a function that fits a decision tree to a bootstrapped sample and predicts the responses of our testing data:

# bagging 
# =============================================================================

bag.trees <- function(B) # B is the number of bootstrap samples
{
  bootstrap.samples <- list()
  pred.mat <- matrix(NA, nrow = nrow(df_test), ncol = B)
  
  for(i in 1:B)
  {
    spam.sample <- resample.spam.train() # gets a bootstrap sample
    tree <- rpart(spam ~ .,
                  data = spam.sample) # fits a tree
    
    pruned.tree <- prune(tree,
                         cp= tree$cptable[which.min(tree$cptable[,"xerror"]),
                                          "CP"]) #prunes tree
    
    # predictions
    pred.mat[,i] <- predict(pruned.tree,
                            newdata = df_test[,-ncol(df_test)],
                            type = "class")
    
    pred.mat[,i] <- pred.mat[,i] - 1 # convert to (0/1)
  }
  return(pred.mat)
}

Finally, we created a bagged estimator by aggregating 50 CART trees.

# run the bagging function with 50 boostrap samples
# =============================================================================
set.seed(11)
bag <- bag.trees(50)

# aggregation
# =============================================================================

# the 0/1 output is the one that the majority of the trees select
bagged.preds <- (rowMeans(bag) >= 0.5)+0

# final classification accuracy
round(mean(bagged.preds == df_test$spam),2) #88%

Bagging CART trees improved prediction accuracy on average by 10% - the bagged prediction accuracy was 88%, and the prediction accuracy of a single CART tree was, on average, 78%. The variance in prediction accuracy of our model improved dramatically; The prediction accuracy of a single tree ranged from 64% to 89%, with a variance of 24.89. Bagged trees had a prediction accuracy ranging from 81 to 90 percent, with a variance of 4.49.

References

Bühlmann, P.: Bagging, Boosting and Ensemble Methods. Handbook of Computational Statistics. 47, 985-1022 (2004).

Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Terry Therneau, Beth Atkinson and Brian Ripley (2017). rpart: Recursive Partitioning and Regression Trees. R package version 4.1-11. https://CRAN.R-project.org/package=rpart