Bagging:

  1. resample cases and recalculate predictions.
  2. average or majority vote.

average for regression problem, majority vote (most commonly occuring class) for classification problem.

WHY BAGGING:

  1. fitting one tree only to the data set will have high variance and the problem of over-fitting.
  2. each bagged model has low Bias but high variance.
  3. averaged or majority voted final model has reduced variance and low bias.
  4. reduced variance helps to improve prediction accuracy of the model.
  5. more useful for non-linear fucntions.
  6. loses interpretability.

Basic Idea:

            The basic idea is that when you fit complicated models, sometimes if you average those models together, you get a smoother model fit, that gives you a better balance between potential bias in your fit and variance in your fit. 
            

So bootstrap aggregating has a very simple idea. The basic idea is take your data and take resamples of the data set. So, this is the similar to the idea of bootstrapping.

After you resample the cases with replacement, then you recalculate your prediction function on that resampled data.

And then you either average the predictions from all these repeated predictors that you built or you majority vote or something like that when you’re doing classification.

The thing is that you get a similar bias that you would get from fitting any one of those models individually, but a reduced variability because you’ve averaged a bunch of different predictors together.

This is most useful for non-linear functions. So, we’ll show an example with smoothing, but it’s also very useful for things like predicting with trees.

SPECIAL REMARKS:

cross validation is used to get a more accurate estimate of model performance on unseen test data, and bagging is used to get a more accurate fit of the training data (reducing overfitting by redcuing the variance, same bias).

POTENIAL ISSUE WITH BAGGING:

if you have a strong dominant predictor in dataset, all bagged trees will have this predictors as a first split and bcoz of that all trees will have similar splits. now when we average these highly correlated trees the reduction in variance will not be big as compared to reduction due to averaging of uncorrelated trees.

This problem of “Bagging” method is overcome in “Random Forests”.

suppressMessages(library(caret))
suppressMessages(library(ElemStatLearn))

data(ozone, package = "ElemStatLearn")
head(ozone)
##   ozone radiation temperature wind
## 1    41       190          67  7.4
## 2    36       118          72  8.0
## 3    12       149          74 12.6
## 4    18       313          62 11.5
## 5    23       299          65  8.6
## 6    19        99          59 13.8
ozone = ozone[order(ozone$ozone),]
index = createDataPartition(y=ozone$temperature, p=0.7, list=FALSE)

train = ozone[index,]
test = ozone[-index,]

dim(train)
## [1] 80  4
dim(test)
## [1] 31  4
train.predictors = data.frame(ozone = train$ozone)
train.temperature = train$temperature

test.predictors = data.frame(ozone = test$ozone)
test.temperature = test$temperature
# fit the model
treebag = bag(train.predictors, train.temperature, B=100,
              bagControl = bagControl(fit = ctreeBag$fit,
                                      predict = ctreeBag$pred,
                                      aggregate = ctreeBag$aggregate),
              trControl = trainControl(method = "oob"))

treebag
## 
## Call:
## bag.default(x = train.predictors, y = train.temperature, B =
##  ctreeBag$pred, aggregate = ctreeBag$aggregate), trControl
##  = trainControl(method = "oob"))
## 
## 
## B: 100 
## Training data: 1 variables and 80 samples
## All variables were used in each model
summary(treebag)
## 
## Call:
## bag.default(x = train.predictors, y = train.temperature, B =
##  ctreeBag$pred, aggregate = ctreeBag$aggregate), trControl
##  = trainControl(method = "oob"))
## 
## Out of bag statistics (B = 100):
## 
##         RMSE Rsquared
##   0.0% 5.305   0.1715
##   2.5% 5.640   0.2931
##  25.0% 6.536   0.4517
##  50.0% 7.130   0.5448
##  75.0% 7.607   0.5866
##  97.5% 8.557   0.6752
## 100.0% 9.275   0.7357

in above model fitting we have used OOB approach for estimating test-error instead of CV. when you have large dataset, OOB (out-of-bag) is better due to CV’s computationally expensive.

temp.pred = predict(treebag, test.predictors)
MSE = mean(test.temperature - temp.pred)^2
model.SE = sqrt(MSE)
model.SE
## [1] 0.7783611
plot(test.predictors$ozone, test.temperature, col='lightgrey', pch=19) # original observed data
points(test.predictors$ozone, predict(treebag$fits[[1]]$fit, test.predictors), pch=19, col="red") # predicted temperature from one model
points(test.predictors$ozone, predict(treebag, test.predictors), pch=19, col="blue") # predicted temperature from 10 model's average

red dots are from a single model.

blue dots are from average of 100 bagged models. less variability as seen by smoother trend.

This example was for regression tree. same can be done for classification tree (it uses majority vote instead of RMSE). 1. very useful for nonlinear models. 2. often used with trees - an extension is random forests.

Part of Bagging

these are the function we have used in fitting the bagging model.

ctreeBag$fit
## function (x, y, ...) 
## {
##     loadNamespace("party")
##     data <- as.data.frame(x)
##     data$y <- y
##     party::ctree(y ~ ., data = data)
## }
## <environment: namespace:caret>
ctreeBag$pred
## function (object, x) 
## {
##     if (!is.data.frame(x)) 
##         x <- as.data.frame(x)
##     obsLevels <- levels(object@data@get("response")[, 1])
##     if (!is.null(obsLevels)) {
##         rawProbs <- party::treeresponse(object, x)
##         probMatrix <- matrix(unlist(rawProbs), ncol = length(obsLevels), 
##             byrow = TRUE)
##         out <- data.frame(probMatrix)
##         colnames(out) <- obsLevels
##         rownames(out) <- NULL
##     }
##     else out <- unlist(party::treeresponse(object, x))
##     out
## }
## <environment: namespace:caret>
ctreeBag$aggregate
## function (x, type = "class") 
## {
##     if (is.matrix(x[[1]]) | is.data.frame(x[[1]])) {
##         pooled <- x[[1]] & NA
##         classes <- colnames(pooled)
##         for (i in 1:ncol(pooled)) {
##             tmp <- lapply(x, function(y, col) y[, col], col = i)
##             tmp <- do.call("rbind", tmp)
##             pooled[, i] <- apply(tmp, 2, median)
##         }
##         if (type == "class") {
##             out <- factor(classes[apply(pooled, 1, which.max)], 
##                 levels = classes)
##         }
##         else out <- as.data.frame(pooled)
##     }
##     else {
##         x <- matrix(unlist(x), ncol = length(x))
##         out <- apply(x, 1, median)
##     }
##     out
## }
## <environment: namespace:caret>