An ensemble method is an approach that combines simple models (called weak learners) in order to obtain a single potentially powerful model.

Bootstrap aggregation, or bagging, is a general-purpose procedure for reducing the variance of a statistical learning method - particularly decision trees. The idea is to take many training sets through bootstrap resampling and build the model B times. Then for regression problems, the prediction will be based on the average over all the B trees constructed. For classification problems, the mode over all the B trees will be used as the predicted class. Note that the number of bootstrap resamples should not be too large to avoid overfitting with the training data set.

Although bagging appears to be good solution to reduce variance, one major concern is that it tends to produce highly correlated trees because it uses almost the same information per bagged tree.

Random Forest

Random forest provides an improvement over bagged trees by way of randomly selecting a sample of the m predictors from a full set of p predictors for every bootstrapped training sample. By default, most software will use the square root of p as the value of m. The rationale behind this approach is that the bootstrapped trees become less correlated from each other, and that one very strong predictor does not always influence the prediction of the bootstrapped tree.

Random Forest: Regression

# Random forest to build predictive model on Credit data
# Balance is the outcome variable
library(ISLR2)
data("Credit")
head(Credit)
##    Income Limit Rating Cards Age Education Own Student Married Region Balance
## 1  14.891  3606    283     2  34        11  No      No     Yes  South     333
## 2 106.025  6645    483     3  82        15 Yes     Yes     Yes   West     903
## 3 104.593  7075    514     4  71        11  No      No      No   West     580
## 4 148.924  9504    681     3  36        11 Yes      No      No   West     964
## 5  55.882  4897    357     2  68        16  No      No     Yes  South     331
## 6  80.180  8047    569     4  77        10  No      No      No  South    1151
# Divide Data to Train and Test Set
set.seed(27)
train.index <- sample(c(1:400), 340, replace=FALSE) # 340 is 85% of 400
train <- Credit[train.index,]
test <- Credit[-train.index,]

# Train
library(randomForest)
set.seed(24)
fit.rf <- randomForest(Balance ~ ., data=train,
                       mtry = 3, importance=TRUE)    # mtry is m < p
# if m=p this becomes bagging
fit.rf
## 
## Call:
##  randomForest(formula = Balance ~ ., data = train, mtry = 3, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 19596.53
##                     % Var explained: 90.76
# Test
predicted <- predict(fit.rf, newdata=test)
actual <- test$Balance
plot(predicted, actual)
abline(0,1)

# Test MSE
MSE <- mean((predicted - actual)^2)
MSE
## [1] 25847.34
# Variable Importance
# The higher the better
importance(fit.rf)
##              %IncMSE IncNodePurity
## Income    31.7779326     8261768.9
## Limit     32.8332463    26608360.1
## Rating    32.8564551    27374732.0
## Cards      1.2411222     1112746.1
## Age        1.5520404     1806808.0
## Education -0.1334746     1253469.7
## Own       -1.1927211      228264.9
## Student   32.0971468     3223706.6
## Married    0.3937348      342933.6
## Region     1.0260184      627750.1
varImpPlot(fit.rf)

Random Forest: Classification

# Classification Tree on Type of Cell Tumor
# There are 4 types of Cell Tumor that we want to predict here
data("Khan")  # Warning: Do not print this data, it's very big

train <- data.frame(Y=as.factor(Khan$ytrain), Khan$xtrain)
test <- data.frame(Y=as.factor(Khan$ytest), Khan$xtest)

head(train[,1:5]) # Show the first 6 rows and 5 columns
##    Y          X1        X2         X3         X4
## V1 2  0.77334370 -2.438405 -0.4825622 -2.7211350
## V2 2 -0.07817778 -2.415754  0.4127717 -2.8251460
## V3 2 -0.08446916 -1.649739 -0.2413075 -2.8752860
## V4 2  0.96561400 -2.380547  0.6252965 -1.7412560
## V5 2  0.07566390 -1.728785  0.8526265  0.2726953
## V6 2  0.45881630 -2.875286  0.1358412  0.4053984
head(test[,1:5])  # Show the first 6 rows and 5 columns
##    Y         X1         X2         X3        X4
## V1 3  0.1395010 -1.1689275  0.5649728 -3.366796
## V2 2  1.1642752 -2.0181583  1.1035335 -2.165435
## V4 4  0.8410929  0.2547197 -0.2087477 -2.148149
## V6 2  0.6850646 -1.9275792 -0.2330676 -1.640413
## V7 1 -1.9561625 -2.2349264  0.2815634 -2.695628
## V8 3 -0.2586412 -1.6847004  0.1758003 -2.323809
# Train
set.seed(123)
fit.rf <- randomForest(Y ~ . , data=train, importance=TRUE)

# Test
predicted <- predict(fit.rf, newdata=test)
actual <- test$Y
table(predicted, actual)
##          actual
## predicted 1 2 3 4
##         1 3 0 0 0
##         2 0 6 1 0
##         3 0 0 4 0
##         4 0 0 1 5
# Variable Importance
varImpPlot(fit.rf)

Boosting

Boosting is another approach for improving the predictions resulting from a decision tree. Like bagging, boosting is a general approach that can be applied to many statistical learning methods for regression and classification. However, boosting is difficult to implement in classification problems so we’ll restrict our example here for regression only.

Boosting works similarly as the bagging approach, except that the trees are grown sequentially: each trees is grown using information from previously grown trees. Boosting does not involve bootstrap sampling; instead each tree is fit on a modified version of the original data set.

The boosting approach learns slowly. Given the current model, we fit a decision tree to the residuals from the model. That is, we fit a tree using the current residuals, rather than the outcome \(Y\), as the response. We then update the residuals and fit the model again until a certain number of specified trees have been constructed, denote by B. The shrinkage parameter \(\lambda\), a small positive number, controls the rate at which boosting learns. Typical values are 0.01 or 0.001, depending on the problem.

The boosting method is applied to predict housing prices using the Credit data.

##     BOOSTING     ##

# Learns slowly
# Sequential (n.trees = ?)
# Decision Tree 1: Residual 1
# Decision Tree 2: (Residual 1 as outcome) -> Residual 2
# Decision Tree 3: (Residual 2 as outcome) -> Residual 3
# ...

library(ISLR2)
data("Credit")

# Divide Data to Train and Test Set
set.seed(123)
train.index <- sample(c(1:400), 340, replace=FALSE)
train <- Credit[train.index,]
test <- Credit[-train.index,]

# Use the gbm() function from the {gbm} package
library(gbm)
fit.boost <- gbm(Balance ~ ., distribution = "gaussian", data=train,
                 n.trees = 500, interaction.depth = 2, shrinkage=0.1)
summary(fit.boost)

##                 var     rel.inf
## Limit         Limit 53.55721538
## Rating       Rating 25.40811618
## Income       Income 10.73434565
## Student     Student  6.82529585
## Age             Age  1.96065124
## Cards         Cards  0.73462672
## Education Education  0.46467508
## Region       Region  0.14948169
## Married     Married  0.12367906
## Own             Own  0.04191316
# Model assessment using validation set
predicted <- predict(fit.boost, newdata=test)
actual <- test$Balance
plot(predicted,actual)
abline(0,1)

MSE <- mean((predicted - actual)^2) 
MSE
## [1] 3274.015
# How did boosting regression perform compared to random forest?