An ensemble method is an approach that combines simple models (called weak learners) in order to obtain a single potentially powerful model.
Bootstrap aggregation, or bagging, is a general-purpose procedure for reducing the variance of a statistical learning method - particularly decision trees. The idea is to take many training sets through bootstrap resampling and build the model B times. Then for regression problems, the prediction will be based on the average over all the B trees constructed. For classification problems, the mode over all the B trees will be used as the predicted class. Note that the number of bootstrap resamples should not be too large to avoid overfitting with the training data set.
Although bagging appears to be good solution to reduce variance, one major concern is that it tends to produce highly correlated trees because it uses almost the same information per bagged tree.
Random forest provides an improvement over bagged trees by way of randomly selecting a sample of the m predictors from a full set of p predictors for every bootstrapped training sample. By default, most software will use the square root of p as the value of m. The rationale behind this approach is that the bootstrapped trees become less correlated from each other, and that one very strong predictor does not always influence the prediction of the bootstrapped tree.
# Random forest to build predictive model on Credit data
# Balance is the outcome variable
library(ISLR2)
data("Credit")
head(Credit)
## Income Limit Rating Cards Age Education Own Student Married Region Balance
## 1 14.891 3606 283 2 34 11 No No Yes South 333
## 2 106.025 6645 483 3 82 15 Yes Yes Yes West 903
## 3 104.593 7075 514 4 71 11 No No No West 580
## 4 148.924 9504 681 3 36 11 Yes No No West 964
## 5 55.882 4897 357 2 68 16 No No Yes South 331
## 6 80.180 8047 569 4 77 10 No No No South 1151
# Divide Data to Train and Test Set
set.seed(27)
train.index <- sample(c(1:400), 340, replace=FALSE) # 340 is 85% of 400
train <- Credit[train.index,]
test <- Credit[-train.index,]
# Train
library(randomForest)
set.seed(24)
fit.rf <- randomForest(Balance ~ ., data=train,
mtry = 3, importance=TRUE) # mtry is m < p
# if m=p this becomes bagging
fit.rf
##
## Call:
## randomForest(formula = Balance ~ ., data = train, mtry = 3, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 19596.53
## % Var explained: 90.76
# Test
predicted <- predict(fit.rf, newdata=test)
actual <- test$Balance
plot(predicted, actual)
abline(0,1)
# Test MSE
MSE <- mean((predicted - actual)^2)
MSE
## [1] 25847.34
# Variable Importance
# The higher the better
importance(fit.rf)
## %IncMSE IncNodePurity
## Income 31.7779326 8261768.9
## Limit 32.8332463 26608360.1
## Rating 32.8564551 27374732.0
## Cards 1.2411222 1112746.1
## Age 1.5520404 1806808.0
## Education -0.1334746 1253469.7
## Own -1.1927211 228264.9
## Student 32.0971468 3223706.6
## Married 0.3937348 342933.6
## Region 1.0260184 627750.1
varImpPlot(fit.rf)
# Classification Tree on Type of Cell Tumor
# There are 4 types of Cell Tumor that we want to predict here
data("Khan") # Warning: Do not print this data, it's very big
train <- data.frame(Y=as.factor(Khan$ytrain), Khan$xtrain)
test <- data.frame(Y=as.factor(Khan$ytest), Khan$xtest)
head(train[,1:5]) # Show the first 6 rows and 5 columns
## Y X1 X2 X3 X4
## V1 2 0.77334370 -2.438405 -0.4825622 -2.7211350
## V2 2 -0.07817778 -2.415754 0.4127717 -2.8251460
## V3 2 -0.08446916 -1.649739 -0.2413075 -2.8752860
## V4 2 0.96561400 -2.380547 0.6252965 -1.7412560
## V5 2 0.07566390 -1.728785 0.8526265 0.2726953
## V6 2 0.45881630 -2.875286 0.1358412 0.4053984
head(test[,1:5]) # Show the first 6 rows and 5 columns
## Y X1 X2 X3 X4
## V1 3 0.1395010 -1.1689275 0.5649728 -3.366796
## V2 2 1.1642752 -2.0181583 1.1035335 -2.165435
## V4 4 0.8410929 0.2547197 -0.2087477 -2.148149
## V6 2 0.6850646 -1.9275792 -0.2330676 -1.640413
## V7 1 -1.9561625 -2.2349264 0.2815634 -2.695628
## V8 3 -0.2586412 -1.6847004 0.1758003 -2.323809
# Train
set.seed(123)
fit.rf <- randomForest(Y ~ . , data=train, importance=TRUE)
# Test
predicted <- predict(fit.rf, newdata=test)
actual <- test$Y
table(predicted, actual)
## actual
## predicted 1 2 3 4
## 1 3 0 0 0
## 2 0 6 1 0
## 3 0 0 4 0
## 4 0 0 1 5
# Variable Importance
varImpPlot(fit.rf)
Boosting is another approach for improving the predictions resulting from a decision tree. Like bagging, boosting is a general approach that can be applied to many statistical learning methods for regression and classification. However, boosting is difficult to implement in classification problems so we’ll restrict our example here for regression only.
Boosting works similarly as the bagging approach, except that the trees are grown sequentially: each trees is grown using information from previously grown trees. Boosting does not involve bootstrap sampling; instead each tree is fit on a modified version of the original data set.
The boosting approach learns slowly. Given the current model, we fit a decision tree to the residuals from the model. That is, we fit a tree using the current residuals, rather than the outcome \(Y\), as the response. We then update the residuals and fit the model again until a certain number of specified trees have been constructed, denote by B. The shrinkage parameter \(\lambda\), a small positive number, controls the rate at which boosting learns. Typical values are 0.01 or 0.001, depending on the problem.
The boosting method is applied to predict housing prices using the
Credit data.
## BOOSTING ##
# Learns slowly
# Sequential (n.trees = ?)
# Decision Tree 1: Residual 1
# Decision Tree 2: (Residual 1 as outcome) -> Residual 2
# Decision Tree 3: (Residual 2 as outcome) -> Residual 3
# ...
library(ISLR2)
data("Credit")
# Divide Data to Train and Test Set
set.seed(123)
train.index <- sample(c(1:400), 340, replace=FALSE)
train <- Credit[train.index,]
test <- Credit[-train.index,]
# Use the gbm() function from the {gbm} package
library(gbm)
fit.boost <- gbm(Balance ~ ., distribution = "gaussian", data=train,
n.trees = 500, interaction.depth = 2, shrinkage=0.1)
summary(fit.boost)
## var rel.inf
## Limit Limit 53.55721538
## Rating Rating 25.40811618
## Income Income 10.73434565
## Student Student 6.82529585
## Age Age 1.96065124
## Cards Cards 0.73462672
## Education Education 0.46467508
## Region Region 0.14948169
## Married Married 0.12367906
## Own Own 0.04191316
# Model assessment using validation set
predicted <- predict(fit.boost, newdata=test)
actual <- test$Balance
plot(predicted,actual)
abline(0,1)
MSE <- mean((predicted - actual)^2)
MSE
## [1] 3274.015
# How did boosting regression perform compared to random forest?