We discuss approaches for using decision trees for regression problems starting with a simple decision tree and working our way to xgboost
. We will use the Boston Housing dataset as an example. All of the code below except xgboost
is borrowed from An Introduction to Statistical Learning by G. James, D. Witten, T. Hastie, and R. Tibshirani.
We begin with an example of fitting a tree to the Boston dataset
library(tree)
library(MASS)
library(ggplot2)
set.seed(1)
train = sample(1:nrow(Boston),nrow(Boston)/2) # 50/50 split between test and train
tree.boston = tree (medv ~ ., Boston, subset=train)
summary(tree.boston)
##
## Regression tree:
## tree(formula = medv ~ ., data = Boston, subset = train)
## Variables actually used in tree construction:
## [1] "lstat" "rm" "dis"
## Number of terminal nodes: 8
## Residual mean deviance: 12.65 = 3099 / 245
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -14.10000 -2.04200 -0.05357 0.00000 1.96000 12.60000
The deviance above is the sum of the squared errors for the tree.
plot(tree.boston)
text(tree.boston, pretty=0)
We now prune the tree using cross-validation with the cv.tree()
function
cv.boston = cv.tree(tree.boston)
plot(cv.boston$size, cv.boston$dev, type='b')
#plot(cv.boston)
#ggplot(mapping = aes(x=cv.boston$size, y=cv.boston$dev)) + geom_line()
The cross-validation results suggests that the most complex tree is the best one. We can try pruning this tree to keep 5 terminal nodes.
prune.boston = prune.tree(tree.boston,best=5) # We keep 5 terminal nodes
plot(prune.boston)
text(prune.boston, pretty=0)
Finally, we evaluate the performance of the best tree based on cross-validation - the full tree with 8-terminal nodes
yhat = predict(tree.boston, newdata=Boston[-train,])
boston.test = Boston[-train,"medv"]
print(mean((yhat-boston.test)^2))
## [1] 25.04559
plot(yhat, boston.test)
abline(0,1)
The MSE associated with the tree is 25.05.
In bagging, we use bootstrapping to generate B separate training sets and train a tree on each of them. The predictions are then averaged. For each training set, the data not selected is known as the out-of-bag sample, and is used to evaluate the prediction. The average prediction is given by
\[\hat{f}_{bag}(x) = \frac{1}{B}\sum_{b=1}^B\hat{f}^{*b}(x)\]
The averaging of predictions results in lower variance compared to using a single tree. This is based on the observation that for n random variables, each with variance \({\sigma}^2\), their average value has a variance \({\sigma}^2/n\). Thus bagging overcomes a shortcoming of single decision trees that can give different tree structures when built on different samples of the input data.
For classification problems, rather than averaging the results for each tree, a majority vote is taken across the multiple trees.
We now apply baggin to the Boston housing dataset. Since random forest is a special case of bagging (with \(m=sqrt(p)\)), we use the randomForest
package and set the number of variables mtry
to all the drivers in the Boston data set.
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
set.seed(1)
bag.boston = randomForest(medv ~., data=Boston, subset=train, mtry=ncol(Boston)-1, importance=TRUE)
bag.boston
##
## Call:
## randomForest(formula = medv ~ ., data = Boston, mtry = ncol(Boston) - 1, importance = TRUE, subset = train)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 13
##
## Mean of squared residuals: 11.02509
## % Var explained: 86.65
This spawns 500 trees. Let’s evaluate the prediction of the bagged model on the test set
yhat.bag = predict(bag.boston, newdata=Boston[-train,])
plot(yhat.bag,boston.test)
abline(0,1)
round(mean((yhat.bag-boston.test)^2),2)
## [1] 13.47
The test error at 13.47 is nearly half of what we obtained with a single tree. We try to limit the number of trees to 25 and try again
bag.boston = randomForest(medv ~., data=Boston,subset=train, mtry = ncol(Boston)-1, ntree=25)
yhat.bag = predict(bag.boston, newdata=Boston[-train,])
mean((yhat.bag-boston.test)^2)
## [1] 13.43068
We have a relatively smaller tree, without a significant drop in performance.
Random forests is similar to bagging, except for each tree, a subset \(m=\sqrt{p}\) of the total number of predictors are used. This helps de-correlate the trees. Like bagging, random forest will not overfit if the number of trees \(B\) is very large. In practice, we use a value of B sufficiently large for the error rate to have settled down.
We now apply random forests to the Boston dataset. By default randomForest()
uses p/3 variables when building a regression tree, and \(sqrt(p)\) variables when building a classification tree. Since we have 13 variables, we try with mtry=6
rf.boston = randomForest(medv ~., data=Boston, mtry=6, subset=train, importance=TRUE)
yhat.rf = predict(rf.boston, newdata=Boston[-train, ])
round(mean((yhat.rf - boston.test)^2),2)
## [1] 11.21
We have been able to further bring down the error to 11.21
The variable importance is given by
importance(rf.boston)
## %IncMSE IncNodePurity
## crim 12.167433 1100.48678
## zn 1.956719 62.54669
## indus 8.245823 954.97538
## chas 1.429269 50.94527
## nox 13.010349 1032.95449
## rm 30.854978 6006.71143
## age 11.161356 491.91157
## dis 14.227309 1356.79634
## rad 4.186869 102.83437
## tax 8.997186 470.72817
## ptratio 11.922732 944.95930
## black 7.765048 352.90710
## lstat 29.955127 7688.11053
The two metrics are the increase in MSE and increase in node purity by including the variable. Node impurity is measured by training RSS for regression trees, and deviance for classification trees.
A plot of the variable importance is given by
varImpPlot(rf.boston)
In boosting, the trees are trained sequentially based on the reside of the previous step. The steps are as follows
\[ \hat{f}(x) = \sum_{i=1}^B{\lambda}\hat{f}^b(x) \]
Unlike random forests and bagging, boosting can overfit if the number of trees (B) is very large. An appropriate choice of B can be made using cross-validation. The shrinkage parameter \(\lambda\) controls the rate at which boosting learns and is typically between 0.01 and 0.001. The smaller value implies a slow learning rate. The number of splits \(d\) controls the tree complexity. Often d=1 works well.
We now apply boosting to the Boston dataset
library(gbm)
## Loading required package: survival
## Loading required package: lattice
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.3
set.seed(1)
boost.boston = gbm(medv ~., data=Boston[train,],distribution = "gaussian",n.trees=5000, interaction.depth = 4) # use `distribution='bernoulli' for classification
summary(boost.boston)
## var rel.inf
## lstat lstat 45.9627334
## rm rm 31.2238187
## dis dis 6.8087398
## crim crim 4.0743784
## nox nox 2.5605001
## ptratio ptratio 2.2748652
## black black 1.7971159
## age age 1.6488532
## tax tax 1.3595005
## indus indus 1.2705924
## chas chas 0.8014323
## rad rad 0.2026619
## zn zn 0.0148083
We can produce partial dependence plots for the two important variables lstat
and rm
par(mfrow=c(1,2))
plot(boost.boston, i="rm")
plot(boost.boston, i="lstat")
Next, we use the boosted model for prediction
yhat.boost = predict(boost.boston, newdata=Boston[-train,],n.trees=5000)
round(mean((yhat.boost-boston.test)^2),2)
## [1] 11.84
The above MSE of 11.84 is similar to what we see for random forest, and better than bagging. We can further fine-tune the shrinkage parameter, currently set at the default of 0.001.
boston.boost <- gbm(medv~., data = Boston[train,], distribution = "gaussian", n.trees=5000, interaction.depth=4, shrinkage = 0.2, verbose=F)
yhat.boost <- predict(boston.boost, Boston[-train,], n.trees=5000)
round(mean((yhat.boost-boston.test)^2),2)
## [1] 11.51
We see a much lower error than before.
Let’s try using cross-validation to find the best number of trees
boston.boost.cv <- gbm(medv~., data = Boston[train,], distribution = "gaussian", n.trees=5000, interaction.depth=4, shrinkage = 0.01, verbose=F, cv.folds=5)
bestTreeForPrediction <- gbm.perf(boston.boost.cv)
## Using cv method...
We now predict using this tree
yhat.boost = predict(boston.boost.cv, newdata = Boston[-train,],n.trees = bestTreeForPrediction)
round(mean((yhat.boost-boston.test)^2),2)
## [1] 10.88
We observe a further improvement in performance with the lowest test MSE across all approaches of 10.88
XGBoost, or eXtreme Gradient Boosting, implements gradient boosting, but now includes a regularization parameter and implements parallel processing. It also has a built-in routine to handle missing values. XGBoost also allows one to use the model trained on the last iteration, and updates it when new data becomes available.
We now try xgboost
on the above dataset
library(xgboost)
train.boston <- Boston[train,]
test.boston <- Boston[-train,]
dtrain <- xgb.DMatrix(data = as.matrix(train.boston[!names(train.boston) %in% c("medv")]), label = train.boston$medv)
boston.xgb = xgboost(data=dtrain, max_depth=3, eta = 0.2, nthread=3, nrounds=40, lambda=0
, objective="reg:linear")
## [1] train-rmse:19.307873
## [2] train-rmse:15.580692
## [3] train-rmse:12.618803
## [4] train-rmse:10.269051
## [5] train-rmse:8.412157
## [6] train-rmse:6.930980
## [7] train-rmse:5.768116
## [8] train-rmse:4.849423
## [9] train-rmse:4.117004
## [10] train-rmse:3.560823
## [11] train-rmse:3.117196
## [12] train-rmse:2.806568
## [13] train-rmse:2.555753
## [14] train-rmse:2.353195
## [15] train-rmse:2.216175
## [16] train-rmse:2.115286
## [17] train-rmse:2.011023
## [18] train-rmse:1.929796
## [19] train-rmse:1.849981
## [20] train-rmse:1.810580
## [21] train-rmse:1.769735
## [22] train-rmse:1.738504
## [23] train-rmse:1.697501
## [24] train-rmse:1.666304
## [25] train-rmse:1.613824
## [26] train-rmse:1.581723
## [27] train-rmse:1.555577
## [28] train-rmse:1.535946
## [29] train-rmse:1.504930
## [30] train-rmse:1.471925
## [31] train-rmse:1.449668
## [32] train-rmse:1.434701
## [33] train-rmse:1.398550
## [34] train-rmse:1.347438
## [35] train-rmse:1.320076
## [36] train-rmse:1.303965
## [37] train-rmse:1.280769
## [38] train-rmse:1.258030
## [39] train-rmse:1.245997
## [40] train-rmse:1.237618
dtest <- as.matrix(test.boston[!names(train.boston) %in% c("medv")])
yhat.xgb <- predict(boston.xgb,dtest)
round(mean((yhat.xgb - boston.test)^2),2)
## [1] 12.86
The number of rounds was based on my iteratively trying out different values for nround
. Let’s use the cross validation functionality within xgboost
. For convenience, we will move the parameters into its own list
#xgboost(data=dtrain, max_depth=3, eta = 0.2, nthread=3, nrounds=40, lambda=0, objective="reg:linear")
set.seed(1)
param <- list("max_depth" = 3, "eta" = 0.2, "objective" = "reg:linear", "lambda" = 0)
cv.nround <- 500
cv.nfold <- 3
boston.xgb.cv <- xgb.cv(param=param, data = dtrain, nfold = cv.nfold, nrounds=cv.nround,
early_stopping_rounds = 200, # training will stop if performance doesn't improve for 200 rounds from the last best iteration
verbose=0)
The best iteration is 89. We re-run the model using that value for nround
dtrain <- xgb.DMatrix(data = as.matrix(train.boston[!names(train.boston) %in% c("medv")]), label = train.boston$medv)
boston.xgb = xgboost(param=param, data=dtrain, nthread=3, nrounds=boston.xgb.cv$best_iteration, verbose=0)
dtest <- as.matrix(test.boston[!names(train.boston) %in% c("medv")])
yhat.xgb <- predict(boston.xgb,dtest)
round(mean((yhat.xgb - boston.test)^2),2)
## [1] 12.48
We can further explore the feature importance
importance <- xgb.importance(colnames(train.boston[!names(train.boston) %in% c("medv")]),model=boston.xgb)
importance
## Feature Gain Cover Frequency
## 1: lstat 0.4683402175 0.129477645 0.106830123
## 2: rm 0.3352122898 0.176735504 0.171628722
## 3: dis 0.0609717087 0.151639951 0.126094571
## 4: crim 0.0385090928 0.119463127 0.176882662
## 5: ptratio 0.0323205534 0.051998459 0.045534151
## 6: nox 0.0163963972 0.042798732 0.061295972
## 7: tax 0.0155481980 0.032887914 0.031523643
## 8: black 0.0143943347 0.095389766 0.087565674
## 9: age 0.0084895132 0.089567717 0.103327496
## 10: indus 0.0061749558 0.070368285 0.054290718
## 11: rad 0.0015809307 0.024517792 0.021015762
## 12: chas 0.0011536132 0.006533140 0.003502627
## 13: zn 0.0009081953 0.008621967 0.010507881
xgb.plot.importance(importance, rel_to_first=TRUE, xlab="Relative Importance")
One of the final components here is to tune the other parameters in param
. For this we will have to use the caret
package
library(caret)
##
## Attaching package: 'caret'
## The following object is masked from 'package:survival':
##
## cluster
ntrees <- boston.xgb.cv$best_iteration
param_grid <- expand.grid(
nrounds = ntrees,
eta = seq(2,24,2)/ntrees,
#eta = c(0.1, 0.2, 0.3, 0.4, 0.5),
subsample = 1.0,
colsample_bytree = 1.0,
max_depth = c(1,2,3,4,5,6),
gamma = 1,
min_child_weight = 1
)
xgb_control <- trainControl(
method="cv",
number = 5
)
set.seed(1)
boston.xgb.tuned <- train(medv~., data=train.boston, trControl=xgb_control, tuneGrid=param_grid,lambda=0, method="xgbTree")
## Loading required package: plyr
The best tuning parameters and the final model is given by
boston.xgb.tuned$bestTune
## nrounds max_depth eta gamma colsample_bytree min_child_weight
## 39 89 3 0.1573034 1 1 1
## subsample
## 39 1
Let’s plot the output
#yhat.xgb.tuned <- predict()
plot(boston.xgb.tuned)
The parameter grid with the RMSE values are in boston.xgb.tuned$results
.
We will now use the tuned final model boston.xgb.tuned$finalModel
to predict on the test set
yhat.xgb.tuned <- predict(boston.xgb.tuned$finalModel,newdata=dtest)
round(mean((yhat.xgb.tuned - boston.test)^2),2)
## [1] 12.43
We observe that the MSE is slightly lower than what we had observed before.