We discuss approaches for using decision trees for regression problems starting with a simple decision tree and working our way to xgboost. We will use the Boston Housing dataset as an example. All of the code below except xgboost is borrowed from An Introduction to Statistical Learning by G. James, D. Witten, T. Hastie, and R. Tibshirani.

Regression Trees

We begin with an example of fitting a tree to the Boston dataset

library(tree)
library(MASS)
library(ggplot2)
set.seed(1)
train = sample(1:nrow(Boston),nrow(Boston)/2) # 50/50 split between test and train
tree.boston = tree (medv ~ ., Boston, subset=train)
summary(tree.boston)
## 
## Regression tree:
## tree(formula = medv ~ ., data = Boston, subset = train)
## Variables actually used in tree construction:
## [1] "lstat" "rm"    "dis"  
## Number of terminal nodes:  8 
## Residual mean deviance:  12.65 = 3099 / 245 
## Distribution of residuals:
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -14.10000  -2.04200  -0.05357   0.00000   1.96000  12.60000

The deviance above is the sum of the squared errors for the tree.

plot(tree.boston)
text(tree.boston, pretty=0)

We now prune the tree using cross-validation with the cv.tree() function

cv.boston = cv.tree(tree.boston)
plot(cv.boston$size, cv.boston$dev, type='b')

#plot(cv.boston)
#ggplot(mapping = aes(x=cv.boston$size, y=cv.boston$dev)) + geom_line()

The cross-validation results suggests that the most complex tree is the best one. We can try pruning this tree to keep 5 terminal nodes.

prune.boston = prune.tree(tree.boston,best=5) # We keep 5 terminal nodes
plot(prune.boston)
text(prune.boston, pretty=0)

Finally, we evaluate the performance of the best tree based on cross-validation - the full tree with 8-terminal nodes

yhat = predict(tree.boston, newdata=Boston[-train,])
boston.test = Boston[-train,"medv"]
print(mean((yhat-boston.test)^2))
## [1] 25.04559
plot(yhat, boston.test)
abline(0,1)

The MSE associated with the tree is 25.05.

Bagging

In bagging, we use bootstrapping to generate B separate training sets and train a tree on each of them. The predictions are then averaged. For each training set, the data not selected is known as the out-of-bag sample, and is used to evaluate the prediction. The average prediction is given by

\[\hat{f}_{bag}(x) = \frac{1}{B}\sum_{b=1}^B\hat{f}^{*b}(x)\]

The averaging of predictions results in lower variance compared to using a single tree. This is based on the observation that for n random variables, each with variance \({\sigma}^2\), their average value has a variance \({\sigma}^2/n\). Thus bagging overcomes a shortcoming of single decision trees that can give different tree structures when built on different samples of the input data.

For classification problems, rather than averaging the results for each tree, a majority vote is taken across the multiple trees.

We now apply baggin to the Boston housing dataset. Since random forest is a special case of bagging (with \(m=sqrt(p)\)), we use the randomForest package and set the number of variables mtry to all the drivers in the Boston data set.

library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
set.seed(1)
bag.boston = randomForest(medv ~., data=Boston, subset=train, mtry=ncol(Boston)-1, importance=TRUE) 
bag.boston
## 
## Call:
##  randomForest(formula = medv ~ ., data = Boston, mtry = ncol(Boston) -      1, importance = TRUE, subset = train) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 13
## 
##           Mean of squared residuals: 11.02509
##                     % Var explained: 86.65

This spawns 500 trees. Let’s evaluate the prediction of the bagged model on the test set

yhat.bag = predict(bag.boston, newdata=Boston[-train,])
plot(yhat.bag,boston.test)
abline(0,1)

round(mean((yhat.bag-boston.test)^2),2)
## [1] 13.47

The test error at 13.47 is nearly half of what we obtained with a single tree. We try to limit the number of trees to 25 and try again

bag.boston = randomForest(medv ~., data=Boston,subset=train, mtry = ncol(Boston)-1, ntree=25)
yhat.bag = predict(bag.boston, newdata=Boston[-train,])
mean((yhat.bag-boston.test)^2)
## [1] 13.43068

We have a relatively smaller tree, without a significant drop in performance.

Random Forests

Random forests is similar to bagging, except for each tree, a subset \(m=\sqrt{p}\) of the total number of predictors are used. This helps de-correlate the trees. Like bagging, random forest will not overfit if the number of trees \(B\) is very large. In practice, we use a value of B sufficiently large for the error rate to have settled down.

We now apply random forests to the Boston dataset. By default randomForest() uses p/3 variables when building a regression tree, and \(sqrt(p)\) variables when building a classification tree. Since we have 13 variables, we try with mtry=6

rf.boston = randomForest(medv ~., data=Boston, mtry=6, subset=train, importance=TRUE)
yhat.rf = predict(rf.boston, newdata=Boston[-train, ])
round(mean((yhat.rf - boston.test)^2),2)
## [1] 11.21

We have been able to further bring down the error to 11.21

The variable importance is given by

importance(rf.boston)
##           %IncMSE IncNodePurity
## crim    12.167433    1100.48678
## zn       1.956719      62.54669
## indus    8.245823     954.97538
## chas     1.429269      50.94527
## nox     13.010349    1032.95449
## rm      30.854978    6006.71143
## age     11.161356     491.91157
## dis     14.227309    1356.79634
## rad      4.186869     102.83437
## tax      8.997186     470.72817
## ptratio 11.922732     944.95930
## black    7.765048     352.90710
## lstat   29.955127    7688.11053

The two metrics are the increase in MSE and increase in node purity by including the variable. Node impurity is measured by training RSS for regression trees, and deviance for classification trees.

A plot of the variable importance is given by

varImpPlot(rf.boston)

Boosting

In boosting, the trees are trained sequentially based on the reside of the previous step. The steps are as follows

  1. Train a tree \(\hat{f}^1(x)\) on the entire dataset
  2. Update the residual \(r <- y-\lambda\hat{f}^1(x)\)
  3. For each tree b=2..B, train the tree on the data (X,r) to get \(\hat{f}^b(x)\), and update the residual as \(r <- y-\lambda\hat{f}^b(x)\)
  4. The final boosted model is given by

\[ \hat{f}(x) = \sum_{i=1}^B{\lambda}\hat{f}^b(x) \]

Unlike random forests and bagging, boosting can overfit if the number of trees (B) is very large. An appropriate choice of B can be made using cross-validation. The shrinkage parameter \(\lambda\) controls the rate at which boosting learns and is typically between 0.01 and 0.001. The smaller value implies a slow learning rate. The number of splits \(d\) controls the tree complexity. Often d=1 works well.

We now apply boosting to the Boston dataset

library(gbm)
## Loading required package: survival
## Loading required package: lattice
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.3
set.seed(1)
boost.boston = gbm(medv ~., data=Boston[train,],distribution = "gaussian",n.trees=5000, interaction.depth = 4) # use `distribution='bernoulli' for classification
summary(boost.boston)

##             var    rel.inf
## lstat     lstat 45.9627334
## rm           rm 31.2238187
## dis         dis  6.8087398
## crim       crim  4.0743784
## nox         nox  2.5605001
## ptratio ptratio  2.2748652
## black     black  1.7971159
## age         age  1.6488532
## tax         tax  1.3595005
## indus     indus  1.2705924
## chas       chas  0.8014323
## rad         rad  0.2026619
## zn           zn  0.0148083

We can produce partial dependence plots for the two important variables lstat and rm

par(mfrow=c(1,2))
plot(boost.boston, i="rm")
plot(boost.boston, i="lstat")

Next, we use the boosted model for prediction

yhat.boost = predict(boost.boston, newdata=Boston[-train,],n.trees=5000)
round(mean((yhat.boost-boston.test)^2),2)
## [1] 11.84

The above MSE of 11.84 is similar to what we see for random forest, and better than bagging. We can further fine-tune the shrinkage parameter, currently set at the default of 0.001.

boston.boost <- gbm(medv~., data = Boston[train,], distribution = "gaussian", n.trees=5000, interaction.depth=4, shrinkage = 0.2, verbose=F)
yhat.boost <- predict(boston.boost, Boston[-train,], n.trees=5000)
round(mean((yhat.boost-boston.test)^2),2)
## [1] 11.51

We see a much lower error than before.

Let’s try using cross-validation to find the best number of trees

boston.boost.cv <- gbm(medv~., data = Boston[train,], distribution = "gaussian", n.trees=5000, interaction.depth=4, shrinkage = 0.01, verbose=F, cv.folds=5)
bestTreeForPrediction <- gbm.perf(boston.boost.cv)
## Using cv method...

We now predict using this tree

yhat.boost = predict(boston.boost.cv, newdata = Boston[-train,],n.trees = bestTreeForPrediction)
round(mean((yhat.boost-boston.test)^2),2)
## [1] 10.88

We observe a further improvement in performance with the lowest test MSE across all approaches of 10.88

XGBoost

XGBoost, or eXtreme Gradient Boosting, implements gradient boosting, but now includes a regularization parameter and implements parallel processing. It also has a built-in routine to handle missing values. XGBoost also allows one to use the model trained on the last iteration, and updates it when new data becomes available.

We now try xgboost on the above dataset

library(xgboost)
train.boston <- Boston[train,]
test.boston <- Boston[-train,]
dtrain <- xgb.DMatrix(data = as.matrix(train.boston[!names(train.boston) %in% c("medv")]), label = train.boston$medv)
boston.xgb = xgboost(data=dtrain, max_depth=3, eta = 0.2, nthread=3, nrounds=40, lambda=0
, objective="reg:linear")
## [1]  train-rmse:19.307873 
## [2]  train-rmse:15.580692 
## [3]  train-rmse:12.618803 
## [4]  train-rmse:10.269051 
## [5]  train-rmse:8.412157 
## [6]  train-rmse:6.930980 
## [7]  train-rmse:5.768116 
## [8]  train-rmse:4.849423 
## [9]  train-rmse:4.117004 
## [10] train-rmse:3.560823 
## [11] train-rmse:3.117196 
## [12] train-rmse:2.806568 
## [13] train-rmse:2.555753 
## [14] train-rmse:2.353195 
## [15] train-rmse:2.216175 
## [16] train-rmse:2.115286 
## [17] train-rmse:2.011023 
## [18] train-rmse:1.929796 
## [19] train-rmse:1.849981 
## [20] train-rmse:1.810580 
## [21] train-rmse:1.769735 
## [22] train-rmse:1.738504 
## [23] train-rmse:1.697501 
## [24] train-rmse:1.666304 
## [25] train-rmse:1.613824 
## [26] train-rmse:1.581723 
## [27] train-rmse:1.555577 
## [28] train-rmse:1.535946 
## [29] train-rmse:1.504930 
## [30] train-rmse:1.471925 
## [31] train-rmse:1.449668 
## [32] train-rmse:1.434701 
## [33] train-rmse:1.398550 
## [34] train-rmse:1.347438 
## [35] train-rmse:1.320076 
## [36] train-rmse:1.303965 
## [37] train-rmse:1.280769 
## [38] train-rmse:1.258030 
## [39] train-rmse:1.245997 
## [40] train-rmse:1.237618
dtest <- as.matrix(test.boston[!names(train.boston) %in% c("medv")])
yhat.xgb <- predict(boston.xgb,dtest)
round(mean((yhat.xgb - boston.test)^2),2)
## [1] 12.86

The number of rounds was based on my iteratively trying out different values for nround. Let’s use the cross validation functionality within xgboost. For convenience, we will move the parameters into its own list

#xgboost(data=dtrain, max_depth=3, eta = 0.2, nthread=3, nrounds=40, lambda=0, objective="reg:linear")
set.seed(1)
param <- list("max_depth" = 3, "eta" = 0.2, "objective" = "reg:linear", "lambda" = 0)
cv.nround <- 500
cv.nfold <- 3
boston.xgb.cv <- xgb.cv(param=param, data = dtrain, nfold = cv.nfold, nrounds=cv.nround,
                        early_stopping_rounds = 200, # training will stop if performance doesn't improve for 200 rounds from the last best iteration
                        verbose=0)

The best iteration is 89. We re-run the model using that value for nround

dtrain <- xgb.DMatrix(data = as.matrix(train.boston[!names(train.boston) %in% c("medv")]), label = train.boston$medv)
boston.xgb = xgboost(param=param, data=dtrain, nthread=3, nrounds=boston.xgb.cv$best_iteration, verbose=0)
dtest <- as.matrix(test.boston[!names(train.boston) %in% c("medv")])
yhat.xgb <- predict(boston.xgb,dtest)
round(mean((yhat.xgb - boston.test)^2),2)
## [1] 12.48

We can further explore the feature importance

importance <- xgb.importance(colnames(train.boston[!names(train.boston) %in% c("medv")]),model=boston.xgb)
importance
##     Feature         Gain       Cover   Frequency
##  1:   lstat 0.4683402175 0.129477645 0.106830123
##  2:      rm 0.3352122898 0.176735504 0.171628722
##  3:     dis 0.0609717087 0.151639951 0.126094571
##  4:    crim 0.0385090928 0.119463127 0.176882662
##  5: ptratio 0.0323205534 0.051998459 0.045534151
##  6:     nox 0.0163963972 0.042798732 0.061295972
##  7:     tax 0.0155481980 0.032887914 0.031523643
##  8:   black 0.0143943347 0.095389766 0.087565674
##  9:     age 0.0084895132 0.089567717 0.103327496
## 10:   indus 0.0061749558 0.070368285 0.054290718
## 11:     rad 0.0015809307 0.024517792 0.021015762
## 12:    chas 0.0011536132 0.006533140 0.003502627
## 13:      zn 0.0009081953 0.008621967 0.010507881
xgb.plot.importance(importance, rel_to_first=TRUE, xlab="Relative Importance")

One of the final components here is to tune the other parameters in param. For this we will have to use the caret package

library(caret)
## 
## Attaching package: 'caret'
## The following object is masked from 'package:survival':
## 
##     cluster
ntrees <- boston.xgb.cv$best_iteration
param_grid <- expand.grid(
  nrounds = ntrees,
  eta = seq(2,24,2)/ntrees,
  #eta = c(0.1, 0.2, 0.3, 0.4, 0.5),
  subsample = 1.0,
  colsample_bytree = 1.0,
  max_depth = c(1,2,3,4,5,6),
  gamma = 1,
  min_child_weight = 1
)

xgb_control <- trainControl(
  method="cv",
  number = 5
)
set.seed(1)
boston.xgb.tuned <- train(medv~., data=train.boston, trControl=xgb_control, tuneGrid=param_grid,lambda=0, method="xgbTree")
## Loading required package: plyr

The best tuning parameters and the final model is given by

boston.xgb.tuned$bestTune
##    nrounds max_depth       eta gamma colsample_bytree min_child_weight
## 39      89         3 0.1573034     1                1                1
##    subsample
## 39         1

Let’s plot the output

#yhat.xgb.tuned <- predict()
plot(boston.xgb.tuned)

The parameter grid with the RMSE values are in boston.xgb.tuned$results.

We will now use the tuned final model boston.xgb.tuned$finalModel to predict on the test set

yhat.xgb.tuned <- predict(boston.xgb.tuned$finalModel,newdata=dtest)
round(mean((yhat.xgb.tuned - boston.test)^2),2)
## [1] 12.43

We observe that the MSE is slightly lower than what we had observed before.