Desicion Trees for Regression on Boston housing dataset

  1. decision trees for regression
  2. bagging (bootsrapping)
  3. random forest for regression
  4. boosting
  5. xgboost

Regression Trees

Load Boston dataset, and fit a tree

library(tree)
## Warning: package 'tree' was built under R version 3.5.2
library(MASS)
library(ggplot2)
set.seed(42)
train = sample(1:nrow(Boston),nrow(Boston)/2) # 50/50 split between test and train
tree.boston = tree (medv ~ ., Boston, subset=train)
summary(tree.boston)
## 
## Regression tree:
## tree(formula = medv ~ ., data = Boston, subset = train)
## Variables actually used in tree construction:
## [1] "lstat"   "rm"      "ptratio" "crim"   
## Number of terminal nodes:  8 
## Residual mean deviance:  13.24 = 3244 / 245 
## Distribution of residuals:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -15.130  -1.908  -0.155   0.000   1.884  16.100

Picture a tree

plot(tree.boston)
text(tree.boston, pretty=0)

head(Boston)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12
##   lstat medv
## 1  4.98 24.0
## 2  9.14 21.6
## 3  4.03 34.7
## 4  2.94 33.4
## 5  5.33 36.2
## 6  5.21 28.7

Prune tree using cross-validation with the cv.tree() function

cv.boston = cv.tree(tree.boston)
plot(cv.boston$size, cv.boston$dev, type='b')

#same graph, but with ggplot
ggplot(mapping = aes(x=cv.boston$size, y=cv.boston$dev)) + geom_line()

Prunning based on the graph, to keep 5 terminal nodes

prune.boston = prune.tree(tree.boston,best=5) # We keep 5 terminal nodes
plot(prune.boston)
text(prune.boston, pretty=0)

Finaly, evaluate performance of best tree based on cross validtion the full tree with 8 terminal nodes

yhat = predict(tree.boston, newdata=Boston[-train,])
boston.test = Boston[-train,"medv"]
print(mean((yhat-boston.test)^2))
## [1] 24.83804
plot(yhat, boston.test)
abline(0,1)

MSE (averaged of the squared error) associated with the tree is 24.84

Bagging Randon Forest

In bagging, we use bootstrapping to generate B separate training sets and train a tree on each of them. The predictions are then averaged. For each training set, the data not selected is known as the out-of-bag sample, and is used to evaluate the prediction.

We now apply baggin to the Boston housing dataset. Since random forest is a special case of bagging (with m=sqrt(p)), we use the randomForest package and set the number of variables mtry to all the drivers in the Boston data set.

library(randomForest)
## Warning: package 'randomForest' was built under R version 3.5.2
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
set.seed(42)
bag.boston = randomForest(medv ~., data=Boston, subset=train, mtry=ncol(Boston)-1, importance=TRUE) 
bag.boston
## 
## Call:
##  randomForest(formula = medv ~ ., data = Boston, mtry = ncol(Boston) -      1, importance = TRUE, subset = train) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 13
## 
##           Mean of squared residuals: 14.1636
##                     % Var explained: 82.23

This spawns 500 trees. Let’s evaluate the prediction of the bagged model on the test set

yhat.bag = predict(bag.boston, newdata=Boston[-train,])
plot(yhat.bag,boston.test)
abline(0,1)

Lets calculate an error

round(mean((yhat.bag-boston.test)^2),2)
## [1] 17.63

Error (17.63 for Random Forest) is much smaller than Tree error(24.84) Will limit number of trees to 25 and check error again:

bag.boston = randomForest(medv ~., data=Boston,subset=train, mtry = ncol(Boston)-1, ntree=25)
yhat.bag = predict(bag.boston, newdata=Boston[-train,])
mean((yhat.bag-boston.test)^2)
## [1] 17.83574

Reducing a tree 20 times results in small reduction of performance.

Random Forests

Random forests is similar to bagging, except for each tree, a subset m=vp of the total number of predictors are used. This helps de-correlate the trees. Like bagging, random forest will not overfit if the number of trees B is very large. In practice, we use a value of B sufficiently large for the error rate to have settled down.

We now apply random forests to the Boston dataset. By default randomForest() uses p/3 variables when building a regression tree, and sqrt(p) variables when building a classification tree. Since we have 13 variables, we try with mtry=6

rf.boston = randomForest(medv ~., data=Boston, mtry=6, subset=train, importance=TRUE)
yhat.rf = predict(rf.boston, newdata=Boston[-train, ])
round(mean((yhat.rf - boston.test)^2),2)
## [1] 17.06

The error went further down to 17.06. The variable importance is given by

importance(rf.boston)
##            %IncMSE IncNodePurity
## crim    13.1143005    1173.28359
## zn       3.3873940      64.59933
## indus    6.8846286     363.75935
## chas     0.9923889      23.83435
## nox     11.2555038     753.55124
## rm      33.9120908    6413.20013
## age     14.9473254     512.25432
## dis     10.9395265     557.88297
## rad      5.2052417     159.55581
## tax     10.0015194     672.44225
## ptratio 14.3271608    1244.89390
## black    4.8105479     390.25402
## lstat   32.7596191    7651.40264

The two metrics are the increase in MSE and increase in node purity by including the variable. Node impurity is measured by training RSS for regression trees, and deviance for classification trees.

A plot of the variable importance is given by

varImpPlot(rf.boston)

Boosting (package gbm)

In boosting, the trees are trained sequentially based on the reside of the previous step. The steps are as follows 1. Train a tree ^1(x) on the entire dataset 2. Update the residual r <- y-^1(x) 3. For each tree b=2..B, train the tree on the data (X,r) to get ^b(x) and update the residual as r <- y-^b(x) 4. The final boosted model is given by (x) = _{i=1}B{}b(x) Unlike random forests and bagging, boosting can overfit if the number of trees (B) is very large. An appropriate choice of B can be made using cross-validation. The shrinkage parameter ?? (lambda) controls the rate at which boosting learns and is typically between 0.01 and 0.001. The smaller value implies a slow learning rate. The number of splits d controls the tree complexity. Often d=1 works well.

We now apply boosting to the Boston dataset

library(gbm)
## Warning: package 'gbm' was built under R version 3.5.2
## Loaded gbm 2.1.5
set.seed(42)
boost.boston = gbm(medv ~., data=Boston[train,],distribution = "gaussian",n.trees=5000, interaction.depth = 4) # use `distribution='bernoulli' for classification
summary(boost.boston)

##             var    rel.inf
## lstat     lstat 44.3207059
## rm           rm 26.6627313
## crim       crim  5.5744795
## dis         dis  5.3607854
## age         age  4.9119813
## black     black  4.4952152
## ptratio ptratio  3.2548038
## nox         nox  2.8501109
## tax         tax  0.8647899
## indus     indus  0.8526402
## rad         rad  0.6835900
## zn           zn  0.1681667
## chas       chas  0.0000000

We can produce partial dependence plots for the two important variables lstat and rm

par(mfrow=c(1,2))
plot(boost.boston, i="rm")

plot(boost.boston, i="lstat")

Next, we use the boosted model for prediction

yhat.boost = predict(boost.boston, newdata=Boston[-train,],n.trees=5000)
round(mean((yhat.boost-boston.test)^2),2)
## [1] 15.75

The above MSE 15.75 is smaller to whatr we’ve seen in randon forest, and better than bagging. We can further tune the shrinkage parameter, currently set at the default of 0.001

boston.boost <- gbm(medv~., data = Boston[train,], distribution = "gaussian", n.trees=5000, interaction.depth=4, shrinkage = 0.1, verbose=F)
yhat.boost <- predict(boston.boost, Boston[-train,], n.trees=5000)
round(mean((yhat.boost-boston.test)^2),2)
## [1] 15.25

with shrinkage increased from 0.001 to 0.1 we can see an improvement in MSE. So, will try cross-validation to find the best numbers

#cross-validation
boston.boost.cv <- gbm(medv~., data = Boston[train,], distribution = "gaussian", n.trees=5000, interaction.depth=4, shrinkage = 0.1, verbose=F, cv.folds=10)

#find the best prediction
bestTreeForPrediction <- gbm.perf(boston.boost.cv)

We now predict using this tree

yhat.boost = predict(boston.boost.cv, newdata = Boston[-train,],n.trees = bestTreeForPrediction)
round(mean((yhat.boost-boston.test)^2),2)
## [1] 16.11
outputdataset = data.frame("crim" = Boston$crim,
                           "price" = yhat.boost)
head(outputdataset)
##      crim    price
## 1 0.00632 37.08211
## 2 0.02731 24.85468
## 3 0.02729 19.39020
## 4 0.03237 17.20503
## 5 0.06905 15.29022
## 6 0.02985 17.28210
write.csv(outputdataset, "predictions_boost.csv", row.names=FALSE)
write.csv(Boston, "boston_test.csv", row.names=FALSE)

XGBoost

XGBoost, or eXtreme Gradient Boosting, implements gradient boosting, but now includes a regularization parameter and implements parallel processing. It also has a built-in routine to handle missing values. XGBoost also allows one to use the model trained on the last iteration, and updates it when new data becomes available.

We now try xgboost on the above dataset

library(xgboost)
## Warning: package 'xgboost' was built under R version 3.5.2
train.boston <- Boston[train,]
test.boston <- Boston[-train,]

dtrain <- xgb.DMatrix(data = as.matrix(train.boston[!names(train.boston) %in% c("medv")]), label = train.boston$medv)

boston.xgb = xgboost(data=dtrain, max_depth=3, eta = 0.2, nthread=3, nrounds=40, lambda=0
, objective="reg:linear")
## [1]  train-rmse:19.008835 
## [2]  train-rmse:15.350367 
## [3]  train-rmse:12.433320 
## [4]  train-rmse:10.097751 
## [5]  train-rmse:8.249939 
## [6]  train-rmse:6.770641 
## [7]  train-rmse:5.618767 
## [8]  train-rmse:4.705110 
## [9]  train-rmse:4.002415 
## [10] train-rmse:3.475011 
## [11] train-rmse:3.065882 
## [12] train-rmse:2.755373 
## [13] train-rmse:2.522681 
## [14] train-rmse:2.336552 
## [15] train-rmse:2.216177 
## [16] train-rmse:2.057536 
## [17] train-rmse:1.959620 
## [18] train-rmse:1.874514 
## [19] train-rmse:1.818909 
## [20] train-rmse:1.778851 
## [21] train-rmse:1.727912 
## [22] train-rmse:1.679558 
## [23] train-rmse:1.637625 
## [24] train-rmse:1.583728 
## [25] train-rmse:1.548635 
## [26] train-rmse:1.518612 
## [27] train-rmse:1.468672 
## [28] train-rmse:1.440403 
## [29] train-rmse:1.421851 
## [30] train-rmse:1.387431 
## [31] train-rmse:1.361621 
## [32] train-rmse:1.313439 
## [33] train-rmse:1.301068 
## [34] train-rmse:1.285264 
## [35] train-rmse:1.250826 
## [36] train-rmse:1.233393 
## [37] train-rmse:1.195193 
## [38] train-rmse:1.181252 
## [39] train-rmse:1.161372 
## [40] train-rmse:1.156628
dtest <- as.matrix(test.boston[!names(train.boston) %in% c("medv")])
yhat.xgb <- predict(boston.xgb,dtest)
round(mean((yhat.xgb - boston.test)^2),2)
## [1] 14.98

This showes better result than all other algorithms! The number of rounds was based on my iteratively trying out different values for nround. Ccross validation functionality within xgboost. For convenience, we will move the parameters into its own list

#xgboost(data=dtrain, max_depth=3, eta = 0.2, nthread=3, nrounds=40, lambda=0, objective="reg:linear")
set.seed(42)
param <- list("max_depth" = 3, "eta" = 0.2, "objective" = "reg:linear", "lambda" = 0)
cv.nround <- 500
cv.nfold <- 3
boston.xgb.cv <- xgb.cv(param=param, data = dtrain, nfold = cv.nfold, nrounds=cv.nround,
                        early_stopping_rounds = 200, # training will stop if performance doesn't improve for 200 rounds from the last best iteration
                        verbose=0)
dtrain <- xgb.DMatrix(data = as.matrix(train.boston[!names(train.boston) %in% c("medv")]), label = train.boston$medv)
boston.xgb = xgboost(param=param, data=dtrain, nthread=3, nrounds=boston.xgb.cv$best_iteration, verbose=0)
dtest <- as.matrix(test.boston[!names(train.boston) %in% c("medv")])
yhat.xgb <- predict(boston.xgb,dtest)
round(mean((yhat.xgb - boston.test)^2),2)
## [1] 14.91

Explore feature importance:

importance <- xgb.importance(colnames(train.boston[!names(train.boston) %in% c("medv")]),model=boston.xgb)
importance
##     Feature         Gain        Cover   Frequency
##  1:   lstat 0.5655334011 0.1248410000 0.107816712
##  2:      rm 0.2706268366 0.1853419367 0.161725067
##  3:    crim 0.0628331607 0.1213256551 0.183288410
##  4: ptratio 0.0334566723 0.0627905363 0.045822102
##  5:     dis 0.0209431241 0.0956081315 0.102425876
##  6:   black 0.0153946032 0.0935729318 0.094339623
##  7:     age 0.0098593208 0.1350169985 0.129380054
##  8:     nox 0.0091309889 0.0400101760 0.059299191
##  9:     tax 0.0072313883 0.0708388261 0.040431267
## 10:   indus 0.0029268536 0.0615879183 0.045822102
## 11:     rad 0.0013241853 0.0039316358 0.008086253
## 12:      zn 0.0004068742 0.0043941812 0.010781671
## 13:    chas 0.0003325909 0.0007400726 0.010781671
xgb.plot.importance(importance, rel_to_first=TRUE, xlab="Relative Importance")

Tuning parameters in param with caret. (caret - package)

library(caret)
## Warning: package 'caret' was built under R version 3.5.2
## Loading required package: lattice
ntrees <- boston.xgb.cv$best_iteration
param_grid <- expand.grid(
  nrounds = ntrees,
  eta = seq(2,24,2)/ntrees,
  #eta = c(0.1, 0.2, 0.3, 0.4, 0.5),
  subsample = 1.0,
  colsample_bytree = 1.0,
  max_depth = c(1,2,3,4,5,6),
  gamma = 1,
  min_child_weight = 1
)

xgb_control <- trainControl(
  method="cv",
  number = 5
)
set.seed(42)
boston.xgb.tuned <- train(medv~., data=train.boston, trControl=xgb_control, tuneGrid=param_grid,lambda=0, method="xgbTree")
boston.xgb.tuned$bestTune
##    nrounds max_depth       eta gamma colsample_bytree min_child_weight
## 21      57         3 0.1403509     1                1                1
##    subsample
## 21         1

The best tuning parameters and the final model is above:

#yhat.xgb.tuned <- predict()
plot(boston.xgb.tuned)

Shrinkage

The parameter grid with the RMSE values are in boston.xgb.tuned$results.

We will now use the tuned final model boston.xgb.tuned$finalModel to predict on the test set

yhat.xgb.tuned <- predict(boston.xgb.tuned$finalModel,newdata=dtest)
round(mean((yhat.xgb.tuned - boston.test)^2),2)
## [1] 14.84

The best result so far!

outputdataset = data.frame("crim" = Boston$crim,
                           "price" = yhat.boost)
write.csv(outputdataset, "predictions_boost.csv", row.names=FALSE)