Load Boston dataset, and fit a tree
library(tree)
## Warning: package 'tree' was built under R version 3.5.2
library(MASS)
library(ggplot2)
set.seed(42)
train = sample(1:nrow(Boston),nrow(Boston)/2) # 50/50 split between test and train
tree.boston = tree (medv ~ ., Boston, subset=train)
summary(tree.boston)
##
## Regression tree:
## tree(formula = medv ~ ., data = Boston, subset = train)
## Variables actually used in tree construction:
## [1] "lstat" "rm" "ptratio" "crim"
## Number of terminal nodes: 8
## Residual mean deviance: 13.24 = 3244 / 245
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -15.130 -1.908 -0.155 0.000 1.884 16.100
Picture a tree
plot(tree.boston)
text(tree.boston, pretty=0)
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio black
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12
## lstat medv
## 1 4.98 24.0
## 2 9.14 21.6
## 3 4.03 34.7
## 4 2.94 33.4
## 5 5.33 36.2
## 6 5.21 28.7
Prune tree using cross-validation with the cv.tree() function
cv.boston = cv.tree(tree.boston)
plot(cv.boston$size, cv.boston$dev, type='b')
#same graph, but with ggplot
ggplot(mapping = aes(x=cv.boston$size, y=cv.boston$dev)) + geom_line()
Prunning based on the graph, to keep 5 terminal nodes
prune.boston = prune.tree(tree.boston,best=5) # We keep 5 terminal nodes
plot(prune.boston)
text(prune.boston, pretty=0)
Finaly, evaluate performance of best tree based on cross validtion the full tree with 8 terminal nodes
yhat = predict(tree.boston, newdata=Boston[-train,])
boston.test = Boston[-train,"medv"]
print(mean((yhat-boston.test)^2))
## [1] 24.83804
plot(yhat, boston.test)
abline(0,1)
MSE (averaged of the squared error) associated with the tree is 24.84
In bagging, we use bootstrapping to generate B separate training sets and train a tree on each of them. The predictions are then averaged. For each training set, the data not selected is known as the out-of-bag sample, and is used to evaluate the prediction.
We now apply baggin to the Boston housing dataset. Since random forest is a special case of bagging (with m=sqrt(p)), we use the randomForest package and set the number of variables mtry to all the drivers in the Boston data set.
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.5.2
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
set.seed(42)
bag.boston = randomForest(medv ~., data=Boston, subset=train, mtry=ncol(Boston)-1, importance=TRUE)
bag.boston
##
## Call:
## randomForest(formula = medv ~ ., data = Boston, mtry = ncol(Boston) - 1, importance = TRUE, subset = train)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 13
##
## Mean of squared residuals: 14.1636
## % Var explained: 82.23
This spawns 500 trees. Let’s evaluate the prediction of the bagged model on the test set
yhat.bag = predict(bag.boston, newdata=Boston[-train,])
plot(yhat.bag,boston.test)
abline(0,1)
Lets calculate an error
round(mean((yhat.bag-boston.test)^2),2)
## [1] 17.63
Error (17.63 for Random Forest) is much smaller than Tree error(24.84) Will limit number of trees to 25 and check error again:
bag.boston = randomForest(medv ~., data=Boston,subset=train, mtry = ncol(Boston)-1, ntree=25)
yhat.bag = predict(bag.boston, newdata=Boston[-train,])
mean((yhat.bag-boston.test)^2)
## [1] 17.83574
Reducing a tree 20 times results in small reduction of performance.
Random forests is similar to bagging, except for each tree, a subset m=vp of the total number of predictors are used. This helps de-correlate the trees. Like bagging, random forest will not overfit if the number of trees B is very large. In practice, we use a value of B sufficiently large for the error rate to have settled down.
We now apply random forests to the Boston dataset. By default randomForest() uses p/3 variables when building a regression tree, and sqrt(p) variables when building a classification tree. Since we have 13 variables, we try with mtry=6
rf.boston = randomForest(medv ~., data=Boston, mtry=6, subset=train, importance=TRUE)
yhat.rf = predict(rf.boston, newdata=Boston[-train, ])
round(mean((yhat.rf - boston.test)^2),2)
## [1] 17.06
The error went further down to 17.06. The variable importance is given by
importance(rf.boston)
## %IncMSE IncNodePurity
## crim 13.1143005 1173.28359
## zn 3.3873940 64.59933
## indus 6.8846286 363.75935
## chas 0.9923889 23.83435
## nox 11.2555038 753.55124
## rm 33.9120908 6413.20013
## age 14.9473254 512.25432
## dis 10.9395265 557.88297
## rad 5.2052417 159.55581
## tax 10.0015194 672.44225
## ptratio 14.3271608 1244.89390
## black 4.8105479 390.25402
## lstat 32.7596191 7651.40264
The two metrics are the increase in MSE and increase in node purity by including the variable. Node impurity is measured by training RSS for regression trees, and deviance for classification trees.
A plot of the variable importance is given by
varImpPlot(rf.boston)
In boosting, the trees are trained sequentially based on the reside of the previous step. The steps are as follows 1. Train a tree ^1(x) on the entire dataset 2. Update the residual r <- y-^1(x) 3. For each tree b=2..B, train the tree on the data (X,r) to get ^b(x) and update the residual as r <- y-^b(x) 4. The final boosted model is given by (x) = _{i=1}B{}b(x) Unlike random forests and bagging, boosting can overfit if the number of trees (B) is very large. An appropriate choice of B can be made using cross-validation. The shrinkage parameter ?? (lambda) controls the rate at which boosting learns and is typically between 0.01 and 0.001. The smaller value implies a slow learning rate. The number of splits d controls the tree complexity. Often d=1 works well.
We now apply boosting to the Boston dataset
library(gbm)
## Warning: package 'gbm' was built under R version 3.5.2
## Loaded gbm 2.1.5
set.seed(42)
boost.boston = gbm(medv ~., data=Boston[train,],distribution = "gaussian",n.trees=5000, interaction.depth = 4) # use `distribution='bernoulli' for classification
summary(boost.boston)
## var rel.inf
## lstat lstat 44.3207059
## rm rm 26.6627313
## crim crim 5.5744795
## dis dis 5.3607854
## age age 4.9119813
## black black 4.4952152
## ptratio ptratio 3.2548038
## nox nox 2.8501109
## tax tax 0.8647899
## indus indus 0.8526402
## rad rad 0.6835900
## zn zn 0.1681667
## chas chas 0.0000000
We can produce partial dependence plots for the two important variables lstat and rm
par(mfrow=c(1,2))
plot(boost.boston, i="rm")
plot(boost.boston, i="lstat")
Next, we use the boosted model for prediction
yhat.boost = predict(boost.boston, newdata=Boston[-train,],n.trees=5000)
round(mean((yhat.boost-boston.test)^2),2)
## [1] 15.75
The above MSE 15.75 is smaller to whatr we’ve seen in randon forest, and better than bagging. We can further tune the shrinkage parameter, currently set at the default of 0.001
boston.boost <- gbm(medv~., data = Boston[train,], distribution = "gaussian", n.trees=5000, interaction.depth=4, shrinkage = 0.1, verbose=F)
yhat.boost <- predict(boston.boost, Boston[-train,], n.trees=5000)
round(mean((yhat.boost-boston.test)^2),2)
## [1] 15.25
with shrinkage increased from 0.001 to 0.1 we can see an improvement in MSE. So, will try cross-validation to find the best numbers
#cross-validation
boston.boost.cv <- gbm(medv~., data = Boston[train,], distribution = "gaussian", n.trees=5000, interaction.depth=4, shrinkage = 0.1, verbose=F, cv.folds=10)
#find the best prediction
bestTreeForPrediction <- gbm.perf(boston.boost.cv)
We now predict using this tree
yhat.boost = predict(boston.boost.cv, newdata = Boston[-train,],n.trees = bestTreeForPrediction)
round(mean((yhat.boost-boston.test)^2),2)
## [1] 16.11
outputdataset = data.frame("crim" = Boston$crim,
"price" = yhat.boost)
head(outputdataset)
## crim price
## 1 0.00632 37.08211
## 2 0.02731 24.85468
## 3 0.02729 19.39020
## 4 0.03237 17.20503
## 5 0.06905 15.29022
## 6 0.02985 17.28210
write.csv(outputdataset, "predictions_boost.csv", row.names=FALSE)
write.csv(Boston, "boston_test.csv", row.names=FALSE)
XGBoost, or eXtreme Gradient Boosting, implements gradient boosting, but now includes a regularization parameter and implements parallel processing. It also has a built-in routine to handle missing values. XGBoost also allows one to use the model trained on the last iteration, and updates it when new data becomes available.
We now try xgboost on the above dataset
library(xgboost)
## Warning: package 'xgboost' was built under R version 3.5.2
train.boston <- Boston[train,]
test.boston <- Boston[-train,]
dtrain <- xgb.DMatrix(data = as.matrix(train.boston[!names(train.boston) %in% c("medv")]), label = train.boston$medv)
boston.xgb = xgboost(data=dtrain, max_depth=3, eta = 0.2, nthread=3, nrounds=40, lambda=0
, objective="reg:linear")
## [1] train-rmse:19.008835
## [2] train-rmse:15.350367
## [3] train-rmse:12.433320
## [4] train-rmse:10.097751
## [5] train-rmse:8.249939
## [6] train-rmse:6.770641
## [7] train-rmse:5.618767
## [8] train-rmse:4.705110
## [9] train-rmse:4.002415
## [10] train-rmse:3.475011
## [11] train-rmse:3.065882
## [12] train-rmse:2.755373
## [13] train-rmse:2.522681
## [14] train-rmse:2.336552
## [15] train-rmse:2.216177
## [16] train-rmse:2.057536
## [17] train-rmse:1.959620
## [18] train-rmse:1.874514
## [19] train-rmse:1.818909
## [20] train-rmse:1.778851
## [21] train-rmse:1.727912
## [22] train-rmse:1.679558
## [23] train-rmse:1.637625
## [24] train-rmse:1.583728
## [25] train-rmse:1.548635
## [26] train-rmse:1.518612
## [27] train-rmse:1.468672
## [28] train-rmse:1.440403
## [29] train-rmse:1.421851
## [30] train-rmse:1.387431
## [31] train-rmse:1.361621
## [32] train-rmse:1.313439
## [33] train-rmse:1.301068
## [34] train-rmse:1.285264
## [35] train-rmse:1.250826
## [36] train-rmse:1.233393
## [37] train-rmse:1.195193
## [38] train-rmse:1.181252
## [39] train-rmse:1.161372
## [40] train-rmse:1.156628
dtest <- as.matrix(test.boston[!names(train.boston) %in% c("medv")])
yhat.xgb <- predict(boston.xgb,dtest)
round(mean((yhat.xgb - boston.test)^2),2)
## [1] 14.98
This showes better result than all other algorithms! The number of rounds was based on my iteratively trying out different values for nround. Ccross validation functionality within xgboost. For convenience, we will move the parameters into its own list
#xgboost(data=dtrain, max_depth=3, eta = 0.2, nthread=3, nrounds=40, lambda=0, objective="reg:linear")
set.seed(42)
param <- list("max_depth" = 3, "eta" = 0.2, "objective" = "reg:linear", "lambda" = 0)
cv.nround <- 500
cv.nfold <- 3
boston.xgb.cv <- xgb.cv(param=param, data = dtrain, nfold = cv.nfold, nrounds=cv.nround,
early_stopping_rounds = 200, # training will stop if performance doesn't improve for 200 rounds from the last best iteration
verbose=0)
dtrain <- xgb.DMatrix(data = as.matrix(train.boston[!names(train.boston) %in% c("medv")]), label = train.boston$medv)
boston.xgb = xgboost(param=param, data=dtrain, nthread=3, nrounds=boston.xgb.cv$best_iteration, verbose=0)
dtest <- as.matrix(test.boston[!names(train.boston) %in% c("medv")])
yhat.xgb <- predict(boston.xgb,dtest)
round(mean((yhat.xgb - boston.test)^2),2)
## [1] 14.91
Explore feature importance:
importance <- xgb.importance(colnames(train.boston[!names(train.boston) %in% c("medv")]),model=boston.xgb)
importance
## Feature Gain Cover Frequency
## 1: lstat 0.5655334011 0.1248410000 0.107816712
## 2: rm 0.2706268366 0.1853419367 0.161725067
## 3: crim 0.0628331607 0.1213256551 0.183288410
## 4: ptratio 0.0334566723 0.0627905363 0.045822102
## 5: dis 0.0209431241 0.0956081315 0.102425876
## 6: black 0.0153946032 0.0935729318 0.094339623
## 7: age 0.0098593208 0.1350169985 0.129380054
## 8: nox 0.0091309889 0.0400101760 0.059299191
## 9: tax 0.0072313883 0.0708388261 0.040431267
## 10: indus 0.0029268536 0.0615879183 0.045822102
## 11: rad 0.0013241853 0.0039316358 0.008086253
## 12: zn 0.0004068742 0.0043941812 0.010781671
## 13: chas 0.0003325909 0.0007400726 0.010781671
xgb.plot.importance(importance, rel_to_first=TRUE, xlab="Relative Importance")
Tuning parameters in param with caret. (caret - package)
library(caret)
## Warning: package 'caret' was built under R version 3.5.2
## Loading required package: lattice
ntrees <- boston.xgb.cv$best_iteration
param_grid <- expand.grid(
nrounds = ntrees,
eta = seq(2,24,2)/ntrees,
#eta = c(0.1, 0.2, 0.3, 0.4, 0.5),
subsample = 1.0,
colsample_bytree = 1.0,
max_depth = c(1,2,3,4,5,6),
gamma = 1,
min_child_weight = 1
)
xgb_control <- trainControl(
method="cv",
number = 5
)
set.seed(42)
boston.xgb.tuned <- train(medv~., data=train.boston, trControl=xgb_control, tuneGrid=param_grid,lambda=0, method="xgbTree")
boston.xgb.tuned$bestTune
## nrounds max_depth eta gamma colsample_bytree min_child_weight
## 21 57 3 0.1403509 1 1 1
## subsample
## 21 1
The best tuning parameters and the final model is above:
#yhat.xgb.tuned <- predict()
plot(boston.xgb.tuned)
Shrinkage
The parameter grid with the RMSE values are in boston.xgb.tuned$results.
We will now use the tuned final model boston.xgb.tuned$finalModel to predict on the test set
yhat.xgb.tuned <- predict(boston.xgb.tuned$finalModel,newdata=dtest)
round(mean((yhat.xgb.tuned - boston.test)^2),2)
## [1] 14.84
The best result so far!
outputdataset = data.frame("crim" = Boston$crim,
"price" = yhat.boost)
write.csv(outputdataset, "predictions_boost.csv", row.names=FALSE)