Here we apply rando forests to the Boston data data, using
the randomForest package in R. The exact
results obtained in this section may depend on the version of R and the
version of the randomForest package installed on your
computer. The randomForest() function can be used to
perform both random forests and bagging. We perform Random Forests
randomForest() as follows:
library(ISLR2)
library(randomForest)
set.seed(1)
train <- sample(1:nrow(Boston),nrow(Boston)/2) #Crea random sample
rf.boston <- randomForest(medv ~ .,data=Boston, subset=train, mtry = 6, importance =TRUE)
rf.boston
Call:
randomForest(formula = medv ~ ., data = Boston, mtry = 6, importance = TRUE, subset = train)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 6
Mean of squared residuals: 10.13394
% Var explained: 86.82
The argument mtry indicates how many predictors should
be considered for each split of the tree—in other words if the
predictors were 12 we could have done a bagging.
By default, randomForest() uses \(p/3\) variables when building a random
forest of regression trees, and \(√p\)
variables when building a random forest of classification trees. Here we
use mtry = 6.
We could change the number of trees grown by
randomForest() using the ntree argument, additionally we
could change the number of trees grown by randomForest()
using the ntree argument.
How well does this bagged model perform on the test set?
boston.test <- Boston[-train, "medv"]
yhat.rf <- predict(rf.boston , newdata = Boston[-train , ])
plot(yhat.rf , boston.test)
abline(0, 1)
mean((yhat.rf - boston.test)^2)
[1] 20.05854
The test set MSE associated with the bagged regression tree is 20.06, much better of that obtained using an optimally-pruned single tree.
Using the importance() function, we can view the
importance of each variable.
importance(rf.boston)
%IncMSE IncNodePurity
crim 16.228551 1064.41512
zn 2.978667 74.74542
indus 4.709285 478.99444
chas 0.986362 27.21836
nox 14.858454 753.85506
rm 33.643834 8206.94956
age 13.212521 542.63711
dis 8.449379 644.76518
rad 3.436762 75.21787
tax 10.681850 279.37128
ptratio 9.879386 868.12587
lstat 28.945897 6137.32497
Two measures of variable importance are reported. The first is based
upon the mean decrease of accuracy in predictions on the out of bag
samples when a given variable is permuted. The second is a measure of
the total decrease in node impurity that results from splits over that
variable, averaged over all trees. In the case of regression trees, the
node impurity is measured by the training RSS, and for classification
trees by the deviance. Plots of these importance measures can be
produced using the varImpPlot() function.
varImpPlot(rf.boston)
The results indicate that across all of the trees considered in the
random forest, the wealth of the community (lstat) and the
house size (rm) are by far the two most important
variables.