Here we apply rando forests to the Boston data data, using
the randomForest package in R. The exact
results obtained in this section may depend on the version of R and the
version of the randomForest package installed on your
computer. The randomForest() function can be used to
perform both random forests and bagging. We perform Random Forests
randomForest() as follows:
library(ISLR2)
library(randomForest)
set.seed(1)
train <- sample(1:nrow(Boston),nrow(Boston)/2) #Crea random sample
rf.boston <- randomForest(medv ~ .,data=Boston, subset=train, mtry = 6, importance =TRUE)
rf.boston
Call:
 randomForest(formula = medv ~ ., data = Boston, mtry = 6, importance = TRUE,      subset = train) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 6
          Mean of squared residuals: 10.13394
                    % Var explained: 86.82
The argument mtry indicates how many predictors should
be considered for each split of the tree—in other words if the
predictors were 12 we could have done a bagging.
By default, randomForest() uses \(p/3\) variables when building a random
forest of regression trees, and \(√p\)
variables when building a random forest of classification trees. Here we
use mtry = 6.
We could change the number of trees grown by
randomForest() using the ntree argument, additionally we
could change the number of trees grown by randomForest()
using the ntree argument.
How well does this bagged model perform on the test set?
boston.test <- Boston[-train, "medv"]
yhat.rf <- predict(rf.boston , newdata = Boston[-train , ])
plot(yhat.rf , boston.test)
abline(0, 1)
mean((yhat.rf - boston.test)^2)
[1] 20.05854
The test set MSE associated with the bagged regression tree is 20.06, much better of that obtained using an optimally-pruned single tree.
Using the importance() function, we can view the
importance of each variable.
importance(rf.boston)
          %IncMSE IncNodePurity
crim    16.228551    1064.41512
zn       2.978667      74.74542
indus    4.709285     478.99444
chas     0.986362      27.21836
nox     14.858454     753.85506
rm      33.643834    8206.94956
age     13.212521     542.63711
dis      8.449379     644.76518
rad      3.436762      75.21787
tax     10.681850     279.37128
ptratio  9.879386     868.12587
lstat   28.945897    6137.32497
Two measures of variable importance are reported. The first is based
upon the mean decrease of accuracy in predictions on the out of bag
samples when a given variable is permuted. The second is a measure of
the total decrease in node impurity that results from splits over that
variable, averaged over all trees. In the case of regression trees, the
node impurity is measured by the training RSS, and for classification
trees by the deviance. Plots of these importance measures can be
produced using the varImpPlot() function.
varImpPlot(rf.boston)
The results indicate that across all of the trees considered in the
random forest, the wealth of the community (lstat) and the
house size (rm) are by far the two most important
variables.