This example is taken from the book “An Introduction to Statistical Learning with Application in R” by Gareth James, Daniela Witten, Trevor Hastie. Here we fit a regression tree to the Boston data set. First we create a training set, and fit the tree to the training data.
We load the necessary libraries.
require(MASS)
require(tree)
We load the dataset and divide into train and test dataset. The dataset is for housing values in suburbs of Boston. For more information on the dataset, we can do ?Boston in R.
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.090 1 296 15.3 396.9 4.98
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.967 2 242 17.8 396.9 9.14
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.967 2 242 17.8 392.8 4.03
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.062 3 222 18.7 394.6 2.94
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.062 3 222 18.7 396.9 5.33
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.062 3 222 18.7 394.1 5.21
## medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
set.seed(1)
train <- sample(1:nrow(Boston), nrow(Boston)/2)
We build the tree model and do summary on the model.
tree.boston <- tree(medv ~., Boston, subset = train)
summary(tree.boston)
##
## Regression tree:
## tree(formula = medv ~ ., data = Boston, subset = train)
## Variables actually used in tree construction:
## [1] "lstat" "rm" "dis"
## Number of terminal nodes: 8
## Residual mean deviance: 12.6 = 3100 / 245
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -14.100 -2.040 -0.054 0.000 1.960 12.600
Notice that the output of summary() indicates that only three of the variables have been used in constructing the tree. In the context of a regression tree, the deviance is simply the sum of squared errors for the tree. We now plot the tree.
plot(tree.boston)
text(tree.boston, pretty = 0)
The variable lstat measures the percentage of individuals with lower socioeconomic status. The tree indicates that lower values of lstat correspond to more expensive houses. The tree predicts a median house price of $46,400 for larger house in suburbs in which residents have high socioeconomic status(rm >= 7.437 and lstat < 9.715).
We will see how it performs in the test dataset.
yhat <- predict(tree.boston, newdata = Boston[-train, ])
boston.test <- Boston[-train, "medv"]
plot(yhat, boston.test)
abline(0, 1)
mean((yhat - boston.test)^2)
## [1] 25.05
The test set MSE associated with regression tree is 25.05. The square root of the MSE is therefore around 5.005, indicating that this model leads to test predictions that are within around $5005 of the true median home value for the suburbs.
Now we use the cv.tree() function to see whether pruning the tree will improve the performance.
cv.boston <- cv.tree(tree.boston)
plot(cv.boston$size, cv.boston$dev, type = "b")
The lowest MSE stands at tree size 7 or 8, so even if we prune at 7, it won’t make much difference.