Trees stratify or segment the predictor space into a number of simple regions. Predictions on observations are made by the mean of the observations.
The observation space is segmented into regions called terminal nodes or leaves of the tree. The points along the tree are internal nodes. And the segments that connect the nodes are called branches.
Building a regression tree takes two steps: 1. Divide the predictor space \(X_1, X_2, \ldots, X_p\) into \(J\) distinct and non-overlapping regions \(R_1, R_2, \ldots, R_J\) 2. Every observation that falls into region \(R_j\) make the same prediction, which is the mean of the response values for the training observations in \(R_j\)
For example, the baseball players with less than 4.5 years of experience have an average salary (in thousands of dollars) of 164.0219073, and 601.8450379 for more than 4.5 years of experience. The more experienced players can be further divided by the number of hits they have. The tree is built from the top down in a greedy fashion, meaning it takes the best binary split at that node without regard to successive splits. To perform recursive binary splitting, first select the predictor \(X_j\) and the cutpoint \(s\) such that the regions $$ \[\begin{eqnarray} R_1(j, s) = \{ X | X_j < s \} \quad &\text{and}& \quad R_2(j,s) = \{ X | X_j \geq s \} \\ \text{minimizes} \quad \text{RSS} = \sum_{i: x_i \in R_1(j, s)} (y_i - \hat y_{R_1})^2 \quad &+& \sum_{i: x_i \in R_2(j, s)} (y_i - \hat y_{R_2})^2 \end{eqnarray}\]$$ Where \(\hat y_R\) is the mean response for the training observations in that region. Next step, the previously created regions are stratified as before. This continues until a stopping point is reached (max depth, number of observations, etc.).
A deep tree has a lot of variance and will probably overfit the data. A smaller tree will have lower variance at the cost of increased bias. Just limiting the tree is not enough; a bad initial split might be followed by a good split. A better strategy is to grow a very large tree \(T_0\) then prune it back to obtain a subtree. Ideally the best subtree has the lowest test error. Cross-validation is too expensive, so we use a method called cost complexity pruning, also known as weakest link pruning.
Algorithm for Building a Regression Tree 1. Use recursive binary fitting to grow a large tree 2. Apply cost complexity pruning to the large tree and obtain a sequence of best subtrees as a function of \(\alpha\) 3. Use K-fold cross-validation to choose \(\alpha\). For each fold \(k\): (a) Repeat Steps 1 and 2 on all but the \(k\)th fold of the training data. (b) Evaluate MSE on the left-out \(k\)th fold as a function of \(\alpha\) Then average the results for each value of \(\alpha\) and choose the minimum. 4. Return the subtree from Step 2 given the smallest \(\alpha\)
Rather than considering every subtree, we consider a sequence of trees indexed by a nonnegative tuning parameter \(\alpha\). For each value of \(\alpha\) there corresponds a subtree \(T \subset T_0\) such that \[ \text{min} \quad \text{RSS} = \sum_{m = 1}^{\left| T \right|} \sum_{i: x_i \in R_m} (y_i - \hat y_{R_m})^2 + \alpha \left| T \right| \] Where \(\left| T \right|\) indicates the number of terminal nodes in tree \(T\), \(R_m\) is the rectangle corresponding to the \(m\)th terminal node, and \(\hat y_{R_m}\) is the predicted response (mean value) associated with \(R_m\). The tuning parameter \(\alpha\) controls the bias-variance trade off, with \(\alpha = 0\) being the whole tree \(T_0\). As \(\alpha\) increases, there’s a penalty associated with increasing terminal nodes \(\left| T \right|\), similar to the lasso in linear regression.
Classification trees are very similar to regression trees but they predict a qualitative response. Duh. Now the mean response of the training observations is determined by the most common occuring class, like nearest neighbors. Instead using RSS to make binary splits, we use the classification error rate \(E = 1 - \text{max}_k (\hat p_{mk})\) where \(\hat p_{mk}\) represents the proportion of training observations in the \(m\)th region that are from the \(k\)th class. However, in practice this is not sensitive enough for tree-growing, so we use other measures. \[ G = \sum_{k = 1}^K \hat p_{mk} (1 - \hat p_{mk}) \qquad \text{Gini Index} \] The Gini index is a measure of the total variance across the \(K\) classes. A small value indicates that a node contains predominantly observations from a single class. It is pure. \[ D = - \sum_{k = 1}^K \hat p_{mk} \log \hat p_{mk} \qquad \text{Entropy} \] Since \(0 \leq \hat p_{mk} \leq 1\), it follows that \(0 \leq \hat p_{mk} \log \hat p_{mk}\). One can show the entropy will take on a value near zero if the \(\hat p_{mk}\)’s are all near zero or one. Meaning the entropy will be small if the \(m\)th node is pure.
Which is better? Depends on the problem at hand. If the response is well approximated by a linear model, the linear regression it is! If instead the data has highly non-linear and complex relationships between the features and response, then decision trees may outperform classical approaches.
Decision trees suffer from high variance. Splitting the data into two random parts will lead to very different model fits. A low variance procedure yields similar results if applied repeatedly ti distinct datasets. Bootstrap aggregation, or bagging, is a general purpose procedure for reducing the variance of a statistical learning method. Given a set of \(n\) independent observations \(Z_1, \ldots, Z_n\) each with a variance \(\sigma^2\), the variance of the mean \(\bar Z\) of the observations is \(\sigma^2 / n\). In other words, averaging a set of observations reduces variance. We could calculate \(\hat f^1 (x), \hat f^2 (x), \ldots, \hat f^B (x)\) using \(B\) separate training sets, and average them in order to obtain a single low-variance statistical learning model given by \[ \hat f_{bag} (x) = \frac{1}{B} \sum_{b = 1}^B \hat f^{*b} (x) \] Bagging is when we bootstrap training sets to achieve this. For each iteration we build a deep tree, on that has high variance, but low bias. Averaging these \(B\) trees together reduces the variance.
Estimating the test error of a bagged model is as follows. On average, each bagged tree uses about two-thirds of the observations (see Ch5 exercise 2). The remaining one-third of observations left out are referred to as the out-of-bag (OOB) observations. For each tree in which the \(i\)th observation was OOB, we can predict this, yielding \(B/3\) predictions for the \(i\)th observation. Averaging these predictions for all \(n\) observations gives the overall OOB MSE.
Bagging many, many trees together obsfucates their interpretability. We can get an overall summary of the importance of each predictor using the RSS or Gini index, recording the total RSS that is decreased from splits on an important predictor.
Like bagging, random forests builds a number of decision trees on bootstrapped training samples. But in building thse decision trees, each time a split is considered, a random sample of \(m\) predictors is chosen as split candidates from the full set of \(p\) predictors. A fresh sample of \(m\) predictors is taken at each split, and typically we choose \(m \approx \sqrt p\). Not considering a majority of the available predictors sounds crazy, but it has a clever rationale. Suppose the dataset has one very strong predictor. Then the collection of bagged trees will have many trees with the strong predictor in the top split. Consequently, all of the bagged trees will look quite similar and predictions will be highly correlated, and highly correlated predictions do not reduce the variance as much as uncorrelated predictions. Random forest uncorrelate the predictions by only considering a subset of variables. On average \((p - m) / p\) of the splits do not even consider the strong predictor. This process decorrelates the trees.
Boosting is similar to bagging in that it is a general approach that can be applied to many statistical learning methods for regression or classification. - Bagging: boostrap the original training data, fit a model to each set, then average all the models together. - Boosting: similar to bagging, but trees are grown sequentially: each tree is grown using information from previously grown trees. Instead of bootstrapping, each tree is fit on a modified version of the original dataset.
What’s the idea? Unlike fitting a single large decision tree to the data, which amounts to fitting the data hard and potentially overfitting, the boosting approach instead learns slowly. Boosting has three tuning parameters: 1. The number of trees \(B\). Too large a \(B\) can overfit. Use CV to select \(B\). 2. Shrinkage parameter \(\lambda\). This controls the boosting learning rate. Typical values are 0.01 or 0.001. 3. Number of \(d\) splits in each tree. Often \(d = 1\) works well, where each tree is a stump consisting of a single split. In this case the boosted enseble is fitting an additive model, since each term involves only a single variable.
Carseats$High <- as.factor(if_else(Carseats$Sales <=8, "No", "Yes"))
set.seed(2)
tree.carseats <- rpart(High ~ . -Sales ,data = Carseats, method = "class")
rpart.plot(tree.carseats)
plotcp(tree.carseats)
printcp(tree.carseats)
##
## Classification tree:
## rpart(formula = High ~ . - Sales, data = Carseats, method = "class")
##
## Variables actually used in tree construction:
## [1] Advertising Age CompPrice Income Price ShelveLoc
##
## Root node error: 164/400 = 0.41
##
## n= 400
##
## CP nsplit rel error xerror xstd
## 1 0.286585 0 1.00000 1.00000 0.059980
## 2 0.109756 1 0.71341 0.71341 0.055477
## 3 0.045732 2 0.60366 0.63415 0.053492
## 4 0.036585 4 0.51220 0.61585 0.052981
## 5 0.027439 5 0.47561 0.56707 0.051515
## 6 0.024390 7 0.42073 0.56707 0.051515
## 7 0.012195 8 0.39634 0.55488 0.051124
## 8 0.010000 10 0.37195 0.53049 0.050310
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
set.seed(1)
train <- sample(1:nrow(Boston), nrow(Boston) / 2)
tree.boston <- rpart(medv ~ ., data = Boston,
method = "anova", subset = train)
yhat <- predict(tree.boston, newdata = Boston[-train,])
boston.test <- Boston$medv[-train]
plot(yhat, boston.test)
mean((yhat - boston.test)^2)
## [1] 25.35825
printcp(tree.boston)
##
## Regression tree:
## rpart(formula = medv ~ ., data = Boston, subset = train, method = "anova")
##
## Variables actually used in tree construction:
## [1] dis lstat rm
##
## Root node error: 20895/253 = 82.588
##
## n= 253
##
## CP nsplit rel error xerror xstd
## 1 0.462576 0 1.00000 1.01273 0.117728
## 2 0.204673 1 0.53742 0.56339 0.059699
## 3 0.074618 2 0.33275 0.34949 0.039945
## 4 0.039191 3 0.25813 0.29608 0.039976
## 5 0.032082 4 0.21894 0.31072 0.048076
## 6 0.021629 5 0.18686 0.29508 0.048053
## 7 0.011150 6 0.16523 0.26146 0.042604
## 8 0.010000 7 0.15408 0.25032 0.037014
rpart.plot(tree.boston)
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
set.seed(1)
bag.boston <- randomForest(medv ~ ., data = Boston,
subset = train, mtry = 13, importance = T)
yhat.bag <- predict(bag.boston, newdata = Boston[-train,])
plot(yhat.bag, boston.test-yhat.bag)
mean((yhat.bag - boston.test)^2)
## [1] 13.50808
rf.boston <- randomForest(medv ~ ., data = Boston,
subset = train, importance = T)
importance(rf.boston)
## %IncMSE IncNodePurity
## crim 12.683368 1109.8339
## zn 2.848951 105.8980
## indus 11.437218 1477.5452
## chas 1.456523 111.5992
## nox 11.941250 1383.9662
## rm 27.947237 5369.1223
## age 11.762910 753.9417
## dis 14.082646 1414.5241
## rad 4.347546 153.0222
## tax 11.118294 751.6709
## ptratio 12.686562 1351.2312
## black 6.791716 428.7641
## lstat 25.621981 5981.0986
# %IncMSE is the mean decrease in accuracy on MSE of OOB samples when predictor is left out
# IncNodePurity: total decrease in node purity that results from splits over that variable
varImpPlot(rf.boston)
library(gbm)
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:rpart':
##
## solder
## The following object is masked from 'package:caret':
##
## cluster
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.3
set.seed(1)
boost.boston <- gbm(medv ~ ., data = Boston[train, ],
distribution = "gaussian",
n.trees = 5000,
interaction.depth = 4,
shrinkage = 0.01)
summary(boost.boston)
## var rel.inf
## lstat lstat 38.4712337
## rm rm 26.8997879
## dis dis 9.6829733
## crim crim 6.4257005
## black black 4.2383213
## age age 3.7465696
## nox nox 3.4367222
## ptratio ptratio 2.6381676
## tax tax 1.5671593
## indus indus 1.5032801
## chas chas 0.7598312
## rad rad 0.5285132
## zn zn 0.1017401
plot(boost.boston, i = "lstat")
yhat.boost <- predict(boost.boston, newdata = Boston[-train,], n.trees = 5000)
mean((yhat.boost - boston.test)^2)
## [1] 10.23422
3.Compare the Gini index, classification error, and entropy as a function of \(\hat p_{m1}\)
df <- tibble(
p1 = seq(0, 1, 0.01),
p2 = 1 - p1,
error = 1 - pmax(p1, p2),
gini = p1 * (1 - p1) + p2 * (1 - p2),
entropy = - p1 * log(p1) - p2 * log(p2)
)
df
## # A tibble: 101 x 5
## p1 p2 error gini entropy
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 1 0 0 NaN
## 2 0.01 0.99 0.01 0.0198 0.0560
## 3 0.02 0.98 0.02 0.0392 0.0980
## 4 0.03 0.97 0.03 0.0582 0.135
## 5 0.04 0.96 0.04 0.0768 0.168
## 6 0.05 0.95 0.05 0.095 0.199
## 7 0.06 0.94 0.06 0.113 0.227
## 8 0.07 0.93 0.07 0.130 0.254
## 9 0.08 0.92 0.0800 0.147 0.279
## 10 0.09 0.91 0.0900 0.164 0.303
## # ... with 91 more rows
?max()
df %>%
gather(type, value, -p1, -p2) %>%
ggplot(aes(x = p1, y = value, color = type)) +
geom_line() +
labs(title = "Binary Region classification errors",
y = "Error Rate")
## Warning: Removed 2 rows containing missing values (geom_path).
Continue until a stopping point is reached.
set.seed(1)
train <- sample(1:nrow(Carseats), nrow(Carseats) * 0.8)
tree.carseats <- rpart(Sales ~ ., data = Carseats, subset = train)
rpart.plot(tree.carseats)
yhat <- predict(tree.carseats, newdata = Carseats[-train,])
mean((yhat - Carseats$Sales[-train])^2)
## [1] 1.946852
# printcp(tree.carseats) # 4 splits is the best
tree.carseats4 <- rpart(Sales ~ ., data = Carseats, cp = 0.015971, subset = train)
rpart.plot(tree.carseats4)
yhat4 <- predict(tree.carseats4, newdata = Carseats[-train,])
mean((yhat4 - Carseats$Sales[-train])^2)
## [1] 1.863894
# Test MSE reduced from 4.78 to 4.71
bag.carseat <- randomForest(Sales ~ ., data = Carseats,
subset = train,
mtry = length(Carseats)-1,
importance = T)
yhat.bag <- predict(bag.carseat, newdata = Carseats[-train,])
mean((yhat.bag - Carseats$Sales[-train])^2)
## [1] 1.384816
# Bagging Test MSE reduced to 2.07
importance(bag.carseat)
## %IncMSE IncNodePurity
## CompPrice 1.936790e+01 141.280931
## Income -3.081611e-02 72.238399
## Advertising 3.423359e+00 46.889299
## Population 1.684690e+00 82.547358
## Price 3.525324e+01 264.184965
## ShelveLoc 3.928732e+01 159.456659
## Age 6.124197e+00 95.946727
## Education 7.119851e-04 38.019156
## Urban -2.673066e+00 9.052972
## US 9.605100e-01 5.957471
## High 1.048537e+02 1555.734490
rf.carseat <- randomForest(Sales ~ ., data = Carseats,
subset = train,
importance = T)
yhat.rf <- predict(rf.carseat, newdata = Carseats[-train,])
mean((yhat.rf - Carseats$Sales[-train])^2)
## [1] 1.443891
# Random Forest Test MSE reduced to 2.54
library(caret)
set.seed(1)
train <- sample(1:nrow(OJ), 800)
tree.oj <- rpart(Purchase ~ ., data = OJ, subset = train)
yhat.tr.oj <- factor(predict(tree.oj, newdata = OJ[-train, ])[,1] > 0.5, labels = c("CH", "MM"))
yhat.tr.oj
## 1 3 4 7 11 12 19 20 21 23 25 28 30 33 35
## MM MM CH MM MM MM MM MM MM MM MM MM MM MM MM
## 36 37 42 45 54 60 65 74 75 76 79 82 84 86 90
## MM MM MM MM MM MM MM MM MM MM MM MM MM MM MM
## 94 102 107 113 119 121 128 131 134 139 146 148 151 158 167
## MM MM MM MM MM MM MM MM MM MM MM CH CH MM MM
## 178 181 184 189 190 201 204 207 208 244 248 251 252 269 271
## MM MM MM MM MM MM MM MM MM MM MM MM MM MM MM
## 272 273 275 276 279 281 286 289 307 313 317 318 319 320 323
## CH CH CH CH CH CH CH CH CH MM MM MM MM MM MM
## 329 332 337 339 350 353 356 359 360 361 364 366 373 375 376
## MM CH CH CH MM MM CH MM MM MM MM MM MM MM MM
## 380 385 391 403 408 410 418 420 422 423 428 434 435 437 439
## CH CH CH CH MM MM CH CH CH MM MM MM CH CH MM
## 443 448 451 456 458 462 466 472 474 478 484 496 497 501 503
## MM MM MM MM MM CH MM MM MM MM MM MM CH CH MM
## 504 508 509 515 516 531 532 537 542 543 549 554 557 562 571
## MM MM MM MM MM MM MM MM MM MM CH CH CH CH CH
## 572 574 578 584 587 589 591 592 602 615 616 617 618 620 631
## CH CH MM MM MM MM MM MM MM MM MM MM MM MM MM
## 637 641 642 646 650 659 661 663 677 679 680 684 685 695 697
## MM MM MM MM MM MM MM MM MM MM CH CH CH CH CH
## 698 707 712 713 724 728 736 741 743 751 760 762 777 779 781
## CH CH CH CH CH MM CH MM MM MM MM MM MM CH MM
## 784 790 794 799 806 808 809 819 823 825 835 838 840 841 844
## CH MM CH MM MM MM MM MM MM MM CH CH CH CH MM
## 845 850 854 861 863 866 867 869 872 873 874 875 881 882 885
## CH MM MM MM MM MM CH MM MM MM MM MM MM MM MM
## 886 887 891 892 893 895 898 904 910 911 913 917 918 920 924
## MM MM MM MM MM MM MM MM MM MM MM MM MM MM MM
## 925 927 928 929 931 935 936 937 940 941 942 945 946 949 954
## CH CH CH CH CH CH CH CH CH CH CH CH CH CH CH
## 956 974 983 984 987 992 993 994 995 996 998 1002 1012 1013 1014
## CH CH MM MM MM CH CH MM MM CH MM MM MM MM MM
## 1016 1023 1025 1035 1036 1039 1042 1047 1051 1053 1059 1060 1062 1065 1070
## MM MM MM MM CH MM MM MM MM MM MM MM MM MM MM
## Levels: CH MM
rpart.plot(tree.oj)
confusionMatrix(yhat.tr.oj, OJ$Purchase[-train])
## Confusion Matrix and Statistics
##
## Reference
## Prediction CH MM
## CH 12 64
## MM 147 47
##
## Accuracy : 0.2185
## 95% CI : (0.1707, 0.2726)
## No Information Rate : 0.5889
## P-Value [Acc > NIR] : 1
##
## Kappa : -0.4503
## Mcnemar's Test P-Value : 1.651e-08
##
## Sensitivity : 0.07547
## Specificity : 0.42342
## Pos Pred Value : 0.15789
## Neg Pred Value : 0.24227
## Prevalence : 0.58889
## Detection Rate : 0.04444
## Detection Prevalence : 0.28148
## Balanced Accuracy : 0.24945
##
## 'Positive' Class : CH
##
# printcp(tree.oj)