ISL Ch8 Tree-Based Methods

8.1 Decision Tree Basics

Trees stratify or segment the predictor space into a number of simple regions. Predictions on observations are made by the mean of the observations.

Regression Trees

The observation space is segmented into regions called terminal nodes or leaves of the tree. The points along the tree are internal nodes. And the segments that connect the nodes are called branches.

Prediction via Stratification of the Feature Space

Building a regression tree takes two steps: 1. Divide the predictor space $X_1, X_2, \ldots, X_p$ into $J$ distinct and non-overlapping regions $R_1, R_2, \ldots, R_J$ 2. Every observation that falls into region $R_j$ make the same prediction, which is the mean of the response values for the training observations in $R_j$

For example, the baseball players with less than 4.5 years of experience have an average salary (in thousands of dollars) of 164.0219073, and 601.8450379 for more than 4.5 years of experience. The more experienced players can be further divided by the number of hits they have. The tree is built from the top down in a greedy fashion, meaning it takes the best binary split at that node without regard to successive splits. To perform recursive binary splitting, first select the predictor $X_j$ and the cutpoint $s$ such that the regions $$ \[\begin{eqnarray} R_1(j, s) = \{ X | X_j < s \} \quad &\text{and}& \quad R_2(j,s) = \{ X | X_j \geq s \} \\ \text{minimizes} \quad \text{RSS} = \sum_{i: x_i \in R_1(j, s)} (y_i - \hat y_{R_1})^2 \quad &+& \sum_{i: x_i \in R_2(j, s)} (y_i - \hat y_{R_2})^2 \end{eqnarray}\]

$$ Where $\hat y_R$ is the mean response for the training observations in that region. Next step, the previously created regions are stratified as before. This continues until a stopping point is reached (max depth, number of observations, etc.).

Tree Pruning

A deep tree has a lot of variance and will probably overfit the data. A smaller tree will have lower variance at the cost of increased bias. Just limiting the tree is not enough; a bad initial split might be followed by a good split. A better strategy is to grow a very large tree $T_0$ then prune it back to obtain a subtree. Ideally the best subtree has the lowest test error. Cross-validation is too expensive, so we use a method called cost complexity pruning, also known as weakest link pruning.

Algorithm for Building a Regression Tree 1. Use recursive binary fitting to grow a large tree 2. Apply cost complexity pruning to the large tree and obtain a sequence of best subtrees as a function of $\alpha$ 3. Use K-fold cross-validation to choose $\alpha$. For each fold $k$: (a) Repeat Steps 1 and 2 on all but the $k$th fold of the training data. (b) Evaluate MSE on the left-out $k$th fold as a function of $\alpha$ Then average the results for each value of $\alpha$ and choose the minimum. 4. Return the subtree from Step 2 given the smallest $\alpha$

Rather than considering every subtree, we consider a sequence of trees indexed by a nonnegative tuning parameter $\alpha$. For each value of $\alpha$ there corresponds a subtree $T \subset T_0$ such that \[ \text{min} \quad \text{RSS} = \sum_{m = 1}^{\left| T \right|} \sum_{i: x_i \in R_m} (y_i - \hat y_{R_m})^2 + \alpha \left| T \right| \] Where $\left| T \right|$ indicates the number of terminal nodes in tree $T$, $R_m$ is the rectangle corresponding to the $m$th terminal node, and $\hat y_{R_m}$ is the predicted response (mean value) associated with $R_m$. The tuning parameter $\alpha$ controls the bias-variance trade off, with $\alpha = 0$ being the whole tree $T_0$. As $\alpha$ increases, there’s a penalty associated with increasing terminal nodes $\left| T \right|$, similar to the lasso in linear regression.

Classification Trees pg. 311

Classification trees are very similar to regression trees but they predict a qualitative response. Duh. Now the mean response of the training observations is determined by the most common occuring class, like nearest neighbors. Instead using RSS to make binary splits, we use the classification error rate $E = 1 - \text{max}_k (\hat p_{mk})$ where $\hat p_{mk}$ represents the proportion of training observations in the $m$th region that are from the $k$th class. However, in practice this is not sensitive enough for tree-growing, so we use other measures. \[ G = \sum_{k = 1}^K \hat p_{mk} (1 - \hat p_{mk}) \qquad \text{Gini Index} \] The Gini index is a measure of the total variance across the $K$ classes. A small value indicates that a node contains predominantly observations from a single class. It is pure. \[ D = - \sum_{k = 1}^K \hat p_{mk} \log \hat p_{mk} \qquad \text{Entropy} \] Since $0 \leq \hat p_{mk} \leq 1$, it follows that $0 \leq \hat p_{mk} \log \hat p_{mk}$. One can show the entropy will take on a value near zero if the $\hat p_{mk}$’s are all near zero or one. Meaning the entropy will be small if the $m$th node is pure.

Trees versus Linear Models

Which is better? Depends on the problem at hand. If the response is well approximated by a linear model, the linear regression it is! If instead the data has highly non-linear and complex relationships between the features and response, then decision trees may outperform classical approaches.

Advantages and Disadvantages of Trees

Trees are very easy to explain! And they can be displayed graphically.
Some believe trees more closely mirror human decision making
Trees easily handle qualitative variables without creating dummy variables
Unfortunately, trees do not have the same level of predictive accuracy as other methods
Additionally, trees can be very fragile. Small changes in the data cause big model changes.

8.2 Bagging, Random Forests, Boosting pg.316

8.2.1 Bagging

Decision trees suffer from high variance. Splitting the data into two random parts will lead to very different model fits. A low variance procedure yields similar results if applied repeatedly ti distinct datasets. Bootstrap aggregation, or bagging, is a general purpose procedure for reducing the variance of a statistical learning method. Given a set of $n$ independent observations $Z_1, \ldots, Z_n$ each with a variance $\sigma^2$, the variance of the mean $\bar Z$ of the observations is $\sigma^2 / n$. In other words, averaging a set of observations reduces variance. We could calculate $\hat f^1 (x), \hat f^2 (x), \ldots, \hat f^B (x)$ using $B$ separate training sets, and average them in order to obtain a single low-variance statistical learning model given by \[ \hat f_{bag} (x) = \frac{1}{B} \sum_{b = 1}^B \hat f^{*b} (x) \] Bagging is when we bootstrap training sets to achieve this. For each iteration we build a deep tree, on that has high variance, but low bias. Averaging these $B$ trees together reduces the variance.

Out-of-Bag Error Estimation pg.318

Estimating the test error of a bagged model is as follows. On average, each bagged tree uses about two-thirds of the observations (see Ch5 exercise 2). The remaining one-third of observations left out are referred to as the out-of-bag (OOB) observations. For each tree in which the $i$th observation was OOB, we can predict this, yielding $B/3$ predictions for the $i$th observation. Averaging these predictions for all $n$ observations gives the overall OOB MSE.

Variance Importance Measures

Bagging many, many trees together obsfucates their interpretability. We can get an overall summary of the importance of each predictor using the RSS or Gini index, recording the total RSS that is decreased from splits on an important predictor.

8.2.2 Random Forests

Like bagging, random forests builds a number of decision trees on bootstrapped training samples. But in building thse decision trees, each time a split is considered, a random sample of $m$ predictors is chosen as split candidates from the full set of $p$ predictors. A fresh sample of $m$ predictors is taken at each split, and typically we choose $m \approx \sqrt p$. Not considering a majority of the available predictors sounds crazy, but it has a clever rationale. Suppose the dataset has one very strong predictor. Then the collection of bagged trees will have many trees with the strong predictor in the top split. Consequently, all of the bagged trees will look quite similar and predictions will be highly correlated, and highly correlated predictions do not reduce the variance as much as uncorrelated predictions. Random forest uncorrelate the predictions by only considering a subset of variables. On average $(p - m) / p$ of the splits do not even consider the strong predictor. This process decorrelates the trees.

8.2.3 Boosting

Boosting is similar to bagging in that it is a general approach that can be applied to many statistical learning methods for regression or classification. - Bagging: boostrap the original training data, fit a model to each set, then average all the models together. - Boosting: similar to bagging, but trees are grown sequentially: each tree is grown using information from previously grown trees. Instead of bootstrapping, each tree is fit on a modified version of the original dataset.

Algorithm for Boosting Regression Trees

Set $\hat f (x) = 0$ and $r_i = y_i$ for all $i$ in the training set.
For $b = 1, 2, \ldots, B$, repeat:

Fit a tree $\hat f^b$ with $d$ splits ($d+1$ terminal nodes) to the training data $(X,r)$.
Update $\hat f$ by adding in a shrunken version of the new tree $\hat f (x) \leftarrow \hat f (x) + \lambda \hat f^b (x)$
Update the residuals $r_i \leftarrow r_i - \lambda \hat f^b (x_i)$

Output the boosted model: $\hat f (x) = \sum_{b = 1}^B \lambda \hat f^b (x)$

What’s the idea? Unlike fitting a single large decision tree to the data, which amounts to fitting the data hard and potentially overfitting, the boosting approach instead learns slowly. Boosting has three tuning parameters: 1. The number of trees $B$. Too large a $B$ can overfit. Use CV to select $B$. 2. Shrinkage parameter $\lambda$. This controls the boosting learning rate. Typical values are 0.01 or 0.001. 3. Number of $d$ splits in each tree. Often $d = 1$ works well, where each tree is a stump consisting of a single split. In this case the boosted enseble is fitting an additive model, since each term involves only a single variable.

Lab 8: Decision Trees

Carseats$High <- as.factor(if_else(Carseats$Sales <=8, "No", "Yes"))

set.seed(2)

tree.carseats <- rpart(High ~ . -Sales ,data = Carseats, method = "class")

rpart.plot(tree.carseats)

plotcp(tree.carseats)

printcp(tree.carseats)

## 
## Classification tree:
## rpart(formula = High ~ . - Sales, data = Carseats, method = "class")
## 
## Variables actually used in tree construction:
## [1] Advertising Age         CompPrice   Income      Price       ShelveLoc  
## 
## Root node error: 164/400 = 0.41
## 
## n= 400 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.286585      0   1.00000 1.00000 0.059980
## 2 0.109756      1   0.71341 0.71341 0.055477
## 3 0.045732      2   0.60366 0.63415 0.053492
## 4 0.036585      4   0.51220 0.61585 0.052981
## 5 0.027439      5   0.47561 0.56707 0.051515
## 6 0.024390      7   0.42073 0.56707 0.051515
## 7 0.012195      8   0.39634 0.55488 0.051124
## 8 0.010000     10   0.37195 0.53049 0.050310

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

set.seed(1)
train <- sample(1:nrow(Boston), nrow(Boston) / 2)

tree.boston <- rpart(medv ~ ., data = Boston, 
                     method = "anova", subset = train)
yhat <- predict(tree.boston, newdata = Boston[-train,])
boston.test <- Boston$medv[-train]
plot(yhat, boston.test)

mean((yhat - boston.test)^2)

## [1] 25.35825

printcp(tree.boston)

## 
## Regression tree:
## rpart(formula = medv ~ ., data = Boston, subset = train, method = "anova")
## 
## Variables actually used in tree construction:
## [1] dis   lstat rm   
## 
## Root node error: 20895/253 = 82.588
## 
## n= 253 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.462576      0   1.00000 1.01273 0.117728
## 2 0.204673      1   0.53742 0.56339 0.059699
## 3 0.074618      2   0.33275 0.34949 0.039945
## 4 0.039191      3   0.25813 0.29608 0.039976
## 5 0.032082      4   0.21894 0.31072 0.048076
## 6 0.021629      5   0.18686 0.29508 0.048053
## 7 0.011150      6   0.16523 0.26146 0.042604
## 8 0.010000      7   0.15408 0.25032 0.037014

rpart.plot(tree.boston)

library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

set.seed(1)

bag.boston <- randomForest(medv ~ ., data = Boston,
                           subset = train, mtry = 13, importance = T)

yhat.bag <- predict(bag.boston, newdata = Boston[-train,])
plot(yhat.bag, boston.test-yhat.bag)

mean((yhat.bag - boston.test)^2)

## [1] 13.50808

rf.boston <- randomForest(medv ~ ., data = Boston,
                          subset = train, importance = T)
importance(rf.boston)

##           %IncMSE IncNodePurity
## crim    12.683368     1109.8339
## zn       2.848951      105.8980
## indus   11.437218     1477.5452
## chas     1.456523      111.5992
## nox     11.941250     1383.9662
## rm      27.947237     5369.1223
## age     11.762910      753.9417
## dis     14.082646     1414.5241
## rad      4.347546      153.0222
## tax     11.118294      751.6709
## ptratio 12.686562     1351.2312
## black    6.791716      428.7641
## lstat   25.621981     5981.0986

# %IncMSE is the mean decrease in accuracy on MSE of OOB samples when predictor is left out
# IncNodePurity: total decrease in node purity that results from splits over that variable
varImpPlot(rf.boston)

library(gbm)

## Loading required package: survival

## 
## Attaching package: 'survival'

## The following object is masked from 'package:rpart':
## 
##     solder

## The following object is masked from 'package:caret':
## 
##     cluster

## Loading required package: splines

## Loading required package: parallel

## Loaded gbm 2.1.3

set.seed(1)

boost.boston <- gbm(medv ~ ., data = Boston[train, ], 
                    distribution = "gaussian", 
                    n.trees = 5000, 
                    interaction.depth = 4,
                    shrinkage = 0.01)
summary(boost.boston)

##             var    rel.inf
## lstat     lstat 38.4712337
## rm           rm 26.8997879
## dis         dis  9.6829733
## crim       crim  6.4257005
## black     black  4.2383213
## age         age  3.7465696
## nox         nox  3.4367222
## ptratio ptratio  2.6381676
## tax         tax  1.5671593
## indus     indus  1.5032801
## chas       chas  0.7598312
## rad         rad  0.5285132
## zn           zn  0.1017401

plot(boost.boston, i = "lstat")

yhat.boost <- predict(boost.boston, newdata = Boston[-train,], n.trees = 5000)
mean((yhat.boost - boston.test)^2)

## [1] 10.23422

Exercise 8 pg.332

3.Compare the Gini index, classification error, and entropy as a function of $\hat p_{m1}$

df <- tibble(
  p1 = seq(0, 1, 0.01),
  p2 = 1 - p1,
  error = 1 - pmax(p1, p2),
  gini = p1 * (1 - p1) + p2 * (1 - p2),
  entropy = - p1 * log(p1) - p2 * log(p2)
)
df

## # A tibble: 101 x 5
##       p1    p2  error   gini  entropy
##    <dbl> <dbl>  <dbl>  <dbl>    <dbl>
##  1  0     1    0      0      NaN     
##  2  0.01  0.99 0.01   0.0198   0.0560
##  3  0.02  0.98 0.02   0.0392   0.0980
##  4  0.03  0.97 0.03   0.0582   0.135 
##  5  0.04  0.96 0.04   0.0768   0.168 
##  6  0.05  0.95 0.05   0.095    0.199 
##  7  0.06  0.94 0.06   0.113    0.227 
##  8  0.07  0.93 0.07   0.130    0.254 
##  9  0.08  0.92 0.0800 0.147    0.279 
## 10  0.09  0.91 0.0900 0.164    0.303 
## # ... with 91 more rows

?max()
df %>%
  gather(type, value, -p1, -p2) %>%
  ggplot(aes(x = p1, y = value, color = type)) +
  geom_line() +
  labs(title = "Binary Region classification errors",
       y = "Error Rate")

## Warning: Removed 2 rows containing missing values (geom_path).

Describe building a regression tree.
Select predictor $X_j$ and split into two regions by cutpoint $s$ so that ${X | X_j <s}$ and ${X | X_j > s}$ minimizes RSS. This corresponds to $s = \hat \mu_{X_j}$
Repeat the process for the resulting regions in the previous step.
Continue until a stopping point is reached.

set.seed(1)
train <- sample(1:nrow(Carseats), nrow(Carseats) * 0.8)
tree.carseats <- rpart(Sales ~ ., data = Carseats, subset = train)
rpart.plot(tree.carseats)

yhat <- predict(tree.carseats, newdata = Carseats[-train,])
mean((yhat - Carseats$Sales[-train])^2)

## [1] 1.946852

# printcp(tree.carseats) # 4 splits is the best

tree.carseats4 <- rpart(Sales ~ ., data = Carseats, cp = 0.015971, subset = train)
rpart.plot(tree.carseats4)

yhat4 <- predict(tree.carseats4, newdata = Carseats[-train,])
mean((yhat4 - Carseats$Sales[-train])^2)

## [1] 1.863894

# Test MSE reduced from 4.78 to 4.71

bag.carseat <- randomForest(Sales ~ ., data = Carseats, 
                            subset = train, 
                            mtry = length(Carseats)-1, 
                            importance = T)
yhat.bag <- predict(bag.carseat, newdata = Carseats[-train,])
mean((yhat.bag - Carseats$Sales[-train])^2)

## [1] 1.384816

# Bagging Test MSE reduced to 2.07
importance(bag.carseat)

##                   %IncMSE IncNodePurity
## CompPrice    1.936790e+01    141.280931
## Income      -3.081611e-02     72.238399
## Advertising  3.423359e+00     46.889299
## Population   1.684690e+00     82.547358
## Price        3.525324e+01    264.184965
## ShelveLoc    3.928732e+01    159.456659
## Age          6.124197e+00     95.946727
## Education    7.119851e-04     38.019156
## Urban       -2.673066e+00      9.052972
## US           9.605100e-01      5.957471
## High         1.048537e+02   1555.734490

rf.carseat <- randomForest(Sales ~ ., data = Carseats, 
                            subset = train, 
                            importance = T)
yhat.rf <- predict(rf.carseat, newdata = Carseats[-train,])
mean((yhat.rf - Carseats$Sales[-train])^2)

## [1] 1.443891

# Random Forest Test MSE reduced to 2.54

library(caret)
set.seed(1)
train <- sample(1:nrow(OJ), 800)

tree.oj <- rpart(Purchase ~ ., data = OJ, subset = train)
yhat.tr.oj <- factor(predict(tree.oj, newdata = OJ[-train, ])[,1] > 0.5, labels = c("CH", "MM"))
yhat.tr.oj

##    1    3    4    7   11   12   19   20   21   23   25   28   30   33   35 
##   MM   MM   CH   MM   MM   MM   MM   MM   MM   MM   MM   MM   MM   MM   MM 
##   36   37   42   45   54   60   65   74   75   76   79   82   84   86   90 
##   MM   MM   MM   MM   MM   MM   MM   MM   MM   MM   MM   MM   MM   MM   MM 
##   94  102  107  113  119  121  128  131  134  139  146  148  151  158  167 
##   MM   MM   MM   MM   MM   MM   MM   MM   MM   MM   MM   CH   CH   MM   MM 
##  178  181  184  189  190  201  204  207  208  244  248  251  252  269  271 
##   MM   MM   MM   MM   MM   MM   MM   MM   MM   MM   MM   MM   MM   MM   MM 
##  272  273  275  276  279  281  286  289  307  313  317  318  319  320  323 
##   CH   CH   CH   CH   CH   CH   CH   CH   CH   MM   MM   MM   MM   MM   MM 
##  329  332  337  339  350  353  356  359  360  361  364  366  373  375  376 
##   MM   CH   CH   CH   MM   MM   CH   MM   MM   MM   MM   MM   MM   MM   MM 
##  380  385  391  403  408  410  418  420  422  423  428  434  435  437  439 
##   CH   CH   CH   CH   MM   MM   CH   CH   CH   MM   MM   MM   CH   CH   MM 
##  443  448  451  456  458  462  466  472  474  478  484  496  497  501  503 
##   MM   MM   MM   MM   MM   CH   MM   MM   MM   MM   MM   MM   CH   CH   MM 
##  504  508  509  515  516  531  532  537  542  543  549  554  557  562  571 
##   MM   MM   MM   MM   MM   MM   MM   MM   MM   MM   CH   CH   CH   CH   CH 
##  572  574  578  584  587  589  591  592  602  615  616  617  618  620  631 
##   CH   CH   MM   MM   MM   MM   MM   MM   MM   MM   MM   MM   MM   MM   MM 
##  637  641  642  646  650  659  661  663  677  679  680  684  685  695  697 
##   MM   MM   MM   MM   MM   MM   MM   MM   MM   MM   CH   CH   CH   CH   CH 
##  698  707  712  713  724  728  736  741  743  751  760  762  777  779  781 
##   CH   CH   CH   CH   CH   MM   CH   MM   MM   MM   MM   MM   MM   CH   MM 
##  784  790  794  799  806  808  809  819  823  825  835  838  840  841  844 
##   CH   MM   CH   MM   MM   MM   MM   MM   MM   MM   CH   CH   CH   CH   MM 
##  845  850  854  861  863  866  867  869  872  873  874  875  881  882  885 
##   CH   MM   MM   MM   MM   MM   CH   MM   MM   MM   MM   MM   MM   MM   MM 
##  886  887  891  892  893  895  898  904  910  911  913  917  918  920  924 
##   MM   MM   MM   MM   MM   MM   MM   MM   MM   MM   MM   MM   MM   MM   MM 
##  925  927  928  929  931  935  936  937  940  941  942  945  946  949  954 
##   CH   CH   CH   CH   CH   CH   CH   CH   CH   CH   CH   CH   CH   CH   CH 
##  956  974  983  984  987  992  993  994  995  996  998 1002 1012 1013 1014 
##   CH   CH   MM   MM   MM   CH   CH   MM   MM   CH   MM   MM   MM   MM   MM 
## 1016 1023 1025 1035 1036 1039 1042 1047 1051 1053 1059 1060 1062 1065 1070 
##   MM   MM   MM   MM   CH   MM   MM   MM   MM   MM   MM   MM   MM   MM   MM 
## Levels: CH MM

rpart.plot(tree.oj)

confusionMatrix(yhat.tr.oj, OJ$Purchase[-train])

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  CH  MM
##         CH  12  64
##         MM 147  47
##                                           
##                Accuracy : 0.2185          
##                  95% CI : (0.1707, 0.2726)
##     No Information Rate : 0.5889          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : -0.4503         
##  Mcnemar's Test P-Value : 1.651e-08       
##                                           
##             Sensitivity : 0.07547         
##             Specificity : 0.42342         
##          Pos Pred Value : 0.15789         
##          Neg Pred Value : 0.24227         
##              Prevalence : 0.58889         
##          Detection Rate : 0.04444         
##    Detection Prevalence : 0.28148         
##       Balanced Accuracy : 0.24945         
##                                           
##        'Positive' Class : CH              
##

# printcp(tree.oj)