1. Decision Tree Example (Classification)
2. Boosted Decision Stumps
3. Gini Index, Classification Error & Entropy
4. Decision Tree Example (Regression)
- (a) Sketch the Tree (given the feature space partition)
- (b) Sketch the Feature Space Partition (given the tree)
5. Bagged Probabilities to Class Predictions
6. Regression Tree Algorithm
7. APPLIED: The Boston Dataset (Random Forests)
8. APPLIED: The Carseats Dataset (Regression Trees, Random Forests)
9. APPLIED: The OJ Dataset (Classification Trees)
10. APPLIED: The Hitters Dataset (Boosting)
11. APPLIED: The Caravan Dataset (Boosting)
12. APPLIED: The Boston Dataset (Bagging, Random Forests & Boosting vs Other Approaches)

NOTE: There are no official solutions for these questions. These are my solutions and could be incorrect. If you spot any mistakes/inconsistencies, please contact me on Liam95morgan@gmail.com, or via LinkedIn.

Some of the figures in this presentation are taken from “An Introduction to Statistical Learning, with applications in R” (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani

library(tidyverse)
library(MASS) # Boston data
library(ISLR) # Carseats data
library(randomForest) # random forests
library(tree) # trees
library(caret)
library(gridExtra)
library(gbm)
library(gam)

select <- dplyr::select

RNGkind(kind = "Mersenne-Twister", 
        normal.kind = "Inversion", 
        sample.kind = "Rejection") # diff versions of R/RMD using different sampling types was annoying me

# caret::train with RSS & MSE added:
custom_regression_metrics <- function (data, lev = NULL, model = NULL) {
  c(RMSE = sqrt(mean((data$obs-data$pred)^2)),
    Rsquared = summary(lm(pred ~ obs, data))$r.squared,
    MAE = mean(abs(data$obs-data$pred)), 
    MSE = mean((data$obs-data$pred)^2),
    RSS = sum((data$obs-data$pred)^2))
}

theme_set(theme_light())

1. Decision Tree Example (Classification)

Q: Draw an example (of your own invention) of a partition of two-dimensional feature space that could result from recursive binary splitting. Your example should contain at least six regions. Draw a decision tree corresponding to this partition. Be sure to label all aspects of your figures, including the regions $R_1, R_2, \dots$, the cutpoints $t_1, t_2, \dots$, and so forth.

Hint: Your result should look something like Figures 8.1 and 8.2.

I created the example below that illustrates recursive binary splitting in the context of growing a classification tree.

Consider a situation where a number of identical plants were given different volumes of water ($X_1$), and fertilizer ($X_2$), and their growth was measured in some time-frame following. This growth is captured by the categorical target variable $Y$, which takes the value $Yes$ if the plant grew > 1cm, else $No$.

Below is a rough example of the two-dimensional feature space that might result from growing a decision tree predicting $Y$ using $X_1$ and $X_2$:

Here is the corresponding decision tree:

Notice how high levels of fertilizer are toxic regardless and therefore limit growth, but generally a balance between fertilizer and water is most productive for growth. That is to say that the degree to which fertilizer ($X_2$) impacts growth ($Y$) depends on the level of water ($X_1$).

The fact that the classification tree captures the above relationship (without the need to specify an interaction term) nicely illustrates how decision trees inherently model variable interactions as a consequence of their structure (although sometimes explicitly specifying interaction terms can improve predictive power). This property extends to ensembles of decision trees of sufficient depth.

2. Boosted Decision Stumps

Q: It is mentioned in Section 8.2.3 that boosting using depth-one trees (or stumps) leads to an additive model: that is, a model of the form

\[f(X) = \sum_{j = 1}^p f_j(X_j)\]

Explain why this is the case. You can begin with (8.12) in Algorithm 8.2.

Algorithm 8.2 and the remaining text on Boosting is focused around boosting regression trees, so I presume this question is also focused on this.

Firstly, recall that a decision tree with $M$ terminal nodes ($M-1$ splits) can be expressed as:

\[f(X) = \sum_{m = 1}^M c_m \cdot 1_{(X \in R_m)}\]

where, in the regression tree context, $c_m$ is the mean response value for the observations in region $m$.

A stump is a tree with just $M = 2$ terminal nodes (one split), so if we have a decision stump in which a single split occurs on the variable $X_j$ (at value $s$), we can simplify this representation:

\[\begin{align} f(X) & = c_1 \cdot 1_{(X \in R_1)} + c_2 \cdot 1_{(X \in R_2)} \\ & = \begin{cases} c_1 && \text{if } X_j < s\\ c_2 & & \text{if } X_j \ge s \end{cases} \\ & = c_1 \cdot I(X_j < s) + c_2 \cdot I(X_j \ge s) \end{align}\]

With boosted stumps we are not dealing with a single regression stump but $B$ stumps, so I will use the following notation:

Consider $b = 1, 2, \dots, B$ stumps in the boosting model, each consisting of a single split
The splits occur on variables $X_{j_1}, X_{j_2}, \dots, X_{j_B} \in \{ X_1, X_2, \dots, X_p\}$
The splits on these variables occur at the corresponding values $s_1, s_2, \dots, s_B \in \mathbb{R}$
$\alpha_b, \beta_b \in \mathbb{R}$ are just constants (the mean response variable in the regions created by stump $b$)

For $b = 1, 2, \dots, B$, the stumps will therefore take the form:

\[\hat{f}^b(X) = \alpha_b \cdot I(X_{j_b} < s_b) + \beta_b \cdot I(X_{j_b} \ge s_b)\]

Proceeding through Algorithm 8.2, we first set $\hat{f}(X) = 0$ and $r_i = y_i$ ($\forall i \in \{1, 2, \dots, n\}$).

$b = 1$:

\[\hat{f}^1(X) = \alpha_1 \cdot I(X_{j_1} < s_1) + \beta_1 \cdot I(X_{j_1} \ge s_1)\]

\[\begin{align} \hat{f}(X) & \leftarrow \hat{f}(X) + \lambda \hat{f}^1(X) \\ & = \lambda \hat{f}^1(X) \\ \end{align}\]

\[\begin{align} r_i & \leftarrow r_i - \lambda \hat{f}^1(X) \\ & = y_i - \lambda \hat{f}^1(X) \\ \end{align}\]

$b = 2$:

\[\hat{f}^2(X) = \alpha_2 \cdot I(X_{j_2} < s_2) + \beta_2 \cdot I(X_{j_2} \ge s_2)\]

\[\begin{align} \hat{f}(X) & \leftarrow \hat{f}(X) + \lambda \hat{f}^2(X) \\ & = \lambda \hat{f}^1(X) + \lambda \hat{f}^2(X)\\ \end{align}\]

\[\begin{align} r_i & \leftarrow r_i - \lambda \hat{f}^2(X) \\ & = y_i - \lambda \hat{f}^1(X) - \lambda \hat{f}^2(X)\\ \end{align}\]

$\vdots$

$b = B$:

\[\hat{f}^B(X) = \alpha_B \cdot I(X_{j_B} < s_B) + \beta_B \cdot I(X_{j_B} \ge s_B)\]

\[\begin{align} \hat{f}(X) & \leftarrow \hat{f}(X) + \lambda \hat{f}^B(X) \\ & = \lambda \hat{f}^1(X) + \lambda \hat{f}^2(X) + \dots + \lambda \hat{f}^B(X) \\ & = \sum_{b = 1}^B \lambda\hat{f}^b(X) \end{align}\]

\[\begin{align} r_i & \leftarrow r_i - \lambda \hat{f}^B(X) \\ & = y_i - \lambda \hat{f}^1(X) - \lambda \hat{f}^2(X) - \dots - \lambda \hat{f}^B(X) \\ & = y_i - \sum_{b = 1}^B \lambda\hat{f}^b(X) \end{align}\]

From the final stump, we have our final updated boosting model $\hat{f}$, given by:

\[\begin{align} \hat{f}(X) & = \sum_{b = 1}^B \lambda\hat{f}^b(X) \\ & = \lambda \sum_{b = 1}^B \alpha_b \cdot I(X_{j_b} < s_b) + \beta_b \cdot I(X_{j_b} \ge s_b) \\ \end{align}\]

Since $X_{j_1}, X_{j_2}, \dots, X_{j_B} \in \{ X_1, X_2, \dots, X_p\}$, we can see that this estimate $\hat{f}$ is indeed an additive model of the form $f(X) = \sum_{j = 1}^p f_j(X_j)$.

Further Justification:

To show this in more concrete terms, imagine that $B = 50$ and $p = 10$, so $X = (X_1, X_2, \dots, X_{10})$. Suppose that the boosting algorithm split on $X_5$ when $b \in \{ 1, 22, 47\}$.

In terms of the previous notation, this means that $X_{j_1} = X_{j_{22}} = X_{j_{47}} = X_5$.

Hence, in $f(X) = \sum_{j = 1}^{10} f_j(X_j)$, we would have:

\[\begin{align} f_5(X_5) & = \hat{f}^1(X) + \hat{f}^{22}(X) + \hat{f}^{47}(X) \\ & = \lambda \Big[ \alpha_1 \cdot I(X_{j_1} < s_1) + \beta_1 \cdot I(X_{j_1} \ge s_1) \\ & \; \; \; + \alpha_{22} \cdot I(X_{j_{22}} < s_{22}) + \beta_{22} \cdot I(X_{j_{22}} \ge s_{22}) \\ & \; \; \; + \alpha_{47} \cdot I(X_{j_{47}} < s_{47}) + \beta_{47} \cdot I(X_{j_{47}} \ge s_{47}) \Big] \\ & = \lambda \Big[ \alpha_1 \cdot I(X_{5} < s_1) + \beta_1 \cdot I(X_{5} \ge s_1) \\ & \; \; \; + \alpha_{22} \cdot I(X_{5} < s_{22}) + \beta_{22} \cdot I(X_{5} \ge s_{22}) \\ & \; \; \; + \alpha_{47} \cdot I(X_{5} < s_{47}) + \beta_{47} \cdot I(X_{5} \ge s_{47}) \Big] \\ \end{align}\]

We could do exactly the same $\forall j \in \{ 1, 2, \dots, 10\}$, separating out the stumps based on which variable from $X_1, X_2, \dots, X_{10}$ they split on, combining them and expressing them as $f_1(X_1), f_2(X_2), \dots, f_{10}(X_{10})$.

Other Comments:

I believe that if this question were extended past boosted stumps, this property would no longer hold. We would still be able to express the predictions of the boosted model as an a sum of the predictions of many trees; this is clear from the final step of the algorithm, where the final $\hat{f}$ takes the form $\hat{f}(X) = \sum_{b = 1}^B \lambda\hat{f}^b(X)$. The problem arises in expressing these trees as separate functions of the predictors $X_1, X_2, \dots, X_p$.

For stumps this was simple; each stump could be represented by $\hat{f}^b(X) = \alpha_b \cdot I(X_{j_b} < s_b) + \beta_b \cdot I(X_{j_b} \ge s_b)$, which is clearly a function of one variable ($X_{j_b}$).

A stump that split on $X_3$ at value $5.2$ could be represented by:

\[c_1 \cdot I(X_{3} < 5.2) + c_2 \cdot I(X_{3} \ge 5.2)\]

Consider if that stump was instead a tree, which had a further split (when $X_{3} \ge 5.2$) on $X_{8}$ at the value $16.3$, it would then be represented by:

\[c_1 \cdot I(X_{3} < 5.2) + c_2 \cdot I(X_{3} \ge 5.2) \cdot I(X_{8} < 16.3) + c_3 \cdot I(X_{3} \ge 5.2) \cdot I(X_{8} \ge 16.3)\]

How could we separate this tree into separate functions of variables $X_3$ and $X_8$? We couldn’t, so the additive form $f(X) = \sum_{j = 1}^{p} f_j(X_j)$ would not be possible here.

Note that throughout the question I have focused on binary splits of numeric variables, but if $X$ contains some (or all) categorical features this is not an issue.

For example, a stump with one split on the categorical variable $X_j$ (at values $v_1, v_2, v_3$) could be written as:

\[\begin{align} f(X) & = c_1 \cdot 1_{(X \in R_1)} + c_2 \cdot 1_{(X \in R_2)} \\ & = \begin{cases} c_1 && \text{if } X_j \in \{v_1, v_2, v_3\}\\ c_2 & & \text{if } X_j \notin \{v_1, v_2, v_3\} \end{cases} \\ & = c_1 \cdot I \left( X_j \in \{v_1, v_2, v_3\} \right) + c_2 \cdot I \left( X_j \notin \{v_1, v_2, v_3\} \right) \end{align}\]

This notation could continue throughout the proof and nothing would change (the final prediction $\hat{f}$ would still be a sum of the individual stumps $\hat{f}^b$, and these individual stumps and therefore the prediction $\hat{f}$ would still be separable and able to be expressed as an additive model of the form $f(X) = \sum_{j = 1}^{p} f_j(X_j)$).

3. Gini Index, Classification Error & Entropy

Q: Consider the Gini index, classification error, and entropy in a simple classification setting with two classes. Create a single plot that displays each of these quantities as a function of $\hat{p}_{m1}$. The $x$-axis should display $\hat{p}_{m1}$, ranging from 0 to 1, and the $y$-axis should display the value of the Gini index, classification error, and entropy.

Hint: In a setting with two classes, $\hat{p}_{m1} = 1 − \hat{p}_{m2}$. You could make this plot by hand, but it will be much easier to make in R.

Recall from the book that $\hat{p}_{mk}$ is the proportion of the training observations in the $m$th region that are from the $k$th class.

Classification Error:

\[\begin{align} E & = 1 - \underset{k}{\operatorname{max}} (\hat{p}_{mk}) \\ & = 1 - \max \{\hat{p}_{m1}, \hat{p}_{m2}\} \\ \end{align}\]

Gini Index:

\[\begin{align} G & = \sum_{k = 1}^K \hat{p}_{mk} (1 - \hat{p}_{mk}) \\ & = \hat{p}_{m1} (1 - \hat{p}_{m1}) + \hat{p}_{m2} (1 - \hat{p}_{m2}) \\ \end{align}\]

Entropy:

\[\begin{align} D & = - \sum_{k = 1}^K \hat{p}_{mk} \log \left( \hat{p}_{mk} \right) \\ & = - \hat{p}_{m1} \log \left( \hat{p}_{m1} \right) - \hat{p}_{m2} \log \left( \hat{p}_{m2} \right) \\ \end{align}\]

e.g. in $R_5$ of question 1:

$E = 1 - \max \{\frac{6}{7}, \frac{1}{7} \} \approx 0.1429$
$G = \frac{6}{7} (1 - \frac{6}{7}) + \frac{1}{7} (1 - \frac{1}{7}) \approx 0.2449$
$D = - \frac{6}{7} \log \left( \frac{6}{7} \right) - \frac{1}{7} \log \left( \frac{1}{7} \right) \approx 0.4101$

prop_class_1 <- seq(0, 1, 0.001)
prop_class_2 <- 1 - prop_class_1


classification_error <- 1 - pmax(prop_class_1, prop_class_2)

gini <- prop_class_1*(1-prop_class_1) + prop_class_2*(1-prop_class_2)

entropy <- -prop_class_1*log(prop_class_1) - prop_class_2*log(prop_class_2)


data.frame(prop_class_1, prop_class_2, classification_error, gini, entropy) %>%
  pivot_longer(cols = c(classification_error, gini, entropy), names_to = "metric") %>%
  ggplot(aes(x = prop_class_1, y = value, col = factor(metric))) + 
  geom_line(size = 1) + 
  scale_y_continuous(breaks = seq(0, 1, 0.1), minor_breaks = NULL) + 
  scale_color_hue(labels = c("Classification Error", "Entropy", "Gini")) +
  labs(col = "Metric", 
       y = "Value", 
       x = "Proportion (of class '1')")

4. Decision Tree Example (Regression)

This question relates to the plots in Figure 8.12.

(a) Sketch the Tree (given the feature space partition)

Q: Sketch the tree corresponding to the partition of the predictor space illustrated in the left-hand panel of Figure 8.12. The numbers inside the boxes indicate the mean of $Y$ within each region.

(b) Sketch the Feature Space Partition (given the tree)

Q: Create a diagram similar to the left-hand panel of Figure 8.12, using the tree illustrated in the right-hand panel of the same figure. You should divide up the predictor space into the correct regions, and indicate the mean for each region.

I’ve drawn $X_1$ & $X_2$ as continuous variables for practical reasons (the tree splitting on discrete values of $0, 1, 2$ makes it quite unlikely that they would actually be continuous):

5. Bagged Probabilities to Class Predictions

Q: Suppose we produce ten bootstrapped samples from a data set containing red and green classes. We then apply a classification tree to each bootstrapped sample and, for a specific value of $X$, produce $10$ estimates of $P(\text{Class is Red}|X)$:

\[0.1, 0.15, 0.2, 0.2, 0.55, 0.6, 0.6, 0.65, 0.7, \text{ and }0.75.\]

There are two common ways to combine these results together into a single class prediction. One is the majority vote approach discussed in this chapter. The second approach is to classify based on the average probability. In this example, what is the final classification under each of these two approaches?

probs <- c(0.1, 0.15, 0.2, 0.2, 0.55, 0.6, 0.6, 0.65, 0.7, 0.75)

Majority Vote:

sum(probs >= 0.5) # number of 'Red' predictions

## [1] 6

sum(probs < 0.5) # number of 'Green' predictions

## [1] 4

ifelse(sum(probs >= 0.5) > sum(probs < 0.5), "Red", "Green")

## [1] "Red"

Average Approach:

mean(probs) # average P(Red)

## [1] 0.45

ifelse(mean(probs) >= 0.5, "Red", "Green")

## [1] "Green"

6. Regression Tree Algorithm

Q: Provide a detailed explanation of the algorithm that is used to fit a regression tree.

This is basically Algorithm 8.1 in the book. Much of the algorithm is actually about selecting the best individual tree through cost complexity pruning and cross-validation. The trees in a boosting regression model will usually have the depth/number of splits pre-specified as a hyper-parameter, and in regression random forests the trees tend to be grown to full depth. Most of this algorithm is therefore about building a single tree for predictions, and only the recursive binary splitting part is relevant to these other methods.

I presume the question wants us to explain the algorithm, as there is a lot going on behind each step. I will try to make this as detailed as possible so it can be

Step 1: Use recursive binary splitting to grow a large tree on the training data, stopping only when a stopping criterion is reached, e.g. when each terminal node has fewer than some minimum number of observations.

Recall that growing a regression tree involves the following:

Divide the predictor space ($X$) into $J$ distinct non-overlapping regions $R_1, R_2, \dots, R_J$
For each observation in region $R_j$ it will predict the mean response value for the training observations in $R_j$ (denoted by $\hat{y}_{R_j}$)

How do we construct the regions $R_1, R_2, \dots, R_J$?

In theory, these regions could be any shape
For decision trees, we divide $X$ into high-dimensional rectangles/boxes for simplicity
We want to find boxes $R_1, R_2, \dots, R_J$ that minimize $RSS = \sum_{j = 1}^J \sum_{i \in R_j} (y_i - \hat{y}_{R_j})^2$
We would like to consider every possible way of partitioning $X$ into $J$ boxes and select that with the lowest RSS, but this is computationally infeasible
To simplify, we use recursive binary splitting, which is a top-down, greedy approach (always choosing the best split at that particular step)

The steps for recursive binary splitting include first testing every possible cutpoint for each predictor.

For a numeric predictor $X_j$, this involves testing every unique value of $X_j$ as a cut-point (to partition $X$ into $R_1$ and $R_2$), assessing the resulting RSS.
For a categorical predictor $X_j$, this involves testing every way of splitting on the the unique values of $X_j$ (to partition $X$ into $R_1$ and $R_2$), assessing the resulting RSS.
The best initial split will be the the split that results in the lowest RSS: $\sum_{i: \; x_i \in R_1} (y_i - \hat{y}_{R_1})^2 + \sum_{i: \; x_i \in R_2} (y_i - \hat{y}_{R_2})^2$
Now we look for the best way of partitioning $R_1$ or $R_2$ (so as to minimize the RSS), resulting in three regions
This repeats until a stopping criterion, e.g. all leaves contain 5 observations or less

Step 2: Apply cost complexity pruning to the large tree $T_0$ (from the previous step), in order to obtain a sequence of best subtrees as a function of $\alpha$.

Growing an individual tree fully will likely overfit the training data, so our solution is to prune the large tree $T_0$ to obtain the subtree $T$ with the best generalization error. Obtaining the cross-validation or test error for every possible subtree would again be computationally infeasible, so we use cost complexity pruning to generate a sequence of trees.

Cost complexity pruning uses a complexity parameter $\alpha$, and says that $\forall \alpha \ge 0, \exists T \subseteq T_0$ such that the following is minimized:

\[g(\alpha) = \sum_{m = 1}^{|T|} \sum_{i: \; x_i \in R_m} (y_i - \hat{y}_{R_m})^2 + \alpha |T|\] where $|T|$ is the number of terminal nodes of the tree $T$.

This is a similar $cost + penalty$ format seen in regularization methods throughout the book (Lasso, Ridge, Smoothing Splines, etc.):

When $\alpha = 0$, we will have $T = T_0$ (the penalty is removed, so $T$ will simply be that which minimizes the RSS)
As $\alpha$ increases, the ‘$+ \alpha |T|$’ term becomes more important in the above equation and ensures a price is paid for a tree with many terminal nodes. For a higher $\alpha$, $T$ will therefore be a smaller subtree

For a sequence of sufficiently different complexity parameters $\alpha_{1}, \alpha_{2}, \dots, \alpha_{m} \ge 0$, we will hopefully obtain a corresponding sequence of sufficiently different sub-trees $T_{\alpha_{1}}, T_{\alpha_{2}}, \dots, T_{\alpha_{m}} \subseteq T_0$, so we must have some way of selecting between them.

Step 3: Use $K$-fold cross-validation to choose $\alpha$.

That is, divide the training observations into $K$ folds. For each $k = 1, 2, \dots, K$:

Repeat Steps 1 and 2 (grow a regression tree and prune it based on the value of $\alpha$) on all but the $k$th fold of the training data
Evaluate the MSE on the data in the left-out $k$th fold

Average the $K$ estimates to obtain the cross-validation MSE for this value of $\alpha$. Repeat these steps to get this estimate for each value of $\alpha$. Select the $\alpha_{min}$ that minimizes this cross-validation error:

data <- data.frame(alpha = c('$\\alpha_1$', '$\\alpha_2$', '$\\vdots$', '$\\alpha_{min}$', '$\\vdots$', '$\\alpha_m$'), 
                   mse = c('$MSE_{\\alpha_1}$', '$MSE_{\\alpha_2}$', '$\\vdots$', '$MSE_{\\alpha_{min}}$', '$\\vdots$', '$MSE_{\\alpha_m}$'), 
                   stringsAsFactors = F)

colnames(data) <- c('$\\alpha$', '$MSE$')

library(knitr)
library(kableExtra)

kable(data, row.names = F) %>%
  kable_styling(bootstrap_options = "striped", full_width = F) %>%
  row_spec(4, background = "lightgreen")

$\alpha$	$MSE$
$\alpha_1$	$MSE_{\alpha_1}$
$\alpha_2$	$MSE_{\alpha_2}$
$\vdots$	$\vdots$
$\alpha_{min}$	$MSE_{\alpha_{min}}$
$\vdots$	$\vdots$
$\alpha_m$	$MSE_{\alpha_m}$

Step 4: Return the subtree $T_{\alpha_{min}}$ from Step 2 that corresponds to the chosen value of $\alpha$ (this is our final model).

I find it easier to think of steps 1-4 as just two steps (in a different order):

Find the optimum complexity parameter ($\alpha_{min}$) through cross-validation
Using this value of $\alpha$, re-fit a regression tree on the full dataset to obtain the subtree ($T_{\alpha_{min}}$) corresponding to that complexity parameter (this is our final model)

7. APPLIED: The `Boston` Dataset (Random Forests)

Q: In the lab, we applied random forests to the Boston data using mtry = 6 and using ntree = 25 and ntree = 500. Create a plot displaying the test error resulting from random forests on this data set for a more comprehensive range of values for mtry and ntree. You can model your plot after Figure 8.10. Describe the results obtained.

I will define the same train/test datasets as in the labs:

set.seed(1, sample.kind = "Rounding") # allows us to match the same train/test split as in the labs

train_index <- sample(1:nrow(Boston), nrow(Boston) / 2)

train <- Boston[train_index, ]
test <- Boston[-train_index, ]

Here is the graph showing how the test MSE changes for each additional tree added to the forest (for different values of mtry):

test_MSE <- c()
OOB_MSE <- c()
mtry <- c()

i <- 1


set.seed(1)

for (Mtry in seq(1, 13, 2)) {
  
  rf_temp <- randomForest(y = train$medv, 
                          x = train[ ,-14], 
                          ytest = test$medv,
                          xtest = test[ ,-14],
                          mtry = Mtry, 
                          ntree = 500)
  
  test_MSE[[i]] <- rf_temp$test$mse
  
  OOB_MSE[[i]] <- rf_temp$mse
  
  mtry[[i]] <- rep(Mtry, 500)
  
  i <- i + 1
}


rf_summary <- data.frame(ntree = rep(1:500, length(seq(1, 13, 2))), 
                         mtry = unlist(mtry), 
                         test_MSE = unlist(test_MSE), 
                         OOB_MSE = unlist(OOB_MSE))


# test MSE:
ggplot(rf_summary, aes(x = ntree, y = test_MSE, col = factor(mtry))) + 
  geom_line() + 
  geom_vline(xintercept = 300, col = "grey55") +
  coord_cartesian(ylim = c(10, 25)) + 
  labs(x = "Number of Trees", 
       y = "Test MSE", 
       col = "mtry:")

As a reference, here is the same with the OOB MSE:

# OOB MSE:
ggplot(rf_summary, aes(x = ntree, y = OOB_MSE, col = factor(mtry))) + 
  geom_line() + 
  geom_vline(xintercept = 300, col = "grey55") +
  coord_cartesian(ylim = c(10, 25)) + 
  labs(x = "Number of Trees", 
       y = "OOB MSE", 
       col = "mtry:")

Focusing on the first graph, for all values of mtry the plot shows the test MSE starts off very high for a low number of trees (<10), which makes sense to me (the trees are grown to full depth and will overfit without a sufficient number of them). This quickly reduces as trees are added to the model.

We could probably say that the OOB MSE comfortably stabilized after a few hundred trees (say 300), and this appears to happen faster for the test MSE.

I think this is perhaps to be expected, since every one of those 300 trees will be used for prediction for each test observation when calculating the test MSE, but only $\approx$ 63.2% ($\approx$ 200) will be used for each train observation when calculating the OOB MSE (bagging property). I verify that this is the case below:

set.seed(111)

rf_temp <- randomForest(y = train$medv, 
                        x = train[ ,-14], 
                        ntree = 500)


# rf_temp$oob.times = number of times each obs. is OOB
# 500 - rf_temp$oob.times = number of times each obs. is *not* OOB ('in-bag')
# (500 - rf_temp$oob.times) / 500 = proportion of times each obs. is *not* OOB ('in-bag')
# mean((500 - rf_temp$oob.times) / 500) = how often each observation is 'in-bag' (on average)
mean((500 - rf_temp$oob.times) / 500)

## [1] 0.6326166

It is quite interesting to me that the OOB MSE and test MSE are not highly correlated (perhaps even negatively correlated), with the only thing they agree on being that selecting a totally random variable at each split (mtry = 1) is a bad idea for generalization.

The Boston dataset is only 506 rows, so there there is significantly variability in the optimum value of mtry based on the train/test split and the random seed used. You can test this by re-running the code for this question without setting any seeds and observing how different the results are each time (for a large dataset they would hardly change).

rf_summary %>%
  filter(ntree == 300) %>%
  pivot_longer(cols = c(test_MSE, OOB_MSE), names_to = "metric") %>%
  ggplot(aes(x = mtry, y = value, col = factor(metric))) + 
  geom_line() + 
  geom_point() + 
  scale_y_continuous(breaks = seq(10, 30, 2)) + 
  scale_x_continuous(breaks = seq(1, 13, 2)) +
  scale_color_hue(labels = c("OOB MSE", "Test MSE")) +
  labs(col = "Metric:", 
       title = "Boston Dataset - Random Forest", 
       subtitle = "Optimum mtry using OOB/Test MSE (at ntree = 300)") + 
  theme(legend.position = "bottom", 
        axis.title.y = element_blank())

Nevertheless, for this particular split and seed combination, the test MSE was actually lowest for mtry = 7, although most values give good results.

The OOB error graph seems to suggest that simple bagged decision trees could be a good option (mtry = 13, which is the total number of predictors), with the lowest OOB MSE for mtry = 11.

8. APPLIED: The `Carseats` Dataset (Regression Trees, Random Forests)

In the lab, a classification tree was applied to the Carseats data set after converting Sales into a qualitative response variable. Now we will seek to predict Sales using regression trees and related approaches, treating the response as a quantitative variable.

(a) `train`/`test` Split

Q: Split the data set into a training set and a test set.

Again, I’ll match the split used in the labs:

set.seed(2, sample.kind = "Rounding") # allows us to match the same train/test split as in the labs

train_index <- sample(1:nrow(Carseats), nrow(Carseats) / 2)

train <- Carseats[train_index, ] # 200
test <- Carseats[-train_index, ] # 200

(b) Regression Tree Plot

Q: Fit a regression tree to the training set. Plot the tree, and interpret the results. What test MSE do you obtain?

The plot can be seen below:

tree_model <- tree(Sales ~ ., train)

plot(tree_model)
text(tree_model, pretty = 0, cex = 0.7)

We could infer from this that ShelveLoc and Price are the two most important factors in predicting car seat sales, since they appear at the top of the tree (because they provided the best split of the data).

The tree has a total of 17 terminal nodes:

summary(tree_model)

## 
## Regression tree:
## tree(formula = Sales ~ ., data = train)
## Variables actually used in tree construction:
## [1] "ShelveLoc"   "Price"       "Age"         "Income"      "CompPrice"  
## [6] "Population"  "Advertising"
## Number of terminal nodes:  17 
## Residual mean deviance:  2.341 = 428.4 / 183 
## Distribution of residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -3.76700 -1.00900 -0.01558  0.00000  0.94900  3.58600

Predicting and evaluating the test MSE:

test_pred <- predict(tree_model, test)
mean((test_pred - test$Sales)^2)

## [1] 4.844991

For comparison, here is the baseline test MSE (using the average train$Sales as the prediction for all new test observations):

baseline_test_pred <- mean(train$Sales)
mean((baseline_test_pred - test$Sales)^2)

## [1] 8.085056

(c) Cross-Validation Pruning

Q: Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?

The selected model with the lowest cross-validation error appears to be the model with 17 terminal nodes:

set.seed(3)

cv_tree_model <- cv.tree(tree_model, K = 10)

data.frame(n_leaves = cv_tree_model$size,
           CV_RSS = cv_tree_model$dev) %>%
  mutate(min_CV_RSS = as.numeric(min(CV_RSS) == CV_RSS)) %>%
  ggplot(aes(x = n_leaves, y = CV_RSS)) +
  geom_line(col = "grey55") +
  geom_point(size = 2, aes(col = factor(min_CV_RSS))) +
  scale_x_continuous(breaks = seq(1, 17, 2)) +
  scale_y_continuous(labels = scales::comma_format()) +
  scale_color_manual(values = c("deepskyblue3", "green")) +
  theme(legend.position = "none") +
  labs(title = "Carseats Dataset - Regression Tree",
       subtitle = "Selecting the complexity parameter with cross-validation",
       x = "Terminal Nodes",
       y = "CV RSS")

Since the optimal tree is the fully grown tree without any pruning, it should be clear that the test MSE is the same as in part b).

Here is how we would select a smaller sub-tree, although in this case I select the full tree with 17 terminal nodes (best = 17). The test MSE is indeed the same as in part b):

pruned_tree_model <- prune.tree(tree_model, best = 17)

test_pred <- predict(pruned_tree_model, test)
mean((test_pred - test$Sales)^2)

## [1] 4.844991

Notes on the tree package:

Something that seems unusual to me about the results of tree() is that, for regression, deviance apparently means the RSS. We would therefore expect the deviance from cv.tree() to be the cross-validation RSS, which would usually lower as we increase the number of folds K (see my Ch.7 Q.9b solutions for why). This doesn’t happen here like it does with other packages that perform cross-validation (e.g. caret with the RSS metric). This makes me think that the package is producing K out-of-fold RSS estimates and then summing them instead of averaging them. This isn’t really an issue but it did strike me as quite odd.

Another thing that confuses me about this output is that the book says that for cv.tree() the output k corresponds to the complexity parameter $\alpha$, where $\alpha \ge 0$. How is it that our selected model therefore has an $\alpha$ of -Inf?:

data.frame(size = cv_tree_model$size, 
           dev = cv_tree_model$dev, 
           k = cv_tree_model$k)

##    size       dev         k
## 1    17  996.5117      -Inf
## 2    16  998.3865  16.25730
## 3    15 1008.3807  19.34054
## 4    14 1032.8005  21.45248
## 5    13 1032.2548  22.09786
## 6    12 1124.1770  31.59872
## 7    11 1130.3090  32.10994
## 8    10 1125.5635  35.48396
## 9     9 1142.3155  42.57770
## 10    7 1142.3155  43.06856
## 11    6 1165.2758  46.54477
## 12    5 1166.0119  47.65604
## 13    4 1147.0774  80.06393
## 14    3 1166.6474 108.73209
## 15    2 1213.4513 146.62539
## 16    1 1600.1608 406.17230

We might think that k = -Inf corresponds to $\alpha = 0$ (no pruning), which would make sense in this particular question (as this model still has 17 terminal nodes). However, on the bottom output of page 326 you can see k can apparently take values of both -Inf and 0 which yield models with different numbers of terminal nodes (19 and 17 respectively).

This question was the only relevant question I could find online highlighting this issue. I think based on this example there must be some difference between the mathematical description of cost complexity pruning given on page 309 and its implementation in the tree package that the authors did not mention.

There are a lot of these strange inconsistencies with the tree package and I think I will investigate the more popular rpart package (directly or through caret::train(method = "rpart")) if returning to decision trees in the future.

(d) Bagged Trees

Q: Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important.

We can implement bagged trees through the randomForest() function by setting mtry as the number of predictors (at each stage in the split of each tree, all variables are candidates to split on). I leave ntree at the default (500).

set.seed(1)

bagged_trees_model <- randomForest(y = train$Sales, 
                                   x = train[ ,-1], 
                                   mtry = ncol(train) - 1, # 10
                                   importance = T)

Interpretability has decreased, as we now have 500 trees built on different bootstrap samples of the training dataset. However, the predictive power has increased, with the test MSE reducing by ~50%.

test_pred <- predict(bagged_trees_model, test)
mean((test_pred - test$Sales)^2)

## [1] 2.368674

Below are the variable importances:

importance(bagged_trees_model) %>%
  as.data.frame() %>%
  rownames_to_column("varname") %>%
  arrange(desc(IncNodePurity))

##        varname     %IncMSE IncNodePurity
## 1        Price 55.76266629    493.804969
## 2    ShelveLoc 53.87451311    446.816951
## 3    CompPrice 25.47984338    173.982449
## 4          Age 12.07366106    117.502364
## 5  Advertising 13.97464644     96.929928
## 6       Income  1.72616791     71.465227
## 7   Population  1.01449985     68.297498
## 8    Education  0.08382003     37.513944
## 9        Urban -3.06299457      6.909530
## 10          US  0.14346468      5.985091

The first measure, %IncMSE, is calculated by recording the prediction error of every tree on the OOB portion of the data (for regression, the MSE). A predictor is then scrambled (permuted) and the OOB MSE is then re-assessed. The difference between these MSE’s is averaged across all trees (and normalized by the standard deviation of the differences).

The second measure, IncNodePurity, is the total decrease in node impurity from splitting on that variable (averaged across all the trees). Node impurity is measured by Gini and RSS for classification and regression respectively.

Whichever measure is used, we can see that for bagged trees, Price and ShelveLoc are the most important variables. This was also the conclusion from inspecting the top splits of the single regression tree produced in part b).

(e) Random Forests

Q: Use random forests to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important. Describe the effect of $m$, the number of variables considered at each split, on the error rate obtained.

For this dataset it seems quite clear that reducing mtry negatively effects (increases) the test MSE, with the mtry = 1 model performing about the same as a single decision tree (but without the ease of interpretability). Increasing mtry quite quickly reduces the test error, and we can observe a monotonic reduction in the test MSE as mtry increases (up to mtry = 10).

test_MSE <- c()

i <- 1


for (Mtry in 1:10) {
  set.seed(1)
  
  rf_temp <- randomForest(y = train$Sales, 
                          x = train[ ,-1], 
                          mtry = Mtry, 
                          importance = T)
  
  test_pred <- predict(rf_temp, test)
  
  test_MSE[i] <- mean((test_pred - test$Sales)^2)
  
  i <- i + 1
}


data.frame(mtry = 1:10, test_MSE = test_MSE) %>%
  mutate(min_test_MSE = as.numeric(min(test_MSE) == test_MSE)) %>%
  ggplot(aes(x = mtry, y = test_MSE)) +
  geom_line(col = "grey55") +
  geom_point(size = 2, aes(col = factor(min_test_MSE))) +
  scale_x_continuous(breaks = seq(1, 10), minor_breaks = NULL) +
  scale_color_manual(values = c("deepskyblue3", "green")) +
  theme(legend.position = "none") +
  labs(title = "Carseats Dataset - Random Forests",
       subtitle = "Selecting 'mtry' using the test MSE",
       x = "mtry",
       y = "Test MSE")

The best random forest model used an mtry of 10, which makes it identical to the bagged trees model. Because I used the same random seed as in part d), the test MSE is identical:

tail(test_MSE, 1)

## [1] 2.368674

… as is the variable importance:

importance(rf_temp) %>%
  as.data.frame() %>%
  rownames_to_column("varname") %>%
  arrange(desc(IncNodePurity))

##        varname     %IncMSE IncNodePurity
## 1        Price 55.76266629    493.804969
## 2    ShelveLoc 53.87451311    446.816951
## 3    CompPrice 25.47984338    173.982449
## 4          Age 12.07366106    117.502364
## 5  Advertising 13.97464644     96.929928
## 6       Income  1.72616791     71.465227
## 7   Population  1.01449985     68.297498
## 8    Education  0.08382003     37.513944
## 9        Urban -3.06299457      6.909530
## 10          US  0.14346468      5.985091

9. APPLIED: The `OJ` Dataset (Classification Trees)

This problem involves the OJ data set which is part of the ISLR package.

glimpse(OJ)

## Rows: 1,070
## Columns: 18
## $ Purchase       <fct> CH, CH, CH, MM, CH, CH, CH, CH, CH, CH, CH, CH, CH, ...
## $ WeekofPurchase <dbl> 237, 239, 245, 227, 228, 230, 232, 234, 235, 238, 24...
## $ StoreID        <dbl> 1, 1, 1, 1, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 1, 2...
## $ PriceCH        <dbl> 1.75, 1.75, 1.86, 1.69, 1.69, 1.69, 1.69, 1.75, 1.75...
## $ PriceMM        <dbl> 1.99, 1.99, 2.09, 1.69, 1.69, 1.99, 1.99, 1.99, 1.99...
## $ DiscCH         <dbl> 0.00, 0.00, 0.17, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00...
## $ DiscMM         <dbl> 0.00, 0.30, 0.00, 0.00, 0.00, 0.00, 0.40, 0.40, 0.40...
## $ SpecialCH      <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SpecialMM      <dbl> 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1...
## $ LoyalCH        <dbl> 0.500000, 0.600000, 0.680000, 0.400000, 0.956535, 0....
## $ SalePriceMM    <dbl> 1.99, 1.69, 2.09, 1.69, 1.69, 1.99, 1.59, 1.59, 1.59...
## $ SalePriceCH    <dbl> 1.75, 1.75, 1.69, 1.69, 1.69, 1.69, 1.69, 1.75, 1.75...
## $ PriceDiff      <dbl> 0.24, -0.06, 0.40, 0.00, 0.00, 0.30, -0.10, -0.16, -...
## $ Store7         <fct> No, No, No, No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Y...
## $ PctDiscMM      <dbl> 0.000000, 0.150754, 0.000000, 0.000000, 0.000000, 0....
## $ PctDiscCH      <dbl> 0.000000, 0.000000, 0.091398, 0.000000, 0.000000, 0....
## $ ListPriceDiff  <dbl> 0.24, 0.24, 0.23, 0.00, 0.00, 0.30, 0.30, 0.24, 0.24...
## $ STORE          <dbl> 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2...

(a) `train`/`test` Split

Q: Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.

Since this is my first time seeing this dataset, here is my quick overview: The OJ dataset contains 1,070 purchases of two brands of orange juice (‘Citrus Hill’ or ‘Minute Maid’), captured in the values of the Purchase variable (CH or MM). The remaining 17 predictors are characteristics of the customer, product, store, etc. Throughout this question we are basically tasked with predicting which orange juice the customer purchased based on these statistics.

I create train and test below.

set.seed(5)

train_index <- sample(1:nrow(OJ), 800)

train <- OJ[train_index, ]
test <- OJ[-train_index, ]

(b) Classification Tree

Q: Fit a tree to the training data, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?

The classification tree has 7 terminal nodes and a training error rate of 18.38%.

tree_model <- tree(Purchase ~ ., train)

summary(tree_model)

## 
## Classification tree:
## tree(formula = Purchase ~ ., data = train)
## Variables actually used in tree construction:
## [1] "LoyalCH"   "PriceDiff" "DiscCH"   
## Number of terminal nodes:  7 
## Residual mean deviance:  0.7786 = 617.4 / 793 
## Misclassification error rate: 0.1838 = 147 / 800

Despite there being 17 predictors in the dataset, only three were used in splits. These were:

LoyalCH - Customer brand loyalty for CH
PriceDiff - Sale price of MM less sale price of CH
DiscCH - Discount offered for CH

(c) `tree()` - Text Interpretation

Q: Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.

I print the text output below.

tree_model

## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 800 1064.00 CH ( 0.61750 0.38250 )  
##    2) LoyalCH < 0.5036 354  435.50 MM ( 0.30508 0.69492 )  
##      4) LoyalCH < 0.142213 100   45.39 MM ( 0.06000 0.94000 ) *
##      5) LoyalCH > 0.142213 254  342.20 MM ( 0.40157 0.59843 )  
##       10) PriceDiff < 0.235 136  153.00 MM ( 0.25000 0.75000 ) *
##       11) PriceDiff > 0.235 118  160.80 CH ( 0.57627 0.42373 ) *
##    3) LoyalCH > 0.5036 446  352.30 CH ( 0.86547 0.13453 )  
##      6) LoyalCH < 0.705699 154  189.50 CH ( 0.69481 0.30519 )  
##       12) PriceDiff < 0.25 85  117.50 CH ( 0.52941 0.47059 )  
##         24) DiscCH < 0.15 77  106.60 MM ( 0.48052 0.51948 ) *
##         25) DiscCH > 0.15 8    0.00 CH ( 1.00000 0.00000 ) *
##       13) PriceDiff > 0.25 69   45.30 CH ( 0.89855 0.10145 ) *
##      7) LoyalCH > 0.705699 292  106.30 CH ( 0.95548 0.04452 ) *

Choosing node 11), which is a terminal node as it is marked by a *:

First the root node: 1) root 800 1064.00 CH ( 0.61750 0.38250 )

This means that, at the root node, there are 800 observations, the deviance is 1064.00, the overall prediction is CH and the split is 61.75% CH vs 38.25% MM.

We can see that, from the root node, three splits take place to produce the terminal node labelled by 11):

A split at LoyalCH = 0.5036
A split at LoyalCH = 0.142213
A split at PriceDiff = 0.235

 1) root 800 1064.00 CH ( 0.61750 0.38250 )  
   2) LoyalCH < 0.5036 354  435.50 MM ( 0.30508 0.69492 )  
     4) LoyalCH < 0.142213 100   45.39 MM ( 0.06000 0.94000 ) *
     5) LoyalCH > 0.142213 254  342.20 MM ( 0.40157 0.59843 )  
      10) PriceDiff < 0.235 136  153.00 MM ( 0.25000 0.75000 ) *
      11) PriceDiff > 0.235 118  160.80 CH ( 0.57627 0.42373 ) *

Node 11) is therefore the subset of purchases where 0.142213 < LoyalCH < 0.5036 and PriceDiff > 0.235. The overall prediction is CH, and the node seems quite impure with 57.627% CH vs 42.373% MM.

There are 118 observations in the node, and from the percentages above we know that 68/118 are CH and 50/118 are MM (demonstrated below).

train %>%
  filter(LoyalCH < 0.5036, 
         LoyalCH > 0.142213, 
         PriceDiff > 0.235) %>%
  select(Purchase) %>% 
  table()

## .
## CH MM 
## 68 50

Based on the formula on page 325 for the overall deviance of a classification tree ($-2 \sum_{m} \sum_{k} n_{mk} \, log (\hat{p}_{mk})$), where the overall deviance sum is over $m$ regions (terminal nodes). We calculate can the deviance of node 11) only using the code below:

-2 * (68 * log(68/118) + 50 * log(50/118))

## [1] 160.8262

tree() reports the number as 160.80, and testing with other nodes revealed this is because it’s rounding the result to 4 significant figures.

(d) Plotting

Q: Create a plot of the tree, and interpret the results.

LoyalCH is certainly the most important variable (the top 3 nodes all split on this variable), followed by PriceDiff and DiscCH. We can see node 11) is the third terminal node (from left $\rightarrow$ right).

LoyalCH ranges from 0 to 1, so the first split sends those less loyal to Citrus Hill (CH) orange juice to the left and those more loyal to the right:

plot(tree_model)
text(tree_model, pretty = 0, cex = 0.7)

Those that scored lowest in Citrus Hill loyalty (LoyalCH < 0.142213) were predicted to buy Minute Maid (MM), which isn’t surprising. Those that were slightly more loyal to CH (0.142213 < LoyalCH < 0.5036) would still buy MM if it wasn’t too much more expensive (PriceDiff < 0.235), but if the price difference is large enough (CH was much cheaper) they could end up purchasing CH.

Those on the far-right terminal node are the most loyal to CH (LoyalCH > 0.705699), so it should be unsurprising that this is their predicted purchase. Those with slightly lower brand loyalty (0.5036 < LoyalCH < 0.705699) would still purchase CH if it was much cheaper (PriceDiff > 0.25), or if it wasn’t but was sufficiently discounted (PriceDiff < 0.25 & DiscCH > 0.15). For those cases where CH wasn’t much cheaper (PriceDiff < 0.25) and wasn’t sufficiently discounted (DiscCH < 0.15), the predicted purchase actually ended up being MM.

This was a much more detailed explanation, but this could be summarized at a much higher level in the following way: people go with the brand they are more loyal towards, but there are some edge cases (based on discounts and the prices relative to one another) that can sway people against their usual brand loyalties.

(e) Test Error

Q: Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?

Here is the confusion matrix for the unpruned regression tree:

test_pred <- predict(tree_model, test, type = "class")
table(test_pred, test_actual = test$Purchase)

##          test_actual
## test_pred  CH  MM
##        CH 125  32
##        MM  34  79

The test error rate corresponding to it:

1 - mean(test_pred == test$Purchase)

## [1] 0.2444444

CH was the most common orange juice in train so, for comparison, a baseline classifier (that predicted CH for all observations in test) would have the following error rate:

1 - mean(test$Purchase == "CH")

## [1] 0.4111111

(f) Cost-Complexity Pruning

Q: Apply the cv.tree() function to the training set in order to determine the optimal tree size.

Since our goal appears to be low test error, I specify FUN = prune.misclass. This indicates that we want the classification error rate to guide the cross-validation and pruning process, rather than the default for the cv.tree() function, which is deviance.

set.seed(2)

cv_tree_model <- cv.tree(tree_model, K = 10, FUN = prune.misclass)
cv_tree_model

## $size
## [1] 7 4 2 1
## 
## $dev
## [1] 164 164 177 298
## 
## $k
## [1] -Inf    1    9  138
## 
## $method
## [1] "misclass"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"

(g) CV Error Plot

Q: Produce a plot with tree size on the $x$-axis and cross-validated classification error rate on the $y$-axis.

The plot is below. Note that cv_tree_model$size is the number of terminal nodes (so 1 means it is just the root node with no splits), and cv_tree_model$dev gives the total number of errors made from the out-of-fold predictions during cross-validation (only because we specified FUN = prune.misclass - omitting this would mean this reports the deviance). From this we can obtain the cross-validation error rate.

data.frame(size = cv_tree_model$size, CV_Error = cv_tree_model$dev / nrow(train)) %>%
  mutate(min_CV_Error = as.numeric(min(CV_Error) == CV_Error)) %>%
  ggplot(aes(x = size, y = CV_Error)) +
  geom_line(col = "grey55") +
  geom_point(size = 2, aes(col = factor(min_CV_Error))) +
  scale_x_continuous(breaks = seq(1, 7), minor_breaks = NULL) +
  scale_y_continuous(labels = scales::percent_format()) +
  scale_color_manual(values = c("deepskyblue3", "green")) +
  theme(legend.position = "none") +
  labs(title = "OJ Dataset - Classification Tree",
       subtitle = "Selecting tree 'size' (# of terminal nodes) using cross-validation",
       x = "Tree Size",
       y = "CV Error")

(h) Best Tree - CV Error

Q: Which tree size corresponds to the lowest cross-validated classification error rate?

Of the sequence of trees generated, trees of sizes 4 and 7 have the same cross-validation error. It makes sense to select the more parsimonious model here with 4 terminal nodes.

(i) Best Tree - Selecting

Q: Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.

I produce the tree with 4 terminal nodes. Interestingly we have the same cross-validation error as the full tree with 7 terminal nodes, but have only split on LoyalCH to achieve this. This would have the added benefit of simplifying the interpretation in part (d).

pruned_tree_model <- prune.tree(tree_model, best = 4)
pruned_tree_model

## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 800 1064.00 CH ( 0.61750 0.38250 )  
##   2) LoyalCH < 0.5036 354  435.50 MM ( 0.30508 0.69492 )  
##     4) LoyalCH < 0.142213 100   45.39 MM ( 0.06000 0.94000 ) *
##     5) LoyalCH > 0.142213 254  342.20 MM ( 0.40157 0.59843 ) *
##   3) LoyalCH > 0.5036 446  352.30 CH ( 0.86547 0.13453 )  
##     6) LoyalCH < 0.705699 154  189.50 CH ( 0.69481 0.30519 ) *
##     7) LoyalCH > 0.705699 292  106.30 CH ( 0.95548 0.04452 ) *

(j) Training Error Comparison

Q: Compare the training error rates between the pruned and unpruned trees. Which is higher?

Here is the training error for the unpruned tree (7 terminal nodes):

mean(predict(tree_model, type = "class") != train$Purchase)

## [1] 0.18375

The same for the pruned tree (4 terminal nodes):

mean(predict(pruned_tree_model, type = "class") != train$Purchase)

## [1] 0.21

The training error for the pruned tree is higher. This isn’t surprising - we would expect the training error of a tree to monotonically decrease as its flexibility (number of splits) increases.

(k) Test Error Comparison

Q: Compare the test error rates between the pruned and unpruned trees. Which is higher?

The test error for the unpruned tree:

mean(predict(tree_model, type = "class", newdata = test) != test$Purchase)

## [1] 0.2444444

The same for the pruned tree:

mean(predict(pruned_tree_model, type = "class", newdata = test) != test$Purchase)

## [1] 0.1703704

Now the order has reversed and the error is higher for the unpruned tree.

It is interesting that the cross-validation errors were in fact equal but the test error is noticeably lower for the simpler tree. A lot of this probably comes from random variability when working with a small dataset; using a different random state for the CV folds and train/test split would likely change all of these results (particularly because decision trees are such high-variance approaches).

10. APPLIED: The `Hitters` Dataset (Boosting)

We now use boosting to predict Salary in the Hitters data set.

glimpse(Hitters)

## Rows: 322
## Columns: 20
## $ AtBat     <int> 293, 315, 479, 496, 321, 594, 185, 298, 323, 401, 574, 20...
## $ Hits      <int> 66, 81, 130, 141, 87, 169, 37, 73, 81, 92, 159, 53, 113, ...
## $ HmRun     <int> 1, 7, 18, 20, 10, 4, 1, 0, 6, 17, 21, 4, 13, 0, 7, 3, 20,...
## $ Runs      <int> 30, 24, 66, 65, 39, 74, 23, 24, 26, 49, 107, 31, 48, 30, ...
## $ RBI       <int> 29, 38, 72, 78, 42, 51, 8, 24, 32, 66, 75, 26, 61, 11, 27...
## $ Walks     <int> 14, 39, 76, 37, 30, 35, 21, 7, 8, 65, 59, 27, 47, 22, 30,...
## $ Years     <int> 1, 14, 3, 11, 2, 11, 2, 3, 2, 13, 10, 9, 4, 6, 13, 3, 15,...
## $ CAtBat    <int> 293, 3449, 1624, 5628, 396, 4408, 214, 509, 341, 5206, 46...
## $ CHits     <int> 66, 835, 457, 1575, 101, 1133, 42, 108, 86, 1332, 1300, 4...
## $ CHmRun    <int> 1, 69, 63, 225, 12, 19, 1, 0, 6, 253, 90, 15, 41, 4, 36, ...
## $ CRuns     <int> 30, 321, 224, 828, 48, 501, 30, 41, 32, 784, 702, 192, 20...
## $ CRBI      <int> 29, 414, 266, 838, 46, 336, 9, 37, 34, 890, 504, 186, 204...
## $ CWalks    <int> 14, 375, 263, 354, 33, 194, 24, 12, 8, 866, 488, 161, 203...
## $ League    <fct> A, N, A, N, N, A, N, A, N, A, A, N, N, A, N, A, N, A, A, ...
## $ Division  <fct> E, W, W, E, E, W, E, W, W, E, E, W, E, E, E, W, W, W, W, ...
## $ PutOuts   <int> 446, 632, 880, 200, 805, 282, 76, 121, 143, 0, 238, 304, ...
## $ Assists   <int> 33, 43, 82, 11, 40, 421, 127, 283, 290, 0, 445, 45, 11, 1...
## $ Errors    <int> 20, 10, 14, 3, 4, 25, 7, 9, 19, 0, 22, 11, 7, 6, 8, 0, 10...
## $ Salary    <dbl> NA, 475.000, 480.000, 500.000, 91.500, 750.000, 70.000, 1...
## $ NewLeague <fct> A, N, A, N, N, A, A, A, N, A, A, N, N, A, N, A, N, A, A, ...

(a) `Salary` - Remove Missings & Log-Transform

Q: Remove the observations for whom the salary information is unknown, and then log-transform the salaries.

There does seem to be quite a few observations with missing Salary data:

sum(is.na(Hitters$Salary)) # 59

## [1] 59

I produce a new dataframe with these observations removed and log-transform Salary. I’ll keep the original response variable around and instead create a new variable, log_Salary. A before & after density plot is below, where we can see that log_Salary is less skewed and closer to a normal distribution:

Hitters_notNA <- Hitters[!is.na(Hitters$Salary), ]
Hitters_notNA$log_Salary <- log(Hitters_notNA$Salary)

ggplot(Hitters_notNA, aes(x = Salary)) + 
  geom_density(fill = "mediumseagreen") + 
  scale_x_continuous(labels = scales::comma_format()) +
  theme(axis.text.y = element_blank(), 
        axis.ticks.y = element_blank(), 
        axis.title.y = element_blank()) + 
  labs(title = "Hitters Dataset - Salary Distribution", 
       x = "Salary")

ggplot(Hitters_notNA, aes(x = log_Salary)) + 
  geom_density(fill = "deepskyblue3") + 
  theme(axis.text.y = element_blank(), 
        axis.ticks.y = element_blank(), 
        axis.title.y = element_blank()) + 
  labs(title = "Hitters Dataset - log(Salary) Distribution", 
       x = "log(Salary)")

Extended - Log-Transformation

This is the end of this question, but below I produce some explanations & illustrations for my own understanding. I thought it was quite interesting that the book applies a log-transformation to the response here without really justifying why, since so far transformations have really been focused around linear models. I think the accepted answer here is a good summary of why the authors might be applying this transformation.

Log-transforming will make the splits decided by regression trees (and regression-tree-based methods) less-influenced by extreme values of Salary. I believe that the predictions in the regions produced by these splits should also be skewed less by these extreme values (these extreme values will just be the mean response variable for observations in that region, after-all, and the mean is sensitive to outliers). Furthermore, when we select a model based on target metrics that use the squared error (e.g. MSE), we might not want the selected model predicting salary to be too strongly influenced by these top few earners.

It is worth remembering for the future that response-variable transformations and outlier removal could be useful in decreasing the prediction error with tree-based methods when tackling regression problems with skewed response variables.

Below I put together a quick example similar to that seen in the link above, applying it directly to the Hitters dataset.

I fit a decision stump predicting Salary with one predictor (CHits):

### ### ### using y = Salary ### ### ###
CHits_stump_Salary <- tree(Salary ~ CHits, Hitters_notNA) %>%
  prune.tree(best = 2)

CHits_stump_Salary # CHits < 450

## node), split, n, deviance, yval
##       * denotes terminal node
## 
## 1) root 263 53320000 535.9  
##   2) CHits < 450 117  5931000 227.9 *
##   3) CHits > 450 146 27390000 782.8 *

We can see that the split occurs at the ~45th percentile of the CHits variable (at value 450):

round(sum(Hitters_notNA$CHits < 450) / nrow(Hitters_notNA), 3) # 44.5%

## [1] 0.445

The vertical darker-green line shows the split that partitions the feature space into two regions. The horizontal lighter-green lines are the predictions made for these two regions:

pred_Salary <- Hitters_notNA %>%
  mutate(split = case_when(CHits < 450 ~ "lt 450", 
                           TRUE ~ "ge 450")) %>%
  group_by(split) %>%
  summarize(Salary = mean(Salary))

ggplot(Hitters_notNA, aes(x = CHits, y = Salary)) + 
  geom_point(alpha = 0.5) + 
  geom_segment(aes(x = 450, xend = 4500, y = pred_Salary$Salary[pred_Salary$split == "ge 450"], yend = pred_Salary$Salary[pred_Salary$split == "ge 450"]), col = "mediumseagreen", size = 1) + 
  geom_segment(aes(x = 0, xend = 450, y = pred_Salary$Salary[pred_Salary$split == "lt 450"], yend = pred_Salary$Salary[pred_Salary$split == "lt 450"]), col = "mediumseagreen", size = 1) + 
  geom_vline(xintercept = 450, size = 1.2, col = "darkgreen") +
  scale_x_continuous(labels = scales::comma_format()) +
  scale_y_continuous(labels = scales::comma_format())

The MSE and Median Absolute Error:

data.frame(MSE = mean((predict(CHits_stump_Salary, Hitters_notNA) - Hitters_notNA$Salary)^2), # 126,678
           MAE = median(abs(predict(CHits_stump_Salary, Hitters_notNA) - Hitters_notNA$Salary))) # 137.8547

##      MSE      MAE
## 1 126678 137.8547

I now fit a decision stump predicting log_Salary with the same predictor (CHits):

### ### ### using y = log_Salary ### ### ###
CHits_stump_logSalary <- tree(log_Salary ~ CHits, Hitters_notNA) %>%
  prune.tree(best = 2)

CHits_stump_logSalary # CHits < 358

## node), split, n, deviance, yval
##       * denotes terminal node
## 
## 1) root 263 207.20 5.927  
##   2) CHits < 358 101  35.24 5.083 *
##   3) CHits > 358 162  55.07 6.454 *

The split has shifted to the left, now occurring at the ~38th percentile of the CHits variable (at value 358):

round(sum(Hitters_notNA$CHits < 358) / nrow(Hitters_notNA), 3) # 38.4%

## [1] 0.384

As before, the vertical darker-blue line shows the split, and the horizontal lighter-blue lines are the predictions made by the tree for the two regions:

pred_logSalary <- Hitters_notNA %>%
  mutate(split = case_when(CHits < 358 ~ "lt 358", 
                           TRUE ~ "ge 358")) %>%
  group_by(split) %>%
  summarize(log_Salary = mean(log_Salary))

ggplot(Hitters_notNA, aes(x = CHits, y = log_Salary)) + 
  geom_point(alpha = 0.5) + 
  geom_segment(aes(x = 358, xend = 4500, y = pred_logSalary$log_Salary[pred_logSalary$split == "ge 358"], yend = pred_logSalary$log_Salary[pred_logSalary$split == "ge 358"]), col = "deepskyblue3", size = 1) +
  geom_segment(aes(x = 0, xend = 358, y = pred_logSalary$log_Salary[pred_logSalary$split == "lt 358"], yend = pred_logSalary$log_Salary[pred_logSalary$split == "lt 358"]), col = "deepskyblue3", size = 1) +
  geom_vline(xintercept = 358, size = 1.2, col = "blue") + 
  scale_x_continuous(labels = scales::comma_format()) +
  labs(y = "log(Salary)")

The MSE and Median Absolute Error:

data.frame(MSE = mean((exp(predict(CHits_stump_logSalary, Hitters_notNA)) - Hitters_notNA$Salary)^2), # 141287.2
           MAE = median(abs(exp(predict(CHits_stump_logSalary, Hitters_notNA)) - Hitters_notNA$Salary))) # 116.2309

##        MSE      MAE
## 1 141287.2 116.2309

Looking at the splits and predictions on the same scale is the best way of seeing how the transformation influenced the split-points, predicted salary and therefore the overall error of the models:

ggplot(Hitters_notNA, aes(x = CHits, y = Salary)) + 
  geom_point(alpha = 0.5) + 
  geom_segment(aes(x = 450, xend = 4500, y = pred_Salary$Salary[pred_Salary$split == "ge 450"], yend = pred_Salary$Salary[pred_Salary$split == "ge 450"]), col = "mediumseagreen", size = 1) + 
  geom_segment(aes(x = 0, xend = 450, y = pred_Salary$Salary[pred_Salary$split == "lt 450"], yend = pred_Salary$Salary[pred_Salary$split == "lt 450"]), col = "mediumseagreen", size = 1) + 
  geom_segment(aes(x = 358, xend = 4500, y = exp(pred_logSalary$log_Salary[pred_logSalary$split == "ge 358"]), yend = exp(pred_logSalary$log_Salary[pred_logSalary$split == "ge 358"])), col = "deepskyblue3", size = 1) +
  geom_segment(aes(x = 0, xend = 358, y = exp(pred_logSalary$log_Salary[pred_logSalary$split == "lt 358"]), yend = exp(pred_logSalary$log_Salary[pred_logSalary$split == "lt 358"])), col = "deepskyblue3", size = 1) +
  geom_vline(aes(col = "log(Salary)", xintercept = 358), size = 1.2) + 
  geom_vline(aes(col = "Salary", xintercept = 450), size = 1.2) + 
  scale_color_manual(values = c("blue", "darkgreen")) + 
  scale_x_continuous(labels = scales::comma_format()) +
  scale_y_continuous(labels = scales::comma_format()) +
  labs(col = "Response:") + 
  theme(legend.position = "bottom")

(b) `train`/`test` Split

Q: Create a training set consisting of the first 200 observations, and a test set consisting of the remaining observations

I create a 200-row train dataset and 63-row test dataset below.

train <- Hitters_notNA[1:200, ]
test <- Hitters_notNA[201:nrow(Hitters_notNA), ]

(c) Boosting - Varying Shrinkage (`train` MSE)

Q: Perform boosting on the training set with 1,000 trees for a range of values of the shrinkage parameter $\lambda$. Produce a plot with different shrinkage values on the $x$-axis and the corresponding training set MSE on the $y$-axis.

Note: For the remainder of the question I am going to be calculating performance metrics based on log_Salary, instead of exponentiating these predictions and calculating performance metrics on the original Salary scale (as I did in my extension of part a). I presume this is what the authors expected.

I test a pretty complete sequence of $\lambda$ values, choosing to leave interaction.depth = 2 to model basic interactions.

I plot the results below, displaying the x axis on a $log_{10}$ scale. The training MSE decreases for higher values of $\lambda$.

lambda_seq <- 10^seq(-6, 0, 0.1)

set.seed(1)

train_MSE <- c()
test_MSE <- c()

for (i in 1:length(lambda_seq)) {
  boost_TEMP <- gbm(log_Salary ~ . - Salary, 
                    data = train, 
                    distribution = "gaussian", 
                    n.trees = 1000, 
                    interaction.depth = 2, 
                    shrinkage = lambda_seq[i])
  
  train_MSE[i] <- mean((predict(boost_TEMP, train, n.trees = 1000) - train$log_Salary)^2)
  
  test_MSE[i] <- mean((predict(boost_TEMP, test, n.trees = 1000) - test$log_Salary)^2)
}

data.frame(lambda = lambda_seq, train_MSE) %>%
  ggplot(aes(x = lambda, y = train_MSE)) + 
  geom_point(size = 2, col = "deepskyblue3") + 
  geom_line(col = "grey55") + 
  scale_x_continuous(trans = 'log10', breaks = 10^seq(-6, 0), labels = 10^seq(-6, 0), minor_breaks = NULL) + 
  labs(x = "Lambda (Shrinkage)", 
       y = "Training MSE")

(d) Boosting - Varying Shrinkage (`test` MSE)

Q: Produce a plot with different shrinkage values on the x-axis and the corresponding test set MSE on the y-axis.

I use the same sequence of lambdas to assess the corresponding test MSE. I actually did this in the previous question (creating the vector test_MSE alongside train_MSE), so I simply plot the results here:

data.frame(lambda = lambda_seq, test_MSE) %>%
  ggplot(aes(x = lambda, y = test_MSE)) + 
  geom_point(size = 2, col = "deepskyblue3") + 
  geom_line(col = "grey55") + 
  scale_x_continuous(trans = 'log10', breaks = 10^seq(-6, 0), labels = 10^seq(-6, 0), minor_breaks = NULL) + 
  labs(x = "Lambda (Shrinkage)", 
       y = "Test MSE")

(e) Boosting vs PLS vs Lasso

Q: Compare the test MSE of boosting to the test MSE that results from applying two of the regression approaches seen in Chapters 3 and 6

I’ll choose the following three methods for comparison:

Boosting
Partial Least Squares Regression
Lasso Regression

1. Boosting

Holding n.trees = 1000, interaction.depth = 2 and n.minobsinnode = 10 constant (the last is not in the book and I find is much less important to tune), I used repeated cross-validation to select the shrinkage parameter:

set.seed(1)

ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 3, verboseIter = F, summaryFunction = custom_regression_metrics)

gbm_grid <- expand.grid(n.trees = 1000, interaction.depth = 2, n.minobsinnode = 10, shrinkage = 10^seq(-6, 0, 0.1))

model_gbm <- train(log_Salary ~ . - Salary, 
                   data = train, 
                   method = "gbm",
                   distribution = "gaussian", 
                   verbose = F,
                   metric = "MSE",
                   maximize = F,
                   trControl = ctrl, 
                   tuneGrid = gbm_grid)


model_gbm$results %>%
  rename(CV_MSE = MSE) %>%
  mutate(min_CV_MSE = as.numeric(shrinkage == model_gbm$bestTune$shrinkage)) %>%
  ggplot(aes(x = shrinkage, y = CV_MSE)) + 
  geom_line(col = "grey55") + 
  geom_point(size = 2, aes(col = factor(min_CV_MSE))) + 
  scale_x_continuous(trans = 'log10', breaks = 10^seq(-6, 0), labels = 10^seq(-6, 0), minor_breaks = NULL) + 
  scale_color_manual(values = c("deepskyblue3", "green")) + 
  theme(legend.position = "none") + 
  labs(title = "Hitters Dataset - Boosting", 
       subtitle = "Selecting shrinkage parameter using cross-validation",
       x = "Lambda (Shrinkage)", 
       y = "CV MSE")

The optimal value of shrinkage:

model_gbm$bestTune$shrinkage %>% round(4) # 0.0063

## [1] 0.0063

The corresponding cross-validation MSE:

gbm_cv <- min(model_gbm$results$MSE) %>% round(4) # 0.2089
gbm_cv

## [1] 0.2089

The test MSE:

gbm_test <- mean((predict(model_gbm, test) - test$log_Salary)^2) %>% round(4) # 0.2816
gbm_test

## [1] 0.2816

I did a fair bit of playing around with the hyper-parameters on this dataset, but even with repeated cross-validation there is still a lot of variability in the optimal parameters (due to the size of the dataset). Any reduction in the CV MSE from here was marginal and often didn’t really correlate with an improvement in the test MSE. I’ll take the above stats as the final boosting performance for this dataset and move on.

2. Partial Least Squares Regression

set.seed(1)

ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 20, verboseIter = F, summaryFunction = custom_regression_metrics)

model_pls <- train(log_Salary ~ . - Salary, 
                   data = train, 
                   method = "pls", 
                   preProcess = c("center", "scale"), 
                   metric = "MSE", 
                   maximize = F,
                   trControl = ctrl, 
                   tuneGrid = data.frame(ncomp = 1:20))


model_pls$results %>%
  rename(CV_MSE = MSE) %>%
  mutate(min_CV_MSE = as.numeric(ncomp == model_pls$bestTune$ncomp)) %>%
  ggplot(aes(x = ncomp, y = CV_MSE)) + 
  geom_line(col = "grey55") + 
  geom_point(size = 2, aes(col = factor(min_CV_MSE))) + 
  scale_x_continuous(breaks = 1:20, minor_breaks = NULL) +
  scale_color_manual(values = c("deepskyblue3", "green")) + 
  theme(legend.position = "none") + 
  labs(title = "Hitters Dataset - Partial Least Squares", 
       subtitle = "Selecting number of principal components with cross-validation",
       x = "Principal Components", 
       y = "CV MSE")

The optimal number of principal components:

model_pls$bestTune$ncomp # 1

## [1] 1

The corresponding cross-validation MSE:

pls_cv <- min(model_pls$results$MSE) %>% round(4) # 0.4038
pls_cv

## [1] 0.4038

The test MSE:

pls_test <- mean((predict(model_pls, test) - test$log_Salary)^2) %>% round(4) # 0.466
pls_test

## [1] 0.466

3. Lasso Regression

ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 10, verboseIter = F, summaryFunction = custom_regression_metrics)

set.seed(1)

model_lasso <- train(log_Salary ~ . - Salary, 
                   data = train, 
                   method = "glmnet", 
                   preProcess = c("center", "scale"), 
                     metric = "MSE",
                     maximize = F,
                     trControl = ctrl, 
                     tuneGrid = expand.grid(alpha = 1,
                                            lambda = seq(0, 0.1, 0.001)))



model_lasso$results %>%
  rename(CV_MSE = MSE) %>%
  mutate(min_CV_MSE = as.numeric(lambda == model_lasso$bestTune$lambda)) %>%
  ggplot(aes(x = lambda, y = CV_MSE)) + 
  geom_line(col = "grey55") + 
  geom_point(size = 2, aes(col = factor(min_CV_MSE))) + 
  scale_color_manual(values = c("deepskyblue3", "green")) + 
  theme(legend.position = "none") + 
  labs(title = "Hitters Dataset - Lasso Regression", 
       subtitle = "Selecting shrinkage parameter with cross-validation",
       y = "CV MSE")

The optimal value of lambda (the shrinkage parameter):

model_lasso$bestTune$lambda # 0.003

## [1] 0.003

The corresponding cross-validation MSE:

lasso_cv <- min(model_lasso$results$MSE) %>% round(4) # 0.3982
lasso_cv

## [1] 0.3982

The test MSE:

lasso_test <- mean((predict(model_lasso, test) - test$log_Salary)^2) %>% round(4) # 0.4712
lasso_test

## [1] 0.4712

The improvement in the test MSE for Boosting over the other (linear) regression approaches appears to be quite large. These results are summarized (alongside bagging) in part (g).

(f) Variable Importance

Q: Which variables appear to be the most important predictors in the boosted model?

We can see the variable importance ranking below. “CAtBat = Number of times at bat during his career” is the most important variable in the model by some margin. After that, CRuns, CWalks, CHits & CRBI are all of similar importance. Note that the most important variables are all ‘career’ variables (totals across their career to-date). For the full definitions you can type ?ISLR::Hitters in the console.

# summary(model_gbm$finalModel)

# ggplot version:
summary(model_gbm$finalModel, plotit = F) %>%
  rename("Importance" = "rel.inf") %>%
  ggplot(aes(x = reorder(var, Importance), y = Importance, fill = Importance)) + 
  geom_bar(stat = "identity") + 
  geom_text(aes(label = round(Importance, 2), col = Importance), hjust = -0.2) + 
  scale_y_continuous(limits = c(0, 31.5)) +
  scale_fill_gradient(low = "grey40", high = "#28B463") + 
  scale_color_gradient(low = "grey40", high = "#28B463") +
  coord_flip() + 
  theme(legend.position = "none") + 
  labs(title = "Hitters Dataset - Boosting - Variable Importance", 
       x = "Varname")

Extended - PDP’s & Monotone Constraints

Below I plot the partial dependence plot for the most important predictor (CAtBat), which shows its marginal effect on the response after integrating out the other variables. We can see that $log(Salary)$ sharply increases with CAtBat (up to about ~1500), after which higher values of CAtBat no longer have an effect.

plot(model_gbm$finalModel, "CAtBat", return.grid = T) %>%
  ggplot(aes(CAtBat, y)) + 
  geom_line() + 
  scale_x_continuous(labels = scales::comma_format()) + 
  labs(title = "CAtBat - Partial Dependence Plot", 
       y = "Marginal effect on y = log(Salary)")

I stumbled across the var.monotone argument for gbm(), which allows us to force the relationships of certain predictors with the outcome to be monotonic. This can be really useful in finance and other regulated industries, and the effect of these monotone constraints are immediately visible in the partial dependence plots.

To show an example, I look at the PDP of PutOuts, where we can see that the relationship is clearly not monotonic:

plot(model_gbm$finalModel, "PutOuts", return.grid = T) %>%
  ggplot(aes(PutOuts, y)) + 
  geom_line() + 
  scale_x_continuous(labels = scales::comma_format()) + 
  labs(title = "PutOuts - Partial Dependence Plot", 
       y = "Marginal effect on y = log(Salary)")

I re-produce the PDP for a model trained with all the same hyper-parameters, but now with an additional (monotonic, increasing) constraint on PutOuts with the response. I didn’t force any relationship with the remaining 18 predictors, meaning arbitrary non-monotonic relationships are permitted. The impact of the monotonic constraint is clear:

set.seed(1)

train_X <- select(train, -c(Salary, log_Salary))

model_gbm_mono <- train(y = train$log_Salary, 
                   x = train_X, 
                   method = "gbm",
                   distribution = "gaussian", 
                   verbose = F,
                   metric = "MSE",
                   maximize = F,
                   trControl = ctrl, 
                   tuneGrid = data.frame(n.trees = 1000, interaction.depth = 2, n.minobsinnode = 10, shrinkage = 10^-2.2), # shrinkage = 10^-2.2 chosen earlier
                   var.monotone = ifelse(names(train_X) == "PutOuts", 1, 0)) # PutOuts = 1 = monotonic increasing (others = 0 = arbitrary)


plot(model_gbm_mono$finalModel, "PutOuts", return.grid = T) %>%
  ggplot(aes(PutOuts, y)) + 
  geom_line() + 
  scale_x_continuous(labels = scales::comma_format()) + 
  labs(title = "PutOuts - Partial Dependence Plot", 
       subtitle = "Monotonic (increasing) constraint on 'PutOuts'",
       y = "Marginal effect on y = log(Salary)")

(g) Bagging

Q: Now apply bagging to the training set. What is the test set MSE for this approach?

set.seed(1)

ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 3, verboseIter = F, summaryFunction = custom_regression_metrics)

model_bagging <- train(y = train$log_Salary, 
                   x = train_X, 
                   method = "rf",
                   metric = "MSE",
                   maximize = F,
                   trControl = ctrl, 
                   tuneGrid = expand.grid(mtry = 19))

The cross-validation MSE:

bagging_cv <- model_bagging$results$MSE %>% round(4) # 0.2204
bagging_cv

## [1] 0.2198

The test MSE:

bagging_test <- mean((predict(model_bagging, test) - test$log_Salary)^2) %>% round(4) # 0.229
bagging_test

## [1] 0.2323

The performance of all 4 methods are summarized below for comparison:

data.frame(method = c("Boosting", "PLS", "Lasso", "Bagging"), 
           CV_MSE = c(gbm_cv, pls_cv, lasso_cv, bagging_cv), 
           test_MSE = c(gbm_test, pls_test, lasso_test, bagging_test)) %>%
  arrange(test_MSE)

##     method CV_MSE test_MSE
## 1  Bagging 0.2198   0.2323
## 2 Boosting 0.2089   0.2816
## 3      PLS 0.4038   0.4660
## 4    Lasso 0.3982   0.4712

With regards to the test MSE, bagging performed slightly better than boosting. The dataset is really small so there’s an element of randomness to the exact order, but it does seem pretty clear that there is a large performance increase to be had from using tree ensembles here as opposed to linear methods.

11. APPLIED: The `Caravan` Dataset (Boosting)

This question uses the Caravan data set.

glimpse(Caravan)

## Rows: 5,822
## Columns: 86
## $ MOSTYPE  <dbl> 33, 37, 37, 9, 40, 23, 39, 33, 33, 11, 10, 9, 33, 41, 23, ...
## $ MAANTHUI <dbl> 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1...
## $ MGEMOMV  <dbl> 3, 2, 2, 3, 4, 2, 3, 2, 2, 3, 4, 3, 2, 3, 1, 2, 2, 3, 4, 2...
## $ MGEMLEEF <dbl> 2, 2, 2, 3, 2, 1, 2, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 3, 2, 4...
## $ MOSHOOFD <dbl> 8, 8, 8, 3, 10, 5, 9, 8, 8, 3, 3, 3, 8, 10, 5, 8, 9, 5, 3,...
## $ MGODRK   <dbl> 0, 1, 0, 2, 1, 0, 2, 0, 0, 3, 1, 1, 1, 0, 0, 0, 0, 0, 2, 0...
## $ MGODPR   <dbl> 5, 4, 4, 3, 4, 5, 2, 7, 1, 5, 4, 3, 4, 5, 6, 7, 6, 5, 4, 2...
## $ MGODOV   <dbl> 1, 1, 2, 2, 1, 0, 0, 0, 3, 0, 1, 2, 1, 0, 1, 0, 0, 0, 0, 0...
## $ MGODGE   <dbl> 3, 4, 4, 4, 4, 5, 5, 2, 6, 2, 4, 4, 4, 4, 2, 2, 3, 4, 3, 7...
## $ MRELGE   <dbl> 7, 6, 3, 5, 7, 0, 7, 7, 6, 7, 7, 7, 6, 7, 1, 7, 7, 7, 7, 9...
## $ MRELSA   <dbl> 0, 2, 2, 2, 1, 6, 2, 2, 0, 0, 1, 1, 2, 1, 2, 2, 0, 0, 0, 0...
## $ MRELOV   <dbl> 2, 2, 4, 2, 2, 3, 0, 0, 3, 2, 2, 2, 3, 1, 6, 0, 2, 2, 2, 0...
## $ MFALLEEN <dbl> 1, 0, 4, 2, 2, 3, 0, 0, 3, 2, 0, 2, 3, 1, 5, 0, 0, 0, 1, 0...
## $ MFGEKIND <dbl> 2, 4, 4, 3, 4, 5, 3, 5, 3, 2, 3, 3, 4, 4, 3, 5, 6, 2, 3, 6...
## $ MFWEKIND <dbl> 6, 5, 2, 4, 4, 2, 6, 4, 3, 6, 6, 5, 3, 5, 1, 4, 3, 7, 6, 3...
## $ MOPLHOOG <dbl> 1, 0, 0, 3, 5, 0, 0, 0, 0, 0, 4, 1, 1, 2, 2, 0, 2, 2, 5, 0...
## $ MOPLMIDD <dbl> 2, 5, 5, 4, 4, 5, 4, 3, 1, 4, 3, 7, 4, 4, 6, 3, 6, 1, 4, 0...
## $ MOPLLAAG <dbl> 7, 4, 4, 2, 0, 4, 5, 6, 8, 5, 3, 1, 5, 4, 2, 6, 2, 7, 1, 9...
## $ MBERHOOG <dbl> 1, 0, 0, 4, 0, 2, 0, 2, 1, 2, 0, 4, 1, 3, 1, 2, 2, 0, 6, 0...
## $ MBERZELF <dbl> 0, 0, 0, 0, 5, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 0...
## $ MBERBOER <dbl> 1, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0...
## $ MBERMIDD <dbl> 2, 5, 7, 3, 0, 4, 4, 2, 1, 3, 9, 5, 3, 2, 4, 2, 4, 1, 3, 2...
## $ MBERARBG <dbl> 5, 0, 0, 1, 0, 2, 1, 5, 8, 3, 0, 1, 2, 2, 3, 5, 0, 1, 0, 4...
## $ MBERARBO <dbl> 2, 4, 2, 2, 0, 2, 5, 2, 1, 3, 0, 1, 4, 2, 2, 2, 4, 5, 1, 4...
## $ MSKA     <dbl> 1, 0, 0, 3, 9, 2, 0, 2, 1, 1, 3, 2, 1, 4, 1, 2, 2, 2, 5, 0...
## $ MSKB1    <dbl> 1, 2, 5, 2, 0, 2, 1, 1, 1, 2, 0, 3, 2, 2, 3, 1, 2, 0, 2, 0...
## $ MSKB2    <dbl> 2, 3, 0, 1, 0, 2, 4, 2, 0, 1, 6, 4, 2, 1, 2, 2, 4, 0, 1, 0...
## $ MSKC     <dbl> 6, 5, 4, 4, 0, 4, 5, 5, 8, 4, 0, 1, 5, 4, 4, 5, 2, 7, 2, 7...
## $ MSKD     <dbl> 1, 0, 0, 0, 0, 2, 0, 2, 1, 2, 0, 0, 1, 0, 0, 2, 0, 0, 0, 2...
## $ MHHUUR   <dbl> 1, 2, 7, 5, 4, 9, 6, 0, 9, 0, 0, 6, 5, 5, 9, 0, 6, 4, 4, 9...
## $ MHKOOP   <dbl> 8, 7, 2, 4, 5, 0, 3, 9, 0, 9, 9, 3, 4, 4, 0, 9, 3, 5, 5, 0...
## $ MAUT1    <dbl> 8, 7, 7, 9, 6, 5, 8, 4, 5, 6, 6, 7, 6, 7, 5, 4, 7, 6, 7, 7...
## $ MAUT2    <dbl> 0, 1, 0, 0, 2, 3, 0, 4, 2, 1, 2, 1, 1, 2, 1, 4, 2, 1, 1, 2...
## $ MAUT0    <dbl> 1, 2, 2, 0, 1, 3, 1, 2, 3, 2, 1, 2, 3, 0, 3, 2, 0, 2, 2, 0...
## $ MZFONDS  <dbl> 8, 6, 9, 7, 5, 9, 9, 6, 7, 6, 5, 4, 7, 8, 5, 6, 4, 7, 5, 9...
## $ MZPART   <dbl> 1, 3, 0, 2, 4, 0, 0, 3, 2, 3, 4, 5, 2, 1, 4, 3, 5, 2, 4, 0...
## $ MINKM30  <dbl> 0, 2, 4, 1, 0, 5, 4, 2, 7, 2, 0, 3, 3, 3, 4, 2, 0, 0, 2, 5...
## $ MINK3045 <dbl> 4, 0, 5, 5, 0, 2, 3, 5, 2, 3, 3, 4, 4, 2, 3, 5, 1, 6, 2, 4...
## $ MINK4575 <dbl> 5, 5, 0, 3, 9, 3, 3, 3, 1, 3, 2, 3, 3, 4, 3, 3, 6, 3, 6, 0...
## $ MINK7512 <dbl> 0, 2, 0, 0, 0, 0, 0, 0, 0, 1, 2, 1, 1, 1, 0, 0, 2, 0, 0, 0...
## $ MINK123M <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ MINKGEM  <dbl> 4, 5, 3, 4, 6, 3, 3, 3, 2, 4, 8, 3, 3, 4, 3, 3, 5, 4, 4, 3...
## $ MKOOPKLA <dbl> 3, 4, 4, 4, 3, 3, 5, 3, 3, 7, 7, 4, 3, 4, 3, 3, 4, 2, 6, 1...
## $ PWAPART  <dbl> 0, 2, 2, 0, 0, 0, 0, 0, 0, 2, 0, 2, 0, 0, 2, 2, 0, 2, 2, 0...
## $ PWABEDR  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ PWALAND  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ PPERSAUT <dbl> 6, 0, 6, 6, 0, 6, 6, 0, 5, 0, 6, 5, 6, 0, 5, 0, 0, 6, 5, 0...
## $ PBESAUT  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ PMOTSCO  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ PVRAAUT  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ PAANHANG <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ PTRACTOR <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ PWERKT   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ PBROM    <dbl> 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3...
## $ PLEVEN   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ PPERSONG <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ PGEZONG  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ PWAOREG  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ PBRAND   <dbl> 5, 2, 2, 2, 6, 0, 0, 0, 0, 3, 0, 2, 0, 0, 2, 4, 0, 3, 2, 0...
## $ PZEILPL  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ PPLEZIER <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ PFIETS   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ PINBOED  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ PBYSTAND <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ AWAPART  <dbl> 0, 2, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0...
## $ AWABEDR  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ AWALAND  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ APERSAUT <dbl> 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0...
## $ ABESAUT  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ AMOTSCO  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ AVRAAUT  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ AAANHANG <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ ATRACTOR <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ AWERKT   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ ABROM    <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1...
## $ ALEVEN   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ APERSONG <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ AGEZONG  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ AWAOREG  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ ABRAND   <dbl> 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0...
## $ AZEILPL  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ APLEZIER <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ AFIETS   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ AINBOED  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ ABYSTAND <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Purchase <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, No, No...

(a) `train`/`test` Split

Q: Create a training set consisting of the first 1,000 observations, and a test set consisting of the remaining observations.

In this scenario, train has far fewer rows than test, which I thought was quite unusual.

train <- Caravan[1:1000, ]
test <- Caravan[1001:nrow(Caravan), ]

nrow(train) # 1,000

## [1] 1000

nrow(test) # 4,822

## [1] 4822

Note that there are quite a few low-variance/zero-variance predictors in train. Three examples:

table(AVRAAUT = train$AVRAAUT)

## AVRAAUT
##    0 
## 1000

table(PVRAAUT = train$PVRAAUT)

## PVRAAUT
##    0 
## 1000

table(AZEILPL = train$AZEILPL)

## AZEILPL
##   0   1 
## 999   1

Enough so in fact that I wrote the code below to take every predictor in train and calculate what proportion of each variable took one value. I said that, if a variable is more than 99% one value (e.g. 99% of its values are 0), it was probably safe to exclude it.

prop_1_value_df <- sapply(select(train, -Purchase), function(x) round(max(prop.table(table(x, useNA = "ifany"))), 5)) %>%
  reshape2::melt() %>%
  rownames_to_column("varname") %>%
  rename("prop_1_value" = "value") %>%
  arrange(desc(prop_1_value)) %>%
  mutate(index = 1:nrow(.)) 

near_zero_variance <- prop_1_value_df$varname[prop_1_value_df$prop_1_value >= 0.99]

ggplot(prop_1_value_df, aes(x = index, y = prop_1_value)) +
  geom_line(col = "steelblue", size = 1) + 
  geom_hline(yintercept = 0.99) + 
  geom_vline(xintercept = length(near_zero_variance)) +
  scale_y_continuous(labels = scales::percent_format()) + 
  labs(title = "Caravan 'train' Dataset", 
       subtitle = paste0(length(near_zero_variance), " vars removed using a 99% threshold"), 
       x = "Variable Index", 
       y = "Percentage taking 1 value")

Even this strict exclusion criteria ended up removing close to 25% of the predictors, but I figured they would be those least-likely to be useful in modelling. My primary reason for this step was to stop the source of the constant ‘zero-variance predictor’ messages from the modelling packages used on this dataset, but improved speed and predictive power are also likely to result from it.

I remove the predictors from both train and test:

train <- train %>%
  select(-all_of(near_zero_variance))

test <- test %>%
  select(-all_of(near_zero_variance))

I did a quick check and this step didn’t negatively impact boosting performance whatsoever.

(b) Variable Importance

Q: Fit a boosting model to the training set with Purchase as the response and the other variables as predictors. Use $1,000$ trees, and a shrinkage value of $0.01$. Which predictors appear to be the most important?

As previously I’ll keep interaction.depth = 2 and n.minobsinnode = 10 as reasonable initial values.

gbm_grid <- expand.grid(n.trees = 1000, shrinkage = 0.01, 
                        interaction.depth = 2, n.minobsinnode = 10)

set.seed(5)

model_gbm <- train(Purchase ~ ., 
                   data = train, 
                   method = "gbm",
                   distribution = "bernoulli", 
                   verbose = F,
                   trControl = trainControl(method = "none", summaryFunction = defaultSummary, classProbs = T), 
                   tuneGrid = gbm_grid)

Below are the predictors in descending order of importance. Note that some predictors were excluded from the graph (zero variable importance, not used by the model). This number is considerably lower than if I didn’t perform the variance exclusions.

# ggplot version:
gbm_importance <- summary(model_gbm$finalModel, plotit = F) %>%
  rename("Importance" = "rel.inf")

gbm_importance %>%
  filter(Importance > 0) %>%
  ggplot(aes(x = reorder(var, Importance), y = Importance, fill = Importance)) + 
  geom_bar(stat = "identity") + 
  geom_text(aes(label = round(Importance, 2), col = Importance), hjust = -0.2) + 
  scale_y_continuous(limits = c(0, 10.5)) +
  scale_fill_gradient(low = "grey40", high = "#28B463") + 
  scale_color_gradient(low = "grey40", high = "#28B463") +
  coord_flip() + 
  theme(legend.position = "none") + 
  labs(title = "Caravan Dataset - Boosting - Variable Importance", 
       subtitle = paste0(sum(gbm_importance$Importance == 0), " zero-importance variables removed"),
       x = "Varname")

The most important predictor by a reasonable amount is PPERSAUT. Typing ?ISLR::Caravan in the console points us in the direction of this link, where we can see that PPERSAUT has the definition “Contribution car policies”.

(c) Precision - Boosting vs KNN vs Logistic Regression

Q: Use the boosting model to predict the response on the test data. Predict that a person will make a purchase if the estimated probability of purchase is greater than 20%. Form a confusion matrix. What fraction of the people predicted to make a purchase do in fact make one? How does this compare with the results obtained from applying KNN or logistic regression to this dataset?

The metric we are being asked to use is basically precision (the proportion of predictions for the positive class that were correct):

\[Precision = \frac{TP}{TP + FP}\]

The only thing slightly different is that the probability threshold (above which we make a positive prediction, Purchase = "Yes") is 0.2 instead of 0.5. It is quite easy to calculate this metric on a simple test set, but requires us to write a custom metric for caret::train() if we wish to get the cross-validation estimate of this metric and use this to tune the model hyper-parameters (e.g. $k$ in KNN) without using test. I define custom_precision here:

custom_precision <- function (data, lev = NULL, model = NULL){ 
  actual <- data[ ,"obs"]
  probs <- data[ ,"Yes"] # P(y = "Yes") (when setting 'classProbs = TRUE', column automatically created)
  predicted <- factor(ifelse(probs > 0.2, "Yes", "No")) # most important line
  levels(actual) <- c("No", "Yes")
  levels(predicted) <- c("No", "Yes")
  out <- caret::precision(data = predicted, reference = actual, relevant = "Yes")
  names(out) <- c("Precision [thresh = 0.2]")
  out
}

The question isn’t directly asking us to use this as a target metric (rather to simply report this metric on test) but I think this is good practice for properly implementing custom metrics in caret::train() cross-validation:

Boosting:

I fit a gbm using the same parameters as in part (b):

gbm_grid <- expand.grid(n.trees = 1000, shrinkage = 0.01, 
                        interaction.depth = 2, n.minobsinnode = 10)

ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 3, summaryFunction = custom_precision, classProbs = T)

set.seed(5)

model_gbm <- train(Purchase ~ ., 
                   data = train, 
                   method = "gbm",
                   distribution = "bernoulli", 
                   verbose = F,
                   metric = "Precision [thresh = 0.2]",
                   maximize = T,
                   trControl = ctrl, 
                   tuneGrid = gbm_grid)

Here’s the cross-validation precision:

model_gbm$results$`Precision [thresh = 0.2]` %>% round(4) # 0.1703

## [1] 0.1703

Here’s the confusion matrix for test:

# releveling purely to make the confusion matrix entries are in the standard order:
predicted <- factor(ifelse(predict(model_gbm, test, "prob")$Yes > 0.2, "Yes", "No")) %>% relevel(ref = "Yes")
actual <- test$Purchase %>% relevel(ref = "Yes")

conf_matrix <- table(predicted, actual)
conf_matrix

##          actual
## predicted  Yes   No
##       Yes   37  164
##       No   252 4369

Here’s the resulting test precision:

precision <- conf_matrix[1, 1] / (conf_matrix[1, 1] + conf_matrix[1, 2]) # precision = TP / (TP + FP)
precision %>% round(4) # 0.1841 (0.1840796)

## [1] 0.1841

## check:
# caret::precision(data = factor(ifelse(predict(model_gbm, test, "prob")$Yes > 0.2, "Yes", "No")),
#                  reference = test$Purchase,
#                  relevant = "Yes") # 0.1840796

KNN:

Comparing to KNN:

set.seed(5)

model_knn <- train(Purchase ~ ., 
                   data = train, 
                   method = "knn", 
                   preProcess = c("center", "scale"), 
                   trControl = trainControl(method = "repeatedcv", number = 5, repeats = 10, summaryFunction = custom_precision, classProbs = T, returnResamp = "all"), 
                   metric = "Precision [thresh = 0.2]", 
                   maximize = T, 
                   tuneGrid = expand.grid(k = seq(1, 45, 2))) # tuneGrid

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.

There are some NA estimates, which basically results from a combination of reasons that lead to zero positive predictions and therefore an undefined precision metric (small dataset, imbalanced target, the metric being used, large ‘k’ in KNN). Details can be found in the notes below but are not particularly important:

# NA's occur where the precision metric is undefined, i.e. where there are no positive predictions
# As 'k' in KNN gets very large, it becomes less and less likely that more than 20% of these k nearest neighbors for an observation will have a "yes" purchase value (<6% of train$Purchase takes the value "yes")
# This will mean an observation is less and less likely to have a predicted probability above 0.2
# This also means increasingly few positive predictions (Purchase = "yes") as k increases
# If there are *no* positive predictions, Precision = TP / (TP + FP) = 0 / 0 = NA (undefined, anyway)
# You can verify this is the issue by reducing the threshold for a positive prediction in the custom metric to 0.1 (gives 0 NA's), or accentuating the problem by increasing it to 0.3 (gives 640 NA's)
# Too many NA estimates can be problematic for the accuracy of our precision estimate

# model_knn$resample %>%
#   group_by(k) %>%
#   summarize(n = n(),
#             num_NA = sum(is.na(`Precision [thresh = 0.2]`))) # only occur in some estimates for high values of k (to be expected)

sum(is.na(model_knn$resample$`Precision [thresh = 0.2]`)) # 31

## [1] 31

The selection of $k$ using this custom metric and cross-validation is shown below:

data.frame(k = model_knn$results$k,
           CV_Precision = model_knn$results$`Precision [thresh = 0.2]`) %>%
  mutate(max_CV_Precision = as.numeric(max(CV_Precision) == CV_Precision)) %>%
  ggplot(aes(x = k, y = CV_Precision)) +
  geom_line(col = "grey55") +
  geom_point(size = 2, aes(col = factor(max_CV_Precision))) +
  scale_x_continuous(breaks = seq(1, 45, 2), minor_breaks = NULL) +
  scale_y_continuous(labels = scales::comma_format()) +
  scale_color_manual(values = c("deepskyblue3", "green")) +
  theme(legend.position = "none") +
  labs(title = "Caravan Dataset - KNN",
       subtitle = "Selecting k with cross-validation",
       y = "CV Precision")

For $k$ = 37, the cross-validation Precision:

max(model_knn$results$`Precision [thresh = 0.2]`) %>% round(4) # 0.2403

## [1] 0.2403

The confusion matrix for test:

predicted <- factor(ifelse(predict(model_knn, test, "prob")$Yes > 0.2, "Yes", "No")) %>% relevel(ref = "Yes")
actual <- test$Purchase %>% relevel(ref = "Yes")

conf_matrix <- table(predicted, actual)
conf_matrix

##          actual
## predicted  Yes   No
##       Yes   14   56
##       No   275 4477

Here’s the resulting test precision:

precision <- conf_matrix[1, 1] / (conf_matrix[1, 1] + conf_matrix[1, 2]) # precision = TP / (TP + FP)
precision %>% round(4) # 0.2

## [1] 0.2

Notice how, in optimizing for precision without also penalizing low recall, we seem to be selecting for a model that makes very few positive predictions.

Logistic Regression:

Here I compare to Logistic Regression (no regularization or feature selection):

set.seed(5)

model_logistic <- train(Purchase ~ ., 
                        data = train, 
                        method = "glm", 
                        metric = "Precision [thresh = 0.2]", 
                        maximize = T, 
                        family = "binomial", 
                        trControl = ctrl)

The cross-validation Precision:

model_logistic$results$`Precision [thresh = 0.2]` %>% round(4) # 0.1324

## [1] 0.1324

The confusion matrix for test:

predicted <- factor(ifelse(predict(model_logistic, test, "prob")$Yes > 0.2, "Yes", "No")) %>% relevel(ref = "Yes")
actual <- test$Purchase %>% relevel(ref = "Yes")

conf_matrix <- table(predicted, actual)
conf_matrix

##          actual
## predicted  Yes   No
##       Yes   48  272
##       No   241 4261

Here’s the resulting test precision:

precision <- conf_matrix[1, 1] / (conf_matrix[1, 1] + conf_matrix[1, 2]) # precision = TP / (TP + FP)
precision %>% round(4) # 0.15

## [1] 0.15

Boosting (Tuned):

I can tune a gbm further to optimize for precision, which results in the ‘highest-performance’ model so far. However, this has the same drawbacks as the tuned KNN model in that it makes almost no positive predictions:

gbm_grid <- expand.grid(n.trees = seq(1500, 3000, 100), shrinkage = 0.001, 
                        interaction.depth = 2, n.minobsinnode = 10)

ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 3, summaryFunction = custom_precision, classProbs = T, returnResamp = "all")

set.seed(5)

model_gbm <- train(Purchase ~ ., 
                   data = train, 
                   method = "gbm",
                   distribution = "bernoulli", 
                   verbose = F,
                   metric = "Precision [thresh = 0.2]",
                   maximize = T,
                   trControl = ctrl, 
                   tuneGrid = gbm_grid)


data.frame(n_trees = model_gbm$results$n.trees,
           CV_Precision = model_gbm$results$`Precision [thresh = 0.2]`) %>%
  mutate(max_CV_Precision = as.numeric(max(CV_Precision) == CV_Precision)) %>%
  ggplot(aes(x = n_trees, y = CV_Precision)) +
  geom_line(col = "grey55") +
  geom_point(size = 2, aes(col = factor(max_CV_Precision))) +
  scale_x_continuous(labels = scales::comma_format()) + 
  scale_color_manual(values = c("deepskyblue3", "green")) +
  theme(legend.position = "none") +
  labs(title = "Caravan Dataset - Boosting",
       subtitle = "Selecting number of trees with cross-validation",
       y = "CV Precision")

The confusion matrix for test:

predicted <- factor(ifelse(predict(model_gbm, test, "prob")$Yes > 0.2, "Yes", "No")) %>% relevel(ref = "Yes")
actual <- test$Purchase %>% relevel(ref = "Yes")

conf_matrix <- table(predicted, actual)
conf_matrix

##          actual
## predicted  Yes   No
##       Yes   22   44
##       No   267 4489

Here’s the resulting test precision:

precision <- conf_matrix[1, 1] / (conf_matrix[1, 1] + conf_matrix[1, 2]) # precision = TP / (TP + FP)
precision %>% round(4) # 0.3333

## [1] 0.3333

## check:
# caret::precision(data = factor(ifelse(predict(model_gbm, test, "prob")$Yes > 0.2, "Yes", "No")),
#                  reference = test$Purchase,
#                  relevant = "Yes") # 0.3333333

This shows how optimizing for precision can result in a model that only makes positive predictions that are most likely to be correct. This is clearest in the extreme case where a model making only one (correct) positive prediction would have a precision of 1.

This property might be desirable in some scenarios, but if not, we could adjust the target metric to one that values precision while simultaneously penalizing low recall (Area Under the Precision-Recall Curve, F1 Score, etc.).

12. APPLIED: The `Boston` Dataset (Bagging, Random Forests & Boosting vs Other Approaches)

Q: Apply boosting, bagging, and random forests to a data set of your choice. Be sure to fit the models on a training set and to evaluate their performance on a test set. How accurate are the results compared to simple methods like linear or logistic regression? Which of these approaches yields the best performance?

I chose to work with the Boston dataset, which we have worked with several times already:

glimpse(Boston)

## Rows: 506
## Columns: 14
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.088...
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5...
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87,...
## $ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.5...
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.6...
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9...
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9...
## $ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4,...
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2,...
## $ black   <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396...
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17...
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9,...

Like in q11. of my chapter 6 solutions, I will be predicting crim - the per capita crime rate by town. I won’t be utilizing a train/test split, but will be evaluating the performance of all models using repeated cross-validation, so the principle is the same. All the models below were tuned locally to reduce run times. My target metric was MSE, as this was the one used in chapter 6.

Linear Regression (Best-Subset Selection, No Transformations):

In chapter 6 (q11.) this was the best linear model selected from the unaltered dataset:

ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 10 , summaryFunction = custom_regression_metrics)

set.seed(1)

model_linear_1 <- train(crim ~ zn + nox + dis + rad + ptratio + black + medv, 
                        data = Boston, 
                        method = "lm", 
                        trControl = ctrl)

model_linear_1$results$MSE %>% round(2) # 42.64

## [1] 42.64

Linear Regression (Best-Subset Selection, Manual Transformations):

Again from chapter 6 (q11.), where a linear model with manual transformations improved performance significantly:

Boston_1 <- Boston

Boston_1$medv_sq <- Boston_1$medv^2
Boston_1$medv_cub <- Boston_1$medv^3

Boston_1$rad_cat <- ifelse(Boston_1$rad > 20, 1, 0)
Boston_1$rad <- NULL
Boston_1$tax <- NULL


set.seed(1)

model_linear_2 <- train(crim ~ nox + dis + ptratio + medv + medv_sq + medv_cub + rad_cat, 
                        data = Boston_1, 
                        method = "lm", 
                        trControl = ctrl)

model_linear_2$results$MSE %>% round(2) # 35.54

## [1] 35.54

The numbers are slightly different for these two models when compared with chapter 6 (due to different CV folds).

Generalized Additive Models:

The reason for testing the transformations above was that I felt that many of the non-linear relationships present in the Boston dataset weren’t being captured.

The natural next step for me was to compare the performance of my manually transformed linear model to a GAM (with smooth terms for every numeric predictor that has at least 10 unique values).

I was surprised to see that, while performance is better, the gap is small:

set.seed(1)

model_smoothspline <- train(crim ~ .,
                            data = Boston,
                            method = "gamSpline",
                            maximize = F, 
                            metric = "MSE",
                            trControl = ctrl, 
                            tuneGrid = data.frame(df = 4.9))

model_smoothspline$results$MSE %>% round(2) # 35.07

## [1] 35.07

Bagging:

Bagging does better than standard linear regression without predictor transformations, but under-performs relative to the more flexible linear approaches:

set.seed(1)

model_bagging <- train(crim ~ .,
                       data = Boston,
                       method = "rf", 
                       metric = "MSE",
                       maximize = F,
                       trControl = ctrl, 
                       tuneGrid = expand.grid(mtry = 13))

model_bagging$results$MSE %>% round(2) # 39.64

## [1] 39.64

Random Forest:

Random forests give the best performance so far:

set.seed(1)

model_rf <- train(crim ~ .,
                  data = Boston,
                  method = "rf", 
                  metric = "MSE",
                  maximize = F,
                  trControl = ctrl, 
                  tuneGrid = expand.grid(mtry = 2))

model_rf$results$MSE %>% round(2) # 33.54

## [1] 33.54

Boosting:

Boosting performance was good but didn’t manage to beat Random Forests:

gbm_grid <- expand.grid(n.trees = 3400, interaction.depth = 5, n.minobsinnode = 10, shrinkage = 0.001)

set.seed(1)

model_gbm <- train(crim ~ .,
                   data = Boston,
                   method = "gbm",
                   distribution = "gaussian", 
                   verbose = F,
                   metric = "MSE",
                   maximize = F,
                   trControl = ctrl, 
                   tuneGrid = gbm_grid)


model_gbm$results$MSE %>% round(2) # 34.35

## [1] 34.35

Comparison:

A summary can be seen below. Standard linear regression (without transformations) and bagged trees performed poorly, but other methods seem reasonably competitive:

model_summary_resamples <- data.frame(method = "Linear Reg.", MSE = model_linear_1$resample$MSE) %>%
  rbind(data.frame(method = "Linear Reg. (Transformations)", MSE = model_linear_2$resample$MSE)) %>%
  rbind(data.frame(method = "GAM (Smooth Terms)", MSE = model_smoothspline$resample$MSE)) %>%
  rbind(data.frame(method = "Bagged Trees", MSE = model_bagging$resample$MSE)) %>%
  rbind(data.frame(method = "Random Forest", MSE = model_rf$resample$MSE)) %>%
  rbind(data.frame(method = "GBM", MSE = model_gbm$resample$MSE)) 

model_summary <- model_summary_resamples %>%
  group_by(method) %>%
  summarize(MSE = mean(MSE)) %>%
  arrange(MSE) %>%
  mutate(MSE = round(MSE, 2))

model_summary

## # A tibble: 6 x 2
##   method                          MSE
##   <chr>                         <dbl>
## 1 Random Forest                  33.5
## 2 GBM                            34.4
## 3 GAM (Smooth Terms)             35.1
## 4 Linear Reg. (Transformations)  35.5
## 5 Bagged Trees                   39.6
## 6 Linear Reg.                    42.6

As I used 10 repeats of 10-fold cross-validation for each model, each model has 100 out-of-sample performance estimates. By setting the same random seed each time, the cross-validation folds are also identical across each model. I visualize these out-of-sample estimates using the modified box plot below, ordering from best $\downarrow$ worst performance:

ggplot(model_summary_resamples, aes(x = factor(method, ordered = T, levels = rev(model_summary$method)), y = MSE)) + 
  geom_boxplot(outlier.shape = NA, aes(fill = method, alpha = 0.3)) +
  geom_jitter(alpha = 0.2, position = position_jitter(w = 0.05), col = "grey20") +
  theme(legend.position = "none", axis.title.y = element_blank()) + 
  coord_flip() + 
  labs(title = "Boston Dataset - Model Resample Performance - MSE Comparison")

\(\alpha\)	\(MSE\)
\(\alpha_1\)	\(MSE_{\alpha_1}\)
\(\alpha_2\)	\(MSE_{\alpha_2}\)
\(\vdots\)	\(\vdots\)
\(\alpha_{min}\)	\(MSE_{\alpha_{min}}\)
\(\vdots\)	\(\vdots\)
\(\alpha_m\)	\(MSE_{\alpha_m}\)

ISLR - Tree-Based Methods (Ch. 8) - Exercise Solutions

Liam Morgan

April 2021

1. Decision Tree Example (Classification)

2. Boosted Decision Stumps

3. Gini Index, Classification Error & Entropy

4. Decision Tree Example (Regression)

(a) Sketch the Tree (given the feature space partition)

(b) Sketch the Feature Space Partition (given the tree)

5. Bagged Probabilities to Class Predictions

6. Regression Tree Algorithm

7. APPLIED: The Boston Dataset (Random Forests)

8. APPLIED: The Carseats Dataset (Regression Trees, Random Forests)

(a) train/test Split

(b) Regression Tree Plot

(c) Cross-Validation Pruning

(d) Bagged Trees

(e) Random Forests

9. APPLIED: The OJ Dataset (Classification Trees)

(a) train/test Split

(b) Classification Tree

(c) tree() - Text Interpretation

(d) Plotting

(e) Test Error

(f) Cost-Complexity Pruning

(g) CV Error Plot

(h) Best Tree - CV Error

(i) Best Tree - Selecting

(j) Training Error Comparison

(k) Test Error Comparison

10. APPLIED: The Hitters Dataset (Boosting)

(a) Salary - Remove Missings & Log-Transform

Extended - Log-Transformation

(b) train/test Split

(c) Boosting - Varying Shrinkage (train MSE)

(d) Boosting - Varying Shrinkage (test MSE)

(e) Boosting vs PLS vs Lasso

(f) Variable Importance

Extended - PDP’s & Monotone Constraints

(g) Bagging

11. APPLIED: The Caravan Dataset (Boosting)

(a) train/test Split

(b) Variable Importance

(c) Precision - Boosting vs KNN vs Logistic Regression

12. APPLIED: The Boston Dataset (Bagging, Random Forests & Boosting vs Other Approaches)

7. APPLIED: The `Boston` Dataset (Random Forests)

8. APPLIED: The `Carseats` Dataset (Regression Trees, Random Forests)

(a) `train`/`test` Split

9. APPLIED: The `OJ` Dataset (Classification Trees)

(a) `train`/`test` Split

(c) `tree()` - Text Interpretation

10. APPLIED: The `Hitters` Dataset (Boosting)

(a) `Salary` - Remove Missings & Log-Transform

(b) `train`/`test` Split

(c) Boosting - Varying Shrinkage (`train` MSE)

(d) Boosting - Varying Shrinkage (`test` MSE)

11. APPLIED: The `Caravan` Dataset (Boosting)

(a) `train`/`test` Split

12. APPLIED: The `Boston` Dataset (Bagging, Random Forests & Boosting vs Other Approaches)