Boston
Dataset (Random Forests)Carseats
Dataset (Regression Trees, Random Forests)OJ
Dataset (Classification Trees)Hitters
Dataset (Boosting)Caravan
Dataset (Boosting)Boston
Dataset (Bagging, Random Forests & Boosting vs Other Approaches)NOTE: There are no official solutions for these questions. These are my solutions and could be incorrect. If you spot any mistakes/inconsistencies, please contact me on Liam95morgan@gmail.com, or via LinkedIn.
Some of the figures in this presentation are taken from “An Introduction to Statistical Learning, with applications in R” (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani
library(tidyverse)
library(MASS) # Boston data
library(ISLR) # Carseats data
library(randomForest) # random forests
library(tree) # trees
library(caret)
library(gridExtra)
library(gbm)
library(gam)
select <- dplyr::select
RNGkind(kind = "Mersenne-Twister",
normal.kind = "Inversion",
sample.kind = "Rejection") # diff versions of R/RMD using different sampling types was annoying me
# caret::train with RSS & MSE added:
custom_regression_metrics <- function (data, lev = NULL, model = NULL) {
c(RMSE = sqrt(mean((data$obs-data$pred)^2)),
Rsquared = summary(lm(pred ~ obs, data))$r.squared,
MAE = mean(abs(data$obs-data$pred)),
MSE = mean((data$obs-data$pred)^2),
RSS = sum((data$obs-data$pred)^2))
}
theme_set(theme_light())
Q: Draw an example (of your own invention) of a partition of two-dimensional feature space that could result from recursive binary splitting. Your example should contain at least six regions. Draw a decision tree corresponding to this partition. Be sure to label all aspects of your figures, including the regions \(R_1, R_2, \dots\), the cutpoints \(t_1, t_2, \dots\), and so forth.
Hint: Your result should look something like Figures 8.1 and 8.2.
A:
I created the example below that illustrates recursive binary splitting in the context of growing a classification tree.
Consider a situation where a number of identical plants were given different volumes of water (\(X_1\)), and fertilizer (\(X_2\)), and their growth was measured in some time-frame following. This growth is captured by the categorical target variable \(Y\), which takes the value \(Yes\) if the plant grew > 1cm, else \(No\).
Below is a rough example of the two-dimensional feature space that might result from growing a decision tree predicting \(Y\) using \(X_1\) and \(X_2\):
Here is the corresponding decision tree:
Notice how high levels of fertilizer are toxic regardless and therefore limit growth, but generally a balance between fertilizer and water is most productive for growth. That is to say that the degree to which fertilizer (\(X_2\)) impacts growth (\(Y\)) depends on the level of water (\(X_1\)).
The fact that the classification tree captures the above relationship (without the need to specify an interaction term) nicely illustrates how decision trees inherently model variable interactions as a consequence of their structure (although sometimes explicitly specifying interaction terms can improve predictive power). This property extends to ensembles of decision trees of sufficient depth.
Q: It is mentioned in Section 8.2.3 that boosting using depth-one trees (or stumps) leads to an additive model: that is, a model of the form
\[f(X) = \sum_{j = 1}^p f_j(X_j)\]
Explain why this is the case. You can begin with (8.12) in Algorithm 8.2.
A:
Algorithm 8.2 and the remaining text on Boosting is focused around boosting regression trees, so I presume this question is also focused on this.
Firstly, recall that a decision tree with \(M\) terminal nodes (\(M-1\) splits) can be expressed as:
\[f(X) = \sum_{m = 1}^M c_m \cdot 1_{(X \in R_m)}\]
where, in the regression tree context, \(c_m\) is the mean response value for the observations in region \(m\).
A stump is a tree with just \(M = 2\) terminal nodes (one split), so if we have a decision stump in which a single split occurs on the variable \(X_j\) (at value \(s\)), we can simplify this representation:
\[\begin{align} f(X) & = c_1 \cdot 1_{(X \in R_1)} + c_2 \cdot 1_{(X \in R_2)} \\ & = \begin{cases} c_1 && \text{if } X_j < s\\ c_2 & & \text{if } X_j \ge s \end{cases} \\ & = c_1 \cdot I(X_j < s) + c_2 \cdot I(X_j \ge s) \end{align}\]
With boosted stumps we are not dealing with a single regression stump but \(B\) stumps, so I will use the following notation:
For \(b = 1, 2, \dots, B\), the stumps will therefore take the form:
\[\hat{f}^b(X) = \alpha_b \cdot I(X_{j_b} < s_b) + \beta_b \cdot I(X_{j_b} \ge s_b)\]
Proceeding through Algorithm 8.2, we first set \(\hat{f}(X) = 0\) and \(r_i = y_i\) (\(\forall i \in \{1, 2, \dots, n\}\)).
\(b = 1\):
\[\hat{f}^1(X) = \alpha_1 \cdot I(X_{j_1} < s_1) + \beta_1 \cdot I(X_{j_1} \ge s_1)\]
\[\begin{align} \hat{f}(X) & \leftarrow \hat{f}(X) + \lambda \hat{f}^1(X) \\ & = \lambda \hat{f}^1(X) \\ \end{align}\]
\[\begin{align} r_i & \leftarrow r_i - \lambda \hat{f}^1(X) \\ & = y_i - \lambda \hat{f}^1(X) \\ \end{align}\]
\(b = 2\):
\[\hat{f}^2(X) = \alpha_2 \cdot I(X_{j_2} < s_2) + \beta_2 \cdot I(X_{j_2} \ge s_2)\]
\[\begin{align} \hat{f}(X) & \leftarrow \hat{f}(X) + \lambda \hat{f}^2(X) \\ & = \lambda \hat{f}^1(X) + \lambda \hat{f}^2(X)\\ \end{align}\]
\[\begin{align} r_i & \leftarrow r_i - \lambda \hat{f}^2(X) \\ & = y_i - \lambda \hat{f}^1(X) - \lambda \hat{f}^2(X)\\ \end{align}\]
\(\vdots\)
\(b = B\):
\[\hat{f}^B(X) = \alpha_B \cdot I(X_{j_B} < s_B) + \beta_B \cdot I(X_{j_B} \ge s_B)\]
\[\begin{align} \hat{f}(X) & \leftarrow \hat{f}(X) + \lambda \hat{f}^B(X) \\ & = \lambda \hat{f}^1(X) + \lambda \hat{f}^2(X) + \dots + \lambda \hat{f}^B(X) \\ & = \sum_{b = 1}^B \lambda\hat{f}^b(X) \end{align}\]
\[\begin{align} r_i & \leftarrow r_i - \lambda \hat{f}^B(X) \\ & = y_i - \lambda \hat{f}^1(X) - \lambda \hat{f}^2(X) - \dots - \lambda \hat{f}^B(X) \\ & = y_i - \sum_{b = 1}^B \lambda\hat{f}^b(X) \end{align}\]
From the final stump, we have our final updated boosting model \(\hat{f}\), given by:
\[\begin{align} \hat{f}(X) & = \sum_{b = 1}^B \lambda\hat{f}^b(X) \\ & = \lambda \sum_{b = 1}^B \alpha_b \cdot I(X_{j_b} < s_b) + \beta_b \cdot I(X_{j_b} \ge s_b) \\ \end{align}\]
Since \(X_{j_1}, X_{j_2}, \dots, X_{j_B} \in \{ X_1, X_2, \dots, X_p\}\), we can see that this estimate \(\hat{f}\) is indeed an additive model of the form \(f(X) = \sum_{j = 1}^p f_j(X_j)\).
Further Justification:
To show this in more concrete terms, imagine that \(B = 50\) and \(p = 10\), so \(X = (X_1, X_2, \dots, X_{10})\). Suppose that the boosting algorithm split on \(X_5\) when \(b \in \{ 1, 22, 47\}\).
In terms of the previous notation, this means that \(X_{j_1} = X_{j_{22}} = X_{j_{47}} = X_5\).
Hence, in \(f(X) = \sum_{j = 1}^{10} f_j(X_j)\), we would have:
\[\begin{align} f_5(X_5) & = \hat{f}^1(X) + \hat{f}^{22}(X) + \hat{f}^{47}(X) \\ & = \lambda \Big[ \alpha_1 \cdot I(X_{j_1} < s_1) + \beta_1 \cdot I(X_{j_1} \ge s_1) \\ & \; \; \; + \alpha_{22} \cdot I(X_{j_{22}} < s_{22}) + \beta_{22} \cdot I(X_{j_{22}} \ge s_{22}) \\ & \; \; \; + \alpha_{47} \cdot I(X_{j_{47}} < s_{47}) + \beta_{47} \cdot I(X_{j_{47}} \ge s_{47}) \Big] \\ & = \lambda \Big[ \alpha_1 \cdot I(X_{5} < s_1) + \beta_1 \cdot I(X_{5} \ge s_1) \\ & \; \; \; + \alpha_{22} \cdot I(X_{5} < s_{22}) + \beta_{22} \cdot I(X_{5} \ge s_{22}) \\ & \; \; \; + \alpha_{47} \cdot I(X_{5} < s_{47}) + \beta_{47} \cdot I(X_{5} \ge s_{47}) \Big] \\ \end{align}\]
We could do exactly the same \(\forall j \in \{ 1, 2, \dots, 10\}\), separating out the stumps based on which variable from \(X_1, X_2, \dots, X_{10}\) they split on, combining them and expressing them as \(f_1(X_1), f_2(X_2), \dots, f_{10}(X_{10})\).
Other Comments:
I believe that if this question were extended past boosted stumps, this property would no longer hold. We would still be able to express the predictions of the boosted model as an a sum of the predictions of many trees; this is clear from the final step of the algorithm, where the final \(\hat{f}\) takes the form \(\hat{f}(X) = \sum_{b = 1}^B \lambda\hat{f}^b(X)\). The problem arises in expressing these trees as separate functions of the predictors \(X_1, X_2, \dots, X_p\).
For stumps this was simple; each stump could be represented by \(\hat{f}^b(X) = \alpha_b \cdot I(X_{j_b} < s_b) + \beta_b \cdot I(X_{j_b} \ge s_b)\), which is clearly a function of one variable (\(X_{j_b}\)).
A stump that split on \(X_3\) at value \(5.2\) could be represented by:
\[c_1 \cdot I(X_{3} < 5.2) + c_2 \cdot I(X_{3} \ge 5.2)\]
Consider if that stump was instead a tree, which had a further split (when \(X_{3} \ge 5.2\)) on \(X_{8}\) at the value \(16.3\), it would then be represented by:
\[c_1 \cdot I(X_{3} < 5.2) + c_2 \cdot I(X_{3} \ge 5.2) \cdot I(X_{8} < 16.3) + c_3 \cdot I(X_{3} \ge 5.2) \cdot I(X_{8} \ge 16.3)\]
How could we separate this tree into separate functions of variables \(X_3\) and \(X_8\)? We couldn’t, so the additive form \(f(X) = \sum_{j = 1}^{p} f_j(X_j)\) would not be possible here.
Note that throughout the question I have focused on binary splits of numeric variables, but if \(X\) contains some (or all) categorical features this is not an issue.
For example, a stump with one split on the categorical variable \(X_j\) (at values \(v_1, v_2, v_3\)) could be written as:
\[\begin{align} f(X) & = c_1 \cdot 1_{(X \in R_1)} + c_2 \cdot 1_{(X \in R_2)} \\ & = \begin{cases} c_1 && \text{if } X_j \in \{v_1, v_2, v_3\}\\ c_2 & & \text{if } X_j \notin \{v_1, v_2, v_3\} \end{cases} \\ & = c_1 \cdot I \left( X_j \in \{v_1, v_2, v_3\} \right) + c_2 \cdot I \left( X_j \notin \{v_1, v_2, v_3\} \right) \end{align}\]
This notation could continue throughout the proof and nothing would change (the final prediction \(\hat{f}\) would still be a sum of the individual stumps \(\hat{f}^b\), and these individual stumps and therefore the prediction \(\hat{f}\) would still be separable and able to be expressed as an additive model of the form \(f(X) = \sum_{j = 1}^{p} f_j(X_j)\)).
Q: Consider the Gini index, classification error, and entropy in a simple classification setting with two classes. Create a single plot that displays each of these quantities as a function of \(\hat{p}_{m1}\). The \(x\)-axis should display \(\hat{p}_{m1}\), ranging from 0 to 1, and the \(y\)-axis should display the value of the Gini index, classification error, and entropy.
Hint: In a setting with two classes, \(\hat{p}_{m1} = 1 − \hat{p}_{m2}\). You could make this plot by hand, but it will be much easier to make in R
.
A:
Recall from the book that \(\hat{p}_{mk}\) is the proportion of the training observations in the \(m\)th region that are from the \(k\)th class.
Classification Error:
\[\begin{align} E & = 1 - \underset{k}{\operatorname{max}} (\hat{p}_{mk}) \\ & = 1 - \max \{\hat{p}_{m1}, \hat{p}_{m2}\} \\ \end{align}\]
Gini Index:
\[\begin{align} G & = \sum_{k = 1}^K \hat{p}_{mk} (1 - \hat{p}_{mk}) \\ & = \hat{p}_{m1} (1 - \hat{p}_{m1}) + \hat{p}_{m2} (1 - \hat{p}_{m2}) \\ \end{align}\]
Entropy:
\[\begin{align} D & = - \sum_{k = 1}^K \hat{p}_{mk} \log \left( \hat{p}_{mk} \right) \\ & = - \hat{p}_{m1} \log \left( \hat{p}_{m1} \right) - \hat{p}_{m2} \log \left( \hat{p}_{m2} \right) \\ \end{align}\]
e.g. in \(R_5\) of question 1:
prop_class_1 <- seq(0, 1, 0.001)
prop_class_2 <- 1 - prop_class_1
classification_error <- 1 - pmax(prop_class_1, prop_class_2)
gini <- prop_class_1*(1-prop_class_1) + prop_class_2*(1-prop_class_2)
entropy <- -prop_class_1*log(prop_class_1) - prop_class_2*log(prop_class_2)
data.frame(prop_class_1, prop_class_2, classification_error, gini, entropy) %>%
pivot_longer(cols = c(classification_error, gini, entropy), names_to = "metric") %>%
ggplot(aes(x = prop_class_1, y = value, col = factor(metric))) +
geom_line(size = 1) +
scale_y_continuous(breaks = seq(0, 1, 0.1), minor_breaks = NULL) +
scale_color_hue(labels = c("Classification Error", "Entropy", "Gini")) +
labs(col = "Metric",
y = "Value",
x = "Proportion (of class '1')")
This question relates to the plots in Figure 8.12.
Q: Sketch the tree corresponding to the partition of the predictor space illustrated in the left-hand panel of Figure 8.12. The numbers inside the boxes indicate the mean of \(Y\) within each region.
A:
Q: Create a diagram similar to the left-hand panel of Figure 8.12, using the tree illustrated in the right-hand panel of the same figure. You should divide up the predictor space into the correct regions, and indicate the mean for each region.
A:
I’ve drawn \(X_1\) & \(X_2\) as continuous variables for practical reasons (the tree splitting on discrete values of \(0, 1, 2\) makes it quite unlikely that they would actually be continuous):
Q: Suppose we produce ten bootstrapped samples from a data set containing red and green classes. We then apply a classification tree to each bootstrapped sample and, for a specific value of \(X\), produce \(10\) estimates of \(P(\text{Class is Red}|X)\):
\[0.1, 0.15, 0.2, 0.2, 0.55, 0.6, 0.6, 0.65, 0.7, \text{ and }0.75.\]
There are two common ways to combine these results together into a single class prediction. One is the majority vote approach discussed in this chapter. The second approach is to classify based on the average probability. In this example, what is the final classification under each of these two approaches?
A:
Majority Vote:
## [1] 6
## [1] 4
## [1] "Red"
Average Approach:
## [1] 0.45
## [1] "Green"
Q: Provide a detailed explanation of the algorithm that is used to fit a regression tree.
A:
This is basically Algorithm 8.1 in the book. Much of the algorithm is actually about selecting the best individual tree through cost complexity pruning and cross-validation. The trees in a boosting regression model will usually have the depth/number of splits pre-specified as a hyper-parameter, and in regression random forests the trees tend to be grown to full depth. Most of this algorithm is therefore about building a single tree for predictions, and only the recursive binary splitting part is relevant to these other methods.
I presume the question wants us to explain the algorithm, as there is a lot going on behind each step. I will try to make this as detailed as possible so it can be
Step 1: Use recursive binary splitting to grow a large tree on the training data, stopping only when a stopping criterion is reached, e.g. when each terminal node has fewer than some minimum number of observations.
Recall that growing a regression tree involves the following:
How do we construct the regions \(R_1, R_2, \dots, R_J\)?
The steps for recursive binary splitting include first testing every possible cutpoint for each predictor.
Step 2: Apply cost complexity pruning to the large tree \(T_0\) (from the previous step), in order to obtain a sequence of best subtrees as a function of \(\alpha\).
Growing an individual tree fully will likely overfit the training data, so our solution is to prune the large tree \(T_0\) to obtain the subtree \(T\) with the best generalization error. Obtaining the cross-validation or test error for every possible subtree would again be computationally infeasible, so we use cost complexity pruning to generate a sequence of trees.
Cost complexity pruning uses a complexity parameter \(\alpha\), and says that \(\forall \alpha \ge 0, \exists T \subseteq T_0\) such that the following is minimized:
\[g(\alpha) = \sum_{m = 1}^{|T|} \sum_{i: \; x_i \in R_m} (y_i - \hat{y}_{R_m})^2 + \alpha |T|\] where \(|T|\) is the number of terminal nodes of the tree \(T\).
This is a similar \(cost + penalty\) format seen in regularization methods throughout the book (Lasso, Ridge, Smoothing Splines, etc.):
For a sequence of sufficiently different complexity parameters \(\alpha_{1}, \alpha_{2}, \dots, \alpha_{m} \ge 0\), we will hopefully obtain a corresponding sequence of sufficiently different sub-trees \(T_{\alpha_{1}}, T_{\alpha_{2}}, \dots, T_{\alpha_{m}} \subseteq T_0\), so we must have some way of selecting between them.
Step 3: Use \(K\)-fold cross-validation to choose \(\alpha\).
That is, divide the training observations into \(K\) folds. For each \(k = 1, 2, \dots, K\):
Average the \(K\) estimates to obtain the cross-validation MSE for this value of \(\alpha\). Repeat these steps to get this estimate for each value of \(\alpha\). Select the \(\alpha_{min}\) that minimizes this cross-validation error:
data <- data.frame(alpha = c('$\\alpha_1$', '$\\alpha_2$', '$\\vdots$', '$\\alpha_{min}$', '$\\vdots$', '$\\alpha_m$'),
mse = c('$MSE_{\\alpha_1}$', '$MSE_{\\alpha_2}$', '$\\vdots$', '$MSE_{\\alpha_{min}}$', '$\\vdots$', '$MSE_{\\alpha_m}$'),
stringsAsFactors = F)
colnames(data) <- c('$\\alpha$', '$MSE$')
library(knitr)
library(kableExtra)
kable(data, row.names = F) %>%
kable_styling(bootstrap_options = "striped", full_width = F) %>%
row_spec(4, background = "lightgreen")
\(\alpha\) | \(MSE\) |
---|---|
\(\alpha_1\) | \(MSE_{\alpha_1}\) |
\(\alpha_2\) | \(MSE_{\alpha_2}\) |
\(\vdots\) | \(\vdots\) |
\(\alpha_{min}\) | \(MSE_{\alpha_{min}}\) |
\(\vdots\) | \(\vdots\) |
\(\alpha_m\) | \(MSE_{\alpha_m}\) |
Step 4: Return the subtree \(T_{\alpha_{min}}\) from Step 2 that corresponds to the chosen value of \(\alpha\) (this is our final model).
I find it easier to think of steps 1-4 as just two steps (in a different order):
Boston
Dataset (Random Forests)Q: In the lab, we applied random forests to the Boston data using mtry = 6
and using ntree = 25
and ntree = 500
. Create a plot displaying the test error resulting from random forests on this data set for a more comprehensive range of values for mtry
and ntree
. You can model your plot after Figure 8.10. Describe the results obtained.
A:
I will define the same train
/test
datasets as in the labs:
set.seed(1, sample.kind = "Rounding") # allows us to match the same train/test split as in the labs
train_index <- sample(1:nrow(Boston), nrow(Boston) / 2)
train <- Boston[train_index, ]
test <- Boston[-train_index, ]
Here is the graph showing how the test
MSE changes for each additional tree added to the forest (for different values of mtry
):
test_MSE <- c()
OOB_MSE <- c()
mtry <- c()
i <- 1
set.seed(1)
for (Mtry in seq(1, 13, 2)) {
rf_temp <- randomForest(y = train$medv,
x = train[ ,-14],
ytest = test$medv,
xtest = test[ ,-14],
mtry = Mtry,
ntree = 500)
test_MSE[[i]] <- rf_temp$test$mse
OOB_MSE[[i]] <- rf_temp$mse
mtry[[i]] <- rep(Mtry, 500)
i <- i + 1
}
rf_summary <- data.frame(ntree = rep(1:500, length(seq(1, 13, 2))),
mtry = unlist(mtry),
test_MSE = unlist(test_MSE),
OOB_MSE = unlist(OOB_MSE))
# test MSE:
ggplot(rf_summary, aes(x = ntree, y = test_MSE, col = factor(mtry))) +
geom_line() +
geom_vline(xintercept = 300, col = "grey55") +
coord_cartesian(ylim = c(10, 25)) +
labs(x = "Number of Trees",
y = "Test MSE",
col = "mtry:")
As a reference, here is the same with the OOB MSE:
# OOB MSE:
ggplot(rf_summary, aes(x = ntree, y = OOB_MSE, col = factor(mtry))) +
geom_line() +
geom_vline(xintercept = 300, col = "grey55") +
coord_cartesian(ylim = c(10, 25)) +
labs(x = "Number of Trees",
y = "OOB MSE",
col = "mtry:")
Focusing on the first graph, for all values of mtry
the plot shows the test MSE starts off very high for a low number of trees (<10), which makes sense to me (the trees are grown to full depth and will overfit without a sufficient number of them). This quickly reduces as trees are added to the model.
We could probably say that the OOB MSE comfortably stabilized after a few hundred trees (say 300), and this appears to happen faster for the test MSE.
I think this is perhaps to be expected, since every one of those 300 trees will be used for prediction for each test
observation when calculating the test MSE, but only \(\approx\) 63.2% (\(\approx\) 200) will be used for each train
observation when calculating the OOB MSE (bagging property). I verify that this is the case below:
set.seed(111)
rf_temp <- randomForest(y = train$medv,
x = train[ ,-14],
ntree = 500)
# rf_temp$oob.times = number of times each obs. is OOB
# 500 - rf_temp$oob.times = number of times each obs. is *not* OOB ('in-bag')
# (500 - rf_temp$oob.times) / 500 = proportion of times each obs. is *not* OOB ('in-bag')
# mean((500 - rf_temp$oob.times) / 500) = how often each observation is 'in-bag' (on average)
mean((500 - rf_temp$oob.times) / 500)
## [1] 0.6326166
It is quite interesting to me that the OOB MSE and test MSE are not highly correlated (perhaps even negatively correlated), with the only thing they agree on being that selecting a totally random variable at each split (mtry = 1
) is a bad idea for generalization.
The Boston
dataset is only 506 rows, so there there is significantly variability in the optimum value of mtry
based on the train
/test
split and the random seed used. You can test this by re-running the code for this question without setting any seeds and observing how different the results are each time (for a large dataset they would hardly change).
rf_summary %>%
filter(ntree == 300) %>%
pivot_longer(cols = c(test_MSE, OOB_MSE), names_to = "metric") %>%
ggplot(aes(x = mtry, y = value, col = factor(metric))) +
geom_line() +
geom_point() +
scale_y_continuous(breaks = seq(10, 30, 2)) +
scale_x_continuous(breaks = seq(1, 13, 2)) +
scale_color_hue(labels = c("OOB MSE", "Test MSE")) +
labs(col = "Metric:",
title = "Boston Dataset - Random Forest",
subtitle = "Optimum mtry using OOB/Test MSE (at ntree = 300)") +
theme(legend.position = "bottom",
axis.title.y = element_blank())
Nevertheless, for this particular split and seed combination, the test MSE was actually lowest for mtry = 7
, although most values give good results.
The OOB error graph seems to suggest that simple bagged decision trees could be a good option (mtry = 13
, which is the total number of predictors), with the lowest OOB MSE for mtry = 11
.
Carseats
Dataset (Regression Trees, Random Forests)In the lab, a classification tree was applied to the Carseats
data set after converting Sales
into a qualitative response variable. Now we will seek to predict Sales
using regression trees and related approaches, treating the response as a quantitative variable.
train
/test
SplitQ: Split the data set into a training set and a test set.
A:
Again, I’ll match the split used in the labs:
set.seed(2, sample.kind = "Rounding") # allows us to match the same train/test split as in the labs
train_index <- sample(1:nrow(Carseats), nrow(Carseats) / 2)
train <- Carseats[train_index, ] # 200
test <- Carseats[-train_index, ] # 200
Q: Fit a regression tree to the training set. Plot the tree, and interpret the results. What test MSE do you obtain?
A:
The plot can be seen below:
We could infer from this that ShelveLoc
and Price
are the two most important factors in predicting car seat sales, since they appear at the top of the tree (because they provided the best split of the data).
The tree has a total of 17 terminal nodes:
##
## Regression tree:
## tree(formula = Sales ~ ., data = train)
## Variables actually used in tree construction:
## [1] "ShelveLoc" "Price" "Age" "Income" "CompPrice"
## [6] "Population" "Advertising"
## Number of terminal nodes: 17
## Residual mean deviance: 2.341 = 428.4 / 183
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.76700 -1.00900 -0.01558 0.00000 0.94900 3.58600
Predicting and evaluating the test MSE:
## [1] 4.844991
For comparison, here is the baseline test MSE (using the average train$Sales
as the prediction for all new test
observations):
## [1] 8.085056
Q: Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?
A:
The selected model with the lowest cross-validation error appears to be the model with 17 terminal nodes:
set.seed(3)
cv_tree_model <- cv.tree(tree_model, K = 10)
data.frame(n_leaves = cv_tree_model$size,
CV_RSS = cv_tree_model$dev) %>%
mutate(min_CV_RSS = as.numeric(min(CV_RSS) == CV_RSS)) %>%
ggplot(aes(x = n_leaves, y = CV_RSS)) +
geom_line(col = "grey55") +
geom_point(size = 2, aes(col = factor(min_CV_RSS))) +
scale_x_continuous(breaks = seq(1, 17, 2)) +
scale_y_continuous(labels = scales::comma_format()) +
scale_color_manual(values = c("deepskyblue3", "green")) +
theme(legend.position = "none") +
labs(title = "Carseats Dataset - Regression Tree",
subtitle = "Selecting the complexity parameter with cross-validation",
x = "Terminal Nodes",
y = "CV RSS")
Since the optimal tree is the fully grown tree without any pruning, it should be clear that the test MSE is the same as in part b).
Here is how we would select a smaller sub-tree, although in this case I select the full tree with 17 terminal nodes (best = 17
). The test MSE is indeed the same as in part b):
pruned_tree_model <- prune.tree(tree_model, best = 17)
test_pred <- predict(pruned_tree_model, test)
mean((test_pred - test$Sales)^2)
## [1] 4.844991
Notes on the tree
package:
Something that seems unusual to me about the results of tree()
is that, for regression, deviance apparently means the RSS. We would therefore expect the deviance from cv.tree()
to be the cross-validation RSS, which would usually lower as we increase the number of folds K
(see my Ch.7 Q.9b solutions for why). This doesn’t happen here like it does with other packages that perform cross-validation (e.g. caret
with the RSS metric). This makes me think that the package is producing K
out-of-fold RSS estimates and then summing them instead of averaging them. This isn’t really an issue but it did strike me as quite odd.
Another thing that confuses me about this output is that the book says that for cv.tree()
the output k
corresponds to the complexity parameter \(\alpha\), where \(\alpha \ge 0\). How is it that our selected model therefore has an \(\alpha\) of -Inf
?:
## size dev k
## 1 17 996.5117 -Inf
## 2 16 998.3865 16.25730
## 3 15 1008.3807 19.34054
## 4 14 1032.8005 21.45248
## 5 13 1032.2548 22.09786
## 6 12 1124.1770 31.59872
## 7 11 1130.3090 32.10994
## 8 10 1125.5635 35.48396
## 9 9 1142.3155 42.57770
## 10 7 1142.3155 43.06856
## 11 6 1165.2758 46.54477
## 12 5 1166.0119 47.65604
## 13 4 1147.0774 80.06393
## 14 3 1166.6474 108.73209
## 15 2 1213.4513 146.62539
## 16 1 1600.1608 406.17230
We might think that k = -Inf
corresponds to \(\alpha = 0\) (no pruning), which would make sense in this particular question (as this model still has 17 terminal nodes). However, on the bottom output of page 326 you can see k
can apparently take values of both -Inf
and 0
which yield models with different numbers of terminal nodes (19 and 17 respectively).
This question was the only relevant question I could find online highlighting this issue. I think based on this example there must be some difference between the mathematical description of cost complexity pruning given on page 309 and its implementation in the tree
package that the authors did not mention.
There are a lot of these strange inconsistencies with the tree
package and I think I will investigate the more popular rpart
package (directly or through caret::train(method = "rpart")
) if returning to decision trees in the future.
Q: Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the importance()
function to determine which variables are most important.
A:
We can implement bagged trees through the randomForest()
function by setting mtry
as the number of predictors (at each stage in the split of each tree, all variables are candidates to split on). I leave ntree
at the default (500).
set.seed(1)
bagged_trees_model <- randomForest(y = train$Sales,
x = train[ ,-1],
mtry = ncol(train) - 1, # 10
importance = T)
Interpretability has decreased, as we now have 500 trees built on different bootstrap samples of the training dataset. However, the predictive power has increased, with the test MSE reducing by ~50%.
## [1] 2.368674
Below are the variable importances:
importance(bagged_trees_model) %>%
as.data.frame() %>%
rownames_to_column("varname") %>%
arrange(desc(IncNodePurity))
## varname %IncMSE IncNodePurity
## 1 Price 55.76266629 493.804969
## 2 ShelveLoc 53.87451311 446.816951
## 3 CompPrice 25.47984338 173.982449
## 4 Age 12.07366106 117.502364
## 5 Advertising 13.97464644 96.929928
## 6 Income 1.72616791 71.465227
## 7 Population 1.01449985 68.297498
## 8 Education 0.08382003 37.513944
## 9 Urban -3.06299457 6.909530
## 10 US 0.14346468 5.985091
The first measure, %IncMSE
, is calculated by recording the prediction error of every tree on the OOB portion of the data (for regression, the MSE). A predictor is then scrambled (permuted) and the OOB MSE is then re-assessed. The difference between these MSE’s is averaged across all trees (and normalized by the standard deviation of the differences).
The second measure, IncNodePurity
, is the total decrease in node impurity from splitting on that variable (averaged across all the trees). Node impurity is measured by Gini and RSS for classification and regression respectively.
Whichever measure is used, we can see that for bagged trees, Price
and ShelveLoc
are the most important variables. This was also the conclusion from inspecting the top splits of the single regression tree produced in part b).
Q: Use random forests to analyze this data. What test MSE do you obtain? Use the importance()
function to determine which variables are most important. Describe the effect of \(m\), the number of variables considered at each split, on the error rate obtained.
A:
For this dataset it seems quite clear that reducing mtry
negatively effects (increases) the test MSE, with the mtry = 1
model performing about the same as a single decision tree (but without the ease of interpretability). Increasing mtry
quite quickly reduces the test error, and we can observe a monotonic reduction in the test MSE as mtry increases (up to mtry = 10
).
test_MSE <- c()
i <- 1
for (Mtry in 1:10) {
set.seed(1)
rf_temp <- randomForest(y = train$Sales,
x = train[ ,-1],
mtry = Mtry,
importance = T)
test_pred <- predict(rf_temp, test)
test_MSE[i] <- mean((test_pred - test$Sales)^2)
i <- i + 1
}
data.frame(mtry = 1:10, test_MSE = test_MSE) %>%
mutate(min_test_MSE = as.numeric(min(test_MSE) == test_MSE)) %>%
ggplot(aes(x = mtry, y = test_MSE)) +
geom_line(col = "grey55") +
geom_point(size = 2, aes(col = factor(min_test_MSE))) +
scale_x_continuous(breaks = seq(1, 10), minor_breaks = NULL) +
scale_color_manual(values = c("deepskyblue3", "green")) +
theme(legend.position = "none") +
labs(title = "Carseats Dataset - Random Forests",
subtitle = "Selecting 'mtry' using the test MSE",
x = "mtry",
y = "Test MSE")
The best random forest model used an mtry
of 10, which makes it identical to the bagged trees model. Because I used the same random seed as in part d), the test MSE is identical:
## [1] 2.368674
… as is the variable importance:
importance(rf_temp) %>%
as.data.frame() %>%
rownames_to_column("varname") %>%
arrange(desc(IncNodePurity))
## varname %IncMSE IncNodePurity
## 1 Price 55.76266629 493.804969
## 2 ShelveLoc 53.87451311 446.816951
## 3 CompPrice 25.47984338 173.982449
## 4 Age 12.07366106 117.502364
## 5 Advertising 13.97464644 96.929928
## 6 Income 1.72616791 71.465227
## 7 Population 1.01449985 68.297498
## 8 Education 0.08382003 37.513944
## 9 Urban -3.06299457 6.909530
## 10 US 0.14346468 5.985091
OJ
Dataset (Classification Trees)This problem involves the OJ
data set which is part of the ISLR
package.
## Rows: 1,070
## Columns: 18
## $ Purchase <fct> CH, CH, CH, MM, CH, CH, CH, CH, CH, CH, CH, CH, CH, ...
## $ WeekofPurchase <dbl> 237, 239, 245, 227, 228, 230, 232, 234, 235, 238, 24...
## $ StoreID <dbl> 1, 1, 1, 1, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 1, 2...
## $ PriceCH <dbl> 1.75, 1.75, 1.86, 1.69, 1.69, 1.69, 1.69, 1.75, 1.75...
## $ PriceMM <dbl> 1.99, 1.99, 2.09, 1.69, 1.69, 1.99, 1.99, 1.99, 1.99...
## $ DiscCH <dbl> 0.00, 0.00, 0.17, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00...
## $ DiscMM <dbl> 0.00, 0.30, 0.00, 0.00, 0.00, 0.00, 0.40, 0.40, 0.40...
## $ SpecialCH <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SpecialMM <dbl> 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1...
## $ LoyalCH <dbl> 0.500000, 0.600000, 0.680000, 0.400000, 0.956535, 0....
## $ SalePriceMM <dbl> 1.99, 1.69, 2.09, 1.69, 1.69, 1.99, 1.59, 1.59, 1.59...
## $ SalePriceCH <dbl> 1.75, 1.75, 1.69, 1.69, 1.69, 1.69, 1.69, 1.75, 1.75...
## $ PriceDiff <dbl> 0.24, -0.06, 0.40, 0.00, 0.00, 0.30, -0.10, -0.16, -...
## $ Store7 <fct> No, No, No, No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Y...
## $ PctDiscMM <dbl> 0.000000, 0.150754, 0.000000, 0.000000, 0.000000, 0....
## $ PctDiscCH <dbl> 0.000000, 0.000000, 0.091398, 0.000000, 0.000000, 0....
## $ ListPriceDiff <dbl> 0.24, 0.24, 0.23, 0.00, 0.00, 0.30, 0.30, 0.24, 0.24...
## $ STORE <dbl> 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2...
train
/test
SplitQ: Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.
A:
Since this is my first time seeing this dataset, here is my quick overview: The OJ
dataset contains 1,070 purchases of two brands of orange juice (‘Citrus Hill’ or ‘Minute Maid’), captured in the values of the Purchase
variable (CH
or MM
). The remaining 17 predictors are characteristics of the customer, product, store, etc. Throughout this question we are basically tasked with predicting which orange juice the customer purchased based on these statistics.
I create train
and test
below.
set.seed(5)
train_index <- sample(1:nrow(OJ), 800)
train <- OJ[train_index, ]
test <- OJ[-train_index, ]
Q: Fit a tree to the training data, with Purchase
as the response and the other variables as predictors. Use the summary()
function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?
A:
The classification tree has 7 terminal nodes and a training error rate of 18.38%.
##
## Classification tree:
## tree(formula = Purchase ~ ., data = train)
## Variables actually used in tree construction:
## [1] "LoyalCH" "PriceDiff" "DiscCH"
## Number of terminal nodes: 7
## Residual mean deviance: 0.7786 = 617.4 / 793
## Misclassification error rate: 0.1838 = 147 / 800
Despite there being 17 predictors in the dataset, only three were used in splits. These were:
LoyalCH
- Customer brand loyalty for CHPriceDiff
- Sale price of MM less sale price of CHDiscCH
- Discount offered for CH
tree()
- Text InterpretationQ: Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.
A:
I print the text output below.
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 800 1064.00 CH ( 0.61750 0.38250 )
## 2) LoyalCH < 0.5036 354 435.50 MM ( 0.30508 0.69492 )
## 4) LoyalCH < 0.142213 100 45.39 MM ( 0.06000 0.94000 ) *
## 5) LoyalCH > 0.142213 254 342.20 MM ( 0.40157 0.59843 )
## 10) PriceDiff < 0.235 136 153.00 MM ( 0.25000 0.75000 ) *
## 11) PriceDiff > 0.235 118 160.80 CH ( 0.57627 0.42373 ) *
## 3) LoyalCH > 0.5036 446 352.30 CH ( 0.86547 0.13453 )
## 6) LoyalCH < 0.705699 154 189.50 CH ( 0.69481 0.30519 )
## 12) PriceDiff < 0.25 85 117.50 CH ( 0.52941 0.47059 )
## 24) DiscCH < 0.15 77 106.60 MM ( 0.48052 0.51948 ) *
## 25) DiscCH > 0.15 8 0.00 CH ( 1.00000 0.00000 ) *
## 13) PriceDiff > 0.25 69 45.30 CH ( 0.89855 0.10145 ) *
## 7) LoyalCH > 0.705699 292 106.30 CH ( 0.95548 0.04452 ) *
Choosing node 11)
, which is a terminal node as it is marked by a *
:
First the root node: 1) root 800 1064.00 CH ( 0.61750 0.38250 )
This means that, at the root node, there are 800
observations, the deviance is 1064.00
, the overall prediction is CH
and the split is 61.75% CH
vs 38.25% MM
.
We can see that, from the root node, three splits take place to produce the terminal node labelled by 11)
:
LoyalCH = 0.5036
LoyalCH = 0.142213
PriceDiff = 0.235
1) root 800 1064.00 CH ( 0.61750 0.38250 )
2) LoyalCH < 0.5036 354 435.50 MM ( 0.30508 0.69492 )
4) LoyalCH < 0.142213 100 45.39 MM ( 0.06000 0.94000 ) *
5) LoyalCH > 0.142213 254 342.20 MM ( 0.40157 0.59843 )
10) PriceDiff < 0.235 136 153.00 MM ( 0.25000 0.75000 ) *
11) PriceDiff > 0.235 118 160.80 CH ( 0.57627 0.42373 ) *
Node 11)
is therefore the subset of purchases where 0.142213 < LoyalCH < 0.5036
and PriceDiff > 0.235
. The overall prediction is CH
, and the node seems quite impure with 57.627% CH
vs 42.373% MM
.
There are 118
observations in the node, and from the percentages above we know that 68/118
are CH
and 50/118
are MM
(demonstrated below).
train %>%
filter(LoyalCH < 0.5036,
LoyalCH > 0.142213,
PriceDiff > 0.235) %>%
select(Purchase) %>%
table()
## .
## CH MM
## 68 50
Based on the formula on page 325 for the overall deviance of a classification tree (\(-2 \sum_{m} \sum_{k} n_{mk} \, log (\hat{p}_{mk})\)), where the overall deviance sum is over \(m\) regions (terminal nodes). We calculate can the deviance of node 11)
only using the code below:
## [1] 160.8262
tree()
reports the number as 160.80
, and testing with other nodes revealed this is because it’s rounding the result to 4 significant figures.
Q: Create a plot of the tree, and interpret the results.
A:
LoyalCH
is certainly the most important variable (the top 3 nodes all split on this variable), followed by PriceDiff
and DiscCH
. We can see node 11)
is the third terminal node (from left \(\rightarrow\) right).
LoyalCH
ranges from 0
to 1
, so the first split sends those less loyal to Citrus Hill (CH
) orange juice to the left and those more loyal to the right:
Those that scored lowest in Citrus Hill loyalty (LoyalCH < 0.142213
) were predicted to buy Minute Maid (MM
), which isn’t surprising. Those that were slightly more loyal to CH
(0.142213 < LoyalCH < 0.5036
) would still buy MM
if it wasn’t too much more expensive (PriceDiff < 0.235
), but if the price difference is large enough (CH
was much cheaper) they could end up purchasing CH
.
Those on the far-right terminal node are the most loyal to CH
(LoyalCH > 0.705699
), so it should be unsurprising that this is their predicted purchase. Those with slightly lower brand loyalty (0.5036 < LoyalCH < 0.705699
) would still purchase CH
if it was much cheaper (PriceDiff > 0.25
), or if it wasn’t but was sufficiently discounted (PriceDiff < 0.25
& DiscCH > 0.15
). For those cases where CH
wasn’t much cheaper (PriceDiff < 0.25
) and wasn’t sufficiently discounted (DiscCH < 0.15
), the predicted purchase actually ended up being MM
.
This was a much more detailed explanation, but this could be summarized at a much higher level in the following way: people go with the brand they are more loyal towards, but there are some edge cases (based on discounts and the prices relative to one another) that can sway people against their usual brand loyalties.
Q: Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?
A:
Here is the confusion matrix for the unpruned regression tree:
test_pred <- predict(tree_model, test, type = "class")
table(test_pred, test_actual = test$Purchase)
## test_actual
## test_pred CH MM
## CH 125 32
## MM 34 79
The test
error rate corresponding to it:
## [1] 0.2444444
CH
was the most common orange juice in train
so, for comparison, a baseline classifier (that predicted CH
for all observations in test
) would have the following error rate:
## [1] 0.4111111
Q: Apply the cv.tree()
function to the training set in order to determine the optimal tree size.
A:
Since our goal appears to be low test
error, I specify FUN = prune.misclass
. This indicates that we want the classification error rate to guide the cross-validation and pruning process, rather than the default for the cv.tree()
function, which is deviance.
## $size
## [1] 7 4 2 1
##
## $dev
## [1] 164 164 177 298
##
## $k
## [1] -Inf 1 9 138
##
## $method
## [1] "misclass"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
Q: Produce a plot with tree size on the \(x\)-axis and cross-validated classification error rate on the \(y\)-axis.
A:
The plot is below. Note that cv_tree_model$size
is the number of terminal nodes (so 1
means it is just the root node with no splits), and cv_tree_model$dev
gives the total number of errors made from the out-of-fold predictions during cross-validation (only because we specified FUN = prune.misclass
- omitting this would mean this reports the deviance). From this we can obtain the cross-validation error rate.
data.frame(size = cv_tree_model$size, CV_Error = cv_tree_model$dev / nrow(train)) %>%
mutate(min_CV_Error = as.numeric(min(CV_Error) == CV_Error)) %>%
ggplot(aes(x = size, y = CV_Error)) +
geom_line(col = "grey55") +
geom_point(size = 2, aes(col = factor(min_CV_Error))) +
scale_x_continuous(breaks = seq(1, 7), minor_breaks = NULL) +
scale_y_continuous(labels = scales::percent_format()) +
scale_color_manual(values = c("deepskyblue3", "green")) +
theme(legend.position = "none") +
labs(title = "OJ Dataset - Classification Tree",
subtitle = "Selecting tree 'size' (# of terminal nodes) using cross-validation",
x = "Tree Size",
y = "CV Error")
Q: Which tree size corresponds to the lowest cross-validated classification error rate?
A:
Of the sequence of trees generated, trees of sizes 4 and 7 have the same cross-validation error. It makes sense to select the more parsimonious model here with 4 terminal nodes.
Q: Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.
A:
I produce the tree with 4 terminal nodes. Interestingly we have the same cross-validation error as the full tree with 7 terminal nodes, but have only split on LoyalCH
to achieve this. This would have the added benefit of simplifying the interpretation in part (d).
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 800 1064.00 CH ( 0.61750 0.38250 )
## 2) LoyalCH < 0.5036 354 435.50 MM ( 0.30508 0.69492 )
## 4) LoyalCH < 0.142213 100 45.39 MM ( 0.06000 0.94000 ) *
## 5) LoyalCH > 0.142213 254 342.20 MM ( 0.40157 0.59843 ) *
## 3) LoyalCH > 0.5036 446 352.30 CH ( 0.86547 0.13453 )
## 6) LoyalCH < 0.705699 154 189.50 CH ( 0.69481 0.30519 ) *
## 7) LoyalCH > 0.705699 292 106.30 CH ( 0.95548 0.04452 ) *
Q: Compare the training error rates between the pruned and unpruned trees. Which is higher?
A:
Here is the training error for the unpruned tree (7 terminal nodes):
## [1] 0.18375
The same for the pruned tree (4 terminal nodes):
## [1] 0.21
The training error for the pruned tree is higher. This isn’t surprising - we would expect the training error of a tree to monotonically decrease as its flexibility (number of splits) increases.
Q: Compare the test error rates between the pruned and unpruned trees. Which is higher?
A:
The test error for the unpruned tree:
## [1] 0.2444444
The same for the pruned tree:
## [1] 0.1703704
Now the order has reversed and the error is higher for the unpruned tree.
It is interesting that the cross-validation errors were in fact equal but the test
error is noticeably lower for the simpler tree. A lot of this probably comes from random variability when working with a small dataset; using a different random state for the CV folds and train
/test
split would likely change all of these results (particularly because decision trees are such high-variance approaches).
Hitters
Dataset (Boosting)We now use boosting to predict Salary
in the Hitters
data set.
## Rows: 322
## Columns: 20
## $ AtBat <int> 293, 315, 479, 496, 321, 594, 185, 298, 323, 401, 574, 20...
## $ Hits <int> 66, 81, 130, 141, 87, 169, 37, 73, 81, 92, 159, 53, 113, ...
## $ HmRun <int> 1, 7, 18, 20, 10, 4, 1, 0, 6, 17, 21, 4, 13, 0, 7, 3, 20,...
## $ Runs <int> 30, 24, 66, 65, 39, 74, 23, 24, 26, 49, 107, 31, 48, 30, ...
## $ RBI <int> 29, 38, 72, 78, 42, 51, 8, 24, 32, 66, 75, 26, 61, 11, 27...
## $ Walks <int> 14, 39, 76, 37, 30, 35, 21, 7, 8, 65, 59, 27, 47, 22, 30,...
## $ Years <int> 1, 14, 3, 11, 2, 11, 2, 3, 2, 13, 10, 9, 4, 6, 13, 3, 15,...
## $ CAtBat <int> 293, 3449, 1624, 5628, 396, 4408, 214, 509, 341, 5206, 46...
## $ CHits <int> 66, 835, 457, 1575, 101, 1133, 42, 108, 86, 1332, 1300, 4...
## $ CHmRun <int> 1, 69, 63, 225, 12, 19, 1, 0, 6, 253, 90, 15, 41, 4, 36, ...
## $ CRuns <int> 30, 321, 224, 828, 48, 501, 30, 41, 32, 784, 702, 192, 20...
## $ CRBI <int> 29, 414, 266, 838, 46, 336, 9, 37, 34, 890, 504, 186, 204...
## $ CWalks <int> 14, 375, 263, 354, 33, 194, 24, 12, 8, 866, 488, 161, 203...
## $ League <fct> A, N, A, N, N, A, N, A, N, A, A, N, N, A, N, A, N, A, A, ...
## $ Division <fct> E, W, W, E, E, W, E, W, W, E, E, W, E, E, E, W, W, W, W, ...
## $ PutOuts <int> 446, 632, 880, 200, 805, 282, 76, 121, 143, 0, 238, 304, ...
## $ Assists <int> 33, 43, 82, 11, 40, 421, 127, 283, 290, 0, 445, 45, 11, 1...
## $ Errors <int> 20, 10, 14, 3, 4, 25, 7, 9, 19, 0, 22, 11, 7, 6, 8, 0, 10...
## $ Salary <dbl> NA, 475.000, 480.000, 500.000, 91.500, 750.000, 70.000, 1...
## $ NewLeague <fct> A, N, A, N, N, A, A, A, N, A, A, N, N, A, N, A, N, A, A, ...
Salary
- Remove Missings & Log-TransformQ: Remove the observations for whom the salary information is unknown, and then log-transform the salaries.
A:
There does seem to be quite a few observations with missing Salary
data:
## [1] 59
I produce a new dataframe with these observations removed and log-transform Salary
. I’ll keep the original response variable around and instead create a new variable, log_Salary
. A before & after density plot is below, where we can see that log_Salary
is less skewed and closer to a normal distribution:
Hitters_notNA <- Hitters[!is.na(Hitters$Salary), ]
Hitters_notNA$log_Salary <- log(Hitters_notNA$Salary)
ggplot(Hitters_notNA, aes(x = Salary)) +
geom_density(fill = "mediumseagreen") +
scale_x_continuous(labels = scales::comma_format()) +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.title.y = element_blank()) +
labs(title = "Hitters Dataset - Salary Distribution",
x = "Salary")
ggplot(Hitters_notNA, aes(x = log_Salary)) +
geom_density(fill = "deepskyblue3") +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.title.y = element_blank()) +
labs(title = "Hitters Dataset - log(Salary) Distribution",
x = "log(Salary)")
This is the end of this question, but below I produce some explanations & illustrations for my own understanding. I thought it was quite interesting that the book applies a log-transformation to the response here without really justifying why, since so far transformations have really been focused around linear models. I think the accepted answer here is a good summary of why the authors might be applying this transformation.
Log-transforming will make the splits decided by regression trees (and regression-tree-based methods) less-influenced by extreme values of Salary
. I believe that the predictions in the regions produced by these splits should also be skewed less by these extreme values (these extreme values will just be the mean response variable for observations in that region, after-all, and the mean is sensitive to outliers). Furthermore, when we select a model based on target metrics that use the squared error (e.g. MSE), we might not want the selected model predicting salary to be too strongly influenced by these top few earners.
It is worth remembering for the future that response-variable transformations and outlier removal could be useful in decreasing the prediction error with tree-based methods when tackling regression problems with skewed response variables.
Below I put together a quick example similar to that seen in the link above, applying it directly to the Hitters
dataset.
I fit a decision stump predicting Salary
with one predictor (CHits
):
### ### ### using y = Salary ### ### ###
CHits_stump_Salary <- tree(Salary ~ CHits, Hitters_notNA) %>%
prune.tree(best = 2)
CHits_stump_Salary # CHits < 450
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 263 53320000 535.9
## 2) CHits < 450 117 5931000 227.9 *
## 3) CHits > 450 146 27390000 782.8 *
We can see that the split occurs at the ~45th percentile of the CHits
variable (at value 450
):
## [1] 0.445
The vertical darker-green line shows the split that partitions the feature space into two regions. The horizontal lighter-green lines are the predictions made for these two regions:
pred_Salary <- Hitters_notNA %>%
mutate(split = case_when(CHits < 450 ~ "lt 450",
TRUE ~ "ge 450")) %>%
group_by(split) %>%
summarize(Salary = mean(Salary))
ggplot(Hitters_notNA, aes(x = CHits, y = Salary)) +
geom_point(alpha = 0.5) +
geom_segment(aes(x = 450, xend = 4500, y = pred_Salary$Salary[pred_Salary$split == "ge 450"], yend = pred_Salary$Salary[pred_Salary$split == "ge 450"]), col = "mediumseagreen", size = 1) +
geom_segment(aes(x = 0, xend = 450, y = pred_Salary$Salary[pred_Salary$split == "lt 450"], yend = pred_Salary$Salary[pred_Salary$split == "lt 450"]), col = "mediumseagreen", size = 1) +
geom_vline(xintercept = 450, size = 1.2, col = "darkgreen") +
scale_x_continuous(labels = scales::comma_format()) +
scale_y_continuous(labels = scales::comma_format())
The MSE and Median Absolute Error:
data.frame(MSE = mean((predict(CHits_stump_Salary, Hitters_notNA) - Hitters_notNA$Salary)^2), # 126,678
MAE = median(abs(predict(CHits_stump_Salary, Hitters_notNA) - Hitters_notNA$Salary))) # 137.8547
## MSE MAE
## 1 126678 137.8547
I now fit a decision stump predicting log_Salary
with the same predictor (CHits
):
### ### ### using y = log_Salary ### ### ###
CHits_stump_logSalary <- tree(log_Salary ~ CHits, Hitters_notNA) %>%
prune.tree(best = 2)
CHits_stump_logSalary # CHits < 358
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 263 207.20 5.927
## 2) CHits < 358 101 35.24 5.083 *
## 3) CHits > 358 162 55.07 6.454 *
The split has shifted to the left, now occurring at the ~38th percentile of the CHits
variable (at value 358
):
## [1] 0.384
As before, the vertical darker-blue line shows the split, and the horizontal lighter-blue lines are the predictions made by the tree for the two regions:
pred_logSalary <- Hitters_notNA %>%
mutate(split = case_when(CHits < 358 ~ "lt 358",
TRUE ~ "ge 358")) %>%
group_by(split) %>%
summarize(log_Salary = mean(log_Salary))
ggplot(Hitters_notNA, aes(x = CHits, y = log_Salary)) +
geom_point(alpha = 0.5) +
geom_segment(aes(x = 358, xend = 4500, y = pred_logSalary$log_Salary[pred_logSalary$split == "ge 358"], yend = pred_logSalary$log_Salary[pred_logSalary$split == "ge 358"]), col = "deepskyblue3", size = 1) +
geom_segment(aes(x = 0, xend = 358, y = pred_logSalary$log_Salary[pred_logSalary$split == "lt 358"], yend = pred_logSalary$log_Salary[pred_logSalary$split == "lt 358"]), col = "deepskyblue3", size = 1) +
geom_vline(xintercept = 358, size = 1.2, col = "blue") +
scale_x_continuous(labels = scales::comma_format()) +
labs(y = "log(Salary)")
The MSE and Median Absolute Error:
data.frame(MSE = mean((exp(predict(CHits_stump_logSalary, Hitters_notNA)) - Hitters_notNA$Salary)^2), # 141287.2
MAE = median(abs(exp(predict(CHits_stump_logSalary, Hitters_notNA)) - Hitters_notNA$Salary))) # 116.2309
## MSE MAE
## 1 141287.2 116.2309
Looking at the splits and predictions on the same scale is the best way of seeing how the transformation influenced the split-points, predicted salary and therefore the overall error of the models:
ggplot(Hitters_notNA, aes(x = CHits, y = Salary)) +
geom_point(alpha = 0.5) +
geom_segment(aes(x = 450, xend = 4500, y = pred_Salary$Salary[pred_Salary$split == "ge 450"], yend = pred_Salary$Salary[pred_Salary$split == "ge 450"]), col = "mediumseagreen", size = 1) +
geom_segment(aes(x = 0, xend = 450, y = pred_Salary$Salary[pred_Salary$split == "lt 450"], yend = pred_Salary$Salary[pred_Salary$split == "lt 450"]), col = "mediumseagreen", size = 1) +
geom_segment(aes(x = 358, xend = 4500, y = exp(pred_logSalary$log_Salary[pred_logSalary$split == "ge 358"]), yend = exp(pred_logSalary$log_Salary[pred_logSalary$split == "ge 358"])), col = "deepskyblue3", size = 1) +
geom_segment(aes(x = 0, xend = 358, y = exp(pred_logSalary$log_Salary[pred_logSalary$split == "lt 358"]), yend = exp(pred_logSalary$log_Salary[pred_logSalary$split == "lt 358"])), col = "deepskyblue3", size = 1) +
geom_vline(aes(col = "log(Salary)", xintercept = 358), size = 1.2) +
geom_vline(aes(col = "Salary", xintercept = 450), size = 1.2) +
scale_color_manual(values = c("blue", "darkgreen")) +
scale_x_continuous(labels = scales::comma_format()) +
scale_y_continuous(labels = scales::comma_format()) +
labs(col = "Response:") +
theme(legend.position = "bottom")
train
/test
SplitQ: Create a training set consisting of the first 200 observations, and a test set consisting of the remaining observations
A:
I create a 200-row train
dataset and 63-row test
dataset below.
train
MSE)Q: Perform boosting on the training set with 1,000 trees for a range of values of the shrinkage parameter \(\lambda\). Produce a plot with different shrinkage values on the \(x\)-axis and the corresponding training set MSE on the \(y\)-axis.
A:
Note: For the remainder of the question I am going to be calculating performance metrics based on log_Salary
, instead of exponentiating these predictions and calculating performance metrics on the original Salary
scale (as I did in my extension of part a). I presume this is what the authors expected.
I test a pretty complete sequence of \(\lambda\) values, choosing to leave interaction.depth = 2
to model basic interactions.
I plot the results below, displaying the x axis on a \(log_{10}\) scale. The training MSE decreases for higher values of \(\lambda\).
lambda_seq <- 10^seq(-6, 0, 0.1)
set.seed(1)
train_MSE <- c()
test_MSE <- c()
for (i in 1:length(lambda_seq)) {
boost_TEMP <- gbm(log_Salary ~ . - Salary,
data = train,
distribution = "gaussian",
n.trees = 1000,
interaction.depth = 2,
shrinkage = lambda_seq[i])
train_MSE[i] <- mean((predict(boost_TEMP, train, n.trees = 1000) - train$log_Salary)^2)
test_MSE[i] <- mean((predict(boost_TEMP, test, n.trees = 1000) - test$log_Salary)^2)
}
data.frame(lambda = lambda_seq, train_MSE) %>%
ggplot(aes(x = lambda, y = train_MSE)) +
geom_point(size = 2, col = "deepskyblue3") +
geom_line(col = "grey55") +
scale_x_continuous(trans = 'log10', breaks = 10^seq(-6, 0), labels = 10^seq(-6, 0), minor_breaks = NULL) +
labs(x = "Lambda (Shrinkage)",
y = "Training MSE")
test
MSE)Q: Produce a plot with different shrinkage values on the x-axis and the corresponding test set MSE on the y-axis.
A:
I use the same sequence of lambdas to assess the corresponding test
MSE. I actually did this in the previous question (creating the vector test_MSE
alongside train_MSE
), so I simply plot the results here:
data.frame(lambda = lambda_seq, test_MSE) %>%
ggplot(aes(x = lambda, y = test_MSE)) +
geom_point(size = 2, col = "deepskyblue3") +
geom_line(col = "grey55") +
scale_x_continuous(trans = 'log10', breaks = 10^seq(-6, 0), labels = 10^seq(-6, 0), minor_breaks = NULL) +
labs(x = "Lambda (Shrinkage)",
y = "Test MSE")
Q: Compare the test MSE of boosting to the test MSE that results from applying two of the regression approaches seen in Chapters 3 and 6
A:
I’ll choose the following three methods for comparison:
1. Boosting
Holding n.trees = 1000
, interaction.depth = 2
and n.minobsinnode = 10
constant (the last is not in the book and I find is much less important to tune), I used repeated cross-validation to select the shrinkage parameter:
set.seed(1)
ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 3, verboseIter = F, summaryFunction = custom_regression_metrics)
gbm_grid <- expand.grid(n.trees = 1000, interaction.depth = 2, n.minobsinnode = 10, shrinkage = 10^seq(-6, 0, 0.1))
model_gbm <- train(log_Salary ~ . - Salary,
data = train,
method = "gbm",
distribution = "gaussian",
verbose = F,
metric = "MSE",
maximize = F,
trControl = ctrl,
tuneGrid = gbm_grid)
model_gbm$results %>%
rename(CV_MSE = MSE) %>%
mutate(min_CV_MSE = as.numeric(shrinkage == model_gbm$bestTune$shrinkage)) %>%
ggplot(aes(x = shrinkage, y = CV_MSE)) +
geom_line(col = "grey55") +
geom_point(size = 2, aes(col = factor(min_CV_MSE))) +
scale_x_continuous(trans = 'log10', breaks = 10^seq(-6, 0), labels = 10^seq(-6, 0), minor_breaks = NULL) +
scale_color_manual(values = c("deepskyblue3", "green")) +
theme(legend.position = "none") +
labs(title = "Hitters Dataset - Boosting",
subtitle = "Selecting shrinkage parameter using cross-validation",
x = "Lambda (Shrinkage)",
y = "CV MSE")
The optimal value of shrinkage
:
## [1] 0.0063
The corresponding cross-validation MSE:
## [1] 0.2089
The test
MSE:
## [1] 0.2816
I did a fair bit of playing around with the hyper-parameters on this dataset, but even with repeated cross-validation there is still a lot of variability in the optimal parameters (due to the size of the dataset). Any reduction in the CV MSE from here was marginal and often didn’t really correlate with an improvement in the test MSE. I’ll take the above stats as the final boosting performance for this dataset and move on.
2. Partial Least Squares Regression
set.seed(1)
ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 20, verboseIter = F, summaryFunction = custom_regression_metrics)
model_pls <- train(log_Salary ~ . - Salary,
data = train,
method = "pls",
preProcess = c("center", "scale"),
metric = "MSE",
maximize = F,
trControl = ctrl,
tuneGrid = data.frame(ncomp = 1:20))
model_pls$results %>%
rename(CV_MSE = MSE) %>%
mutate(min_CV_MSE = as.numeric(ncomp == model_pls$bestTune$ncomp)) %>%
ggplot(aes(x = ncomp, y = CV_MSE)) +
geom_line(col = "grey55") +
geom_point(size = 2, aes(col = factor(min_CV_MSE))) +
scale_x_continuous(breaks = 1:20, minor_breaks = NULL) +
scale_color_manual(values = c("deepskyblue3", "green")) +
theme(legend.position = "none") +
labs(title = "Hitters Dataset - Partial Least Squares",
subtitle = "Selecting number of principal components with cross-validation",
x = "Principal Components",
y = "CV MSE")
The optimal number of principal components:
## [1] 1
The corresponding cross-validation MSE:
## [1] 0.4038
The test
MSE:
## [1] 0.466
3. Lasso Regression
ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 10, verboseIter = F, summaryFunction = custom_regression_metrics)
set.seed(1)
model_lasso <- train(log_Salary ~ . - Salary,
data = train,
method = "glmnet",
preProcess = c("center", "scale"),
metric = "MSE",
maximize = F,
trControl = ctrl,
tuneGrid = expand.grid(alpha = 1,
lambda = seq(0, 0.1, 0.001)))
model_lasso$results %>%
rename(CV_MSE = MSE) %>%
mutate(min_CV_MSE = as.numeric(lambda == model_lasso$bestTune$lambda)) %>%
ggplot(aes(x = lambda, y = CV_MSE)) +
geom_line(col = "grey55") +
geom_point(size = 2, aes(col = factor(min_CV_MSE))) +
scale_color_manual(values = c("deepskyblue3", "green")) +
theme(legend.position = "none") +
labs(title = "Hitters Dataset - Lasso Regression",
subtitle = "Selecting shrinkage parameter with cross-validation",
y = "CV MSE")
The optimal value of lambda
(the shrinkage parameter):
## [1] 0.003
The corresponding cross-validation MSE:
## [1] 0.3982
The test
MSE:
lasso_test <- mean((predict(model_lasso, test) - test$log_Salary)^2) %>% round(4) # 0.4712
lasso_test
## [1] 0.4712
The improvement in the test MSE for Boosting over the other (linear) regression approaches appears to be quite large. These results are summarized (alongside bagging) in part (g).
Q: Which variables appear to be the most important predictors in the boosted model?
A:
We can see the variable importance ranking below. “CAtBat
= Number of times at bat during his career” is the most important variable in the model by some margin. After that, CRuns
, CWalks
, CHits
& CRBI
are all of similar importance. Note that the most important variables are all ‘career’ variables (totals across their career to-date). For the full definitions you can type ?ISLR::Hitters
in the console.
# summary(model_gbm$finalModel)
# ggplot version:
summary(model_gbm$finalModel, plotit = F) %>%
rename("Importance" = "rel.inf") %>%
ggplot(aes(x = reorder(var, Importance), y = Importance, fill = Importance)) +
geom_bar(stat = "identity") +
geom_text(aes(label = round(Importance, 2), col = Importance), hjust = -0.2) +
scale_y_continuous(limits = c(0, 31.5)) +
scale_fill_gradient(low = "grey40", high = "#28B463") +
scale_color_gradient(low = "grey40", high = "#28B463") +
coord_flip() +
theme(legend.position = "none") +
labs(title = "Hitters Dataset - Boosting - Variable Importance",
x = "Varname")
Below I plot the partial dependence plot for the most important predictor (CAtBat
), which shows its marginal effect on the response after integrating out the other variables. We can see that \(log(Salary)\) sharply increases with CAtBat
(up to about ~1500), after which higher values of CAtBat
no longer have an effect.
plot(model_gbm$finalModel, "CAtBat", return.grid = T) %>%
ggplot(aes(CAtBat, y)) +
geom_line() +
scale_x_continuous(labels = scales::comma_format()) +
labs(title = "CAtBat - Partial Dependence Plot",
y = "Marginal effect on y = log(Salary)")
I stumbled across the var.monotone
argument for gbm()
, which allows us to force the relationships of certain predictors with the outcome to be monotonic. This can be really useful in finance and other regulated industries, and the effect of these monotone constraints are immediately visible in the partial dependence plots.
To show an example, I look at the PDP of PutOuts
, where we can see that the relationship is clearly not monotonic:
plot(model_gbm$finalModel, "PutOuts", return.grid = T) %>%
ggplot(aes(PutOuts, y)) +
geom_line() +
scale_x_continuous(labels = scales::comma_format()) +
labs(title = "PutOuts - Partial Dependence Plot",
y = "Marginal effect on y = log(Salary)")
I re-produce the PDP for a model trained with all the same hyper-parameters, but now with an additional (monotonic, increasing) constraint on PutOuts
with the response. I didn’t force any relationship with the remaining 18 predictors, meaning arbitrary non-monotonic relationships are permitted. The impact of the monotonic constraint is clear:
set.seed(1)
train_X <- select(train, -c(Salary, log_Salary))
model_gbm_mono <- train(y = train$log_Salary,
x = train_X,
method = "gbm",
distribution = "gaussian",
verbose = F,
metric = "MSE",
maximize = F,
trControl = ctrl,
tuneGrid = data.frame(n.trees = 1000, interaction.depth = 2, n.minobsinnode = 10, shrinkage = 10^-2.2), # shrinkage = 10^-2.2 chosen earlier
var.monotone = ifelse(names(train_X) == "PutOuts", 1, 0)) # PutOuts = 1 = monotonic increasing (others = 0 = arbitrary)
plot(model_gbm_mono$finalModel, "PutOuts", return.grid = T) %>%
ggplot(aes(PutOuts, y)) +
geom_line() +
scale_x_continuous(labels = scales::comma_format()) +
labs(title = "PutOuts - Partial Dependence Plot",
subtitle = "Monotonic (increasing) constraint on 'PutOuts'",
y = "Marginal effect on y = log(Salary)")
Q: Now apply bagging to the training set. What is the test set MSE for this approach?
A:
set.seed(1)
ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 3, verboseIter = F, summaryFunction = custom_regression_metrics)
model_bagging <- train(y = train$log_Salary,
x = train_X,
method = "rf",
metric = "MSE",
maximize = F,
trControl = ctrl,
tuneGrid = expand.grid(mtry = 19))
The cross-validation MSE:
## [1] 0.2198
The test
MSE:
bagging_test <- mean((predict(model_bagging, test) - test$log_Salary)^2) %>% round(4) # 0.229
bagging_test
## [1] 0.2323
The performance of all 4 methods are summarized below for comparison:
data.frame(method = c("Boosting", "PLS", "Lasso", "Bagging"),
CV_MSE = c(gbm_cv, pls_cv, lasso_cv, bagging_cv),
test_MSE = c(gbm_test, pls_test, lasso_test, bagging_test)) %>%
arrange(test_MSE)
## method CV_MSE test_MSE
## 1 Bagging 0.2198 0.2323
## 2 Boosting 0.2089 0.2816
## 3 PLS 0.4038 0.4660
## 4 Lasso 0.3982 0.4712
With regards to the test
MSE, bagging performed slightly better than boosting. The dataset is really small so there’s an element of randomness to the exact order, but it does seem pretty clear that there is a large performance increase to be had from using tree ensembles here as opposed to linear methods.
Caravan
Dataset (Boosting)This question uses the Caravan
data set.
## Rows: 5,822
## Columns: 86
## $ MOSTYPE <dbl> 33, 37, 37, 9, 40, 23, 39, 33, 33, 11, 10, 9, 33, 41, 23, ...
## $ MAANTHUI <dbl> 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1...
## $ MGEMOMV <dbl> 3, 2, 2, 3, 4, 2, 3, 2, 2, 3, 4, 3, 2, 3, 1, 2, 2, 3, 4, 2...
## $ MGEMLEEF <dbl> 2, 2, 2, 3, 2, 1, 2, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 3, 2, 4...
## $ MOSHOOFD <dbl> 8, 8, 8, 3, 10, 5, 9, 8, 8, 3, 3, 3, 8, 10, 5, 8, 9, 5, 3,...
## $ MGODRK <dbl> 0, 1, 0, 2, 1, 0, 2, 0, 0, 3, 1, 1, 1, 0, 0, 0, 0, 0, 2, 0...
## $ MGODPR <dbl> 5, 4, 4, 3, 4, 5, 2, 7, 1, 5, 4, 3, 4, 5, 6, 7, 6, 5, 4, 2...
## $ MGODOV <dbl> 1, 1, 2, 2, 1, 0, 0, 0, 3, 0, 1, 2, 1, 0, 1, 0, 0, 0, 0, 0...
## $ MGODGE <dbl> 3, 4, 4, 4, 4, 5, 5, 2, 6, 2, 4, 4, 4, 4, 2, 2, 3, 4, 3, 7...
## $ MRELGE <dbl> 7, 6, 3, 5, 7, 0, 7, 7, 6, 7, 7, 7, 6, 7, 1, 7, 7, 7, 7, 9...
## $ MRELSA <dbl> 0, 2, 2, 2, 1, 6, 2, 2, 0, 0, 1, 1, 2, 1, 2, 2, 0, 0, 0, 0...
## $ MRELOV <dbl> 2, 2, 4, 2, 2, 3, 0, 0, 3, 2, 2, 2, 3, 1, 6, 0, 2, 2, 2, 0...
## $ MFALLEEN <dbl> 1, 0, 4, 2, 2, 3, 0, 0, 3, 2, 0, 2, 3, 1, 5, 0, 0, 0, 1, 0...
## $ MFGEKIND <dbl> 2, 4, 4, 3, 4, 5, 3, 5, 3, 2, 3, 3, 4, 4, 3, 5, 6, 2, 3, 6...
## $ MFWEKIND <dbl> 6, 5, 2, 4, 4, 2, 6, 4, 3, 6, 6, 5, 3, 5, 1, 4, 3, 7, 6, 3...
## $ MOPLHOOG <dbl> 1, 0, 0, 3, 5, 0, 0, 0, 0, 0, 4, 1, 1, 2, 2, 0, 2, 2, 5, 0...
## $ MOPLMIDD <dbl> 2, 5, 5, 4, 4, 5, 4, 3, 1, 4, 3, 7, 4, 4, 6, 3, 6, 1, 4, 0...
## $ MOPLLAAG <dbl> 7, 4, 4, 2, 0, 4, 5, 6, 8, 5, 3, 1, 5, 4, 2, 6, 2, 7, 1, 9...
## $ MBERHOOG <dbl> 1, 0, 0, 4, 0, 2, 0, 2, 1, 2, 0, 4, 1, 3, 1, 2, 2, 0, 6, 0...
## $ MBERZELF <dbl> 0, 0, 0, 0, 5, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 0...
## $ MBERBOER <dbl> 1, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0...
## $ MBERMIDD <dbl> 2, 5, 7, 3, 0, 4, 4, 2, 1, 3, 9, 5, 3, 2, 4, 2, 4, 1, 3, 2...
## $ MBERARBG <dbl> 5, 0, 0, 1, 0, 2, 1, 5, 8, 3, 0, 1, 2, 2, 3, 5, 0, 1, 0, 4...
## $ MBERARBO <dbl> 2, 4, 2, 2, 0, 2, 5, 2, 1, 3, 0, 1, 4, 2, 2, 2, 4, 5, 1, 4...
## $ MSKA <dbl> 1, 0, 0, 3, 9, 2, 0, 2, 1, 1, 3, 2, 1, 4, 1, 2, 2, 2, 5, 0...
## $ MSKB1 <dbl> 1, 2, 5, 2, 0, 2, 1, 1, 1, 2, 0, 3, 2, 2, 3, 1, 2, 0, 2, 0...
## $ MSKB2 <dbl> 2, 3, 0, 1, 0, 2, 4, 2, 0, 1, 6, 4, 2, 1, 2, 2, 4, 0, 1, 0...
## $ MSKC <dbl> 6, 5, 4, 4, 0, 4, 5, 5, 8, 4, 0, 1, 5, 4, 4, 5, 2, 7, 2, 7...
## $ MSKD <dbl> 1, 0, 0, 0, 0, 2, 0, 2, 1, 2, 0, 0, 1, 0, 0, 2, 0, 0, 0, 2...
## $ MHHUUR <dbl> 1, 2, 7, 5, 4, 9, 6, 0, 9, 0, 0, 6, 5, 5, 9, 0, 6, 4, 4, 9...
## $ MHKOOP <dbl> 8, 7, 2, 4, 5, 0, 3, 9, 0, 9, 9, 3, 4, 4, 0, 9, 3, 5, 5, 0...
## $ MAUT1 <dbl> 8, 7, 7, 9, 6, 5, 8, 4, 5, 6, 6, 7, 6, 7, 5, 4, 7, 6, 7, 7...
## $ MAUT2 <dbl> 0, 1, 0, 0, 2, 3, 0, 4, 2, 1, 2, 1, 1, 2, 1, 4, 2, 1, 1, 2...
## $ MAUT0 <dbl> 1, 2, 2, 0, 1, 3, 1, 2, 3, 2, 1, 2, 3, 0, 3, 2, 0, 2, 2, 0...
## $ MZFONDS <dbl> 8, 6, 9, 7, 5, 9, 9, 6, 7, 6, 5, 4, 7, 8, 5, 6, 4, 7, 5, 9...
## $ MZPART <dbl> 1, 3, 0, 2, 4, 0, 0, 3, 2, 3, 4, 5, 2, 1, 4, 3, 5, 2, 4, 0...
## $ MINKM30 <dbl> 0, 2, 4, 1, 0, 5, 4, 2, 7, 2, 0, 3, 3, 3, 4, 2, 0, 0, 2, 5...
## $ MINK3045 <dbl> 4, 0, 5, 5, 0, 2, 3, 5, 2, 3, 3, 4, 4, 2, 3, 5, 1, 6, 2, 4...
## $ MINK4575 <dbl> 5, 5, 0, 3, 9, 3, 3, 3, 1, 3, 2, 3, 3, 4, 3, 3, 6, 3, 6, 0...
## $ MINK7512 <dbl> 0, 2, 0, 0, 0, 0, 0, 0, 0, 1, 2, 1, 1, 1, 0, 0, 2, 0, 0, 0...
## $ MINK123M <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ MINKGEM <dbl> 4, 5, 3, 4, 6, 3, 3, 3, 2, 4, 8, 3, 3, 4, 3, 3, 5, 4, 4, 3...
## $ MKOOPKLA <dbl> 3, 4, 4, 4, 3, 3, 5, 3, 3, 7, 7, 4, 3, 4, 3, 3, 4, 2, 6, 1...
## $ PWAPART <dbl> 0, 2, 2, 0, 0, 0, 0, 0, 0, 2, 0, 2, 0, 0, 2, 2, 0, 2, 2, 0...
## $ PWABEDR <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ PWALAND <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ PPERSAUT <dbl> 6, 0, 6, 6, 0, 6, 6, 0, 5, 0, 6, 5, 6, 0, 5, 0, 0, 6, 5, 0...
## $ PBESAUT <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ PMOTSCO <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ PVRAAUT <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ PAANHANG <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ PTRACTOR <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ PWERKT <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ PBROM <dbl> 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3...
## $ PLEVEN <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ PPERSONG <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ PGEZONG <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ PWAOREG <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ PBRAND <dbl> 5, 2, 2, 2, 6, 0, 0, 0, 0, 3, 0, 2, 0, 0, 2, 4, 0, 3, 2, 0...
## $ PZEILPL <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ PPLEZIER <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ PFIETS <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ PINBOED <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ PBYSTAND <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ AWAPART <dbl> 0, 2, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0...
## $ AWABEDR <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ AWALAND <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ APERSAUT <dbl> 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0...
## $ ABESAUT <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ AMOTSCO <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ AVRAAUT <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ AAANHANG <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ ATRACTOR <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ AWERKT <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ ABROM <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1...
## $ ALEVEN <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ APERSONG <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ AGEZONG <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ AWAOREG <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ ABRAND <dbl> 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0...
## $ AZEILPL <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ APLEZIER <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ AFIETS <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ AINBOED <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ ABYSTAND <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Purchase <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, No, No...
train
/test
SplitQ: Create a training set consisting of the first 1,000 observations, and a test set consisting of the remaining observations.
A:
In this scenario, train
has far fewer rows than test
, which I thought was quite unusual.
## [1] 1000
## [1] 4822
Note that there are quite a few low-variance/zero-variance predictors in train
. Three examples:
## AVRAAUT
## 0
## 1000
## PVRAAUT
## 0
## 1000
## AZEILPL
## 0 1
## 999 1
Enough so in fact that I wrote the code below to take every predictor in train
and calculate what proportion of each variable took one value. I said that, if a variable is more than 99% one value (e.g. 99% of its values are 0
), it was probably safe to exclude it.
prop_1_value_df <- sapply(select(train, -Purchase), function(x) round(max(prop.table(table(x, useNA = "ifany"))), 5)) %>%
reshape2::melt() %>%
rownames_to_column("varname") %>%
rename("prop_1_value" = "value") %>%
arrange(desc(prop_1_value)) %>%
mutate(index = 1:nrow(.))
near_zero_variance <- prop_1_value_df$varname[prop_1_value_df$prop_1_value >= 0.99]
ggplot(prop_1_value_df, aes(x = index, y = prop_1_value)) +
geom_line(col = "steelblue", size = 1) +
geom_hline(yintercept = 0.99) +
geom_vline(xintercept = length(near_zero_variance)) +
scale_y_continuous(labels = scales::percent_format()) +
labs(title = "Caravan 'train' Dataset",
subtitle = paste0(length(near_zero_variance), " vars removed using a 99% threshold"),
x = "Variable Index",
y = "Percentage taking 1 value")
Even this strict exclusion criteria ended up removing close to 25% of the predictors, but I figured they would be those least-likely to be useful in modelling. My primary reason for this step was to stop the source of the constant ‘zero-variance predictor’ messages from the modelling packages used on this dataset, but improved speed and predictive power are also likely to result from it.
I remove the predictors from both train
and test
:
train <- train %>%
select(-all_of(near_zero_variance))
test <- test %>%
select(-all_of(near_zero_variance))
I did a quick check and this step didn’t negatively impact boosting performance whatsoever.
Q: Fit a boosting model to the training set with Purchase
as the response and the other variables as predictors. Use \(1,000\) trees, and a shrinkage value of \(0.01\). Which predictors appear to be the most important?
A:
As previously I’ll keep interaction.depth = 2
and n.minobsinnode = 10
as reasonable initial values.
gbm_grid <- expand.grid(n.trees = 1000, shrinkage = 0.01,
interaction.depth = 2, n.minobsinnode = 10)
set.seed(5)
model_gbm <- train(Purchase ~ .,
data = train,
method = "gbm",
distribution = "bernoulli",
verbose = F,
trControl = trainControl(method = "none", summaryFunction = defaultSummary, classProbs = T),
tuneGrid = gbm_grid)
Below are the predictors in descending order of importance. Note that some predictors were excluded from the graph (zero variable importance, not used by the model). This number is considerably lower than if I didn’t perform the variance exclusions.
# ggplot version:
gbm_importance <- summary(model_gbm$finalModel, plotit = F) %>%
rename("Importance" = "rel.inf")
gbm_importance %>%
filter(Importance > 0) %>%
ggplot(aes(x = reorder(var, Importance), y = Importance, fill = Importance)) +
geom_bar(stat = "identity") +
geom_text(aes(label = round(Importance, 2), col = Importance), hjust = -0.2) +
scale_y_continuous(limits = c(0, 10.5)) +
scale_fill_gradient(low = "grey40", high = "#28B463") +
scale_color_gradient(low = "grey40", high = "#28B463") +
coord_flip() +
theme(legend.position = "none") +
labs(title = "Caravan Dataset - Boosting - Variable Importance",
subtitle = paste0(sum(gbm_importance$Importance == 0), " zero-importance variables removed"),
x = "Varname")
The most important predictor by a reasonable amount is PPERSAUT
. Typing ?ISLR::Caravan
in the console points us in the direction of this link, where we can see that PPERSAUT
has the definition “Contribution car policies”.
Q: Use the boosting model to predict the response on the test data. Predict that a person will make a purchase if the estimated probability of purchase is greater than 20%. Form a confusion matrix. What fraction of the people predicted to make a purchase do in fact make one? How does this compare with the results obtained from applying KNN or logistic regression to this dataset?
A:
The metric we are being asked to use is basically precision (the proportion of predictions for the positive class that were correct):
\[Precision = \frac{TP}{TP + FP}\]
The only thing slightly different is that the probability threshold (above which we make a positive prediction, Purchase = "Yes"
) is 0.2 instead of 0.5. It is quite easy to calculate this metric on a simple test set, but requires us to write a custom metric for caret::train()
if we wish to get the cross-validation estimate of this metric and use this to tune the model hyper-parameters (e.g. \(k\) in KNN) without using test
. I define custom_precision
here:
custom_precision <- function (data, lev = NULL, model = NULL){
actual <- data[ ,"obs"]
probs <- data[ ,"Yes"] # P(y = "Yes") (when setting 'classProbs = TRUE', column automatically created)
predicted <- factor(ifelse(probs > 0.2, "Yes", "No")) # most important line
levels(actual) <- c("No", "Yes")
levels(predicted) <- c("No", "Yes")
out <- caret::precision(data = predicted, reference = actual, relevant = "Yes")
names(out) <- c("Precision [thresh = 0.2]")
out
}
The question isn’t directly asking us to use this as a target metric (rather to simply report this metric on test
) but I think this is good practice for properly implementing custom metrics in caret::train()
cross-validation:
Boosting:
I fit a gbm using the same parameters as in part (b):
gbm_grid <- expand.grid(n.trees = 1000, shrinkage = 0.01,
interaction.depth = 2, n.minobsinnode = 10)
ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 3, summaryFunction = custom_precision, classProbs = T)
set.seed(5)
model_gbm <- train(Purchase ~ .,
data = train,
method = "gbm",
distribution = "bernoulli",
verbose = F,
metric = "Precision [thresh = 0.2]",
maximize = T,
trControl = ctrl,
tuneGrid = gbm_grid)
Here’s the cross-validation precision:
## [1] 0.1703
Here’s the confusion matrix for test
:
# releveling purely to make the confusion matrix entries are in the standard order:
predicted <- factor(ifelse(predict(model_gbm, test, "prob")$Yes > 0.2, "Yes", "No")) %>% relevel(ref = "Yes")
actual <- test$Purchase %>% relevel(ref = "Yes")
conf_matrix <- table(predicted, actual)
conf_matrix
## actual
## predicted Yes No
## Yes 37 164
## No 252 4369
Here’s the resulting test
precision:
precision <- conf_matrix[1, 1] / (conf_matrix[1, 1] + conf_matrix[1, 2]) # precision = TP / (TP + FP)
precision %>% round(4) # 0.1841 (0.1840796)
## [1] 0.1841
## check:
# caret::precision(data = factor(ifelse(predict(model_gbm, test, "prob")$Yes > 0.2, "Yes", "No")),
# reference = test$Purchase,
# relevant = "Yes") # 0.1840796
KNN:
Comparing to KNN:
set.seed(5)
model_knn <- train(Purchase ~ .,
data = train,
method = "knn",
preProcess = c("center", "scale"),
trControl = trainControl(method = "repeatedcv", number = 5, repeats = 10, summaryFunction = custom_precision, classProbs = T, returnResamp = "all"),
metric = "Precision [thresh = 0.2]",
maximize = T,
tuneGrid = expand.grid(k = seq(1, 45, 2))) # tuneGrid
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
There are some NA estimates, which basically results from a combination of reasons that lead to zero positive predictions and therefore an undefined precision metric (small dataset, imbalanced target, the metric being used, large ‘k’ in KNN). Details can be found in the notes below but are not particularly important:
# NA's occur where the precision metric is undefined, i.e. where there are no positive predictions
# As 'k' in KNN gets very large, it becomes less and less likely that more than 20% of these k nearest neighbors for an observation will have a "yes" purchase value (<6% of train$Purchase takes the value "yes")
# This will mean an observation is less and less likely to have a predicted probability above 0.2
# This also means increasingly few positive predictions (Purchase = "yes") as k increases
# If there are *no* positive predictions, Precision = TP / (TP + FP) = 0 / 0 = NA (undefined, anyway)
# You can verify this is the issue by reducing the threshold for a positive prediction in the custom metric to 0.1 (gives 0 NA's), or accentuating the problem by increasing it to 0.3 (gives 640 NA's)
# Too many NA estimates can be problematic for the accuracy of our precision estimate
# model_knn$resample %>%
# group_by(k) %>%
# summarize(n = n(),
# num_NA = sum(is.na(`Precision [thresh = 0.2]`))) # only occur in some estimates for high values of k (to be expected)
sum(is.na(model_knn$resample$`Precision [thresh = 0.2]`)) # 31
## [1] 31
The selection of \(k\) using this custom metric and cross-validation is shown below:
data.frame(k = model_knn$results$k,
CV_Precision = model_knn$results$`Precision [thresh = 0.2]`) %>%
mutate(max_CV_Precision = as.numeric(max(CV_Precision) == CV_Precision)) %>%
ggplot(aes(x = k, y = CV_Precision)) +
geom_line(col = "grey55") +
geom_point(size = 2, aes(col = factor(max_CV_Precision))) +
scale_x_continuous(breaks = seq(1, 45, 2), minor_breaks = NULL) +
scale_y_continuous(labels = scales::comma_format()) +
scale_color_manual(values = c("deepskyblue3", "green")) +
theme(legend.position = "none") +
labs(title = "Caravan Dataset - KNN",
subtitle = "Selecting k with cross-validation",
y = "CV Precision")
For \(k\) = 37, the cross-validation Precision:
## [1] 0.2403
The confusion matrix for test
:
predicted <- factor(ifelse(predict(model_knn, test, "prob")$Yes > 0.2, "Yes", "No")) %>% relevel(ref = "Yes")
actual <- test$Purchase %>% relevel(ref = "Yes")
conf_matrix <- table(predicted, actual)
conf_matrix
## actual
## predicted Yes No
## Yes 14 56
## No 275 4477
Here’s the resulting test
precision:
precision <- conf_matrix[1, 1] / (conf_matrix[1, 1] + conf_matrix[1, 2]) # precision = TP / (TP + FP)
precision %>% round(4) # 0.2
## [1] 0.2
Notice how, in optimizing for precision without also penalizing low recall, we seem to be selecting for a model that makes very few positive predictions.
Logistic Regression:
Here I compare to Logistic Regression (no regularization or feature selection):
set.seed(5)
model_logistic <- train(Purchase ~ .,
data = train,
method = "glm",
metric = "Precision [thresh = 0.2]",
maximize = T,
family = "binomial",
trControl = ctrl)
The cross-validation Precision:
## [1] 0.1324
The confusion matrix for test
:
predicted <- factor(ifelse(predict(model_logistic, test, "prob")$Yes > 0.2, "Yes", "No")) %>% relevel(ref = "Yes")
actual <- test$Purchase %>% relevel(ref = "Yes")
conf_matrix <- table(predicted, actual)
conf_matrix
## actual
## predicted Yes No
## Yes 48 272
## No 241 4261
Here’s the resulting test
precision:
precision <- conf_matrix[1, 1] / (conf_matrix[1, 1] + conf_matrix[1, 2]) # precision = TP / (TP + FP)
precision %>% round(4) # 0.15
## [1] 0.15
Boosting (Tuned):
I can tune a gbm further to optimize for precision, which results in the ‘highest-performance’ model so far. However, this has the same drawbacks as the tuned KNN model in that it makes almost no positive predictions:
gbm_grid <- expand.grid(n.trees = seq(1500, 3000, 100), shrinkage = 0.001,
interaction.depth = 2, n.minobsinnode = 10)
ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 3, summaryFunction = custom_precision, classProbs = T, returnResamp = "all")
set.seed(5)
model_gbm <- train(Purchase ~ .,
data = train,
method = "gbm",
distribution = "bernoulli",
verbose = F,
metric = "Precision [thresh = 0.2]",
maximize = T,
trControl = ctrl,
tuneGrid = gbm_grid)
data.frame(n_trees = model_gbm$results$n.trees,
CV_Precision = model_gbm$results$`Precision [thresh = 0.2]`) %>%
mutate(max_CV_Precision = as.numeric(max(CV_Precision) == CV_Precision)) %>%
ggplot(aes(x = n_trees, y = CV_Precision)) +
geom_line(col = "grey55") +
geom_point(size = 2, aes(col = factor(max_CV_Precision))) +
scale_x_continuous(labels = scales::comma_format()) +
scale_color_manual(values = c("deepskyblue3", "green")) +
theme(legend.position = "none") +
labs(title = "Caravan Dataset - Boosting",
subtitle = "Selecting number of trees with cross-validation",
y = "CV Precision")
The confusion matrix for test
:
predicted <- factor(ifelse(predict(model_gbm, test, "prob")$Yes > 0.2, "Yes", "No")) %>% relevel(ref = "Yes")
actual <- test$Purchase %>% relevel(ref = "Yes")
conf_matrix <- table(predicted, actual)
conf_matrix
## actual
## predicted Yes No
## Yes 22 44
## No 267 4489
Here’s the resulting test
precision:
precision <- conf_matrix[1, 1] / (conf_matrix[1, 1] + conf_matrix[1, 2]) # precision = TP / (TP + FP)
precision %>% round(4) # 0.3333
## [1] 0.3333
## check:
# caret::precision(data = factor(ifelse(predict(model_gbm, test, "prob")$Yes > 0.2, "Yes", "No")),
# reference = test$Purchase,
# relevant = "Yes") # 0.3333333
This shows how optimizing for precision can result in a model that only makes positive predictions that are most likely to be correct. This is clearest in the extreme case where a model making only one (correct) positive prediction would have a precision of 1.
This property might be desirable in some scenarios, but if not, we could adjust the target metric to one that values precision while simultaneously penalizing low recall (Area Under the Precision-Recall Curve, F1 Score, etc.).
Boston
Dataset (Bagging, Random Forests & Boosting vs Other Approaches)Q: Apply boosting, bagging, and random forests to a data set of your choice. Be sure to fit the models on a training set and to evaluate their performance on a test set. How accurate are the results compared to simple methods like linear or logistic regression? Which of these approaches yields the best performance?
A:
I chose to work with the Boston
dataset, which we have worked with several times already:
## Rows: 506
## Columns: 14
## $ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.088...
## $ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5...
## $ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87,...
## $ chas <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.5...
## $ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.6...
## $ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9...
## $ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9...
## $ rad <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4,...
## $ tax <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2,...
## $ black <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396...
## $ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17...
## $ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9,...
Like in q11. of my chapter 6 solutions, I will be predicting crim
- the per capita crime rate by town. I won’t be utilizing a train/test split, but will be evaluating the performance of all models using repeated cross-validation, so the principle is the same. All the models below were tuned locally to reduce run times. My target metric was MSE, as this was the one used in chapter 6.
Linear Regression (Best-Subset Selection, No Transformations):
In chapter 6 (q11.) this was the best linear model selected from the unaltered dataset:
ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 10 , summaryFunction = custom_regression_metrics)
set.seed(1)
model_linear_1 <- train(crim ~ zn + nox + dis + rad + ptratio + black + medv,
data = Boston,
method = "lm",
trControl = ctrl)
model_linear_1$results$MSE %>% round(2) # 42.64
## [1] 42.64
Linear Regression (Best-Subset Selection, Manual Transformations):
Again from chapter 6 (q11.), where a linear model with manual transformations improved performance significantly:
Boston_1 <- Boston
Boston_1$medv_sq <- Boston_1$medv^2
Boston_1$medv_cub <- Boston_1$medv^3
Boston_1$rad_cat <- ifelse(Boston_1$rad > 20, 1, 0)
Boston_1$rad <- NULL
Boston_1$tax <- NULL
set.seed(1)
model_linear_2 <- train(crim ~ nox + dis + ptratio + medv + medv_sq + medv_cub + rad_cat,
data = Boston_1,
method = "lm",
trControl = ctrl)
model_linear_2$results$MSE %>% round(2) # 35.54
## [1] 35.54
The numbers are slightly different for these two models when compared with chapter 6 (due to different CV folds).
Generalized Additive Models:
The reason for testing the transformations above was that I felt that many of the non-linear relationships present in the Boston
dataset weren’t being captured.
The natural next step for me was to compare the performance of my manually transformed linear model to a GAM (with smooth terms for every numeric predictor that has at least 10 unique values).
I was surprised to see that, while performance is better, the gap is small:
set.seed(1)
model_smoothspline <- train(crim ~ .,
data = Boston,
method = "gamSpline",
maximize = F,
metric = "MSE",
trControl = ctrl,
tuneGrid = data.frame(df = 4.9))
model_smoothspline$results$MSE %>% round(2) # 35.07
## [1] 35.07
Bagging:
Bagging does better than standard linear regression without predictor transformations, but under-performs relative to the more flexible linear approaches:
set.seed(1)
model_bagging <- train(crim ~ .,
data = Boston,
method = "rf",
metric = "MSE",
maximize = F,
trControl = ctrl,
tuneGrid = expand.grid(mtry = 13))
model_bagging$results$MSE %>% round(2) # 39.64
## [1] 39.64
Random Forest:
Random forests give the best performance so far:
set.seed(1)
model_rf <- train(crim ~ .,
data = Boston,
method = "rf",
metric = "MSE",
maximize = F,
trControl = ctrl,
tuneGrid = expand.grid(mtry = 2))
model_rf$results$MSE %>% round(2) # 33.54
## [1] 33.54
Boosting:
Boosting performance was good but didn’t manage to beat Random Forests:
gbm_grid <- expand.grid(n.trees = 3400, interaction.depth = 5, n.minobsinnode = 10, shrinkage = 0.001)
set.seed(1)
model_gbm <- train(crim ~ .,
data = Boston,
method = "gbm",
distribution = "gaussian",
verbose = F,
metric = "MSE",
maximize = F,
trControl = ctrl,
tuneGrid = gbm_grid)
model_gbm$results$MSE %>% round(2) # 34.35
## [1] 34.35
Comparison:
A summary can be seen below. Standard linear regression (without transformations) and bagged trees performed poorly, but other methods seem reasonably competitive:
model_summary_resamples <- data.frame(method = "Linear Reg.", MSE = model_linear_1$resample$MSE) %>%
rbind(data.frame(method = "Linear Reg. (Transformations)", MSE = model_linear_2$resample$MSE)) %>%
rbind(data.frame(method = "GAM (Smooth Terms)", MSE = model_smoothspline$resample$MSE)) %>%
rbind(data.frame(method = "Bagged Trees", MSE = model_bagging$resample$MSE)) %>%
rbind(data.frame(method = "Random Forest", MSE = model_rf$resample$MSE)) %>%
rbind(data.frame(method = "GBM", MSE = model_gbm$resample$MSE))
model_summary <- model_summary_resamples %>%
group_by(method) %>%
summarize(MSE = mean(MSE)) %>%
arrange(MSE) %>%
mutate(MSE = round(MSE, 2))
model_summary
## # A tibble: 6 x 2
## method MSE
## <chr> <dbl>
## 1 Random Forest 33.5
## 2 GBM 34.4
## 3 GAM (Smooth Terms) 35.1
## 4 Linear Reg. (Transformations) 35.5
## 5 Bagged Trees 39.6
## 6 Linear Reg. 42.6
As I used 10 repeats of 10-fold cross-validation for each model, each model has 100 out-of-sample performance estimates. By setting the same random seed each time, the cross-validation folds are also identical across each model. I visualize these out-of-sample estimates using the modified box plot below, ordering from best \(\downarrow\) worst performance:
ggplot(model_summary_resamples, aes(x = factor(method, ordered = T, levels = rev(model_summary$method)), y = MSE)) +
geom_boxplot(outlier.shape = NA, aes(fill = method, alpha = 0.3)) +
geom_jitter(alpha = 0.2, position = position_jitter(w = 0.05), col = "grey20") +
theme(legend.position = "none", axis.title.y = element_blank()) +
coord_flip() +
labs(title = "Boston Dataset - Model Resample Performance - MSE Comparison")