1. Cubic Splines
2. Smoothing Splines - Extreme Smoothing Parameters
3. Basis Functions - p.1
4. Basis Functions - p.2
5. Smoothing Splines
6. APPLIED: The Wage Dataset (Polynomial Regression, Step Functions)
- (a) Polynomial Regression
- (b) Step Function
7. APPLIED: The Wage Dataset (Various Non-Linear Methods)
8. APPLIED: The Auto Dataset (Various Non-Linear Methods)
9. APPLIED: The Boston Dataset (Polynomial Regression, Splines)
10. APPLIED: The College Dataset (Generalized Additive Models)
11. APPLIED: Generated Data (GAM Backfitting, \(p = 2\))
12. APPLIED: Generated Data (GAM Backfitting, \(p = 100\))

NOTE: There are no official solutions for these questions. These are my solutions and could be incorrect. If you spot any mistakes/inconsistencies, please contact me on Liam95morgan@gmail.com, or via LinkedIn.

Some of the figures in this presentation are taken from “An Introduction to Statistical Learning, with applications in R” (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani

library(tidyverse)
library(ISLR) # 'Wage' data
library(caret) # cross-validation
library(gridExtra) # combining graphs
library(gam) # generalized additive models
library(MASS) # Boston data

select <- dplyr::select

# caret::train with RSS & MSE added:
custom_regression_metrics <- function (data, lev = NULL, model = NULL) {
  c(RMSE = sqrt(mean((data$obs-data$pred)^2)),
    Rsquared = summary(lm(pred ~ obs, data))$r.squared,
    MAE = mean(abs(data$obs-data$pred)), 
    MSE = mean((data$obs-data$pred)^2),
    RSS = sum((data$obs-data$pred)^2))
}

RNGkind(kind = "Mersenne-Twister", 
        normal.kind = "Inversion", 
        sample.kind = "Rejection") # diff versions of R/RMD using different sampling types was annoying me

theme_set(theme_light())

1. Cubic Splines

It was mentioned in the chapter that a cubic regression spline with one knot at \(\xi\) can be obtained using a basis of the form \(x, x^2, x^3, (x - \xi)_{+}^3\), where \((x - \xi)_{+}^3 = (x - \xi)^3\) if \(x > \xi\) and equals \(0\) otherwise. We will now show that a function of the form \[f(x) = \beta_0 + \beta_1x + \beta_2x^2 + \beta_3x^3 + \beta_4(x - \xi)_{+}^3\] is indeed a cubic regression spline, regardless of the values of \(\beta_0, \beta_1, \beta_2, \beta_3, \beta_4\).

(a) Cubic Polynomial Format for \(x \le \xi\)

Q: Find a cubic polynomial \[f_1(x) = a_1 + b_1x + c_1x^2 + d_1x^3\] such that \(f(x) = f_1(x)\) for all \(x \le \xi\). Express \(a_1, b_1, c_1, d_1\) in terms of \(\beta_0, \beta_1, \beta_2, \beta_3, \beta_4\).

From the question, we must find a cubic polynomial \(f_1(x)\) with the following properties:

\[\begin{align} f(x) & = \beta_0 + \beta_1x + \beta_2x^2 + \beta_3x^3 + \beta_4(x - \xi)_{+}^3 \\ & = \beta_0 + \beta_1x + \beta_2x^2 + \beta_3x^3 && \text{since } (x - \xi)_{+}^3 = 0 \quad (\forall x \le \xi) \\ & = a_1 + b_1x + c_1x^2 + d_1x^3 \\ & = f_1(x) \end{align}\]

It is clear from equating coefficients that, for \(x \le \xi\), we have \(f_1(x) = f(x)\) when:

\(a_1 = \beta_0\)
\(b_1= \beta_1\)
\(c_1 = \beta_2\)
\(d_1 = \beta_3\)

(b) Cubic Polynomial Format for \(x > \xi\)

Q: Find a cubic polynomial \[f_2(x) = a_2 + b_2x + c_2x^2 + d_2x^3\] such that \(f(x) = f_2(x)\) for all \(x > \xi\). Express \(a_2, b_2, c_2, d_2\) in terms of \(\beta_0, \beta_1, \beta_2, \beta_3, \beta_4\). We have now established that f(x) is a piecewise polynomial.

Similar to before, we must find a cubic polynomial \(f_2(x)\) with the following properties:

\[\begin{align} f(x) & = \beta_0 + \beta_1x + \beta_2x^2 + \beta_3x^3 + \beta_4(x - \xi)_{+}^3 \\ & = \beta_0 + \beta_1x + \beta_2x^2 + \beta_3x^3 + \beta_4(x - \xi)^3 && \text{since } (x - \xi)_{+}^3 = (x - \xi)^3 \quad (\forall x > \xi) \\ & = \beta_0 + \beta_1x + \beta_2x^2 + \beta_3x^3 + \beta_4(x - \xi)(x^2 - 2\xi x + \xi^2) \\ & = \beta_0 + \beta_1x + \beta_2x^2 + \beta_3x^3 + \beta_4(x^3 - 2\xi x^2 + \xi^2 x - \xi x^2 + 2\xi^2 x - \xi^3) \\ & = (\beta_0 - \beta_4 \xi^3) + (\beta_1 + \beta_4 \xi^2 + 2 \beta_4 \xi^2)x + (\beta_2 - 2\beta_4 \xi - \beta_4 \xi)x^2 + (\beta_3 + \beta_4)x^3\\ & = a_1 + b_1x + c_1x^2 + d_1x^3 \\ & = f_2(x) \end{align}\]

Equating coefficients we have that, for \(x > \xi\), \(f_2(x) = f(x)\) when:

\(a_2 = \beta_0 - \beta_4 \xi^3\)
\(b_2 = \beta_1 + 3 \beta_4 \xi^2\)
\(c_2 = \beta_2 - 3\beta_4 \xi\)
\(d_2 = \beta_3 + \beta_4\)

(c) Continuity at the Knot \(\xi\)

Q: Show that \(f_1(\xi) = f_2(\xi)\). That is, \(f(x)\) is continuous at \(\xi\).

We have already shown that \(f(x) = \beta_0 + \beta_1x + \beta_2x^2 + \beta_3x^3 + \beta_4(x - \xi)_{+}^3\) is a piecewise cubic polynomial that can be expressed in the form:

\[\begin{align} f(x) & = \begin{cases} f_1(x) && \text{for } x \le \xi \\ f_2(x) && \text{for } x > \xi \end{cases} \\ & = \begin{cases} \beta_0 + \beta_1x + \beta_2x^2 + \beta_3x^3 && \text{for } x \le \xi \\ (\beta_0 - \beta_4 \xi^3) + (\beta_1 + 3 \beta_4 \xi^2)x + (\beta_2 - 3\beta_4 \xi)x^2 + (\beta_3 + \beta_4)x^3 && \text{for } x > \xi \end{cases} \end{align}\]

At the knot \(\xi\), we get:

\[f_1(\xi) = \beta_0 + \beta_1 \xi + \beta_2 \xi^2 + \beta_3 \xi^3\]

\[\begin{align} f_2(\xi) & = (\beta_0 - \beta_4 \xi^3) + (\beta_1 + 3 \beta_4 \xi^2)\xi + (\beta_2 - 3\beta_4 \xi)\xi^2 + (\beta_3 + \beta_4)\xi^3 \\ & = \beta_0 - \beta_4 \xi^3 + \beta_1 \xi + 3 \beta_4 \xi^3 + \beta_2 \xi^2 - 3\beta_4 \xi^3 + \beta_3 \xi^3 + \beta_4 \xi^3 \\ & = \beta_0 + \beta_1 \xi + \beta_2 \xi^2 + \beta_3 \xi^3 + (\beta_4 \xi^3 - \beta_4 \xi^3 + 3 \beta_4 \xi^3 - 3\beta_4 \xi^3) \\ & = \beta_0 + \beta_1 \xi + \beta_2 \xi^2 + \beta_3 \xi^3 \end{align}\]

We therefore have \(f_1(\xi) = f_2(\xi)\), so \(f(x)\) is continuous at the knot \(\xi\).

(d) Continuity (in the First Derivative) at the Knot \(\xi\)

Q: Show that \(f_1'(\xi) = f_2'(\xi)\). That is, \(f'(x)\) is continuous at \(\xi\).

Differentiating \(f(x)\) above, we get:

\[\begin{align} f'(x) & = \begin{cases} f_1'(x) && \text{for } x \le \xi \\ f_2'(x) && \text{for } x > \xi \end{cases} \\ & = \begin{cases} \beta_1 + 2\beta_2x + 3\beta_3x^2 && \text{for } x \le \xi \\ (\beta_1 + 3 \beta_4 \xi^2) + 2(\beta_2 - 3\beta_4 \xi)x + 3(\beta_3 + \beta_4)x^2 && \text{for } x > \xi \end{cases} \end{align}\]

At the knot \(\xi\), we get:

\[f_1'(\xi) = \beta_1 + 2\beta_2 \xi + 3\beta_3 \xi^2\]

\[\begin{align} f_2'(\xi) & = (\beta_1 + 3 \beta_4 \xi^2) + 2(\beta_2 - 3\beta_4 \xi) \xi + 3(\beta_3 + \beta_4) \xi^2 \\ & = \beta_1 + 2\beta_2\xi + 3 \beta_3 \xi^2 + 3 \beta_4 \xi^2 - 6 \beta_4 \xi^2 + 3 \beta_4 \xi^2 \\ & = \beta_1 + 2\beta_2\xi + 3 \beta_3 \xi^2 \end{align}\]

We therefore have \(f_1'(\xi) = f_2'(\xi)\), so \(f'(x)\) is continuous at the knot \(\xi\).

(e) Continuity (in the Second Derivative) at the Knot \(\xi\)

Q: Show that \(f_1''(\xi) = f_2''(\xi)\). That is, \(f''(x)\) is continuous at \(\xi\).

Following the same steps and differentiating \(f'(x)\) above, we get:

\[\begin{align} f''(x) & = \begin{cases} f_1''(x) && \text{for } x \le \xi \\ f_2''(x) && \text{for } x > \xi \end{cases} \\ & = \begin{cases} 2\beta_2 + 6\beta_3x && \text{for } x \le \xi \\ 2(\beta_2 - 3\beta_4 \xi) + 6(\beta_3 + \beta_4)x && \text{for } x > \xi \end{cases} \end{align}\]

At the knot \(\xi\), we get:

\[f_1''(\xi) = 2\beta_2 + 6\beta_3\xi\]

\[\begin{align} f_2''(\xi) & = 2(\beta_2 - 3\beta_4 \xi) + 6(\beta_3 + \beta_4)\xi \\ & = 2\beta_2 + 6\beta_3 \xi - 6\beta_4 \xi + 6\beta_4 \xi \\ & = 2\beta_2 + 6\beta_3 \xi \end{align}\]

\(f_1''(\xi) = f_2''(\xi)\), so \(f''(x)\) is continuous at the knot \(\xi\).

We have shown that, regardless of the values of \(\beta_0, \beta_1, \beta_2, \beta_3, \beta_4\), a function of the form \(f(x) = \beta_0 + \beta_1x + \beta_2x^2 + \beta_3x^3 + \beta_4(x - \xi)_{+}^3\) has all the properties of a cubic spline with one knot at \(\xi\):

A piecewise polynomial
Continuous at \(\xi\)
Continuous in the first and second derivative at \(\xi\) (so \(f(x)\) is not only continuous, but ‘smooth’ at \(\xi\))

2. Smoothing Splines - Extreme Smoothing Parameters

Suppose that a curve \(\hat{g}\) is computed to smoothly fit a set of \(n\) points using the following formula:

\[\hat{g} = \underset{g}{\operatorname{argmin}} \left( \sum_{i=1}^n (y_i - g(x_i))^2 + \lambda\int \left[g^{(m)}(x) \right]^2 dx \right)\] where \(g(m)\) represents the \(m\)th derivative of \(g\) (and \(g^{(0)} = g\)). Provide example sketches of \(\hat{g}\) in each of the following scenarios.

NOTE: Before proceeding I’m going to generate some data, or we won’t be able to sketch how \(\hat{g}\) will change under different conditions!

This is the example I use:

\[Y = g(X) + \epsilon \\ g(X) = \frac{\text{sin}(12(X + 0.2))}{X + 0.2}\]

With \(X \sim \mathcal{U}[0, 1]\) and \(\epsilon \sim \mathcal{N}(0, 1)\).

I draw 50 samples with this relationship and plot below. I also plot an overlay of the true generating function \(g(X)\):

set.seed(3)

X <- runif(50)
eps <- rnorm(50)
Y <- sin(12*(X + 0.2)) / (X + 0.2) + eps
generating_fn <- function(X) {sin(12*(X + 0.2)) / (X + 0.2)}
df <- data.frame(X, Y)

ggplot(df, aes(x = X, y = Y)) + 
  geom_point(alpha = 0.5) + 
  stat_function(fun = generating_fn, aes(col = "Generating Function")) + 
  scale_color_manual(values = "deepskyblue3") + 
  theme(legend.position = "bottom", legend.title = element_blank())

The explanation given on the first half of p.278 was really helpful on this question - particularly the intuition that \(\int \left[g^{(m)}(x) \right]^2 dx\) can be thought of as a summation of \(\left[g^{(m)}(x) \right]^2\) over the range of \(x\), so \(\hat{g}\) will be the \(g\) that minimizes \(\sum_{i=1}^n (y_i - g(x_i))^2 + \lambda\int \left[g^{(m)}(x) \right]^2 dx\) (for a given \(\lambda \ge 0\)).

We can see why large values of \(\lambda\) will push the penalty term \(\lambda\int \left[g^{(m)}(x) \right]^2 dx\) towards zero. At the opposite extreme, a \(\lambda = 0\) removes this term entirely from the equation, and we are left with the freedom to choose any \(g\) that minimizes the loss function \(\sum_{i=1}^n (y_i - g(x_i))^2\).

This intuition is used throughout this question, so I will not go into as much detail for individual parts.

(a) Sketch \(\hat{g}\) for \(\lambda = \infty, m = 0\)

Q: \(\lambda = \infty, m = 0\)

For \(m = 0\), we get:

\[\hat{g} = \underset{g}{\operatorname{argmin}} \left( \sum_{i=1}^n (y_i - g(x_i))^2 + \lambda\int \left[ g(x) \right]^2 dx \right)\]

As \(\lambda\) increases, the penalty term becomes more and more important in the above equation. As \(\lambda \rightarrow \infty\), this forces \(g(x) \rightarrow 0\).

We therefore get \(\hat{g}(x) = 0\):

ggplot(df, aes(x = X, y = Y)) + 
  geom_point(alpha = 0.5) + 
  stat_function(fun = generating_fn, aes(col = "Generating Function")) + 
  geom_hline(aes(yintercept = 0, linetype = "Chosen g(X)"), col = "green", size = 0.8) + 
  scale_color_manual(values = "deepskyblue3") + 
  theme(legend.position = "bottom", legend.title = element_blank())

(b) Sketch \(\hat{g}\) for \(\lambda = \infty, m = 1\)

Q: \(\lambda = \infty, m = 1\)

For \(m = 1\), we get:

\[\hat{g} = \underset{g}{\operatorname{argmin}} \left( \sum_{i=1}^n (y_i - g(x_i))^2 + \lambda\int \left[ g'(x) \right]^2 dx \right)\]

As \(\lambda \rightarrow \infty\), this forces \(g'(x) \rightarrow 0\).

This means we would get \(\hat{g}(x) = c\).

More specifically we can take \(\hat{g}(x) = c = \frac{1}{n} \sum_{i=1}^n y_i\), since all constant functions will have a first derivative of zero, but this one will also minimize the RSS (the LHS loss term).

ggplot(df, aes(x = X, y = Y)) + 
  geom_point(alpha = 0.5) + 
  stat_function(fun = generating_fn, aes(col = "Generating Function")) + 
  geom_hline(aes(yintercept = mean(Y), linetype = "Chosen g(X)"), col = "green", size = 0.8) + 
  scale_color_manual(values = "deepskyblue3") + 
  theme(legend.position = "bottom", legend.title = element_blank())

(c) Sketch \(\hat{g}\) for \(\lambda = \infty, m = 2\)

Q: \(\lambda = \infty, m = 2\)

For \(m = 2\), we get:

\[\hat{g} = \underset{g}{\operatorname{argmin}} \left( \sum_{i=1}^n (y_i - g(x_i))^2 + \lambda\int \left[ g''(x) \right]^2 dx \right)\] As \(\lambda \rightarrow \infty\), this forces \(g''(x) \rightarrow 0\).

This means we would get \(\hat{g}(x) = ax + b\).

More specifically we can say \(\hat{g}(x)\) will be the linear least squares line, since all linear functions will have a second derivative of zero, but this one will also minimize the RSS.

ggplot(df, aes(x = X, y = Y)) + 
  geom_point(alpha = 0.5) + 
  stat_function(fun = generating_fn, aes(col = "Generating Function")) + 
  geom_smooth(method = "lm", formula = "y ~ x", se = F, size = 0.8, aes(col = "Chosen g(X)")) + 
  scale_color_manual(values = c("green", "deepskyblue3")) + 
  theme(legend.position = "bottom", legend.title = element_blank())

(d) Sketch \(\hat{g}\) for \(\lambda = \infty, m = 3\)

Q: \(\lambda = \infty, m = 3\)

For \(m = 3\), we get:

\[\hat{g} = \underset{g}{\operatorname{argmin}} \left( \sum_{i=1}^n (y_i - g(x_i))^2 + \lambda\int \left[ g^{(3)}(x) \right]^2 dx \right)\] As \(\lambda \rightarrow \infty\), this forces \(g^{(3)}(x) \rightarrow 0\).

This means we would get \(\hat{g}(x) = ax^2 + bx + c\).

More specifically we can say \(\hat{g}(x)\) will be the quadratic least squares line, since all quadratic functions will have a third derivative of zero, but this one will also minimize the RSS.

ggplot(df, aes(x = X, y = Y)) + 
  geom_point(alpha = 0.5) + 
  stat_function(fun = generating_fn, aes(col = "Generating Function")) + 
  geom_smooth(method = "lm", formula = "y ~ x + I(x^2)", se = F, size = 0.8, aes(col = "Chosen g(X)")) + 
  scale_color_manual(values = c("green", "deepskyblue3")) + 
  theme(legend.position = "bottom", legend.title = element_blank())

(e) Sketch \(\hat{g}\) for \(\lambda = 0, m = 3\)

Q: \(\lambda = 0, m = 3\)

For \(m = 3\), we get the same as before:

\[\hat{g} = \underset{g}{\operatorname{argmin}} \left( \sum_{i=1}^n (y_i - g(x_i))^2 + \lambda\int \left[ g^{(3)}(x) \right]^2 dx \right)\]

However, since \(\lambda = 0\), the penalty term no longer plays any role in the selection of \(\hat{g}(x)\). For this reason, we can achieve RSS = 0 with any \(g(x)\) that interpolates all of the observations!

Taking a cubic smoothing spline (with no smoothing) as an example:

interp_spline <- smooth.spline(x = df$X, y = df$Y, all.knots = T, lambda = 0.0000000000001)
fitted <- predict(interp_spline, x = seq(min(X) - 0.02, max(X) + 0.02, by = 0.0001))
fitted <- data.frame(x = fitted$x, fitted_y = fitted$y)

ggplot(df, aes(x = X, y = Y)) + 
  geom_point(alpha = 0.5) + 
  stat_function(fun = generating_fn, aes(col = "Generating Function")) + 
  geom_line(data = fitted, 
            aes(x = x, y = fitted_y, col = "Chosen g(X)"), size = 0.8) + 
  scale_color_manual(values = c("green", "deepskyblue3")) + 
  theme(legend.position = "bottom", legend.title = element_blank())

As a bonus, I fit a cubic smoothing spline to the data, but this time LOOCV is used to select \(\lambda\). The fit is much closer to the true generating function than anything we’ve seen so far - impressive considering the dataset is only 50 observations!

smooth_spline <- smooth.spline(x = df$X, y = df$Y, all.knots = T)
fitted <- predict(smooth_spline, x = seq(min(X) - 0.02, max(X) + 0.02, by = 0.0001))
fitted <- data.frame(x = fitted$x, fitted_y = fitted$y)

ggplot(df, aes(x = X, y = Y)) + 
  geom_point(alpha = 0.5) + 
  stat_function(fun = generating_fn, aes(col = "Generating Function")) + 
  geom_line(data = fitted, 
            aes(x = x, y = fitted_y, col = "Chosen g(X)"), size = 0.8) + 
  scale_color_manual(values = c("green", "deepskyblue3")) + 
  theme(legend.position = "bottom", legend.title = element_blank())

3. Basis Functions - p.1

Q: Suppose we fit a curve with basis functions:

\(b_1(X) = X\)
\(b_2(X) = (X - 1)^2 I(X \ge 1)\)

(Note that \(I(X \ge 1)\) equals \(1\) for \(X \ge 1\) and \(0\) otherwise.)

We fit the linear regression model

\[Y = \beta_0 + \beta_1 b_1(X) + \beta_2 b_2(X) + \epsilon\]

and obtain coefficient estimates \(\hat{\beta_0} = 1\), \(\hat{\beta_1} = 1\), \(\hat{\beta_2} = -2\). Sketch the estimated curve between \(X = -2\) and \(X = 2\). Note the intercepts, slopes, and other relevant information.

I sketch the fitted curve \(\hat{f}(X) = 1 + X - 2(X - 1)^2 I(X \ge 1)\):

X = seq(-2, 2, 0.01)
Y = 1 + X + -2 * (X - 1)^2 * (X >= 1)
df <- data.frame(X, Y)

ggplot(df, aes(x = X, y = Y)) + 
  geom_vline(xintercept = 0) + 
  geom_vline(xintercept = 1, col = "mediumseagreen") + 
  geom_hline(yintercept = 0) + 
  geom_line(size = 1.5)

For \(X < 1\), we simply have \(\hat{f}(X) = 1 + X\), so the slope and intercept are clearly \(1\).

Because of the indicator variable \(I(X \ge 1)\), the quadratic term only starts to become relevant for \(X \ge 1\), which is why the curve is linear before then.

Once we have \(X \ge 1\), the fitted equation becomes:

\[\begin{align} \hat{f}(X) & = 1 + X - 2(X - 1)^2 \\ & = 1 + X - 2(X^2 - 2X + 1) \\ & = -2X^2 + 5X - 1 \end{align}\]

Taking the derivative, we can see that the slope for \(X \ge 1\) will vary (it is a function of \(X\)):

\[\hat{f}'(X) = -4X + 5\]

Setting this derivative equal to zero reveals why the critical point in the graph occurs at \(X = \frac{5}{4}\)

4. Basis Functions - p.2

Q: Suppose we fit a curve with basis functions:

\(b_1(X) = I(0 \le X \le 2) - (X - 1)I(1 \le X \le 2)\)
\(b_2(X) = (X - 3)I(3 \le X \le 4) + I(4 < X \le 5)\)

We fit the linear regression model

\[Y = \beta_0 + \beta_1 b_1(X) + \beta_2 b_2(X) + \epsilon\] and obtain coefficient estimates \(\hat{\beta_0} = 1\), \(\hat{\beta_1} = 1\), \(\hat{\beta_2} = 3\). Sketch the estimated curve between \(X = -2\) and \(X = 2\). Note the intercepts, slopes, and other relevant information.

Plugging in these basis functions and coefficient estimates, we get the fitted model:

\[\hat{f}(X) = 1 + I(0 \le X \le 2) - (X - 1)I(1 \le X \le 2) + 3 (X - 3)I(3 \le X \le 4) + 3 I(4 < X \le 5)\]

Since we are only interested in the function over the range \([-2, 2]\), we can simplify \(\hat{f}(X)\) over this range:

\[\hat{f}(X) = 1 + I(0 \le X \le 2) - (X - 1)I(1 \le X \le 2)\]

X = seq(-2, 2, 0.01)
Y = 1 + (X >= 0 & X <= 2) - (X - 1)*(X >= 1 & X <= 2) + 3*(X - 3)*(X >= 3 & X <= 4) + 3*(X > 4 & X <= 5)
df <- data.frame(X, Y)

ggplot(df, aes(x = X, y = Y)) + 
  geom_vline(xintercept = 0) + 
  geom_hline(yintercept = 0) + 
  geom_line(size = 1.5)

For \(X < 0\) none of the indicator variables are true, so \(\hat{f}(X) = 1\) with a slope of \(0\).

For \(0 \le X < 1\), the first indicator variable is true, so \(\hat{f}(X) = 1 + 1 = 2\), so this is the intercept with a slope of \(0\).

For \(1 \le X \le 2\), both indicator variables are true, so \(\hat{f}(X) = 1 + 1 - (X - 1) \cdot 1 = 3 - X\), so the slope is \(-1\) here.

5. Smoothing Splines

Consider two curves, \(\hat{g_1}\) and \(\hat{g_2}\), defined by

\[\hat{g_1} = \underset{g}{\operatorname{argmin}} \left( \sum_{i=1}^n (y_i - g(x_i))^2 + \lambda\int \left[ g^{(3)}(x) \right]^2 dx \right)\]

\[\hat{g_2} = \underset{g}{\operatorname{argmin}} \left( \sum_{i=1}^n (y_i - g(x_i))^2 + \lambda\int \left[ g^{(4)}(x) \right]^2 dx \right)\]

where \(g^{(m)}\) represents the \(m\)th derivative of \(g\).

(a) Training RSS as \(\lambda \rightarrow \infty\)

Q: As \(\lambda \rightarrow \infty\), will \(\hat{g_1}\) or \(\hat{g_2}\) have the smaller training RSS?

Same logic to question 2: as \(\lambda\) increases, the penalty term becomes more and more important in the above equation.

For \(\hat{g_1}\), \(\lambda \rightarrow \infty\) forces \(g^{(3)}(x) \rightarrow 0\). The highest order (most flexible) polynomial to satisfy this would be of the form \(\hat{g}(x) = ax^2 + bx + c\) (as the third derivative would be zero). \(\hat{g_1}\) will therefore be the quadratic that minimizes the training RSS.

For \(\hat{g_2}\), \(\lambda \rightarrow \infty\) forces \(g^{(4)}(x) \rightarrow 0\). The highest order (most flexible) polynomial to satisfy this would be of the form \(\hat{g}(x) = ax^3 + bx^2 + cx + d\) (as the third derivative would be zero). \(\hat{g_2}\) will therefore be the cubic that minimizes the training RSS.

Since \(\hat{g_2}\) would be more flexible as the higher order polynomial, \(\hat{g_2}\) would have the smaller training RSS.

(b) Test RSS as \(\lambda \rightarrow \infty\)

Q: As \(\lambda \rightarrow \infty\), will \(\hat{g_1}\) or \(\hat{g_2}\) have the smaller test RSS?

We don’t know - it depends whether the true relationship between \(x\) and \(y\) is better approximated with a cubic or quadratic relationship. It is possible that \(\hat{g_2}\) could be overfit and have a larger test RSS, or \(\hat{g_1}\) could be underfit and have a larger test RSS.

(c) Training & Test RSS for \(\lambda = 0\)

Q: For \(\lambda = 0\), will \(\hat{g_1}\) or \(\hat{g_2}\) have the smaller training and test RSS?

For \(\lambda = 0\) there is no restriction at all on \(g(x)\), so both \(\hat{g_1}\) and \(\hat{g_2}\) would have the same training RSS (of zero if all the \(x_i\) are unique). With no restrictions on \(g\) we can simply have any function that interpolates all the training observations; see question 2) e) as an example.

Both of these functions would be terribly overfit and have high test RSS. If we assume that the same interpolating function was chosen for \(\hat{g_1}\) & \(\hat{g_2}\) (e.g. both were a linear spline with knots at each unique \(x_i\)) they would also clearly have the same test RSS.

6. APPLIED: The `Wage` Dataset (Polynomial Regression, Step Functions)

In this exercise, you will further analyze the Wage data set considered throughout this chapter.

(a) Polynomial Regression

Q: Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree \(d\) for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.

I iterate through polynomials of order 1 to 10 of age, using 10-fold cross-validation to select the best model. This resulted in the selection of the third-order polynomial:

ctrl <- trainControl(method = "cv", number = 10)

CV_RMSE <- c()

set.seed(123)

for (i in 1:10) {
  model_temp <- train(y = Wage$wage,
                      x = poly(Wage$age, i, raw = T, simple = T),
                      method = "lm",
                      metric = "RMSE",
                      trControl = ctrl)
  CV_RMSE[i] <- model_temp$results$RMSE
}


data.frame(degree = 1:10, CV_RMSE = CV_RMSE) %>%
  mutate(min_CV_RMSE = as.numeric(min(CV_RMSE) == CV_RMSE)) %>%
  ggplot(aes(x = degree, y = CV_RMSE)) +
  geom_line(col = "grey55") +
  geom_point(size = 2, aes(col = factor(min_CV_RMSE))) +
  scale_x_continuous(breaks = seq(1, 10), minor_breaks = NULL) +
  scale_color_manual(values = c("deepskyblue3", "green")) +
  theme(legend.position = "none") +
  labs(title = "Wage Dataset - Polynomial Regression",
       subtitle = "Selecting the 'age' polynomial degree with cross-validation",
       x = "Degree",
       y = "CV RMSE")

I now use anova() to sequentially test the null hypothesis that a model \(\mathcal{M}_1\) is sufficient to explain the data against the alternative hypothesis that a more complex model \(\mathcal{M}_2\) is required.

Note that the predictors in \(\mathcal{M}_1\) must be a subset of the predictors in \(\mathcal{M}_2\) (they clearly are in this case, as the only modification for each fit is the addition of a higher-order term of age).

fit.1 <- lm(wage ~ age, data = Wage)
fit.2 <- lm(wage ~ poly(age, 2, raw = T), data = Wage)
fit.3 <- lm(wage ~ poly(age, 3, raw = T), data = Wage)
fit.4 <- lm(wage ~ poly(age, 4, raw = T), data = Wage)
fit.5 <- lm(wage ~ poly(age, 5, raw = T), data = Wage)

anova(fit.1, fit.2, fit.3, fit.4, fit.5)

## Analysis of Variance Table
## 
## Model 1: wage ~ age
## Model 2: wage ~ poly(age, 2, raw = T)
## Model 3: wage ~ poly(age, 3, raw = T)
## Model 4: wage ~ poly(age, 4, raw = T)
## Model 5: wage ~ poly(age, 5, raw = T)
##   Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
## 1   2998 5022216                                    
## 2   2997 4793430  1    228786 143.5931 < 2.2e-16 ***
## 3   2996 4777674  1     15756   9.8888  0.001679 ** 
## 4   2995 4771604  1      6070   3.8098  0.051046 .  
## 5   2994 4770322  1      1283   0.8050  0.369682    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Using ANOVA as apposed to cross-validation could result in selecting either the third-order or fourth-order polynomials of age.

I plot the third-order polynomial fit, which certainly looks reasonable:

ggplot(Wage, aes(x = age, y = wage)) + 
  geom_point(alpha = 0.3) + 
  geom_smooth(method = "lm", formula = "y ~ poly(x, 3, raw = T)") + 
  labs(title = "Wage Dataset - Polynomial Regression",
       subtitle = "Predicting 'wage' with a cubic polynomial of 'age'")

(b) Step Function

Q: Fit a step function to predict wage using age, and perform crossvalidation to choose the optimal number of cuts. Make a plot of the fit obtained.

I test cutting age into up to 20 intervals for the step function, using cross-validation to select the best model. The selected model had 12 intervals of age:

CV_RMSE <- c()

set.seed(124)

for (i in 2:20) {
  model_temp <- train(y = Wage$wage,
                      x = data.frame(cut(Wage$age, i)),
                      method = "lm",
                      metric = "RMSE",
                      trControl = ctrl)
  CV_RMSE[i-1] <- model_temp$results$RMSE
}


data.frame(cuts = 2:20, CV_RMSE = CV_RMSE) %>%
  mutate(min_CV_RMSE = as.numeric(min(CV_RMSE) == CV_RMSE)) %>%
  ggplot(aes(x = cuts, y = CV_RMSE)) +
  geom_line(col = "grey55") +
  geom_point(size = 2, aes(col = factor(min_CV_RMSE))) +
  scale_x_continuous(breaks = seq(2, 20), minor_breaks = NULL) +
  scale_color_manual(values = c("deepskyblue3", "green")) +
  theme(legend.position = "none") +
  labs(title = "Wage Dataset - Step Function",
       subtitle = "Selecting number of 'age' cut-points with cross-validation",
       x = "Intervals",
       y = "CV RMSE")

I visualize this below.

ggplot(Wage, aes(x = age, y = wage)) + 
  geom_point(alpha = 0.3) + 
  geom_smooth(method = "lm", formula = "y ~ cut(x, 12)") + 
  labs(title = "Wage Dataset - Step Function",
       subtitle = "Predicting 'wage' with a 12-interval step function of 'age'")

If we were using the 1 s.e. rule for cross-validation we would probably end up choosing the step function with 7 intervals, which is probably what I would select in practice.

7. APPLIED: The `Wage` Dataset (Various Non-Linear Methods)

Q: The Wage data set contains a number of other features not explored in this chapter, such as marital status (maritl), job class (jobclass), and others. Explore the relationships between some of these other predictors and wage, and use non-linear fitting techniques in order to fit flexible models to the data. Create plots of the results obtained, and write a summary of your findings.

glimpse(Wage)

## Rows: 3,000
## Columns: 11
## $ year       <int> 2006, 2004, 2003, 2003, 2005, 2008, 2009, 2008, 2006, 20...
## $ age        <int> 18, 24, 45, 43, 50, 54, 44, 30, 41, 52, 45, 34, 35, 39, ...
## $ maritl     <fct> 1. Never Married, 1. Never Married, 2. Married, 2. Marri...
## $ race       <fct> 1. White, 1. White, 1. White, 3. Asian, 1. White, 1. Whi...
## $ education  <fct> 1. < HS Grad, 4. College Grad, 3. Some College, 4. Colle...
## $ region     <fct> 2. Middle Atlantic, 2. Middle Atlantic, 2. Middle Atlant...
## $ jobclass   <fct> 1. Industrial, 2. Information, 1. Industrial, 2. Informa...
## $ health     <fct> 1. <=Good, 2. >=Very Good, 1. <=Good, 2. >=Very Good, 1....
## $ health_ins <fct> 2. No, 2. No, 1. Yes, 1. Yes, 1. Yes, 1. Yes, 1. Yes, 1....
## $ logwage    <dbl> 4.318063, 4.255273, 4.875061, 5.041393, 4.318063, 4.8450...
## $ wage       <dbl> 75.04315, 70.47602, 130.98218, 154.68529, 75.04315, 127....

year:

This is the only other numeric variable in the dataset, and can only take 7 unique values (2003 - 2009), so it doesn’t seem appropriate to use splines to describe this relationship:

table(Wage$year)

## 
## 2003 2004 2005 2006 2007 2008 2009 
##  513  485  447  392  386  388  389

Because of the discrete nature of year I fit a step function with the following intervals:

table(cut(Wage$year, breaks = 2002:2009))

## 
## (2002,2003] (2003,2004] (2004,2005] (2005,2006] (2006,2007] (2007,2008] 
##         513         485         447         392         386         388 
## (2008,2009] 
##         389

I plot the relationship between year and wage (with the step function) below. It looks pretty uninformative:

year_step <- lm(wage ~ cut(year, breaks = 2002:2009), Wage)
fitted <- data.frame(year = seq(2003, 2009, 0.1), 
                     wage = predict(year_step, data.frame(year = seq(2003, 2009, 0.1))))


ggplot(Wage, aes(x = year, y = wage)) + 
  geom_point(alpha = 0.5) + 
  geom_line(data = fitted, 
            aes(x = year, y = wage), size = 0.8, col = "green") + 
  scale_x_continuous(breaks = 2003:2009, minor_breaks = NULL)

maritl:

Note that the volume for 3. Widowed in particular is pretty low:

table(Wage$maritl)

## 
## 1. Never Married       2. Married       3. Widowed      4. Divorced 
##              648             2074               19              204 
##     5. Separated 
##               55

It appears that married workers (for those in this dataset - male workers in the Mid-Atlantic region) earn more.

ggplot(Wage, aes(x = maritl, y = wage, fill = maritl)) + 
  geom_boxplot() + 
  theme(legend.position = "none")

Due to the low volumes, it would make sense to combined the 3. Widowed category with another category (e.g. 4. Divorced) to create something like 6. Previously Married before modelling.

race:

Note the low volumes again for the 4. Other category:

table(Wage$race)

## 
## 1. White 2. Black 3. Asian 4. Other 
##     2480      293      190       37

In this dataset asian men earn the most, followed by white men and black men. You could potentially combine 4. Other with 2. Black (the closest category w.r.t. the response) or 3. Asian (the second-smallest category) before modelling.

ggplot(Wage, aes(x = race, y = wage, fill = race)) + 
  geom_boxplot() + 
  theme(legend.position = "none")

education:

Again, very balanced:

table(Wage$education)

## 
##       1. < HS Grad         2. HS Grad    3. Some College    4. College Grad 
##                268                971                650                685 
## 5. Advanced Degree 
##                426

This is an ordinal categorical variable, and we can see a clear positive relationship between education-level and wage.

ggplot(Wage, aes(x = education, y = wage, fill = education)) + 
  geom_boxplot() + 
  theme(legend.position = "none")

I’d anticipate this variable will be quite powerful (good-sized categories, large spread between them).

region:

Since the dataset is only a sample of Mid-Atlantic male workers, it’s unsurprising that this variable has the following split:

table(Wage$region)

## 
##        1. New England    2. Middle Atlantic 3. East North Central 
##                     0                  3000                     0 
## 4. West North Central     5. South Atlantic 6. East South Central 
##                     0                     0                     0 
## 7. West South Central           8. Mountain            9. Pacific 
##                     0                     0                     0

The variable has no variance and so obviously won’t be used in modelling.

jobclass:

This variable is much more balanced:

table(Wage$jobclass)

## 
##  1. Industrial 2. Information 
##           1544           1456

We see that those with Information as their type of job earn more than those with Industrial jobs.

ggplot(Wage, aes(x = jobclass, y = wage, fill = jobclass)) + 
  geom_boxplot() + 
  theme(legend.position = "none")

health:

This variable is also very balanced, with most workers falling in the 2. Very Good category for health:

table(Wage$health)

## 
##      1. <=Good 2. >=Very Good 
##            858           2142

We see that those with better health tend to earn more.

ggplot(Wage, aes(x = health, y = wage, fill = health)) + 
  geom_boxplot() + 
  theme(legend.position = "none")

health_ins:

Most workers don’t have health insurance:

table(Wage$health_ins)

## 
## 1. Yes  2. No 
##   2083    917

Workers with health insurance are, on average, higher paid:

ggplot(Wage, aes(x = health_ins, y = wage, fill = health_ins)) + 
  geom_boxplot() + 
  theme(legend.position = "none")

Flexible Modelling:

Using the things I learned about the data over the past couple of questions, I some some combining of categories and fit a linear regression model with some flexible basis functions for age and year (cubic & step-function).

The results are seen below.

Wage_cleaned <- Wage %>%
  select(wage, age, year, maritl, race, education, region, jobclass, health, health_ins) %>%
  mutate(maritl = case_when(maritl %in% c("3. Widowed", "4. Divorced") ~ "6. Previously Married", TRUE ~ as.character(maritl)) %>% factor(), 
         race = case_when(race %in% c("2. Black", "4. Other") ~ "5. Black & Other", TRUE ~ as.character(race)) %>% factor())


model <- lm(wage ~ poly(age, 3, raw = T) + cut(year, breaks = 2002:2009) + maritl + race + education + jobclass + health + health_ins, data = Wage_cleaned)
summary(model)

## 
## Call:
## lm(formula = wage ~ poly(age, 3, raw = T) + cut(year, breaks = 2002:2009) + 
##     maritl + race + education + jobclass + health + health_ins, 
##     data = Wage_cleaned)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -101.4  -18.6   -3.3   13.8  209.2 
## 
## Coefficients:
##                                            Estimate Std. Error t value Pr(>|t|)
## (Intercept)                               3.187e+00  1.972e+01   0.162 0.871596
## poly(age, 3, raw = T)1                    3.592e+00  1.409e+00   2.550 0.010832
## poly(age, 3, raw = T)2                   -4.883e-02  3.190e-02  -1.530 0.126009
## poly(age, 3, raw = T)3                    1.634e-04  2.321e-04   0.704 0.481443
## cut(year, breaks = 2002:2009)(2003,2004]  2.603e+00  2.145e+00   1.214 0.224950
## cut(year, breaks = 2002:2009)(2004,2005]  3.122e+00  2.190e+00   1.426 0.154054
## cut(year, breaks = 2002:2009)(2005,2006]  7.629e+00  2.269e+00   3.362 0.000782
## cut(year, breaks = 2002:2009)(2006,2007]  5.201e+00  2.281e+00   2.280 0.022658
## cut(year, breaks = 2002:2009)(2007,2008]  6.408e+00  2.274e+00   2.818 0.004866
## cut(year, breaks = 2002:2009)(2008,2009]  8.393e+00  2.275e+00   3.689 0.000229
## maritl2. Married                          1.337e+01  1.811e+00   7.385 1.98e-13
## maritl5. Separated                        7.160e+00  4.857e+00   1.474 0.140521
## maritl6. Previously Married               1.629e-01  2.845e+00   0.057 0.954348
## race3. Asian                             -2.683e+00  2.587e+00  -1.037 0.299772
## race5. Black & Other                     -4.856e+00  2.022e+00  -2.402 0.016374
## education2. HS Grad                       7.564e+00  2.351e+00   3.218 0.001306
## education3. Some College                  1.799e+01  2.499e+00   7.200 7.59e-13
## education4. College Grad                  3.064e+01  2.531e+00  12.107  < 2e-16
## education5. Advanced Degree               5.315e+01  2.793e+00  19.031  < 2e-16
## jobclass2. Information                    3.455e+00  1.317e+00   2.623 0.008750
## health2. >=Very Good                      6.310e+00  1.412e+00   4.469 8.14e-06
## health_ins2. No                          -1.634e+01  1.404e+00 -11.635  < 2e-16
##                                             
## (Intercept)                                 
## poly(age, 3, raw = T)1                   *  
## poly(age, 3, raw = T)2                      
## poly(age, 3, raw = T)3                      
## cut(year, breaks = 2002:2009)(2003,2004]    
## cut(year, breaks = 2002:2009)(2004,2005]    
## cut(year, breaks = 2002:2009)(2005,2006] ***
## cut(year, breaks = 2002:2009)(2006,2007] *  
## cut(year, breaks = 2002:2009)(2007,2008] ** 
## cut(year, breaks = 2002:2009)(2008,2009] ***
## maritl2. Married                         ***
## maritl5. Separated                          
## maritl6. Previously Married                 
## race3. Asian                                
## race5. Black & Other                     *  
## education2. HS Grad                      ** 
## education3. Some College                 ***
## education4. College Grad                 ***
## education5. Advanced Degree              ***
## jobclass2. Information                   ** 
## health2. >=Very Good                     ***
## health_ins2. No                          ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33.75 on 2978 degrees of freedom
## Multiple R-squared:  0.3503, Adjusted R-squared:  0.3457 
## F-statistic: 76.44 on 21 and 2978 DF,  p-value: < 2.2e-16

8. APPLIED: The `Auto` Dataset (Various Non-Linear Methods)

Q: Fit some of the non-linear models investigated in this chapter to the Auto data set. Is there evidence for non-linear relationships in this data set? Create some informative plots to justify your answer.

Looking through a pairs() plot of the relationships below, there are quite a few non-linear relationships in the dataset. Particularly with mpg, which we have been tasked with predicting (and noted the nonlinearities) in previous chapters.

glimpse(Auto)

## Rows: 392
## Columns: 9
## $ mpg          <dbl> 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14...
## $ cylinders    <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 6, 6, 6, ...
## $ displacement <dbl> 307, 350, 318, 304, 302, 429, 454, 440, 455, 390, 383,...
## $ horsepower   <dbl> 130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170,...
## $ weight       <dbl> 3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, ...
## $ acceleration <dbl> 12.0, 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10.0, 8....
## $ year         <dbl> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70...
## $ origin       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, ...
## $ name         <fct> chevrolet chevelle malibu, buick skylark 320, plymouth...

pairs(Auto[ ,-9])

I am going to focus on relationships with mpg here, as I want to adapt this question slightly and confirm that using non-linear models provides better results in predicting mpg than we got in previous chapters through using linear methods.

I explore some of the relationships with mpg below, fitting smoothing lines to the numeric variables with geom_smooth() (which performs local regression as the default smoothing method).

I re-labelled origin and also re-created a brand variable that I engineered in my chapter 2 solutions, instead of discarding the name variable completely:

Auto$origin <- factor(Auto$origin, labels = c("American", "European", "Japanese"))

Auto$brand <- sapply(strsplit(as.character(Auto$name), split = " "),
                     function(x) x[1]) # extract the first item from each list element

Auto$brand <- factor(ifelse(Auto$brand %in% c("vokswagen", "vw"), "volkswagen", 
                     ifelse(Auto$brand == "toyouta", "toyota", 
                            ifelse(Auto$brand %in% c("chevroelt", "chevy"), "chevrolet", 
                                   ifelse(Auto$brand == "maxda", "mazda", 
                                          Auto$brand))))) # fixing typo's

Auto$brand <- fct_lump(Auto$brand, 
                       n = 9, 
                       other_level = "uncommon") # collapse into 10 categories

Auto$name <- NULL

glimpse(Auto)

## Rows: 392
## Columns: 9
## $ mpg          <dbl> 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14...
## $ cylinders    <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 6, 6, 6, ...
## $ displacement <dbl> 307, 350, 318, 304, 302, 429, 454, 440, 455, 390, 383,...
## $ horsepower   <dbl> 130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170,...
## $ weight       <dbl> 3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, ...
## $ acceleration <dbl> 12.0, 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10.0, 8....
## $ year         <dbl> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70...
## $ origin       <fct> American, American, American, American, American, Amer...
## $ brand        <fct> chevrolet, buick, plymouth, amc, ford, ford, chevrolet...

g1 <- ggplot(Auto, aes(x = cylinders, y = mpg, group = cylinders)) +
  geom_boxplot()

g2 <- ggplot(Auto, aes(x = displacement, y = mpg)) +
  geom_point(alpha = 0.5) +
  geom_smooth()

g3 <- ggplot(Auto, aes(x = horsepower, y = mpg)) +
  geom_point(alpha = 0.5) +
  geom_smooth()

g4 <- ggplot(Auto, aes(x = weight, y = mpg)) +
  geom_point(alpha = 0.5) +
  geom_smooth()

g5 <- ggplot(Auto, aes(x = acceleration, y = mpg)) +
  geom_point(alpha = 0.5) +
  geom_smooth()

g6 <- ggplot(Auto, aes(x = year, y = mpg)) +
  geom_point(alpha = 0.5) +
  geom_smooth()

g7 <- ggplot(Auto, aes(x = origin, y = mpg, fill = origin)) + 
  geom_boxplot() + 
  theme(legend.position = "none")

g8 <- ggplot(Auto, aes(x = brand, y = mpg, fill = brand)) + 
  geom_boxplot() + 
  scale_x_discrete(guide = guide_axis(n.dodge = 2)) +
  theme(legend.position = "none")

grid.arrange(g1, g2, g3, g4, g5, g6, g7, g8, ncol = 2)

The relationships are pretty strong, and I highly suspect that we will see increased performance with a non-linear methodology when predicting mpg. I will use the caret package again here and quantify the performance of linear models vs non-linear using cross-validation \(R^2\) as my performance metric.

Linear Model:

First I fit a linear model predicting mpg using these 8 predictors:

ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)

set.seed(1)

model_linear <- train(mpg ~ .,
                      data = Auto,
                      method = "lm",
                      metric = "Rsquared",
                      trControl = ctrl)

model_linear$results$Rsquared

## [1] 0.8239898

Smoothing Spline:

Here, now using method = "gamSpline", caret will instead fit a generalized additive model (with smooth terms for every numeric predictor that has at least 10 unique values). We can tune the effective degrees of freedom:

set.seed(1)

model_smoothspline_broad <- train(mpg ~ .,
                                  data = Auto,
                                  method = "gamSpline",
                                  metric = "Rsquared",
                                  trControl = ctrl, 
                                  tuneGrid = data.frame(df = 1:10))

data.frame(df = 1:10, CV_R2 = model_smoothspline_broad$results$Rsquared) %>%
  mutate(max_CV_R2 = as.numeric(max(CV_R2) == CV_R2)) %>%
  ggplot(aes(x = df, y = CV_R2)) +
  geom_line(col = "grey55") +
  geom_point(size = 2, aes(col = factor(max_CV_R2))) +
  scale_x_continuous(breaks = seq(1, 10), minor_breaks = NULL) +
  scale_color_manual(values = c("deepskyblue3", "green")) +
  theme(legend.position = "none") +
  labs(title = "Auto Dataset - GAM (Smoothing Splines)",
       subtitle = "Selecting the effective degrees of freedom using cross-validation",
       x = "df",
       y = "CV R-squared")

The cross-validation \(R^2\):

max(model_smoothspline_broad$results$Rsquared)

## [1] 0.8823149

In fact, we can tune the degrees of freedom more precisely and are not restricted to integers, since these are effective degrees of freedom (determined from the smoothing parameter \(\lambda\)):

model_smoothspline_fine <- train(mpg ~ .,
                            data = Auto,
                            method = "gamSpline",
                            metric = "Rsquared",
                            trControl = ctrl, 
                            tuneGrid = data.frame(df = seq(2.5, 4, 0.1)))

data.frame(df = seq(2.5, 4, 0.1), CV_R2 = model_smoothspline_fine$results$Rsquared) %>%
  mutate(max_CV_R2 = as.numeric(max(CV_R2) == CV_R2)) %>%
  ggplot(aes(x = df, y = CV_R2)) +
  geom_line(col = "grey55") +
  geom_point(size = 2, aes(col = factor(max_CV_R2))) +
  scale_x_continuous(breaks = seq(2.5, 4, 0.1), minor_breaks = NULL) +
  scale_color_manual(values = c("deepskyblue3", "green")) +
  theme(legend.position = "none") +
  labs(title = "Auto Dataset - GAM (Smoothing Splines)",
       subtitle = "Selecting the effective degrees of freedom using cross-validation",
       x = "df",
       y = "CV R-squared")

The cross-validation \(R^2\):

max(model_smoothspline_fine$results$Rsquared)

## [1] 0.8834424

So we have not only identified non-linear relationships, but we’ve used that knowledge to adapt our modelling approach in order to model that non-linearity. This resulted in our out-of-sample \(R^2\) increasing from 0.824 to 0.883!

9. APPLIED: The `Boston` Dataset (Polynomial Regression, Splines)

This question uses the variables dis (the weighted mean of distances to five Boston employment centers) and nox (nitrogen oxides concentration in parts per 10 million) from the Boston data. We will treat dis as the predictor and nox as the response.

(a) Cubic Polynomial

Q: Use the poly() function to fit a cubic polynomial regression to predict nox using dis. Report the regression output, and plot the resulting data and polynomial fits.

I fit the model, providing the model summary():

model <- lm(nox ~ poly(dis, 3, raw = T), data = Boston)
summary(model)

## 
## Call:
## lm(formula = nox ~ poly(dis, 3, raw = T), data = Boston)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.121130 -0.040619 -0.009738  0.023385  0.194904 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             0.9341281  0.0207076  45.110  < 2e-16 ***
## poly(dis, 3, raw = T)1 -0.1820817  0.0146973 -12.389  < 2e-16 ***
## poly(dis, 3, raw = T)2  0.0219277  0.0029329   7.476 3.43e-13 ***
## poly(dis, 3, raw = T)3 -0.0008850  0.0001727  -5.124 4.27e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06207 on 502 degrees of freedom
## Multiple R-squared:  0.7148, Adjusted R-squared:  0.7131 
## F-statistic: 419.3 on 3 and 502 DF,  p-value: < 2.2e-16

To plot the fit in ggplot, we don’t actually need to create a line of predictions and overlay it onto the data. As I’ve done previously, I simply pass the formula into the geom_smooth() function:

ggplot(Boston, aes(x = dis, y = nox)) + 
  geom_point(alpha = 0.3) + 
  geom_smooth(method = "lm", formula = "y ~ poly(x, 3, raw = T)") + 
  labs(title = "Boston Dataset - Polynomial Regression",
       subtitle = "Predicting 'nox' with a cubic polynomial of 'dis'")

The fit looks good for the most part, but we can see the limitations in the tails of dis.

(b) Polynomial Regression - Training MSE

Q: Plot the polynomial fits for a range of different polynomial degrees (say, from 1 to 10), and report the associated residual sum of squares.

Degrees 1 to 10 and plotted below. Cubic & quartic fits visually seem to be the most reasonable (they look virtually identical):

g1 <- ggplot(Boston, aes(x = dis, y = nox)) + 
  geom_point(alpha = 0.3) + 
  geom_smooth(method = "lm", formula = "y ~ poly(x, 1, raw = T)") + 
  coord_cartesian(ylim = c(0.3, 0.9)) +
  labs(title = "Polynomial Degree = 1")

g2 <- ggplot(Boston, aes(x = dis, y = nox)) + 
  geom_point(alpha = 0.3) + 
  geom_smooth(method = "lm", formula = "y ~ poly(x, 2, raw = T)") + 
  coord_cartesian(ylim = c(0.3, 0.9)) +
  labs(title = "Polynomial Degree = 2")

g3 <- ggplot(Boston, aes(x = dis, y = nox)) + 
  geom_point(alpha = 0.3) + 
  geom_smooth(method = "lm", formula = "y ~ poly(x, 3, raw = T)") + 
  coord_cartesian(ylim = c(0.3, 0.9)) +
  labs(title = "Polynomial Degree = 3")

g4 <- ggplot(Boston, aes(x = dis, y = nox)) + 
  geom_point(alpha = 0.3) + 
  geom_smooth(method = "lm", formula = "y ~ poly(x, 4, raw = T)") + 
  coord_cartesian(ylim = c(0.3, 0.9)) +
  labs(title = "Polynomial Degree = 4")

g5 <- ggplot(Boston, aes(x = dis, y = nox)) + 
  geom_point(alpha = 0.3) + 
  geom_smooth(method = "lm", formula = "y ~ poly(x, 5, raw = T)") + 
  coord_cartesian(ylim = c(0.3, 0.9)) +
  labs(title = "Polynomial Degree = 5")

g6 <- ggplot(Boston, aes(x = dis, y = nox)) + 
  geom_point(alpha = 0.3) + 
  geom_smooth(method = "lm", formula = "y ~ poly(x, 6, raw = T)") + 
  coord_cartesian(ylim = c(0.3, 0.9)) +
  labs(title = "Polynomial Degree = 6")

g7 <- ggplot(Boston, aes(x = dis, y = nox)) + 
  geom_point(alpha = 0.3) + 
  geom_smooth(method = "lm", formula = "y ~ poly(x, 7, raw = T)") + 
  coord_cartesian(ylim = c(0.3, 0.9)) +
  labs(title = "Polynomial Degree = 7")

g8 <- ggplot(Boston, aes(x = dis, y = nox)) + 
  geom_point(alpha = 0.3) + 
  geom_smooth(method = "lm", formula = "y ~ poly(x, 8, raw = T)") + 
  coord_cartesian(ylim = c(0.3, 0.9)) +
  labs(title = "Polynomial Degree = 8")

g9 <- ggplot(Boston, aes(x = dis, y = nox)) + 
  geom_point(alpha = 0.3) + 
  geom_smooth(method = "lm", formula = "y ~ poly(x, 9, raw = T)") + 
  coord_cartesian(ylim = c(0.3, 0.9)) +
  labs(title = "Polynomial Degree = 9")

g10 <- ggplot(Boston, aes(x = dis, y = nox)) + 
  geom_point(alpha = 0.3) + 
  geom_smooth(method = "lm", formula = "y ~ poly(x, 10, raw = T)") + 
  coord_cartesian(ylim = c(0.3, 0.9)) +
  labs(title = "Polynomial Degree = 10")

grid.arrange(g1, g2, g3, g4, g5, g6, g7, g8, g9, g10, ncol = 2)

Note: For these questions I’m going to report the MSE instead of the RSS because it allows for nicer comparisons between training & cross-validation results. RSS as a metric is dependent on the number of observations the residuals are being calculated over, so it will (unintuitively) be lower for cross-validation (for 10-fold CV, each test set will be 1/10th of the data, so will naturally have a RSS metric ~1/10th the size of the training RSS on the full data). This means we can look at the cross-validation results on the same ‘scale’ as the train results.

The loop below fits these models and assesses the training MSE of the fits. I plot the results:

train_MSE <- c()

for (i in 1:10) {
    model_temp <- lm(nox ~ poly(dis, i, raw = T), data = Boston)
    train_MSE[i] <- mean(model_temp$residuals^2)
}

data.frame(degree = 1:10, train_MSE) %>%
  mutate(min_train_MSE = as.numeric(min(train_MSE) == train_MSE)) %>%
  ggplot(aes(x = degree, y = train_MSE)) +
  geom_line(col = "grey55") +
  geom_point(size = 2, aes(col = factor(min_train_MSE))) +
  scale_x_continuous(breaks = 1:10, minor_breaks = NULL) +
  scale_y_continuous(breaks = seq(0, 0.2, 0.0002)) +
  scale_color_manual(values = c("deepskyblue3", "green")) +
  theme(legend.position = "none") +
  labs(title = "Boston Dataset - Polynomial Regression",
       subtitle = "Selecting the 'dis' polynomial degree with training MSE",
       x = "Degree",
       y = "Training MSE")

Unsurprisingly, the training RSS decreases monotonically as flexibility increases (the polynomial degree of dis increases), so the minimum training RSS is for degree 10.

(c) Polynomial Regression - Tuning Degree (CV MSE)

Q: Perform cross-validation or another approach to select the optimal degree for the polynomial, and explain your results.

I perform 5 repeats of 10-fold cross-validation to provide an out-of-sample error estimate for polynomial regression from degrees 1 to 10.

The selected model was degree 3, which makes sense from the visuals above; it appeared to be flexible enough to model the ‘true relationship’ without fitting to noise like the higher-order polynomials.

ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 5, summaryFunction = custom_regression_metrics)

CV_MSE <- c()

set.seed(159)

for (i in 1:10) {
  model_temp <- train(y = Boston$nox,
                      x = poly(Boston$dis, i, raw = T, simple = T),
                      method = "lm",
                      metric = "MSE",
                      trControl = ctrl)
  CV_MSE[i] <- model_temp$results$MSE
}

data.frame(degree = 1:10, CV_MSE = CV_MSE) %>%
  mutate(min_CV_MSE = as.numeric(min(CV_MSE) == CV_MSE)) %>%
  ggplot(aes(x = degree, y = CV_MSE)) +
  geom_line(col = "grey55") +
  geom_point(size = 2, aes(col = factor(min_CV_MSE))) +
  scale_x_continuous(breaks = seq(1, 10), minor_breaks = NULL) +
  scale_y_continuous(breaks = seq(0, 0.03, 0.002)) +
  scale_color_manual(values = c("deepskyblue3", "green")) +
  theme(legend.position = "none") +
  labs(title = "Boston Dataset - Polynomial Regression",
       subtitle = "Selecting the 'dis' polynomial degree with cross-validation MSE",
       x = "Degree",
       y = "Cross-Validation MSE")

min_poly_MSE_raw <- min(CV_MSE)

The large spike/instability in the MSE estimate for higher-order polynomials (e.g. 9) is because of the significant flexibility in the fits. When we partition the data into 10 segments for cross-validation there will be considerable variability in the fitted model for higher values of dis (look at the previous plots and imagine how much the fitted polynomial would vary if you excluded just a few observations).

The results change considerably if we exclude outliers, e.g. where dis >= 10, which goes to show how high-variance these higher order polynomials are at the tails:

ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 5, summaryFunction = custom_regression_metrics)

Boston1 <- Boston %>%
  filter(dis < 10)

CV_MSE <- c()

set.seed(159)

for (i in 1:10) {
  model_temp <- train(y = Boston1$nox,
                      x = poly(Boston1$dis, i, raw = T, simple = T),
                      method = "lm",
                      metric = "MSE",
                      trControl = ctrl)
  CV_MSE[i] <- model_temp$results$MSE
}

data.frame(degree = 1:10, CV_MSE = CV_MSE) %>%
  mutate(min_CV_MSE = as.numeric(min(CV_MSE) == CV_MSE)) %>%
  ggplot(aes(x = degree, y = CV_MSE)) +
  geom_line(col = "grey55") +
  geom_point(size = 2, aes(col = factor(min_CV_MSE))) +
  scale_x_continuous(breaks = seq(1, 10), minor_breaks = NULL) +
  scale_y_continuous(breaks = seq(0, 0.2, 0.0002)) +
  scale_color_manual(values = c("deepskyblue3", "green")) +
  theme(legend.position = "none") +
  labs(title = "Boston Dataset - Polynomial Regression (outliers removed)",
       subtitle = "Selecting the 'dis' polynomial degree with cross-validation MSE",
       x = "Degree",
       y = "Cross-Validation MSE")

min_poly_MSE_no_outliers <- min(CV_MSE)

(d) Different Splines & Knots

Q: Use the bs() function to fit a regression spline to predict nox using dis. Report the output for the fit using four degrees of freedom. How did you choose the knots? Plot the resulting fit.

1. The direct and literal answer to this is that it is a bit of a trick question (if we are to stay consistent with the definitions given in the text), and that we need to fit a simple cubic polynomial to the data with \(0\) knots.

If you have a cubic spline with \(K = 1\) knot \(\xi\), the most direct representation is via:

\[y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \beta_4(x - \xi)_{+}^3\]

We have to estimate \(5\) coefficients, which results in \(5\) degrees of freedom. A cubic spline with \(K\) knots has \(K+4\) degrees of freedom… so to get a fit with 4 degrees of freedom, we need \(K = 0\) knots!

How did you choose the knots?

I didn’t - \(0\) knots were chosen because there were no degrees of freedom left to add any!

Plot the resulting fit.

This is just a cubic polynomial.

ggplot(Boston, aes(x = dis, y = nox)) + 
  geom_vline(xintercept = attr(bs(Boston$dis, df = 3),"knots"), col = "grey55", linetype = "dashed") +
  geom_point(alpha = 0.3) + 
  geom_smooth(method = "lm", formula = "y ~ poly(x, 3)", aes(col = "Orthogonal Cubic")) + 
  geom_smooth(method = "lm", formula = "y ~ poly(x, 3, raw = T)", aes(col = "Raw Cubic")) + 
  geom_smooth(method = "lm", formula = "y ~ bs(x, df = 3)", aes(col = "B-spline (no knots)")) + # note the identical fit to the cubics
  coord_cartesian(ylim = c(0.3, 0.9)) +
  theme(legend.position = "bottom") +
  labs(title = "Different Representations, Identical Regression Lines", col = "Fit:")

Extended - (1)

This is the end of the question but I produce some explanations for my own & others understanding.

Note that I actually plotted three lines above. The predictors being passed into the model are given in the following dataframes:

An orthogonal polynomial: y ~ poly(x, 3)

head(poly(Boston$dis, 3))

##                1           2           3
## [1,] 0.006233255 -0.04136207  0.03495452
## [2,] 0.024768778 -0.04614946  0.01039356
## [3,] 0.024768778 -0.04614946  0.01039356
## [4,] 0.047911237 -0.03413142 -0.02986110
## [5,] 0.047911237 -0.03413142 -0.02986110
## [6,] 0.047911237 -0.03413142 -0.02986110

The raw polynomial (the most intuitive one we are used to): y ~ poly(x, 3, raw = T)

head(poly(Boston$dis, 3, raw = T))

##           1        2         3
## [1,] 4.0900 16.72810  68.41793
## [2,] 4.9671 24.67208 122.54870
## [3,] 4.9671 24.67208 122.54870
## [4,] 6.0622 36.75027 222.78748
## [5,] 6.0622 36.75027 222.78748
## [6,] 6.0622 36.75027 222.78748

The B-spline basis representation of a cubic spline with no knots: y ~ bs(x, df = 3)

head(bs(Boston$dis, df = 3))

##              1         2          3
## [1,] 0.4313152 0.1588833 0.01950924
## [2,] 0.4437231 0.2378394 0.04249466
## [3,] 0.4437231 0.2378394 0.04249466
## [4,] 0.4092114 0.3328457 0.09024369
## [5,] 0.4092114 0.3328457 0.09024369
## [6,] 0.4092114 0.3328457 0.09024369

All three of these predictor sets, when passed into a linear regression model, give identical predictions (which is why the prediction lines perfectly overlap):

identical(predict(lm(nox ~ poly(dis, 3), Boston)) %>% round(10), 
          predict(lm(nox ~ poly(dis, 3, raw = T), Boston)) %>% round(10), 
          predict(lm(nox ~ bs(dis, df = 3), Boston)) %>% round(10))

## [1] TRUE

There are even more representations, as page 273 describes. For example, if we had cubic spline with one knot \(\xi\), we could produce the B-spline basis representation using bs() which would be of the form:

\[y = \beta_0 + \beta_1 b_1(x) + \beta_2 b_2(x) + \beta_3 b_3(x) + \beta_4 b_4(x)\] for basis functions \(b_i\). We could also code the representation that makes the most intuitive sense to me, which would be a simple cubic polynomial with the addition of one truncated power basis function:

\[y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \beta_4(x - \xi)_{+}^3\]

Once again, the underlying variables and therefore parameter estimates would be different here, but the final predictions, \(R^2\), etc. would be the same!

Extended - (2)

I wasn’t sure what we were meant to do on this question so I turned to stats.stackexchange, where Glen_b suggests that there are five likely things they could have meant by “Report the output for the fit using four degrees of freedom”. I am doing this to learn and the exercises don’t seem to explore his suggestions (e.g. fitting natural cubic splines) so I am taking his advice and going through bullet points 2. to 5.

2. The authors wanted 4 knots rather than 4 degrees of freedom

Page 274 of the book suggests that there are two main approaches for knot selection:

Place more knots in places where we feel the function might vary most rapidly, and to place fewer knots where it seems more stable
Place knots in a uniform fashion (determining the position by calculating quantiles of the predictor)

ISLR favours the second approach, and I feel like it is easier and scales better so I will do the same.

The knots are automatically placed at uniform quantiles:

spline2_knots <- attr(bs(Boston$dis, df = 7), "knots")
spline2_knots

##    20%    40%    60%    80% 
## 1.9512 2.6403 3.8750 5.6150

Visualizing the fit:

ggplot(Boston, aes(x = dis, y = nox)) + 
  geom_vline(aes(xintercept = spline2_knots[1], col = "Knots"), linetype = "dashed") +
  geom_vline(aes(xintercept = spline2_knots[2], col = "Knots"), linetype = "dashed") +
  geom_vline(aes(xintercept = spline2_knots[3], col = "Knots"), linetype = "dashed") +
  geom_vline(aes(xintercept = spline2_knots[4], col = "Knots"), linetype = "dashed") +
  scale_color_manual(values = "mediumseagreen") +
  geom_point(alpha = 0.3) + 
  geom_smooth(method = "lm", formula = "y ~ bs(x, df = 7)") + # note the identical fit to the cubics
  coord_cartesian(ylim = c(0.3, 0.9)) +
  theme(legend.position = "bottom") +
  labs(title = "Cubic Spline with 4 Knots", col = "")

3. The authors didn’t intend to count the intercept as a degree of freedom, leaving ‘room’ for one knot (this is possible - some texts like ISLR consider the degrees of freedom as the number of parameters estimated, some consider it the number of variables)

The knot:

spline3_knots <- attr(bs(Boston$dis, df = 4), "knots")
spline3_knots

##     50% 
## 3.20745

Visualizing the fit:

ggplot(Boston, aes(x = dis, y = nox)) + 
  geom_vline(aes(xintercept = spline3_knots[1], col = "Knots"), linetype = "dashed") +
  scale_color_manual(values = "mediumseagreen") +
  geom_point(alpha = 0.3) + 
  geom_smooth(method = "lm", formula = "y ~ bs(x, df = 4)") + # note the identical fit to the cubics
  coord_cartesian(ylim = c(0.3, 0.9)) +
  theme(legend.position = "bottom") +
  labs(title = "Cubic Spline with 1 Knot", col = "")

4. The authors wanted a natural cubic spline (ns()) instead of a regular cubic spline

We ‘save’ some degrees of freedom by using natural splines because of the linearity constraints at the boundaries. This means that we can still have some knots without surpassing 4 degrees of freedom!

The B-spline basis matrix for a natural cubic spline:

head(ns(Boston$dis, df = 3))

##               1         2          3
## [1,] 0.09767068 0.5995465 -0.3179909
## [2,] 0.28154410 0.5137105 -0.2343426
## [3,] 0.28154410 0.5137105 -0.2343426
## [4,] 0.40934723 0.4353054 -0.1115005
## [5,] 0.40934723 0.4353054 -0.1115005
## [6,] 0.40934723 0.4353054 -0.1115005

The knots:

spline4_knots <- attr(ns(Boston$dis, df = 3), "knots")
spline4_knots

## 33.33333% 66.66667% 
##  2.384033  4.325700

The boundary knots:

spline4_boundary_knots <- attr(ns(Boston$dis, df = 3), "Boundary.knots")
spline4_boundary_knots

## [1]  1.1296 12.1265

The resulting model degrees of freedom is still 4 (see the bottom of page 275 for why):

spline4_fit <- lm(nox ~ ns(dis, df = 3), data = Boston)
summary(spline4_fit)$df[1] # 4 degrees of freedom in total

## [1] 4

Visualizing:

ggplot(Boston, aes(x = dis, y = nox)) + 
  geom_vline(aes(xintercept = spline4_knots[1], col = "Knots"), linetype = "dashed") +
  geom_vline(aes(xintercept = spline4_knots[2], col = "Knots"), linetype = "dashed") +
  geom_vline(aes(xintercept = spline4_boundary_knots[1], col = "Boundary Knots"), linetype = "dashed") +
  geom_vline(aes(xintercept = spline4_boundary_knots[2], col = "Boundary Knots"), linetype = "dashed") +
  scale_color_manual(values = c("grey55", "mediumseagreen")) +
  geom_point(alpha = 0.3) + 
  geom_smooth(method = "lm", formula = "y ~ ns(x, df = 3)") + # note the identical fit to the cubics
  coord_cartesian(ylim = c(0.3, 0.9)) +
  theme(legend.position = "bottom") +
  labs(title = "Natural Cubic Spline with 2 Knots", col = "")

5. The authors wanted a regular spline using bs() with 4 degrees of freedom (which requires us to deviate from a regular cubic spline to a linear or quadratic spline to ‘make room’ for a knot)

A quadratic spline with 1 knot will still have 4 degrees of freedom.

The knots:

spline5a_knots <- attr(bs(Boston$dis, df = 3, degree = 2), "knots")
spline5a_knots

##     50% 
## 3.20745

The resulting model degrees of freedom:

spline5a_fit <- lm(nox ~ bs(dis, df = 3, degree = 2), data = Boston)
summary(spline5a_fit)$df[1] # 4 degrees of freedom in total

## [1] 4

A linear spline with 2 knots will also have 4 degrees of freedom.

The knots:

spline5b_knots <- attr(bs(Boston$dis, df = 3, degree = 1), "knots")
spline5b_knots

## 33.33333% 66.66667% 
##  2.384033  4.325700

The resulting model degrees of freedom:

spline5a_fit <- lm(nox ~ bs(dis, df = 3, degree = 2), data = Boston)
summary(spline5a_fit)$df[1] # 4 degrees of freedom in total

## [1] 4

While the book doesn’t emphasise them much, linear splines are actually very useful as a non-linear modelling approach that mitigates overfitting but maintains model interpretability. If you are interested in learning more, I suggest googling “multivariate adaptive regression splines” for a practical implementation.

Visualizing both of these possibilities:

g1 <- ggplot(Boston, aes(x = dis, y = nox)) + 
  geom_vline(aes(xintercept = spline5a_knots, col = "Knots"), linetype = "dashed") +
  scale_color_manual(values = "mediumseagreen") +
  geom_point(alpha = 0.3) + 
  geom_smooth(method = "lm", formula = "y ~ bs(x, df = 3, degree = 2)") + # note the identical fit to the cubics
  coord_cartesian(ylim = c(0.3, 0.9)) +
  labs(title = "Quadratic Spline with 1 Knot", col = "")

g2 <- ggplot(Boston, aes(x = dis, y = nox)) + 
  geom_vline(aes(xintercept = spline5b_knots[1], col = "Knots"), linetype = "dashed") +
  geom_vline(aes(xintercept = spline5b_knots[2], col = "Knots"), linetype = "dashed") +
  scale_color_manual(values = "mediumseagreen") +
  geom_point(alpha = 0.3) + 
  geom_smooth(method = "lm", formula = "y ~ bs(x, df = 3, degree = 1)") + # note the identical fit to the cubics
  coord_cartesian(ylim = c(0.3, 0.9)) +
  labs(title = "Linear Spline with 2 Knots", col = "")

grid.arrange(g1, g2, nrow = 2)

Out of all the fits I’ve looked at, a natural cubic spline using ns() seems to provide the best fit while keeping low degrees of freedom, which aligns with the books implication that they are generally better.

(e) Cubic Splines - Training MSE

Q: Now fit a regression spline for a range of degrees of freedom, and plot the resulting fits and report the resulting RSS. Describe the results obtained.

Again, I will report MSE instead.

I iterate through 10 regular cubic spline fits, where I vary the degrees of freedom from 4 (no knots) to 13 (9 knots). The actual loop in the code below is for df from 3 to 12 for the reasons outlined in the previous question.

MSE <- c()

i <- 1
for (p in 3:12) {
  model_temp <- lm(nox ~ bs(dis, df = p), data = Boston)

  MSE[i] <- mean(model_temp$residuals^2)
  i <- i+1
}

data.frame(df = 4:13, MSE = MSE) %>%
  mutate(min_MSE = as.numeric(min(MSE) == MSE)) %>%
  ggplot(aes(x = df, y = MSE)) +
  geom_line(col = "grey55") +
  geom_point(size = 2, aes(col = factor(min_MSE))) +
  scale_x_continuous(breaks = seq(4, 13), minor_breaks = NULL) +
  scale_y_continuous(breaks = seq(0, 0.2, 0.00004)) +
  scale_color_manual(values = c("deepskyblue3", "green")) +
  theme(legend.position = "none") +
  labs(title = "Boston Dataset - Cubic Splines",
       subtitle = "Selecting the 'dis' degrees of freedom with training MSE",
       x = "Degrees of Freedom", 
       y = "Training MSE")

Something happens here that was initially surprising for me - while the spline with 13 degrees of freedom (13 - 4 = 9 knots) had the lowest RSS, the trend in training RSS is not monotonic.

The reason behind this is that, while it is true that adding additional predictors to a linear model will never increase the training RSS, that is not what is happening here. We are increasing the flexibility of the model, but every time an additional degree of freedom is allowed, the knot locations will change, so there is no violation of the rule here!

For example, here is the knot position for the fit with 5 degrees of freedom:

attr(bs(Boston$dis, df = 4), "knots")

##     50% 
## 3.20745

The same for 6 degrees of freedom:

attr(bs(Boston$dis, df = 5), "knots")

## 33.33333% 66.66667% 
##  2.384033  4.325700

Plotting the fits:

g1 <- ggplot(Boston, aes(x = dis, y = nox)) + 
  geom_point(alpha = 0.3) + 
  geom_smooth(method = "lm", formula = "y ~ bs(x, df = 3)") + 
  coord_cartesian(ylim = c(0.3, 0.9)) +
  labs(title = "df = 4 (0 knots)")

g2 <- ggplot(Boston, aes(x = dis, y = nox)) + 
  geom_point(alpha = 0.3) + 
  geom_smooth(method = "lm", formula = "y ~ bs(x, df = 4)") + 
  coord_cartesian(ylim = c(0.3, 0.9)) +
  labs(title = "df = 5 (1 knot)")

g3 <- ggplot(Boston, aes(x = dis, y = nox)) + 
  geom_point(alpha = 0.3) + 
  geom_smooth(method = "lm", formula = "y ~ bs(x, df = 5)") + 
  coord_cartesian(ylim = c(0.3, 0.9)) +
  labs(title = "df = 6 (2 knots)")

g4 <- ggplot(Boston, aes(x = dis, y = nox)) + 
  geom_point(alpha = 0.3) + 
  geom_smooth(method = "lm", formula = "y ~ bs(x, df = 6)") + 
  coord_cartesian(ylim = c(0.3, 0.9)) +
  labs(title = "df = 7 (3 knots)")

g5 <- ggplot(Boston, aes(x = dis, y = nox)) + 
  geom_point(alpha = 0.3) + 
  geom_smooth(method = "lm", formula = "y ~ bs(x, df = 7)") + 
  coord_cartesian(ylim = c(0.3, 0.9)) +
  labs(title = "df = 8 (4 knots)")

g6 <- ggplot(Boston, aes(x = dis, y = nox)) + 
  geom_point(alpha = 0.3) + 
  geom_smooth(method = "lm", formula = "y ~ bs(x, df = 8)") + 
  coord_cartesian(ylim = c(0.3, 0.9)) +
  labs(title = "df = 9 (5 knots)")

g7 <- ggplot(Boston, aes(x = dis, y = nox)) + 
  geom_point(alpha = 0.3) + 
  geom_smooth(method = "lm", formula = "y ~ bs(x, df = 9)") + 
  coord_cartesian(ylim = c(0.3, 0.9)) +
  labs(title = "df = 10 (6 knots)")

g8 <- ggplot(Boston, aes(x = dis, y = nox)) + 
  geom_point(alpha = 0.3) + 
  geom_smooth(method = "lm", formula = "y ~ bs(x, df = 10)") + 
  coord_cartesian(ylim = c(0.3, 0.9)) +
  labs(title = "df = 11 (7 knots)")

g9 <- ggplot(Boston, aes(x = dis, y = nox)) + 
  geom_point(alpha = 0.3) + 
  geom_smooth(method = "lm", formula = "y ~ bs(x, df = 11)") + 
  coord_cartesian(ylim = c(0.3, 0.9)) +
  labs(title = "df = 12 (8 knots)")

g10 <- ggplot(Boston, aes(x = dis, y = nox)) + 
  geom_point(alpha = 0.3) + 
  geom_smooth(method = "lm", formula = "y ~ bs(x, df = 12)") + 
  coord_cartesian(ylim = c(0.3, 0.9)) +
  labs(title = "df = 13 (9 knots)")

grid.arrange(g1, g2, g3, g4, g5, g6, g7, g8, g9, g10, ncol = 2)

The lines get wigglier as we increase the number of knots, as we would expect. While it’s appears that the plots with the most knots are probably overfitting, it doesn’t look like this is happening to a huge extent - it is certainly not overfitting to the extent that a polynomial regression overfits as the polynomial degree is increased, and still doesn’t appear to be nearly as bad as a simple linear fit.

(f) Cubic Splines - Tuning Degrees of Freedom (CV MSE)

Q: Perform cross-validation or another approach in order to select the best degrees of freedom for a regression spline on this data. Describe your results.

I iterate through the same 10 regular cubic spline fits, where I vary the degrees of freedom from 4 (no knots) to 13 (10 knots), using 5 repeats of 10-fold cross-validation to select the best fit.

ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 5, summaryFunction = custom_regression_metrics)

CV_MSE <- c()

set.seed(3)

i <- 1
for (p in 3:12) {
  model_temp <- train(y = Boston$nox,
                      x = bs(Boston$dis, df = p),
                      method = "lm",
                      metric = "MSE",
                      trControl = ctrl)
  CV_MSE[i] <- model_temp$results$MSE
  i <- i + 1
}


data.frame(df = 4:13, CV_MSE = CV_MSE) %>%
  mutate(min_CV_MSE = as.numeric(min(CV_MSE) == CV_MSE)) %>%
  ggplot(aes(x = df, y = CV_MSE)) +
  geom_line(col = "grey55") +
  geom_point(size = 2, aes(col = factor(min_CV_MSE))) +
  geom_hline(yintercept = min_poly_MSE_raw) + 
  geom_hline(yintercept = min_poly_MSE_no_outliers) + 
  scale_x_continuous(breaks = seq(4, 13), minor_breaks = NULL) +
  scale_y_continuous(breaks = seq(0, 0.2, 0.00002)) +
  scale_color_manual(values = c("deepskyblue3", "green")) +
  theme(legend.position = "none") +
  labs(title = "Boston Dataset - Cubic Splines",
       subtitle = "Selecting the 'dis' degrees of freedom with cross-validation MSE",
       x = "Degrees of Freedom", 
       y = "Cross-Validation MSE")

The fit selected by repeated cross-validation was the cubic spline with 6 degrees of freedom (so only 2 knots), with a CV MSE of 0.00369.

For comparison, the best-performing polynomial regression with outliers included (fairest comparison) had a CV MSE of 0.00387. Even when we excluded outliers for polynomial regression, the selected model still had a higher CV MSE of 0.00374. I mark both these polynomial errors on the graph using black horizontal lines so we can see the improvement.

The cross-validation results seem to confirm what the visual inspection told us; while it appears that the splines with higher degrees of freedom did begin to overfit, the overfitting is not nearly as severe as with higher-order polynomial fits.

10. APPLIED: The `College` Dataset (Generalized Additive Models)

This question relates to the College data set.

(a) Forward Stepwise Selection (applying the ‘one standard error rule’)

Q: Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.

I randomly sample 70% of the observations for train:

set.seed(1)
train_index <- sample(1:nrow(College), round(nrow(College) * 0.7))

train <- College[train_index, ]
nrow(train) / nrow(College)

## [1] 0.7001287

The remaining observations are allocated to the test dataset:

test <- College[-train_index, ]
nrow(test) / nrow(College)

## [1] 0.2998713

In chapter 6 I used forward stepwise selection with the leaps::regsubsets() function, using penalized metrics (Mallows \(C_p\), \(BIC\) and Adjusted \(R^2\)) to select the number of predictors in the final model.

Here I (once again) turn to caret::train(), as one of the 238 methods added to it is method = 'leapForward', which allows us to easily do this selection with cross-validation without having to write a wrapper of our own (as I did previously with best subset selection).

I also utilize selectionFunction = "oneSE" which applies the “one standard error rule” for cross-validation; it identifies the model that minimizes the cross-validation error, then selects the simplest model within 1 standard error of it. Repeated cross-validation will result in smaller error bars, so here I just do 10-fold cross-validation once with the goal of selecting a satisfactory model that is still simple.

ctrl <- trainControl(method = "repeatedcv", 
                     number = 10, 
                     repeats = 1, 
                     summaryFunction = custom_regression_metrics, 
                     selectionFunction = "oneSE")

set.seed(11)

model_forward <- train(Outstate ~ .,
                       data = train,
                       method = "leapForward",
                       metric = "MSE",
                       maximize = F,
                       trControl = ctrl,
                       tuneGrid = data.frame(nvmax = 1:17))

model_forward

## Linear Regression with Forward Selection 
## 
## 544 samples
##  17 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 489, 490, 492, 491, 489, 488, ... 
## Resampling results across tuning parameters:
## 
##   nvmax  RMSE      Rsquared   MAE       MSE      RSS      
##    1     3116.676  0.5047899  2480.277  9924143  537869955
##    2     2670.511  0.6141343  2018.854  7253757  393129955
##    3     2394.728  0.6806436  1847.491  5787625  314365711
##    4     2223.172  0.7179308  1740.158  4974734  270218937
##    5     2160.560  0.7329846  1701.678  4701270  255744224
##    6     2117.879  0.7416528  1665.929  4519142  245832663
##    7     2123.836  0.7408057  1670.549  4548786  247459481
##    8     2119.735  0.7406173  1674.291  4533400  246643018
##    9     2155.722  0.7293720  1691.149  4710988  256372597
##   10     2157.944  0.7291680  1680.271  4726621  257176853
##   11     2113.974  0.7416482  1663.049  4513600  245493174
##   12     2096.665  0.7459114  1653.184  4443008  241535828
##   13     2086.080  0.7482922  1647.345  4399171  239126796
##   14     2082.560  0.7487630  1644.411  4383112  238253554
##   15     2090.142  0.7471492  1651.776  4414671  239941188
##   16     2088.295  0.7476358  1651.515  4404923  239413821
##   17     2088.644  0.7475671  1651.796  4406352  239491261
## 
## MSE was used to select the optimal model using  the one SE rule.
## The final value used for the model was nvmax = 6.

caret selected the 6-variable model. It takes a bit of manual work, but we can manually produce the following plot that clarifies what is happening:

model_forward_SE <- model_forward$results %>%
  mutate(MSE_SE_low = MSE - (MSESD / sqrt(model_forward$control$number * model_forward$control$repeats)),
         MSE_SE_high = MSE + (MSESD / sqrt(model_forward$control$number * model_forward$control$repeats)), 
         min_CV_MSE = as.numeric(min(MSE) == MSE))

ggplot(model_forward_SE, aes(x = nvmax, y = MSE)) +
  geom_line(col = "grey55") +
  geom_hline(aes(yintercept = model_forward_SE$MSE_SE_high[model_forward_SE$min_CV_MSE == 1], col = "1 s.e. model")) +
  geom_vline(aes(xintercept = model_forward$bestTune$nvmax, col = "mediumseagreen")) +
  geom_errorbar(aes(ymin = MSE_SE_low, ymax = MSE_SE_high), 
                colour = "grey55",
                width = 0.4) +
  geom_point(size = 2, aes(col = factor(min_CV_MSE))) +
  scale_color_manual(values = c("deepskyblue3", "green", "mediumseagreen", "mediumseagreen")) +
  scale_x_continuous(breaks = 1:17, minor_breaks = NULL) +
  theme(legend.position = "none") +
  labs(title = "College Dataset - Linear Model",
       subtitle = "Selecting number of predictors with forward selection (cross-validation MSE, 1 s.e. rule)",
       x = "Number of Predictors", 
       y = "Cross-Validation MSE")

The model with the minimum cross-validation error had 14 predictors, but these cross-validation error estimates are the mean of multiple error estimates (in this case 10, as I am doing one pass of 10-fold CV). The ‘one standard error rule for cross-validation’ will then create standard error bars for these estimates (we need only create this for the 14-variable model, but I show them for all models here).

The next step is to find the ‘simplest’ model that is within one standard error of the lowest CV error model. Following the green horizontal line, we can see that the simplest model with a cross-validation error still within this region is the one with 6 predictors.

This selection approach can in theory be applied to any model where cross-validation is being used (not just linear models), but I find it most useful for creating parsimonious linear models. The challenge with other model types is how one rank-orders a model with perhaps many combinations of different hyperparameters (such as a GBM) from simplest to most-complex in order to select the simplest within this tolerance. With linear models this is far more straightforward, as a model with fewer variables, or greater shrinkage (for lasso/ridge), is clearly simpler.

(b) GAM Plot

Q: Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.

I extract the coefficients so we can see the variables selected for the chosen model in the previous step:

coef(model_forward$finalModel, id = 6)

##   (Intercept)    PrivateYes    Room.Board           PhD   perc.alumni 
## -3764.3413062  2793.2069104     0.9703210    38.2157650    59.0358377 
##        Expend     Grad.Rate 
##     0.2031532    28.6548780

I fit a GAM with the predictors Private, Room.Board, PhD, perc.alumni, Expend and Grad.Rate. I fit smoothing spline terms for all predictors except Private (it is qualitative).

I plot the fitted smooths below:

model_gam_1 <- gam(Outstate ~ Private + s(Room.Board) + s(PhD) + s(perc.alumni) + s(Expend) + s(Grad.Rate), data = train)

par(mfrow = c(2, 3))
plot(model_gam_1, se = T, col = "blue")

Interpreting these plots using the logic outlined on page 284, it seems that there is obvious evidence of a nonlinear effect of Expend, whereas a linear relationship seems more reasonable for perc.alumni. As for the other terms, it seems there is moderate evidence of non-linearity, but it is harder for me to say.

(c) GAM - `test` Performance

Q: Evaluate the model obtained on the test set, and explain the results obtained.

The GAM test MSE and \(R^2\):

mean((predict(model_gam_1, newdata = test) - test$Outstate)^2)

## [1] 3170227

test_TSS <- sum((test$Outstate - mean(test$Outstate))^2)
test_RSS <- sum((predict(model_gam_1, newdata = test) - test$Outstate)^2)

1 - test_RSS/test_TSS

## [1] 0.7712156

Comparing to the linear model test MSE & \(R^2\) (with the same predictors) chosen by caret:

mean((predict(model_forward, newdata = test) - test$Outstate)^2)

## [1] 3616631

test_RSS <- sum((predict(model_forward, newdata = test) - test$Outstate)^2)

1 - test_RSS/test_TSS

## [1] 0.7390001

The GAM model clearly has better test performance, so it would seem that the non-linear relationships it found (displayed in the previous plot) were not spurious and generalized well to unseen data.

(d) GAM - Detecting Non-Linear Relationships

Q: For which variables, if any, is there evidence of a non-linear relationship with the response?

I produce a summary of the fitted GAM:

summary(model_gam_1)

## 
## Call: gam(formula = Outstate ~ Private + s(Room.Board) + s(PhD) + s(perc.alumni) + 
##     s(Expend) + s(Grad.Rate), data = train)
## Deviance Residuals:
##      Min       1Q   Median       3Q      Max 
## -7511.91 -1172.59    38.71  1267.44  7827.85 
## 
## (Dispersion Parameter for gaussian family taken to be 3617918)
## 
##     Null Deviance: 9263945675 on 543 degrees of freedom
## Residual Deviance: 1888551843 on 521.9997 degrees of freedom
## AIC: 9782.515 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##                 Df     Sum Sq    Mean Sq F value    Pr(>F)    
## Private          1 2402224634 2402224634 663.980 < 2.2e-16 ***
## s(Room.Board)    1 1721929895 1721929895 475.945 < 2.2e-16 ***
## s(PhD)           1  620590394  620590394 171.532 < 2.2e-16 ***
## s(perc.alumni)   1  456743095  456743095 126.245 < 2.2e-16 ***
## s(Expend)        1  746743434  746743434 206.401 < 2.2e-16 ***
## s(Grad.Rate)     1   99786036   99786036  27.581 2.197e-07 ***
## Residuals      522 1888551843    3617918                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                Npar Df Npar F   Pr(F)    
## (Intercept)                              
## Private                                  
## s(Room.Board)        3  2.680 0.04629 *  
## s(PhD)               3  1.179 0.31718    
## s(perc.alumni)       3  0.937 0.42233    
## s(Expend)            3 32.503 < 2e-16 ***
## s(Grad.Rate)         3  1.995 0.11380    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Looking at p-values in the bottom table (‘Anova for Nonparametric Effects’) we are performing tests to determine whether smooth/non-linear effects are present, where significance (\(p < 0.05\)) indicates the presence of a non-linear effect.

The smallest p-value is for the smooth effect of Expend, with Room.Board also being significant. The remaining terms are not significant. Private isn’t even included in the second table (it is a factor - a parametric term).

For those variables that don’t have evidence of a non-linear effect (PhD, perc.alumni, Grad.Rate), there is evidence of a linear effect (‘Anova for Parametric Effects’).

I try fitting a new model that accounts for these findings and evaluate the test MSE & \(R^2\) again:

model_gam_2 <- gam(Outstate ~ Private + s(Room.Board) + PhD + perc.alumni + s(Expend) + Grad.Rate, data = train)

mean((predict(model_gam_2, newdata = test) - test$Outstate)^2)

## [1] 3226447

test_RSS <- sum((predict(model_gam_2, newdata = test) - test$Outstate)^2)
1 - test_RSS/test_TSS

## [1] 0.7671584

We actually end up with slightly lower test performance, but the model has been simplified somewhat.

Reassuringly, if we change it so Expend is simply parametric (where a non-linear effect was much clearer from the plot and summary), the performance basically drops to that of a linear model:

model_gam_3 <- gam(Outstate ~ Private + s(Room.Board) + PhD + perc.alumni + Expend + Grad.Rate, data = train)

mean((predict(model_gam_3, newdata = test) - test$Outstate)^2)

## [1] 3610592

test_RSS <- sum((predict(model_gam_3, newdata = test) - test$Outstate)^2)
1 - test_RSS/test_TSS

## [1] 0.739436

11. APPLIED: Generated Data (GAM Backfitting, \(p = 2\))

In Section 7.7, it was mentioned that GAMs are generally fit using a backfitting approach. The idea behind backfitting is actually quite simple. We will now explore backfitting in the context of multiple linear regression.

Suppose that we would like to perform multiple linear regression, but we do not have software to do so. Instead, we only have software to perform simple linear regression. Therefore, we take the following iterative approach: we repeatedly hold all but one coefficient estimate fixed at its current value, and update only that coefficient estimate using a simple linear regression. The process is continued until convergence - that is, until the coefficient estimates stop changing.

We now try this out on a toy example.

(a) Generate Data

Q: Generate a response \(Y\) and two predictors \(X_1\) and \(X_2\), with \(n = 100\).

I generate \(Y\) using the following model:

\[Y = 16 + 5.1 X_1 - 7.3 X_2 + \epsilon \\ X_1 \sim \mathcal{N}(0, 1) \\ X_2 \sim \mathcal{N}(0, 1) \\ \epsilon \sim \mathcal{N}(0, 1)\]

set.seed(591)
x1 <- rnorm(100)
x2 <- rnorm(100)
eps <- rnorm(100)

y <- 16 + 5.1*x1 - 7.3*x2 + eps

Here are the relationships visualized across two dimensions:

g1 <- data.frame(x1, x2, y) %>%
  ggplot(aes(x1, y)) + 
  geom_point()

g2 <- data.frame(x1, x2, y) %>%
  ggplot(aes(x2, y)) + 
  geom_point()

grid.arrange(g1, g2, ncol = 2)

(b) Initialize \(\hat{\beta_1}\)

Q: Initialize \(\hat{\beta_1}\) to take on a value of your choice. It does not matter what value you choose.

I will initialize \(\hat{\beta_1}\) at \(1\):

beta_1 <- 1
beta_1

## [1] 1

(c) Simple Linear Regression - Estimating \(\beta_2\)

Q: Keeping \(\hat{\beta_1}\) fixed, fit the model

\[Y - \hat{\beta_1} X_1 = \beta_0 + \beta_2 X_2 + \epsilon\]

You can do this as follows:

r2 = y - beta_1 * x1
beta_2 = lm(r2 ~ x2)$coef[2]

I change notation for the code a bit here.

Recall a couple of things from section 7.7 as it is useful to keep perspective on where this is used:

Fitting a GAM with a smoothing spline is not quite as simple as fitting a GAM with a natural spline, since in the case of smoothing splines, least squares cannot be used. However, standard software such as the gam() function in R can be used to fit GAMs using smoothing splines, via an approach known as backfitting. This method fits a model involving multiple predictors by repeatedly updating the fit for each predictor in turn, holding the others fixed. The beauty of this approach is that each time we update a function,we simply apply the fitting method for that variable to a partial residual.

Therefore, we first calculated the partial residual for x2 (‘r2’), which is y with the contribution of all other explanatory variables subtracted (in this case, just x1): \(r_2 = Y - \hat{\beta_1} X_1\):

r2 <- y - beta_1 * x1

I then fit a simple linear regression predicting this partial residual using x2. Extracting the coefficient gives the initial value for \(\hat{\beta_2}\):

beta_2 <- as.numeric(lm(r2 ~ x2)$coef[2])
beta_2

## [1] -7.396905

(d) Simple Linear Regression - Estimating \(\beta_1\)

Q: Keeping \(\hat{\beta_2}\) fixed, fit the model

\[Y - \hat{\beta_2} X_2 = \beta_0 + \beta_1 X_1 + \epsilon\]

You can do this as follows:

r1 = y - beta_2 * x2
beta_1 = lm(r1 ~ x1)$coef[2]

As before, but now for the partial residual for x1 (‘r1’). Note that we are now making use of the estimate \(\hat{\beta_2}\) to replace the initial value of \(\hat{\beta_1}\):

r1 <- y - beta_2 * x2

beta_1 <- as.numeric(lm(r1 ~ x1)$coef[2])
beta_1

## [1] 5.229975

Since we have fit the model \(r_1 = \beta_0 + \beta_1 X_1 + \epsilon\), we can also get our estimate for \(\beta_0\) by extracting the intercept coefficient:

beta_0 <- as.numeric(lm(r1 ~ x1)$coef[1])
beta_0

## [1] 16.13128

We are already amazingly close to the true estimates.

(e) Looping (1,000 Iterations)

Q: Write a for loop to repeat (c) and (d) 1,000 times. Report the estimates of \(\hat{\beta_0}\), \(\hat{\beta_1}\), and \(\hat{\beta_2}\) at each iteration of the for loop. Create a plot in which each of these values is displayed, with \(\hat{\beta_0}\), \(\hat{\beta_1}\), and \(\hat{\beta_2}\) each shown in a different color.

I initialize \(\hat{\beta_1}\) again at \(1\). I do not consider this initialization as part of the first iteration:

beta_1_init <- 1
beta_1_init

## [1] 1

I perform the 1,000 iterations in the code below.

beta_0 <- c()
beta_1 <- c()
beta_2 <- c()

# beta_2:
r2 <- y - beta_1_init * x1
beta_2[1] <- lm(r2 ~ x2)$coef[2]

# beta_1:
r1 <- y - beta_2[1] * x2
lm_coef <- lm(r1 ~ x1)$coef
beta_1[1] <- lm_coef[2]

# beta_0:
beta_0[1] <- lm_coef[1]


for (i in 2:1000) {
  r2 <- y - beta_1[i-1] * x1
  beta_2[i] <- lm(r2 ~ x2)$coef[2]
  
  r1 <- y - beta_2[i-1] * x2
  lm_coef <- lm(r1 ~ x1)$coef
  beta_1[i] <- lm_coef[2]
  
  beta_0[i] <- lm_coef[1]
}

The question says that this process continues until convergence, where the estimated coefficients no longer change. It appears from the graph that this happens very quickly:

data.frame(iteration = 1:1000, beta_0, beta_1, beta_2) %>%
  pivot_longer(-iteration, names_to = "coefficient", values_to = "estimate") %>%
  ggplot(aes(x = iteration, y = estimate, col = factor(coefficient))) + 
  geom_line() + 
  scale_x_continuous(breaks = seq(0, 1000, 100), labels = scales::comma_format()) +
  facet_wrap(coefficient ~ ., scales = "free_y", nrow = 3) + 
  theme(legend.position = "none") + 
  labs(title = "Backfitting Convergence", 
       x = "Iteration", 
       y = "Parameter Estimate")

(f) Comparison - Backfitting vs Multiple Regression

Q: Compare your answer in (e) to the results of simply performing multiple linear regression to predict \(Y\) using \(X_1\) and \(X_2\). Use the abline() function to overlay those multiple linear regression coefficient estimates on the plot obtained in (e).

I simply fit the multiple regression \(Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \epsilon\) to the data.

The fitted coefficients:

lm_coef <- lm(y ~ x1 + x2)$coefficients
lm_coef

## (Intercept)          x1          x2 
##   16.149420    5.234804   -7.273397

Overlaying these coefficients onto the previous plot (with a dashed line), it appears that the looped simple linear regressions via backfitting gave the same (or very close) parameter estimates to those provided by multiple regression, once they had converged:

data.frame(iteration = 1:1000, beta_0, beta_1, beta_2) %>%
  pivot_longer(-iteration, names_to = "coefficient", values_to = "estimate") %>%
  ggplot(aes(x = iteration, y = estimate, col = factor(coefficient))) + 
  geom_line() + 
  geom_hline(data = data.frame(yint = lm_coef[1], coefficient = "beta_0"), aes(yintercept = yint), linetype = "dashed") +
  geom_hline(data = data.frame(yint = lm_coef[2], coefficient = "beta_1"), aes(yintercept = yint), linetype = "dashed") +
  geom_hline(data = data.frame(yint = lm_coef[3], coefficient = "beta_2"), aes(yintercept = yint), linetype = "dashed") +
  scale_x_continuous(breaks = seq(0, 1000, 100), labels = scales::comma_format()) +
  facet_wrap(coefficient ~ ., scales = "free_y", nrow = 3) + 
  theme(legend.position = "none") + 
  labs(title = "Backfitting Convergence", 
       x = "Iteration", 
       y = "Parameter Estimate")

Extracting the backfitting coefficients at iteration 1,000 and comparing to those of multiple regression:

data.frame(simple_reg = c(beta_0[1000], beta_1[1000], beta_2[1000]), multiple_reg = as.numeric(lm_coef))

##   simple_reg multiple_reg
## 1  16.149420    16.149420
## 2   5.234804     5.234804
## 3  -7.273397    -7.273397

It certainly looks like they are identical, and the code below confirms.

identical(round(as.numeric(lm_coef), 12), 
          round(c(beta_0[1000], beta_1[1000], beta_2[1000]), 12))

## [1] TRUE

(g) Backfitting - Convergence Iterations

Q: On this data set, how many backfitting iterations were required in order to obtain a “good” approximation to the multiple regression coefficient estimates?

For this simple example, just one iteration gave us a ‘good’ approximation of multiple regression coefficients:

data.frame(iteration = 1:1000, beta_0, beta_1, beta_2) %>%
  pivot_longer(-iteration, names_to = "coefficient", values_to = "estimate") %>%
  ggplot(aes(x = iteration, y = estimate, col = factor(coefficient))) + 
  geom_line() + 
  geom_hline(data = data.frame(yint = lm_coef[1], coefficient = "beta_0"), aes(yintercept = yint), linetype = "dashed") +
  geom_hline(data = data.frame(yint = lm_coef[2], coefficient = "beta_1"), aes(yintercept = yint), linetype = "dashed") +
  geom_hline(data = data.frame(yint = lm_coef[3], coefficient = "beta_2"), aes(yintercept = yint), linetype = "dashed") +
  geom_point() +
  scale_x_continuous(limits = c(1, 10), breaks = seq(1, 10), minor_breaks = NULL) +
  facet_wrap(coefficient ~ ., scales = "free_y", nrow = 3) + 
  theme(legend.position = "none") + 
  labs(title = "Backfitting Convergence", 
       x = "Iteration", 
       y = "Parameter Estimate")

It looks like convergence happens at iteration 3 from the graph, and that all subsequent iterations give the same estimates.

A closer inspection shows that perhaps the coefficients haven’t completely converged by this point:

length(unique(round(beta_0, 6)[3:1000])) == 1 &
length(unique(round(beta_1, 6)[3:1000])) == 1 &
length(unique(round(beta_2, 6)[3:1000])) == 1

## [1] FALSE

Which iteration achieves ‘convergence’ happens seems to depend on our level of precision. For example, 6dp. (given by the multiple regression output) takes 5 iterations:

length(unique(round(beta_0, 6)[5:1000])) == 1 &
length(unique(round(beta_1, 6)[5:1000])) == 1 &
length(unique(round(beta_2, 6)[5:1000])) == 1

## [1] TRUE

I presume that the number of iterations required for a ‘good’ approximation will vary significantly based on the strength of the relationships or number of variables.

12. APPLIED: Generated Data (GAM Backfitting, \(p = 100\))

Q: This problem is a continuation of the previous exercise. In a toy example with \(p = 100\), show that one can approximate the multiple linear regression coefficient estimates by repeatedly performing simple linear regression in a backfitting procedure. How many backfitting iterations are required in order to obtain a “good” approximation to the multiple regression coefficient estimates? Create a plot to justify your answer.

I simulate the following relationship: \(Y = \beta_{0} + \sum_{i = 1}^{100}\beta_{i}X_{i} + \epsilon\).

I first generate the ‘true’ parameters \(\beta = (\beta_1, \beta_2, \dots, \beta_{100})\). I generate the parameters from \(\mathcal{N}(0, 1)\), with the exception of \(\beta_0\) = beta_0, which I arbitrarily set at 3:

set.seed(101)

betas <- rnorm(100)
beta_0 <- 3

ggplot(data.frame(parameter = 0:100, value = c(beta_0, betas)), aes(x = parameter, y = value)) + 
  geom_point(alpha = 0.3) + 
  labs(title = "True Generating Parameters", 
       x = "Parameter Number", 
       y = "True Value")

I generate \(X = (X_1, X_2, \dots, X_{100})\) from \(\mathcal{N}(0, 1)\) again. I decide to go with a 1,000 observations, so \(X\) is a \(1,000 \times 100\) matrix:

set.seed(102)
X <- matrix(rnorm(1000 * 100), nrow = 1000, ncol = 100)
dim(X)

## [1] 1000  100

I wanted to introduce significant noise so that the relationships aren’t too strong. Here, \(\epsilon \sim \mathcal{N}(0, 10^2)\):

set.seed(103)
noise <- rnorm(1000, sd = 10)

Finally I generate my response with \(Y = \beta_{0} + \sum_{i = 1}^{100}\beta_{i}X_{i} + \epsilon\):

Y <- beta_0 + X %*% betas + noise

I fit a linear model and extracting the \(R^2\) to this data to get an idea of the strength of the relationship between the response \(Y\) and the predictors in \(X\):

summary(lm(Y ~ X))$r.squared

## [1] 0.528377

Backfitting:

I presume, based on the previous question, that the first step will need to be to provide an initial guess for all parameters except one. That is, we must initialize \(\hat{\beta_2}, \hat{\beta_3}, \dots, \hat{\beta_{100}}\). I take these as random estimates, as the previous question suggests that these initial values aren’t important:

betas_iter <- rep(NA, 100)
beta_0_iter <- NA

set.seed(104)
betas_iter[2:100] <- rnorm(99)

I now try 50 iterations of the backfitting process, estimating the \(\beta_{i}\) one at a time in each iteration and updating the parameter estimates each time.

The first iteration will fit 100 models, updating the parameters one at a time:

\(Y - \sum_{i=2}^{100}\hat{\beta_i} X_i = \beta_0 + \beta_{1} X_{1} + \epsilon\)
\(Y - \hat{\beta_1} X_1 - \sum_{i=3}^{100}\hat{\beta_i} X_i = \beta_0 + \beta_{2} X_{2} + \epsilon\)

\(\dots\)

\(Y - \sum_{i=1}^{98}\hat{\beta_i} X_i - \hat{\beta}_{100} X_{100} = \beta_0 + \beta_{99} X_{99} + \epsilon\)
\(Y - \sum_{i=1}^{99}\hat{\beta_i} X_i = \beta_0 + \beta_{100} X_{100} + \epsilon\)

On the fit for \(X_{100}\) I will also update the estimate for \(\hat{\beta_0}\).

# rows = iteration (first = initial), cols = beta_0, beta_1, ..., beta_100
beta_matrix <- matrix(nrow = 1 + 50, ncol = 1 + 100) 
beta_matrix[1, 3:101] <- betas_iter[2:100]

for (i in 1:50) {
  
  # update estimates for beta_1, ..., beta_100, then beta_0
  for (p in 1:100) {
    rp <- Y - X[ ,-p] %*% betas_iter[-p] 
    betas_iter[p] <- lm(rp ~ X[ ,p])$coef[2] 
  }
  
  beta_0_iter <- lm(rp ~ X[ ,p])$coef[1]
  
  # update matrix
  beta_matrix[i + 1, ] <- c(beta_0_iter, betas_iter)
}

I drop the first row of beta_matrix, since it only contains the initialized random values of \(\hat{\beta_2}, \hat{\beta_3}, \dots, \hat{\beta_{100}}\) that were required for the first simple linear regression.

beta_matrix <- beta_matrix[-1, ]
colnames(beta_matrix) <- c("(Intercept)", paste0("X", 1:100))
dim(beta_matrix)

## [1]  50 101

I display the first \(20 \times 6\) entries of beta_matrix, which now takes the form:

\[\begin{bmatrix} \hat{\beta}_{0(1)} & \hat{\beta}_{1(1)} & \dots & \hat{\beta}_{100(1)} \\ \hat{\beta}_{0(2)} & \ddots & & \\ \vdots & & \ddots & \\ \hat{\beta}_{0(50)} & & & \hat{\beta}_{100(50)} \end{bmatrix}\]

where \(\hat{\beta}_{j(k)}\) is the estimate for \(\beta_j\) at iteration \(k\).

beta_matrix[1:20, 1:6] %>% print(digits = 8)

##       (Intercept)          X1          X2         X3          X4          X5
##  [1,]   3.2392615 -1.70704102 -0.12680456 -1.2913319  0.49017672 -0.42537690
##  [2,]   3.2719487 -0.86775206  0.26510010 -1.1690149 -0.16608137  0.16336896
##  [3,]   3.2818133 -0.77487081  0.34143174 -1.2004922 -0.19236642  0.22145118
##  [4,]   3.2825939 -0.77934076  0.36537150 -1.2131186 -0.19448843  0.25010371
##  [5,]   3.2828479 -0.77968237  0.37034730 -1.2175314 -0.19330619  0.24948653
##  [6,]   3.2829583 -0.77969699  0.37061290 -1.2188563 -0.19344194  0.24843638
##  [7,]   3.2829878 -0.77961440  0.37049857 -1.2192055 -0.19353794  0.24815108
##  [8,]   3.2829950 -0.77957360  0.37044130 -1.2192901 -0.19356060  0.24808928
##  [9,]   3.2829968 -0.77955950  0.37042193 -1.2193107 -0.19356394  0.24807633
## [10,]   3.2829973 -0.77955533  0.37041627 -1.2193158 -0.19356425  0.24807372
## [11,]   3.2829974 -0.77955421  0.37041474 -1.2193170 -0.19356427  0.24807323
## [12,]   3.2829974 -0.77955392  0.37041434 -1.2193173 -0.19356427  0.24807315
## [13,]   3.2829974 -0.77955386  0.37041425 -1.2193173 -0.19356427  0.24807314
## [14,]   3.2829974 -0.77955384  0.37041422 -1.2193173 -0.19356428  0.24807314
## [15,]   3.2829974 -0.77955383  0.37041422 -1.2193173 -0.19356428  0.24807314
## [16,]   3.2829974 -0.77955383  0.37041421 -1.2193173 -0.19356428  0.24807314
## [17,]   3.2829974 -0.77955383  0.37041421 -1.2193173 -0.19356428  0.24807314
## [18,]   3.2829974 -0.77955383  0.37041421 -1.2193173 -0.19356428  0.24807314
## [19,]   3.2829974 -0.77955383  0.37041421 -1.2193173 -0.19356428  0.24807314
## [20,]   3.2829974 -0.77955383  0.37041421 -1.2193173 -0.19356428  0.24807314

I fit a multiple regression equation to the data and display the first 10 coefficients for a rough comparison.

lm_coef <- lm(Y ~ X)$coef
lm_coef[1:6] %>% print(digits = 8)

## (Intercept)          X1          X2          X3          X4          X5 
##  3.28299744 -0.77955383  0.37041421 -1.21931735 -0.19356428  0.24807314

It looks like we have a “good” approximation of the multiple regression coefficients by the third iteration. While I am only viewing 6 of the 101 coefficients to be estimated, it seems likely that convergence happens to a high degree of precision somewhere around the 15th iteration. One potential (better) way of determining how ‘close’ these estimates are to the multiple regression estimates is shown below.

Parameter Distance:

To quantify how ‘close’ the parameters are to the multiple regression coefficients, I remembered the ‘parameter distance’ metric (see my solutions for chapter 6, question 10)g)). This seems reasonable to use again here; all the variables and corresponding parameter estimates are on the same-ish scale (the data was sampled from \(\mathcal{N}(0, 1)\)):

For each of the \(k\) iterations, I calculate the ‘parameter distance’ which I have denoted by \(d_k\). I have modified the formula slightly from the chapter 6 solutions to:

Include the intercept
Calculate the distance between the backfitting coefficients \(\hat{\beta_j(k)}\) and the multiple regression coefficients \(\hat{\beta_j^m}\) (which do not change)

\[d_k =\sqrt{\sum_{j=0}^p \left( \hat{\beta}_{j(k)} - \hat{\beta}_j^m \right)^2}\] I produce a plot of \(d_k\) by iteration:

param_dist <- c()

for (k in 1:50) {
  param_dist[k] <- sqrt(sum((beta_matrix[k, ] - lm_coef)^2))
}

data.frame(k = 1:50, param_dist = param_dist)  %>%
  ggplot(aes(x = k, y = param_dist)) + 
  geom_line(col = "grey20") + 
  geom_point(col = "steelblue") + 
  scale_x_continuous(limits = c(1, 50), breaks = seq(0, 50, 10)) + 
  labs(title = "Parameter Distance", 
       subtitle = "Backfitted vs Multiple Regression, by Iteration", 
       x = "Iteration", 
       y = "Parameter Distance")

Perhaps a more useful way of displaying this graph is with the y axis on a \(log_{10}\) scale, where we can see the coefficients from the repeated simple linear regressions (backfitting) get very close to those given by multiple regression very quickly, and continue to converge to the multiple regression values with each iteration:

data.frame(k = 1:50, param_dist = param_dist)  %>%
  ggplot(aes(x = k, y = param_dist)) + 
  geom_line(col = "grey20") + 
  geom_point(col = "steelblue") + 
  scale_x_continuous(limits = c(1, 50), breaks = seq(0, 50, 10)) + 
  scale_y_continuous(trans = "log10", breaks = 10^seq(1, -14, -2)) +
  labs(title = "Parameter Distance", 
       subtitle = "Backfitted vs Multiple Regression, by Iteration", 
       x = "Iteration", 
       y = "Parameter Distance")

I presume the relatively flat line after iteration 25 is due to numeric limitations in R.

ISLR - Moving Beyond Linearity (Ch. 7) - Exercise Solutions

Liam Morgan

September 2020

1. Cubic Splines

(a) Cubic Polynomial Format for \(x \le \xi\)

(b) Cubic Polynomial Format for \(x > \xi\)

(c) Continuity at the Knot \(\xi\)

(d) Continuity (in the First Derivative) at the Knot \(\xi\)

(e) Continuity (in the Second Derivative) at the Knot \(\xi\)

2. Smoothing Splines - Extreme Smoothing Parameters

(a) Sketch \(\hat{g}\) for \(\lambda = \infty, m = 0\)

(b) Sketch \(\hat{g}\) for \(\lambda = \infty, m = 1\)

(c) Sketch \(\hat{g}\) for \(\lambda = \infty, m = 2\)

(d) Sketch \(\hat{g}\) for \(\lambda = \infty, m = 3\)

(e) Sketch \(\hat{g}\) for \(\lambda = 0, m = 3\)

3. Basis Functions - p.1

4. Basis Functions - p.2

5. Smoothing Splines

(a) Training RSS as \(\lambda \rightarrow \infty\)

(b) Test RSS as \(\lambda \rightarrow \infty\)

(c) Training & Test RSS for \(\lambda = 0\)

6. APPLIED: The Wage Dataset (Polynomial Regression, Step Functions)

(a) Polynomial Regression

(b) Step Function

7. APPLIED: The Wage Dataset (Various Non-Linear Methods)

8. APPLIED: The Auto Dataset (Various Non-Linear Methods)

9. APPLIED: The Boston Dataset (Polynomial Regression, Splines)

(a) Cubic Polynomial

(b) Polynomial Regression - Training MSE

(c) Polynomial Regression - Tuning Degree (CV MSE)

(d) Different Splines & Knots

Extended - (1)

Extended - (2)

(e) Cubic Splines - Training MSE

(f) Cubic Splines - Tuning Degrees of Freedom (CV MSE)

10. APPLIED: The College Dataset (Generalized Additive Models)

(a) Forward Stepwise Selection (applying the ‘one standard error rule’)

(b) GAM Plot

(c) GAM - test Performance

(d) GAM - Detecting Non-Linear Relationships

11. APPLIED: Generated Data (GAM Backfitting, \(p = 2\))

(a) Generate Data

(b) Initialize \(\hat{\beta_1}\)

(c) Simple Linear Regression - Estimating \(\beta_2\)

(d) Simple Linear Regression - Estimating \(\beta_1\)

(e) Looping (1,000 Iterations)

(f) Comparison - Backfitting vs Multiple Regression

(g) Backfitting - Convergence Iterations

12. APPLIED: Generated Data (GAM Backfitting, \(p = 100\))

6. APPLIED: The `Wage` Dataset (Polynomial Regression, Step Functions)

7. APPLIED: The `Wage` Dataset (Various Non-Linear Methods)

8. APPLIED: The `Auto` Dataset (Various Non-Linear Methods)

9. APPLIED: The `Boston` Dataset (Polynomial Regression, Splines)

10. APPLIED: The `College` Dataset (Generalized Additive Models)

(c) GAM - `test` Performance