1. In this exercise, you will further analyze the Wage data set considered throughout this chapter.
  1. Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polyno mial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.

  2. Fit a step function to predict wage using age, and perform cross validation to choose the optimal number of cuts. Make a plot of the fit obtained.

(a) Polynomial Regression

We fit polynomial regression models of varying degrees (from 1 to 6) to predict wage using age. Cross-validation is used to determine the optimal degree of the polynomial. We also compare this selection with the results from an ANOVA hypothesis test.

# Cross-validation error for polynomial degrees 1 through 6
cv_errors <- sapply(1:6, function(d) {
  fit <- glm(wage ~ poly(age, d), data = Wage)
  cv.glm(Wage, fit, K = 5)$delta[1]
})

# Plot CV errors and identify minimum
which.min(cv_errors)
## [1] 6
plot(1:6, cv_errors, xlab = "Degree", ylab = "Test MSE", type = "l", lwd = 2)
abline(v = which.min(cv_errors), col = "red", lty = 2)

# Fit best polynomial model
best_degree <- which.min(cv_errors)
fit <- glm(wage ~ poly(age, best_degree), data = Wage)

# Plot data and polynomial fit
plot(Wage$age, Wage$wage, pch = 19, cex = 0.4, col = alpha("steelblue", 0.4),
     xlab = "Age", ylab = "Wage")
lines(1:100, predict(fit, data.frame(age = 1:100)), col = "red", lwd = 2)

# Summary of full degree-6 model (for ANOVA)
summary(glm(wage ~ poly(age, 6), data = Wage))
## 
## Call:
## glm(formula = wage ~ poly(age, 6), data = Wage)
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    111.7036     0.7286 153.316  < 2e-16 ***
## poly(age, 6)1  447.0679    39.9063  11.203  < 2e-16 ***
## poly(age, 6)2 -478.3158    39.9063 -11.986  < 2e-16 ***
## poly(age, 6)3  125.5217    39.9063   3.145  0.00167 ** 
## poly(age, 6)4  -77.9112    39.9063  -1.952  0.05099 .  
## poly(age, 6)5  -35.8129    39.9063  -0.897  0.36956    
## poly(age, 6)6   62.7077    39.9063   1.571  0.11620    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1592.512)
## 
##     Null deviance: 5222086  on 2999  degrees of freedom
## Residual deviance: 4766389  on 2993  degrees of freedom
## AIC: 30642
## 
## Number of Fisher Scoring iterations: 2
# ANOVA model comparisons (F-tests)
fit1 <- lm(wage ~ poly(age, 1), data = Wage)
fit2 <- lm(wage ~ poly(age, 2), data = Wage)
fit3 <- lm(wage ~ poly(age, 3), data = Wage)
fit4 <- lm(wage ~ poly(age, 4), data = Wage)
fit5 <- lm(wage ~ poly(age, 5), data = Wage)
anova(fit1, fit2, fit3, fit4, fit5, test = "F")

The cross-validation approach selected degree 6 as having the lowest test error, suggesting a more complex model fits the data better. However, the ANOVA test shows that adding terms beyond degree 3 or 4 provides limited statistical significance. Thus, ANOVA favors a simpler model (up to degree 3), while CV prefers a more complex one, revealing a classic bias-variance tradeoff.

b. Step function regression with cross-validation

# Cross-validation errors for step functions with 2–10 cuts
cv_step <- sapply(2:10, function(k) {
  Wage$cats <- cut(Wage$age, k)
  fit <- glm(wage ~ cats, data = Wage)
  cv.glm(Wage, fit, K = 5)$delta[1]
})

names(cv_step) <- 2:10

# Plot CV errors and identify minimum
plot(2:10, cv_step, xlab = "Cuts", ylab = "Test MSE", type = "l", lwd = 2)
which.min(cv_step)
## 8 
## 7
abline(v = names(which.min(cv_step)), col = "red", lty = 2)

# Fit model with best number of cuts (8 in this case)
fit <- glm(wage ~ cut(age, 8), data = Wage)

# Plot data and step function fit
plot(Wage$age, Wage$wage, pch = 19, cex = 0.4, col = alpha("steelblue", 0.4),
     xlab = "Age", ylab = "Wage")
lines(18:80, predict(fit, data.frame(age = 18:80)), col = "red", lwd = 2)

This is based on how the vector is indexed. In R, sapply(2:10, ...) creates a vector with names “2” through “10”, and which.min(cv_step) returns the index in that vector — which is 7, corresponding to age cuts = 8.


  1. The Wage data set contains a number of other features not explored in this chapter, such as marital status (maritl), job class (jobclass), and others. Explore the relationships between some of these other predictors and wage, and use non-linear fitting techniques in order to f it flexible models to the data. Create plots of the results obtained, and write a summary of your findings

We begin by plotting wage against several predictors:

plot(Wage$year, Wage$wage, pch = 19, cex = 0.4, col = alpha("steelblue", 0.4))

plot(Wage$age, Wage$wage, pch = 19, cex = 0.4, col = alpha("steelblue", 0.4))

plot(Wage$maritl, Wage$wage, pch = 19, cex = 0.4, col = alpha("steelblue", 0.4))

plot(Wage$jobclass, Wage$wage, pch = 19, cex = 0.4, col = alpha("steelblue", 0.4))

plot(Wage$education, Wage$wage, pch = 19, cex = 0.4, col = alpha("steelblue", 0.4))

From these plots:

To capture both non-linear and categorical effects, we use Generalized Additive Models (GAMs). The s() function fits smooth splines to continuous predictors, while categorical variables are included directly and interpreted relative to a baseline level.

library(gam)
## Warning: package 'gam' was built under R version 4.4.3
## Loading required package: foreach
## Warning: package 'foreach' was built under R version 4.4.3
## Loaded gam 1.22-5
fit0 <- gam(wage ~ s(year, 4) + s(age, 5) + education, data = Wage)
fit2 <- gam(wage ~ s(year, 4) + s(age, 5) + education + maritl, data = Wage)
fit1 <- gam(wage ~ s(year, 4) + s(age, 5) + education + jobclass, data = Wage)
fit3 <- gam(wage ~ s(year, 4) + s(age, 5) + education + jobclass + maritl, data = Wage)
anova(fit0, fit1, fit2, fit3)
par(mfrow = c(2, 3))
plot(fit3, se = TRUE, col = "blue")


  1. This question uses the variables dis (the weighted mean of distances to five Boston employment centers) and nox (nitrogen oxides concen tration in parts per 10 million) from the Boston data. We will treat dis as the predictor and nox as the response.
  1. Use the poly() function to fit a cubic polynomial regression to predict nox using dis. Report the regression output, and plot the resulting data and polynomial fits.
  2. Plot the polynomial fits for a range of different polynomial degrees (say, from 1 to 10), and report the associated residual sum of squares.
  3. Perform cross-validation or another approach to select the opti mal degree for the polynomial, and explain your results.
  4. Use the bs() function to fit a regression spline to predict nox using dis. Report the output for the fit using four degrees of freedom. How did you choose the knots? Plot the resulting fit.
  5. Now fit a regression spline for a range of degrees of freedom, and plot the resulting fits and report the resulting RSS. Describe the results obtained.
  6. Perform cross-validation or another approach in order to select the best degrees of freedom for a regression spline on this data. Describe your results.

(a) Fit a Cubic Polynomial Regression

We use the poly() function to fit a cubic (degree 3) polynomial regression. Then, we visualize the fit.

# Fit cubic polynomial
fit_cubic <- glm(nox ~ poly(dis, 3), data = Boston)
summary(fit_cubic)
## 
## Call:
## glm(formula = nox ~ poly(dis, 3), data = Boston)
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.554695   0.002759 201.021  < 2e-16 ***
## poly(dis, 3)1 -2.003096   0.062071 -32.271  < 2e-16 ***
## poly(dis, 3)2  0.856330   0.062071  13.796  < 2e-16 ***
## poly(dis, 3)3 -0.318049   0.062071  -5.124 4.27e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.003852802)
## 
##     Null deviance: 6.7810  on 505  degrees of freedom
## Residual deviance: 1.9341  on 502  degrees of freedom
## AIC: -1370.9
## 
## Number of Fisher Scoring iterations: 2
# Visualization
x_vals <- seq(min(Boston$dis), max(Boston$dis), length.out = 1000)
pred_vals <- predict(fit_cubic, newdata = data.frame(dis = x_vals))

plot(nox ~ dis, data = Boston, col = alpha("steelblue", 0.4), pch = 19,
     main = "Cubic Polynomial Fit: nox vs. dis")
lines(x_vals, pred_vals, col = "red", lwd = 2, lty = 2)

Explanation: The cubic polynomial regression model has significant coefficients for all polynomial terms, indicating a good fit. The curve shows a non-linear, decreasing trend in NOx as distance increases, which aligns with expectations since NOx levels generally decrease with distance from urban centers.

(b) Compare Polynomial Fits from Degree 1 to 10 and Report RSS

We fit polynomial regressions for degrees 1 through 10 and visualize them together. We also report the residual sum of squares (RSS) for comparison.

# Fit models and predict
polynomial_models <- lapply(1:10, function(deg) glm(nox ~ poly(dis, deg), data = Boston))

# Prediction frame
x_vals <- seq(min(Boston$dis), max(Boston$dis), length.out = 1000)
preds <- sapply(polynomial_models, function(mod) predict(mod, newdata = data.frame(dis = x_vals)))

# Prepare data for ggplot
pred_df <- data.frame(x = x_vals, preds)
colnames(pred_df)[-1] <- paste0("Degree_", 1:10)
pred_long <- pivot_longer(pred_df, -x, names_to = "Degree", values_to = "Predicted")

# Plot fits
ggplot(Boston, aes(dis, nox)) +
  geom_point(color = alpha("steelblue", 0.4)) +
  geom_line(data = pred_long, aes(x = x, y = Predicted, color = Degree), size = 1) +
  labs(title = "Polynomial Fits (Degrees 1–10)", x = "Distance to Employment Centers", y = "NOx Concentration") +
  theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Report RSS
rss_vals <- sapply(polynomial_models, function(mod) sum(residuals(mod)^2))
names(rss_vals) <- paste0("Degree_", 1:10)
rss_vals
##  Degree_1  Degree_2  Degree_3  Degree_4  Degree_5  Degree_6  Degree_7  Degree_8 
##  2.768563  2.035262  1.934107  1.932981  1.915290  1.878257  1.849484  1.835630 
##  Degree_9 Degree_10 
##  1.833331  1.832171

Explanation:

As the polynomial degree increases, the residual sum of squares (RSS) decreases, indicating better fit to the training data. However, degrees beyond 3-5 show diminishing returns and potential overfitting, particularly near the data boundaries. The RSS stabilizes at degree 9 and 10, suggesting no significant improvement beyond degree 5.

(c) Use Cross-Validation to Select Optimal Polynomial Degree

We apply 10-fold cross-validation to select the best degree.

cv_errors <- sapply(1:10, function(deg) {
  fit <- glm(nox ~ poly(dis, deg), data = Boston)
  cv.glm(Boston, fit, K = 10)$delta[1]
})

# Plot CV errors
plot(1:10, cv_errors, type = "b", pch = 19, col = "darkgreen",
     xlab = "Polynomial Degree", ylab = "CV Error",
     main = "10-Fold Cross-Validation Error")

optimal_degree <- which.min(cv_errors)
optimal_degree
## [1] 3

Explanation:

The cross-validation error is minimized at degree 3, confirming that the cubic fit strikes a balance between bias and variance. This is consistent with our earlier visual inspection, where degree 3 provided a smooth, reasonable fit without overfitting.

(d) Fit a Regression Spline with 4 Degrees of Freedom

We now fit a regression spline using bs() (basis spline), choosing 4 degrees of freedom (df). Knots are automatically chosen based on quantiles.

fit_spline4 <- glm(nox ~ bs(dis, df = 4), data = Boston)
summary(fit_spline4)
## 
## Call:
## glm(formula = nox ~ bs(dis, df = 4), data = Boston)
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.73447    0.01460  50.306  < 2e-16 ***
## bs(dis, df = 4)1 -0.05810    0.02186  -2.658  0.00812 ** 
## bs(dis, df = 4)2 -0.46356    0.02366 -19.596  < 2e-16 ***
## bs(dis, df = 4)3 -0.19979    0.04311  -4.634 4.58e-06 ***
## bs(dis, df = 4)4 -0.38881    0.04551  -8.544  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.003837874)
## 
##     Null deviance: 6.7810  on 505  degrees of freedom
## Residual deviance: 1.9228  on 501  degrees of freedom
## AIC: -1371.9
## 
## Number of Fisher Scoring iterations: 2
# Visualization
pred_spline <- predict(fit_spline4, newdata = data.frame(dis = x_vals))
plot(nox ~ dis, data = Boston, col = alpha("steelblue", 0.4), pch = 19,
     main = "Regression Spline Fit (4 df)")
lines(x_vals, pred_spline, col = "red", lwd = 2, lty = 2)

Explanation:

The regression spline with 4 degrees of freedom fits a smooth curve with a good balance between flexibility and avoiding overfitting. The spline captures the curvature in the data without becoming excessively wiggly, offering a stable fit with reasonable complexity.

(e) Fit Splines With df = 3 to 10 and Report RSS

Now we fit splines for multiple degrees of freedom and observe model complexity vs. fit.

# Fit spline models
spline_models <- lapply(3:10, function(df) glm(nox ~ bs(dis, df = df), data = Boston))

# Generate predictions
spline_preds <- sapply(spline_models, function(mod) predict(mod, newdata = data.frame(dis = x_vals)))
spline_df <- data.frame(x = x_vals, spline_preds)
colnames(spline_df)[-1] <- paste0("df_", 3:10)
spline_long <- pivot_longer(spline_df, -x, names_to = "DF", values_to = "Predicted")

# Plot splines
ggplot(Boston, aes(dis, nox)) +
  geom_point(color = alpha("steelblue", 0.4)) +
  geom_line(data = spline_long, aes(x = x, y = Predicted, color = DF), size = 1) +
  labs(title = "Regression Splines with Varying df", x = "dis", y = "nox") +
  theme_minimal()

Explanation:

As degrees of freedom increase, the RSS improves, but the model begins to show signs of overfitting at higher degrees of freedom, especially beyond 5. Around 4-5 degrees of freedom provides a good compromise between fit and model complexity, without excessive noise or complexity near the edges of the data.

(f) Use Cross-Validation to Choose Optimal Degrees of Freedom for Splines

We’ll apply 10-fold CV to the spline fits. Note: warnings related to knots are suppressed here.

set.seed(42)

cv_spline_errors <- sapply(3:10, function(df) {
  fit <- glm(nox ~ bs(dis, df = df), data = Boston)
  suppressWarnings(cv.glm(Boston, fit, K = 10)$delta[1])
})

# Plot CV errors
plot(3:10, cv_spline_errors, type = "b", pch = 19, col = "blue",
     xlab = "Degrees of Freedom", ylab = "CV Error",
     main = "CV Error for Spline Fits")

best_df <- which.min(cv_spline_errors) + 2  # shift index to start at 3
best_df
## [1] 10

Explanation:

Cross-validation identifies 10 degrees of freedom as optimal for the spline fit. While higher degrees of freedom improve the fit, they also introduce more complexity. The increase in fit quality for df=10 justifies the higher degree, although overfitting signs appear as the complexity increases beyond df=4.


  1. In Section 7.7, it was mentioned that GAMs are generally fit using a backfitting approach. The idea behind backfitting is actually quite simple. We will now explore backfitting in the context of multiple linear regression. Suppose that we would like to perform multiple linear regression, but we do not have software to do so. Instead, we only have software to perform simple linear regression. Therefore, we take the following iterative approach: we repeatedly hold all but one coefficient esti mate fixed at its current value, and update only that coefficient estimate using a simple linear regression. The process is continued un til convergence—that is, until the coefficient estimates stop changing. We now try this out on a toy example.
  1. Generate a response Y and two predictors X1 and X2, with n =100.

  2. Initialize ˆ β1 to take on a value of your choice. It does not matter what value you choose.

  3. Keeping ˆ β1 fixed, fit the model Y − ˆ β1X1 = β0 +β2X2 +ϵ. You can do this as follows: > a <- y- beta1 * x1 > beta2 <-lm(a ∼ x2)$coef[2]

  4. Keeping ˆ β2 fixed, fit the model Y − ˆ β2X2 = β0 +β1X1 +ϵ. You can do this as follows: > a <- y- beta2 * x2 > beta1 <-lm(a ∼ x1)$coef[2]

  5. Write a for loop to repeat (c) and (d) 1,000 times. Report the estimates of ˆ β0, ˆ β1, and ˆ β2 at each iteration of the for loop. Create a plot in which each of these values is displayed, with ˆ β0, ˆ β1, and ˆ β2 each shown in a different color

  6. Compare your answer in (e) to the results of simply performing multiple linear regression to predict Y using X1 and X2. Use the abline() function to overlay those multiple linear regression coefficient estimates on the plot obtained in (e).

  7. On this data set, how many backfitting iterations were required in order to obtain a “good” approximation to the multiple re gression coefficient estimates?

a. Generate predictors and response

We generate X1 and X2 independently from a normal distribution with mean 0 and variance 2. The response Y is generated using the true model:

Y=β0+β1X1+β2X2+εY = _0 + _1 X_1 + _2 X_2 + =β0​+β1​X1​+β2​X2​+ε

where:

  • β0=1.5_0 = 1.5β0​=1.5

  • β1=3_1 = 3β1​=3

  • β2=−4.5_2 = -4.5β2​=−4.5

  • ε∼N(0,1)N(0, 1)ε∼N(0,1)

set.seed(42)
n <- 100
x1 <- rnorm(n, mean = 0, sd = sqrt(2))
x2 <- rnorm(n, mean = 0, sd = sqrt(2))
beta0_true <- 1.5
beta1_true <- 3
beta2_true <- -4.5
epsilon <- rnorm(n, mean = 0, sd = 1)

y <- beta0_true + beta1_true * x1 + beta2_true * x2 + epsilon

b. Initialize β₁

We start backfitting with an arbitrary value for β^1_1β^​1​:

beta1 <- 20

c–d. One iteration of backfitting

We perform one step of backfitting by keeping one coefficient fixed and estimating the other, then alternating:

a <- y - beta1 * x1
beta2 <- lm(a ~ x2)$coef[2]

a <- y - beta2 * x2
fit1 <- lm(a ~ x1)
beta1 <- fit1$coef[2]
beta0 <- fit1$coef[1]

e. Iterative backfitting for 1000 steps

We repeat steps (c) and (d) for 1000 iterations. We store results at iterations 1–5, then every 50th thereafter. Results are visualized using ggplot2.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ lubridate 1.9.4     ✔ stringr   1.5.1
## ✔ purrr     1.0.2     ✔ tibble    3.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ purrr::accumulate() masks foreach::accumulate()
## ✖ readr::col_factor() masks scales::col_factor()
## ✖ purrr::discard()    masks scales::discard()
## ✖ dplyr::filter()     masks stats::filter()
## ✖ dplyr::lag()        masks stats::lag()
## ✖ purrr::when()       masks foreach::when()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
res <- matrix(NA, nrow = 1000, ncol = 3)
colnames(res) <- c("beta0", "beta1", "beta2")
beta1 <- 20

for (i in 1:1000) {
  beta2 <- lm(y - beta1 * x1 ~ x2)$coef[2]
  fit1 <- lm(y - beta2 * x2 ~ x1)
  beta1 <- fit1$coef[2]
  beta0 <- fit1$coef[1]
  res[i, ] <- c(beta0, beta1, beta2)
}

res <- as.data.frame(res)
res$Iteration <- 1:1000

# Only show selected iterations
selected_rows <- c(1:5, seq(50, 1000, by = 50))
res_display <- res[selected_rows, ]
print(round(res_display, 4))
##       beta0  beta1   beta2 Iteration
## 1    1.4248 2.9151 -5.0558         1
## 2    1.5017 2.8984 -4.4403         2
## 3    1.5018 2.8984 -4.4397         3
## 4    1.5018 2.8984 -4.4397         4
## 5    1.5018 2.8984 -4.4397         5
## 50   1.5018 2.8984 -4.4397        50
## 100  1.5018 2.8984 -4.4397       100
## 150  1.5018 2.8984 -4.4397       150
## 200  1.5018 2.8984 -4.4397       200
## 250  1.5018 2.8984 -4.4397       250
## 300  1.5018 2.8984 -4.4397       300
## 350  1.5018 2.8984 -4.4397       350
## 400  1.5018 2.8984 -4.4397       400
## 450  1.5018 2.8984 -4.4397       450
## 500  1.5018 2.8984 -4.4397       500
## 550  1.5018 2.8984 -4.4397       550
## 600  1.5018 2.8984 -4.4397       600
## 650  1.5018 2.8984 -4.4397       650
## 700  1.5018 2.8984 -4.4397       700
## 750  1.5018 2.8984 -4.4397       750
## 800  1.5018 2.8984 -4.4397       800
## 850  1.5018 2.8984 -4.4397       850
## 900  1.5018 2.8984 -4.4397       900
## 950  1.5018 2.8984 -4.4397       950
## 1000 1.5018 2.8984 -4.4397      1000

Plot convergence of coefficients

res_long <- pivot_longer(res, cols = c("beta0", "beta1", "beta2"))

ggplot(res_long, aes(x = Iteration, y = value, color = name)) +
  geom_line() +
  geom_hline(yintercept = c(beta0_true, beta1_true, beta2_true), 
             linetype = "dashed", color = c("red", "blue", "darkgreen")) +
  labs(title = "Convergence of Coefficients via Backfitting",
       x = "Iteration",
       y = "Coefficient Estimate",
       color = "Coefficient") +
  scale_color_manual(values = c("darkred", "navy", "forestgreen")) +
  theme_minimal()

Dashed lines represent the true values of β0=1.5_0 = 1.5β0​=1.5, β1=3_1 = 3β1​=3, and β2=−4.5_2 = -4.5β2​=−4.5.

f. Compare with standard multiple linear regression

Now we compare the backfitting result with the true multiple regression:

fit_lm <- lm(y ~ x1 + x2)
print(round(coef(fit_lm), 4))
## (Intercept)          x1          x2 
##      1.5018      2.8984     -4.4397
# Overlay the true OLS coefficients on the plot again
ggplot(res_long, aes(x = Iteration, y = value, color = name)) +
  geom_line() +
  geom_hline(yintercept = coef(fit_lm), 
             linetype = "dotted", color = c("red", "blue", "darkgreen")) +
  labs(title = "Backfitting vs Standard Linear Regression",
       x = "Iteration",
       y = "Coefficient Estimate",
       color = "Coefficient") +
  scale_color_manual(values = c("darkred", "navy", "forestgreen")) +
  theme_minimal()

g. How many iterations to converge?

In this simulation, the coefficients stabilized around the OLS values after approximately 3 to 5 iterations. The backfitting approach quickly reached a good approximation of the multiple regression solution.