1. In this exercise, you will further analyze the Wage data set considered throughout this chapter.
  1. Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial ft to the data.
library(ISLR2)
library(boot)

set.seed(100)

cv_errors <- numeric(20)

for (d in 1:20) {
  glm_model <- glm(wage ~ poly(age, d), data = Wage)
  cv_errors[d] <- cv.glm(Wage, glm_model, K = 10)$delta[1]
}

plot(cv_errors, type = "l", xlab = "Degree of Polynomial", ylab = "Cross-Validation Error", main = "CV Error vs Degree")

From the plot, degree 3 seems optimal i.e., CV error stabilizes after degree 3.

lm_models <- list()
for (d in 1:5) {
  lm_models[[d]] <- lm(wage ~ poly(age, d), data = Wage)
}

anova(lm_models[[1]], lm_models[[2]], lm_models[[3]], lm_models[[4]], lm_models[[5]], test = "F")

Degree 2 (quadratic) and degree 3 (cubic) significantly improve the model.

Beyond degree 3, higher polynomials are not statistically justified (p-values > 0.05).

Cross-validation and ANOVA both suggest using a 3rd degree polynomial.

  1. Fit a step function to predict wage using age, and perform crossvalidation to choose the optimal number of cuts. Make a plot of the ft obtained.
cv_step_errors <- numeric(20)

for (cuts in 1:20) {
  if (cuts == 1) {
    glm_fit <- glm(wage ~ 1, data = Wage)  # No age effect
  } else {
    Wage$cut_age <- cut(Wage$age, cuts)
    glm_fit <- glm(wage ~ cut_age, data = Wage)
  }
  set.seed(100)
  cv_step_errors[cuts] <- cv.glm(Wage, glm_fit, K = 10)$delta[1]
}

plot(cv_step_errors, type = "l", xlab = "Number of Cuts", ylab = "Crossvalidation Error", main = "CV Error for Step Functions")

From this CV error plot we can say that 8 cuts gives the best balance between bias and variance.

  1. The Wage data set contains a number of other features not explored in this chapter, such as marital status (maritl), job class (jobclass), and others. Explore the relationships between some of these other predictors and wage, and use non-linear ftting techniques in order to ft fexible models to the data. Create plots of the results obtained, and write a summary of your fndings.
library(gam)
## Loading required package: splines
## Loading required package: foreach
## Loaded gam 1.22-5
plot(Wage$jobclass, Wage$wage, col = "skyblue", main = "Wage by Job Class")

plot(Wage$maritl, Wage$wage, col = "orange", main = "Wage by Marital Status")

gam_fit <- gam(wage ~ s(age, 4) + education + jobclass + maritl, data = Wage)
summary(gam_fit)
## 
## Call: gam(formula = wage ~ s(age, 4) + education + jobclass + maritl, 
##     data = Wage)
## Deviance Residuals:
##      Min       1Q   Median       3Q      Max 
## -112.806  -19.773   -2.918   14.360  209.875 
## 
## (Dispersion Parameter for gaussian family taken to be 1208.038)
## 
##     Null Deviance: 5222086 on 2999 degrees of freedom
## Residual Deviance: 3607202 on 2986 degrees of freedom
## AIC: 29819.86 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##             Df  Sum Sq Mean Sq F value    Pr(>F)    
## s(age, 4)    1  199870  199870 165.450 < 2.2e-16 ***
## education    4 1097688  274422 227.163 < 2.2e-16 ***
## jobclass     1   12811   12811  10.605  0.001141 ** 
## maritl       4  106390   26597  22.017 < 2.2e-16 ***
## Residuals 2986 3607202    1208                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##             Npar Df Npar F     Pr(F)    
## (Intercept)                             
## s(age, 4)         3 23.054 9.437e-15 ***
## education                               
## jobclass                                
## maritl                                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(gam_fit, se = TRUE, col = "darkgreen")

Job class and marital status show wage differences. GAM shows that age has a non-linear effect; education and job class are strong predictors. The plot of s(age, 4) shows that wage increases with age until around 50, then stabilizes and slowly declines beyond 60. The widening confidence bands at younger and older ages suggest more variability due to fewer observations.

  1. This question uses the variables dis (the weighted mean of distances to fve Boston employment centers) and nox (nitrogen oxides concentration in parts per 10 million) from the Boston data. We will treat dis as the predictor and nox as the response.
  1. Use the poly() function to ft a cubic polynomial regression to predict nox using dis. Report the regression output, and plot the resulting data and polynomial fts.
library(ISLR2)
library(splines)

fit_poly3 <- lm(nox ~ poly(dis, 3), data = Boston)
summary(fit_poly3)
## 
## Call:
## lm(formula = nox ~ poly(dis, 3), data = Boston)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.121130 -0.040619 -0.009738  0.023385  0.194904 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.554695   0.002759 201.021  < 2e-16 ***
## poly(dis, 3)1 -2.003096   0.062071 -32.271  < 2e-16 ***
## poly(dis, 3)2  0.856330   0.062071  13.796  < 2e-16 ***
## poly(dis, 3)3 -0.318049   0.062071  -5.124 4.27e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06207 on 502 degrees of freedom
## Multiple R-squared:  0.7148, Adjusted R-squared:  0.7131 
## F-statistic: 419.3 on 3 and 502 DF,  p-value: < 2.2e-16
plot(Boston$dis, Boston$nox, col = "gray", main = "Cubic Polynomial Fit")
lines(sort(Boston$dis), predict(fit_poly3, newdata = data.frame(dis = sort(Boston$dis))), col = "blue", lwd = 2)

The negative coefficient for the first term and positive/negative coefficients for the higher-order terms confirm a non-linear U-shaped curvature in the fit.The very low residual standard error (0.06207) suggests the model predictions are close to observed values on average. A high F-statistic and extremely low p-value (< 2.2e-16) confirm that the overall model is statistically significant.

  1. Plot the polynomial fts for a range of diferent polynomial degrees (say, from 1 to 10), and report the associated residual sum of squares.
rss <- numeric(10)

for (d in 1:10) {
  fit <- lm(nox ~ poly(dis, d), data = Boston)
  rss[d] <- sum(residuals(fit)^2)
}

plot(1:10, rss, type = "b", xlab = "Degree", ylab = "RSS", main = "Polynomial Degree vs RSS")

The RSS decreases rapidly from degrees 1 to 3, indicating improved model fit.

  1. Perform cross-validation or another approach to select the optimal degree for the polynomial, and explain your results.
library(boot)
cv_errors_poly <- numeric(10)

for (d in 1:10) {
  glm_fit <- glm(nox ~ poly(dis, d), data = Boston)
  cv_errors_poly[d] <- cv.glm(Boston, glm_fit, K = 10)$delta[1]
}

plot(1:10, cv_errors_poly, type = "b", xlab = "Degree", ylab = "CV Error", main = "CV Error by Degree")

The cross-validation (CV) error is lowest at degree 3, suggesting that a cubic polynomial provides the best generalization to unseen data.Degrees beyond 4 do not improve performance and may overfit.

  1. Use the bs() function to ft a regression spline to predict nox using dis. Report the output for the ft using four degrees of freedom. How did you choose the knots? Plot the resulting ft.
fit_spline4 <- lm(nox ~ bs(dis, df = 4), data = Boston)
summary(fit_spline4)
## 
## Call:
## lm(formula = nox ~ bs(dis, df = 4), data = Boston)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.124622 -0.039259 -0.008514  0.020850  0.193891 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.73447    0.01460  50.306  < 2e-16 ***
## bs(dis, df = 4)1 -0.05810    0.02186  -2.658  0.00812 ** 
## bs(dis, df = 4)2 -0.46356    0.02366 -19.596  < 2e-16 ***
## bs(dis, df = 4)3 -0.19979    0.04311  -4.634 4.58e-06 ***
## bs(dis, df = 4)4 -0.38881    0.04551  -8.544  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06195 on 501 degrees of freedom
## Multiple R-squared:  0.7164, Adjusted R-squared:  0.7142 
## F-statistic: 316.5 on 4 and 501 DF,  p-value: < 2.2e-16
plot(Boston$dis, Boston$nox, col = "gray", main = "Regression Spline (df=4)")
lines(sort(Boston$dis), predict(fit_spline4, newdata = data.frame(dis = sort(Boston$dis))), col = "red", lwd = 2)

Using a spline with 4 degrees of freedom provides a flexible yet interpretable fit between nox and distance. All spline terms are statistically significant (p < 0.01). The plot shows a sharp decline in nox up to 5–6 miles, then levels off—reflecting higher pollution near urban centers. The model captures key trends without overfitting. Knots are placed automatically based on quantiles of the predictor.

  1. Now ft a regression spline for a range of degrees of freedom, and plot the resulting fts and report the resulting RSS. Describe the results obtained.
rss_spline <- numeric(8)
dfs <- 3:10

for (i in 1:length(dfs)) {
  fit <- lm(nox ~ bs(dis, df = dfs[i]), data = Boston)
  rss_spline[i] <- sum(residuals(fit)^2)
}

plot(dfs, rss_spline, type = "b", xlab = "Degrees of Freedom", ylab = "RSS", main = "Spline DF vs RSS")

The RSS decreases sharply from df = 3 to df = 5, then levels off, indicating minimal gains in fit beyond 5 degrees of freedom. This suggests that around 5–6 df captures the key non-linear structure without adding unnecessary complexity.

  1. Perform cross-validation or another approach in order to select the best degrees of freedom for a regression spline on this data. Describe your results.
range_dis <- range(Boston$dis)
cv_errors_spline <- numeric(10)

for (df in 3:10) {
  glm_fit <- glm(nox ~ bs(dis, df = df, Boundary.knots = range_dis), data = Boston)
  cv_errors_spline[df] <- cv.glm(Boston, glm_fit, K = 10)$delta[1]
}

plot(3:10, cv_errors_spline[3:10], type = "b", xlab = "Degrees of Freedom", ylab = "CV Error", main = "Spline CV Error")

Cross-validation suggests that 5 degrees of freedom minimizes the prediction error. While df = 3 and 4 have slightly higher CV error, and df > 5 do not improve performance, df = 5 provides the best balance of flexibility and generalization.

  1. In Section 7.7, it was mentioned that GAMs are generally ft using a backftting approach. The idea behind backftting is actually quite simple. We will now explore backftting in the context of multiple linear regression. Suppose that we would like to perform multiple linear regression, but we do not have software to do so. Instead, we only have software to perform simple linear regression. Therefore, we take the following iterative approach: we repeatedly hold all but one coeffcient estimate fxed at its current value, and update only that coeffcient estimate using a simple linear regression. The process is continued until convergence—that is, until the coeffcient estimates stop changing. We now try this out on a toy example.
  1. Generate a response Y and two predictors X1 and X2, with n = 100.
set.seed(1)
n <- 100
x1 <- rnorm(n)
x2 <- rnorm(n)
y <- 1 + 2 * x1 + 3 * x2 + rnorm(n)
  1. Initialize βˆ1 to take on a value of your choice. It does not matter what value you choose.
beta1 <- 0
  1. Keeping βˆ1 fxed, ft the model Y − βˆ1X1 = β0 + β2X2 + ϵ. You can do this as follows: a <- y - beta1 * x1 beta2 <- lm(a ∼ x2)$coef[2]
  2. Keeping βˆ2 fxed, ft the model Y − βˆ2X2 = β0 + β1X1 + ϵ. You can do this as follows: a <- y - beta2 * x2 beta1 <- lm(a ∼ x1)$coef[2]
beta0 <- beta2 <- 0
b0 <- b1 <- b2 <- numeric(1000)

for (i in 1:1000) {
  # Update beta2 keping beta1 fixed as per c.
  a <- y - beta1 * x1
  beta2 <- coef(lm(a ~ x2))[2]
  
  # Update beta1 keeping beta2 fixed as per d.
  a <- y - beta2 * x2
  beta1 <- coef(lm(a ~ x1))[2]
  
  # Update beta0 using new beta1 and beta2
  beta0 <- mean(y - beta1 * x1 - beta2 * x2)
  
  b0[i] <- beta0
  b1[i] <- beta1
  b2[i] <- beta2
}
  1. Write a for loop to repeat (c) and (d) 1,000 times. Report the estimates of βˆ0, βˆ1, and βˆ2 at each iteration of the for loop. Create a plot in which each of these values is displayed, with βˆ0, βˆ1, and βˆ2 each shown in a diferent color.
plot(b1, type = "l", col = "blue", ylim = range(c(b0, b1, b2)),
     ylab = "Coefficient Values", xlab = "Iteration",
     main = "Backfitting Estimates Over Time")
lines(b2, col = "red")
lines(b0, col = "green")
legend("bottomright", legend = c("beta0", "beta1", "beta2"),
       col = c("green", "blue", "red"), lty = 1)

  1. Compare your answer in (e) to the results of simply performing multiple linear regression to predict Y using X1 and X2. Use the abline() function to overlay those multiple linear regression coeffcient estimates on the plot obtained in (e).
plot(b1, type = "l", col = "blue", ylim = range(c(b0, b1, b2)),
     ylab = "Coefficient Values", xlab = "Iteration",
     main = "Backfitting Estimates Over Time")
lines(b2, col = "red")
lines(b0, col = "green")
legend("bottomright", legend = c("beta0", "beta1", "beta2"),
       col = c("green", "blue", "red"), lty = 1)

lm_fit <- lm(y ~ x1 + x2)
abline(h = coef(lm_fit)[1], col = "green", lty = 2)  
abline(h = coef(lm_fit)[2], col = "blue", lty = 2)   
abline(h = coef(lm_fit)[3], col = "red", lty = 2)    

  1. On this data set, how many backftting iterations were required in order to obtain a “good” approximation to the multiple regression coeffcient estimates?

A: The dashed horizontal lines from the multiple linear regression model (lm) align closely with the final values from backfitting, confirming convergence. All three coefficients (beta0, beta1, beta2) stabilize quickly i.e., within the first 15–30 iterations — suggesting that even a few iterations are sufficient for a good approximation.