6. In this exercise, you will further analyze the Wage data set considered throughout this chapter.
(a) Perform polynomial regression to predict wage using age. Use
cross-validation to select the optimal degree d for the polynomial. What
degree was chosen, and how does this compare to the results of
hypothesis testing using ANOVA? Make a plot of the resulting polynomial
fit to the data.
# Loading necessary libraries
library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.4.3
library(tidyverse)
## Warning: package 'tidyr' was built under R version 4.4.2
## Warning: package 'readr' was built under R version 4.4.2
## Warning: package 'dplyr' was built under R version 4.4.3
## Warning: package 'lubridate' was built under R version 4.4.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(boot)
## Warning: package 'boot' was built under R version 4.4.3
library(MASS)
## Warning: package 'MASS' was built under R version 4.4.3
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
##
## The following object is masked from 'package:ISLR2':
##
## Boston
library(splines)
# Polynomial regression to predict wage using age with cross-validation
set.seed(1)
cv.errors <- sapply(1:10, function(d) {
glm.fit <- glm(wage ~ poly(age, d), data = Wage)
cv.glm(Wage, glm.fit, K = 10)$delta[1]
})
optimal.degree <- which.min(cv.errors)
# Fit the polynomial regression model with optimal degree
best.poly.fit <- lm(wage ~ poly(age, optimal.degree), data = Wage)
# Plot the polynomial fit
plot(Wage$age, Wage$wage, col = "grey", main = "Polynomial Fit")
lines(sort(Wage$age), predict(best.poly.fit, newdata = list(age = sort(Wage$age))), col = "blue", lwd = 2)
(b) Fit a step function to predict wage using age, and perform cross validation to choose the optimal number of cuts. Make a plot of the fit obtained.
set.seed(2)
cv.errors.step <- sapply(2:10, function(cuts) {
Wage$cut.age <- cut(Wage$age, cuts)
glm.fit <- glm(wage ~ cut.age, data = Wage)
cv.glm(Wage, glm.fit, K = 10)$delta[1]
})
optimal.cuts <- which.min(cv.errors.step)
# Fit step function model
Wage$cut.age <- cut(Wage$age, optimal.cuts)
step.fit <- lm(wage ~ cut.age, data = Wage)
# Plot step function fit
plot(Wage$age, Wage$wage, col = "grey", main = "Step Function Fit")
lines(sort(Wage$age), predict(step.fit)[order(Wage$age)], col = "red", lwd = 2)
Explanation: Code performs cross-validation to determine the optimal number of cuts for a step function model and plots the resulting fit.
7. The Wage data set contains a number of other features not explored in this chapter, such as marital status (maritl), job class (jobclass), and others. Explore the relationships between some of these other predictors and wage, and use non-linear fitting techniques in order to fit flexible models to the data. Create plots of the results obtained, and write a summary of your findings.
# Non-linear fitting using categorical predictors (e.g., marital status and jobclass)
marital.fit <- lm(wage ~ maritl + poly(age, 4), data = Wage)
jobclass.fit <- lm(wage ~ jobclass + poly(age, 4), data = Wage)
# Plot results
par(mfrow = c(1, 2))
plot(Wage$maritl, predict(marital.fit), col = "blue", main = "Wage vs Marital Status")
plot(Wage$jobclass, predict(jobclass.fit), col = "green", main = "Wage vs Job Class")
Explanation: Above code fits flexible models using categorical predictors (e.g., marital status and job class) and visualizes their relationships with wage.
9. This question uses the variables dis (the weighted mean of
distances to five Boston employment centers) and nox (nitrogen oxides
concentration in parts per 10 million) from the Boston data. We will
treat dis as the predictor and nox as the response.
(a) Use the poly() function to fit a cubic polynomial regression to
predict nox using dis. Report the regression output, and plot the
resulting data and polynomial fits.
# Fit cubic polynomial regression
poly.fit <- lm(nox ~ poly(dis, 3), data = Boston)
summary(poly.fit)
##
## Call:
## lm(formula = nox ~ poly(dis, 3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.121130 -0.040619 -0.009738 0.023385 0.194904
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.554695 0.002759 201.021 < 2e-16 ***
## poly(dis, 3)1 -2.003096 0.062071 -32.271 < 2e-16 ***
## poly(dis, 3)2 0.856330 0.062071 13.796 < 2e-16 ***
## poly(dis, 3)3 -0.318049 0.062071 -5.124 4.27e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06207 on 502 degrees of freedom
## Multiple R-squared: 0.7148, Adjusted R-squared: 0.7131
## F-statistic: 419.3 on 3 and 502 DF, p-value: < 2.2e-16
# Plot cubic polynomial fit
plot(Boston$dis, Boston$nox, col = "grey", main = "Cubic Polynomial Fit")
lines(sort(Boston$dis), predict(poly.fit, newdata = list(dis = sort(Boston$dis))), col = "blue", lwd = 2)
Explanation: Fits a cubic polynomial regression model and plots the resulting fit.
(b) Plot the polynomial fits for a range of different polynomial degrees (say, from 1 to 10), and report the associated residual sum of squares.
rss.values <- sapply(1:10, function(d) {
poly.fit <- lm(nox ~ poly(dis, d), data = Boston)
sum(residuals(poly.fit)^2)
})
plot(1:10, rss.values, type = "b", main = "RSS vs Degree", xlab = "Degree", ylab = "RSS")
Explanation: Calculates residual sum of squares (RSS) for polynomial degrees from 1 to 10 and plots them.
(c) Perform cross-validation or another approach to select the optimal degree for the polynomial, and explain your results.
# Perform cross-validation to select the optimal polynomial degree
set.seed(1)
cv.errors <- sapply(1:10, function(d) {
glm.fit <- glm(nox ~ poly(dis, d), data = Boston)
cv.glm(Boston, glm.fit, K = 10)$delta[1]
})
optimal.degree <- which.min(cv.errors)
# Print the optimal degree
print(paste("Optimal polynomial degree:", optimal.degree))
## [1] "Optimal polynomial degree: 4"
# Plot CV errors vs Polynomial Degree
plot(1:10, cv.errors, type = "b", xlab = "Degree of Polynomial", ylab = "CV Error",
main = "Cross-Validation Errors for Polynomial Regression")
abline(v = optimal.degree, col = "red", lty = 2) # Highlight optimal degree
Results: The optimal polynomial degree is determined based on minimizing prediction error on unseen data. The plot helps visualize the trade-off between model complexity and prediction accuracy.
(d) Use the bs() function to fit a regression spline to predict nox using dis. Report the output for the fit using four degrees of freedom. How did you choose the knots? Plot the resulting fit.
# Fit regression spline with four degrees of freedom
spline.fit <- lm(nox ~ bs(dis, df = 4), data = Boston)
# Plot spline fit
plot(Boston$dis, Boston$nox, col = "grey", main = "Regression Spline Fit")
lines(sort(Boston$dis), predict(spline.fit)[order(Boston$dis)], col = "red", lwd = 2)
Explanation: Fits a regression spline model using bs() and visualizes the fit.
(e) Now fit a regression spline for a range of degrees of freedom, and plot the resulting fits and report the resulting RSS. Describe the results obtained.
# Fit regression splines for a range of degrees of freedom (df) and calculate RSS
rss_values <- sapply(3:10, function(df) {
spline_fit <- lm(nox ~ bs(dis, df = df), data = Boston)
sum(residuals(spline_fit)^2)
})
# Plot RSS vs Degrees of Freedom
plot(3:10, rss_values, type = "b", xlab = "Degrees of Freedom (df)", ylab = "Residual Sum of Squares (RSS)",
main = "RSS for Regression Splines with Varying Degrees of Freedom", col = "blue", lwd = 2)
Explanation: This code fits regression splines using bs() with degrees of freedom ranging from 3 to 10. It calculates the residual sum of squares (RSS) for each fit and plots RSS values against degrees of freedom.
(f) Perform cross-validation or another approach in order to select the best degrees of freedom for a regression spline on this data. Describe your results.
# Perform cross-validation to select the best degrees of freedom
set.seed(1)
cv_errors <- sapply(3:10, function(df) {
spline_fit <- glm(nox ~ bs(dis, df = df), data = Boston)
cv.glm(Boston, spline_fit, K = 10)$delta[1]
})
## Warning in bs(dis, degree = 3L, knots = numeric(0), Boundary.knots = c(1.137, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = numeric(0), Boundary.knots = c(1.137, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = numeric(0), Boundary.knots = c(1.1296,
## : some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = numeric(0), Boundary.knots = c(1.1296,
## : some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = 3.0993, Boundary.knots = c(1.137, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = 3.0993, Boundary.knots = c(1.137, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = 3.3603, Boundary.knots = c(1.1296, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = 3.3603, Boundary.knots = c(1.1296, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(2.38876666666667, 4.3257),
## Boundary.knots = c(1.137, : some 'x' values beyond boundary knots may cause
## ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(2.38876666666667, 4.3257),
## Boundary.knots = c(1.137, : some 'x' values beyond boundary knots may cause
## ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(2.3088, 4.09726666666667),
## Boundary.knots = c(1.1296, : some 'x' values beyond boundary knots may cause
## ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(2.3088, 4.09726666666667),
## Boundary.knots = c(1.1296, : some 'x' values beyond boundary knots may cause
## ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(2.087875, 3.19095, 5.1167),
## Boundary.knots = c(1.137, : some 'x' values beyond boundary knots may cause
## ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(2.087875, 3.19095, 5.1167),
## Boundary.knots = c(1.137, : some 'x' values beyond boundary knots may cause
## ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(1.92404, 2.55946, 3.6715, 5.4917:
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(1.92404, 2.55946, 3.6715, 5.4917:
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(1.94984, 2.59774, 3.81326, 5.4917:
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(1.94984, 2.59774, 3.81326, 5.4917:
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(1.86636666666667, 2.4212, 3.2721, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(1.86636666666667, 2.4212, 3.2721, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(1.82085, 2.36386666666667, 3.2157, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(1.82085, 2.36386666666667, 3.2157, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(1.79078571428571, 2.16972857142857, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(1.79078571428571, 2.16972857142857, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(1.7912, 2.1705, 2.7344, 3.6519, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(1.7912, 2.1705, 2.7344, 3.6519, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(1.757275, 2.1084, 2.53475, 3.2157, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(1.757275, 2.1084, 2.53475, 3.2157, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(1.76375, 2.10525, 2.522775, 3.2721, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(1.76375, 2.10525, 2.522775, 3.2721, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
optimal_df <- which.min(cv_errors)
# Plot CV Errors vs Degrees of Freedom
plot(3:10, cv_errors, type = "b", xlab = "Degrees of Freedom (df)", ylab = "Cross-Validation Error",
main = "Cross-Validation Errors for Regression Splines", col = "red", lwd = 2)
abline(v = optimal_df + 2, col = "blue", lty = 2) # Highlight optimal df
Explanation: This code uses cross-validation (cv.glm) to compute prediction errors for regression splines with varying degrees of freedom. It identifies the optimal degrees of freedom that minimize cross-validation error and plots the errors against degrees of freedom.
11. In Section 7.7, it was mentioned that GAMs are generally
fit using a backfitting approach. The idea behind backfitting is
actually quite simple. We will now explore backfitting in the context of
multiple linear regression. Suppose that we would like to perform
multiple linear regression, but we do not have software to do so.
Instead, we only have software to perform simple linear regression.
Therefore, we take the following iterative approach: we repeatedly hold
all but one coefficient estimate fixed at its current value, and update
only that coefficient estimate using a simple linear regression. The
process is continued un til convergence—that is, until the coefficient
estimates stop changing. We now try this out on a toy example.
(a) Generate a response Y and two predictors X1 and X2, with n
=100.
set.seed(1)
# Generate predictors X1 and X2
n <- 100
X1 <- rnorm(n, mean = 0, sd = sqrt(2))
X2 <- rnorm(n, mean = 0, sd = sqrt(2))
# Generate response Y
epsilon <- rnorm(n, mean = 0, sd = sqrt(1))
Y <- 1.5 + 3 * X1 - 4.5 * X2 + epsilon
(b) Initialize ˆ β1 to take on a value of your choice. It does not matter what value you choose.
# Initialize beta1 to an arbitrary value
beta1 <- runif(1)
(c) Keeping ˆ β1 fixed, fit the model Y − ˆ β1X1 = β0 +β2X2 +ϵ.
# Fit model keeping beta1 fixed
a <- Y - beta1 * X1
beta2 <- lm(a ~ X2)$coef[2]
(d) Keeping ˆ β2 fixed, fit the model Y − ˆ β2X2 = β0 +β1X1 +ϵ. You can do this as follows: > a <- y- beta2 * x2 > beta1 <-lm(a ∼ x1)$coef[2]
# Fit model keeping beta2 fixed
a <- Y - beta2 * X2
beta1 <- lm(a ~ X1)$coef[2]
(e) Write a for loop to repeat (c) and (d) 1,000 times. Report the estimates of ˆ β0, ˆ β1, and ˆ β2 at each iteration of the for loop. Create a plot in which each of these values is displayed, with ˆ β0, ˆ β1, and ˆ β2 each shown in a different color.
# Perform backfitting for 1000 iterations
iterations <- matrix(NA, nrow = 1000, ncol = 3)
for (i in seq_len(1000)) {
a <- Y - beta1 * X1
beta2 <- lm(a ~ X2)$coef[2]
a <- Y - beta2 * X2
beta1 <- lm(a ~ X1)$coef[2]
beta0 <- mean(Y - beta1 * X1 - beta2 * X2)
iterations[i, ] <- c(beta0, beta1, beta2)
}
# Plot the convergence of coefficients
matplot(iterations[, c(1:3)], type = "l", lty = 1,
col = c("red", "blue", "green"), xlab = "Iteration",
ylab = "Coefficient Value", main = "Backfitting Coefficients")
legend("topright", legend = c("Beta0", "Beta1", "Beta2"),
col = c("red", "blue", "green"), lty = 1)
(f) Compare your answer in (e) to the results of simply performing multiple linear regression to predict Y using X1 and X2. Use the abline() function to overlay those multiple linear regression coefficient estimates on the plot obtained in (e).
# Fit multiple linear regression model directly
lm.fit <- lm(Y ~ X1 + X2)
# Perform backfitting (from Exercise 11(e))
iterations <- matrix(NA, nrow = 1000, ncol = 3)
for (i in seq_len(1000)) {
a <- Y - beta1 * X1
beta2 <- lm(a ~ X2)$coef[2]
a <- Y - beta2 * X2
beta1 <- lm(a ~ X1)$coef[2]
beta0 <- mean(Y - beta1 * X1 - beta2 * X2)
iterations[i, ] <- c(beta0, beta1, beta2)
}
# Plot the convergence of coefficients
matplot(iterations[, c(1:3)], type = "l", lty = 1,
col = c("red", "blue", "green"), xlab = "Iteration",
ylab = "Coefficient Value", main = "Backfitting Coefficients")
legend("topright", legend = c("Beta0", "Beta1", "Beta2"),
col = c("red", "blue", "green"), lty = 1)
# Overlay multiple regression coefficients using abline()
abline(h = coef(lm.fit)[c("(Intercept)", "X1", "X2")], col = c("red", "blue", "green"), lty = 2)
(g) On this data set, how many backfitting iterations were required in order to obtain a “good” approximation to the multiple regression coefficient estimates?
# Check convergence by observing changes in coefficients over iterations
convergence_threshold <- apply(abs(diff(iterations)), MARGIN = 2, max)
print(convergence_threshold)
## [1] 0 0 0
# Approximate number of iterations needed for convergence visually or programmatically:
convergence_iteration <- which.max(apply(abs(diff(iterations)), MARGIN = 2, function(x) all(x < 0.01)))
print(paste("Convergence achieved at iteration:", convergence_iteration))
## [1] "Convergence achieved at iteration: 1"