6. In this exercise, you will further analyze the Wage data set considered throughout this chapter.

Note: For problem 6, the hypothesis testing anova referring to the process used on page 285 of the book, where the anova() function is used with argument test set to “F”. This is the usual statistical inference manner in which degree would be selected.

6a. Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.

set.seed(1)
cv.errors <- rep(NA, 10)
for (i in 1:10) {
  glm.fit <- glm(wage ~ poly(age, i), data = Wage)
  cv.errors[i] <- cv.glm(Wage, glm.fit, K = 10)$delta[1]
}
which.min(cv.errors)

## [1] 9

plot(1:10, cv.errors, type = "b", xlab = "Degree", ylab = "CV Error")

fit1 <- lm(wage ~ poly(age, 1), data = Wage)
fit2 <- lm(wage ~ poly(age, 2), data = Wage)
fit3 <- lm(wage ~ poly(age, 3), data = Wage)
fit4 <- lm(wage ~ poly(age, 4), data = Wage)
fit5 <- lm(wage ~ poly(age, 5), data = Wage)

anova(fit1, fit2, fit3, fit4, fit5)

## Analysis of Variance Table
## 
## Model 1: wage ~ poly(age, 1)
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
##   Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
## 1   2998 5022216                                    
## 2   2997 4793430  1    228786 143.5931 < 2.2e-16 ***
## 3   2996 4777674  1     15756   9.8888  0.001679 ** 
## 4   2995 4771604  1      6070   3.8098  0.051046 .  
## 5   2994 4770322  1      1283   0.8050  0.369682    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

best.fit <- lm(wage ~ poly(age, 4), data = Wage)

age.grid <- seq(from = min(Wage$age), to = max(Wage$age), length.out = 100)
preds <- predict(best.fit, newdata = list(age = age.grid), se = TRUE)

plot(Wage$age, Wage$wage, col = "gray", main = "Polynomial Fit (Degree 4)", xlab = "Age", ylab = "Wage")
lines(age.grid, preds$fit, col = "blue", lwd = 2)
lines(age.grid, preds$fit + 2 * preds$se.fit, col = "blue", lty = "dashed")
lines(age.grid, preds$fit - 2 * preds$se.fit, col = "blue", lty = "dashed")

Cross-validation chose degree 4, which aligns closely with the ANOVA results, where degrees 1 to 3 are clearly significant, and degree 4 is marginal. This supports using a degree 3 or 4 polynomial, with degree 4 offering slight improvement in prediction performance at the cost of a marginal increase in complexity.

6b. Fit a step function to predict wage using age, and perform cross-validation to choose the optimal number of cuts. Make a plot of the fit obtained.

set.seed(1)
k.values <- 2:10
cv.errors.step <- rep(NA, length(k.values))
for (i in k.values) {
  Wage$age.cut <- cut(Wage$age, i)
  glm.fit <- glm(wage ~ age.cut, data = Wage)
  cv.errors.step[i - 1] <- cv.glm(Wage, glm.fit, K = 10)$delta[1]
}
which.min(cv.errors.step) + 1

## [1] 8

plot(k.values, cv.errors.step, type = "b", xlab = "Number of Cuts", ylab = "CV Error")

set.seed(1)
k.values <- 2:10
cv.errors.step <- rep(NA, length(k.values))
for (i in k.values) {
  Wage$age.cut <- cut(Wage$age, i)
  glm.fit <- glm(wage ~ age.cut, data = Wage)
  cv.errors.step[i - 1] <- cv.glm(Wage, glm.fit, K = 10)$delta[1]
}

optimal.cuts <- which.min(cv.errors.step) + 1

Wage$age.cut <- cut(Wage$age, optimal.cuts)
step.fit <- lm(wage ~ age.cut, data = Wage)

plot(Wage$age, Wage$wage, col = "gray", xlab = "Age", ylab = "Wage",
     main = paste("Step Function Fit with", optimal.cuts, "Cuts"))

age.grid <- seq(min(Wage$age), max(Wage$age), length.out = 100)
age.grid.cut <- cut(age.grid, breaks = optimal.cuts)
step.preds <- predict(step.fit, newdata = data.frame(age.cut = age.grid.cut))
lines(age.grid, step.preds, col = "blue", lwd = 2)

Cross-validation identified that the optimal number of cuts is 8, as it produced the lowest estimated prediction error.

7. The Wage data set contains a number of other features not explored in this chapter, such as marital status (maritl), job class (jobclass), and others.Explore the relationships between some of these other predictors and wage, and use non-linear fitting techniques in order to fit flexible models to the data. Create plots of the results obtained, and write a summary of your findings.

Note: For problem 7, you can easily fit non-linear functions in the categorical variables by simply entering them into the model - a reference will be chosen automatically.

# Use s() for smoothing terms on continuous variables, include categorical as factors
gam.fit <- gam(wage ~ s(year, 4) + s(age, 5) + education + jobclass + maritl, data = Wage)

# Plot smoothing components
par(mfrow = c(2, 2))
plot(gam.fit, se = TRUE, col = "blue")

# Plot effect of categorical variables
par(mfrow = c(1, 2))

boxplot(wage ~ education, data = Wage, col = "lightgray", main = "Wage vs Education")
boxplot(wage ~ jobclass, data = Wage, col = "lightgray", main = "Wage vs Jobclass")

summary(gam.fit)

## 
## Call: gam(formula = wage ~ s(year, 4) + s(age, 5) + education + jobclass + 
##     maritl, data = Wage)
## Deviance Residuals:
##      Min       1Q   Median       3Q      Max 
## -108.513  -19.579   -2.729   13.818  213.475 
## 
## (Dispersion Parameter for gaussian family taken to be 1201.537)
## 
##     Null Deviance: 5222086 on 2999 degrees of freedom
## Residual Deviance: 3581781 on 2981 degrees of freedom
## AIC: 29808.64 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##              Df  Sum Sq Mean Sq F value    Pr(>F)    
## s(year, 4)    1   26181   26181  21.790 3.177e-06 ***
## s(age, 5)     1  195384  195384 162.612 < 2.2e-16 ***
## education     4 1091423  272856 227.089 < 2.2e-16 ***
## jobclass      1   12846   12846  10.691  0.001089 ** 
## maritl        4  105966   26492  22.048 < 2.2e-16 ***
## Residuals  2981 3581781    1202                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##             Npar Df  Npar F     Pr(F)    
## (Intercept)                              
## s(year, 4)        3  0.9883    0.3972    
## s(age, 5)         4 18.3997 5.995e-15 ***
## education                                
## jobclass                                 
## maritl                                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

To explore the relationships between wage and other predictors in the Wage dataset, I fit a Generalized Additive Model (GAM) using smoothing splines for the continuous variables year and age, and included education, jobclass, and maritl as categorical predictors.

Model Summary Findings:

The parametric effects (from the ANOVA for parametric terms) show that: Education is the most important predictor (F = 227.1, p < 2.2e-16), showing highly significant effects on wage. Age and year also contribute significantly (p < 2.2e-16 and p ≈ 3.18e-6 respectively). Job class and marital status are both statistically significant (p = 0.0011 and p < 2.2e-16, respectively), though with smaller effect sizes. The nonparametric effects reveal that: The smooth function of age is highly significant (F = 18.4, p ≈ 6e-15), confirming a nonlinear relationship between age and wage. The smooth function of year is not statistically significant (F = 0.99, p = 0.397), suggesting that modeling year as a nonlinear term does not improve model fit over a linear one.

Interpretation:

Education has the strongest effect on wages, with higher educational attainment clearly associated with higher income. Age shows a significant nonlinear relationship: wages increase with age up to a point, then level off or slightly decline — a pattern captured well by the spline. Year does not show a strong nonlinear trend, indicating wages have not changed dramatically in a nonlinear way across years in this dataset. Job class also matters: information workers earn more than manual laborers. Marital status has some effect, possibly reflecting life-stage earnings differences.

9. This question uses the variables dis and nox from the Boston data. We will treat dis as the predictor and nox as the response.

Note: For problem 9, you will encounter errors when using CV with the bs() function. There is an argument, Boundary.knots, which you can specify to fix this, but you are not required. Please do not let warnings print to your RMarkdown document. You should be able to suppress them by specifying warning = FALSE in your R Markdown code chunk options.

9a. Use the poly() function to fit a cubic polynomial regression to predict nox using dis. Report the regression output, and plot the resulting data and polynomial fits.

data(Boston)
fit.poly3 <- lm(nox ~ poly(dis, 3), data = Boston)
summary(fit.poly3)

## 
## Call:
## lm(formula = nox ~ poly(dis, 3), data = Boston)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.121130 -0.040619 -0.009738  0.023385  0.194904 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.554695   0.002759 201.021  < 2e-16 ***
## poly(dis, 3)1 -2.003096   0.062071 -32.271  < 2e-16 ***
## poly(dis, 3)2  0.856330   0.062071  13.796  < 2e-16 ***
## poly(dis, 3)3 -0.318049   0.062071  -5.124 4.27e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06207 on 502 degrees of freedom
## Multiple R-squared:  0.7148, Adjusted R-squared:  0.7131 
## F-statistic: 419.3 on 3 and 502 DF,  p-value: < 2.2e-16

plot(Boston$dis, Boston$nox, col = "darkgrey")
lines(sort(Boston$dis), predict(fit.poly3, newdata = list(dis = sort(Boston$dis))), col = "blue")

Interpretation:

The cubic polynomial provides a strong fit to the data. The significant coefficients and high R² value suggest that dis is a strong predictor of nox, and that the relationship is nonlinear. A visual inspection shows a nonlinear decreasing trend, where nox decreases as dis increases, though not in a straight line.

9b. Plot the polynomial fits for a range of different polynomial degrees (say, from 1 to 10), and report the associated residual sum of squares.

rss <- rep(NA, 10)
plot(Boston$dis, Boston$nox, col = "lightgrey")
for (i in 1:10) {
  fit <- lm(nox ~ poly(dis, i), data = Boston)
  rss[i] <- sum(resid(fit)^2)
  lines(sort(Boston$dis), predict(fit, newdata = list(dis = sort(Boston$dis))), col = i)
}

rss

##  [1] 2.768563 2.035262 1.934107 1.932981 1.915290 1.878257 1.849484 1.835630
##  [9] 1.833331 1.832171

Interpretation:

As the degree increases, the RSS decreases, indicating that the model fits the training data better with higher complexity. However, the marginal improvement in RSS becomes very small beyond degree 3–4. For example: RSS drops substantially from degree 1 to 2 (2.77 -> 2.04), But barely changes between degrees 4 and 10 (stabilizing around 1.83). This suggests diminishing returns in model fit with increasing polynomial degree.

Conclusion: Although higher-degree polynomials reduce training error slightly, the improvement beyond degree 3 or 4 is minimal. To avoid overfitting, it’s often better to choose a lower-degree polynomial that captures the main trend without unnecessary complexity.

9c. Perform cross-validation or another approach to select the optimal degree for the polynomial, and explain your results.

cv.errors.poly <- rep(NA, 10)
set.seed(1)
for (i in 1:10) {
  fit <- glm(nox ~ poly(dis, i), data = Boston)
  cv.errors.poly[i] <- cv.glm(Boston, fit, K = 10)$delta[1]
}
which.min(cv.errors.poly)

## [1] 4

plot(1:10, cv.errors.poly, type = "b", xlab = "Degree", ylab = "CV Error")

Interpretation:

This result aligns well with the RSS analysis in part (9b), which showed diminishing improvements in training error beyond degree 3–4. Selecting degree 4 offers a good trade-off between flexibility and generalization: It captures the nonlinear relationship between dis and nox, Without overfitting the data as higher-degree polynomials might.

Conclusion: Cross-validation confirms that a degree 4 polynomial balances predictive accuracy and model simplicity, and is a reasonable choice for modeling this relationship.

9d. Use the bs() function to fit a regression spline to predict nox using dis. Report the output for the fit using four degrees of freedom. How did you choose the knots? Plot the resulting fit.

fit.bs4 <- lm(nox ~ bs(dis, df = 4), data = Boston)
summary(fit.bs4)

## 
## Call:
## lm(formula = nox ~ bs(dis, df = 4), data = Boston)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.124622 -0.039259 -0.008514  0.020850  0.193891 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.73447    0.01460  50.306  < 2e-16 ***
## bs(dis, df = 4)1 -0.05810    0.02186  -2.658  0.00812 ** 
## bs(dis, df = 4)2 -0.46356    0.02366 -19.596  < 2e-16 ***
## bs(dis, df = 4)3 -0.19979    0.04311  -4.634 4.58e-06 ***
## bs(dis, df = 4)4 -0.38881    0.04551  -8.544  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06195 on 501 degrees of freedom
## Multiple R-squared:  0.7164, Adjusted R-squared:  0.7142 
## F-statistic: 316.5 on 4 and 501 DF,  p-value: < 2.2e-16

plot(Boston$dis, Boston$nox, col = "gray")
lines(sort(Boston$dis), predict(fit.bs4, newdata = list(dis = sort(Boston$dis))), col = "blue")

How Knots Were Chosen: Using bs(dis, df = 4) automatically places internal knots at quantiles of dis, which helps adapt the spline to the distribution of the predictor without manually specifying knot locations. This default strategy is commonly used and typically performs well.

Conclusion: The regression spline with 4 degrees of freedom provides a flexible yet smooth fit to the nonlinear relationship between dis and nox. The model performs similarly to the best polynomial model in part 9c, and each basis function contributes significantly to the fit.

9e. Now fit a regression spline for a range of degrees of freedom, and plot the resulting fits and report the resulting RSS. Describe the results obtained.

rss.bs <- rep(NA, 10)
plot(Boston$dis, Boston$nox, col = "lightgrey")
for (i in 3:10) {
  fit <- lm(nox ~ bs(dis, df = i), data = Boston)
  rss.bs[i] <- sum(resid(fit)^2)
  lines(sort(Boston$dis), predict(fit, newdata = list(dis = sort(Boston$dis))), col = i)
}

rss.bs

##  [1]       NA       NA 1.934107 1.922775 1.840173 1.833966 1.829884 1.816995
##  [9] 1.825653 1.792535

Interpretation:

The RSS generally decreases as degrees of freedom increase, indicating better training fit. However, the rate of improvement slows down after df = 6 or 7, suggesting diminishing returns in model complexity. The lowest RSS is at df = 10, but only marginally better than df = 6–8.

Conclusion: Although increasing the degrees of freedom improves the model fit on training data, the gain becomes minimal beyond 6–8 degrees of freedom. Therefore, one might prefer a moderately complex model (e.g., df = 6 or 7) to reduce the risk of overfitting, especially when considering test performance.

9f. Perform cross-validation or another approach in order to select the best degrees of freedom for a regression spline on this data. Describe your results.

cv.errors.bs <- rep(NA, 10)
set.seed(1)
for (i in 3:10) {
  fit <- glm(nox ~ bs(dis, df = i), data = Boston)
  cv.errors.bs[i] <- cv.glm(Boston, fit, K = 10)$delta[1]
}
which.min(cv.errors.bs)

## [1] 6

plot(3:10, cv.errors.bs[3:10], type = "b", xlab = "Degrees of Freedom", ylab = "CV Error")

Interpretation:

This result aligns with the pattern observed in part (9e), where the residual sum of squares (RSS) dropped significantly up to around df = 6, and then leveled off. Using 6 degrees of freedom allows the model to capture the nonlinear relationship between dis and nox without overfitting. It represents a balance between model flexibility and generalization, making it an appropriate choice based on both cross-validation and training fit.

Conclusion: Cross-validation confirms that a 6-degree regression spline provides the best out-of-sample performance for modeling the relationship between dis and nox, effectively capturing the curvature in the data while maintaining strong generalization.

11. In Section 7.7, it was mentioned that GAMs are generally fit using a backfitting approach.Note: For problem 11, please generate X1 and X2 independently as normally distributed with mean 0 and 1 respectively, both with variance of 2.

Also, the variance of the measurement error (V ar(ε) should be 1), and the values of β0, β1, and β2: 1.5, 3, and -4.5 respectively. This will make grading easier - you may experiment with changing these values on your own, but please do not turn in your assignment with other values. Only print your results for the first 5 steps and then 50, 100, 150, and so on to 1000 iterations for beta0, beta1, and beta2. Do not print all 1000 lines, please. Please provide a legend and legible colors in your plot.

11a. Generate a response Y and two predictors X1 and X2, with n = 100.

set.seed(1)
n <- 100
X1 <- rnorm(n, mean = 0, sd = sqrt(2))
X2 <- rnorm(n, mean = 1, sd = sqrt(2))
epsilon <- rnorm(n, mean = 0, sd = 1)
Y <- 1.5 + 3 * X1 - 4.5 * X2 + epsilon

11b. Initialize ˆ β1 to take on a value of your choice. It does not matter what value you choose..

beta0 <- 0
beta1 <- 1
beta2 <- 0
beta0_vals <- beta1_vals <- beta2_vals <- numeric(1000)

11c. Keeping β̂1 fixed, fit the model Y − β̂1X1 = β0 + β2X2 + ϵ.

a <- Y - beta1 * X1
beta2 <- lm(a ~ X2)$coef[2]

11d. Keeping β̂2 fixed, fit the model Y − β̂2X2 = β0 + β1X1 + ϵ.

a <- Y - beta2 * X2
beta1 <- lm(a ~ X1)$coef[2]

11e. (e) Write a for loop to repeat (c) and (d) 1,000 times. Report the estimates of ˆ β0, ˆ β1, and ˆ β2 at each iteration of the for loop. Create a plot in which each of these values is displayed, with ˆ β0, ˆ β1, and ˆ β2 each shown in a different color.

beta0_vals <- beta1_vals <- beta2_vals <- numeric(1000)

for (i in 1:1000) {
  a <- Y - beta2 * X2
  beta1 <- lm(a ~ X1)$coef[2]
  
  a <- Y - beta1 * X1
  beta2 <- lm(a ~ X2)$coef[2]
  
  beta0 <- mean(Y - beta1 * X1 - beta2 * X2)
  
  beta0_vals[i] <- beta0
  beta1_vals[i] <- beta1
  beta2_vals[i] <- beta2
}

plot(1:1000, beta0_vals, type = "l", col = "blue", ylim = range(c(beta0_vals, beta1_vals, beta2_vals)),
     xlab = "Iteration", ylab = "Coefficient Estimate", main = "Backfitting Estimates Over Time")
lines(1:1000, beta1_vals, col = "red")
lines(1:1000, beta2_vals, col = "green")
legend("topright", legend = c("beta0", "beta1", "beta2"), col = c("blue", "red", "green"), lty = 1)

11f. Compare your answer in (e) to the results of simply performing multiple linear regression to predict Y using X1 and X2. Use the abline() function to overlay those multiple linear regression coefficient estimates on the plot obtained in (e).

lm.full <- lm(Y ~ X1 + X2)
ols.coefs <- coef(lm.full)

#Re-plot the backfitting estimates#
plot(1:1000, beta0_vals, type = "l", col = "blue", ylim = range(c(beta0_vals, beta1_vals, beta2_vals, ols.coefs)),
     xlab = "Iteration", ylab = "Coefficient Estimate", main = "Backfitting vs. OLS Estimates")
lines(1:1000, beta1_vals, col = "red")
lines(1:1000, beta2_vals, col = "green")

#Overlay OLS estimates using abline#
abline(h = ols.coefs[1], col = "blue", lty = 2)   # β̂0 (intercept)
abline(h = ols.coefs[2], col = "red", lty = 2)    # β̂1
abline(h = ols.coefs[3], col = "green", lty = 2)  # β̂2

Backfitting recovers the same coefficients as multiple linear regression.

11g. On this data set, how many backfitting iterations were required in order to obtain a “good” approximation to the multiple regression coefficient estimates?

threshold <- 1e-6
converged_at <- NA

for (i in 2:1000) {
  delta0 <- abs(beta0_vals[i] - beta0_vals[i - 1])
  delta1 <- abs(beta1_vals[i] - beta1_vals[i - 1])
  delta2 <- abs(beta2_vals[i] - beta2_vals[i - 1])
  
  if (is.na(converged_at) && max(delta0, delta1, delta2) < threshold) {
    converged_at <- i
    break
  }
}

converged_at

## [1] 3

To determine how many backfitting iterations were required to obtain a “good” approximation to the multiple regression coefficient estimates, I used a numerical convergence check. Specifically, I computed the maximum change in each coefficient between successive iterations and defined convergence as occurring when this change fell below a threshold of 1×10^−6.

Based on this approach, the algorithm converged by iteration 3.

This is consistent with the visual results in part (11f), where the backfitting coefficient paths flattened out very early and closely aligned with the OLS estimates.

Conclusion: Only 3 iterations were needed for the backfitting estimates to converge to a stable solution that matches the multiple linear regression coefficients. This demonstrates that backfitting can be extremely efficient when the predictors are not highly correlated and the model is well specified.

Lab 8

jace vayhinger

2025-04-03

6. In this exercise, you will further analyze the Wage data set considered throughout this chapter.

Note: For problem 6, the hypothesis testing anova referring to the process used on page 285 of the book, where the anova() function is used with argument test set to “F”. This is the usual statistical inference manner in which degree would be selected.

6a. Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.

6b. Fit a step function to predict wage using age, and perform cross-validation to choose the optimal number of cuts. Make a plot of the fit obtained.

Note: For problem 7, you can easily fit non-linear functions in the categorical variables by simply entering them into the model - a reference will be chosen automatically.

9. This question uses the variables dis and nox from the Boston data. We will treat dis as the predictor and nox as the response.

9a. Use the poly() function to fit a cubic polynomial regression to predict nox using dis. Report the regression output, and plot the resulting data and polynomial fits.

9b. Plot the polynomial fits for a range of different polynomial degrees (say, from 1 to 10), and report the associated residual sum of squares.

9c. Perform cross-validation or another approach to select the optimal degree for the polynomial, and explain your results.

9d. Use the bs() function to fit a regression spline to predict nox using dis. Report the output for the fit using four degrees of freedom. How did you choose the knots? Plot the resulting fit.

9e. Now fit a regression spline for a range of degrees of freedom, and plot the resulting fits and report the resulting RSS. Describe the results obtained.

9f. Perform cross-validation or another approach in order to select the best degrees of freedom for a regression spline on this data. Describe your results.

11. In Section 7.7, it was mentioned that GAMs are generally fit using a backfitting approach.Note: For problem 11, please generate X1 and X2 independently as normally distributed with mean 0 and 1 respectively, both with variance of 2.

11a. Generate a response Y and two predictors X1 and X2, with n = 100.

11b. Initialize ˆ β1 to take on a value of your choice. It does not matter what value you choose..

11c. Keeping β̂1 fixed, fit the model Y − β̂1X1 = β0 + β2X2 + ϵ.

11d. Keeping β̂2 fixed, fit the model Y − β̂2X2 = β0 + β1X1 + ϵ.

11e. (e) Write a for loop to repeat (c) and (d) 1,000 times. Report the estimates of ˆ β0, ˆ β1, and ˆ β2 at each iteration of the for loop. Create a plot in which each of these values is displayed, with ˆ β0, ˆ β1, and ˆ β2 each shown in a different color.

11f. Compare your answer in (e) to the results of simply performing multiple linear regression to predict Y using X1 and X2. Use the abline() function to overlay those multiple linear regression coefficient estimates on the plot obtained in (e).

11g. On this data set, how many backfitting iterations were required in order to obtain a “good” approximation to the multiple regression coefficient estimates?