6. In this exercise, you will further analyze the Wage data set considered throughout this chapter.

(a) Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.

# For consistent results
set.seed(2)

# Initialize a vector to store cross-validation errors for degrees 1 to 10
cv.error.10 = rep(0, 10)
for (i in 1:10) {
  poly.fit = glm(wage ~ poly(age, i), data = Wage)
  cv.error.10[i] = cv.glm(Wage, poly.fit, K = 10)$delta[1]
}

# Find the degree with the lowest CV error
degree = which.min(cv.error.10)
cat("Optimal degree chosen by cross-validation:", degree, "\n")

## Optimal degree chosen by cross-validation: 5

We now fit several polynomial models and compare them using ANOVA to see if the improvement with higher degrees is statistically significant.

# Fit polynomial models of increasing degree
fit.1 = lm(wage ~ poly(age, 1), data = Wage)
fit.2 = lm(wage ~ poly(age, 2), data = Wage)
fit.3 = lm(wage ~ poly(age, 3), data = Wage)
poly.fit = lm(wage ~ poly(age, degree), data = Wage)
fit.5 = lm(wage ~ poly(age, 5), data = Wage)

# Compare the models using ANOVA
anova_result = anova(fit.1, fit.2, fit.3, poly.fit, fit.5, test = "F")
print(anova_result)

## Analysis of Variance Table
## 
## Model 1: wage ~ poly(age, 1)
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, degree)
## Model 5: wage ~ poly(age, 5)
##   Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
## 1   2998 5022216                                    
## 2   2997 4793430  1    228786 143.5931 < 2.2e-16 ***
## 3   2996 4777674  1     15756   9.8888  0.001679 ** 
## 4   2994 4770322  2      7353   2.3074  0.099697 .  
## 5   2994 4770322  0         0                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now, we visualize the polynomial fit using the optimal degree selected by cross-validation.

# Fit the final model using the selected degree
final_model <- lm(wage ~ poly(age, degree = degree), data = Wage)

# Create a grid of age values to predict wages
age_grid <- seq(from = min(Wage$age), to = max(Wage$age), length.out = 100)
preds <- predict(final_model, newdata = data.frame(age = age_grid), se = TRUE)
se_bands <- cbind(preds$fit + 2 * preds$se.fit, preds$fit - 2 * preds$se.fit)
# Plot the observed wages and the fitted curve with confidence bands
plot(Wage$age, Wage$wage, cex = 0.5, col = "lightgreen", 
     main = paste0("Degree-", degree),
     xlab = "Age", ylab = "Wage")
lines(age_grid, preds$fit, lwd = 2, col = "blue")
matlines(age_grid, se_bands, lwd = 1, col = "blue", lty = 3)

Insights * Cross-validation picked degree 5 as the best for prediction, minimizing error.

ANOVA results tell a nuanced story:
- There’s a significant jump in explanatory power from a linear to quadratic model.
- Adding a third-degree term also shows a statistically significant improvement.
- Beyond that, the gains from higher degrees (like 5) are only marginally significant (p ≈ 0.1).

From the plot:

Wages tend to rise from the early 20s to around age 50.
There’s a clear peak in midlife, with a slight decline toward later years.
The higher-degree polynomial adds some curvature at the tails, capturing subtle trends.

Takeaway:

Cross-validation prioritized predictive performance (favoring degree 5), while ANOVA focuses on whether the added complexity is statistically justified. A cubic model (degree 3) might be a good balance between complexity and interpretability.

(b) Fit a step function to predict wage using age, and perform crossvalidation to choose the optimal number of cuts. Make a plot of the fit obtained.

# Try step functions with 2 to 11 intervals
cv.error.10 = rep(0, 10)
for(i in 1:10) {
  Wage$age.cut = cut(Wage$age, i+1)
  glm.fit = glm(wage ~ age.cut, data = Wage)
  cv.error.10[i] = cv.glm(Wage, glm.fit, K = 10)$delta[1]
}

# Best number of cuts based on CV
number = which.min(cv.error.10)
cat("Optimal number of cuts chosen by cross-validation:", number, "\n")

## Optimal number of cuts chosen by cross-validation: 7

# Fit the final step function model
step.fit = glm(wage ~ cut(age, number + 1), data = Wage)

# Generate prediction data
grid = seq(from = min(Wage$age), to = max(Wage$age), length.out = 100)
preds = predict(step.fit, newdata = data.frame(age = grid), se = TRUE)
se.bands = cbind(preds$fit + 2 * preds$se.fit, preds$fit - 2 * preds$se.fit)
# Plot the step function fit
plot(Wage$age, Wage$wage, cex = 0.5, col = "lightgreen", 
     xlab = "Age", ylab = "Wage", 
     main = paste0("Step Function with ", number, " Cuts"))
lines(grid, preds$fit, lwd = 2, col = "blue")
matlines(grid, se.bands, lwd = 1, col = "blue", lty = 3)

# Add cut boundaries as vertical lines
cuts = quantile(Wage$age, probs = seq(0, 1, length = number+2))[-c(1, number+2)]
abline(v = cuts, lty = 2, col = "lightgreen")

Interpretation

Cross-validation selected 7 cuts, dividing age into 8 intervals.

The step function shows:

Lower wages for younger workers (under age 30)
A noticeable jump in wages around 30
Wages remain fairly stable through midlife (40s to 60s)
A drop-off in wages as workers approach retirement age (65+)

The step function creates discrete wage levels for different age ranges, which provides an intuitive interpretation of how wages change across career stages.

The blue dashed lines represent confidence intervals, which widen at the extremes of the age range where data is sparser.

The vertical dashed lines mark the 7 cut points that divide the age range into 8 segments.

7. The Wage data set contains a number of other features not explored in this chapter, such as marital status (maritl), job class (jobclass), and others. Explore the relationships between some of these other predictors and wage, and use non-linear ftting techniques in order to fit fexible models to the data. Create plots of the results obtained, and write a summary of your findings.

Fitting a Flexible GAM Model

We’ll start by fitting three GAM models:

A base model using a natural spline for age.
A model adding marital status.
A model including both marital status and job class.

# Fit GAM models with increasing complexity
gam.fit1 <- gam(wage ~ ns(age, 4), data = Wage)
gam.fit2 <- gam(wage ~ ns(age, 4) + maritl, data = Wage)
gam.fit3 <- gam(wage ~ ns(age, 4) + maritl + jobclass, data = Wage)

# Compare models using ANOVA
anova(gam.fit1, gam.fit2, gam.fit3)

## Analysis of Deviance Table
## 
## Model 1: wage ~ ns(age, 4)
## Model 2: wage ~ ns(age, 4) + maritl
## Model 3: wage ~ ns(age, 4) + maritl + jobclass
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1      2995    4773109                          
## 2      2991    4650697  4   122412 < 2.2e-16 ***
## 3      2990    4479913  1   170784 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Visualizing the Relationships

Wage by Marital Status

ggplot(Wage, aes(x = maritl, y = wage)) +
  geom_boxplot(fill = "blue") +
  labs(title = "Wage by Marital Status", x = "Marital Status", y = "Wage")

WAge by Job Class

ggplot(Wage, aes(x = jobclass, y = wage)) +
  geom_boxplot(fill = "skyblue") +
  labs(title = "Wage by Job Class", x = "Job Class", y = "Wage")

Non-linear Relationship Between Age and Wage

ggplot(Wage, aes(x = age, y = wage)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "loess", se = FALSE, color = "red") +
  labs(title = "Non-linear Relationship between Age and Wage")

## `geom_smooth()` using formula = 'y ~ x'

Summary of Findings

Using flexible modeling methods like splines and GAMs helped uncover richer insights into the relationship between age, marital status, job class, and wages.

Key Takeaways from the GAM Models

Adding maritl and jobclass significantly improved the model. Each addition reduced the residual deviance:

Age only: 4,773,109

Age + Marital Status: 4,650,697

Age + Marital Status + Job Class: 4,479,913

These drops in deviance, coupled with very small p-values (p < 0.001), suggest that both marital status and job class are important predictors of wage.

Insights from the Data Visualizations

Marital Status:

Married individuals tend to earn more on average.

Never Married, Divorced, and Separated individuals generally earn less, with fewer high-earning outliers.

Job Class:

Information sector jobs have a noticeably higher median wage compared to industrial jobs.

There’s also more wage variability in the Information group, with a wider range of high earners.

Age:

The relationship between age and wage is clearly non-linear.

Wages rise through early adulthood, plateau in midlife, and slightly decline later in life.

9. This question uses the variables dis (the weighted mean of distances to fve Boston employment centers) and nox (nitrogen oxides concentration in parts per 10 million) from the Boston data. We will treat dis as the predictor and nox as the response.

(a) Use the poly() function to ft a cubic polynomial regression to predict nox using dis. Report the regression output, and plot the resulting data and polynomial fts.

cubic_model <- lm(nox ~ poly(dis, 3), data = Boston)
summary(cubic_model)

## 
## Call:
## lm(formula = nox ~ poly(dis, 3), data = Boston)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.121130 -0.040619 -0.009738  0.023385  0.194904 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.554695   0.002759 201.021  < 2e-16 ***
## poly(dis, 3)1 -2.003096   0.062071 -32.271  < 2e-16 ***
## poly(dis, 3)2  0.856330   0.062071  13.796  < 2e-16 ***
## poly(dis, 3)3 -0.318049   0.062071  -5.124 4.27e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06207 on 502 degrees of freedom
## Multiple R-squared:  0.7148, Adjusted R-squared:  0.7131 
## F-statistic: 419.3 on 3 and 502 DF,  p-value: < 2.2e-16

# Create a scatter plot with the polynomial fit
plot(Boston$dis, Boston$nox, pch = 20, cex = 0.7, col = "darkgray",
     xlab = "Distance to Employment Centers (dis)", 
     ylab = "Nitrogen Oxides Concentration (nox)",
     main = "Cubic Polynomial Regression")

# Create a grid of dis values for smooth prediction line
dis_grid <- seq(min(Boston$dis), max(Boston$dis), length.out = 100)

# Generate predictions
preds <- predict(cubic_model, newdata = data.frame(dis = dis_grid), se = TRUE)

# Add the cubic polynomial fit line
lines(dis_grid, preds$fit, col = "blue", lwd = 2)

# Add confidence bands
lines(dis_grid, preds$fit + 2 * preds$se.fit, col = "blue", lty = 2)
lines(dis_grid, preds$fit - 2 * preds$se.fit, col = "blue", lty = 2)

# Add a legend
legend("topright", legend = c("Data", "Cubic Polynomial Fit", "95% Confidence Interval"),
       pch = c(20, NA, NA), lty = c(NA, 1, 2), col = c("darkgray", "blue", "blue"))

We then visualized this fit alongside the actual data:

A scatterplot shows how nox changes with dis.

A smooth blue line shows the predicted values from our cubic model.

Dashed lines represent the 95% confidence interval for the predictions.

Key Takeaways: The model is highly significant, with an adjusted R² of 0.713, meaning it explains about 71% of the variability in NOx levels.

Each term (linear, quadratic, cubic) is statistically significant.

The plot shows that:

Pollution is highest in areas closest to employment centers.

NOx drops sharply as distance increases from 1 to 6 units.

Beyond 6 units, the trend flattens out, and levels stabilize.

There’s a tiny uptick beyond 10 units, but the wide confidence bands suggest uncertainty.

(b) Plot the polynomial fts for a range of diferent polynomial degrees (say, from 1 to 10), and report the associated residual sum of squares.

# Create a function to fit polynomials and calculate RSS
calculate_rss <- function(degree) {
  model <- lm(nox ~ poly(dis, degree), data = Boston)
  return(sum(model$residuals^2))
}

# Calculate RSS for degrees 1 through 10
rss_values <- sapply(1:10, calculate_rss)

# Create a data frame for reporting
rss_df <- data.frame(
  Degree = 1:10,
  RSS = rss_values
)


# Plot the RSS by degree
plot(1:10, rss_values, type = "b", pch = 19,
     xlab = "Polynomial Degree", ylab = "Residual Sum of Squares",
     main = "RSS by Polynomial Degree")

# Plot the polynomial fits on a single graph
plot(Boston$dis, Boston$nox, pch = 20, cex = 0.7, col = "darkgray",
     xlab = "Distance to Employment Centers (dis)", 
     ylab = "Nitrogen Oxides Concentration (nox)",
     main = "Polynomial Fits of Various Degrees")

# Generate and plot fits for each degree
colors <- rainbow(8)
legend_text <- character(10)

for(i in 1:10) {
  # Fit model
  model <- lm(nox ~ poly(dis, i), data = Boston)
  
  # Create prediction grid
  dis_grid <- seq(min(Boston$dis), max(Boston$dis), length.out = 100)
  preds <- predict(model, newdata = data.frame(dis = dis_grid))
  
  # Add the polynomial fit line
  lines(dis_grid, preds, col = colors[i], lwd = 1.5)
  
  # Create legend text
  legend_text[i] <- paste("Degree", i)
}

# Add a legend
legend("topright", legend = legend_text, col = colors, lwd = 1.5, cex = 0.7)

# Print the RSS values
print(rss_df)

##    Degree      RSS
## 1       1 2.768563
## 2       2 2.035262
## 3       3 1.934107
## 4       4 1.932981
## 5       5 1.915290
## 6       6 1.878257
## 7       7 1.849484
## 8       8 1.835630
## 9       9 1.833331
## 10     10 1.832171

Observations:

The biggest improvement comes from moving from a linear to a quadratic model.

After degree 3, the benefits start to plateau.

The cubic model seems to hit the sweet spot—adding more complexity after that barely helps and might overfit the data.

(c) Perform cross-validation or another approach to select the optimal degree for the polynomial, and explain your results.

We used 10-fold cross-validation to determine the optimal polynomial degree:

# Set seed for reproducibility
set.seed(1)

# Function to calculate CV error for each polynomial degree
cv_error <- function(degree) {
  glm_fit <- glm(nox ~ poly(dis, degree), data = Boston)
  cv_result <- cv.glm(Boston, glm_fit, K = 10)
  return(cv_result$delta[1])  # Return cross-validation error
}

# Calculate CV errors for degrees 1-10
cv_errors <- sapply(1:10, cv_error)

# Create a data frame of results
cv_df <- data.frame(
  Degree = 1:10,
  CV_MSE = cv_errors
)

# Print results
print(cv_df)

##    Degree      CV_MSE
## 1       1 0.005558263
## 2       2 0.004085706
## 3       3 0.003876521
## 4       4 0.003863342
## 5       5 0.004237452
## 6       6 0.005686862
## 7       7 0.010278897
## 8       8 0.006810868
## 9       9 0.033308607
## 10     10 0.004075599

# Plot CV errors
plot(1:10, cv_errors, type = "b", pch = 19, col = "skyblue",
     xlab = "Polynomial Degree", ylab = "Cross-Validation MSE",
     main = "Cross-Validation Error by Polynomial Degree")
best_degree <- which.min(cv_errors)
points(best_degree, cv_errors[best_degree], col = "green", cex = 2, pch = 19)

Results: Degree 4 had the lowest error, but degree 3 was nearly identical—only a 0.000013 difference!

The error drops sharply from degree 1 to 3, then levels out.

Degree 3 is likely the best choice, balancing accuracy and simplicity.

(d) Use the bs() function to ft a regression spline to predict nox using dis. Report the output for the ft using four degrees of freedom. How did you choose the knots? Plot the resulting fit.

# Fit a regression spline with 4 degrees of freedom
spline_model <- lm(nox ~ bs(dis, df = 4), data = Boston)

# Get the summary of the regression
summary(spline_model)

## 
## Call:
## lm(formula = nox ~ bs(dis, df = 4), data = Boston)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.124622 -0.039259 -0.008514  0.020850  0.193891 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.73447    0.01460  50.306  < 2e-16 ***
## bs(dis, df = 4)1 -0.05810    0.02186  -2.658  0.00812 ** 
## bs(dis, df = 4)2 -0.46356    0.02366 -19.596  < 2e-16 ***
## bs(dis, df = 4)3 -0.19979    0.04311  -4.634 4.58e-06 ***
## bs(dis, df = 4)4 -0.38881    0.04551  -8.544  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06195 on 501 degrees of freedom
## Multiple R-squared:  0.7164, Adjusted R-squared:  0.7142 
## F-statistic: 316.5 on 4 and 501 DF,  p-value: < 2.2e-16

# Plot the resulting fit
plot(Boston$dis, Boston$nox, pch = 20, cex = 0.7, col = "darkgray",
     xlab = "Distance to Employment Centers (dis)", 
     ylab = "Nitrogen Oxides Concentration (nox)",
     main = "Regression Spline with 4 Degrees of Freedom")

# Create a grid for smooth prediction line
dis_grid <- seq(min(Boston$dis), max(Boston$dis), length.out = 100)

# Generate predictions
preds <- predict(spline_model, newdata = data.frame(dis = dis_grid), se = TRUE)

# Add the spline fit
lines(dis_grid, preds$fit, col = "blue", lwd = 2)

# Add confidence bands
lines(dis_grid, preds$fit + 2 * preds$se.fit, col = "blue", lty = 2)
lines(dis_grid, preds$fit - 2 * preds$se.fit, col = "blue", lty = 2)

# Show knot location
knots <- attr(bs(Boston$dis, df = 4), "knots")
abline(v = knots, lty = 2, col = "red")

# Add a legend
legend("topright", legend = c("Data", "Spline Fit", "95% CI", "Knot"),
       pch = c(20, NA, NA, NA), lty = c(NA, 1, 2, 2), 
       col = c("darkgray", "blue", "blue", "red"))

Thoughts on Knot Selection:

We let R choose the knot locations automatically (based on quantiles). With 4 degrees of freedom, this gave us a nice, flexible fit without manually placing the knots.

Highlights: All spline terms are statistically significant.

Slightly better R² (0.716) than the cubic polynomial.

Residual error is very similar to what we saw before.

(e) Now ft a regression spline for a range of degrees of freedom, and plot the resulting fts and report the resulting RSS. Describe the results obtained.

# Function to fit spline model and calculate RSS
fit_spline_and_get_rss <- function(df) {
  model <- lm(nox ~ bs(dis, df = df), data = Boston)
  return(sum(model$residuals^2))
}

# Calculate RSS for degrees of freedom 3 through 10
df_values <- 3:10
rss_values <- sapply(df_values, fit_spline_and_get_rss)

# Create a data frame of the results
rss_df <- data.frame(DF = df_values, RSS = rss_values)

# Print results
print(rss_df)

##   DF      RSS
## 1  3 1.934107
## 2  4 1.922775
## 3  5 1.840173
## 4  6 1.833966
## 5  7 1.829884
## 6  8 1.816995
## 7  9 1.825653
## 8 10 1.792535

# Plot RSS by degrees of freedom
plot(df_values, rss_values, type = "b", pch = 19,
     xlab = "Degrees of Freedom", ylab = "Residual Sum of Squares",
     main = "RSS by Degrees of Freedom (Regression Spline)")

# Plot the spline fits
plot(Boston$dis, Boston$nox, pch = 20, cex = 0.7, col = "lightgreen",
     xlab = "Distance to Employment Centers (dis)", 
     ylab = "Nitrogen Oxides Concentration (nox)",
     main = "Regression Spline Fits with Various Degrees of Freedom")

# Create prediction grid
dis_grid <- seq(min(Boston$dis), max(Boston$dis), length.out = 100)

# Generate and plot fits for each degree of freedom
colors <- rainbow(length(df_values))
for(i in 1:length(df_values)) {
  df <- df_values[i]
  model <- lm(nox ~ bs(dis, df = df), data = Boston)
  preds <- predict(model, newdata = data.frame(dis = dis_grid))
  lines(dis_grid, preds, col = colors[i], lwd = 1.5)
}

# Add a legend
legend("topright", legend = paste("df =", df_values), 
       col = colors, lwd = 1.5, cex = 0.7)

What We Learned: The biggest improvement happened from DF = 4 to DF = 5.

After that, improvements are smaller.

RSS unexpectedly increased at DF = 9, possibly due to poor knot placement.

The best RSS came at DF = 10, but again, there’s a risk of overfitting.

(f) Perform cross-validation or another approach in order to select the best degrees of freedom for a regression spline on this data. Describe your results.

# Set seed for reproducibility
set.seed(1)

# Function to compute CV MSE for a given df
cv_error_spline <- function(df) {
  # Create formula with bs() function
  form <- as.formula(paste("nox ~ bs(dis, df =", df, ")"))
  
  # Fit the model
  model <- glm(form, data = Boston)
  
  # Perform 10-fold cross-validation
  cv_result <- cv.glm(Boston, model, K = 10)
  
  # Return the cross-validation error
  return(cv_result$delta[1])
}

# Calculate CV error for degrees of freedom 3 through 10
df_values <- 3:10
cv_errors <- sapply(df_values, cv_error_spline)

# Create a data frame of results
cv_df <- data.frame(DF = df_values, CV_MSE = cv_errors)
print(cv_df)

##   DF      CV_MSE
## 1  3 0.003897402
## 2  4 0.003890178
## 3  5 0.003724890
## 4  6 0.003697010
## 5  7 0.003740823
## 6  8 0.003709286
## 7  9 0.003727194
## 8 10 0.003715941

# Find the optimal degrees of freedom
best_df <- df_values[which.min(cv_errors)]
cat("Optimal degrees of freedom:", best_df, "\n")

## Optimal degrees of freedom: 6

# Plot CV errors
plot(df_values, cv_errors, type = "b", pch = 19, col = "skyblue",
     xlab = "Degrees of Freedom", ylab = "Cross-Validation MSE",
     main = "CV Error by Degrees of Freedom (Regression Spline)")
points(best_df, cv_errors[which.min(cv_errors)], col = "lightgreen", cex = 2, pch = 19)

Final Verdict:

DF = 6 gave the lowest cross-validation error.

Errors steadily declined from DF = 3 to DF = 6, but barely changed after that.

So, a spline with 6 degrees of freedom seems like the best option—great fit, not too complex.

11.In Section 7.7, it was mentioned that GAMs are generally ft using a backftting approach. The idea behind backftting is actually quite simple. We will now explore backftting in the context of multiple linear regression. Suppose that we would like to perform multiple linear regression, but we do not have software to do so. Instead, we only have software to perform simple linear regression. Therefore, we take the following iterative approach: we repeatedly hold all but one coeffcient estimate fxed at its current value, and update only that coeffcient estimate using a simple linear regression. The process is continued until convergence—that is, until the coeffcient estimates stop changing. We now try this out on a toy example:

(a) Generate a response Y and two predictors X1 and X2, with n = 100.

# Set a seed for reproducibility
set.seed(1)

# Generate predictors
n <- 100
x1 <- rnorm(n)
x2 <- rnorm(n)

# Generate response with true coefficients beta0 = 3, beta1 = 2, beta2 = 1
beta0_true <- 3
beta1_true <- 2
beta2_true <- 1
y <- beta0_true + beta1_true * x1 + beta2_true * x2 + rnorm(n, sd = 0.5)

(b) Initialize βˆ1 to take on a value of your choice. It does not matter what value you choose.

beta1 <- 0  # Arbitrary initial value

print(beta1)

## [1] 0

(c) Keeping βˆ1 fxed, ft the model Y − βˆ1X1 = β0 + β2X2 + ϵ.

a <- y - beta1 * x1
beta2 <- lm(a ~ x2)$coef[2]

print(beta2)

##       x2 
## 0.971392

(d) Keeping βˆ2 fxed, ft the model Y − βˆ2X2 = β0 + β1X1 + ϵ.

a <- y - beta2 * x2
beta1 <- lm(a ~ x1)$coef[2]

print(beta1)

##       x1 
## 2.010553

(e) Write a for loop to repeat (c) and (d) 1,000 times. Report the estimates of βˆ0, βˆ1, and βˆ2 at each iteration of the for loop. Create a plot in which each of these values is displayed, with βˆ0, βˆ1, and βˆ2 each shown in a diferent color.

set.seed(123)
n <- 100
x1 <- rnorm(n, mean = 0, sd = sqrt(2))
x2 <- rnorm(n, mean = 1, sd = sqrt(2))
beta0 <- 1.5
beta1 <- 3
beta2 <- -4.5
epsilon <- rnorm(n, mean = 0, sd = 1)
y <- beta0 + beta1 * x1 + beta2 * x2 + epsilon

# Initialize parameters
beta0_current <- 0
beta1_current <- 0
beta2_current <- 0

# Set up storage for coefficient estimates
n_iter <- 1000
beta0_history <- numeric(n_iter)
beta1_history <- numeric(n_iter)
beta2_history <- numeric(n_iter)

# Backfitting algorithm
for (i in 1:n_iter) {
  # Step (c): Update beta2 with beta1 fixed
  a <- y - beta1_current * x1
  lm_result <- lm(a ~ x2)
  beta0_current <- lm_result$coef[1]
  beta2_current <- lm_result$coef[2]
  
  # Step (d): Update beta1 with beta2 fixed
  a <- y - beta2_current * x2
  lm_result <- lm(a ~ x1)
  beta0_current <- lm_result$coef[1]
  beta1_current <- lm_result$coef[2]
  
  # Store the current estimates
  beta0_history[i] <- beta0_current
  beta1_history[i] <- beta1_current
  beta2_history[i] <- beta2_current
}

# Print selected iterations as requested
print_iterations <- c(1:5, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500, 
                     550, 600, 650, 700, 750, 800, 850, 900, 950, 1000)

for (i in print_iterations) {
  cat("Iteration", i, ": beta0 =", beta0_history[i], 
      "beta1 =", beta1_history[i], 
      "beta2 =", beta2_history[i], "\n")
}

## Iteration 1 : beta0 = 1.734344 beta1 = 2.898704 beta2 = -4.619032 
## Iteration 2 : beta0 = 1.618513 beta1 = 2.905816 beta2 = -4.483496 
## Iteration 3 : beta0 = 1.618229 beta1 = 2.905833 beta2 = -4.483164 
## Iteration 4 : beta0 = 1.618228 beta1 = 2.905834 beta2 = -4.483163 
## Iteration 5 : beta0 = 1.618228 beta1 = 2.905834 beta2 = -4.483163 
## Iteration 50 : beta0 = 1.618228 beta1 = 2.905834 beta2 = -4.483163 
## Iteration 100 : beta0 = 1.618228 beta1 = 2.905834 beta2 = -4.483163 
## Iteration 150 : beta0 = 1.618228 beta1 = 2.905834 beta2 = -4.483163 
## Iteration 200 : beta0 = 1.618228 beta1 = 2.905834 beta2 = -4.483163 
## Iteration 250 : beta0 = 1.618228 beta1 = 2.905834 beta2 = -4.483163 
## Iteration 300 : beta0 = 1.618228 beta1 = 2.905834 beta2 = -4.483163 
## Iteration 350 : beta0 = 1.618228 beta1 = 2.905834 beta2 = -4.483163 
## Iteration 400 : beta0 = 1.618228 beta1 = 2.905834 beta2 = -4.483163 
## Iteration 450 : beta0 = 1.618228 beta1 = 2.905834 beta2 = -4.483163 
## Iteration 500 : beta0 = 1.618228 beta1 = 2.905834 beta2 = -4.483163 
## Iteration 550 : beta0 = 1.618228 beta1 = 2.905834 beta2 = -4.483163 
## Iteration 600 : beta0 = 1.618228 beta1 = 2.905834 beta2 = -4.483163 
## Iteration 650 : beta0 = 1.618228 beta1 = 2.905834 beta2 = -4.483163 
## Iteration 700 : beta0 = 1.618228 beta1 = 2.905834 beta2 = -4.483163 
## Iteration 750 : beta0 = 1.618228 beta1 = 2.905834 beta2 = -4.483163 
## Iteration 800 : beta0 = 1.618228 beta1 = 2.905834 beta2 = -4.483163 
## Iteration 850 : beta0 = 1.618228 beta1 = 2.905834 beta2 = -4.483163 
## Iteration 900 : beta0 = 1.618228 beta1 = 2.905834 beta2 = -4.483163 
## Iteration 950 : beta0 = 1.618228 beta1 = 2.905834 beta2 = -4.483163 
## Iteration 1000 : beta0 = 1.618228 beta1 = 2.905834 beta2 = -4.483163

# Plot the coefficient histories
plot(1:n_iter, beta0_history, type = "l", col = "red", 
     xlab = "Iteration", ylab = "Coefficient Estimate",
     main = "Coefficient Estimates by Iteration",
     ylim = range(c(beta0_history, beta1_history, beta2_history)))
lines(1:n_iter, beta1_history, col = "skyblue")
lines(1:n_iter, beta2_history, col = "lightgreen")
legend("topright", legend = c("beta0", "beta1", "beta2"), 
       col = c("red", "skyblue", "lightgreen"), lty = 1)

(f) Compare your answer in (e) to the results of simply performing multiple linear regression to predict Y using X1 and X2. Use the abline() function to overlay those multiple linear regression coeffcient estimates on the plot obtained in (e).

# Perform direct multiple linear regression
direct_lm <- lm(y ~ x1 + x2)
direct_beta0 <- coef(direct_lm)[1]
direct_beta1 <- coef(direct_lm)[2]
direct_beta2 <- coef(direct_lm)[3]

cat("\nDirect multiple regression estimates:\n")

## 
## Direct multiple regression estimates:

cat("beta0 =", direct_beta0, "\n")

## beta0 = 1.618228

cat("beta1 =", direct_beta1, "\n")

## beta1 = 2.905834

cat("beta2 =", direct_beta2, "\n")

## beta2 = -4.483163

# Plot with clear legend and legible colors
plot(1:n_iter, beta0_history, type = "l", col = "darkred", lwd = 2,
     xlab = "Iteration", ylab = "Coefficient Estimate",
     main = "Backfitting Algorithm Convergence",
     ylim = range(c(beta0_history, beta1_history, beta2_history)))
lines(1:n_iter, beta1_history, col = "darkblue", lwd = 2)
lines(1:n_iter, beta2_history, col = "darkgreen", lwd = 2)

# Overlay direct regression estimates
abline(h = direct_beta0, col = "red", lty = 2, lwd = 1.5)
abline(h = direct_beta1, col = "skyblue", lty = 2, lwd = 1.5)
abline(h = direct_beta2, col = "lightgreen", lty = 2, lwd = 1.5)

legend("right", 
       legend = c("beta0 (backfitting)", "beta1 (backfitting)", "beta2 (backfitting)",
                  "beta0 (direct)", "beta1 (direct)", "beta2 (direct)"),
       col = c("darkred", "darkblue", "darkgreen", "red", "skyblue", "lightgreen"),
       lty = c(1, 1, 1, 2, 2, 2),
       lwd = c(2, 2, 2, 1.5, 1.5, 1.5),
       cex = 0.8)

# Zoomed in plot for early iterations
plot(1:50, beta0_history[1:50], type = "l", col = "darkred", lwd = 2,
     xlab = "Iteration", ylab = "Coefficient Estimate",
     main = "First 50 Iterations of Backfitting Algorithm",
     ylim = range(c(beta0_history[1:50], beta1_history[1:50], beta2_history[1:50])))
lines(1:50, beta1_history[1:50], col = "darkblue", lwd = 2)
lines(1:50, beta2_history[1:50], col = "darkgreen", lwd = 2)

abline(h = direct_beta0, col = "red", lty = 2, lwd = 1.5)
abline(h = direct_beta1, col = "skyblue", lty = 2, lwd = 1.5)
abline(h = direct_beta2, col = "lightgreen", lty = 2, lwd = 1.5)

legend("right", 
       legend = c("beta0 (backfitting)", "beta1 (backfitting)", "beta2 (backfitting)",
                  "beta0 (direct)", "beta1 (direct)", "beta2 (direct)"),
       col = c("darkred", "darkblue", "darkgreen", "red", "skyblue", "lightgreen"),
       lty = c(1, 1, 1, 2, 2, 2),
       lwd = c(2, 2, 2, 1.5, 1.5, 1.5),
       cex = 0.8)

Observations

The backfitting algorithm does a great job at approximating the results of standard multiple linear regression. After running it for 1,000 iterations, the estimated coefficients were virtually identical to those obtained directly through multiple regression—matching up to six decimal places:

beta0 ≈ 1.618228,beta1 ≈ 2.905834,beta2 ≈ -4.483163

This shows that backfitting is a powerful method: by iteratively fitting simple linear regressions and updating one coefficient at a time, it can successfully break down a multivariate regression problem and still arrive at the optimal solution.Also worth noting—the final estimates are very close to the actual parameters used to generate the data. The slight differences are expected and are simply due to the random noise we added when generating the response variable.

(g) On this data set, how many backftting iterations were required in order to obtain a “good” approximation to the multiple regression coeffcient estimates?

In this dataset,the algorithm converged extremely quickly. After just 4 iterations, the coefficient estimates stabilized and matched those from the direct multiple regression, down to 6 decimal places:

Iteration 1: β₀ = 1.7343, β₁ = 2.8987, β₂ = -4.6190

Iteration 2: β₀ = 1.6185, β₁ = 2.9058, β₂ = -4.4835

Iteration 3: β₀ = 1.6182, β₁ = 2.9058, β₂ = -4.4832

Iteration 4+: Values remain unchanged (i.e., convergence achieved)

Key Takeaways

Backfitting is a powerful and intuitive approach that lets you build a multiple regression model using only simple linear regressions.

The algorithm converges quickly and gives the same result as a standard linear regression model.

This approach forms the foundation of more flexible models like GAMs, where each term can be a non-linear smooth function.

SLlab8

Sai Pranam

2025-04-08