Statistical Learning Lab-8

6. In this exercise, you will further analyze the Wage data set considered throughout this chapter.

(a) Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.

library(ISLR2)
library(boot)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

wage_data <- read.csv("/Users/saransh/Downloads/Statistical_Learning_Resources/Wage.csv")

# Fit polynomial models from degree 1 to 10 and compute LOOCV error
cv_errors <- rep(NA, 10)
for (d in 1:10) {
  fit <- glm(wage ~ poly(age, d), data = wage_data)
  cv_errors[d] <- cv.glm(wage_data, fit, K = 10)$delta[1]
}

# Plot cross-validation error
plot(1:10, cv_errors, type = "b", xlab = "Degree", ylab = "CV Error")

best_degree <- which.min(cv_errors)
best_degree

## [1] 10

Interpretation: The cross-validation error sharply decreases from degree 1 to 2 and then flattens out. The minimum error is observed at degree 9, which was selected as the optimal polynomial degree. However, the improvement beyond degree 4 is marginal, suggesting potential overfitting for higher degrees.

Comparing using hypothesis testing with ANOVA:

fit_list <- list()
for (i in 1:5) {
  fit_list[[i]] <- lm(wage ~ poly(age, i), data = wage_data)
}
anova(fit_list[[1]], fit_list[[2]], fit_list[[3]], fit_list[[4]], fit_list[[5]])

## Analysis of Variance Table
## 
## Model 1: wage ~ poly(age, i)
## Model 2: wage ~ poly(age, i)
## Model 3: wage ~ poly(age, i)
## Model 4: wage ~ poly(age, i)
## Model 5: wage ~ poly(age, i)
##   Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
## 1   2998 5022216                                    
## 2   2997 4793430  1    228786 143.5931 < 2.2e-16 ***
## 3   2996 4777674  1     15756   9.8888  0.001679 ** 
## 4   2995 4771604  1      6070   3.8098  0.051046 .  
## 5   2994 4770322  1      1283   0.8050  0.369682    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretation: The ANOVA table shows that: - Degree 2 and 3 add significant explanatory power (p-values < 0.01), - Degree 4 is marginally significant (p = 0.051), - Degree 5 is not significant (p = 0.37).

This suggests that a polynomial of degree 3 or 4 is sufficient and aligns with the principle of parsimony. Although degree 9 was optimal in CV, ANOVA emphasizes avoiding unnecessary complexity.

Plot of the best-fit polynomial:

age_range <- range(wage_data$age)
age_grid <- seq(age_range[1], age_range[2], length.out = 100)
predicted <- predict(lm(wage ~ poly(age, best_degree), data = wage_data), newdata = data.frame(age = age_grid))

plot(wage_data$age, wage_data$wage, col = "gray", main = paste("Degree", best_degree, "Polynomial Fit"), xlab = "Age", ylab = "Wage")
lines(age_grid, predicted, col = "blue", lwd = 2)

The polynomial fit with degree 9 captures the non-linear wage trend quite well, especially around age 30–60. However, the wiggles toward the extremes (young and old) suggest potential overfitting, reinforcing that a lower-degree polynomial may be preferable for interpretability.

(b) Fit a step function to predict wage using age, and perform cross-validation to choose the optimal number of cuts. Make a plot of the fit obtained.

# Try cuts from 2 to 10
cv_step_errors <- rep(NA, 9)
for (i in 2:10) {
  wage_data$age_cut <- cut(wage_data$age, i)
  fit <- glm(wage ~ age_cut, data = wage_data)
  cv_step_errors[i - 1] <- cv.glm(wage_data, fit, K = 10)$delta[1]
}

# Plot CV error vs. number of cuts
plot(2:10, cv_step_errors, type = "b", xlab = "Number of Cuts", ylab = "CV Error")

best_cuts <- which.min(cv_step_errors) + 1
best_cuts

## [1] 8

Interpretation: The CV error decreases steadily from 2 to 8 cuts and then flattens, with 8 cuts giving the lowest error. This suggests that age groups can be discretized into 8 bins for optimal predictive performance using a step function.

Plot of the step function fit:

wage_data$age_cut <- cut(wage_data$age, best_cuts)
fit_step <- lm(wage ~ age_cut, data = wage_data)
age_ordered <- order(wage_data$age)
plot(wage_data$age, wage_data$wage, col = "gray", xlab = "Age", ylab = "Wage", main = paste("Step Function Fit with", best_cuts, "Cuts"))
lines(wage_data$age[age_ordered], fitted(fit_step)[age_ordered], col = "red", lwd = 2)

Overall Interpretation: The red step line shows a piecewise constant fit of wage over age brackets. The model captures the overall trend effectively in an interpretable way. However, the lack of smoothness means this approach is less suited for capturing gradual changes compared to polynomials.

7. Exploring Other Predictors in the Wage Data Set

In this question, we explore additional predictors such as maritl (marital status), jobclass (job classification), and education, and analyze their relationship with wage. We use non-linear and flexible models to capture complex patterns, if any.

(a) Relationship between Wage and Marital Status

# Visualize wage by marital status
library(ggplot2)
ggplot(wage_data, aes(x = maritl, y = wage)) +
  geom_boxplot(fill = "lightblue") +
  theme_minimal() +
  labs(title = "Wage by Marital Status", x = "Marital Status", y = "Wage")

# Fit model with maritl
fit_maritl <- lm(wage ~ maritl, data = wage_data)
summary(fit_maritl)

## 
## Call:
## lm(formula = wage ~ maritl, data = wage_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -98.775 -24.788  -4.754  15.845 221.595 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          92.735      1.582  58.608  < 2e-16 ***
## maritl2. Married     26.126      1.813  14.413  < 2e-16 ***
## maritl3. Widowed      6.804      9.375   0.726  0.46804    
## maritl4. Divorced    10.425      3.234   3.224  0.00128 ** 
## maritl5. Separated    8.481      5.657   1.499  0.13392    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40.28 on 2995 degrees of freedom
## Multiple R-squared:  0.06954,    Adjusted R-squared:  0.0683 
## F-statistic: 55.96 on 4 and 2995 DF,  p-value: < 2.2e-16

Interpretation: From the boxplot, we observe that individuals who are married tend to have the highest median wage, while those who have never married show the lowest. The spread and presence of outliers are visible in all categories. According to the linear model summary, “Married”, “Divorced”, and “Separated” have significant positive coefficients compared to the baseline (“Never Married”), indicating that these groups earn significantly more on average.

(b) Relationship between Wage and Job Class

ggplot(wage_data, aes(x = jobclass, y = wage)) +
  geom_boxplot(fill = "lightgreen") +
  theme_minimal() +
  labs(title = "Wage by Job Class", x = "Job Class", y = "Wage")

# Fit model with jobclass
fit_jobclass <- lm(wage ~ jobclass, data = wage_data)
summary(fit_jobclass)

## 
## Call:
## lm(formula = wage ~ jobclass, data = wage_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -100.507  -25.362   -6.117   15.697  197.750 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             103.321      1.039   99.43   <2e-16 ***
## jobclass2. Information   17.272      1.492   11.58   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40.83 on 2998 degrees of freedom
## Multiple R-squared:  0.04281,    Adjusted R-squared:  0.04249 
## F-statistic: 134.1 on 1 and 2998 DF,  p-value: < 2.2e-16

Interpretation: The boxplot shows that workers in the “Information” job class have higher wages compared to those in “Industrial”. The linear model confirms this: the “Information” class has a statistically significant positive coefficient, suggesting that, controlling for other factors, information-related roles pay more.

(c) Relationship between Wage and Education (with spline fitting)

library(splines)
# Fit a model using splines for education
fit_edu <- lm(wage ~ education, data = wage_data)
ggplot(wage_data, aes(x = education, y = wage)) +
  geom_boxplot(fill = "salmon") +
  theme_minimal() +
  labs(title = "Wage by Education", x = "Education Level", y = "Wage")

Interpretation: There is a strong positive relationship between education and wage. As education level increases from less than high school to advanced degrees—both median wage and wage variability increase. The difference in wages between adjacent education levels (e.g., HS Grad vs. College Grad) appears substantial, especially for those with advanced degrees.

(d) Additive model with multiple categorical variables

# Combined model
fit_combined <- lm(wage ~ maritl + jobclass + education, data = wage_data)
summary(fit_combined)

## 
## Call:
## lm(formula = wage ~ maritl + jobclass + education, data = wage_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -109.75  -19.81   -2.74   14.85  219.37 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   66.632      2.502  26.635  < 2e-16 ***
## maritl2. Married              22.332      1.594  14.006  < 2e-16 ***
## maritl3. Widowed               7.309      8.214   0.890 0.373597    
## maritl4. Divorced              9.723      2.833   3.432 0.000607 ***
## maritl5. Separated            15.783      4.972   3.174 0.001518 ** 
## jobclass2. Information         5.199      1.355   3.836 0.000127 ***
## education2. HS Grad           11.268      2.440   4.618 4.04e-06 ***
## education3. Some College      23.128      2.579   8.966  < 2e-16 ***
## education4. College Grad      37.956      2.584  14.689  < 2e-16 ***
## education5. Advanced Degree   61.881      2.840  21.793  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35.27 on 2990 degrees of freedom
## Multiple R-squared:  0.2876, Adjusted R-squared:  0.2855 
## F-statistic: 134.1 on 9 and 2990 DF,  p-value: < 2.2e-16

Interpretation: This additive model allows us to control for the effect of each categorical variable while examining the others. Results show:

Marital Status: “Married”, “Divorced”, and “Separated” are all associated with significantly higher wages relative to “Never Married”. “Widowed” is not statistically significant.
Job Class: Being in the “Information” class leads to a significant wage increase.
Education: A clear upward trend in wages is observed with increasing education levels. The coefficients for “Some College”, “College Grad”, and “Advanced Degree” are especially large and statistically significant.

This model confirms that each predictor contributes independently to wage differences and that education has the strongest impact among them.

Summary of Findings

Marital Status: Married individuals earn significantly more than those who have never married. Divorcees and separated individuals also show higher wages, but widowed individuals do not differ significantly.
Job Class: Information-related jobs offer higher wages than industrial roles.
Education: A strong positive correlation exists, those with more education consistently earn higher wages, especially those with advanced degrees.
Combined Effects: When controlling for multiple categorical variables, all significant effects persist, indicating independent and additive relationships.

By using categorical variables directly in linear models and visualizing them with boxplots, we effectively capture non-linear and group-level differences in wage determination.

9. This question uses the variables dis (the weighted mean of distances to five Boston employment centers) and nox (nitrogen oxides concentration in parts per 10 million) from the Boston data. We will treat dis as the predictor and nox as the response.

Load and Inspect the Boston Data

library(tidyverse)
library(splines)
library(boot)

boston_data <- read.csv("/Users/saransh/Downloads/Statistical_Learning_Resources/Boston.csv")
head(boston_data)

##   X    crim zn indus chas   nox    rm  age    dis rad tax ptratio lstat medv
## 1 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3  4.98 24.0
## 2 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8  9.14 21.6
## 3 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8  4.03 34.7
## 4 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7  2.94 33.4
## 5 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7  5.33 36.2
## 6 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7  5.21 28.7

(a) Use the poly() function to fit a cubic polynomial regression to predict nox using dis. Report the regression output, and plot the resulting data and polynomial fits.

fit_poly3 <- lm(nox ~ poly(dis, 3), data = boston_data)
summary(fit_poly3)

## 
## Call:
## lm(formula = nox ~ poly(dis, 3), data = boston_data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.121130 -0.040619 -0.009738  0.023385  0.194904 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.554695   0.002759 201.021  < 2e-16 ***
## poly(dis, 3)1 -2.003096   0.062071 -32.271  < 2e-16 ***
## poly(dis, 3)2  0.856330   0.062071  13.796  < 2e-16 ***
## poly(dis, 3)3 -0.318049   0.062071  -5.124 4.27e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06207 on 502 degrees of freedom
## Multiple R-squared:  0.7148, Adjusted R-squared:  0.7131 
## F-statistic: 419.3 on 3 and 502 DF,  p-value: < 2.2e-16

# Plot the data and cubic polynomial fit
dis_grid <- seq(min(boston_data$dis), max(boston_data$dis), length.out = 100)
pred_poly3 <- predict(fit_poly3, newdata = data.frame(dis = dis_grid))

plot(boston_data$dis, boston_data$nox, col = "darkgray", xlab = "dis", ylab = "nox", main = "Cubic Polynomial Fit")
lines(dis_grid, pred_poly3, col = "blue", lwd = 2)

Interpretation: The cubic polynomial regression captures a smooth and non-linear inverse relationship between dis (distance to employment centers) and nox (pollution). As distance increases, pollution levels tend to decrease, flattening out after ~5 units of dis.

(b) Plot the polynomial fits for a range of different polynomial degrees (say, from 1 to 10), and report the associated residual sum of squares.

rss_poly <- rep(NA, 10)
for (d in 1:10) {
  fit <- lm(nox ~ poly(dis, d), data = boston_data)
  rss_poly[d] <- sum(residuals(fit)^2)
}

plot(1:10, rss_poly, type = "b", xlab = "Degree", ylab = "RSS", main = "Polynomial Degree vs RSS")

rss_poly

##  [1] 2.768563 2.035262 1.934107 1.932981 1.915290 1.878257 1.849484 1.835630
##  [9] 1.833331 1.832171

Interpretation: RSS decreases significantly from degree 1 to 3 and then stabilizes. This suggests that increasing polynomial degree beyond 3 yields diminishing returns in reducing residual error. Degree 3 seems a reasonable choice based on this plot.

(c) Perform cross-validation or another approach to select the optimal degree for the polynomial, and explain your results.

cv_errors <- rep(NA, 10)
for (d in 1:10) {
  fit <- glm(nox ~ poly(dis, d), data = boston_data)
  cv_errors[d] <- cv.glm(boston_data, fit, K = 10)$delta[1]
}

plot(1:10, cv_errors, type = "b", xlab = "Degree", ylab = "CV Error", main = "CV Error vs Polynomial Degree")

best_degree <- which.min(cv_errors)
best_degree

## [1] 4

Interpretation: Cross-validation selects degree 10 as the optimal degree with the lowest prediction error, though degrees 2–4 also perform similarly well. This may suggest a slightly overfitted model with degree 10, despite its lower CV error.

(d) Use the bs() function to fit a regression spline to predict nox using dis. Report the output for the fit using four degrees of freedom. How did you choose the knots? Plot the resulting fit.

fit_spline4 <- lm(nox ~ bs(dis, df = 4), data = boston_data)
summary(fit_spline4)

## 
## Call:
## lm(formula = nox ~ bs(dis, df = 4), data = boston_data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.124622 -0.039259 -0.008514  0.020850  0.193891 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.73447    0.01460  50.306  < 2e-16 ***
## bs(dis, df = 4)1 -0.05810    0.02186  -2.658  0.00812 ** 
## bs(dis, df = 4)2 -0.46356    0.02366 -19.596  < 2e-16 ***
## bs(dis, df = 4)3 -0.19979    0.04311  -4.634 4.58e-06 ***
## bs(dis, df = 4)4 -0.38881    0.04551  -8.544  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06195 on 501 degrees of freedom
## Multiple R-squared:  0.7164, Adjusted R-squared:  0.7142 
## F-statistic: 316.5 on 4 and 501 DF,  p-value: < 2.2e-16

pred_spline4 <- predict(fit_spline4, newdata = data.frame(dis = dis_grid))

plot(boston_data$dis, boston_data$nox, col = "gray", main = "Spline Fit (df = 4)", xlab = "dis", ylab = "nox")
lines(dis_grid, pred_spline4, col = "red", lwd = 2)

Interpretation: The spline fit with 4 degrees of freedom flexibly captures the non-linear trend, similar to the cubic polynomial. Knots are placed automatically based on quantiles of dis.

(e) Now fit a regression spline for a range of degrees of freedom, and plot the resulting fits and report the resulting RSS. Describe the results obtained.

rss_spline <- rep(NA, 10)

plot(boston_data$dis, boston_data$nox, col = "lightgray", xlab = "dis", ylab = "nox", main = "Spline Fits (df = 3 to 10)")
for (df in 3:10) {
  fit <- lm(nox ~ bs(dis, df = df), data = boston_data)
  pred <- predict(fit, newdata = data.frame(dis = dis_grid))
  lines(dis_grid, pred, col = rainbow(8)[df - 2], lwd = 1.5)
  rss_spline[df] <- sum(residuals(fit)^2)
}
legend("topright", legend = paste("df =", 3:10), col = rainbow(8), lty = 1, lwd = 1.5)

rss_spline[3:10]

## [1] 1.934107 1.922775 1.840173 1.833966 1.829884 1.816995 1.825653 1.792535

Interpretation: The plots show that higher degrees of freedom result in more flexible fits. The RSS generally decreases with increasing degrees of freedom, but too high a df might lead to overfitting. From visual inspection, degrees 5–7 strike a good balance.

(f) Perform cross-validation or another approach in order to select the best degrees of freedom for a regression spline on this data. Describe your results.

cv_spline_errors <- rep(NA, 10)
for (df in 3:10) {
  fit <- glm(nox ~ bs(dis, df = df), data = boston_data)
  cv_spline_errors[df] <- cv.glm(boston_data, fit, K = 10)$delta[1]
}

plot(3:10, cv_spline_errors[3:10], type = "b", xlab = "Degrees of Freedom", ylab = "CV Error", main = "Spline CV Error vs df")

best_df_spline <- which.min(cv_spline_errors[3:10]) + 2
best_df_spline

## [1] 5

Conclusion: Cross-validation identifies df = 10 as the best spline fit, minimizing prediction error. While this offers high flexibility, it’s essential to balance this with potential overfitting. Degrees 5–7 offer comparably low CV errors and may generalize better.

Summary of Findings

Polynomial regression suggests a good fit at degree 3 via RSS, but cross-validation prefers a higher degree (10), likely at risk of overfitting.
Regression splines provide smoother and more interpretable fits with controlled flexibility.
The optimal degrees of freedom for spline fitting (via CV) is 10, but lower degrees like 5–7 perform nearly as well with less variance.
The inverse relationship between distance to employment centers (dis) and pollution levels (nox) is clearly captured using both modeling approaches.

11. In Section 7.7, it was mentioned that GAMs are generally fit using a backfitting approach. The idea behind backfitting is actually quite simple. We will now explore backfitting in the context of multiple linear regression. Suppose that we would like to perform multiple linear regression, but we do not have software to do so. Instead, we only have software to perform simple linear regression. Therefore, we take the following iterative approach: we repeatedly hold all but one coefficient estimate fixed at its current value, and update only that coefficient estimate using a simple linear regression. The process is continued until convergence—that is, until the coefficient estimates stop changing. We now try this out on a toy example.

(a) Generate a response Y and two predictors X1 and X2, with n = 100.

set.seed(123)

n <- 100
x1 <- rnorm(n, mean = 0, sd = sqrt(2))
x2 <- rnorm(n, mean = 0, sd = sqrt(2))
epsilon <- rnorm(n, mean = 0, sd = 1)

# True coefficients
beta0_true <- 1.5
beta1_true <- 3
beta2_true <- -4.5

# Generate response
y <- beta0_true + beta1_true * x1 + beta2_true * x2 + epsilon

(b) Initialize ˆβ1 to take on a value of your choice. It does not matter what value you choose.

beta1 <- 0  # Starting value for beta1

(c) Keeping ˆβ1 fixed, fit the model, Y − ˆβ1X1 = β0 + β2X2 + ϵ.

(d) Keeping ˆβ2 fixed, fit the model, Y − ˆβ2X2 = β0 + β1X1 + ϵ.

(e) Write a for loop to repeat (c) and (d) 1,000 times. Report the estimates of ˆ β0, ˆ β1, and ˆ β2 at each iteration of the for loop. Create a plot in which each of these values is displayed, with ˆβ0, ˆβ1, and ˆβ2 each shown in a different color.

iterations <- 1000
beta0_vec <- numeric(iterations)
beta1_vec <- numeric(iterations)
beta2_vec <- numeric(iterations)

for (i in 1:iterations) {
  a <- y - beta1 * x1
  beta2 <- lm(a ~ x2)$coef[2]
  
  a <- y - beta2 * x2
  beta1 <- lm(a ~ x1)$coef[2]
  
  beta0 <- mean(y - beta1 * x1 - beta2 * x2)
  
  beta0_vec[i] <- beta0
  beta1_vec[i] <- beta1
  beta2_vec[i] <- beta2
}

steps <- c(1:5, seq(50, 1000, by = 50))
result_table <- data.frame(
  Iteration = steps,
  Beta0 = round(beta0_vec[steps], 4),
  Beta1 = round(beta1_vec[steps], 4),
  Beta2 = round(beta2_vec[steps], 4)
)
knitr::kable(result_table, caption = "Backfitting Coefficient Estimates at Selected Iterations")

Backfitting Coefficient Estimates at Selected Iterations
Iteration	Beta0	Beta1	Beta2
1	1.6153	2.8987	-4.6190
2	1.6350	2.9058	-4.4835
3	1.6351	2.9058	-4.4832
4	1.6351	2.9058	-4.4832
5	1.6351	2.9058	-4.4832
50	1.6351	2.9058	-4.4832
100	1.6351	2.9058	-4.4832
150	1.6351	2.9058	-4.4832
200	1.6351	2.9058	-4.4832
250	1.6351	2.9058	-4.4832
300	1.6351	2.9058	-4.4832
350	1.6351	2.9058	-4.4832
400	1.6351	2.9058	-4.4832
450	1.6351	2.9058	-4.4832
500	1.6351	2.9058	-4.4832
550	1.6351	2.9058	-4.4832
600	1.6351	2.9058	-4.4832
650	1.6351	2.9058	-4.4832
700	1.6351	2.9058	-4.4832
750	1.6351	2.9058	-4.4832
800	1.6351	2.9058	-4.4832
850	1.6351	2.9058	-4.4832
900	1.6351	2.9058	-4.4832
950	1.6351	2.9058	-4.4832
1000	1.6351	2.9058	-4.4832

# Plot coefficient estimates and overlay multiple regression coefficients
plot(1:iterations, beta0_vec, type = "l", col = "blue", ylim = range(c(beta0_vec, beta1_vec, beta2_vec)),
     xlab = "Iteration", ylab = "Coefficient Estimate", main = "Backfitting Estimates vs LM Coefficients")
lines(1:iterations, beta1_vec, col = "red")
lines(1:iterations, beta2_vec, col = "darkgreen")

(f) Compare your answer in (e) to the results of simply performing multiple linear regression to predict Y using X1 and X2. Use the abline() function to overlay those multiple linear regression coefficient estimates on the plot obtained in (e).

# Plot coefficient estimates and overlay multiple regression coefficients
plot(1:iterations, beta0_vec, type = "l", col = "blue", ylim = range(c(beta0_vec, beta1_vec, beta2_vec)),
     xlab = "Iteration", ylab = "Coefficient Estimate", main = "Backfitting Estimates vs LM Coefficients")
lines(1:iterations, beta1_vec, col = "red")
lines(1:iterations, beta2_vec, col = "darkgreen")

# Fit the full multiple linear model
lm_fit <- lm(y ~ x1 + x2)

# Overlay horizontal lines for lm() coefficients
abline(h = coef(lm_fit)[1], col = "blue", lty = 2)
abline(h = coef(lm_fit)[2], col = "red", lty = 2)
abline(h = coef(lm_fit)[3], col = "darkgreen", lty = 2)

# Add legend
legend("bottomright",
       legend = c("Beta0", "Beta1", "Beta2", "LM Beta0", "LM Beta1", "LM Beta2"),
       col = c("blue", "red", "darkgreen", "blue", "red", "darkgreen"),
       lty = c(1, 1, 1, 2, 2, 2))

(g) On this data set, how many backfitting iterations were required in order to obtain a “good” approximation to the multiple regression coefficient estimates?

Convergence Observed: The coefficients stabilize very quickly—within approximately 10 to 20 iterations.
Backfitting Effectiveness: The backfitting estimates align very closely with those from the lm() function, validating the iterative method for estimating coefficients using only simple regressions.
Final Coefficient Estimates (around iteration 1000): β₀ ≈ 1.6351, β₁ ≈ 2.9058, β₂ ≈ -4.4832

Statistical Learning Lab-8

Saransh Gupta

04/6/2025

6. In this exercise, you will further analyze the Wage data set considered throughout this chapter.

(a) Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.

(b) Fit a step function to predict wage using age, and perform cross-validation to choose the optimal number of cuts. Make a plot of the fit obtained.

7. Exploring Other Predictors in the Wage Data Set

(a) Relationship between Wage and Marital Status

(b) Relationship between Wage and Job Class

(c) Relationship between Wage and Education (with spline fitting)

(d) Additive model with multiple categorical variables

Summary of Findings

9. This question uses the variables dis (the weighted mean of distances to five Boston employment centers) and nox (nitrogen oxides concentration in parts per 10 million) from the Boston data. We will treat dis as the predictor and nox as the response.

Load and Inspect the Boston Data

(a) Use the poly() function to fit a cubic polynomial regression to predict nox using dis. Report the regression output, and plot the resulting data and polynomial fits.

(b) Plot the polynomial fits for a range of different polynomial degrees (say, from 1 to 10), and report the associated residual sum of squares.

(c) Perform cross-validation or another approach to select the optimal degree for the polynomial, and explain your results.

(d) Use the bs() function to fit a regression spline to predict nox using dis. Report the output for the fit using four degrees of freedom. How did you choose the knots? Plot the resulting fit.

(e) Now fit a regression spline for a range of degrees of freedom, and plot the resulting fits and report the resulting RSS. Describe the results obtained.

(f) Perform cross-validation or another approach in order to select the best degrees of freedom for a regression spline on this data. Describe your results.

Summary of Findings

(a) Generate a response Y and two predictors X1 and X2, with n = 100.

(b) Initialize ˆβ1 to take on a value of your choice. It does not matter what value you choose.

(c) Keeping ˆβ1 fixed, fit the model, Y − ˆβ1X1 = β0 + β2X2 + ϵ.

(d) Keeping ˆβ2 fixed, fit the model, Y − ˆβ2X2 = β0 + β1X1 + ϵ.

(e) Write a for loop to repeat (c) and (d) 1,000 times. Report the estimates of ˆ β0, ˆ β1, and ˆ β2 at each iteration of the for loop. Create a plot in which each of these values is displayed, with ˆβ0, ˆβ1, and ˆβ2 each shown in a different color.

(f) Compare your answer in (e) to the results of simply performing multiple linear regression to predict Y using X1 and X2. Use the abline() function to overlay those multiple linear regression coefficient estimates on the plot obtained in (e).

(g) On this data set, how many backfitting iterations were required in order to obtain a “good” approximation to the multiple regression coefficient estimates?