Lab8_sl

In this exercise, you will further analyze the Wage data set considered throughout this chapter.

Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.

library(ISLR)

## Warning: package 'ISLR' was built under R version 4.4.2

library(boot)    
library(ggplot2) 


set.seed(1)
cv.error.10 = rep(0, 10)  
for (i in 1:10) {
  glm.fit = glm(logwage ~ poly(age, i), data = Wage)  
  cv.error.10[i] = cv.glm(Wage, glm.fit, K = 10)$delta[1]  
}
cv.error.10

##  [1] 0.1179936 0.1107270 0.1100178 0.1095241 0.1095717 0.1096447 0.1096246
##  [8] 0.1097252 0.1095097 0.1096670

plot(cv.error.10, type = "b", xlab = "Degree", ylab = "CV Error")

lm.fit = glm(logwage ~ poly(age, 4), data = Wage)
summary(lm.fit)

## 
## Call:
## glm(formula = logwage ~ poly(age, 4), data = Wage)
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    4.653905   0.006036 771.070  < 2e-16 ***
## poly(age, 4)1  4.197217   0.330586  12.696  < 2e-16 ***
## poly(age, 4)2 -4.694434   0.330586 -14.200  < 2e-16 ***
## poly(age, 4)3  1.644906   0.330586   4.976 6.87e-07 ***
## poly(age, 4)4 -1.179518   0.330586  -3.568 0.000365 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.109287)
## 
##     Null deviance: 371.07  on 2999  degrees of freedom
## Residual deviance: 327.31  on 2995  degrees of freedom
## AIC: 1879.3
## 
## Number of Fisher Scoring iterations: 2

fit.1 = lm(logwage ~ age, data = Wage)
fit.2 = lm(logwage ~ poly(age, 2), data = Wage)
fit.3 = lm(logwage ~ poly(age, 3), data = Wage)
fit.4 = lm(logwage ~ poly(age, 4), data = Wage)
fit.5 = lm(logwage ~ poly(age, 5), data = Wage)


anova(fit.1, fit.2, fit.3, fit.4, fit.5)

## Analysis of Variance Table
## 
## Model 1: logwage ~ age
## Model 2: logwage ~ poly(age, 2)
## Model 3: logwage ~ poly(age, 3)
## Model 4: logwage ~ poly(age, 4)
## Model 5: logwage ~ poly(age, 5)
##   Res.Df    RSS Df Sum of Sq        F    Pr(>F)    
## 1   2998 353.45                                    
## 2   2997 331.41  1   22.0377 201.5934 < 2.2e-16 ***
## 3   2996 328.71  1    2.7057  24.7510 6.892e-07 ***
## 4   2995 327.31  1    1.3913  12.7268 0.0003661 ***
## 5   2994 327.30  1    0.0177   0.1615 0.6878137    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

agelims = range(Wage$age)
age.grid = seq(from = agelims[1], to = agelims[2])


preds = predict(lm.fit, newdata = list(age = age.grid), se = TRUE)
se.bands = cbind(preds$fit + 2 * preds$se.fit, preds$fit - 2 * preds$se.fit)


plot(Wage$age, Wage$logwage, xlim = agelims, cex = 0.5, col = "darkgrey")
title("Polynomial fit using degree 4 for Age")
lines(age.grid, preds$fit, lwd = 2, col = "blue")
matlines(age.grid, se.bands, lwd = 1, col = "blue", lty = 3)

The plot shows that cross-validation error decreases sharply up to degree 4 and then levels off, suggesting little gain in model complexity beyond degree 4.

The degree-4 polynomial fit shows a nonlinear relationship between age and log(wage), with wages increasing until around age 50 and then gradually declining.

The output you’ve shared indicates that you’ve fit a polynomial regression model to the log-wage (logwage) using the age variable. Here’s a summary of the results:

The model fits a 4th-degree polynomial to the relationship between age and log(wage), with significant coefficients for all polynomial terms (all p-values < 0.001).
Coefficients:

-    The intercept and polynomial terms (up to the 4th degree) are significant with very small **p-values** (less than 0.001), suggesting that a higher-degree polynomial model provides a better fit for the data.

-    The polynomial terms indicate a non-linear relationship between **age** and **log(wage)**.

Model Evaluation:
- The AIC (Akaike Information Criterion) value is 1879.3. The residual deviance is 327.31, indicating a good fit relative to the null deviance (371.07).
- The Analysis of Variance (ANOVA) table shows that adding higher-order terms (up to the 4th-degree polynomial) significantly improves the model, but the 5th-degree term does not significantly improve the fit (p-value = 0.687).
In conclusion, the polynomial model fits the data well, and adding higher-order terms improves the fit, but beyond the 4th degree, the improvement is minimal. The relationship between age and wage appears to be highly non-linear, and a 4th-degree polynomial captures this effectively.

Fit a step function to predict wage using age, and perform crossvalidation to choose the optimal number of cuts. Make a plot of the fit obtained.

set.seed(42)


cv_errors_step = rep(NA, 19)


for (num_cuts in 2:20) {

  Wage$age_cut = cut(Wage$age, num_cuts)


  step_model = glm(wage ~ age_cut, data=Wage)


  cv_errors_step[num_cuts - 1] = cv.glm(Wage, step_model, K=10)$delta[1]
}


cv_errors_step

##  [1] 1733.565 1684.082 1637.465 1632.337 1623.836 1610.572 1602.163 1612.770
##  [9] 1606.141 1601.591 1605.167 1604.376 1603.710 1603.690 1602.266 1607.777
## [17] 1604.390 1603.445 1604.955

plot(cv_errors_step, type="b", ylab="Cross-Validation Error", xlab="Number of Cuts", main="CV Errors for Step Function Models")

The plot shows that cross-validation error for step function models decreases rapidly up to around 8 cuts, after which further increases in the number of cuts yield minimal improvement.

final_step_model = glm(wage ~ cut(age, 8), data=Wage)

age_grid = seq(from = min(Wage$age), to = max(Wage$age), length.out = 100)
predictions_step = predict(final_step_model, newdata = list(age = age_grid), se = TRUE)

se_bands_step = cbind(predictions_step$fit + 2 * predictions_step$se.fit,
                      predictions_step$fit - 2 * predictions_step$se.fit)

plot(Wage$age, Wage$wage, col = "darkgrey", pch = 16, cex = 0.5,
     xlab = "Age", ylab = "Wage")
title("Step Function Fit with 8 Cuts")
lines(age_grid, predictions_step$fit, lwd = 2, col = "blue")  
matlines(age_grid, se_bands_step, lwd = 1, col = "blue", lty = 3)

The Wage data set contains a number of other features not explored in this chapter, such as marital status (maritl), job class (jobclass), and others. Explore the relationships between some of these other predictors and wage, and use non-linear fitting techniques in order to fit flexible models to the data. Create plots of the results obtained, and write a summary of your findings.

boxplot(wage ~ maritl, data = Wage, main = "Wage vs Marital Status", ylab = "Wage", xlab = "Marital Status", col = "lightblue")

library(splines)
marital_spline = lm(wage ~ ns(age, df = 5) + maritl, data = Wage)


summary(marital_spline)

## 
## Call:
## lm(formula = wage ~ ns(age, df = 5) + maritl, data = Wage)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -102.024  -23.696   -4.202   14.517  206.687 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         62.2925     4.6612  13.364  < 2e-16 ***
## ns(age, df = 5)1    48.8124     5.1417   9.493  < 2e-16 ***
## ns(age, df = 5)2    43.9898     6.0327   7.292 3.90e-13 ***
## ns(age, df = 5)3    37.3492     5.1940   7.191 8.10e-13 ***
## ns(age, df = 5)4    62.2447    12.3628   5.035 5.07e-07 ***
## ns(age, df = 5)5     0.6223     9.4840   0.066    0.948    
## maritl2. Married    13.7095     2.1049   6.513 8.61e-11 ***
## maritl3. Widowed    -4.8432     9.2854  -0.522    0.602    
## maritl4. Divorced   -3.0008     3.4105  -0.880    0.379    
## maritl5. Separated  -3.5361     5.6406  -0.627    0.531    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39.42 on 2990 degrees of freedom
## Multiple R-squared:  0.1101, Adjusted R-squared:  0.1074 
## F-statistic: 41.08 on 9 and 2990 DF,  p-value: < 2.2e-16

Married and Never Married have a substantial number of outliers. The box plots for Widowed and Separated are similar to each other.
It appears a married couple makes more money on average than other groups.

boxplot(wage ~ jobclass, data = Wage, main = "Wage vs Job Class", ylab = "Wage", xlab = "Job Class", col = "lightgreen")

jobclass_spline = lm(wage ~ ns(age, df = 5) + jobclass, data = Wage)


summary(jobclass_spline)

## 
## Call:
## lm(formula = wage ~ ns(age, df = 5) + jobclass, data = Wage)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -106.293  -23.444   -4.991   16.043  197.146 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              57.217      4.637  12.340  < 2e-16 ***
## ns(age, df = 5)1         57.219      4.646  12.317  < 2e-16 ***
## ns(age, df = 5)2         51.773      5.630   9.196  < 2e-16 ***
## ns(age, df = 5)3         43.319      4.873   8.890  < 2e-16 ***
## ns(age, df = 5)4         74.139     11.742   6.314 3.12e-10 ***
## ns(age, df = 5)5          3.566      9.324   0.382    0.702    
## jobclass2. Information   15.013      1.442  10.414  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39.21 on 2993 degrees of freedom
## Multiple R-squared:  0.1188, Adjusted R-squared:  0.117 
## F-statistic: 67.23 on 6 and 2993 DF,  p-value: < 2.2e-16

Informational jobs are higher-wage than Industrial jobs on average.

library(gam)

## Warning: package 'gam' was built under R version 4.4.3

## Loading required package: foreach

## Warning: package 'foreach' was built under R version 4.4.2

## Loaded gam 1.22-5

model_gam <- gam(wage ~ s(age) + maritl + jobclass, data = Wage)


summary(model_gam)

## 
## Call: gam(formula = wage ~ s(age) + maritl + jobclass, data = Wage)
## Deviance Residuals:
##      Min       1Q   Median       3Q      Max 
## -110.378  -23.260   -4.772   15.361  201.141 
## 
## (Dispersion Parameter for gaussian family taken to be 1497.157)
## 
##     Null Deviance: 5222086 on 2999 degrees of freedom
## Residual Deviance: 4476501 on 2990 degrees of freedom
## AIC: 30459.59 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##             Df  Sum Sq Mean Sq F value    Pr(>F)    
## s(age)       1  199870  199870 133.499 < 2.2e-16 ***
## maritl       4  141136   35284  23.567 < 2.2e-16 ***
## jobclass     1  173181  173181 115.673 < 2.2e-16 ***
## Residuals 2990 4476501    1497                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##             Npar Df Npar F     Pr(F)    
## (Intercept)                             
## s(age)            3 30.202 < 2.2e-16 ***
## maritl                                  
## jobclass                                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

plot(model_gam, select = 1, main = "Smooth Term for Age")

## Warning in plot.window(...): "select" is not a graphical parameter

## Warning in plot.xy(xy, type, ...): "select" is not a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "select" is not a
## graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "select" is not a
## graphical parameter

## Warning in box(...): "select" is not a graphical parameter

## Warning in title(...): "select" is not a graphical parameter

## Warning in plot.window(...): "select" is not a graphical parameter

## Warning in plot.xy(xy, type, ...): "select" is not a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "select" is not a
## graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "select" is not a
## graphical parameter

## Warning in box(...): "select" is not a graphical parameter

## Warning in title(...): "select" is not a graphical parameter

## Warning in plot.window(...): "select" is not a graphical parameter

## Warning in plot.xy(xy, type, ...): "select" is not a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "select" is not a
## graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "select" is not a
## graphical parameter

## Warning in box(...): "select" is not a graphical parameter

## Warning in title(...): "select" is not a graphical parameter

This question uses the variables dis (the weighted mean of distances to five Boston employment centers) and nox (nitrogen oxides concentration in parts per 10 million) from the Boston data. We will treat dis as the predictor and nox as the response.
```
library(MASS)      
library(ggplot2)    
library(splines)    


data(Boston)
```

Use the poly() function to fit a cubic polynomial regression to predict nox using dis. Report the regression output, and plot the resulting data and polynomial fits.

library(MASS)
data("Boston")


fit_cubic <- lm(nox ~ poly(dis, 3), data = Boston)
summary(fit_cubic)

## 
## Call:
## lm(formula = nox ~ poly(dis, 3), data = Boston)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.121130 -0.040619 -0.009738  0.023385  0.194904 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.554695   0.002759 201.021  < 2e-16 ***
## poly(dis, 3)1 -2.003096   0.062071 -32.271  < 2e-16 ***
## poly(dis, 3)2  0.856330   0.062071  13.796  < 2e-16 ***
## poly(dis, 3)3 -0.318049   0.062071  -5.124 4.27e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06207 on 502 degrees of freedom
## Multiple R-squared:  0.7148, Adjusted R-squared:  0.7131 
## F-statistic: 419.3 on 3 and 502 DF,  p-value: < 2.2e-16

plot(Boston$dis, Boston$nox, main="Cubic Polynomial Fit", xlab="Distance to Employment Centers (dis)", ylab="Nitrogen Oxides (nox)", pch=20, col="darkgray")
dis_range <- seq(min(Boston$dis), max(Boston$dis), length=100)
preds <- predict(fit_cubic, newdata = data.frame(dis = dis_range))
lines(dis_range, preds, col="blue", lwd=2)

The cubic polynomial regression shows a strong nonlinear inverse relationship between distance to employment centers (dis) and nitrogen oxides concentration (nox), with nox decreasing sharply as dis increases.

Plot the polynomial fits for a range of different polynomial degrees (say, from 1 to 10), and report the associated residual sum of squares.

rss_values <- numeric(10)

plot(Boston$dis, Boston$nox, pch=20, col="gray", main="Polynomial Fits", xlab="dis", ylab="nox")

for (d in 1:10) {
  fit <- lm(nox ~ poly(dis, d), data=Boston)
  rss_values[d] <- sum(residuals(fit)^2)

  # Plotting only some of them to avoid overplot
  if (d %in% c(1, 2, 3, 5, 7, 10)) {
    lines(dis_range, predict(fit, newdata = data.frame(dis = dis_range)), col=rainbow(10)[d], lwd=2)
  }
}

# Print RSS values
rss_values

##  [1] 2.768563 2.035262 1.934107 1.932981 1.915290 1.878257 1.849484 1.835630
##  [9] 1.833331 1.832171

As it increase the degree of the polynomial, the model fits the data better — the errors (RSS) get smaller. But the improvement slows down after degree 3 or 4, meaning higher-degree models don’t add much value and might overcomplicate things.

Perform cross-validation or another approach to select the optimal degree for the polynomial, and explain your results.
```
library(boot)

cv_error <- numeric(10)

for (d in 1:10) {
  fit <- glm(nox ~ poly(dis, d), data=Boston)
  cv_error[d] <- cv.glm(Boston, fit, K=10)$delta[2]
}


plot(1:10, cv_error, type="b", pch=19, col="blue", xlab="Polynomial Degree", ylab="10-fold CV Error", main="CV Error vs Polynomial Degree")
```
The cross-validation plot shows that polynomial degrees 3 and 4 have the lowest prediction error, suggesting they offer the best balance between model accuracy and complexity. Using a higher degree (like 7 or 9) increases error again — likely due to overfitting. So, degree 3 or 4 is best!

Use the bs() function to fit a regression spline to predict nox using dis. Report the output for the fit using four degrees of freedom. How did you choose the knots? Plot the resulting fit.

library(splines)

fit_spline_4df <- lm(nox ~ bs(dis, df=4), data=Boston)
summary(fit_spline_4df)

## 
## Call:
## lm(formula = nox ~ bs(dis, df = 4), data = Boston)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.124622 -0.039259 -0.008514  0.020850  0.193891 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.73447    0.01460  50.306  < 2e-16 ***
## bs(dis, df = 4)1 -0.05810    0.02186  -2.658  0.00812 ** 
## bs(dis, df = 4)2 -0.46356    0.02366 -19.596  < 2e-16 ***
## bs(dis, df = 4)3 -0.19979    0.04311  -4.634 4.58e-06 ***
## bs(dis, df = 4)4 -0.38881    0.04551  -8.544  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06195 on 501 degrees of freedom
## Multiple R-squared:  0.7164, Adjusted R-squared:  0.7142 
## F-statistic: 316.5 on 4 and 501 DF,  p-value: < 2.2e-16

plot(Boston$dis, Boston$nox, pch=20, col="gray", xlab="dis", ylab="nox", main="Regression Spline with 4 df")
lines(dis_range, predict(fit_spline_4df, newdata = data.frame(dis = dis_range)), col="red", lwd=2)

The regression spline with 4 degrees of freedom provides a smooth, statistically strong fit that explains 72% of the variation in NOx levels based on distance to employment centers.

Now fit a regression spline for a range of degrees of freedom, and plot the resulting fits and report the resulting RSS. Describe the results obtained.

rss_spline <- numeric(10)

plot(Boston$dis, Boston$nox, pch=20, col="gray", main="Splines with Varying DF", xlab="dis", ylab="nox")

for (df in 3:10) {
  fit <- lm(nox ~ bs(dis, df=df), data=Boston)
  rss_spline[df] <- sum(residuals(fit)^2)

  if (df %in% c(3, 4, 5, 7, 10)) {
    lines(dis_range, predict(fit, newdata = data.frame(dis = dis_range)), col=rainbow(10)[df], lwd=2)
  }
}

rss_spline

##  [1] 0.000000 0.000000 1.934107 1.922775 1.840173 1.833966 1.829884 1.816995
##  [9] 1.825653 1.792535

As degrees of freedom increase in the regression spline, the residual sum of squares (RSS) generally decreases, showing better fit — but improvements become marginal after 6–7 df, suggesting diminishing returns.

Perform cross-validation or another approach in order to select the best degrees of freedom for a regression spline on this data. Describe your results.

cv_spline_error <- numeric(10)

for (df in 3:10) {
  fit <- glm(nox ~ bs(dis, df=df), data=Boston)
  cv_spline_error[df] <- cv.glm(Boston, fit, K=10)$delta[2]
}

## Warning in bs(dis, degree = 3L, knots = numeric(0), Boundary.knots = c(1.1296,
## : some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = numeric(0), Boundary.knots = c(1.1296,
## : some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning in bs(dis, degree = 3L, knots = numeric(0), Boundary.knots = c(1.137, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = numeric(0), Boundary.knots = c(1.137, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning in bs(dis, degree = 3L, knots = 3.2628, Boundary.knots = c(1.137, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = 3.2628, Boundary.knots = c(1.137, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning in bs(dis, degree = 3L, knots = 3.3175, Boundary.knots = c(1.1296, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = 3.3175, Boundary.knots = c(1.1296, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning in bs(dis, degree = 3L, knots = c(2.37286666666667, 4.35876666666667:
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(2.37286666666667, 4.35876666666667:
## some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning in bs(dis, degree = 3L, knots = c(2.10525, 3.1523, 5.16495),
## Boundary.knots = c(1.1691, : some 'x' values beyond boundary knots may cause
## ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(2.10525, 3.1523, 5.16495),
## Boundary.knots = c(1.1691, : some 'x' values beyond boundary knots may cause
## ill-conditioned bases

## Warning in bs(dis, degree = 3L, knots = c(2.0835, 3.2797, 5.16495),
## Boundary.knots = c(1.1296, : some 'x' values beyond boundary knots may cause
## ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(2.0835, 3.2797, 5.16495),
## Boundary.knots = c(1.1296, : some 'x' values beyond boundary knots may cause
## ill-conditioned bases

## Warning in bs(dis, degree = 3L, knots = c(1.9512, 2.70958, 3.92418, 5.6957:
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(1.9512, 2.70958, 3.92418, 5.6957:
## some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning in bs(dis, degree = 3L, knots = c(1.9512, 2.59774, 3.87584, 5.6957:
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(1.9512, 2.59774, 3.87584, 5.6957:
## some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning in bs(dis, degree = 3L, knots = c(1.86746666666667, 2.40296666666667, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(1.86746666666667, 2.40296666666667, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning in bs(dis, degree = 3L, knots = c(1.83066666666667, 2.3727, 3.2157, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(1.83066666666667, 2.3727, 3.2157, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning in bs(dis, degree = 3L, knots = c(1.7912, 2.206, 2.7592, 3.6519, : some
## 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(1.7912, 2.206, 2.7592, 3.6519, : some
## 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning in bs(dis, degree = 3L, knots = c(1.7912, 2.1675, 2.7147, 3.6519, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(1.7912, 2.1675, 2.7147, 3.6519, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning in bs(dis, degree = 3L, knots = c(1.7554, 2.1099, 2.5671, 3.2721, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(1.7554, 2.1099, 2.5671, 3.2721, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning in bs(dis, degree = 3L, knots = c(1.748425, 2.1084, 2.506175, 3.2721, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(1.748425, 2.1084, 2.506175, 3.2721, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases

plot(3:10, cv_spline_error[3:10], type="b", pch=19, col="darkgreen", xlab="Degrees of Freedom", ylab="10-fold CV Error", main="CV for Regression Splines")

Cross-validation shows that a regression spline with 10 degrees of freedom gives the lowest prediction error, making it the best choice for balancing flexibility and accuracy.

In Section 7.7, it was mentioned that GAMs are generally fit using a backfitting approach. The idea behind backfitting is actually quite simple. We will now explore backfitting in the context of multiple linear regression. Suppose that we would like to perform multiple linear regression, but we do not have software to do so. Instead, we only have software to perform simple linear regression. Therefore, we take the following iterative approach: we repeatedly hold all but one coefficient estimate fixed at its current value, and update only that coefficient estimate using a simple linear regression. The process is continued until convergence—that is, until the coefficient estimates stop changing. We now try this out on a toy example.

Generate a response Y and two predictors X1 and X2, with n = 100.

set.seed(123)  


n <- 100
X1 <- rnorm(n)
X2 <- rnorm(n)


beta0 <- 2
beta1 <- 3
beta2 <- 4
epsilon <- rnorm(n)  
Y <- beta0 + beta1 * X1 + beta2 * X2 + epsilon

Initialize ˆ β1 to take on a value of your choice. It does not matter what value you choose.
```
beta1 <- 0 
```
Keeping ˆ β1 fixed, fit the model Y − ˆ β1X1 = β0 + β2X2 + ϵ. You can do this as follows: > a <- y - beta1 * x1 > beta2 <- lm(a ∼ x2)$coef[2]
```
a <- Y - beta1 * X1  
beta2 <- lm(a ~ X2)$coef[2]  
```
Keeping ˆ β2 fixed, fit the model Y − ˆ β2X2 = β0 + β1X1 + ϵ. You can do this as follows: > a <- y - beta2 * x2 > beta1 <- lm(a ∼ x1)$coef[2]
```
a <- Y - beta2 * X2  
beta1 <- lm(a ~ X1)$coef[2]  
```

Write a for loop to repeat (c) and (d) 1,000 times. Report the estimates of ˆ β0, ˆ β1, and ˆ β2 at each iteration of the for loop. Create a plot in which each of these values is displayed, with ˆ β0, ˆ β1, and ˆ β2 each shown in a different color. 326 7. Moving Beyond Linearity

beta0_vals <- numeric(1000)
beta1_vals <- numeric(1000)
beta2_vals <- numeric(1000)

for (i in 1:1000) {

  a <- Y - beta1 * X1
  beta2 <- lm(a ~ X2)$coef[2]

  a <- Y - beta2 * X2
  beta1 <- lm(a ~ X1)$coef[2]


  beta0_vals[i] <- mean(Y - beta1 * X1 - beta2 * X2)  # Estimate for beta0
  beta1_vals[i] <- beta1
  beta2_vals[i] <- beta2
}


plot(1:1000, beta0_vals, type = "l", col = "red", ylim = range(c(beta0_vals, beta1_vals, beta2_vals)),
     xlab = "Iteration", ylab = "Coefficient Estimates", main = "Backfitting Iterations")
lines(1:1000, beta1_vals, col = "blue")
lines(1:1000, beta2_vals, col = "green")
legend("topright", legend = c("Beta0", "Beta1", "Beta2"), col = c("red", "blue", "green"), lty = 1)

Backfitting converged to the multiple linear regression estimates within the first 5–10 iterations, as shown by the stable, overlapping coefficient lines.

Compare your answer in (e) to the results of simply performing multiple linear regression to predict Y using X1 and X2. Use the abline() function to overlay those multiple linear regression coefficient estimates on the plot obtained in (e).
```
plot(1:1000, beta0_vals, type = "l", col = "red", ylim = range(c(beta0_vals, beta1_vals, beta2_vals)),
     xlab = "Iteration", ylab = "Coefficient Estimates", main = "Backfitting Iterations")
lines(1:1000, beta1_vals, col = "blue")
lines(1:1000, beta2_vals, col = "green")
legend("topright", legend = c("Beta0", "Beta1", "Beta2"), col = c("red", "blue", "green"), lty = 1)


lm_model <- lm(Y ~ X1 + X2)


abline(h = coef(lm_model)[1], col = "red", lty = 2)  # Beta0
abline(h = coef(lm_model)[2], col = "blue", lty = 2)  # Beta1
abline(h = coef(lm_model)[3], col = "green", lty = 2)  # Beta2
```
This plot shows the convergence of backfitting estimates (β^0,β^1,β^2\hat{\beta}_0, \hat{\beta}_1, \hat{\beta}_2) over 1,000 iterations. The solid lines represent iterative backfitting values, while the dashed lines show the corresponding multiple linear regression estimates. All three coefficients quickly converge and remain stable, indicating that backfitting effectively approximates the true regression coefficients in just a few iterations.
On this data set, how many backfitting iterations were required in order to obtain a “good” approximation to the multiple regression coefficient estimates?

On this data set, fewer than 10 backfitting iterations were required to obtain a good approximation to the multiple regression coefficient estimates, as the coefficient values stabilized and closely matched the true regression lines early in the process.

Lab8_sl

Mounya

2025-04-06