Lab 8

Question 6:

(a) Polynomial Regression:

library(ISLR)
library(boot)

# Polynomial Regression-> Predicting wage using age:

set.seed(123)

# Using 10-fold cross-validation to choose the optimal polynomial degree:
max_degree <- 10
cv_errors <- rep(NA, max_degree)

for (d in 1:max_degree) {
  glm_fit <- glm(wage ~ poly(age, d), data = Wage)
  cv_errors[d] <- cv.glm(Wage, glm_fit, K = 10)$delta[1]
}

print(cv_errors)

##  [1] 1676.988 1602.473 1597.036 1594.582 1594.424 1594.157 1596.532 1593.351
##  [9] 1593.285 1593.154

# Choosing the degree with minimum CV error:
optimal_degree <- which.min(cv_errors)
cat("Optimal polynomial degree selected by cross-validation:", optimal_degree, "\n")

## Optimal polynomial degree selected by cross-validation: 10

# Fitting the final model with the optimal degree:
fit_poly <- lm(wage ~ poly(age, optimal_degree), data = Wage)

# Plotting the data and the polynomial regression fit:
plot(Wage$age, Wage$wage, col = "grey", main = paste("Polynomial Regression Fit (Degree", optimal_degree, ")"),
     xlab = "Age", ylab = "Wage")
age_grid <- seq(min(Wage$age), max(Wage$age), length.out = 100)
preds_poly <- predict(fit_poly, newdata = list(age = age_grid))
lines(age_grid, preds_poly, col = "blue", lwd = 2)

# Comparing with hypothesis testing via ANOVA:

# Fitting polynomial models up to degree 5:
fit1 <- lm(wage ~ poly(age, 1), data = Wage)
fit2 <- lm(wage ~ poly(age, 2), data = Wage)
fit3 <- lm(wage ~ poly(age, 3), data = Wage)
fit4 <- lm(wage ~ poly(age, 4), data = Wage)
fit5 <- lm(wage ~ poly(age, 5), data = Wage)

anova_results <- anova(fit1, fit2, fit3, fit4, fit5, test = "F")
print(anova_results)

## Analysis of Variance Table
## 
## Model 1: wage ~ poly(age, 1)
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
##   Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
## 1   2998 5022216                                    
## 2   2997 4793430  1    228786 143.5931 < 2.2e-16 ***
## 3   2996 4777674  1     15756   9.8888  0.001679 ** 
## 4   2995 4771604  1      6070   3.8098  0.051046 .  
## 5   2994 4770322  1      1283   0.8050  0.369682    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As we can observe from the above results, cross validation suggested the optimal polynomial degree as 10. This is probably because each additional degree (even if only slightly) reduces the CV error.

However, ANOVA highlighted going from degree 1 to 2 is highly significant (<2.2e-16), from 2 to 3 still significant (0.001679), from 3 to 4 a borderline significant (0.051046) and from 4 to 5 not significant (0.369682). This suggests that may be lower degree polynomial model such as 3 or 4 could be the optimal model with no more additional improvement afterwards.

Hence, depending on the objective of the project, a simpler polynomial model suggested by ANOVA or a flexible model suggested by cross validation could be used.

(b) Fitting a step function:

# Using Cross-Validation for Optimal Number of Cuts:
min_cuts <- 2  
max_cuts <- 10
cuts_range <- min_cuts:max_cuts
cv_errors_step <- numeric(length(cuts_range))

set.seed(123)
for (i in seq_along(cuts_range)) {
  k <- cuts_range[i]
  Wage$age.cut <- cut(Wage$age, breaks = k, include.lowest = TRUE)
  glm_fit <- glm(wage ~ age.cut, data = Wage)
  cv_errors_step[i] <- cv.glm(Wage, glm_fit, K = 10)$delta[1]
}

print(cv_errors_step)

## [1] 1734.784 1683.196 1636.622 1631.955 1621.842 1612.172 1603.495 1607.430
## [9] 1605.066

optimal_index <- which.min(cv_errors_step)
optimal_cuts <- cuts_range[optimal_index]
cat("Optimal number of age groups selected by cross-validation:", optimal_cuts, "\n")

## Optimal number of age groups selected by cross-validation: 8

# Defining Explicit Breakpoints Using min and max of Age:
age_min <- min(Wage$age, na.rm = TRUE)
age_max <- max(Wage$age, na.rm = TRUE)
# Create equally spaced breakpoints from min to max with (optimal_cuts + 1) points.
user_breaks <- seq(age_min, age_max, length.out = optimal_cuts + 1)

# Creating bins using these explicit breakpoints:
Wage$bin <- cut(Wage$age, breaks = user_breaks, include.lowest = TRUE, labels = FALSE)

# fitting the step function model:
fit_step <- lm(wage ~ factor(Wage$bin), data = Wage)

# Preparing Data for Plotting the Step Function:
# Computing midpoints from the explicit breakpoints:
midpoints <- (head(user_breaks, -1) + tail(user_breaks, -1)) / 2

# Computing the average wage per bin:
avg_wage_by_group <- tapply(Wage$wage, factor(Wage$bin, levels = 1:optimal_cuts), mean)

cat("Length of midpoints:", length(midpoints), "\n")

## Length of midpoints: 8

cat("Length of average wages:", length(avg_wage_by_group), "\n")

## Length of average wages: 8

# Plotting the Step Function Fit:
plot(Wage$age, Wage$wage, col = "grey",
     main = paste("Step Function Fit (", optimal_cuts, " Groups)", sep = ""),
     xlab = "Age", ylab = "Wage")

# Overlaying the step function fit:
points(midpoints, avg_wage_by_group, col = "red", pch = 19, cex = 1.5)
lines(midpoints, avg_wage_by_group, col = "red", lwd = 2)
grid()

From above results, we can obtain that a step function does not assume a smooth, continuous relationship (like a polynomial does); instead, it treats each age interval as its own category with a distinct mean wage. Having 8 intervals suggests that wage varies enough across age groups that more than just a few broad bins (e.g., 3 or 4) were needed to capture the underlying pattern, but not so many that we may overfit.

This approach is simpler to interpret than a high-degree polynomial (e.g., “From age 20 to 25, the average wage is about 70; from 25 to 30, the average wage is about 85,” etc.), yet still flexible enough to capture the main age-related wage trends.

Overall, the step function with 8 groups is chosen because it strikes a good balance between capturing the wage–age relationship and avoiding unnecessary complexity, as measured by cross-validation.

Question 7:

Exploring Categorical Relationship:

library(gam)

## Loading required package: splines

## Loading required package: foreach

## Loaded gam 1.22-5

library(ggplot2)  

data("Wage", package = "ISLR")

# Fitting a GAM with a smoothing spline for age, and including maritl and jobclass as predictors.
gam_fit <- gam(wage ~ s(age, 4) + maritl + jobclass, data = Wage)
summary(gam_fit)

## 
## Call: gam(formula = wage ~ s(age, 4) + maritl + jobclass, data = Wage)
## Deviance Residuals:
##      Min       1Q   Median       3Q      Max 
## -110.378  -23.260   -4.772   15.361  201.141 
## 
## (Dispersion Parameter for gaussian family taken to be 1497.157)
## 
##     Null Deviance: 5222086 on 2999 degrees of freedom
## Residual Deviance: 4476501 on 2990 degrees of freedom
## AIC: 30459.59 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##             Df  Sum Sq Mean Sq F value    Pr(>F)    
## s(age, 4)    1  199870  199870 133.499 < 2.2e-16 ***
## maritl       4  141136   35284  23.567 < 2.2e-16 ***
## jobclass     1  173181  173181 115.673 < 2.2e-16 ***
## Residuals 2990 4476501    1497                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##             Npar Df Npar F     Pr(F)    
## (Intercept)                             
## s(age, 4)         3 30.202 < 2.2e-16 ***
## maritl                                  
## jobclass                                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Plotting the GAM components:
par(mfrow = c(1, 3))
plot(gam_fit, se = TRUE, col = "blue")

# Resetting the plotting layout:
par(mfrow = c(1, 1))

# Also using ggplot2:
# Plotting wage vs. age with loess fits for each marital status:
p1 <- ggplot(Wage, aes(x = age, y = wage, color = maritl)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "loess", se = TRUE) +
  labs(title = "Wage vs. Age by Marital Status",
       x = "Age",
       y = "Wage") +
  theme_minimal()

# Plotting wage vs. age with loess fits for each job class:
p2 <- ggplot(Wage, aes(x = age, y = wage, color = jobclass)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "loess", se = TRUE) +
  labs(title = "Wage vs. Age by Job Class",
       x = "Age",
       y = "Wage") +
  theme_minimal()

print(p1)

## `geom_smooth()` using formula = 'y ~ x'

print(p2)

## `geom_smooth()` using formula = 'y ~ x'

Here are the following summary of findings from above output:

Age: The smooth function for age (s(age, 4)) is highly significant, both in the parametric and nonparametric ANOVA tables, showing that wage has a non-linear relationship with age.

Marital Status and Job Class: Both categorical predictors significantly contribute to explaining wage variation, as indicated by their very low p-values in the parametric ANOVA.

Model Fit: The reduction in deviance from the null model to the fitted model, along with the significant F-tests, suggests that the combination of a non-linear age effect and the categorical effects of marital status and job class provides a good explanation for the variation in wages.

In summary, the above analysis demonstrates that wages are influenced by a significant non-linear effect of age and by distinct shifts associated with marital status and job class.

Question 9:

library(MASS)      
library(boot)      
library(splines)   


set.seed(123)
dis_grid <- seq(min(Boston$dis), max(Boston$dis), length.out = 100)

(a) Fit a Polynomial Regression:

fit_poly_cubic <- lm(nox ~ poly(dis, 3), data = Boston)
summary(fit_poly_cubic)

## 
## Call:
## lm(formula = nox ~ poly(dis, 3), data = Boston)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.121130 -0.040619 -0.009738  0.023385  0.194904 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.554695   0.002759 201.021  < 2e-16 ***
## poly(dis, 3)1 -2.003096   0.062071 -32.271  < 2e-16 ***
## poly(dis, 3)2  0.856330   0.062071  13.796  < 2e-16 ***
## poly(dis, 3)3 -0.318049   0.062071  -5.124 4.27e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06207 on 502 degrees of freedom
## Multiple R-squared:  0.7148, Adjusted R-squared:  0.7131 
## F-statistic: 419.3 on 3 and 502 DF,  p-value: < 2.2e-16

# Plotting the data and the cubic polynomial fit:
plot(Boston$dis, Boston$nox, pch = 20, col = "darkgrey",
     xlab = "dis", ylab = "nox",
     main = "Cubic Polynomial Fit: nox ~ poly(dis, 3)")
preds_cubic <- predict(fit_poly_cubic, newdata = data.frame(dis = dis_grid))
lines(dis_grid, preds_cubic, col = "blue", lwd = 2)

The cubic polynomial regression model for predicting nox using dis is highly significant and explains about 71% of the variability in nox. The overall F-statistic of 419.3 (p < 2e-16) indicates that the model provides a strong fit to the data. Each of the three orthogonal polynomial terms is statistically significant—with the first two having extremely low p-values (< 2e-16) and the third also highly significant (p ≈ 4.27e-07)—which supports the inclusion of a cubic term to capture the non-linear relationship. Additionally, the residual standard error of 0.06207 suggests that the model’s predictions are quite precise. Overall, these results demonstrate that a cubic function of dis effectively captures the complex, non-linear association between distance and nitrogen oxide concentration in the Boston dataset.

(b) Plot for range of degrees:

rss_values <- numeric(10)
colors <- rainbow(10)

plot(Boston$dis, Boston$nox, pch = 20, col = "darkgrey",
     xlab = "dis", ylab = "nox",
     main = "Polynomial Fits (Degrees 1 to 10)")
for (d in 1:10) {
  fit_poly <- lm(nox ~ poly(dis, d), data = Boston)
  preds <- predict(fit_poly, newdata = data.frame(dis = dis_grid))
  lines(dis_grid, preds, col = colors[d], lwd = 2)
  rss_values[d] <- sum(resid(fit_poly)^2)
}
legend("topright", legend = paste("Degree", 1:10), col = colors, lwd = 2, cex = 0.8)

print(rss_values)

##  [1] 2.768563 2.035262 1.934107 1.932981 1.915290 1.878257 1.849484 1.835630
##  [9] 1.833331 1.832171

The vector represents the residual sum of squares (RSS) for polynomial regression models of increasing degree, from 1 to 10. We can notice that the RSS drops substantially from a degree->1 model (2.768563) to a degree->2 model (2.035262), indicating a significant improvement when adding a quadratic term. Beyond degree 2, the decrease in RSS becomes much smaller; for instance, the degree->3 model has an RSS of 1.934107, and by degree-> 4 it’s nearly the same at 1.932981. From degree 5 onward, the RSS continues to decrease gradually, reaching around 1.83 for degrees 8–10.

This pattern suggests that while higher-degree polynomials can capture more subtle non-linearities, most of the improvement in fit is achieved with a quadratic or cubic model, and further increases in complexity yield only marginal gains.

(c) Cross Validation:

cv_errors <- numeric(10)
for (d in 1:10) {
  fit_poly_cv <- glm(nox ~ poly(dis, d), data = Boston)
  cv_errors[d] <- cv.glm(Boston, fit_poly_cv, K = 10)$delta[1]
}
plot(1:10, cv_errors, type = "b",
     xlab = "Polynomial Degree", ylab = "CV Error",
     main = "CV Error for Polynomial Fits")

optimal_degree <- which.min(cv_errors)
cat("Optimal polynomial degree (based on CV):", optimal_degree, "\n")

## Optimal polynomial degree (based on CV): 3

print(cv_errors)

##  [1] 0.005513459 0.004083877 0.003873764 0.003912737 0.004129041 0.005440390
##  [7] 0.010297393 0.013651886 0.014866196 0.008931853

The cross-validation results indicate that the optimal polynomial degree is 3. The CV errors for degrees 1 through 10 are approximately 0.00551, 0.00408, 0.00387, 0.00391, 0.00413, 0.00544, 0.01030, 0.01365, 0.01487, and 0.00893, respectively. We can notice that the error drops significantly from degree 1 to degree 2 and reaches its minimum at degree 3 (about 0.00387), then increases slightly at degree 4 and degree 5 before rising substantially for higher degrees.

This pattern suggests that a 3rd-degree polynomial strikes the best balance between capturing the underlying non-linear relationship and avoiding overfitting, yielding the lowest prediction error on unseen data.

(d) Fit a Regression Spline:

fit_spline4 <- lm(nox ~ bs(dis, df = 4), data = Boston)
summary(fit_spline4)

## 
## Call:
## lm(formula = nox ~ bs(dis, df = 4), data = Boston)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.124622 -0.039259 -0.008514  0.020850  0.193891 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.73447    0.01460  50.306  < 2e-16 ***
## bs(dis, df = 4)1 -0.05810    0.02186  -2.658  0.00812 ** 
## bs(dis, df = 4)2 -0.46356    0.02366 -19.596  < 2e-16 ***
## bs(dis, df = 4)3 -0.19979    0.04311  -4.634 4.58e-06 ***
## bs(dis, df = 4)4 -0.38881    0.04551  -8.544  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06195 on 501 degrees of freedom
## Multiple R-squared:  0.7164, Adjusted R-squared:  0.7142 
## F-statistic: 316.5 on 4 and 501 DF,  p-value: < 2.2e-16

# Plotting the spline fit:
plot(Boston$dis, Boston$nox, pch = 20, col = "darkgrey",
     xlab = "dis", ylab = "nox",
     main = "Regression Spline (4 DF)")
preds_spline4 <- predict(fit_spline4, newdata = data.frame(dis = dis_grid))
lines(dis_grid, preds_spline4, col = "red", lwd = 2)

The regression spline model using bs(dis, df = 4) fits nox as a flexible, non-linear function of dis. The overall model is highly significant, with an F-statistic of 316.5 (p < 2.2e-16) and explains approximately 71.6% of the variance in nox.

All spline basis function coefficients are statistically significant, indicating that the non-linear relationship is well captured; notably, the negative coefficients suggest that, generally, higher dis values are associated with lower nox levels, although the relationship is modeled flexibly. The residual standard error of about 0.062 indicates that the typical prediction error is small.

Knots for the spline are chosen automatically—typically at the 25th, 50th, and 75th percentiles of dis—providing a good balance between flexibility and simplicity.

(e) Plot for range of degrees:

df_range <- 3:10
rss_spline <- numeric(length(df_range))
colors_spline <- rainbow(length(df_range))

plot(Boston$dis, Boston$nox, pch = 20, col = "darkgrey",
     xlab = "dis", ylab = "nox",
     main = "Regression Splines: Different DF")
for (i in seq_along(df_range)) {
  df_val <- df_range[i]
  fit_spline <- lm(nox ~ bs(dis, df = df_val), data = Boston)
  preds_spline <- predict(fit_spline, newdata = data.frame(dis = dis_grid))
  lines(dis_grid, preds_spline, col = colors_spline[i], lwd = 2)
  rss_spline[i] <- sum(resid(fit_spline)^2)
}
legend("topright", legend = paste("df", df_range), col = colors_spline, lwd = 2, cex = 0.8)

print(rss_spline)

## [1] 1.934107 1.922775 1.840173 1.833966 1.829884 1.816995 1.825653 1.792535

These RSS values represent the residual sum of squares for regression spline models fitted with increasing degrees of freedom (from a lower to a higher flexibility). Starting at an RSS of 1.934107, the error decreases as the model becomes more flexible, reaching values as low as 1.792535.

The pattern shows that increasing the degrees of freedom generally improves the fit to the training data by reducing the RSS. However, the improvements become marginal at higher degrees of freedom, and slight fluctuations (such as a small increase between some values) may indicate that beyond a certain point additional flexibility does not substantially improve the fit.

This suggests that a moderate level of complexity might be sufficient to capture the underlying relationship without overfitting.

(f) Cross Validation:

cv_errors_spline <- numeric(length(df_range))
for (i in seq_along(df_range)) {
  df_val <- df_range[i]
  fit_spline_cv <- glm(nox ~ bs(dis, df = df_val), data = Boston)
  cv_errors_spline[i] <- cv.glm(Boston, fit_spline_cv, K = 10)$delta[1]
}

## Warning in bs(dis, degree = 3L, knots = numeric(0), Boundary.knots = c(1.137, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = numeric(0), Boundary.knots = c(1.137, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning in bs(dis, degree = 3L, knots = 3.2721, Boundary.knots = c(1.1296, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = 3.2721, Boundary.knots = c(1.1296, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning in bs(dis, degree = 3L, knots = 3.2628, Boundary.knots = c(1.137, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = 3.2628, Boundary.knots = c(1.137, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning in bs(dis, degree = 3L, knots = c(2.35093333333333, 4.2965),
## Boundary.knots = c(1.137, : some 'x' values beyond boundary knots may cause
## ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(2.35093333333333, 4.2965),
## Boundary.knots = c(1.137, : some 'x' values beyond boundary knots may cause
## ill-conditioned bases

## Warning in bs(dis, degree = 3L, knots = c(2.42303333333333, 4.4284),
## Boundary.knots = c(1.1296, : some 'x' values beyond boundary knots may cause
## ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(2.42303333333333, 4.4284),
## Boundary.knots = c(1.1296, : some 'x' values beyond boundary knots may cause
## ill-conditioned bases

## Warning in bs(dis, degree = 3L, knots = c(2.0754, 3.1121, 4.9906),
## Boundary.knots = c(1.1296, : some 'x' values beyond boundary knots may cause
## ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(2.0754, 3.1121, 4.9906),
## Boundary.knots = c(1.1296, : some 'x' values beyond boundary knots may cause
## ill-conditioned bases

## Warning in bs(dis, degree = 3L, knots = c(2.10035, 3.2721, 5.118),
## Boundary.knots = c(1.137, : some 'x' values beyond boundary knots may cause
## ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(2.10035, 3.2721, 5.118),
## Boundary.knots = c(1.137, : some 'x' values beyond boundary knots may cause
## ill-conditioned bases

## Warning in bs(dis, degree = 3L, knots = c(1.96376, 2.70958, 3.89326, 5.4917:
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(1.96376, 2.70958, 3.89326, 5.4917:
## some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning in bs(dis, degree = 3L, knots = c(1.9512, 2.5975, 3.7979, 5.5027: some
## 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(1.9512, 2.5975, 3.7979, 5.5027: some
## 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning in bs(dis, degree = 3L, knots = c(1.82085, 2.3772, 3.17575,
## 4.35876666666667, : some 'x' values beyond boundary knots may cause
## ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(1.82085, 2.3772, 3.17575,
## 4.35876666666667, : some 'x' values beyond boundary knots may cause
## ill-conditioned bases

## Warning in bs(dis, degree = 3L, knots = c(1.7821, 2.1705, 2.7778, 3.6526, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(1.7821, 2.1705, 2.7778, 3.6526, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning in bs(dis, degree = 3L, knots = c(1.80062857142857, 2.2467, 2.7986, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(1.80062857142857, 2.2467, 2.7986, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning in bs(dis, degree = 3L, knots = c(1.743225, 2.1084, 2.50505, 3.1025, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(1.743225, 2.1084, 2.50505, 3.1025, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning in bs(dis, degree = 3L, knots = c(1.7550125, 2.097575, 2.4781875, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(1.7550125, 2.097575, 2.4781875, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases

plot(df_range, cv_errors_spline, type = "b",
     xlab = "Degrees of Freedom", ylab = "CV Error",
     main = "CV Error for Regression Splines")

optimal_df <- df_range[which.min(cv_errors_spline)]
cat("Optimal degrees of freedom for regression spline (based on CV):", optimal_df, "\n")

## Optimal degrees of freedom for regression spline (based on CV): 8

print(cv_errors_spline)

## [1] 0.003871478 0.003896473 0.003713335 0.003689431 0.003737134 0.003681008
## [7] 0.003733003 0.003686694

The cross-validation results for the regression spline indicate that the optimal degrees of freedom is 8. The CV errors for different degrees of freedom are very similar—ranging from about 0.00387 for lower degrees up to roughly 0.00369 for higher ones. Although the CV error values decrease slightly as flexibility increases, the minimum error is achieved when the model is allowed 8 degrees of freedom, suggesting that this level of complexity best captures the non-linear relationship between dis and nox without overfitting.

In summary, a spline model with 8 degrees of freedom provides the best balance between model fit and predictive performance based on these CV error estimates.

Question 11:

(a) Generate data:

set.seed(123)
n <- 100
X1 <- rnorm(n, mean = 0, sd = sqrt(2))
X2 <- rnorm(n, mean = 1, sd = sqrt(2))

# Defining true coefficients:
beta0_true <- 1
beta1_true <- 2
beta2_true <- 3

# Generating response Y with added noise ~ N(0,1):
Y <- beta0_true + beta1_true * X1 + beta2_true * X2 + rnorm(n, mean = 0, sd = 1)

(b) Initializing β1:

beta1 <- 0
beta2 <- 0

(c) Fit Model-> β1 fixed:

a <- Y - beta1 * X1
fit_beta2 <- lm(a ~ X2)
beta2 <- coef(fit_beta2)[2]

# Capturing beta0 estimate:
beta0_temp <- coef(fit_beta2)[1]

(d) Fit Model-> β2 fixed:

a <- Y - beta2 * X2
fit_beta1 <- lm(a ~ X1)
beta1 <- coef(fit_beta1)[2]

# Updating the intercept (beta0 estimate):
beta0_temp <- coef(fit_beta1)[1]

(e) Repeat (c) and (d) 1000 times:

n_iter <- 1000

beta0_store <- numeric(n_iter)
beta1_store <- numeric(n_iter)
beta2_store <- numeric(n_iter)

# Reinitializing beta1 and beta2 arbitrarily:
beta1 <- 0
beta2 <- 0

for (i in 1:n_iter) {
  
  # Updating beta2 (holding beta1 fixed):
  a <- Y - beta1 * X1
  fit_beta2 <- lm(a ~ X2)
  beta2 <- coef(fit_beta2)[2]
  beta0_temp <- coef(fit_beta2)[1]
  
  # Updating beta1 (holding beta2 fixed):
  a <- Y - beta2 * X2
  fit_beta1 <- lm(a ~ X1)
  beta1 <- coef(fit_beta1)[2]
  beta0_temp <- coef(fit_beta1)[1]  # update beta0
  
  # Storing current estimates:
  beta0_store[i] <- beta0_temp
  beta1_store[i] <- beta1
  beta2_store[i] <- beta2
}

# Plotting the evolution of the coefficients:
plot(1:n_iter, beta0_store, type = "l", col = "blue",
     ylim = range(c(beta0_store, beta1_store, beta2_store)),
     xlab = "Iteration", ylab = "Coefficient Estimates",
     main = "Backfitting Coefficient Estimates")
lines(1:n_iter, beta1_store, col = "red")
lines(1:n_iter, beta2_store, col = "green")
legend("topright", legend = c("beta0", "beta1", "beta2"),
       col = c("blue", "red", "green"), lty = 1)

(f) Comparing results:

plot(1:n_iter, beta0_store, type = "l", col = "blue",
     ylim = range(c(beta0_store, beta1_store, beta2_store)),
     xlab = "Iteration", ylab = "Coefficient Estimates",
     main = "Backfitting Coefficient Estimates")
lines(1:n_iter, beta1_store, col = "red")
lines(1:n_iter, beta2_store, col = "green")
legend("topright", legend = c("beta0", "beta1", "beta2"),
       col = c("blue", "red", "green"), lty = 1)

# standard multiple linear regression for comparison:
fit_ml <- lm(Y ~ X1 + X2)
ml_coef <- coef(fit_ml)
print(ml_coef)

## (Intercept)          X1          X2 
##    1.118228    1.905834    3.016837

# overlaying horizontal lines corresponding to these estimates:
abline(h = ml_coef[1], col = "blue", lty = 2)
abline(h = ml_coef[2], col = "red", lty = 2)
abline(h = ml_coef[3], col = "green", lty = 2)

It seems like the results from (e) and (f) are quite similar to each other. They both shows that each coefficients (β0, β1 and β2) remain constant at 1, 2 and 3 as similar as to their true values after the very first update to the multiple regression solution, and no further adjustments were needed afterwards.

(g) Backfitting Iterations:

From the plot obtained in (e), the coefficients essentially jump to their final values immediately and remain flat for all subsequent iterations. This indicates that the backfitting algorithm converged in just one iteration for this particular data set and starting values. After that first update, there was no meaningful change in the coefficient estimates, so the solution was effectively reached right away.