6(a) Polynomial Regression and Model Selection

To examine the relationship between age and wage, I started by fitting polynomial regression models of degrees 1 through 10. I used 10-fold cross-validation to identify the optimal polynomial degree that minimizes prediction error. For this, I used the cv.glm() function from the boot package. I stored the cross-validation error for each degree and plotted it to visually compare model performance.

# Load packages
library(ISLR)

Attaching package: ‘ISLR’

The following object is masked _by_ ‘.GlobalEnv’:

    Auto
library(boot)
library(ggplot2)

# Load the dataset
data(Wage)

# Set seed for reproducibility
set.seed(1)

# Cross-validation to choose polynomial degree
cv.errors <- rep(NA, 10)
for (d in 1:10) {
  glm.fit <- glm(wage ~ poly(age, d), data = Wage)
  cv.errors[d] <- cv.glm(Wage, glm.fit, K = 10)$delta[1]
}

# Plot CV error vs. degree
plot(1:10, cv.errors, type = "b", xlab = "Degree", ylab = "CV Error")


# Optimal degree
optimal_degree <- which.min(cv.errors)
cat("Optimal polynomial degree chosen by cross-validation:", optimal_degree, "\n")
Optimal polynomial degree chosen by cross-validation: 9 

From the cross-validation plot, I observed that the polynomial of degree r optimal_degree had the lowest prediction error, so I selected it as the best model for predicting wage from age.

To compare this data-driven result with classical hypothesis testing, I performed an ANOVA test using anova() with nested polynomial models. This helped me determine if increasing the polynomial degree significantly improved model performance.

# Fit models for ANOVA comparison
fit1 <- lm(wage ~ poly(age, 1), data = Wage)
fit2 <- lm(wage ~ poly(age, 2), data = Wage)
fit3 <- lm(wage ~ poly(age, 3), data = Wage)
fit4 <- lm(wage ~ poly(age, 4), data = Wage)
fit5 <- lm(wage ~ poly(age, 5), data = Wage)

# Compare using ANOVA (F-test)
anova(fit1, fit2, fit3, fit4, fit5)
Analysis of Variance Table

Model 1: wage ~ poly(age, 1)
Model 2: wage ~ poly(age, 2)
Model 3: wage ~ poly(age, 3)
Model 4: wage ~ poly(age, 4)
Model 5: wage ~ poly(age, 5)
  Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
1   2998 5022216                                    
2   2997 4793430  1    228786 143.5931 < 2.2e-16 ***
3   2996 4777674  1     15756   9.8888  0.001679 ** 
4   2995 4771604  1      6070   3.8098  0.051046 .  
5   2994 4770322  1      1283   0.8050  0.369682    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The ANOVA results showed which increases in polynomial degree were statistically significant. For example, if p-values became non-significant after degree 3 or 4, that suggested higher degrees didn’t add much value—something I compared against the CV result to confirm consistency.

Next, I visualized the polynomial regression curve using the selected degree:

# Plotting the best polynomial fit
age.range <- range(Wage$age)
age.grid <- seq(from = age.range[1], to = age.range[2])
fit.best <- lm(wage ~ poly(age, optimal_degree), data = Wage)
preds <- predict(fit.best, newdata = data.frame(age = age.grid))

plot(Wage$age, Wage$wage, col = "gray", main = paste("Polynomial Fit (Degree =", optimal_degree, ")"))
lines(age.grid, preds, col = "blue", lwd = 2)

This plot clearly showed a smooth and flexible curve that captured the non-linear relationship between age and wage. The polynomial fit provided a much better representation than a simple linear model.

(b) Step Function Model with Cross-Validation

After the polynomial approach, I wanted to try modeling wage using a step function. I divided age into different numbers of intervals (or “cuts”) ranging from 2 to 10, and treated those intervals as categorical variables in a linear model. Again, I used 10-fold cross-validation to find the optimal number of cuts.

# Step function CV
cv.errors.step <- rep(NA, 9)
for (k in 2:10) {
  Wage$age.cut <- cut(Wage$age, k)
  glm.fit <- glm(wage ~ age.cut, data = Wage)
  cv.errors.step[k - 1] <- cv.glm(Wage, glm.fit, K = 10)$delta[1]
}

# Plot CV error vs. number of cuts
plot(2:10, cv.errors.step, type = "b", xlab = "Number of cuts", ylab = "CV Error")


# Optimal number of cuts
best.cut <- which.min(cv.errors.step) + 1
cat("Optimal number of cuts for step function:", best.cut, "\n")
Optimal number of cuts for step function: 8 

Based on the cross-validation results, I found that r best.cut cuts gave the best performance. I used that to fit a step function model and visualize it.

# Fit best step function
Wage$age.cut <- cut(Wage$age, best.cut)
fit.step <- lm(wage ~ age.cut, data = Wage)

# Predict using the step function model
age.grid <- seq(from = min(Wage$age), to = max(Wage$age))
age.cut.grid <- cut(age.grid, best.cut)
preds.step <- predict(fit.step, newdata = data.frame(age.cut = age.cut.grid))

# Plot
plot(Wage$age, Wage$wage, col = "gray", main = paste("Step Function (", best.cut, " cuts)", sep=""))
lines(age.grid, preds.step, col = "red", lwd = 2)

This plot showed a piecewise constant function—wage predictions were flat within each age interval. While it wasn’t as smooth as the polynomial regression, it still effectively captured abrupt changes and overall patterns in the data.

7 Exploring Other Predictors of Wage

To further understand what affects wage, I explored other variables in the Wage dataset—specifically marital status (maritl), job class (jobclass), and education (education). Since these are categorical, I can include them directly in the regression model. R will handle them automatically by assigning a reference level and fitting indicator variables.

(a) Effect of Marital Status on Wage

I first examined how maritl affects wage:

# Load data and package
library(ISLR)
data(Wage)

# Fit a model with marital status
fit_maritl <- lm(wage ~ maritl, data = Wage)
summary(fit_maritl)

Call:
lm(formula = wage ~ maritl, data = Wage)

Residuals:
    Min      1Q  Median      3Q     Max 
-98.775 -24.788  -4.754  15.845 221.595 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)          92.735      1.582  58.608  < 2e-16 ***
maritl2. Married     26.126      1.813  14.413  < 2e-16 ***
maritl3. Widowed      6.804      9.375   0.726  0.46804    
maritl4. Divorced    10.425      3.234   3.224  0.00128 ** 
maritl5. Separated    8.481      5.657   1.499  0.13392    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 40.28 on 2995 degrees of freedom
Multiple R-squared:  0.06954,   Adjusted R-squared:  0.0683 
F-statistic: 55.96 on 4 and 2995 DF,  p-value: < 2.2e-16
# Boxplot to visualize
boxplot(wage ~ maritl, data = Wage, col = "lightblue", main = "Wage vs Marital Status", ylab = "Wage")

The model summary showed that married individuals (the reference group) generally earned more on average. The boxplot confirmed this—married people tended to have higher median wages, while individuals who were never married or widowed had lower distributions.

(b) Effect of Job Class on Wage

Next, I analyzed the impact of job type (jobclass):

# Fit a model with jobclass
fit_job <- lm(wage ~ jobclass, data = Wage)
summary(fit_job)

Call:
lm(formula = wage ~ jobclass, data = Wage)

Residuals:
     Min       1Q   Median       3Q      Max 
-100.507  -25.362   -6.117   15.697  197.750 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)             103.321      1.039   99.43   <2e-16 ***
jobclass2. Information   17.272      1.492   11.58   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 40.83 on 2998 degrees of freedom
Multiple R-squared:  0.04281,   Adjusted R-squared:  0.04249 
F-statistic: 134.1 on 1 and 2998 DF,  p-value: < 2.2e-16
# Boxplot
boxplot(wage ~ jobclass, data = Wage, col = "lightgreen", main = "Wage by Job Class", ylab = "Wage")

The results showed that individuals in the Information job class earned significantly more than those in the Industrial job class. This was evident in both the regression coefficients and the boxplot.

(c) Combined Effects: Add Education

Then I created a more flexible model including multiple predictors:

# Fit a model with multiple categorical predictors
fit_all <- lm(wage ~ maritl + jobclass + education, data = Wage)
summary(fit_all)

Call:
lm(formula = wage ~ maritl + jobclass + education, data = Wage)

Residuals:
    Min      1Q  Median      3Q     Max 
-109.75  -19.81   -2.74   14.85  219.37 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                   66.632      2.502  26.635  < 2e-16 ***
maritl2. Married              22.332      1.594  14.006  < 2e-16 ***
maritl3. Widowed               7.309      8.214   0.890 0.373597    
maritl4. Divorced              9.723      2.833   3.432 0.000607 ***
maritl5. Separated            15.783      4.972   3.174 0.001518 ** 
jobclass2. Information         5.199      1.355   3.836 0.000127 ***
education2. HS Grad           11.268      2.440   4.618 4.04e-06 ***
education3. Some College      23.128      2.579   8.966  < 2e-16 ***
education4. College Grad      37.956      2.584  14.689  < 2e-16 ***
education5. Advanced Degree   61.881      2.840  21.793  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 35.27 on 2990 degrees of freedom
Multiple R-squared:  0.2876,    Adjusted R-squared:  0.2855 
F-statistic: 134.1 on 9 and 2990 DF,  p-value: < 2.2e-16
# Visualize wage vs education
boxplot(wage ~ education, data = Wage, col = "peachpuff", main = "Wage by Education Level", ylab = "Wage")

The education level had a clear positive association with wage. Higher education levels (like “Grad School” or “College Grad”) were associated with higher average wages. The boxplot showed this trend clearly.

#Summary Marital Status: Married individuals tend to earn more. “Never Married” and “Widowed” categories showed lower average wages.

Job Class: Information workers consistently earn more than those in industrial roles.

Education: There is a clear positive trend between education level and wage, with higher education corresponding to higher average wages.

By including these categorical variables in the regression model, I was able to flexibly capture non-linear relationships without having to manually encode the categories. The boxplots provided intuitive visual support for the regression findings.

9. Modeling NOx Concentration Using Distance (dis)

(a) Fit a Cubic Polynomial Regression

To start, I modeled nox as a function of dis using a cubic polynomial regression. I used the poly() function to fit an orthogonal polynomial and visualized the resulting curve.

# Load required packages
library(MASS)

Attaching package: ‘MASS’

The following object is masked _by_ ‘.GlobalEnv’:

    Boston
library(ggplot2)

# Load Boston data
data(Boston)

# Fit cubic polynomial regression
fit_cubic <- lm(nox ~ poly(dis, 3), data = Boston)
summary(fit_cubic)

Call:
lm(formula = nox ~ poly(dis, 3), data = Boston)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.121130 -0.040619 -0.009738  0.023385  0.194904 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    0.554695   0.002759 201.021  < 2e-16 ***
poly(dis, 3)1 -2.003096   0.062071 -32.271  < 2e-16 ***
poly(dis, 3)2  0.856330   0.062071  13.796  < 2e-16 ***
poly(dis, 3)3 -0.318049   0.062071  -5.124 4.27e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.06207 on 502 degrees of freedom
Multiple R-squared:  0.7148,    Adjusted R-squared:  0.7131 
F-statistic: 419.3 on 3 and 502 DF,  p-value: < 2.2e-16
# Plot data and cubic fit
dis.grid <- seq(min(Boston$dis), max(Boston$dis), length.out = 100)
preds <- predict(fit_cubic, newdata = data.frame(dis = dis.grid))

plot(Boston$dis, Boston$nox, col = "gray", main = "Cubic Polynomial Fit", xlab = "dis", ylab = "nox")
lines(dis.grid, preds, col = "blue", lwd = 2)

The cubic fit showed a non-linear relationship between dis and nox, where NOx concentration decreases with increasing distance from employment centers, with some curvature.

(b) Fit Polynomials of Degree 1 to 10 and Compare RSS

Next, I fit polynomial regressions of degrees 1 through 10 and plotted the fits. I also recorded the residual sum of squares (RSS) for each model to assess model complexity.

rss <- numeric(10)

plot(Boston$dis, Boston$nox, col = "lightgray", main = "Polynomial Fits of Varying Degrees", xlab = "dis", ylab = "nox")

for (d in 1:10) {
  fit <- lm(nox ~ poly(dis, d), data = Boston)
  rss[d] <- sum(resid(fit)^2)
  preds <- predict(fit, newdata = data.frame(dis = dis.grid))
  lines(dis.grid, preds, col = rainbow(10)[d])
}

legend("topright", legend = paste("Degree", 1:10), col = rainbow(10), lty = 1, cex = 0.6)

print(rss)
 [1] 2.768563 2.035262 1.934107 1.932981 1.915290 1.878257 1.849484 1.835630 1.833331 1.832171

The RSS values generally decreased with increasing degree, but the improvements diminished after a certain point, suggesting potential overfitting.

(c) Cross-Validation to Select Optimal Degree

To choose the best degree objectively, I used 10-fold cross-validation to estimate the prediction error for each polynomial degree.

library(boot)
set.seed(1)

cv.error <- numeric(10)
for (d in 1:10) {
  glm.fit <- glm(nox ~ poly(dis, d), data = Boston)
  cv.error[d] <- cv.glm(Boston, glm.fit, K = 10)$delta[1]
}

plot(1:10, cv.error, type = "b", xlab = "Degree", ylab = "CV Error", main = "CV Error by Polynomial Degree")

optimal_degree <- which.min(cv.error)
cat("Optimal polynomial degree by CV:", optimal_degree, "\n")
Optimal polynomial degree by CV: 4 

Cross-validation suggested that a polynomial of degree r optimal_degree offered the best bias-variance trade-off, balancing flexibility with generalization.

(d) Fit a Regression Spline with 4 Degrees of Freedom

I then used a regression spline via the bs() function with 4 degrees of freedom. I didn’t manually set knots; R automatically placed them based on quantiles.

library(splines)

# Fit spline with 4 df
fit_spline4 <- lm(nox ~ bs(dis, df = 4), data = Boston)
summary(fit_spline4)

Call:
lm(formula = nox ~ bs(dis, df = 4), data = Boston)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.124622 -0.039259 -0.008514  0.020850  0.193891 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       0.73447    0.01460  50.306  < 2e-16 ***
bs(dis, df = 4)1 -0.05810    0.02186  -2.658  0.00812 ** 
bs(dis, df = 4)2 -0.46356    0.02366 -19.596  < 2e-16 ***
bs(dis, df = 4)3 -0.19979    0.04311  -4.634 4.58e-06 ***
bs(dis, df = 4)4 -0.38881    0.04551  -8.544  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.06195 on 501 degrees of freedom
Multiple R-squared:  0.7164,    Adjusted R-squared:  0.7142 
F-statistic: 316.5 on 4 and 501 DF,  p-value: < 2.2e-16
# Predict and plot
preds.spline4 <- predict(fit_spline4, newdata = data.frame(dis = dis.grid))
plot(Boston$dis, Boston$nox, col = "gray", main = "Regression Spline (df = 4)", xlab = "dis", ylab = "nox")
lines(dis.grid, preds.spline4, col = "red", lwd = 2)

The regression spline with 4 degrees of freedom captured non-linearity smoothly and offered better flexibility than a low-degree polynomial without high variance.

(e) Fit Regression Splines with Varying Degrees of Freedom

I fit splines with degrees of freedom from 3 to 10 and computed RSS for each to examine model fit.

rss.spline <- numeric(8)
plot(Boston$dis, Boston$nox, col = "gray", main = "Splines with Varying df", xlab = "dis", ylab = "nox")

for (df in 3:10) {
  fit <- lm(nox ~ bs(dis, df = df), data = Boston)
  rss.spline[df - 2] <- sum(resid(fit)^2)
  lines(dis.grid, predict(fit, newdata = data.frame(dis = dis.grid)), col = rainbow(8)[df - 2])
}

legend("topright", legend = paste("df", 3:10), col = rainbow(8), lty = 1, cex = 0.6)

print(rss.spline)
[1] 1.934107 1.922775 1.840173 1.833966 1.829884 1.816995 1.825653 1.792535

RSS decreased as degrees of freedom increased, indicating better fit, but the improvements tapered off—highlighting the need for a balanced model.

(f) Cross-Validation to Select Best Spline Degrees of Freedom

Finally, I used cross-validation to find the optimal degrees of freedom for the spline.

cv.spline <- numeric(8)
set.seed(2)

for (df in 3:10) {
  fit <- glm(nox ~ bs(dis, df = df), data = Boston)
  cv.spline[df - 2] <- cv.glm(Boston, fit, K = 10)$delta[1]
}
Warning in bs(dis, degree = 3L, knots = numeric(0), Boundary.knots = c(1.137,  :
  some 'x' values beyond boundary knots may cause ill-conditioned bases
Warning in bs(dis, degree = 3L, knots = numeric(0), Boundary.knots = c(1.137,  :
  some 'x' values beyond boundary knots may cause ill-conditioned bases
Warning in bs(dis, degree = 3L, knots = 3.2157, Boundary.knots = c(1.137,  :
  some 'x' values beyond boundary knots may cause ill-conditioned bases
Warning in bs(dis, degree = 3L, knots = 3.2157, Boundary.knots = c(1.137,  :
  some 'x' values beyond boundary knots may cause ill-conditioned bases
Warning in bs(dis, degree = 3L, knots = 3.19095, Boundary.knots = c(1.1296,  :
  some 'x' values beyond boundary knots may cause ill-conditioned bases
Warning in bs(dis, degree = 3L, knots = 3.19095, Boundary.knots = c(1.1296,  :
  some 'x' values beyond boundary knots may cause ill-conditioned bases
Warning in bs(dis, degree = 3L, knots = c(2.3817, 4.166), Boundary.knots = c(1.137,  :
  some 'x' values beyond boundary knots may cause ill-conditioned bases
Warning in bs(dis, degree = 3L, knots = c(2.3817, 4.166), Boundary.knots = c(1.137,  :
  some 'x' values beyond boundary knots may cause ill-conditioned bases
Warning in bs(dis, degree = 3L, knots = c(2.40603333333333, 4.4278), Boundary.knots = c(1.1296,  :
  some 'x' values beyond boundary knots may cause ill-conditioned bases
Warning in bs(dis, degree = 3L, knots = c(2.40603333333333, 4.4278), Boundary.knots = c(1.1296,  :
  some 'x' values beyond boundary knots may cause ill-conditioned bases
Warning in bs(dis, degree = 3L, knots = c(2.1088, 3.2721, 5.118), Boundary.knots = c(1.1296,  :
  some 'x' values beyond boundary knots may cause ill-conditioned bases
Warning in bs(dis, degree = 3L, knots = c(2.1088, 3.2721, 5.118), Boundary.knots = c(1.1296,  :
  some 'x' values beyond boundary knots may cause ill-conditioned bases
Warning in bs(dis, degree = 3L, knots = c(2.09345, 3.1121, 5.03375), Boundary.knots = c(1.137,  :
  some 'x' values beyond boundary knots may cause ill-conditioned bases
Warning in bs(dis, degree = 3L, knots = c(2.09345, 3.1121, 5.03375), Boundary.knots = c(1.137,  :
  some 'x' values beyond boundary knots may cause ill-conditioned bases
Warning in bs(dis, degree = 3L, knots = c(1.9784, 2.7006, 3.8771, 5.6894 :
  some 'x' values beyond boundary knots may cause ill-conditioned bases
Warning in bs(dis, degree = 3L, knots = c(1.9784, 2.7006, 3.8771, 5.6894 :
  some 'x' values beyond boundary knots may cause ill-conditioned bases
Warning in bs(dis, degree = 3L, knots = c(1.9356, 2.5975, 3.7979, 5.7209 :
  some 'x' values beyond boundary knots may cause ill-conditioned bases
Warning in bs(dis, degree = 3L, knots = c(1.9356, 2.5975, 3.7979, 5.7209 :
  some 'x' values beyond boundary knots may cause ill-conditioned bases
Warning in bs(dis, degree = 3L, knots = c(1.86156666666667, 2.40296666666667,  :
  some 'x' values beyond boundary knots may cause ill-conditioned bases
Warning in bs(dis, degree = 3L, knots = c(1.86156666666667, 2.40296666666667,  :
  some 'x' values beyond boundary knots may cause ill-conditioned bases
Warning in bs(dis, degree = 3L, knots = c(1.86565, 2.41396666666667, 3.23925,  :
  some 'x' values beyond boundary knots may cause ill-conditioned bases
Warning in bs(dis, degree = 3L, knots = c(1.86565, 2.41396666666667, 3.23925,  :
  some 'x' values beyond boundary knots may cause ill-conditioned bases
Warning in bs(dis, degree = 3L, knots = c(1.79777142857143, 2.19428571428571,  :
  some 'x' values beyond boundary knots may cause ill-conditioned bases
Warning in bs(dis, degree = 3L, knots = c(1.79777142857143, 2.19428571428571,  :
  some 'x' values beyond boundary knots may cause ill-conditioned bases
Warning in bs(dis, degree = 3L, knots = c(1.79078571428571, 2.19385714285714,  :
  some 'x' values beyond boundary knots may cause ill-conditioned bases
Warning in bs(dis, degree = 3L, knots = c(1.79078571428571, 2.19385714285714,  :
  some 'x' values beyond boundary knots may cause ill-conditioned bases
Warning in bs(dis, degree = 3L, knots = c(1.743225, 2.09345, 2.50505, 3.2157,  :
  some 'x' values beyond boundary knots may cause ill-conditioned bases
Warning in bs(dis, degree = 3L, knots = c(1.743225, 2.09345, 2.50505, 3.2157,  :
  some 'x' values beyond boundary knots may cause ill-conditioned bases
Warning in bs(dis, degree = 3L, knots = c(1.7573, 2.1036, 2.5329, 3.2721,  :
  some 'x' values beyond boundary knots may cause ill-conditioned bases
Warning in bs(dis, degree = 3L, knots = c(1.7573, 2.1036, 2.5329, 3.2721,  :
  some 'x' values beyond boundary knots may cause ill-conditioned bases
plot(3:10, cv.spline, type = "b", xlab = "Degrees of Freedom", ylab = "CV Error", main = "CV for Regression Splines")

best.df <- which.min(cv.spline) + 2
cat("Best degrees of freedom by CV:", best.df, "\n")
Best degrees of freedom by CV: 8 

Cross-validation identified r best.df degrees of freedom as optimal. This model offers a good balance between capturing the non-linear trend and avoiding overfitting.

11. Backfitting for Multiple Linear Regression

(a) Simulate Data

I started by generating 100 observations for two predictors, X1 and X2, independently from a normal distribution with mean 0 and variance 2. The response Y is calculated using the true coefficients:

β₀ = 1.5, β₁ = 3, β₂ = -4.5

Error variance is 1.

set.seed(1)
n <- 100
x1 <- rnorm(n, mean = 0, sd = sqrt(2))
x2 <- rnorm(n, mean = 0, sd = sqrt(2))
epsilon <- rnorm(n, mean = 0, sd = 1)

# True coefficients
beta0_true <- 1.5
beta1_true <- 3
beta2_true <- -4.5

# Response
y <- beta0_true + beta1_true * x1 + beta2_true * x2 + epsilon

(b) Initialize β₁

I initialized β̂₁ to an arbitrary value, say 0. From there, I alternated updating β̂₂ and β̂₁ while holding the other fixed.

beta1 <- 0  # Initial guess
beta2 <- 0
beta0 <- 0

# To store coefficient history
beta0_history <- beta1_history <- beta2_history <- numeric(1000)

(c)-(e) Backfitting Loop

I ran the backfitting algorithm for 1000 iterations, updating beta2 and beta1 alternately while calculating beta0 each time.

for (i in 1:1000) {
  # Update beta2 while keeping beta1 fixed
  a <- y - beta1 * x1
  fit2 <- lm(a ~ x2)
  beta0 <- fit2$coef[1]
  beta2 <- fit2$coef[2]
  
  # Update beta1 while keeping beta2 fixed
  a <- y - beta2 * x2
  fit1 <- lm(a ~ x1)
  beta0 <- fit1$coef[1]
  beta1 <- fit1$coef[2]
  
  # Store results
  beta0_history[i] <- beta0
  beta1_history[i] <- beta1
  beta2_history[i] <- beta2
}

I printed only the first 5 iterations and then every 50th step:

print_df <- data.frame(
  Iteration = c(1:5, seq(50, 1000, 50)),
  Beta0 = beta0_history[c(1:5, seq(50, 1000, 50))],
  Beta1 = beta1_history[c(1:5, seq(50, 1000, 50))],
  Beta2 = beta2_history[c(1:5, seq(50, 1000, 50))]
)

print(print_df)

(f) Compare with True Multiple Regression

I fit a standard multiple regression and overlaid the coefficients using abline() on the plot.

# Plot coefficient paths
plot(beta0_history, type = "l", col = "blue", ylim = range(c(beta0_history, beta1_history, beta2_history)),
     ylab = "Coefficient Value", xlab = "Iteration", main = "Backfitting Coefficient Paths")
lines(beta1_history, col = "green")
lines(beta2_history, col = "red")

# Fit standard multiple regression
lm_fit <- lm(y ~ x1 + x2)
abline(h = coef(lm_fit)[1], col = "blue", lty = 2)
abline(h = coef(lm_fit)[2], col = "green", lty = 2)
abline(h = coef(lm_fit)[3], col = "red", lty = 2)

legend("bottomright", legend = c("Beta0", "Beta1", "Beta2"), col = c("blue", "green", "red"), lty = 1, cex = 0.8)
legend("topright", legend = c("LM Coefs"), col = c("blue", "green", "red"), lty = 2, cex = 0.7)

The backfitting estimates converged closely to the true multiple regression coefficients. The horizontal dashed lines from lm() matched the stable values reached by the iterative estimates.

(g) Number of Iterations for Good Approximation

By inspecting the plots and printed output, I saw that the coefficients stabilized around the true values after about 100–150 iterations. After that, further updates didn’t cause significant change.

---
title: "R Notebook"
output: html_notebook
---

# 6(a) Polynomial Regression and Model Selection
To examine the relationship between age and wage, I started by fitting polynomial regression models of degrees 1 through 10. I used 10-fold cross-validation to identify the optimal polynomial degree that minimizes prediction error. For this, I used the cv.glm() function from the boot package. I stored the cross-validation error for each degree and plotted it to visually compare model performance.

```{r}
# Load packages
library(ISLR)
library(boot)
library(ggplot2)

# Load the dataset
data(Wage)

# Set seed for reproducibility
set.seed(1)

# Cross-validation to choose polynomial degree
cv.errors <- rep(NA, 10)
for (d in 1:10) {
  glm.fit <- glm(wage ~ poly(age, d), data = Wage)
  cv.errors[d] <- cv.glm(Wage, glm.fit, K = 10)$delta[1]
}

# Plot CV error vs. degree
plot(1:10, cv.errors, type = "b", xlab = "Degree", ylab = "CV Error")

# Optimal degree
optimal_degree <- which.min(cv.errors)
cat("Optimal polynomial degree chosen by cross-validation:", optimal_degree, "\n")

```

From the cross-validation plot, I observed that the polynomial of degree r optimal_degree had the lowest prediction error, so I selected it as the best model for predicting wage from age.

To compare this data-driven result with classical hypothesis testing, I performed an ANOVA test using anova() with nested polynomial models. This helped me determine if increasing the polynomial degree significantly improved model performance.

```{r}
# Fit models for ANOVA comparison
fit1 <- lm(wage ~ poly(age, 1), data = Wage)
fit2 <- lm(wage ~ poly(age, 2), data = Wage)
fit3 <- lm(wage ~ poly(age, 3), data = Wage)
fit4 <- lm(wage ~ poly(age, 4), data = Wage)
fit5 <- lm(wage ~ poly(age, 5), data = Wage)

# Compare using ANOVA (F-test)
anova(fit1, fit2, fit3, fit4, fit5)
```

The ANOVA results showed which increases in polynomial degree were statistically significant. For example, if p-values became non-significant after degree 3 or 4, that suggested higher degrees didn’t add much value—something I compared against the CV result to confirm consistency.

Next, I visualized the polynomial regression curve using the selected degree:

```{r}
# Plotting the best polynomial fit
age.range <- range(Wage$age)
age.grid <- seq(from = age.range[1], to = age.range[2])
fit.best <- lm(wage ~ poly(age, optimal_degree), data = Wage)
preds <- predict(fit.best, newdata = data.frame(age = age.grid))

plot(Wage$age, Wage$wage, col = "gray", main = paste("Polynomial Fit (Degree =", optimal_degree, ")"))
lines(age.grid, preds, col = "blue", lwd = 2)
```

This plot clearly showed a smooth and flexible curve that captured the non-linear relationship between age and wage. The polynomial fit provided a much better representation than a simple linear model.

# (b) Step Function Model with Cross-Validation
After the polynomial approach, I wanted to try modeling wage using a step function. I divided age into different numbers of intervals (or "cuts") ranging from 2 to 10, and treated those intervals as categorical variables in a linear model. Again, I used 10-fold cross-validation to find the optimal number of cuts.

```{r}
# Step function CV
cv.errors.step <- rep(NA, 9)
for (k in 2:10) {
  Wage$age.cut <- cut(Wage$age, k)
  glm.fit <- glm(wage ~ age.cut, data = Wage)
  cv.errors.step[k - 1] <- cv.glm(Wage, glm.fit, K = 10)$delta[1]
}

# Plot CV error vs. number of cuts
plot(2:10, cv.errors.step, type = "b", xlab = "Number of cuts", ylab = "CV Error")

# Optimal number of cuts
best.cut <- which.min(cv.errors.step) + 1
cat("Optimal number of cuts for step function:", best.cut, "\n")
```

Based on the cross-validation results, I found that r best.cut cuts gave the best performance. I used that to fit a step function model and visualize it.

```{r}
# Fit best step function
Wage$age.cut <- cut(Wage$age, best.cut)
fit.step <- lm(wage ~ age.cut, data = Wage)

# Predict using the step function model
age.grid <- seq(from = min(Wage$age), to = max(Wage$age))
age.cut.grid <- cut(age.grid, best.cut)
preds.step <- predict(fit.step, newdata = data.frame(age.cut = age.cut.grid))

# Plot
plot(Wage$age, Wage$wage, col = "gray", main = paste("Step Function (", best.cut, " cuts)", sep=""))
lines(age.grid, preds.step, col = "red", lwd = 2)
```

This plot showed a piecewise constant function—wage predictions were flat within each age interval. While it wasn’t as smooth as the polynomial regression, it still effectively captured abrupt changes and overall patterns in the data.


# 7 Exploring Other Predictors of Wage

To further understand what affects wage, I explored other variables in the Wage dataset—specifically marital status (maritl), job class (jobclass), and education (education). Since these are categorical, I can include them directly in the regression model. R will handle them automatically by assigning a reference level and fitting indicator variables.

# (a) Effect of Marital Status on Wage
I first examined how maritl affects wage:

```{r}
# Load data and package
library(ISLR)
data(Wage)

# Fit a model with marital status
fit_maritl <- lm(wage ~ maritl, data = Wage)
summary(fit_maritl)

# Boxplot to visualize
boxplot(wage ~ maritl, data = Wage, col = "lightblue", main = "Wage vs Marital Status", ylab = "Wage")
```

The model summary showed that married individuals (the reference group) generally earned more on average. The boxplot confirmed this—married people tended to have higher median wages, while individuals who were never married or widowed had lower distributions.

# (b) Effect of Job Class on Wage
Next, I analyzed the impact of job type (jobclass):

```{r}
# Fit a model with jobclass
fit_job <- lm(wage ~ jobclass, data = Wage)
summary(fit_job)

# Boxplot
boxplot(wage ~ jobclass, data = Wage, col = "lightgreen", main = "Wage by Job Class", ylab = "Wage")
```

The results showed that individuals in the Information job class earned significantly more than those in the Industrial job class. This was evident in both the regression coefficients and the boxplot.

# (c) Combined Effects: Add Education
Then I created a more flexible model including multiple predictors:

```{r}
# Fit a model with multiple categorical predictors
fit_all <- lm(wage ~ maritl + jobclass + education, data = Wage)
summary(fit_all)

# Visualize wage vs education
boxplot(wage ~ education, data = Wage, col = "peachpuff", main = "Wage by Education Level", ylab = "Wage")
```

The education level had a clear positive association with wage. Higher education levels (like "Grad School" or "College Grad") were associated with higher average wages. The boxplot showed this trend clearly.


#Summary
Marital Status: Married individuals tend to earn more. "Never Married" and "Widowed" categories showed lower average wages.

Job Class: Information workers consistently earn more than those in industrial roles.

Education: There is a clear positive trend between education level and wage, with higher education corresponding to higher average wages.

By including these categorical variables in the regression model, I was able to flexibly capture non-linear relationships without having to manually encode the categories. The boxplots provided intuitive visual support for the regression findings.


# 9. Modeling NOx Concentration Using Distance (dis)

# (a) Fit a Cubic Polynomial Regression
To start, I modeled nox as a function of dis using a cubic polynomial regression. I used the poly() function to fit an orthogonal polynomial and visualized the resulting curve.

```{r}
# Load required packages
library(MASS)
library(ggplot2)

# Load Boston data
data(Boston)

# Fit cubic polynomial regression
fit_cubic <- lm(nox ~ poly(dis, 3), data = Boston)
summary(fit_cubic)

# Plot data and cubic fit
dis.grid <- seq(min(Boston$dis), max(Boston$dis), length.out = 100)
preds <- predict(fit_cubic, newdata = data.frame(dis = dis.grid))

plot(Boston$dis, Boston$nox, col = "gray", main = "Cubic Polynomial Fit", xlab = "dis", ylab = "nox")
lines(dis.grid, preds, col = "blue", lwd = 2)
```

The cubic fit showed a non-linear relationship between dis and nox, where NOx concentration decreases with increasing distance from employment centers, with some curvature.

# (b) Fit Polynomials of Degree 1 to 10 and Compare RSS
Next, I fit polynomial regressions of degrees 1 through 10 and plotted the fits. I also recorded the residual sum of squares (RSS) for each model to assess model complexity.

```{r}
rss <- numeric(10)

plot(Boston$dis, Boston$nox, col = "lightgray", main = "Polynomial Fits of Varying Degrees", xlab = "dis", ylab = "nox")

for (d in 1:10) {
  fit <- lm(nox ~ poly(dis, d), data = Boston)
  rss[d] <- sum(resid(fit)^2)
  preds <- predict(fit, newdata = data.frame(dis = dis.grid))
  lines(dis.grid, preds, col = rainbow(10)[d])
}

legend("topright", legend = paste("Degree", 1:10), col = rainbow(10), lty = 1, cex = 0.6)
print(rss)
```

The RSS values generally decreased with increasing degree, but the improvements diminished after a certain point, suggesting potential overfitting.

# (c) Cross-Validation to Select Optimal Degree
To choose the best degree objectively, I used 10-fold cross-validation to estimate the prediction error for each polynomial degree.

```{r}
library(boot)
set.seed(1)

cv.error <- numeric(10)
for (d in 1:10) {
  glm.fit <- glm(nox ~ poly(dis, d), data = Boston)
  cv.error[d] <- cv.glm(Boston, glm.fit, K = 10)$delta[1]
}

plot(1:10, cv.error, type = "b", xlab = "Degree", ylab = "CV Error", main = "CV Error by Polynomial Degree")
optimal_degree <- which.min(cv.error)
cat("Optimal polynomial degree by CV:", optimal_degree, "\n")
```

Cross-validation suggested that a polynomial of degree r optimal_degree offered the best bias-variance trade-off, balancing flexibility with generalization.

# (d) Fit a Regression Spline with 4 Degrees of Freedom
I then used a regression spline via the bs() function with 4 degrees of freedom. I didn’t manually set knots; R automatically placed them based on quantiles.

```{r}
library(splines)

# Fit spline with 4 df
fit_spline4 <- lm(nox ~ bs(dis, df = 4), data = Boston)
summary(fit_spline4)

# Predict and plot
preds.spline4 <- predict(fit_spline4, newdata = data.frame(dis = dis.grid))
plot(Boston$dis, Boston$nox, col = "gray", main = "Regression Spline (df = 4)", xlab = "dis", ylab = "nox")
lines(dis.grid, preds.spline4, col = "red", lwd = 2)
```

The regression spline with 4 degrees of freedom captured non-linearity smoothly and offered better flexibility than a low-degree polynomial without high variance.

# (e) Fit Regression Splines with Varying Degrees of Freedom
I fit splines with degrees of freedom from 3 to 10 and computed RSS for each to examine model fit.

```{r}
rss.spline <- numeric(8)
plot(Boston$dis, Boston$nox, col = "gray", main = "Splines with Varying df", xlab = "dis", ylab = "nox")

for (df in 3:10) {
  fit <- lm(nox ~ bs(dis, df = df), data = Boston)
  rss.spline[df - 2] <- sum(resid(fit)^2)
  lines(dis.grid, predict(fit, newdata = data.frame(dis = dis.grid)), col = rainbow(8)[df - 2])
}

legend("topright", legend = paste("df", 3:10), col = rainbow(8), lty = 1, cex = 0.6)
print(rss.spline)
```

RSS decreased as degrees of freedom increased, indicating better fit, but the improvements tapered off—highlighting the need for a balanced model.

# (f) Cross-Validation to Select Best Spline Degrees of Freedom
Finally, I used cross-validation to find the optimal degrees of freedom for the spline.

```{r}
cv.spline <- numeric(8)
set.seed(2)

for (df in 3:10) {
  fit <- glm(nox ~ bs(dis, df = df), data = Boston)
  cv.spline[df - 2] <- cv.glm(Boston, fit, K = 10)$delta[1]
}

plot(3:10, cv.spline, type = "b", xlab = "Degrees of Freedom", ylab = "CV Error", main = "CV for Regression Splines")
best.df <- which.min(cv.spline) + 2
cat("Best degrees of freedom by CV:", best.df, "\n")
```

Cross-validation identified r best.df degrees of freedom as optimal. This model offers a good balance between capturing the non-linear trend and avoiding overfitting.

# 11. Backfitting for Multiple Linear Regression

# (a) Simulate Data
I started by generating 100 observations for two predictors, X1 and X2, independently from a normal distribution with mean 0 and variance 2. The response Y is calculated using the true coefficients:

β₀ = 1.5, β₁ = 3, β₂ = -4.5

Error variance is 1.

```{r}
set.seed(1)
n <- 100
x1 <- rnorm(n, mean = 0, sd = sqrt(2))
x2 <- rnorm(n, mean = 0, sd = sqrt(2))
epsilon <- rnorm(n, mean = 0, sd = 1)

# True coefficients
beta0_true <- 1.5
beta1_true <- 3
beta2_true <- -4.5

# Response
y <- beta0_true + beta1_true * x1 + beta2_true * x2 + epsilon
```


# (b) Initialize β₁
I initialized β̂₁ to an arbitrary value, say 0. From there, I alternated updating β̂₂ and β̂₁ while holding the other fixed.

```{r}
beta1 <- 0  # Initial guess
beta2 <- 0
beta0 <- 0

# To store coefficient history
beta0_history <- beta1_history <- beta2_history <- numeric(1000)
```


# (c)-(e) Backfitting Loop
I ran the backfitting algorithm for 1000 iterations, updating beta2 and beta1 alternately while calculating beta0 each time.

```{r}
for (i in 1:1000) {
  # Update beta2 while keeping beta1 fixed
  a <- y - beta1 * x1
  fit2 <- lm(a ~ x2)
  beta0 <- fit2$coef[1]
  beta2 <- fit2$coef[2]
  
  # Update beta1 while keeping beta2 fixed
  a <- y - beta2 * x2
  fit1 <- lm(a ~ x1)
  beta0 <- fit1$coef[1]
  beta1 <- fit1$coef[2]
  
  # Store results
  beta0_history[i] <- beta0
  beta1_history[i] <- beta1
  beta2_history[i] <- beta2
}
```

I printed only the first 5 iterations and then every 50th step:

```{r}
print_df <- data.frame(
  Iteration = c(1:5, seq(50, 1000, 50)),
  Beta0 = beta0_history[c(1:5, seq(50, 1000, 50))],
  Beta1 = beta1_history[c(1:5, seq(50, 1000, 50))],
  Beta2 = beta2_history[c(1:5, seq(50, 1000, 50))]
)

print(print_df)
```

# (f) Compare with True Multiple Regression
I fit a standard multiple regression and overlaid the coefficients using abline() on the plot.

```{r}
# Plot coefficient paths
plot(beta0_history, type = "l", col = "blue", ylim = range(c(beta0_history, beta1_history, beta2_history)),
     ylab = "Coefficient Value", xlab = "Iteration", main = "Backfitting Coefficient Paths")
lines(beta1_history, col = "green")
lines(beta2_history, col = "red")

# Fit standard multiple regression
lm_fit <- lm(y ~ x1 + x2)
abline(h = coef(lm_fit)[1], col = "blue", lty = 2)
abline(h = coef(lm_fit)[2], col = "green", lty = 2)
abline(h = coef(lm_fit)[3], col = "red", lty = 2)

legend("bottomright", legend = c("Beta0", "Beta1", "Beta2"), col = c("blue", "green", "red"), lty = 1, cex = 0.8)
legend("topright", legend = c("LM Coefs"), col = c("blue", "green", "red"), lty = 2, cex = 0.7)
```

The backfitting estimates converged closely to the true multiple regression coefficients. The horizontal dashed lines from lm() matched the stable values reached by the iterative estimates.

# (g) Number of Iterations for Good Approximation
By inspecting the plots and printed output, I saw that the coefficients stabilized around the true values after about 100–150 iterations. After that, further updates didn’t cause significant change.
