Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polyno mial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.
Fit a step function to predict wage using age, and perform cross validation to choose the optimal number of cuts. Make a plot of the fit obtained.
We fit polynomial regression models of varying degrees (from 1 to 6)
to predict wage
using age
. Cross-validation is
used to determine the optimal degree of the polynomial. We also compare
this selection with the results from an ANOVA hypothesis test.
# Cross-validation error for polynomial degrees 1 through 6
cv_errors <- sapply(1:6, function(d) {
fit <- glm(wage ~ poly(age, d), data = Wage)
cv.glm(Wage, fit, K = 5)$delta[1]
})
# Plot CV errors and identify minimum
which.min(cv_errors)
## [1] 6
plot(1:6, cv_errors, xlab = "Degree", ylab = "Test MSE", type = "l", lwd = 2)
abline(v = which.min(cv_errors), col = "red", lty = 2)
# Fit best polynomial model
best_degree <- which.min(cv_errors)
fit <- glm(wage ~ poly(age, best_degree), data = Wage)
# Plot data and polynomial fit
plot(Wage$age, Wage$wage, pch = 19, cex = 0.4, col = alpha("steelblue", 0.4),
xlab = "Age", ylab = "Wage")
lines(1:100, predict(fit, data.frame(age = 1:100)), col = "red", lwd = 2)
# Summary of full degree-6 model (for ANOVA)
summary(glm(wage ~ poly(age, 6), data = Wage))
##
## Call:
## glm(formula = wage ~ poly(age, 6), data = Wage)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 111.7036 0.7286 153.316 < 2e-16 ***
## poly(age, 6)1 447.0679 39.9063 11.203 < 2e-16 ***
## poly(age, 6)2 -478.3158 39.9063 -11.986 < 2e-16 ***
## poly(age, 6)3 125.5217 39.9063 3.145 0.00167 **
## poly(age, 6)4 -77.9112 39.9063 -1.952 0.05099 .
## poly(age, 6)5 -35.8129 39.9063 -0.897 0.36956
## poly(age, 6)6 62.7077 39.9063 1.571 0.11620
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1592.512)
##
## Null deviance: 5222086 on 2999 degrees of freedom
## Residual deviance: 4766389 on 2993 degrees of freedom
## AIC: 30642
##
## Number of Fisher Scoring iterations: 2
# ANOVA model comparisons (F-tests)
fit1 <- lm(wage ~ poly(age, 1), data = Wage)
fit2 <- lm(wage ~ poly(age, 2), data = Wage)
fit3 <- lm(wage ~ poly(age, 3), data = Wage)
fit4 <- lm(wage ~ poly(age, 4), data = Wage)
fit5 <- lm(wage ~ poly(age, 5), data = Wage)
anova(fit1, fit2, fit3, fit4, fit5, test = "F")
The cross-validation approach selected degree 6 as having the lowest test error, suggesting a more complex model fits the data better. However, the ANOVA test shows that adding terms beyond degree 3 or 4 provides limited statistical significance. Thus, ANOVA favors a simpler model (up to degree 3), while CV prefers a more complex one, revealing a classic bias-variance tradeoff.
b. Step function regression with cross-validation
# Cross-validation errors for step functions with 2–10 cuts
cv_step <- sapply(2:10, function(k) {
Wage$cats <- cut(Wage$age, k)
fit <- glm(wage ~ cats, data = Wage)
cv.glm(Wage, fit, K = 5)$delta[1]
})
names(cv_step) <- 2:10
# Plot CV errors and identify minimum
plot(2:10, cv_step, xlab = "Cuts", ylab = "Test MSE", type = "l", lwd = 2)
which.min(cv_step)
## 8
## 7
abline(v = names(which.min(cv_step)), col = "red", lty = 2)
# Fit model with best number of cuts (8 in this case)
fit <- glm(wage ~ cut(age, 8), data = Wage)
# Plot data and step function fit
plot(Wage$age, Wage$wage, pch = 19, cex = 0.4, col = alpha("steelblue", 0.4),
xlab = "Age", ylab = "Wage")
lines(18:80, predict(fit, data.frame(age = 18:80)), col = "red", lwd = 2)
This is based on how the vector is indexed. In R,
sapply(2:10, ...)
creates a vector with names “2” through
“10”, and which.min(cv_step)
returns the index in that
vector — which is 7, corresponding to age cuts = 8.
We begin by plotting wage
against several
predictors:
plot(Wage$year, Wage$wage, pch = 19, cex = 0.4, col = alpha("steelblue", 0.4))
plot(Wage$age, Wage$wage, pch = 19, cex = 0.4, col = alpha("steelblue", 0.4))
plot(Wage$maritl, Wage$wage, pch = 19, cex = 0.4, col = alpha("steelblue", 0.4))
plot(Wage$jobclass, Wage$wage, pch = 19, cex = 0.4, col = alpha("steelblue", 0.4))
plot(Wage$education, Wage$wage, pch = 19, cex = 0.4, col = alpha("steelblue", 0.4))
From these plots:
Year and age show non-linear
trends with wage
.
Education and jobclass have clear differences in wage levels across categories.
Marital status appears to influence wage, with some groups earning more.
To capture both non-linear and categorical effects, we use
Generalized Additive Models (GAMs). The
s()
function fits smooth splines to continuous predictors,
while categorical variables are included directly and interpreted
relative to a baseline level.
library(gam)
## Warning: package 'gam' was built under R version 4.4.3
## Loading required package: foreach
## Warning: package 'foreach' was built under R version 4.4.3
## Loaded gam 1.22-5
fit0 <- gam(wage ~ s(year, 4) + s(age, 5) + education, data = Wage)
fit2 <- gam(wage ~ s(year, 4) + s(age, 5) + education + maritl, data = Wage)
fit1 <- gam(wage ~ s(year, 4) + s(age, 5) + education + jobclass, data = Wage)
fit3 <- gam(wage ~ s(year, 4) + s(age, 5) + education + jobclass + maritl, data = Wage)
anova(fit0, fit1, fit2, fit3)
par(mfrow = c(2, 3))
plot(fit3, se = TRUE, col = "blue")
We use the poly()
function to fit a
cubic (degree 3) polynomial regression. Then, we
visualize the fit.
# Fit cubic polynomial
fit_cubic <- glm(nox ~ poly(dis, 3), data = Boston)
summary(fit_cubic)
##
## Call:
## glm(formula = nox ~ poly(dis, 3), data = Boston)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.554695 0.002759 201.021 < 2e-16 ***
## poly(dis, 3)1 -2.003096 0.062071 -32.271 < 2e-16 ***
## poly(dis, 3)2 0.856330 0.062071 13.796 < 2e-16 ***
## poly(dis, 3)3 -0.318049 0.062071 -5.124 4.27e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.003852802)
##
## Null deviance: 6.7810 on 505 degrees of freedom
## Residual deviance: 1.9341 on 502 degrees of freedom
## AIC: -1370.9
##
## Number of Fisher Scoring iterations: 2
# Visualization
x_vals <- seq(min(Boston$dis), max(Boston$dis), length.out = 1000)
pred_vals <- predict(fit_cubic, newdata = data.frame(dis = x_vals))
plot(nox ~ dis, data = Boston, col = alpha("steelblue", 0.4), pch = 19,
main = "Cubic Polynomial Fit: nox vs. dis")
lines(x_vals, pred_vals, col = "red", lwd = 2, lty = 2)
Explanation: The cubic polynomial regression model has significant coefficients for all polynomial terms, indicating a good fit. The curve shows a non-linear, decreasing trend in NOx as distance increases, which aligns with expectations since NOx levels generally decrease with distance from urban centers.
We fit polynomial regressions for degrees 1 through 10 and visualize them together. We also report the residual sum of squares (RSS) for comparison.
# Fit models and predict
polynomial_models <- lapply(1:10, function(deg) glm(nox ~ poly(dis, deg), data = Boston))
# Prediction frame
x_vals <- seq(min(Boston$dis), max(Boston$dis), length.out = 1000)
preds <- sapply(polynomial_models, function(mod) predict(mod, newdata = data.frame(dis = x_vals)))
# Prepare data for ggplot
pred_df <- data.frame(x = x_vals, preds)
colnames(pred_df)[-1] <- paste0("Degree_", 1:10)
pred_long <- pivot_longer(pred_df, -x, names_to = "Degree", values_to = "Predicted")
# Plot fits
ggplot(Boston, aes(dis, nox)) +
geom_point(color = alpha("steelblue", 0.4)) +
geom_line(data = pred_long, aes(x = x, y = Predicted, color = Degree), size = 1) +
labs(title = "Polynomial Fits (Degrees 1–10)", x = "Distance to Employment Centers", y = "NOx Concentration") +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Report RSS
rss_vals <- sapply(polynomial_models, function(mod) sum(residuals(mod)^2))
names(rss_vals) <- paste0("Degree_", 1:10)
rss_vals
## Degree_1 Degree_2 Degree_3 Degree_4 Degree_5 Degree_6 Degree_7 Degree_8
## 2.768563 2.035262 1.934107 1.932981 1.915290 1.878257 1.849484 1.835630
## Degree_9 Degree_10
## 1.833331 1.832171
Explanation:
As the polynomial degree increases, the residual sum of squares (RSS) decreases, indicating better fit to the training data. However, degrees beyond 3-5 show diminishing returns and potential overfitting, particularly near the data boundaries. The RSS stabilizes at degree 9 and 10, suggesting no significant improvement beyond degree 5.
We apply 10-fold cross-validation to select the best degree.
cv_errors <- sapply(1:10, function(deg) {
fit <- glm(nox ~ poly(dis, deg), data = Boston)
cv.glm(Boston, fit, K = 10)$delta[1]
})
# Plot CV errors
plot(1:10, cv_errors, type = "b", pch = 19, col = "darkgreen",
xlab = "Polynomial Degree", ylab = "CV Error",
main = "10-Fold Cross-Validation Error")
optimal_degree <- which.min(cv_errors)
optimal_degree
## [1] 3
Explanation:
The cross-validation error is minimized at degree 3, confirming that the cubic fit strikes a balance between bias and variance. This is consistent with our earlier visual inspection, where degree 3 provided a smooth, reasonable fit without overfitting.
We now fit a regression spline using bs()
(basis
spline), choosing 4 degrees of freedom (df). Knots are automatically
chosen based on quantiles.
fit_spline4 <- glm(nox ~ bs(dis, df = 4), data = Boston)
summary(fit_spline4)
##
## Call:
## glm(formula = nox ~ bs(dis, df = 4), data = Boston)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.73447 0.01460 50.306 < 2e-16 ***
## bs(dis, df = 4)1 -0.05810 0.02186 -2.658 0.00812 **
## bs(dis, df = 4)2 -0.46356 0.02366 -19.596 < 2e-16 ***
## bs(dis, df = 4)3 -0.19979 0.04311 -4.634 4.58e-06 ***
## bs(dis, df = 4)4 -0.38881 0.04551 -8.544 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.003837874)
##
## Null deviance: 6.7810 on 505 degrees of freedom
## Residual deviance: 1.9228 on 501 degrees of freedom
## AIC: -1371.9
##
## Number of Fisher Scoring iterations: 2
# Visualization
pred_spline <- predict(fit_spline4, newdata = data.frame(dis = x_vals))
plot(nox ~ dis, data = Boston, col = alpha("steelblue", 0.4), pch = 19,
main = "Regression Spline Fit (4 df)")
lines(x_vals, pred_spline, col = "red", lwd = 2, lty = 2)
Explanation:
The regression spline with 4 degrees of freedom fits a smooth curve with a good balance between flexibility and avoiding overfitting. The spline captures the curvature in the data without becoming excessively wiggly, offering a stable fit with reasonable complexity.
Now we fit splines for multiple degrees of freedom and observe model complexity vs. fit.
# Fit spline models
spline_models <- lapply(3:10, function(df) glm(nox ~ bs(dis, df = df), data = Boston))
# Generate predictions
spline_preds <- sapply(spline_models, function(mod) predict(mod, newdata = data.frame(dis = x_vals)))
spline_df <- data.frame(x = x_vals, spline_preds)
colnames(spline_df)[-1] <- paste0("df_", 3:10)
spline_long <- pivot_longer(spline_df, -x, names_to = "DF", values_to = "Predicted")
# Plot splines
ggplot(Boston, aes(dis, nox)) +
geom_point(color = alpha("steelblue", 0.4)) +
geom_line(data = spline_long, aes(x = x, y = Predicted, color = DF), size = 1) +
labs(title = "Regression Splines with Varying df", x = "dis", y = "nox") +
theme_minimal()
Explanation:
As degrees of freedom increase, the RSS improves, but the model begins to show signs of overfitting at higher degrees of freedom, especially beyond 5. Around 4-5 degrees of freedom provides a good compromise between fit and model complexity, without excessive noise or complexity near the edges of the data.
We’ll apply 10-fold CV to the spline fits. Note: warnings related to knots are suppressed here.
set.seed(42)
cv_spline_errors <- sapply(3:10, function(df) {
fit <- glm(nox ~ bs(dis, df = df), data = Boston)
suppressWarnings(cv.glm(Boston, fit, K = 10)$delta[1])
})
# Plot CV errors
plot(3:10, cv_spline_errors, type = "b", pch = 19, col = "blue",
xlab = "Degrees of Freedom", ylab = "CV Error",
main = "CV Error for Spline Fits")
best_df <- which.min(cv_spline_errors) + 2 # shift index to start at 3
best_df
## [1] 10
Explanation:
Cross-validation identifies 10 degrees of freedom as optimal for the spline fit. While higher degrees of freedom improve the fit, they also introduce more complexity. The increase in fit quality for df=10 justifies the higher degree, although overfitting signs appear as the complexity increases beyond df=4.
Generate a response Y and two predictors X1 and X2, with n =100.
Initialize ˆ β1 to take on a value of your choice. It does not matter what value you choose.
Keeping ˆ β1 fixed, fit the model Y − ˆ β1X1 = β0 +β2X2 +ϵ. You can do this as follows: > a <- y- beta1 * x1 > beta2 <-lm(a ∼ x2)$coef[2]
Keeping ˆ β2 fixed, fit the model Y − ˆ β2X2 = β0 +β1X1 +ϵ. You can do this as follows: > a <- y- beta2 * x2 > beta1 <-lm(a ∼ x1)$coef[2]
Write a for loop to repeat (c) and (d) 1,000 times. Report the estimates of ˆ β0, ˆ β1, and ˆ β2 at each iteration of the for loop. Create a plot in which each of these values is displayed, with ˆ β0, ˆ β1, and ˆ β2 each shown in a different color
Compare your answer in (e) to the results of simply performing multiple linear regression to predict Y using X1 and X2. Use the abline() function to overlay those multiple linear regression coefficient estimates on the plot obtained in (e).
On this data set, how many backfitting iterations were required in order to obtain a “good” approximation to the multiple re gression coefficient estimates?
We generate X1
and X2
independently from a
normal distribution with mean 0 and variance 2. The response
Y
is generated using the true model:
Y=β0+β1X1+β2X2+εY = _0 + _1 X_1 + _2 X_2 + =β0+β1X1+β2X2+ε
where:
β0=1.5_0 = 1.5β0=1.5
β1=3_1 = 3β1=3
β2=−4.5_2 = -4.5β2=−4.5
ε∼N(0,1)N(0, 1)ε∼N(0,1)
set.seed(42)
n <- 100
x1 <- rnorm(n, mean = 0, sd = sqrt(2))
x2 <- rnorm(n, mean = 0, sd = sqrt(2))
beta0_true <- 1.5
beta1_true <- 3
beta2_true <- -4.5
epsilon <- rnorm(n, mean = 0, sd = 1)
y <- beta0_true + beta1_true * x1 + beta2_true * x2 + epsilon
We start backfitting with an arbitrary value for β^1_1β^1:
beta1 <- 20
We perform one step of backfitting by keeping one coefficient fixed and estimating the other, then alternating:
a <- y - beta1 * x1
beta2 <- lm(a ~ x2)$coef[2]
a <- y - beta2 * x2
fit1 <- lm(a ~ x1)
beta1 <- fit1$coef[2]
beta0 <- fit1$coef[1]
We repeat steps (c) and (d) for 1000 iterations. We store results at
iterations 1–5, then every 50th thereafter. Results are visualized using
ggplot2
.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.5
## ✔ lubridate 1.9.4 ✔ stringr 1.5.1
## ✔ purrr 1.0.2 ✔ tibble 3.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ purrr::accumulate() masks foreach::accumulate()
## ✖ readr::col_factor() masks scales::col_factor()
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ purrr::when() masks foreach::when()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
res <- matrix(NA, nrow = 1000, ncol = 3)
colnames(res) <- c("beta0", "beta1", "beta2")
beta1 <- 20
for (i in 1:1000) {
beta2 <- lm(y - beta1 * x1 ~ x2)$coef[2]
fit1 <- lm(y - beta2 * x2 ~ x1)
beta1 <- fit1$coef[2]
beta0 <- fit1$coef[1]
res[i, ] <- c(beta0, beta1, beta2)
}
res <- as.data.frame(res)
res$Iteration <- 1:1000
# Only show selected iterations
selected_rows <- c(1:5, seq(50, 1000, by = 50))
res_display <- res[selected_rows, ]
print(round(res_display, 4))
## beta0 beta1 beta2 Iteration
## 1 1.4248 2.9151 -5.0558 1
## 2 1.5017 2.8984 -4.4403 2
## 3 1.5018 2.8984 -4.4397 3
## 4 1.5018 2.8984 -4.4397 4
## 5 1.5018 2.8984 -4.4397 5
## 50 1.5018 2.8984 -4.4397 50
## 100 1.5018 2.8984 -4.4397 100
## 150 1.5018 2.8984 -4.4397 150
## 200 1.5018 2.8984 -4.4397 200
## 250 1.5018 2.8984 -4.4397 250
## 300 1.5018 2.8984 -4.4397 300
## 350 1.5018 2.8984 -4.4397 350
## 400 1.5018 2.8984 -4.4397 400
## 450 1.5018 2.8984 -4.4397 450
## 500 1.5018 2.8984 -4.4397 500
## 550 1.5018 2.8984 -4.4397 550
## 600 1.5018 2.8984 -4.4397 600
## 650 1.5018 2.8984 -4.4397 650
## 700 1.5018 2.8984 -4.4397 700
## 750 1.5018 2.8984 -4.4397 750
## 800 1.5018 2.8984 -4.4397 800
## 850 1.5018 2.8984 -4.4397 850
## 900 1.5018 2.8984 -4.4397 900
## 950 1.5018 2.8984 -4.4397 950
## 1000 1.5018 2.8984 -4.4397 1000
Plot convergence of coefficients
res_long <- pivot_longer(res, cols = c("beta0", "beta1", "beta2"))
ggplot(res_long, aes(x = Iteration, y = value, color = name)) +
geom_line() +
geom_hline(yintercept = c(beta0_true, beta1_true, beta2_true),
linetype = "dashed", color = c("red", "blue", "darkgreen")) +
labs(title = "Convergence of Coefficients via Backfitting",
x = "Iteration",
y = "Coefficient Estimate",
color = "Coefficient") +
scale_color_manual(values = c("darkred", "navy", "forestgreen")) +
theme_minimal()
Dashed lines represent the true values of β0=1.5_0 = 1.5β0=1.5, β1=3_1 = 3β1=3, and β2=−4.5_2 = -4.5β2=−4.5.
Now we compare the backfitting result with the true multiple regression:
fit_lm <- lm(y ~ x1 + x2)
print(round(coef(fit_lm), 4))
## (Intercept) x1 x2
## 1.5018 2.8984 -4.4397
# Overlay the true OLS coefficients on the plot again
ggplot(res_long, aes(x = Iteration, y = value, color = name)) +
geom_line() +
geom_hline(yintercept = coef(fit_lm),
linetype = "dotted", color = c("red", "blue", "darkgreen")) +
labs(title = "Backfitting vs Standard Linear Regression",
x = "Iteration",
y = "Coefficient Estimate",
color = "Coefficient") +
scale_color_manual(values = c("darkred", "navy", "forestgreen")) +
theme_minimal()
In this simulation, the coefficients stabilized around the OLS values after approximately 3 to 5 iterations. The backfitting approach quickly reached a good approximation of the multiple regression solution.