set.seed(1)
cv.errors <- rep(NA, 10)
for (i in 1:10) {
glm.fit <- glm(wage ~ poly(age, i), data = Wage)
cv.errors[i] <- cv.glm(Wage, glm.fit, K = 10)$delta[1]
}
which.min(cv.errors)
## [1] 9
plot(1:10, cv.errors, type = "b", xlab = "Degree", ylab = "CV Error")
fit1 <- lm(wage ~ poly(age, 1), data = Wage)
fit2 <- lm(wage ~ poly(age, 2), data = Wage)
fit3 <- lm(wage ~ poly(age, 3), data = Wage)
fit4 <- lm(wage ~ poly(age, 4), data = Wage)
fit5 <- lm(wage ~ poly(age, 5), data = Wage)
anova(fit1, fit2, fit3, fit4, fit5)
## Analysis of Variance Table
##
## Model 1: wage ~ poly(age, 1)
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2998 5022216
## 2 2997 4793430 1 228786 143.5931 < 2.2e-16 ***
## 3 2996 4777674 1 15756 9.8888 0.001679 **
## 4 2995 4771604 1 6070 3.8098 0.051046 .
## 5 2994 4770322 1 1283 0.8050 0.369682
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
best.fit <- lm(wage ~ poly(age, 4), data = Wage)
age.grid <- seq(from = min(Wage$age), to = max(Wage$age), length.out = 100)
preds <- predict(best.fit, newdata = list(age = age.grid), se = TRUE)
plot(Wage$age, Wage$wage, col = "gray", main = "Polynomial Fit (Degree 4)", xlab = "Age", ylab = "Wage")
lines(age.grid, preds$fit, col = "blue", lwd = 2)
lines(age.grid, preds$fit + 2 * preds$se.fit, col = "blue", lty = "dashed")
lines(age.grid, preds$fit - 2 * preds$se.fit, col = "blue", lty = "dashed")
Cross-validation chose degree 4, which aligns closely with the ANOVA results, where degrees 1 to 3 are clearly significant, and degree 4 is marginal. This supports using a degree 3 or 4 polynomial, with degree 4 offering slight improvement in prediction performance at the cost of a marginal increase in complexity.
set.seed(1)
k.values <- 2:10
cv.errors.step <- rep(NA, length(k.values))
for (i in k.values) {
Wage$age.cut <- cut(Wage$age, i)
glm.fit <- glm(wage ~ age.cut, data = Wage)
cv.errors.step[i - 1] <- cv.glm(Wage, glm.fit, K = 10)$delta[1]
}
which.min(cv.errors.step) + 1
## [1] 8
plot(k.values, cv.errors.step, type = "b", xlab = "Number of Cuts", ylab = "CV Error")
set.seed(1)
k.values <- 2:10
cv.errors.step <- rep(NA, length(k.values))
for (i in k.values) {
Wage$age.cut <- cut(Wage$age, i)
glm.fit <- glm(wage ~ age.cut, data = Wage)
cv.errors.step[i - 1] <- cv.glm(Wage, glm.fit, K = 10)$delta[1]
}
optimal.cuts <- which.min(cv.errors.step) + 1
Wage$age.cut <- cut(Wage$age, optimal.cuts)
step.fit <- lm(wage ~ age.cut, data = Wage)
plot(Wage$age, Wage$wage, col = "gray", xlab = "Age", ylab = "Wage",
main = paste("Step Function Fit with", optimal.cuts, "Cuts"))
age.grid <- seq(min(Wage$age), max(Wage$age), length.out = 100)
age.grid.cut <- cut(age.grid, breaks = optimal.cuts)
step.preds <- predict(step.fit, newdata = data.frame(age.cut = age.grid.cut))
lines(age.grid, step.preds, col = "blue", lwd = 2)
Cross-validation identified that the optimal number of cuts is 8, as it produced the lowest estimated prediction error.
# Use s() for smoothing terms on continuous variables, include categorical as factors
gam.fit <- gam(wage ~ s(year, 4) + s(age, 5) + education + jobclass + maritl, data = Wage)
# Plot smoothing components
par(mfrow = c(2, 2))
plot(gam.fit, se = TRUE, col = "blue")
# Plot effect of categorical variables
par(mfrow = c(1, 2))
boxplot(wage ~ education, data = Wage, col = "lightgray", main = "Wage vs Education")
boxplot(wage ~ jobclass, data = Wage, col = "lightgray", main = "Wage vs Jobclass")
summary(gam.fit)
##
## Call: gam(formula = wage ~ s(year, 4) + s(age, 5) + education + jobclass +
## maritl, data = Wage)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -108.513 -19.579 -2.729 13.818 213.475
##
## (Dispersion Parameter for gaussian family taken to be 1201.537)
##
## Null Deviance: 5222086 on 2999 degrees of freedom
## Residual Deviance: 3581781 on 2981 degrees of freedom
## AIC: 29808.64
##
## Number of Local Scoring Iterations: NA
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## s(year, 4) 1 26181 26181 21.790 3.177e-06 ***
## s(age, 5) 1 195384 195384 162.612 < 2.2e-16 ***
## education 4 1091423 272856 227.089 < 2.2e-16 ***
## jobclass 1 12846 12846 10.691 0.001089 **
## maritl 4 105966 26492 22.048 < 2.2e-16 ***
## Residuals 2981 3581781 1202
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## s(year, 4) 3 0.9883 0.3972
## s(age, 5) 4 18.3997 5.995e-15 ***
## education
## jobclass
## maritl
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
To explore the relationships between wage and other predictors in the Wage dataset, I fit a Generalized Additive Model (GAM) using smoothing splines for the continuous variables year and age, and included education, jobclass, and maritl as categorical predictors.
Model Summary Findings:
The parametric effects (from the ANOVA for parametric terms) show that: Education is the most important predictor (F = 227.1, p < 2.2e-16), showing highly significant effects on wage. Age and year also contribute significantly (p < 2.2e-16 and p ≈ 3.18e-6 respectively). Job class and marital status are both statistically significant (p = 0.0011 and p < 2.2e-16, respectively), though with smaller effect sizes. The nonparametric effects reveal that: The smooth function of age is highly significant (F = 18.4, p ≈ 6e-15), confirming a nonlinear relationship between age and wage. The smooth function of year is not statistically significant (F = 0.99, p = 0.397), suggesting that modeling year as a nonlinear term does not improve model fit over a linear one.
Interpretation:
Education has the strongest effect on wages, with higher educational attainment clearly associated with higher income. Age shows a significant nonlinear relationship: wages increase with age up to a point, then level off or slightly decline — a pattern captured well by the spline. Year does not show a strong nonlinear trend, indicating wages have not changed dramatically in a nonlinear way across years in this dataset. Job class also matters: information workers earn more than manual laborers. Marital status has some effect, possibly reflecting life-stage earnings differences.
data(Boston)
fit.poly3 <- lm(nox ~ poly(dis, 3), data = Boston)
summary(fit.poly3)
##
## Call:
## lm(formula = nox ~ poly(dis, 3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.121130 -0.040619 -0.009738 0.023385 0.194904
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.554695 0.002759 201.021 < 2e-16 ***
## poly(dis, 3)1 -2.003096 0.062071 -32.271 < 2e-16 ***
## poly(dis, 3)2 0.856330 0.062071 13.796 < 2e-16 ***
## poly(dis, 3)3 -0.318049 0.062071 -5.124 4.27e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06207 on 502 degrees of freedom
## Multiple R-squared: 0.7148, Adjusted R-squared: 0.7131
## F-statistic: 419.3 on 3 and 502 DF, p-value: < 2.2e-16
plot(Boston$dis, Boston$nox, col = "darkgrey")
lines(sort(Boston$dis), predict(fit.poly3, newdata = list(dis = sort(Boston$dis))), col = "blue")
Interpretation:
The cubic polynomial provides a strong fit to the data. The significant coefficients and high R² value suggest that dis is a strong predictor of nox, and that the relationship is nonlinear. A visual inspection shows a nonlinear decreasing trend, where nox decreases as dis increases, though not in a straight line.
rss <- rep(NA, 10)
plot(Boston$dis, Boston$nox, col = "lightgrey")
for (i in 1:10) {
fit <- lm(nox ~ poly(dis, i), data = Boston)
rss[i] <- sum(resid(fit)^2)
lines(sort(Boston$dis), predict(fit, newdata = list(dis = sort(Boston$dis))), col = i)
}
rss
## [1] 2.768563 2.035262 1.934107 1.932981 1.915290 1.878257 1.849484 1.835630
## [9] 1.833331 1.832171
Interpretation:
As the degree increases, the RSS decreases, indicating that the model fits the training data better with higher complexity. However, the marginal improvement in RSS becomes very small beyond degree 3–4. For example: RSS drops substantially from degree 1 to 2 (2.77 -> 2.04), But barely changes between degrees 4 and 10 (stabilizing around 1.83). This suggests diminishing returns in model fit with increasing polynomial degree.
Conclusion: Although higher-degree polynomials reduce training error slightly, the improvement beyond degree 3 or 4 is minimal. To avoid overfitting, it’s often better to choose a lower-degree polynomial that captures the main trend without unnecessary complexity.
cv.errors.poly <- rep(NA, 10)
set.seed(1)
for (i in 1:10) {
fit <- glm(nox ~ poly(dis, i), data = Boston)
cv.errors.poly[i] <- cv.glm(Boston, fit, K = 10)$delta[1]
}
which.min(cv.errors.poly)
## [1] 4
plot(1:10, cv.errors.poly, type = "b", xlab = "Degree", ylab = "CV Error")
Interpretation:
This result aligns well with the RSS analysis in part (9b), which showed diminishing improvements in training error beyond degree 3–4. Selecting degree 4 offers a good trade-off between flexibility and generalization: It captures the nonlinear relationship between dis and nox, Without overfitting the data as higher-degree polynomials might.
Conclusion: Cross-validation confirms that a degree 4 polynomial balances predictive accuracy and model simplicity, and is a reasonable choice for modeling this relationship.
fit.bs4 <- lm(nox ~ bs(dis, df = 4), data = Boston)
summary(fit.bs4)
##
## Call:
## lm(formula = nox ~ bs(dis, df = 4), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.124622 -0.039259 -0.008514 0.020850 0.193891
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.73447 0.01460 50.306 < 2e-16 ***
## bs(dis, df = 4)1 -0.05810 0.02186 -2.658 0.00812 **
## bs(dis, df = 4)2 -0.46356 0.02366 -19.596 < 2e-16 ***
## bs(dis, df = 4)3 -0.19979 0.04311 -4.634 4.58e-06 ***
## bs(dis, df = 4)4 -0.38881 0.04551 -8.544 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06195 on 501 degrees of freedom
## Multiple R-squared: 0.7164, Adjusted R-squared: 0.7142
## F-statistic: 316.5 on 4 and 501 DF, p-value: < 2.2e-16
plot(Boston$dis, Boston$nox, col = "gray")
lines(sort(Boston$dis), predict(fit.bs4, newdata = list(dis = sort(Boston$dis))), col = "blue")
How Knots Were Chosen: Using bs(dis, df = 4) automatically places internal knots at quantiles of dis, which helps adapt the spline to the distribution of the predictor without manually specifying knot locations. This default strategy is commonly used and typically performs well.
Conclusion: The regression spline with 4 degrees of freedom provides a flexible yet smooth fit to the nonlinear relationship between dis and nox. The model performs similarly to the best polynomial model in part 9c, and each basis function contributes significantly to the fit.
rss.bs <- rep(NA, 10)
plot(Boston$dis, Boston$nox, col = "lightgrey")
for (i in 3:10) {
fit <- lm(nox ~ bs(dis, df = i), data = Boston)
rss.bs[i] <- sum(resid(fit)^2)
lines(sort(Boston$dis), predict(fit, newdata = list(dis = sort(Boston$dis))), col = i)
}
rss.bs
## [1] NA NA 1.934107 1.922775 1.840173 1.833966 1.829884 1.816995
## [9] 1.825653 1.792535
Interpretation:
The RSS generally decreases as degrees of freedom increase, indicating better training fit. However, the rate of improvement slows down after df = 6 or 7, suggesting diminishing returns in model complexity. The lowest RSS is at df = 10, but only marginally better than df = 6–8.
Conclusion: Although increasing the degrees of freedom improves the model fit on training data, the gain becomes minimal beyond 6–8 degrees of freedom. Therefore, one might prefer a moderately complex model (e.g., df = 6 or 7) to reduce the risk of overfitting, especially when considering test performance.
cv.errors.bs <- rep(NA, 10)
set.seed(1)
for (i in 3:10) {
fit <- glm(nox ~ bs(dis, df = i), data = Boston)
cv.errors.bs[i] <- cv.glm(Boston, fit, K = 10)$delta[1]
}
which.min(cv.errors.bs)
## [1] 6
plot(3:10, cv.errors.bs[3:10], type = "b", xlab = "Degrees of Freedom", ylab = "CV Error")
Interpretation:
This result aligns with the pattern observed in part (9e), where the residual sum of squares (RSS) dropped significantly up to around df = 6, and then leveled off. Using 6 degrees of freedom allows the model to capture the nonlinear relationship between dis and nox without overfitting. It represents a balance between model flexibility and generalization, making it an appropriate choice based on both cross-validation and training fit.
Conclusion: Cross-validation confirms that a 6-degree regression spline provides the best out-of-sample performance for modeling the relationship between dis and nox, effectively capturing the curvature in the data while maintaining strong generalization.
set.seed(1)
n <- 100
X1 <- rnorm(n, mean = 0, sd = sqrt(2))
X2 <- rnorm(n, mean = 1, sd = sqrt(2))
epsilon <- rnorm(n, mean = 0, sd = 1)
Y <- 1.5 + 3 * X1 - 4.5 * X2 + epsilon
beta0 <- 0
beta1 <- 1
beta2 <- 0
beta0_vals <- beta1_vals <- beta2_vals <- numeric(1000)
a <- Y - beta1 * X1
beta2 <- lm(a ~ X2)$coef[2]
a <- Y - beta2 * X2
beta1 <- lm(a ~ X1)$coef[2]
beta0_vals <- beta1_vals <- beta2_vals <- numeric(1000)
for (i in 1:1000) {
a <- Y - beta2 * X2
beta1 <- lm(a ~ X1)$coef[2]
a <- Y - beta1 * X1
beta2 <- lm(a ~ X2)$coef[2]
beta0 <- mean(Y - beta1 * X1 - beta2 * X2)
beta0_vals[i] <- beta0
beta1_vals[i] <- beta1
beta2_vals[i] <- beta2
}
plot(1:1000, beta0_vals, type = "l", col = "blue", ylim = range(c(beta0_vals, beta1_vals, beta2_vals)),
xlab = "Iteration", ylab = "Coefficient Estimate", main = "Backfitting Estimates Over Time")
lines(1:1000, beta1_vals, col = "red")
lines(1:1000, beta2_vals, col = "green")
legend("topright", legend = c("beta0", "beta1", "beta2"), col = c("blue", "red", "green"), lty = 1)
lm.full <- lm(Y ~ X1 + X2)
ols.coefs <- coef(lm.full)
#Re-plot the backfitting estimates#
plot(1:1000, beta0_vals, type = "l", col = "blue", ylim = range(c(beta0_vals, beta1_vals, beta2_vals, ols.coefs)),
xlab = "Iteration", ylab = "Coefficient Estimate", main = "Backfitting vs. OLS Estimates")
lines(1:1000, beta1_vals, col = "red")
lines(1:1000, beta2_vals, col = "green")
#Overlay OLS estimates using abline#
abline(h = ols.coefs[1], col = "blue", lty = 2) # β̂0 (intercept)
abline(h = ols.coefs[2], col = "red", lty = 2) # β̂1
abline(h = ols.coefs[3], col = "green", lty = 2) # β̂2
Backfitting recovers the same coefficients as multiple linear regression.
threshold <- 1e-6
converged_at <- NA
for (i in 2:1000) {
delta0 <- abs(beta0_vals[i] - beta0_vals[i - 1])
delta1 <- abs(beta1_vals[i] - beta1_vals[i - 1])
delta2 <- abs(beta2_vals[i] - beta2_vals[i - 1])
if (is.na(converged_at) && max(delta0, delta1, delta2) < threshold) {
converged_at <- i
break
}
}
converged_at
## [1] 3
To determine how many backfitting iterations were required to obtain a “good” approximation to the multiple regression coefficient estimates, I used a numerical convergence check. Specifically, I computed the maximum change in each coefficient between successive iterations and defined convergence as occurring when this change fell below a threshold of 1×10^−6.
Based on this approach, the algorithm converged by iteration 3.
This is consistent with the visual results in part (11f), where the backfitting coefficient paths flattened out very early and closely aligned with the OLS estimates.
Conclusion: Only 3 iterations were needed for the backfitting estimates to converge to a stable solution that matches the multiple linear regression coefficients. This demonstrates that backfitting can be extremely efficient when the predictors are not highly correlated and the model is well specified.