Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.4.2
library(boot)
library(ggplot2)
set.seed(1)
cv.error.10 = rep(0, 10)
for (i in 1:10) {
glm.fit = glm(logwage ~ poly(age, i), data = Wage)
cv.error.10[i] = cv.glm(Wage, glm.fit, K = 10)$delta[1]
}
cv.error.10
## [1] 0.1179936 0.1107270 0.1100178 0.1095241 0.1095717 0.1096447 0.1096246
## [8] 0.1097252 0.1095097 0.1096670
plot(cv.error.10, type = "b", xlab = "Degree", ylab = "CV Error")
lm.fit = glm(logwage ~ poly(age, 4), data = Wage)
summary(lm.fit)
##
## Call:
## glm(formula = logwage ~ poly(age, 4), data = Wage)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.653905 0.006036 771.070 < 2e-16 ***
## poly(age, 4)1 4.197217 0.330586 12.696 < 2e-16 ***
## poly(age, 4)2 -4.694434 0.330586 -14.200 < 2e-16 ***
## poly(age, 4)3 1.644906 0.330586 4.976 6.87e-07 ***
## poly(age, 4)4 -1.179518 0.330586 -3.568 0.000365 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.109287)
##
## Null deviance: 371.07 on 2999 degrees of freedom
## Residual deviance: 327.31 on 2995 degrees of freedom
## AIC: 1879.3
##
## Number of Fisher Scoring iterations: 2
fit.1 = lm(logwage ~ age, data = Wage)
fit.2 = lm(logwage ~ poly(age, 2), data = Wage)
fit.3 = lm(logwage ~ poly(age, 3), data = Wage)
fit.4 = lm(logwage ~ poly(age, 4), data = Wage)
fit.5 = lm(logwage ~ poly(age, 5), data = Wage)
anova(fit.1, fit.2, fit.3, fit.4, fit.5)
## Analysis of Variance Table
##
## Model 1: logwage ~ age
## Model 2: logwage ~ poly(age, 2)
## Model 3: logwage ~ poly(age, 3)
## Model 4: logwage ~ poly(age, 4)
## Model 5: logwage ~ poly(age, 5)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2998 353.45
## 2 2997 331.41 1 22.0377 201.5934 < 2.2e-16 ***
## 3 2996 328.71 1 2.7057 24.7510 6.892e-07 ***
## 4 2995 327.31 1 1.3913 12.7268 0.0003661 ***
## 5 2994 327.30 1 0.0177 0.1615 0.6878137
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
agelims = range(Wage$age)
age.grid = seq(from = agelims[1], to = agelims[2])
preds = predict(lm.fit, newdata = list(age = age.grid), se = TRUE)
se.bands = cbind(preds$fit + 2 * preds$se.fit, preds$fit - 2 * preds$se.fit)
plot(Wage$age, Wage$logwage, xlim = agelims, cex = 0.5, col = "darkgrey")
title("Polynomial fit using degree 4 for Age")
lines(age.grid, preds$fit, lwd = 2, col = "blue")
matlines(age.grid, se.bands, lwd = 1, col = "blue", lty = 3)
The plot shows that cross-validation error decreases sharply up to degree 4 and then levels off, suggesting little gain in model complexity beyond degree 4.
The degree-4 polynomial fit shows a nonlinear relationship between age and log(wage), with wages increasing until around age 50 and then gradually declining.
The output you’ve shared indicates that you’ve fit a polynomial
regression model to the log-wage (logwage
)
using the age variable. Here’s a summary of the
results:
The model fits a 4th-degree polynomial to the relationship between age and log(wage), with significant coefficients for all polynomial terms (all p-values < 0.001).
Coefficients:
- The intercept and polynomial terms (up to the 4th degree) are significant with very small **p-values** (less than 0.001), suggesting that a higher-degree polynomial model provides a better fit for the data.
- The polynomial terms indicate a non-linear relationship between **age** and **log(wage)**.
Model Evaluation:
The AIC (Akaike Information Criterion) value is 1879.3. The residual deviance is 327.31, indicating a good fit relative to the null deviance (371.07).
The Analysis of Variance (ANOVA) table shows that adding higher-order terms (up to the 4th-degree polynomial) significantly improves the model, but the 5th-degree term does not significantly improve the fit (p-value = 0.687).
In conclusion, the polynomial model fits the data well, and adding higher-order terms improves the fit, but beyond the 4th degree, the improvement is minimal. The relationship between age and wage appears to be highly non-linear, and a 4th-degree polynomial captures this effectively.
Fit a step function to predict wage using age, and perform crossvalidation to choose the optimal number of cuts. Make a plot of the fit obtained.
set.seed(42)
cv_errors_step = rep(NA, 19)
for (num_cuts in 2:20) {
Wage$age_cut = cut(Wage$age, num_cuts)
step_model = glm(wage ~ age_cut, data=Wage)
cv_errors_step[num_cuts - 1] = cv.glm(Wage, step_model, K=10)$delta[1]
}
cv_errors_step
## [1] 1733.565 1684.082 1637.465 1632.337 1623.836 1610.572 1602.163 1612.770
## [9] 1606.141 1601.591 1605.167 1604.376 1603.710 1603.690 1602.266 1607.777
## [17] 1604.390 1603.445 1604.955
plot(cv_errors_step, type="b", ylab="Cross-Validation Error", xlab="Number of Cuts", main="CV Errors for Step Function Models")
The plot shows that cross-validation error for step function models decreases rapidly up to around 8 cuts, after which further increases in the number of cuts yield minimal improvement.
final_step_model = glm(wage ~ cut(age, 8), data=Wage)
age_grid = seq(from = min(Wage$age), to = max(Wage$age), length.out = 100)
predictions_step = predict(final_step_model, newdata = list(age = age_grid), se = TRUE)
se_bands_step = cbind(predictions_step$fit + 2 * predictions_step$se.fit,
predictions_step$fit - 2 * predictions_step$se.fit)
plot(Wage$age, Wage$wage, col = "darkgrey", pch = 16, cex = 0.5,
xlab = "Age", ylab = "Wage")
title("Step Function Fit with 8 Cuts")
lines(age_grid, predictions_step$fit, lwd = 2, col = "blue")
matlines(age_grid, se_bands_step, lwd = 1, col = "blue", lty = 3)
boxplot(wage ~ maritl, data = Wage, main = "Wage vs Marital Status", ylab = "Wage", xlab = "Marital Status", col = "lightblue")
library(splines)
marital_spline = lm(wage ~ ns(age, df = 5) + maritl, data = Wage)
summary(marital_spline)
##
## Call:
## lm(formula = wage ~ ns(age, df = 5) + maritl, data = Wage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -102.024 -23.696 -4.202 14.517 206.687
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.2925 4.6612 13.364 < 2e-16 ***
## ns(age, df = 5)1 48.8124 5.1417 9.493 < 2e-16 ***
## ns(age, df = 5)2 43.9898 6.0327 7.292 3.90e-13 ***
## ns(age, df = 5)3 37.3492 5.1940 7.191 8.10e-13 ***
## ns(age, df = 5)4 62.2447 12.3628 5.035 5.07e-07 ***
## ns(age, df = 5)5 0.6223 9.4840 0.066 0.948
## maritl2. Married 13.7095 2.1049 6.513 8.61e-11 ***
## maritl3. Widowed -4.8432 9.2854 -0.522 0.602
## maritl4. Divorced -3.0008 3.4105 -0.880 0.379
## maritl5. Separated -3.5361 5.6406 -0.627 0.531
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39.42 on 2990 degrees of freedom
## Multiple R-squared: 0.1101, Adjusted R-squared: 0.1074
## F-statistic: 41.08 on 9 and 2990 DF, p-value: < 2.2e-16
Married and Never Married have a substantial number of outliers. The box plots for Widowed and Separated are similar to each other.
It appears a married couple makes more money on average than other groups.
boxplot(wage ~ jobclass, data = Wage, main = "Wage vs Job Class", ylab = "Wage", xlab = "Job Class", col = "lightgreen")
jobclass_spline = lm(wage ~ ns(age, df = 5) + jobclass, data = Wage)
summary(jobclass_spline)
##
## Call:
## lm(formula = wage ~ ns(age, df = 5) + jobclass, data = Wage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -106.293 -23.444 -4.991 16.043 197.146
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 57.217 4.637 12.340 < 2e-16 ***
## ns(age, df = 5)1 57.219 4.646 12.317 < 2e-16 ***
## ns(age, df = 5)2 51.773 5.630 9.196 < 2e-16 ***
## ns(age, df = 5)3 43.319 4.873 8.890 < 2e-16 ***
## ns(age, df = 5)4 74.139 11.742 6.314 3.12e-10 ***
## ns(age, df = 5)5 3.566 9.324 0.382 0.702
## jobclass2. Information 15.013 1.442 10.414 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39.21 on 2993 degrees of freedom
## Multiple R-squared: 0.1188, Adjusted R-squared: 0.117
## F-statistic: 67.23 on 6 and 2993 DF, p-value: < 2.2e-16
library(gam)
## Warning: package 'gam' was built under R version 4.4.3
## Loading required package: foreach
## Warning: package 'foreach' was built under R version 4.4.2
## Loaded gam 1.22-5
model_gam <- gam(wage ~ s(age) + maritl + jobclass, data = Wage)
summary(model_gam)
##
## Call: gam(formula = wage ~ s(age) + maritl + jobclass, data = Wage)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -110.378 -23.260 -4.772 15.361 201.141
##
## (Dispersion Parameter for gaussian family taken to be 1497.157)
##
## Null Deviance: 5222086 on 2999 degrees of freedom
## Residual Deviance: 4476501 on 2990 degrees of freedom
## AIC: 30459.59
##
## Number of Local Scoring Iterations: NA
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## s(age) 1 199870 199870 133.499 < 2.2e-16 ***
## maritl 4 141136 35284 23.567 < 2.2e-16 ***
## jobclass 1 173181 173181 115.673 < 2.2e-16 ***
## Residuals 2990 4476501 1497
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## s(age) 3 30.202 < 2.2e-16 ***
## maritl
## jobclass
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(model_gam, select = 1, main = "Smooth Term for Age")
## Warning in plot.window(...): "select" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "select" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "select" is not a
## graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "select" is not a
## graphical parameter
## Warning in box(...): "select" is not a graphical parameter
## Warning in title(...): "select" is not a graphical parameter
## Warning in plot.window(...): "select" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "select" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "select" is not a
## graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "select" is not a
## graphical parameter
## Warning in box(...): "select" is not a graphical parameter
## Warning in title(...): "select" is not a graphical parameter
## Warning in plot.window(...): "select" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "select" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "select" is not a
## graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "select" is not a
## graphical parameter
## Warning in box(...): "select" is not a graphical parameter
## Warning in title(...): "select" is not a graphical parameter
This question uses the variables dis (the weighted mean of distances to five Boston employment centers) and nox (nitrogen oxides concentration in parts per 10 million) from the Boston data. We will treat dis as the predictor and nox as the response.
library(MASS)
library(ggplot2)
library(splines)
data(Boston)
Use the poly() function to fit a cubic polynomial regression to predict nox using dis. Report the regression output, and plot the resulting data and polynomial fits.
library(MASS)
data("Boston")
fit_cubic <- lm(nox ~ poly(dis, 3), data = Boston)
summary(fit_cubic)
##
## Call:
## lm(formula = nox ~ poly(dis, 3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.121130 -0.040619 -0.009738 0.023385 0.194904
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.554695 0.002759 201.021 < 2e-16 ***
## poly(dis, 3)1 -2.003096 0.062071 -32.271 < 2e-16 ***
## poly(dis, 3)2 0.856330 0.062071 13.796 < 2e-16 ***
## poly(dis, 3)3 -0.318049 0.062071 -5.124 4.27e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06207 on 502 degrees of freedom
## Multiple R-squared: 0.7148, Adjusted R-squared: 0.7131
## F-statistic: 419.3 on 3 and 502 DF, p-value: < 2.2e-16
plot(Boston$dis, Boston$nox, main="Cubic Polynomial Fit", xlab="Distance to Employment Centers (dis)", ylab="Nitrogen Oxides (nox)", pch=20, col="darkgray")
dis_range <- seq(min(Boston$dis), max(Boston$dis), length=100)
preds <- predict(fit_cubic, newdata = data.frame(dis = dis_range))
lines(dis_range, preds, col="blue", lwd=2)
The cubic polynomial regression shows a strong nonlinear inverse
relationship between distance to employment centers
(dis) and nitrogen oxides concentration (nox),
with nox
decreasing sharply as dis
increases.
Plot the polynomial fits for a range of different polynomial degrees (say, from 1 to 10), and report the associated residual sum of squares.
rss_values <- numeric(10)
plot(Boston$dis, Boston$nox, pch=20, col="gray", main="Polynomial Fits", xlab="dis", ylab="nox")
for (d in 1:10) {
fit <- lm(nox ~ poly(dis, d), data=Boston)
rss_values[d] <- sum(residuals(fit)^2)
# Plotting only some of them to avoid overplot
if (d %in% c(1, 2, 3, 5, 7, 10)) {
lines(dis_range, predict(fit, newdata = data.frame(dis = dis_range)), col=rainbow(10)[d], lwd=2)
}
}
# Print RSS values
rss_values
## [1] 2.768563 2.035262 1.934107 1.932981 1.915290 1.878257 1.849484 1.835630
## [9] 1.833331 1.832171
As it increase the degree of the polynomial, the model fits the data better — the errors (RSS) get smaller. But the improvement slows down after degree 3 or 4, meaning higher-degree models don’t add much value and might overcomplicate things.
Perform cross-validation or another approach to select the optimal degree for the polynomial, and explain your results.
library(boot)
cv_error <- numeric(10)
for (d in 1:10) {
fit <- glm(nox ~ poly(dis, d), data=Boston)
cv_error[d] <- cv.glm(Boston, fit, K=10)$delta[2]
}
plot(1:10, cv_error, type="b", pch=19, col="blue", xlab="Polynomial Degree", ylab="10-fold CV Error", main="CV Error vs Polynomial Degree")
The cross-validation plot shows that polynomial degrees 3 and 4 have the lowest prediction error, suggesting they offer the best balance between model accuracy and complexity. Using a higher degree (like 7 or 9) increases error again — likely due to overfitting. So, degree 3 or 4 is best!
Use the bs() function to fit a regression spline to predict nox using dis. Report the output for the fit using four degrees of freedom. How did you choose the knots? Plot the resulting fit.
library(splines)
fit_spline_4df <- lm(nox ~ bs(dis, df=4), data=Boston)
summary(fit_spline_4df)
##
## Call:
## lm(formula = nox ~ bs(dis, df = 4), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.124622 -0.039259 -0.008514 0.020850 0.193891
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.73447 0.01460 50.306 < 2e-16 ***
## bs(dis, df = 4)1 -0.05810 0.02186 -2.658 0.00812 **
## bs(dis, df = 4)2 -0.46356 0.02366 -19.596 < 2e-16 ***
## bs(dis, df = 4)3 -0.19979 0.04311 -4.634 4.58e-06 ***
## bs(dis, df = 4)4 -0.38881 0.04551 -8.544 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06195 on 501 degrees of freedom
## Multiple R-squared: 0.7164, Adjusted R-squared: 0.7142
## F-statistic: 316.5 on 4 and 501 DF, p-value: < 2.2e-16
plot(Boston$dis, Boston$nox, pch=20, col="gray", xlab="dis", ylab="nox", main="Regression Spline with 4 df")
lines(dis_range, predict(fit_spline_4df, newdata = data.frame(dis = dis_range)), col="red", lwd=2)
The regression spline with 4 degrees of freedom provides a smooth, statistically strong fit that explains 72% of the variation in NOx levels based on distance to employment centers.
Now fit a regression spline for a range of degrees of freedom, and plot the resulting fits and report the resulting RSS. Describe the results obtained.
rss_spline <- numeric(10)
plot(Boston$dis, Boston$nox, pch=20, col="gray", main="Splines with Varying DF", xlab="dis", ylab="nox")
for (df in 3:10) {
fit <- lm(nox ~ bs(dis, df=df), data=Boston)
rss_spline[df] <- sum(residuals(fit)^2)
if (df %in% c(3, 4, 5, 7, 10)) {
lines(dis_range, predict(fit, newdata = data.frame(dis = dis_range)), col=rainbow(10)[df], lwd=2)
}
}
rss_spline
## [1] 0.000000 0.000000 1.934107 1.922775 1.840173 1.833966 1.829884 1.816995
## [9] 1.825653 1.792535
As degrees of freedom increase in the regression spline, the residual sum of squares (RSS) generally decreases, showing better fit — but improvements become marginal after 6–7 df, suggesting diminishing returns.
Perform cross-validation or another approach in order to select the best degrees of freedom for a regression spline on this data. Describe your results.
cv_spline_error <- numeric(10)
for (df in 3:10) {
fit <- glm(nox ~ bs(dis, df=df), data=Boston)
cv_spline_error[df] <- cv.glm(Boston, fit, K=10)$delta[2]
}
## Warning in bs(dis, degree = 3L, knots = numeric(0), Boundary.knots = c(1.1296,
## : some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = numeric(0), Boundary.knots = c(1.1296,
## : some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = numeric(0), Boundary.knots = c(1.137, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = numeric(0), Boundary.knots = c(1.137, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = 3.2628, Boundary.knots = c(1.137, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = 3.2628, Boundary.knots = c(1.137, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = 3.3175, Boundary.knots = c(1.1296, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = 3.3175, Boundary.knots = c(1.1296, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(2.37286666666667, 4.35876666666667:
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(2.37286666666667, 4.35876666666667:
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(2.10525, 3.1523, 5.16495),
## Boundary.knots = c(1.1691, : some 'x' values beyond boundary knots may cause
## ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(2.10525, 3.1523, 5.16495),
## Boundary.knots = c(1.1691, : some 'x' values beyond boundary knots may cause
## ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(2.0835, 3.2797, 5.16495),
## Boundary.knots = c(1.1296, : some 'x' values beyond boundary knots may cause
## ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(2.0835, 3.2797, 5.16495),
## Boundary.knots = c(1.1296, : some 'x' values beyond boundary knots may cause
## ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(1.9512, 2.70958, 3.92418, 5.6957:
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(1.9512, 2.70958, 3.92418, 5.6957:
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(1.9512, 2.59774, 3.87584, 5.6957:
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(1.9512, 2.59774, 3.87584, 5.6957:
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(1.86746666666667, 2.40296666666667, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(1.86746666666667, 2.40296666666667, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(1.83066666666667, 2.3727, 3.2157, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(1.83066666666667, 2.3727, 3.2157, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(1.7912, 2.206, 2.7592, 3.6519, : some
## 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(1.7912, 2.206, 2.7592, 3.6519, : some
## 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(1.7912, 2.1675, 2.7147, 3.6519, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(1.7912, 2.1675, 2.7147, 3.6519, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(1.7554, 2.1099, 2.5671, 3.2721, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(1.7554, 2.1099, 2.5671, 3.2721, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(1.748425, 2.1084, 2.506175, 3.2721, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(1.748425, 2.1084, 2.506175, 3.2721, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
plot(3:10, cv_spline_error[3:10], type="b", pch=19, col="darkgreen", xlab="Degrees of Freedom", ylab="10-fold CV Error", main="CV for Regression Splines")
Cross-validation shows that a regression spline with 10 degrees of freedom gives the lowest prediction error, making it the best choice for balancing flexibility and accuracy.
Generate a response Y and two predictors X1 and X2, with n = 100.
set.seed(123)
n <- 100
X1 <- rnorm(n)
X2 <- rnorm(n)
beta0 <- 2
beta1 <- 3
beta2 <- 4
epsilon <- rnorm(n)
Y <- beta0 + beta1 * X1 + beta2 * X2 + epsilon
Initialize ˆ β1 to take on a value of your choice. It does not matter what value you choose.
beta1 <- 0
Keeping ˆ β1 fixed, fit the model Y − ˆ β1X1 = β0 + β2X2 + ϵ. You can do this as follows: > a <- y - beta1 * x1 > beta2 <- lm(a ∼ x2)$coef[2]
a <- Y - beta1 * X1
beta2 <- lm(a ~ X2)$coef[2]
Keeping ˆ β2 fixed, fit the model Y − ˆ β2X2 = β0 + β1X1 + ϵ. You can do this as follows: > a <- y - beta2 * x2 > beta1 <- lm(a ∼ x1)$coef[2]
a <- Y - beta2 * X2
beta1 <- lm(a ~ X1)$coef[2]
Write a for loop to repeat (c) and (d) 1,000 times. Report the estimates of ˆ β0, ˆ β1, and ˆ β2 at each iteration of the for loop. Create a plot in which each of these values is displayed, with ˆ β0, ˆ β1, and ˆ β2 each shown in a different color. 326 7. Moving Beyond Linearity
beta0_vals <- numeric(1000)
beta1_vals <- numeric(1000)
beta2_vals <- numeric(1000)
for (i in 1:1000) {
a <- Y - beta1 * X1
beta2 <- lm(a ~ X2)$coef[2]
a <- Y - beta2 * X2
beta1 <- lm(a ~ X1)$coef[2]
beta0_vals[i] <- mean(Y - beta1 * X1 - beta2 * X2) # Estimate for beta0
beta1_vals[i] <- beta1
beta2_vals[i] <- beta2
}
plot(1:1000, beta0_vals, type = "l", col = "red", ylim = range(c(beta0_vals, beta1_vals, beta2_vals)),
xlab = "Iteration", ylab = "Coefficient Estimates", main = "Backfitting Iterations")
lines(1:1000, beta1_vals, col = "blue")
lines(1:1000, beta2_vals, col = "green")
legend("topright", legend = c("Beta0", "Beta1", "Beta2"), col = c("red", "blue", "green"), lty = 1)
Backfitting converged to the multiple linear regression estimates within the first 5–10 iterations, as shown by the stable, overlapping coefficient lines.
Compare your answer in (e) to the results of simply performing multiple linear regression to predict Y using X1 and X2. Use the abline() function to overlay those multiple linear regression coefficient estimates on the plot obtained in (e).
plot(1:1000, beta0_vals, type = "l", col = "red", ylim = range(c(beta0_vals, beta1_vals, beta2_vals)),
xlab = "Iteration", ylab = "Coefficient Estimates", main = "Backfitting Iterations")
lines(1:1000, beta1_vals, col = "blue")
lines(1:1000, beta2_vals, col = "green")
legend("topright", legend = c("Beta0", "Beta1", "Beta2"), col = c("red", "blue", "green"), lty = 1)
lm_model <- lm(Y ~ X1 + X2)
abline(h = coef(lm_model)[1], col = "red", lty = 2) # Beta0
abline(h = coef(lm_model)[2], col = "blue", lty = 2) # Beta1
abline(h = coef(lm_model)[3], col = "green", lty = 2) # Beta2
This plot shows the convergence of backfitting estimates (β^0,β^1,β^2\hat{\beta}_0, \hat{\beta}_1, \hat{\beta}_2) over 1,000 iterations. The solid lines represent iterative backfitting values, while the dashed lines show the corresponding multiple linear regression estimates. All three coefficients quickly converge and remain stable, indicating that backfitting effectively approximates the true regression coefficients in just a few iterations.
On this data set, how many backfitting iterations were required in order to obtain a “good” approximation to the multiple regression coefficient estimates?
On this data set, fewer than 10 backfitting iterations were required to obtain a good approximation to the multiple regression coefficient estimates, as the coefficient values stabilized and closely matched the true regression lines early in the process.