# For consistent results
set.seed(2)
# Initialize a vector to store cross-validation errors for degrees 1 to 10
cv.error.10 = rep(0, 10)
for (i in 1:10) {
poly.fit = glm(wage ~ poly(age, i), data = Wage)
cv.error.10[i] = cv.glm(Wage, poly.fit, K = 10)$delta[1]
}
# Find the degree with the lowest CV error
degree = which.min(cv.error.10)
cat("Optimal degree chosen by cross-validation:", degree, "\n")
## Optimal degree chosen by cross-validation: 5
We now fit several polynomial models and compare them using ANOVA to see if the improvement with higher degrees is statistically significant.
# Fit polynomial models of increasing degree
fit.1 = lm(wage ~ poly(age, 1), data = Wage)
fit.2 = lm(wage ~ poly(age, 2), data = Wage)
fit.3 = lm(wage ~ poly(age, 3), data = Wage)
poly.fit = lm(wage ~ poly(age, degree), data = Wage)
fit.5 = lm(wage ~ poly(age, 5), data = Wage)
# Compare the models using ANOVA
anova_result = anova(fit.1, fit.2, fit.3, poly.fit, fit.5, test = "F")
print(anova_result)
## Analysis of Variance Table
##
## Model 1: wage ~ poly(age, 1)
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, degree)
## Model 5: wage ~ poly(age, 5)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2998 5022216
## 2 2997 4793430 1 228786 143.5931 < 2.2e-16 ***
## 3 2996 4777674 1 15756 9.8888 0.001679 **
## 4 2994 4770322 2 7353 2.3074 0.099697 .
## 5 2994 4770322 0 0
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now, we visualize the polynomial fit using the optimal degree selected by cross-validation.
# Fit the final model using the selected degree
final_model <- lm(wage ~ poly(age, degree = degree), data = Wage)
# Create a grid of age values to predict wages
age_grid <- seq(from = min(Wage$age), to = max(Wage$age), length.out = 100)
preds <- predict(final_model, newdata = data.frame(age = age_grid), se = TRUE)
se_bands <- cbind(preds$fit + 2 * preds$se.fit, preds$fit - 2 * preds$se.fit)
# Plot the observed wages and the fitted curve with confidence bands
plot(Wage$age, Wage$wage, cex = 0.5, col = "lightgreen",
main = paste0("Degree-", degree),
xlab = "Age", ylab = "Wage")
lines(age_grid, preds$fit, lwd = 2, col = "blue")
matlines(age_grid, se_bands, lwd = 1, col = "blue", lty = 3)
Insights * Cross-validation picked degree 5 as the best
for prediction, minimizing error.
ANOVA results tell a nuanced story:
There’s a significant jump in explanatory power from a linear to quadratic model.
Adding a third-degree term also shows a statistically significant improvement.
Beyond that, the gains from higher degrees (like 5) are only marginally significant (p ≈ 0.1).
From the plot:
Wages tend to rise from the early 20s to around age 50.
There’s a clear peak in midlife, with a slight decline toward later years.
The higher-degree polynomial adds some curvature at the tails, capturing subtle trends.
Takeaway:
Cross-validation prioritized predictive performance (favoring degree 5), while ANOVA focuses on whether the added complexity is statistically justified. A cubic model (degree 3) might be a good balance between complexity and interpretability.
# Try step functions with 2 to 11 intervals
cv.error.10 = rep(0, 10)
for(i in 1:10) {
Wage$age.cut = cut(Wage$age, i+1)
glm.fit = glm(wage ~ age.cut, data = Wage)
cv.error.10[i] = cv.glm(Wage, glm.fit, K = 10)$delta[1]
}
# Best number of cuts based on CV
number = which.min(cv.error.10)
cat("Optimal number of cuts chosen by cross-validation:", number, "\n")
## Optimal number of cuts chosen by cross-validation: 7
# Fit the final step function model
step.fit = glm(wage ~ cut(age, number + 1), data = Wage)
# Generate prediction data
grid = seq(from = min(Wage$age), to = max(Wage$age), length.out = 100)
preds = predict(step.fit, newdata = data.frame(age = grid), se = TRUE)
se.bands = cbind(preds$fit + 2 * preds$se.fit, preds$fit - 2 * preds$se.fit)
# Plot the step function fit
plot(Wage$age, Wage$wage, cex = 0.5, col = "lightgreen",
xlab = "Age", ylab = "Wage",
main = paste0("Step Function with ", number, " Cuts"))
lines(grid, preds$fit, lwd = 2, col = "blue")
matlines(grid, se.bands, lwd = 1, col = "blue", lty = 3)
# Add cut boundaries as vertical lines
cuts = quantile(Wage$age, probs = seq(0, 1, length = number+2))[-c(1, number+2)]
abline(v = cuts, lty = 2, col = "lightgreen")
Interpretation
Cross-validation selected 7 cuts, dividing age into 8 intervals.
The step function shows:
Lower wages for younger workers (under age 30)
A noticeable jump in wages around 30
Wages remain fairly stable through midlife (40s to 60s)
A drop-off in wages as workers approach retirement age (65+)
The step function creates discrete wage levels for different age ranges, which provides an intuitive interpretation of how wages change across career stages.
The blue dashed lines represent confidence intervals, which widen at the extremes of the age range where data is sparser.
The vertical dashed lines mark the 7 cut points that divide the age range into 8 segments.
We’ll start by fitting three GAM models:
A base model using a natural spline for age.
A model adding marital status.
A model including both marital status and job class.
# Fit GAM models with increasing complexity
gam.fit1 <- gam(wage ~ ns(age, 4), data = Wage)
gam.fit2 <- gam(wage ~ ns(age, 4) + maritl, data = Wage)
gam.fit3 <- gam(wage ~ ns(age, 4) + maritl + jobclass, data = Wage)
# Compare models using ANOVA
anova(gam.fit1, gam.fit2, gam.fit3)
## Analysis of Deviance Table
##
## Model 1: wage ~ ns(age, 4)
## Model 2: wage ~ ns(age, 4) + maritl
## Model 3: wage ~ ns(age, 4) + maritl + jobclass
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 2995 4773109
## 2 2991 4650697 4 122412 < 2.2e-16 ***
## 3 2990 4479913 1 170784 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ggplot(Wage, aes(x = maritl, y = wage)) +
geom_boxplot(fill = "blue") +
labs(title = "Wage by Marital Status", x = "Marital Status", y = "Wage")
ggplot(Wage, aes(x = jobclass, y = wage)) +
geom_boxplot(fill = "skyblue") +
labs(title = "Wage by Job Class", x = "Job Class", y = "Wage")
ggplot(Wage, aes(x = age, y = wage)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "loess", se = FALSE, color = "red") +
labs(title = "Non-linear Relationship between Age and Wage")
## `geom_smooth()` using formula = 'y ~ x'
Summary of Findings
Using flexible modeling methods like splines and GAMs helped uncover richer insights into the relationship between age, marital status, job class, and wages.
Key Takeaways from the GAM Models
Adding maritl and jobclass significantly improved the model. Each addition reduced the residual deviance:
Age only: 4,773,109
Age + Marital Status: 4,650,697
Age + Marital Status + Job Class: 4,479,913
These drops in deviance, coupled with very small p-values (p < 0.001), suggest that both marital status and job class are important predictors of wage.
Insights from the Data Visualizations
Marital Status:
Married individuals tend to earn more on average.
Never Married, Divorced, and Separated individuals generally earn less, with fewer high-earning outliers.
Job Class:
Information sector jobs have a noticeably higher median wage compared to industrial jobs.
There’s also more wage variability in the Information group, with a wider range of high earners.
Age:
The relationship between age and wage is clearly non-linear.
Wages rise through early adulthood, plateau in midlife, and slightly decline later in life.
cubic_model <- lm(nox ~ poly(dis, 3), data = Boston)
summary(cubic_model)
##
## Call:
## lm(formula = nox ~ poly(dis, 3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.121130 -0.040619 -0.009738 0.023385 0.194904
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.554695 0.002759 201.021 < 2e-16 ***
## poly(dis, 3)1 -2.003096 0.062071 -32.271 < 2e-16 ***
## poly(dis, 3)2 0.856330 0.062071 13.796 < 2e-16 ***
## poly(dis, 3)3 -0.318049 0.062071 -5.124 4.27e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06207 on 502 degrees of freedom
## Multiple R-squared: 0.7148, Adjusted R-squared: 0.7131
## F-statistic: 419.3 on 3 and 502 DF, p-value: < 2.2e-16
# Create a scatter plot with the polynomial fit
plot(Boston$dis, Boston$nox, pch = 20, cex = 0.7, col = "darkgray",
xlab = "Distance to Employment Centers (dis)",
ylab = "Nitrogen Oxides Concentration (nox)",
main = "Cubic Polynomial Regression")
# Create a grid of dis values for smooth prediction line
dis_grid <- seq(min(Boston$dis), max(Boston$dis), length.out = 100)
# Generate predictions
preds <- predict(cubic_model, newdata = data.frame(dis = dis_grid), se = TRUE)
# Add the cubic polynomial fit line
lines(dis_grid, preds$fit, col = "blue", lwd = 2)
# Add confidence bands
lines(dis_grid, preds$fit + 2 * preds$se.fit, col = "blue", lty = 2)
lines(dis_grid, preds$fit - 2 * preds$se.fit, col = "blue", lty = 2)
# Add a legend
legend("topright", legend = c("Data", "Cubic Polynomial Fit", "95% Confidence Interval"),
pch = c(20, NA, NA), lty = c(NA, 1, 2), col = c("darkgray", "blue", "blue"))
We then visualized this fit alongside the actual data:
A scatterplot shows how nox changes with dis.
A smooth blue line shows the predicted values from our cubic model.
Dashed lines represent the 95% confidence interval for the predictions.
Key Takeaways: The model is highly significant, with an adjusted R² of 0.713, meaning it explains about 71% of the variability in NOx levels.
Each term (linear, quadratic, cubic) is statistically significant.
The plot shows that:
Pollution is highest in areas closest to employment centers.
NOx drops sharply as distance increases from 1 to 6 units.
Beyond 6 units, the trend flattens out, and levels stabilize.
There’s a tiny uptick beyond 10 units, but the wide confidence bands suggest uncertainty.
# Create a function to fit polynomials and calculate RSS
calculate_rss <- function(degree) {
model <- lm(nox ~ poly(dis, degree), data = Boston)
return(sum(model$residuals^2))
}
# Calculate RSS for degrees 1 through 10
rss_values <- sapply(1:10, calculate_rss)
# Create a data frame for reporting
rss_df <- data.frame(
Degree = 1:10,
RSS = rss_values
)
# Plot the RSS by degree
plot(1:10, rss_values, type = "b", pch = 19,
xlab = "Polynomial Degree", ylab = "Residual Sum of Squares",
main = "RSS by Polynomial Degree")
# Plot the polynomial fits on a single graph
plot(Boston$dis, Boston$nox, pch = 20, cex = 0.7, col = "darkgray",
xlab = "Distance to Employment Centers (dis)",
ylab = "Nitrogen Oxides Concentration (nox)",
main = "Polynomial Fits of Various Degrees")
# Generate and plot fits for each degree
colors <- rainbow(8)
legend_text <- character(10)
for(i in 1:10) {
# Fit model
model <- lm(nox ~ poly(dis, i), data = Boston)
# Create prediction grid
dis_grid <- seq(min(Boston$dis), max(Boston$dis), length.out = 100)
preds <- predict(model, newdata = data.frame(dis = dis_grid))
# Add the polynomial fit line
lines(dis_grid, preds, col = colors[i], lwd = 1.5)
# Create legend text
legend_text[i] <- paste("Degree", i)
}
# Add a legend
legend("topright", legend = legend_text, col = colors, lwd = 1.5, cex = 0.7)
# Print the RSS values
print(rss_df)
## Degree RSS
## 1 1 2.768563
## 2 2 2.035262
## 3 3 1.934107
## 4 4 1.932981
## 5 5 1.915290
## 6 6 1.878257
## 7 7 1.849484
## 8 8 1.835630
## 9 9 1.833331
## 10 10 1.832171
Observations:
The biggest improvement comes from moving from a linear to a quadratic model.
After degree 3, the benefits start to plateau.
The cubic model seems to hit the sweet spot—adding more complexity after that barely helps and might overfit the data.
We used 10-fold cross-validation to determine the optimal polynomial degree:
# Set seed for reproducibility
set.seed(1)
# Function to calculate CV error for each polynomial degree
cv_error <- function(degree) {
glm_fit <- glm(nox ~ poly(dis, degree), data = Boston)
cv_result <- cv.glm(Boston, glm_fit, K = 10)
return(cv_result$delta[1]) # Return cross-validation error
}
# Calculate CV errors for degrees 1-10
cv_errors <- sapply(1:10, cv_error)
# Create a data frame of results
cv_df <- data.frame(
Degree = 1:10,
CV_MSE = cv_errors
)
# Print results
print(cv_df)
## Degree CV_MSE
## 1 1 0.005558263
## 2 2 0.004085706
## 3 3 0.003876521
## 4 4 0.003863342
## 5 5 0.004237452
## 6 6 0.005686862
## 7 7 0.010278897
## 8 8 0.006810868
## 9 9 0.033308607
## 10 10 0.004075599
# Plot CV errors
plot(1:10, cv_errors, type = "b", pch = 19, col = "skyblue",
xlab = "Polynomial Degree", ylab = "Cross-Validation MSE",
main = "Cross-Validation Error by Polynomial Degree")
best_degree <- which.min(cv_errors)
points(best_degree, cv_errors[best_degree], col = "green", cex = 2, pch = 19)
Results: Degree 4 had the lowest error, but degree 3 was nearly identical—only a 0.000013 difference!
The error drops sharply from degree 1 to 3, then levels out.
Degree 3 is likely the best choice, balancing accuracy and simplicity.
# Fit a regression spline with 4 degrees of freedom
spline_model <- lm(nox ~ bs(dis, df = 4), data = Boston)
# Get the summary of the regression
summary(spline_model)
##
## Call:
## lm(formula = nox ~ bs(dis, df = 4), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.124622 -0.039259 -0.008514 0.020850 0.193891
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.73447 0.01460 50.306 < 2e-16 ***
## bs(dis, df = 4)1 -0.05810 0.02186 -2.658 0.00812 **
## bs(dis, df = 4)2 -0.46356 0.02366 -19.596 < 2e-16 ***
## bs(dis, df = 4)3 -0.19979 0.04311 -4.634 4.58e-06 ***
## bs(dis, df = 4)4 -0.38881 0.04551 -8.544 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06195 on 501 degrees of freedom
## Multiple R-squared: 0.7164, Adjusted R-squared: 0.7142
## F-statistic: 316.5 on 4 and 501 DF, p-value: < 2.2e-16
# Plot the resulting fit
plot(Boston$dis, Boston$nox, pch = 20, cex = 0.7, col = "darkgray",
xlab = "Distance to Employment Centers (dis)",
ylab = "Nitrogen Oxides Concentration (nox)",
main = "Regression Spline with 4 Degrees of Freedom")
# Create a grid for smooth prediction line
dis_grid <- seq(min(Boston$dis), max(Boston$dis), length.out = 100)
# Generate predictions
preds <- predict(spline_model, newdata = data.frame(dis = dis_grid), se = TRUE)
# Add the spline fit
lines(dis_grid, preds$fit, col = "blue", lwd = 2)
# Add confidence bands
lines(dis_grid, preds$fit + 2 * preds$se.fit, col = "blue", lty = 2)
lines(dis_grid, preds$fit - 2 * preds$se.fit, col = "blue", lty = 2)
# Show knot location
knots <- attr(bs(Boston$dis, df = 4), "knots")
abline(v = knots, lty = 2, col = "red")
# Add a legend
legend("topright", legend = c("Data", "Spline Fit", "95% CI", "Knot"),
pch = c(20, NA, NA, NA), lty = c(NA, 1, 2, 2),
col = c("darkgray", "blue", "blue", "red"))
We let R choose the knot locations automatically (based on quantiles). With 4 degrees of freedom, this gave us a nice, flexible fit without manually placing the knots.
Highlights: All spline terms are statistically significant.
Slightly better R² (0.716) than the cubic polynomial.
Residual error is very similar to what we saw before.
# Function to fit spline model and calculate RSS
fit_spline_and_get_rss <- function(df) {
model <- lm(nox ~ bs(dis, df = df), data = Boston)
return(sum(model$residuals^2))
}
# Calculate RSS for degrees of freedom 3 through 10
df_values <- 3:10
rss_values <- sapply(df_values, fit_spline_and_get_rss)
# Create a data frame of the results
rss_df <- data.frame(DF = df_values, RSS = rss_values)
# Print results
print(rss_df)
## DF RSS
## 1 3 1.934107
## 2 4 1.922775
## 3 5 1.840173
## 4 6 1.833966
## 5 7 1.829884
## 6 8 1.816995
## 7 9 1.825653
## 8 10 1.792535
# Plot RSS by degrees of freedom
plot(df_values, rss_values, type = "b", pch = 19,
xlab = "Degrees of Freedom", ylab = "Residual Sum of Squares",
main = "RSS by Degrees of Freedom (Regression Spline)")
# Plot the spline fits
plot(Boston$dis, Boston$nox, pch = 20, cex = 0.7, col = "lightgreen",
xlab = "Distance to Employment Centers (dis)",
ylab = "Nitrogen Oxides Concentration (nox)",
main = "Regression Spline Fits with Various Degrees of Freedom")
# Create prediction grid
dis_grid <- seq(min(Boston$dis), max(Boston$dis), length.out = 100)
# Generate and plot fits for each degree of freedom
colors <- rainbow(length(df_values))
for(i in 1:length(df_values)) {
df <- df_values[i]
model <- lm(nox ~ bs(dis, df = df), data = Boston)
preds <- predict(model, newdata = data.frame(dis = dis_grid))
lines(dis_grid, preds, col = colors[i], lwd = 1.5)
}
# Add a legend
legend("topright", legend = paste("df =", df_values),
col = colors, lwd = 1.5, cex = 0.7)
What We Learned: The biggest improvement happened from DF = 4 to DF = 5.
After that, improvements are smaller.
RSS unexpectedly increased at DF = 9, possibly due to poor knot placement.
The best RSS came at DF = 10, but again, there’s a risk of overfitting.
# Set seed for reproducibility
set.seed(1)
# Function to compute CV MSE for a given df
cv_error_spline <- function(df) {
# Create formula with bs() function
form <- as.formula(paste("nox ~ bs(dis, df =", df, ")"))
# Fit the model
model <- glm(form, data = Boston)
# Perform 10-fold cross-validation
cv_result <- cv.glm(Boston, model, K = 10)
# Return the cross-validation error
return(cv_result$delta[1])
}
# Calculate CV error for degrees of freedom 3 through 10
df_values <- 3:10
cv_errors <- sapply(df_values, cv_error_spline)
# Create a data frame of results
cv_df <- data.frame(DF = df_values, CV_MSE = cv_errors)
print(cv_df)
## DF CV_MSE
## 1 3 0.003897402
## 2 4 0.003890178
## 3 5 0.003724890
## 4 6 0.003697010
## 5 7 0.003740823
## 6 8 0.003709286
## 7 9 0.003727194
## 8 10 0.003715941
# Find the optimal degrees of freedom
best_df <- df_values[which.min(cv_errors)]
cat("Optimal degrees of freedom:", best_df, "\n")
## Optimal degrees of freedom: 6
# Plot CV errors
plot(df_values, cv_errors, type = "b", pch = 19, col = "skyblue",
xlab = "Degrees of Freedom", ylab = "Cross-Validation MSE",
main = "CV Error by Degrees of Freedom (Regression Spline)")
points(best_df, cv_errors[which.min(cv_errors)], col = "lightgreen", cex = 2, pch = 19)
Final Verdict:
DF = 6 gave the lowest cross-validation error.
Errors steadily declined from DF = 3 to DF = 6, but barely changed after that.
So, a spline with 6 degrees of freedom seems like the best option—great fit, not too complex.
# Set a seed for reproducibility
set.seed(1)
# Generate predictors
n <- 100
x1 <- rnorm(n)
x2 <- rnorm(n)
# Generate response with true coefficients beta0 = 3, beta1 = 2, beta2 = 1
beta0_true <- 3
beta1_true <- 2
beta2_true <- 1
y <- beta0_true + beta1_true * x1 + beta2_true * x2 + rnorm(n, sd = 0.5)
beta1 <- 0 # Arbitrary initial value
print(beta1)
## [1] 0
a <- y - beta1 * x1
beta2 <- lm(a ~ x2)$coef[2]
print(beta2)
## x2
## 0.971392
a <- y - beta2 * x2
beta1 <- lm(a ~ x1)$coef[2]
print(beta1)
## x1
## 2.010553
set.seed(123)
n <- 100
x1 <- rnorm(n, mean = 0, sd = sqrt(2))
x2 <- rnorm(n, mean = 1, sd = sqrt(2))
beta0 <- 1.5
beta1 <- 3
beta2 <- -4.5
epsilon <- rnorm(n, mean = 0, sd = 1)
y <- beta0 + beta1 * x1 + beta2 * x2 + epsilon
# Initialize parameters
beta0_current <- 0
beta1_current <- 0
beta2_current <- 0
# Set up storage for coefficient estimates
n_iter <- 1000
beta0_history <- numeric(n_iter)
beta1_history <- numeric(n_iter)
beta2_history <- numeric(n_iter)
# Backfitting algorithm
for (i in 1:n_iter) {
# Step (c): Update beta2 with beta1 fixed
a <- y - beta1_current * x1
lm_result <- lm(a ~ x2)
beta0_current <- lm_result$coef[1]
beta2_current <- lm_result$coef[2]
# Step (d): Update beta1 with beta2 fixed
a <- y - beta2_current * x2
lm_result <- lm(a ~ x1)
beta0_current <- lm_result$coef[1]
beta1_current <- lm_result$coef[2]
# Store the current estimates
beta0_history[i] <- beta0_current
beta1_history[i] <- beta1_current
beta2_history[i] <- beta2_current
}
# Print selected iterations as requested
print_iterations <- c(1:5, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,
550, 600, 650, 700, 750, 800, 850, 900, 950, 1000)
for (i in print_iterations) {
cat("Iteration", i, ": beta0 =", beta0_history[i],
"beta1 =", beta1_history[i],
"beta2 =", beta2_history[i], "\n")
}
## Iteration 1 : beta0 = 1.734344 beta1 = 2.898704 beta2 = -4.619032
## Iteration 2 : beta0 = 1.618513 beta1 = 2.905816 beta2 = -4.483496
## Iteration 3 : beta0 = 1.618229 beta1 = 2.905833 beta2 = -4.483164
## Iteration 4 : beta0 = 1.618228 beta1 = 2.905834 beta2 = -4.483163
## Iteration 5 : beta0 = 1.618228 beta1 = 2.905834 beta2 = -4.483163
## Iteration 50 : beta0 = 1.618228 beta1 = 2.905834 beta2 = -4.483163
## Iteration 100 : beta0 = 1.618228 beta1 = 2.905834 beta2 = -4.483163
## Iteration 150 : beta0 = 1.618228 beta1 = 2.905834 beta2 = -4.483163
## Iteration 200 : beta0 = 1.618228 beta1 = 2.905834 beta2 = -4.483163
## Iteration 250 : beta0 = 1.618228 beta1 = 2.905834 beta2 = -4.483163
## Iteration 300 : beta0 = 1.618228 beta1 = 2.905834 beta2 = -4.483163
## Iteration 350 : beta0 = 1.618228 beta1 = 2.905834 beta2 = -4.483163
## Iteration 400 : beta0 = 1.618228 beta1 = 2.905834 beta2 = -4.483163
## Iteration 450 : beta0 = 1.618228 beta1 = 2.905834 beta2 = -4.483163
## Iteration 500 : beta0 = 1.618228 beta1 = 2.905834 beta2 = -4.483163
## Iteration 550 : beta0 = 1.618228 beta1 = 2.905834 beta2 = -4.483163
## Iteration 600 : beta0 = 1.618228 beta1 = 2.905834 beta2 = -4.483163
## Iteration 650 : beta0 = 1.618228 beta1 = 2.905834 beta2 = -4.483163
## Iteration 700 : beta0 = 1.618228 beta1 = 2.905834 beta2 = -4.483163
## Iteration 750 : beta0 = 1.618228 beta1 = 2.905834 beta2 = -4.483163
## Iteration 800 : beta0 = 1.618228 beta1 = 2.905834 beta2 = -4.483163
## Iteration 850 : beta0 = 1.618228 beta1 = 2.905834 beta2 = -4.483163
## Iteration 900 : beta0 = 1.618228 beta1 = 2.905834 beta2 = -4.483163
## Iteration 950 : beta0 = 1.618228 beta1 = 2.905834 beta2 = -4.483163
## Iteration 1000 : beta0 = 1.618228 beta1 = 2.905834 beta2 = -4.483163
# Plot the coefficient histories
plot(1:n_iter, beta0_history, type = "l", col = "red",
xlab = "Iteration", ylab = "Coefficient Estimate",
main = "Coefficient Estimates by Iteration",
ylim = range(c(beta0_history, beta1_history, beta2_history)))
lines(1:n_iter, beta1_history, col = "skyblue")
lines(1:n_iter, beta2_history, col = "lightgreen")
legend("topright", legend = c("beta0", "beta1", "beta2"),
col = c("red", "skyblue", "lightgreen"), lty = 1)
# Perform direct multiple linear regression
direct_lm <- lm(y ~ x1 + x2)
direct_beta0 <- coef(direct_lm)[1]
direct_beta1 <- coef(direct_lm)[2]
direct_beta2 <- coef(direct_lm)[3]
cat("\nDirect multiple regression estimates:\n")
##
## Direct multiple regression estimates:
cat("beta0 =", direct_beta0, "\n")
## beta0 = 1.618228
cat("beta1 =", direct_beta1, "\n")
## beta1 = 2.905834
cat("beta2 =", direct_beta2, "\n")
## beta2 = -4.483163
# Plot with clear legend and legible colors
plot(1:n_iter, beta0_history, type = "l", col = "darkred", lwd = 2,
xlab = "Iteration", ylab = "Coefficient Estimate",
main = "Backfitting Algorithm Convergence",
ylim = range(c(beta0_history, beta1_history, beta2_history)))
lines(1:n_iter, beta1_history, col = "darkblue", lwd = 2)
lines(1:n_iter, beta2_history, col = "darkgreen", lwd = 2)
# Overlay direct regression estimates
abline(h = direct_beta0, col = "red", lty = 2, lwd = 1.5)
abline(h = direct_beta1, col = "skyblue", lty = 2, lwd = 1.5)
abline(h = direct_beta2, col = "lightgreen", lty = 2, lwd = 1.5)
legend("right",
legend = c("beta0 (backfitting)", "beta1 (backfitting)", "beta2 (backfitting)",
"beta0 (direct)", "beta1 (direct)", "beta2 (direct)"),
col = c("darkred", "darkblue", "darkgreen", "red", "skyblue", "lightgreen"),
lty = c(1, 1, 1, 2, 2, 2),
lwd = c(2, 2, 2, 1.5, 1.5, 1.5),
cex = 0.8)
# Zoomed in plot for early iterations
plot(1:50, beta0_history[1:50], type = "l", col = "darkred", lwd = 2,
xlab = "Iteration", ylab = "Coefficient Estimate",
main = "First 50 Iterations of Backfitting Algorithm",
ylim = range(c(beta0_history[1:50], beta1_history[1:50], beta2_history[1:50])))
lines(1:50, beta1_history[1:50], col = "darkblue", lwd = 2)
lines(1:50, beta2_history[1:50], col = "darkgreen", lwd = 2)
abline(h = direct_beta0, col = "red", lty = 2, lwd = 1.5)
abline(h = direct_beta1, col = "skyblue", lty = 2, lwd = 1.5)
abline(h = direct_beta2, col = "lightgreen", lty = 2, lwd = 1.5)
legend("right",
legend = c("beta0 (backfitting)", "beta1 (backfitting)", "beta2 (backfitting)",
"beta0 (direct)", "beta1 (direct)", "beta2 (direct)"),
col = c("darkred", "darkblue", "darkgreen", "red", "skyblue", "lightgreen"),
lty = c(1, 1, 1, 2, 2, 2),
lwd = c(2, 2, 2, 1.5, 1.5, 1.5),
cex = 0.8)
Observations
The backfitting algorithm does a great job at approximating the results of standard multiple linear regression. After running it for 1,000 iterations, the estimated coefficients were virtually identical to those obtained directly through multiple regression—matching up to six decimal places:
beta0 ≈ 1.618228,beta1 ≈ 2.905834,beta2 ≈ -4.483163
This shows that backfitting is a powerful method: by iteratively fitting simple linear regressions and updating one coefficient at a time, it can successfully break down a multivariate regression problem and still arrive at the optimal solution.Also worth noting—the final estimates are very close to the actual parameters used to generate the data. The slight differences are expected and are simply due to the random noise we added when generating the response variable.
In this dataset,the algorithm converged extremely quickly. After just 4 iterations, the coefficient estimates stabilized and matched those from the direct multiple regression, down to 6 decimal places:
Iteration 1: β₀ = 1.7343, β₁ = 2.8987, β₂ = -4.6190
Iteration 2: β₀ = 1.6185, β₁ = 2.9058, β₂ = -4.4835
Iteration 3: β₀ = 1.6182, β₁ = 2.9058, β₂ = -4.4832
Iteration 4+: Values remain unchanged (i.e., convergence achieved)
Key Takeaways
Backfitting is a powerful and intuitive approach that lets you build a multiple regression model using only simple linear regressions.
The algorithm converges quickly and gives the same result as a standard linear regression model.
This approach forms the foundation of more flexible models like GAMs, where each term can be a non-linear smooth function.