library(ISLR2)
library(boot)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
wage_data <- read.csv("/Users/saransh/Downloads/Statistical_Learning_Resources/Wage.csv")
# Fit polynomial models from degree 1 to 10 and compute LOOCV error
cv_errors <- rep(NA, 10)
for (d in 1:10) {
fit <- glm(wage ~ poly(age, d), data = wage_data)
cv_errors[d] <- cv.glm(wage_data, fit, K = 10)$delta[1]
}
# Plot cross-validation error
plot(1:10, cv_errors, type = "b", xlab = "Degree", ylab = "CV Error")
best_degree <- which.min(cv_errors)
best_degree
## [1] 10
Interpretation: The cross-validation error sharply decreases from degree 1 to 2 and then flattens out. The minimum error is observed at degree 9, which was selected as the optimal polynomial degree. However, the improvement beyond degree 4 is marginal, suggesting potential overfitting for higher degrees.
Comparing using hypothesis testing with ANOVA:
fit_list <- list()
for (i in 1:5) {
fit_list[[i]] <- lm(wage ~ poly(age, i), data = wage_data)
}
anova(fit_list[[1]], fit_list[[2]], fit_list[[3]], fit_list[[4]], fit_list[[5]])
## Analysis of Variance Table
##
## Model 1: wage ~ poly(age, i)
## Model 2: wage ~ poly(age, i)
## Model 3: wage ~ poly(age, i)
## Model 4: wage ~ poly(age, i)
## Model 5: wage ~ poly(age, i)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2998 5022216
## 2 2997 4793430 1 228786 143.5931 < 2.2e-16 ***
## 3 2996 4777674 1 15756 9.8888 0.001679 **
## 4 2995 4771604 1 6070 3.8098 0.051046 .
## 5 2994 4770322 1 1283 0.8050 0.369682
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpretation: The ANOVA table shows that: - Degree 2 and 3 add significant explanatory power (p-values < 0.01), - Degree 4 is marginally significant (p = 0.051), - Degree 5 is not significant (p = 0.37).
This suggests that a polynomial of degree 3 or 4 is sufficient and aligns with the principle of parsimony. Although degree 9 was optimal in CV, ANOVA emphasizes avoiding unnecessary complexity.
Plot of the best-fit polynomial:
age_range <- range(wage_data$age)
age_grid <- seq(age_range[1], age_range[2], length.out = 100)
predicted <- predict(lm(wage ~ poly(age, best_degree), data = wage_data), newdata = data.frame(age = age_grid))
plot(wage_data$age, wage_data$wage, col = "gray", main = paste("Degree", best_degree, "Polynomial Fit"), xlab = "Age", ylab = "Wage")
lines(age_grid, predicted, col = "blue", lwd = 2)
The polynomial fit with degree 9 captures the non-linear wage trend quite well, especially around age 30–60. However, the wiggles toward the extremes (young and old) suggest potential overfitting, reinforcing that a lower-degree polynomial may be preferable for interpretability.
# Try cuts from 2 to 10
cv_step_errors <- rep(NA, 9)
for (i in 2:10) {
wage_data$age_cut <- cut(wage_data$age, i)
fit <- glm(wage ~ age_cut, data = wage_data)
cv_step_errors[i - 1] <- cv.glm(wage_data, fit, K = 10)$delta[1]
}
# Plot CV error vs. number of cuts
plot(2:10, cv_step_errors, type = "b", xlab = "Number of Cuts", ylab = "CV Error")
best_cuts <- which.min(cv_step_errors) + 1
best_cuts
## [1] 8
Interpretation: The CV error decreases steadily from 2 to 8 cuts and then flattens, with 8 cuts giving the lowest error. This suggests that age groups can be discretized into 8 bins for optimal predictive performance using a step function.
Plot of the step function fit:
wage_data$age_cut <- cut(wage_data$age, best_cuts)
fit_step <- lm(wage ~ age_cut, data = wage_data)
age_ordered <- order(wage_data$age)
plot(wage_data$age, wage_data$wage, col = "gray", xlab = "Age", ylab = "Wage", main = paste("Step Function Fit with", best_cuts, "Cuts"))
lines(wage_data$age[age_ordered], fitted(fit_step)[age_ordered], col = "red", lwd = 2)
Overall Interpretation: The red step line shows a piecewise constant fit of wage over age brackets. The model captures the overall trend effectively in an interpretable way. However, the lack of smoothness means this approach is less suited for capturing gradual changes compared to polynomials.
In this question, we explore additional predictors such as maritl (marital status), jobclass (job classification), and education, and analyze their relationship with wage. We use non-linear and flexible models to capture complex patterns, if any.
# Visualize wage by marital status
library(ggplot2)
ggplot(wage_data, aes(x = maritl, y = wage)) +
geom_boxplot(fill = "lightblue") +
theme_minimal() +
labs(title = "Wage by Marital Status", x = "Marital Status", y = "Wage")
# Fit model with maritl
fit_maritl <- lm(wage ~ maritl, data = wage_data)
summary(fit_maritl)
##
## Call:
## lm(formula = wage ~ maritl, data = wage_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -98.775 -24.788 -4.754 15.845 221.595
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 92.735 1.582 58.608 < 2e-16 ***
## maritl2. Married 26.126 1.813 14.413 < 2e-16 ***
## maritl3. Widowed 6.804 9.375 0.726 0.46804
## maritl4. Divorced 10.425 3.234 3.224 0.00128 **
## maritl5. Separated 8.481 5.657 1.499 0.13392
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 40.28 on 2995 degrees of freedom
## Multiple R-squared: 0.06954, Adjusted R-squared: 0.0683
## F-statistic: 55.96 on 4 and 2995 DF, p-value: < 2.2e-16
Interpretation: From the boxplot, we observe that individuals who are married tend to have the highest median wage, while those who have never married show the lowest. The spread and presence of outliers are visible in all categories. According to the linear model summary, “Married”, “Divorced”, and “Separated” have significant positive coefficients compared to the baseline (“Never Married”), indicating that these groups earn significantly more on average.
ggplot(wage_data, aes(x = jobclass, y = wage)) +
geom_boxplot(fill = "lightgreen") +
theme_minimal() +
labs(title = "Wage by Job Class", x = "Job Class", y = "Wage")
# Fit model with jobclass
fit_jobclass <- lm(wage ~ jobclass, data = wage_data)
summary(fit_jobclass)
##
## Call:
## lm(formula = wage ~ jobclass, data = wage_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -100.507 -25.362 -6.117 15.697 197.750
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 103.321 1.039 99.43 <2e-16 ***
## jobclass2. Information 17.272 1.492 11.58 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 40.83 on 2998 degrees of freedom
## Multiple R-squared: 0.04281, Adjusted R-squared: 0.04249
## F-statistic: 134.1 on 1 and 2998 DF, p-value: < 2.2e-16
Interpretation: The boxplot shows that workers in the “Information” job class have higher wages compared to those in “Industrial”. The linear model confirms this: the “Information” class has a statistically significant positive coefficient, suggesting that, controlling for other factors, information-related roles pay more.
library(splines)
# Fit a model using splines for education
fit_edu <- lm(wage ~ education, data = wage_data)
ggplot(wage_data, aes(x = education, y = wage)) +
geom_boxplot(fill = "salmon") +
theme_minimal() +
labs(title = "Wage by Education", x = "Education Level", y = "Wage")
Interpretation: There is a strong positive relationship between education and wage. As education level increases from less than high school to advanced degrees—both median wage and wage variability increase. The difference in wages between adjacent education levels (e.g., HS Grad vs. College Grad) appears substantial, especially for those with advanced degrees.
# Combined model
fit_combined <- lm(wage ~ maritl + jobclass + education, data = wage_data)
summary(fit_combined)
##
## Call:
## lm(formula = wage ~ maritl + jobclass + education, data = wage_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -109.75 -19.81 -2.74 14.85 219.37
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.632 2.502 26.635 < 2e-16 ***
## maritl2. Married 22.332 1.594 14.006 < 2e-16 ***
## maritl3. Widowed 7.309 8.214 0.890 0.373597
## maritl4. Divorced 9.723 2.833 3.432 0.000607 ***
## maritl5. Separated 15.783 4.972 3.174 0.001518 **
## jobclass2. Information 5.199 1.355 3.836 0.000127 ***
## education2. HS Grad 11.268 2.440 4.618 4.04e-06 ***
## education3. Some College 23.128 2.579 8.966 < 2e-16 ***
## education4. College Grad 37.956 2.584 14.689 < 2e-16 ***
## education5. Advanced Degree 61.881 2.840 21.793 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35.27 on 2990 degrees of freedom
## Multiple R-squared: 0.2876, Adjusted R-squared: 0.2855
## F-statistic: 134.1 on 9 and 2990 DF, p-value: < 2.2e-16
Interpretation: This additive model allows us to control for the effect of each categorical variable while examining the others. Results show:
Marital Status: “Married”, “Divorced”, and “Separated” are all associated with significantly higher wages relative to “Never Married”. “Widowed” is not statistically significant.
Job Class: Being in the “Information” class leads to a significant wage increase.
Education: A clear upward trend in wages is observed with increasing education levels. The coefficients for “Some College”, “College Grad”, and “Advanced Degree” are especially large and statistically significant.
This model confirms that each predictor contributes independently to wage differences and that education has the strongest impact among them.
By using categorical variables directly in linear models and visualizing them with boxplots, we effectively capture non-linear and group-level differences in wage determination.
library(tidyverse)
library(splines)
library(boot)
boston_data <- read.csv("/Users/saransh/Downloads/Statistical_Learning_Resources/Boston.csv")
head(boston_data)
## X crim zn indus chas nox rm age dis rad tax ptratio lstat medv
## 1 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 4.98 24.0
## 2 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 9.14 21.6
## 3 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 4.03 34.7
## 4 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 2.94 33.4
## 5 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 5.33 36.2
## 6 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 5.21 28.7
fit_poly3 <- lm(nox ~ poly(dis, 3), data = boston_data)
summary(fit_poly3)
##
## Call:
## lm(formula = nox ~ poly(dis, 3), data = boston_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.121130 -0.040619 -0.009738 0.023385 0.194904
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.554695 0.002759 201.021 < 2e-16 ***
## poly(dis, 3)1 -2.003096 0.062071 -32.271 < 2e-16 ***
## poly(dis, 3)2 0.856330 0.062071 13.796 < 2e-16 ***
## poly(dis, 3)3 -0.318049 0.062071 -5.124 4.27e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06207 on 502 degrees of freedom
## Multiple R-squared: 0.7148, Adjusted R-squared: 0.7131
## F-statistic: 419.3 on 3 and 502 DF, p-value: < 2.2e-16
# Plot the data and cubic polynomial fit
dis_grid <- seq(min(boston_data$dis), max(boston_data$dis), length.out = 100)
pred_poly3 <- predict(fit_poly3, newdata = data.frame(dis = dis_grid))
plot(boston_data$dis, boston_data$nox, col = "darkgray", xlab = "dis", ylab = "nox", main = "Cubic Polynomial Fit")
lines(dis_grid, pred_poly3, col = "blue", lwd = 2)
Interpretation: The cubic polynomial regression captures a smooth and non-linear inverse relationship between dis (distance to employment centers) and nox (pollution). As distance increases, pollution levels tend to decrease, flattening out after ~5 units of dis.
rss_poly <- rep(NA, 10)
for (d in 1:10) {
fit <- lm(nox ~ poly(dis, d), data = boston_data)
rss_poly[d] <- sum(residuals(fit)^2)
}
plot(1:10, rss_poly, type = "b", xlab = "Degree", ylab = "RSS", main = "Polynomial Degree vs RSS")
rss_poly
## [1] 2.768563 2.035262 1.934107 1.932981 1.915290 1.878257 1.849484 1.835630
## [9] 1.833331 1.832171
Interpretation: RSS decreases significantly from degree 1 to 3 and then stabilizes. This suggests that increasing polynomial degree beyond 3 yields diminishing returns in reducing residual error. Degree 3 seems a reasonable choice based on this plot.
cv_errors <- rep(NA, 10)
for (d in 1:10) {
fit <- glm(nox ~ poly(dis, d), data = boston_data)
cv_errors[d] <- cv.glm(boston_data, fit, K = 10)$delta[1]
}
plot(1:10, cv_errors, type = "b", xlab = "Degree", ylab = "CV Error", main = "CV Error vs Polynomial Degree")
best_degree <- which.min(cv_errors)
best_degree
## [1] 4
Interpretation: Cross-validation selects degree 10 as the optimal degree with the lowest prediction error, though degrees 2–4 also perform similarly well. This may suggest a slightly overfitted model with degree 10, despite its lower CV error.
fit_spline4 <- lm(nox ~ bs(dis, df = 4), data = boston_data)
summary(fit_spline4)
##
## Call:
## lm(formula = nox ~ bs(dis, df = 4), data = boston_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.124622 -0.039259 -0.008514 0.020850 0.193891
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.73447 0.01460 50.306 < 2e-16 ***
## bs(dis, df = 4)1 -0.05810 0.02186 -2.658 0.00812 **
## bs(dis, df = 4)2 -0.46356 0.02366 -19.596 < 2e-16 ***
## bs(dis, df = 4)3 -0.19979 0.04311 -4.634 4.58e-06 ***
## bs(dis, df = 4)4 -0.38881 0.04551 -8.544 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06195 on 501 degrees of freedom
## Multiple R-squared: 0.7164, Adjusted R-squared: 0.7142
## F-statistic: 316.5 on 4 and 501 DF, p-value: < 2.2e-16
pred_spline4 <- predict(fit_spline4, newdata = data.frame(dis = dis_grid))
plot(boston_data$dis, boston_data$nox, col = "gray", main = "Spline Fit (df = 4)", xlab = "dis", ylab = "nox")
lines(dis_grid, pred_spline4, col = "red", lwd = 2)
Interpretation: The spline fit with 4 degrees of freedom flexibly captures the non-linear trend, similar to the cubic polynomial. Knots are placed automatically based on quantiles of dis.
rss_spline <- rep(NA, 10)
plot(boston_data$dis, boston_data$nox, col = "lightgray", xlab = "dis", ylab = "nox", main = "Spline Fits (df = 3 to 10)")
for (df in 3:10) {
fit <- lm(nox ~ bs(dis, df = df), data = boston_data)
pred <- predict(fit, newdata = data.frame(dis = dis_grid))
lines(dis_grid, pred, col = rainbow(8)[df - 2], lwd = 1.5)
rss_spline[df] <- sum(residuals(fit)^2)
}
legend("topright", legend = paste("df =", 3:10), col = rainbow(8), lty = 1, lwd = 1.5)
rss_spline[3:10]
## [1] 1.934107 1.922775 1.840173 1.833966 1.829884 1.816995 1.825653 1.792535
Interpretation: The plots show that higher degrees of freedom result in more flexible fits. The RSS generally decreases with increasing degrees of freedom, but too high a df might lead to overfitting. From visual inspection, degrees 5–7 strike a good balance.
cv_spline_errors <- rep(NA, 10)
for (df in 3:10) {
fit <- glm(nox ~ bs(dis, df = df), data = boston_data)
cv_spline_errors[df] <- cv.glm(boston_data, fit, K = 10)$delta[1]
}
plot(3:10, cv_spline_errors[3:10], type = "b", xlab = "Degrees of Freedom", ylab = "CV Error", main = "Spline CV Error vs df")
best_df_spline <- which.min(cv_spline_errors[3:10]) + 2
best_df_spline
## [1] 5
Conclusion: Cross-validation identifies df = 10 as the best spline fit, minimizing prediction error. While this offers high flexibility, it’s essential to balance this with potential overfitting. Degrees 5–7 offer comparably low CV errors and may generalize better.
set.seed(123)
n <- 100
x1 <- rnorm(n, mean = 0, sd = sqrt(2))
x2 <- rnorm(n, mean = 0, sd = sqrt(2))
epsilon <- rnorm(n, mean = 0, sd = 1)
# True coefficients
beta0_true <- 1.5
beta1_true <- 3
beta2_true <- -4.5
# Generate response
y <- beta0_true + beta1_true * x1 + beta2_true * x2 + epsilon
beta1 <- 0 # Starting value for beta1
iterations <- 1000
beta0_vec <- numeric(iterations)
beta1_vec <- numeric(iterations)
beta2_vec <- numeric(iterations)
for (i in 1:iterations) {
a <- y - beta1 * x1
beta2 <- lm(a ~ x2)$coef[2]
a <- y - beta2 * x2
beta1 <- lm(a ~ x1)$coef[2]
beta0 <- mean(y - beta1 * x1 - beta2 * x2)
beta0_vec[i] <- beta0
beta1_vec[i] <- beta1
beta2_vec[i] <- beta2
}
steps <- c(1:5, seq(50, 1000, by = 50))
result_table <- data.frame(
Iteration = steps,
Beta0 = round(beta0_vec[steps], 4),
Beta1 = round(beta1_vec[steps], 4),
Beta2 = round(beta2_vec[steps], 4)
)
knitr::kable(result_table, caption = "Backfitting Coefficient Estimates at Selected Iterations")
Iteration | Beta0 | Beta1 | Beta2 |
---|---|---|---|
1 | 1.6153 | 2.8987 | -4.6190 |
2 | 1.6350 | 2.9058 | -4.4835 |
3 | 1.6351 | 2.9058 | -4.4832 |
4 | 1.6351 | 2.9058 | -4.4832 |
5 | 1.6351 | 2.9058 | -4.4832 |
50 | 1.6351 | 2.9058 | -4.4832 |
100 | 1.6351 | 2.9058 | -4.4832 |
150 | 1.6351 | 2.9058 | -4.4832 |
200 | 1.6351 | 2.9058 | -4.4832 |
250 | 1.6351 | 2.9058 | -4.4832 |
300 | 1.6351 | 2.9058 | -4.4832 |
350 | 1.6351 | 2.9058 | -4.4832 |
400 | 1.6351 | 2.9058 | -4.4832 |
450 | 1.6351 | 2.9058 | -4.4832 |
500 | 1.6351 | 2.9058 | -4.4832 |
550 | 1.6351 | 2.9058 | -4.4832 |
600 | 1.6351 | 2.9058 | -4.4832 |
650 | 1.6351 | 2.9058 | -4.4832 |
700 | 1.6351 | 2.9058 | -4.4832 |
750 | 1.6351 | 2.9058 | -4.4832 |
800 | 1.6351 | 2.9058 | -4.4832 |
850 | 1.6351 | 2.9058 | -4.4832 |
900 | 1.6351 | 2.9058 | -4.4832 |
950 | 1.6351 | 2.9058 | -4.4832 |
1000 | 1.6351 | 2.9058 | -4.4832 |
# Plot coefficient estimates and overlay multiple regression coefficients
plot(1:iterations, beta0_vec, type = "l", col = "blue", ylim = range(c(beta0_vec, beta1_vec, beta2_vec)),
xlab = "Iteration", ylab = "Coefficient Estimate", main = "Backfitting Estimates vs LM Coefficients")
lines(1:iterations, beta1_vec, col = "red")
lines(1:iterations, beta2_vec, col = "darkgreen")
# Plot coefficient estimates and overlay multiple regression coefficients
plot(1:iterations, beta0_vec, type = "l", col = "blue", ylim = range(c(beta0_vec, beta1_vec, beta2_vec)),
xlab = "Iteration", ylab = "Coefficient Estimate", main = "Backfitting Estimates vs LM Coefficients")
lines(1:iterations, beta1_vec, col = "red")
lines(1:iterations, beta2_vec, col = "darkgreen")
# Fit the full multiple linear model
lm_fit <- lm(y ~ x1 + x2)
# Overlay horizontal lines for lm() coefficients
abline(h = coef(lm_fit)[1], col = "blue", lty = 2)
abline(h = coef(lm_fit)[2], col = "red", lty = 2)
abline(h = coef(lm_fit)[3], col = "darkgreen", lty = 2)
# Add legend
legend("bottomright",
legend = c("Beta0", "Beta1", "Beta2", "LM Beta0", "LM Beta1", "LM Beta2"),
col = c("blue", "red", "darkgreen", "blue", "red", "darkgreen"),
lty = c(1, 1, 1, 2, 2, 2))
Convergence Observed: The coefficients stabilize very quickly—within approximately 10 to 20 iterations.
Backfitting Effectiveness: The backfitting estimates align very closely with those from the lm() function, validating the iterative method for estimating coefficients using only simple regressions.
Final Coefficient Estimates (around iteration 1000): β₀ ≈ 1.6351, β₁ ≈ 2.9058, β₂ ≈ -4.4832