In this exercise, we will generate simulated data, and will then use this data to perform best subset selection.
set.seed(290924)
n = 100
X = rnorm(n)
e = rnorm(n)
beta <- c(2,1.5,-1,0.5)
Y <- beta[1] + beta[2] * X + beta[3] * X^2 + beta[4] * X^3 + e
library(leaps)
## Warning: package 'leaps' was built under R version 4.3.3
dat <- data.frame(X, Y)
best_subset = regsubsets(Y ~ poly(X, 10, raw = TRUE), data = dat)
summary(best_subset)
## Subset selection object
## Call: regsubsets.formula(Y ~ poly(X, 10, raw = TRUE), data = dat)
## 10 Variables (and intercept)
## Forced in Forced out
## poly(X, 10, raw = TRUE)1 FALSE FALSE
## poly(X, 10, raw = TRUE)2 FALSE FALSE
## poly(X, 10, raw = TRUE)3 FALSE FALSE
## poly(X, 10, raw = TRUE)4 FALSE FALSE
## poly(X, 10, raw = TRUE)5 FALSE FALSE
## poly(X, 10, raw = TRUE)6 FALSE FALSE
## poly(X, 10, raw = TRUE)7 FALSE FALSE
## poly(X, 10, raw = TRUE)8 FALSE FALSE
## poly(X, 10, raw = TRUE)9 FALSE FALSE
## poly(X, 10, raw = TRUE)10 FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## poly(X, 10, raw = TRUE)1 poly(X, 10, raw = TRUE)2
## 1 ( 1 ) " " " "
## 2 ( 1 ) "*" " "
## 3 ( 1 ) "*" "*"
## 4 ( 1 ) "*" "*"
## 5 ( 1 ) "*" " "
## 6 ( 1 ) "*" " "
## 7 ( 1 ) "*" " "
## 8 ( 1 ) "*" " "
## poly(X, 10, raw = TRUE)3 poly(X, 10, raw = TRUE)4
## 1 ( 1 ) "*" " "
## 2 ( 1 ) " " "*"
## 3 ( 1 ) "*" " "
## 4 ( 1 ) "*" "*"
## 5 ( 1 ) "*" "*"
## 6 ( 1 ) "*" "*"
## 7 ( 1 ) "*" "*"
## 8 ( 1 ) "*" "*"
## poly(X, 10, raw = TRUE)5 poly(X, 10, raw = TRUE)6
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " " "
## 5 ( 1 ) " " "*"
## 6 ( 1 ) "*" " "
## 7 ( 1 ) "*" " "
## 8 ( 1 ) " " "*"
## poly(X, 10, raw = TRUE)7 poly(X, 10, raw = TRUE)8
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " " "
## 5 ( 1 ) " " " "
## 6 ( 1 ) " " " "
## 7 ( 1 ) " " "*"
## 8 ( 1 ) "*" "*"
## poly(X, 10, raw = TRUE)9 poly(X, 10, raw = TRUE)10
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " " "
## 5 ( 1 ) "*" " "
## 6 ( 1 ) "*" "*"
## 7 ( 1 ) "*" "*"
## 8 ( 1 ) "*" "*"
# Plot results
par(mfrow = c(2, 2))
plot(summary(best_subset)$cp, xlab = "Number of Predictors", ylab = "Cp", type = "b")
plot(summary(best_subset)$bic, xlab = "Number of Predictors", ylab = "BIC", type = "b")
plot(summary(best_subset)$adjr2, xlab = "Number of Predictors", ylab = "Adjusted R^2", type = "b")
Based on Cp and BIC, the best model includes 3 predictors, as adding more does not improve the model’s quality. A lower Cp indicates that the model has a good balance of fit and complexity, minimizing prediction error, while a lower BIC suggests that the model is simpler without sacrificing explanatory power. Although the Adjusted R2 technically increases with 4 and 6 predictors, the improvement beyond 3 predictors is minimal, indicating diminishing returns. The 3-predictor model strikes an optimal balance because it captures most of the variance while avoiding unnecessary complexity. Additional predictors may slightly increase R2, but they risk overfitting, adding noise rather than meaningful information. Thus, 3 predictors are considered optimal, providing a simpler yet effective model. In conclusion, the analysis indicates that the best model includes 3 predictors.
coef(best_subset, id = 3) # id = 3 specifies the model with 3 predictors
## (Intercept) poly(X, 10, raw = TRUE)1 poly(X, 10, raw = TRUE)2
## 1.8276879 1.3563268 -0.9147830
## poly(X, 10, raw = TRUE)3
## 0.5279595
The best subset model is ŷ = 1.71593162 - 1.70014172X2 - 0.01355144X6 + 0.19680000X7 The result is far from the real Y.
forward_best <- regsubsets(Y ~ poly(X, 10, raw = TRUE), data = dat, method = "forward")
summary(forward_best)
## Subset selection object
## Call: regsubsets.formula(Y ~ poly(X, 10, raw = TRUE), data = dat, method = "forward")
## 10 Variables (and intercept)
## Forced in Forced out
## poly(X, 10, raw = TRUE)1 FALSE FALSE
## poly(X, 10, raw = TRUE)2 FALSE FALSE
## poly(X, 10, raw = TRUE)3 FALSE FALSE
## poly(X, 10, raw = TRUE)4 FALSE FALSE
## poly(X, 10, raw = TRUE)5 FALSE FALSE
## poly(X, 10, raw = TRUE)6 FALSE FALSE
## poly(X, 10, raw = TRUE)7 FALSE FALSE
## poly(X, 10, raw = TRUE)8 FALSE FALSE
## poly(X, 10, raw = TRUE)9 FALSE FALSE
## poly(X, 10, raw = TRUE)10 FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: forward
## poly(X, 10, raw = TRUE)1 poly(X, 10, raw = TRUE)2
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " "*"
## 3 ( 1 ) "*" "*"
## 4 ( 1 ) "*" "*"
## 5 ( 1 ) "*" "*"
## 6 ( 1 ) "*" "*"
## 7 ( 1 ) "*" "*"
## 8 ( 1 ) "*" "*"
## poly(X, 10, raw = TRUE)3 poly(X, 10, raw = TRUE)4
## 1 ( 1 ) "*" " "
## 2 ( 1 ) "*" " "
## 3 ( 1 ) "*" " "
## 4 ( 1 ) "*" "*"
## 5 ( 1 ) "*" "*"
## 6 ( 1 ) "*" "*"
## 7 ( 1 ) "*" "*"
## 8 ( 1 ) "*" "*"
## poly(X, 10, raw = TRUE)5 poly(X, 10, raw = TRUE)6
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " " "
## 5 ( 1 ) " " "*"
## 6 ( 1 ) " " "*"
## 7 ( 1 ) " " "*"
## 8 ( 1 ) "*" "*"
## poly(X, 10, raw = TRUE)7 poly(X, 10, raw = TRUE)8
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " " "
## 5 ( 1 ) " " " "
## 6 ( 1 ) " " " "
## 7 ( 1 ) " " " "
## 8 ( 1 ) " " " "
## poly(X, 10, raw = TRUE)9 poly(X, 10, raw = TRUE)10
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " " "
## 5 ( 1 ) " " " "
## 6 ( 1 ) "*" " "
## 7 ( 1 ) "*" "*"
## 8 ( 1 ) "*" "*"
# Plot results
par(mfrow = c(2, 2))
plot(summary(forward_best)$cp, xlab = "Number of Predictors", ylab = "Cp", type = "b")
plot(summary(forward_best)$bic, xlab = "Number of Predictors", ylab = "BIC", type = "b")
plot(summary(forward_best)$adjr2, xlab = "Number of Predictors", ylab = "Adjusted R^2", type = "b")
backward_best <- regsubsets(Y ~ poly(X, 10, raw = TRUE), data = dat, method = "backward")
summary(backward_best)
## Subset selection object
## Call: regsubsets.formula(Y ~ poly(X, 10, raw = TRUE), data = dat, method = "backward")
## 10 Variables (and intercept)
## Forced in Forced out
## poly(X, 10, raw = TRUE)1 FALSE FALSE
## poly(X, 10, raw = TRUE)2 FALSE FALSE
## poly(X, 10, raw = TRUE)3 FALSE FALSE
## poly(X, 10, raw = TRUE)4 FALSE FALSE
## poly(X, 10, raw = TRUE)5 FALSE FALSE
## poly(X, 10, raw = TRUE)6 FALSE FALSE
## poly(X, 10, raw = TRUE)7 FALSE FALSE
## poly(X, 10, raw = TRUE)8 FALSE FALSE
## poly(X, 10, raw = TRUE)9 FALSE FALSE
## poly(X, 10, raw = TRUE)10 FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: backward
## poly(X, 10, raw = TRUE)1 poly(X, 10, raw = TRUE)2
## 1 ( 1 ) "*" " "
## 2 ( 1 ) "*" " "
## 3 ( 1 ) "*" " "
## 4 ( 1 ) "*" " "
## 5 ( 1 ) "*" " "
## 6 ( 1 ) "*" " "
## 7 ( 1 ) "*" " "
## 8 ( 1 ) "*" " "
## poly(X, 10, raw = TRUE)3 poly(X, 10, raw = TRUE)4
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " "*"
## 3 ( 1 ) "*" "*"
## 4 ( 1 ) "*" "*"
## 5 ( 1 ) "*" "*"
## 6 ( 1 ) "*" "*"
## 7 ( 1 ) "*" "*"
## 8 ( 1 ) "*" "*"
## poly(X, 10, raw = TRUE)5 poly(X, 10, raw = TRUE)6
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " " "
## 5 ( 1 ) " " " "
## 6 ( 1 ) " " " "
## 7 ( 1 ) " " " "
## 8 ( 1 ) " " "*"
## poly(X, 10, raw = TRUE)7 poly(X, 10, raw = TRUE)8
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) "*" " "
## 5 ( 1 ) "*" " "
## 6 ( 1 ) "*" " "
## 7 ( 1 ) "*" "*"
## 8 ( 1 ) "*" "*"
## poly(X, 10, raw = TRUE)9 poly(X, 10, raw = TRUE)10
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " " "
## 5 ( 1 ) "*" " "
## 6 ( 1 ) "*" "*"
## 7 ( 1 ) "*" "*"
## 8 ( 1 ) "*" "*"
# Plot results
par(mfrow = c(2, 2))
plot(summary(backward_best)$cp, xlab = "Number of Predictors", ylab = "Cp", type = "b")
plot(summary(backward_best)$bic, xlab = "Number of Predictors", ylab = "BIC", type = "b")
plot(summary(backward_best)$adjr2, xlab = "Number of Predictors", ylab = "Adjusted R^2", type = "b")
The plots indicate that the results are consistent with those observed in part (c) for both forward and backward selection methods. Consequently, the model with three predictors remains the optimal choice.
coef(forward_best, id = 3)
## (Intercept) poly(X, 10, raw = TRUE)1 poly(X, 10, raw = TRUE)2
## 1.8276879 1.3563268 -0.9147830
## poly(X, 10, raw = TRUE)3
## 0.5279595
coef(backward_best, id = 3)
## (Intercept) poly(X, 10, raw = TRUE)1 poly(X, 10, raw = TRUE)3
## 1.3817022 1.8065579 0.3078184
## poly(X, 10, raw = TRUE)4
## -0.1877618
The best model from forward selections is ŷ = 1.8276879 + 1.3563268X - 0.9147830X2 + 0.5279595X3 and the best model from backward selection is ŷ = 1.3817022 + 1.8065579X + 0.3078184X3 - 0.1877618X4
The best model from forward selection in this case is the model that is closer to the real model.
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.3.3
## Loading required package: Matrix
## Warning: package 'Matrix' was built under R version 4.3.2
## Loaded glmnet 4.1-8
res <- cv.glmnet(poly(dat$X, 10, raw = TRUE), dat$Y, alpha = 1)
opt <- res$lambda.min
plot(res)
cat("\nThe coefficient estimates of the lasso model are\n")
##
## The coefficient estimates of the lasso model are
out <- glmnet(poly(dat$X, 10, raw = TRUE), dat$Y, alpha = 1, lambda = res$lambda.min)
predict(out, type = "coefficients", s = opt)
## 11 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 1.68974017
## 1 1.49125931
## 2 -0.62638921
## 3 0.44163473
## 4 -0.06298481
## 5 .
## 6 .
## 7 .
## 8 .
## 9 .
## 10 .
cat("\nThe optimal λ is", opt)
##
## The optimal λ is 0.0506859
The results suggest that a simpler model, utilizing only the first four polynomial terms, sufficiently explains the variation in the data. The Lasso model’s ability to perform feature selection by shrinking irrelevant coefficients to zero is beneficial here, as it simplifies the model and potentially enhances interpretability without sacrificing predictive power.
The cross-validation error plot versus log(λ) reveals that the Mean-Squared Error (MSE) remains consistently low and stable for smaller λ values but begins to rise as λ increases. This trend is characteristic of Lasso regression, where higher λ values apply stronger penalization, causing more coefficients to shrink to zero. Consequently, this excessive penalization results in higher error due to underfitting.
In summary, the cross-validation plot demonstrates that as λ becomes too large, the model suffers from underfitting. Selecting the optimal λ value of 0.02407991 achieves an effective balance between model simplicity and predictive accuracy.
dat$Y <- 2 - 9 * X^2 + 0.9 * X^7 + e
best_subset = regsubsets(Y ~ poly(X, 10, raw = TRUE), data = dat)
summary(best_subset)
## Subset selection object
## Call: regsubsets.formula(Y ~ poly(X, 10, raw = TRUE), data = dat)
## 10 Variables (and intercept)
## Forced in Forced out
## poly(X, 10, raw = TRUE)1 FALSE FALSE
## poly(X, 10, raw = TRUE)2 FALSE FALSE
## poly(X, 10, raw = TRUE)3 FALSE FALSE
## poly(X, 10, raw = TRUE)4 FALSE FALSE
## poly(X, 10, raw = TRUE)5 FALSE FALSE
## poly(X, 10, raw = TRUE)6 FALSE FALSE
## poly(X, 10, raw = TRUE)7 FALSE FALSE
## poly(X, 10, raw = TRUE)8 FALSE FALSE
## poly(X, 10, raw = TRUE)9 FALSE FALSE
## poly(X, 10, raw = TRUE)10 FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## poly(X, 10, raw = TRUE)1 poly(X, 10, raw = TRUE)2
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " "*"
## 3 ( 1 ) " " "*"
## 4 ( 1 ) " " "*"
## 5 ( 1 ) " " "*"
## 6 ( 1 ) " " "*"
## 7 ( 1 ) "*" "*"
## 8 ( 1 ) "*" "*"
## poly(X, 10, raw = TRUE)3 poly(X, 10, raw = TRUE)4
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " "*"
## 5 ( 1 ) " " "*"
## 6 ( 1 ) " " " "
## 7 ( 1 ) " " " "
## 8 ( 1 ) "*" "*"
## poly(X, 10, raw = TRUE)5 poly(X, 10, raw = TRUE)6
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " "*"
## 4 ( 1 ) "*" " "
## 5 ( 1 ) " " " "
## 6 ( 1 ) "*" " "
## 7 ( 1 ) "*" " "
## 8 ( 1 ) " " " "
## poly(X, 10, raw = TRUE)7 poly(X, 10, raw = TRUE)8
## 1 ( 1 ) "*" " "
## 2 ( 1 ) "*" " "
## 3 ( 1 ) "*" " "
## 4 ( 1 ) "*" " "
## 5 ( 1 ) "*" " "
## 6 ( 1 ) "*" "*"
## 7 ( 1 ) "*" "*"
## 8 ( 1 ) "*" "*"
## poly(X, 10, raw = TRUE)9 poly(X, 10, raw = TRUE)10
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " " "
## 5 ( 1 ) "*" "*"
## 6 ( 1 ) "*" "*"
## 7 ( 1 ) "*" "*"
## 8 ( 1 ) "*" "*"
# Plot results
par(mfrow = c(2, 2))
plot(summary(best_subset)$cp, xlab = "Number of Predictors", ylab = "Cp", type = "b")
plot(summary(best_subset)$bic, xlab = "Number of Predictors", ylab = "BIC", type = "b")
plot(summary(best_subset)$adjr2, xlab = "Number of Predictors", ylab = "Adjusted R^2", type = "b")
When applying best subset selection, the model identifies the optimal set of predictors as the two most relevant terms, based on the evaluation plots.
res <- cv.glmnet(poly(dat$X, 10, raw = TRUE), dat$Y, alpha = 1)
opt <- res$lambda.min
plot(res)
cat("\nThe coefficient estimates of the lasso model are\n")
##
## The coefficient estimates of the lasso model are
out <- glmnet(poly(dat$X, 10, raw = TRUE), dat$Y, alpha = 1, lambda = res$lambda.min)
predict(out, type = "coefficients", s = opt)
## 11 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -2.3539249
## 1 .
## 2 -3.0663527
## 3 .
## 4 .
## 5 0.3268560
## 6 -0.2961248
## 7 0.7630483
## 8 .
## 9 .
## 10 .
cat("\nThe optimal λ is", opt)
##
## The optimal λ is 4.744031
When fitting the Lasso model, it does not perfectly replicate the underlying simulated structure, as it retains coefficients for some powers of X that were not originally part of the real model
This question uses the variables dis (the weighted mean of distances to five Boston employment centers) and nox (nitrogen oxides concentration in parts per 10 million) from the Boston data. We will treat dis as the predictor and nox as the response.
library(MASS)
fit <- glm(nox ~ poly(dis, 3), data = Boston)
summary(fit)
##
## Call:
## glm(formula = nox ~ poly(dis, 3), data = Boston)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.554695 0.002759 201.021 < 2e-16 ***
## poly(dis, 3)1 -2.003096 0.062071 -32.271 < 2e-16 ***
## poly(dis, 3)2 0.856330 0.062071 13.796 < 2e-16 ***
## poly(dis, 3)3 -0.318049 0.062071 -5.124 4.27e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.003852802)
##
## Null deviance: 6.7810 on 505 degrees of freedom
## Residual deviance: 1.9341 on 502 degrees of freedom
## AIC: -1370.9
##
## Number of Fisher Scoring iterations: 2
This regression output summarizes the results of a Generalized Linear Model (GLM) that predicts nitrogen oxide concentration (nox) based on a third-degree polynomial of the distance (dis) variable from the Boston dataset. By fitting a cubic polynomial for dis, the model aims to capture potential nonlinear relationships between dis and nox.
The coefficient estimates reveal significant contributions from each term in the polynomial expansion. The intercept is estimated at 0.5547 with a highly significant p-value (p < 2e-16), indicating a strong baseline level of nox concentration when dis is at its mean. The first-degree term of dis has an estimated coefficient of -2.0031, also highly significant, suggesting an inverse relationship between dis and nox. This negative coefficient implies that, on average, as distance increases, nitrogen oxide concentration decreases. The second-degree term for dis has a positive coefficient of 0.8563, which, along with its significance (p < 2e-16), implies a curvilinear adjustment to the relationship, likely capturing a bending effect as dis increases. The third-degree term has a smaller negative coefficient of -0.3180 but remains significant (p < 0.001), fine-tuning the model with a third-order adjustment that further captures the complexity in the dis-nox relationship.
In terms of model fit, the null deviance is 6.7810, which reflects the total variance in nox without any predictors. The model’s residual deviance drops to 1.9341, indicating a substantial reduction in unexplained variance by including the polynomial terms of dis. This improvement suggests that the cubic polynomial is effective at explaining variations in nitrogen oxide concentration. Additionally, the model’s AIC value of -1370.9 supports the quality of fit; lower AIC values generally indicate better model performance for similar types of models, and this negative AIC is suggestive of a strong fit.
Overall, this model highlights a nonlinear relationship between distance and nitrogen oxide concentration, where higher distances are generally associated with lower nox levels. However, the relationship is complex, with each polynomial term adding a layer of adjustment to capture the nuances in the data. The significant contributions of each term demonstrate that dis is a strong predictor of nox, and the model’s fit statistics indicate that it explains a substantial portion of the variation in nox.
library(scales)
## Warning: package 'scales' was built under R version 4.3.3
plot(nox ~ dis, data = Boston, col = alpha("palevioletred1", 0.4), pch = 19)
x <- seq(min(Boston$dis), max(Boston$dis), length.out = 1000)
lines(x, predict(fit, data.frame(dis = x)), col = "red", lty = 2)
The plot displays the relationship between the predictor dis (distance) and the response variable nox (nitrogen oxide concentration), along with a fitted curve from the polynomial regression model. The scatterplot shows an overall negative trend: as distance (dis) increases, nitrogen oxide concentration (nox) tends to decrease.
The fitted curve, shown in red, captures a nonlinear pattern, initially steep and then flattening out as dis increases, aligning with the polynomial regression model of degree 3 used in the analysis. This curvature effectively models the diminishing effect of distance on nitrogen oxide levels, where concentrations drop sharply at shorter distances and stabilize at larger distances.
The curve confirms that a cubic polynomial is appropriate for modeling this relationship, capturing the observed data trend well without significant overfitting. This pattern supports the interpretation that distance from specific sources (such as industrial or traffic-related pollution sources) has a decreasing impact on nitrogen oxide concentration as one moves further away.
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.3.3
##
## Attaching package: 'tidyr'
## The following objects are masked from 'package:Matrix':
##
## expand, pack, unpack
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.3
fits <- lapply(1:10, function(i) glm(nox ~ poly(dis, i), data = Boston))
x <- seq(min(Boston$dis), max(Boston$dis), length.out=1000)
pred <- data.frame(lapply(fits, function(fit) predict(fit, data.frame(dis = x))))
colnames(pred) <- 1:10
pred$x <- x
pred <- pivot_longer(pred, !x)
ggplot(Boston, aes(dis, nox)) +
geom_point(color = alpha("palevioletred1", 0.4)) +
geom_line(data = pred, aes(x, value, color = name)) +
theme_bw()
# residual sum of squares
do.call(anova, fits)[, 2]
## [1] 2.768563 2.035262 1.934107 1.932981 1.915290 1.878257 1.849484 1.835630
## [9] 1.833331 1.832171
The Residual Sum of Squares (RSS) values show a decreasing trend as the model complexity increases, indicating improved fit with each additional term or predictor. Initially, there’s a notable drop in RSS from 2.768563 to around 1.9, suggesting a significant improvement in capturing the data’s variability. However, the reductions become progressively smaller, with the last values (1.835630 to 1.832171) showing minimal change. This pattern suggests diminishing returns; adding further complexity beyond this point likely yields little benefit, potentially indicating an optimal model around the lower RSS values.
library(boot)
## Warning: package 'boot' was built under R version 4.3.3
set.seed(290924)
res <- sapply(1:10, function(i) {
fit <- glm(nox ~ poly(dis, i), data = Boston)
cv.glm(Boston, fit, K = 10)$delta[1]
})
which.min(res)
## [1] 3
Based on cross-validation, the optimal polynomial degree for the model is 3. Increasing the degree beyond this tends to result in overfitting, as it captures more noise rather than improving the model’s predictive accuracy.
library(splines)
fit <- glm(nox ~ bs(dis, df = 4), data = Boston)
summary(fit)
##
## Call:
## glm(formula = nox ~ bs(dis, df = 4), data = Boston)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.73447 0.01460 50.306 < 2e-16 ***
## bs(dis, df = 4)1 -0.05810 0.02186 -2.658 0.00812 **
## bs(dis, df = 4)2 -0.46356 0.02366 -19.596 < 2e-16 ***
## bs(dis, df = 4)3 -0.19979 0.04311 -4.634 4.58e-06 ***
## bs(dis, df = 4)4 -0.38881 0.04551 -8.544 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.003837874)
##
## Null deviance: 6.7810 on 505 degrees of freedom
## Residual deviance: 1.9228 on 501 degrees of freedom
## AIC: -1371.9
##
## Number of Fisher Scoring iterations: 2
plot(nox ~ dis, data = Boston, col = alpha("palevioletred1", 0.4), pch = 19)
x <- seq(min(Boston$dis), max(Boston$dis), length.out = 1000)
lines(x, predict(fit, data.frame(dis = x)), col = "red", lty = 2)
The fitted regression spline model demonstrates a clear, non-linear relationship between nitrogen oxide concentration (nox) and the distance to employment centers (dis). As dis increases, nox tends to decrease, suggesting that areas closer to employment hubs have higher nitrogen oxide levels, likely due to higher traffic and industrial activity. The curve shows that this decline is steep initially, then gradually levels off as dis increases, reflecting a diminishing impact of distance on nox in areas further from the center.
The regression output further supports this interpretation. Each of the spline basis terms is statistically significant, with p-values below 0.01, indicating that the model effectively captures the non-linear relationship between nox and dis. The reduction in deviance from the null model (6.7810) to the residual deviance (1.9228) suggests a substantial improvement in fit, while the low AIC (-1371.9) indicates a strong model. We used four degrees of freedom, allowing the bs() function to place knots at the quantiles of dis, ensuring an even distribution that captures key trends without overfitting.
Knots are chosen based on quantiles of the data.
fits <- lapply(3:10, function(i) {
glm(nox ~ bs(dis, df = i), data = Boston)
})
x <- seq(min(Boston$dis), max(Boston$dis), length.out = 1000)
pred <- data.frame(lapply(fits, function(fit) predict(fit, data.frame(dis = x))))
colnames(pred) <- 3:10
pred$x <- x
pred <- pivot_longer(pred, !x)
ggplot(Boston, aes(dis, nox)) +
geom_point(color = alpha("palevioletred1", 0.4)) +
geom_line(data = pred, aes(x, value, color = name)) +
theme_bw()
With a high number of degrees of freedom, splines risk overfitting the data, especially at the extreme ends of the predictor variable’s distribution.
set.seed(290924)
err <- sapply(3:10, function(i) {
fit <- glm(nox ~ bs(dis, df = i), data = Boston)
suppressWarnings(cv.glm(Boston, fit, K = 10)$delta[1])
})
which.min(err)
## [1] 4
This approach selects a spline with 4 degrees of freedom.