First Exercise

In this exercise, we will generate simulated data, and will then use this data to perform best subset selection.

(a) Use the rnorm() function to generate a predictor X of length n = 100, as well as a noise vector e of length n = 100.

set.seed(290924)
n = 100
X = rnorm(n)
e = rnorm(n)

(b) Generate a response vector Y of length n = 100 according to the model Y = β0 + β1 X + β2 X2 + β3 X3 + e, where β0, β12, and β3 are constants of your choice.

beta <- c(2,1.5,-1,0.5)
Y <- beta[1] + beta[2] * X + beta[3] * X^2 + beta[4] * X^3 + e

(c) Use the regsubsets() function to perform best subset selection in order to choose the best model containing the predictors X, X2, …, X10. What is the best model obtained according to Cp, BIC, and adjusted R2? Show some plots to provide evidence for your answer, and report the coefficients of the best model obtained. Note you will need to use the data.frame() function to create a single data set containing both X and Y.

library(leaps)
## Warning: package 'leaps' was built under R version 4.3.3
dat <- data.frame(X, Y)
best_subset = regsubsets(Y ~ poly(X, 10, raw = TRUE), data = dat)
summary(best_subset)
## Subset selection object
## Call: regsubsets.formula(Y ~ poly(X, 10, raw = TRUE), data = dat)
## 10 Variables  (and intercept)
##                           Forced in Forced out
## poly(X, 10, raw = TRUE)1      FALSE      FALSE
## poly(X, 10, raw = TRUE)2      FALSE      FALSE
## poly(X, 10, raw = TRUE)3      FALSE      FALSE
## poly(X, 10, raw = TRUE)4      FALSE      FALSE
## poly(X, 10, raw = TRUE)5      FALSE      FALSE
## poly(X, 10, raw = TRUE)6      FALSE      FALSE
## poly(X, 10, raw = TRUE)7      FALSE      FALSE
## poly(X, 10, raw = TRUE)8      FALSE      FALSE
## poly(X, 10, raw = TRUE)9      FALSE      FALSE
## poly(X, 10, raw = TRUE)10     FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
##          poly(X, 10, raw = TRUE)1 poly(X, 10, raw = TRUE)2
## 1  ( 1 ) " "                      " "                     
## 2  ( 1 ) "*"                      " "                     
## 3  ( 1 ) "*"                      "*"                     
## 4  ( 1 ) "*"                      "*"                     
## 5  ( 1 ) "*"                      " "                     
## 6  ( 1 ) "*"                      " "                     
## 7  ( 1 ) "*"                      " "                     
## 8  ( 1 ) "*"                      " "                     
##          poly(X, 10, raw = TRUE)3 poly(X, 10, raw = TRUE)4
## 1  ( 1 ) "*"                      " "                     
## 2  ( 1 ) " "                      "*"                     
## 3  ( 1 ) "*"                      " "                     
## 4  ( 1 ) "*"                      "*"                     
## 5  ( 1 ) "*"                      "*"                     
## 6  ( 1 ) "*"                      "*"                     
## 7  ( 1 ) "*"                      "*"                     
## 8  ( 1 ) "*"                      "*"                     
##          poly(X, 10, raw = TRUE)5 poly(X, 10, raw = TRUE)6
## 1  ( 1 ) " "                      " "                     
## 2  ( 1 ) " "                      " "                     
## 3  ( 1 ) " "                      " "                     
## 4  ( 1 ) " "                      " "                     
## 5  ( 1 ) " "                      "*"                     
## 6  ( 1 ) "*"                      " "                     
## 7  ( 1 ) "*"                      " "                     
## 8  ( 1 ) " "                      "*"                     
##          poly(X, 10, raw = TRUE)7 poly(X, 10, raw = TRUE)8
## 1  ( 1 ) " "                      " "                     
## 2  ( 1 ) " "                      " "                     
## 3  ( 1 ) " "                      " "                     
## 4  ( 1 ) " "                      " "                     
## 5  ( 1 ) " "                      " "                     
## 6  ( 1 ) " "                      " "                     
## 7  ( 1 ) " "                      "*"                     
## 8  ( 1 ) "*"                      "*"                     
##          poly(X, 10, raw = TRUE)9 poly(X, 10, raw = TRUE)10
## 1  ( 1 ) " "                      " "                      
## 2  ( 1 ) " "                      " "                      
## 3  ( 1 ) " "                      " "                      
## 4  ( 1 ) " "                      " "                      
## 5  ( 1 ) "*"                      " "                      
## 6  ( 1 ) "*"                      "*"                      
## 7  ( 1 ) "*"                      "*"                      
## 8  ( 1 ) "*"                      "*"
# Plot results
par(mfrow = c(2, 2))
plot(summary(best_subset)$cp, xlab = "Number of Predictors", ylab = "Cp", type = "b")
plot(summary(best_subset)$bic, xlab = "Number of Predictors", ylab = "BIC", type = "b")
plot(summary(best_subset)$adjr2, xlab = "Number of Predictors", ylab = "Adjusted R^2", type = "b")

Based on Cp and BIC, the best model includes 3 predictors, as adding more does not improve the model’s quality. A lower Cp indicates that the model has a good balance of fit and complexity, minimizing prediction error, while a lower BIC suggests that the model is simpler without sacrificing explanatory power. Although the Adjusted R2 technically increases with 4 and 6 predictors, the improvement beyond 3 predictors is minimal, indicating diminishing returns. The 3-predictor model strikes an optimal balance because it captures most of the variance while avoiding unnecessary complexity. Additional predictors may slightly increase R2, but they risk overfitting, adding noise rather than meaningful information. Thus, 3 predictors are considered optimal, providing a simpler yet effective model. In conclusion, the analysis indicates that the best model includes 3 predictors.

coef(best_subset, id = 3)  # id = 3 specifies the model with 3 predictors
##              (Intercept) poly(X, 10, raw = TRUE)1 poly(X, 10, raw = TRUE)2 
##                1.8276879                1.3563268               -0.9147830 
## poly(X, 10, raw = TRUE)3 
##                0.5279595

The best subset model is ŷ = 1.71593162 - 1.70014172X2 - 0.01355144X6 + 0.19680000X7 The result is far from the real Y.

(d) Repeat (c), using forward stepwise selection and also using backwards stepwise selection. How does your answer compare to the results in (c)?

forward_best <- regsubsets(Y ~ poly(X, 10, raw = TRUE), data = dat, method = "forward")
summary(forward_best)
## Subset selection object
## Call: regsubsets.formula(Y ~ poly(X, 10, raw = TRUE), data = dat, method = "forward")
## 10 Variables  (and intercept)
##                           Forced in Forced out
## poly(X, 10, raw = TRUE)1      FALSE      FALSE
## poly(X, 10, raw = TRUE)2      FALSE      FALSE
## poly(X, 10, raw = TRUE)3      FALSE      FALSE
## poly(X, 10, raw = TRUE)4      FALSE      FALSE
## poly(X, 10, raw = TRUE)5      FALSE      FALSE
## poly(X, 10, raw = TRUE)6      FALSE      FALSE
## poly(X, 10, raw = TRUE)7      FALSE      FALSE
## poly(X, 10, raw = TRUE)8      FALSE      FALSE
## poly(X, 10, raw = TRUE)9      FALSE      FALSE
## poly(X, 10, raw = TRUE)10     FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: forward
##          poly(X, 10, raw = TRUE)1 poly(X, 10, raw = TRUE)2
## 1  ( 1 ) " "                      " "                     
## 2  ( 1 ) " "                      "*"                     
## 3  ( 1 ) "*"                      "*"                     
## 4  ( 1 ) "*"                      "*"                     
## 5  ( 1 ) "*"                      "*"                     
## 6  ( 1 ) "*"                      "*"                     
## 7  ( 1 ) "*"                      "*"                     
## 8  ( 1 ) "*"                      "*"                     
##          poly(X, 10, raw = TRUE)3 poly(X, 10, raw = TRUE)4
## 1  ( 1 ) "*"                      " "                     
## 2  ( 1 ) "*"                      " "                     
## 3  ( 1 ) "*"                      " "                     
## 4  ( 1 ) "*"                      "*"                     
## 5  ( 1 ) "*"                      "*"                     
## 6  ( 1 ) "*"                      "*"                     
## 7  ( 1 ) "*"                      "*"                     
## 8  ( 1 ) "*"                      "*"                     
##          poly(X, 10, raw = TRUE)5 poly(X, 10, raw = TRUE)6
## 1  ( 1 ) " "                      " "                     
## 2  ( 1 ) " "                      " "                     
## 3  ( 1 ) " "                      " "                     
## 4  ( 1 ) " "                      " "                     
## 5  ( 1 ) " "                      "*"                     
## 6  ( 1 ) " "                      "*"                     
## 7  ( 1 ) " "                      "*"                     
## 8  ( 1 ) "*"                      "*"                     
##          poly(X, 10, raw = TRUE)7 poly(X, 10, raw = TRUE)8
## 1  ( 1 ) " "                      " "                     
## 2  ( 1 ) " "                      " "                     
## 3  ( 1 ) " "                      " "                     
## 4  ( 1 ) " "                      " "                     
## 5  ( 1 ) " "                      " "                     
## 6  ( 1 ) " "                      " "                     
## 7  ( 1 ) " "                      " "                     
## 8  ( 1 ) " "                      " "                     
##          poly(X, 10, raw = TRUE)9 poly(X, 10, raw = TRUE)10
## 1  ( 1 ) " "                      " "                      
## 2  ( 1 ) " "                      " "                      
## 3  ( 1 ) " "                      " "                      
## 4  ( 1 ) " "                      " "                      
## 5  ( 1 ) " "                      " "                      
## 6  ( 1 ) "*"                      " "                      
## 7  ( 1 ) "*"                      "*"                      
## 8  ( 1 ) "*"                      "*"
# Plot results
par(mfrow = c(2, 2))
plot(summary(forward_best)$cp, xlab = "Number of Predictors", ylab = "Cp", type = "b")
plot(summary(forward_best)$bic, xlab = "Number of Predictors", ylab = "BIC", type = "b")
plot(summary(forward_best)$adjr2, xlab = "Number of Predictors", ylab = "Adjusted R^2", type = "b")

backward_best <- regsubsets(Y ~ poly(X, 10, raw = TRUE), data = dat, method = "backward")
summary(backward_best)
## Subset selection object
## Call: regsubsets.formula(Y ~ poly(X, 10, raw = TRUE), data = dat, method = "backward")
## 10 Variables  (and intercept)
##                           Forced in Forced out
## poly(X, 10, raw = TRUE)1      FALSE      FALSE
## poly(X, 10, raw = TRUE)2      FALSE      FALSE
## poly(X, 10, raw = TRUE)3      FALSE      FALSE
## poly(X, 10, raw = TRUE)4      FALSE      FALSE
## poly(X, 10, raw = TRUE)5      FALSE      FALSE
## poly(X, 10, raw = TRUE)6      FALSE      FALSE
## poly(X, 10, raw = TRUE)7      FALSE      FALSE
## poly(X, 10, raw = TRUE)8      FALSE      FALSE
## poly(X, 10, raw = TRUE)9      FALSE      FALSE
## poly(X, 10, raw = TRUE)10     FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: backward
##          poly(X, 10, raw = TRUE)1 poly(X, 10, raw = TRUE)2
## 1  ( 1 ) "*"                      " "                     
## 2  ( 1 ) "*"                      " "                     
## 3  ( 1 ) "*"                      " "                     
## 4  ( 1 ) "*"                      " "                     
## 5  ( 1 ) "*"                      " "                     
## 6  ( 1 ) "*"                      " "                     
## 7  ( 1 ) "*"                      " "                     
## 8  ( 1 ) "*"                      " "                     
##          poly(X, 10, raw = TRUE)3 poly(X, 10, raw = TRUE)4
## 1  ( 1 ) " "                      " "                     
## 2  ( 1 ) " "                      "*"                     
## 3  ( 1 ) "*"                      "*"                     
## 4  ( 1 ) "*"                      "*"                     
## 5  ( 1 ) "*"                      "*"                     
## 6  ( 1 ) "*"                      "*"                     
## 7  ( 1 ) "*"                      "*"                     
## 8  ( 1 ) "*"                      "*"                     
##          poly(X, 10, raw = TRUE)5 poly(X, 10, raw = TRUE)6
## 1  ( 1 ) " "                      " "                     
## 2  ( 1 ) " "                      " "                     
## 3  ( 1 ) " "                      " "                     
## 4  ( 1 ) " "                      " "                     
## 5  ( 1 ) " "                      " "                     
## 6  ( 1 ) " "                      " "                     
## 7  ( 1 ) " "                      " "                     
## 8  ( 1 ) " "                      "*"                     
##          poly(X, 10, raw = TRUE)7 poly(X, 10, raw = TRUE)8
## 1  ( 1 ) " "                      " "                     
## 2  ( 1 ) " "                      " "                     
## 3  ( 1 ) " "                      " "                     
## 4  ( 1 ) "*"                      " "                     
## 5  ( 1 ) "*"                      " "                     
## 6  ( 1 ) "*"                      " "                     
## 7  ( 1 ) "*"                      "*"                     
## 8  ( 1 ) "*"                      "*"                     
##          poly(X, 10, raw = TRUE)9 poly(X, 10, raw = TRUE)10
## 1  ( 1 ) " "                      " "                      
## 2  ( 1 ) " "                      " "                      
## 3  ( 1 ) " "                      " "                      
## 4  ( 1 ) " "                      " "                      
## 5  ( 1 ) "*"                      " "                      
## 6  ( 1 ) "*"                      "*"                      
## 7  ( 1 ) "*"                      "*"                      
## 8  ( 1 ) "*"                      "*"
# Plot results
par(mfrow = c(2, 2))
plot(summary(backward_best)$cp, xlab = "Number of Predictors", ylab = "Cp", type = "b")
plot(summary(backward_best)$bic, xlab = "Number of Predictors", ylab = "BIC", type = "b")
plot(summary(backward_best)$adjr2, xlab = "Number of Predictors", ylab = "Adjusted R^2", type = "b")

The plots indicate that the results are consistent with those observed in part (c) for both forward and backward selection methods. Consequently, the model with three predictors remains the optimal choice.

coef(forward_best, id = 3)
##              (Intercept) poly(X, 10, raw = TRUE)1 poly(X, 10, raw = TRUE)2 
##                1.8276879                1.3563268               -0.9147830 
## poly(X, 10, raw = TRUE)3 
##                0.5279595
coef(backward_best, id = 3)
##              (Intercept) poly(X, 10, raw = TRUE)1 poly(X, 10, raw = TRUE)3 
##                1.3817022                1.8065579                0.3078184 
## poly(X, 10, raw = TRUE)4 
##               -0.1877618

The best model from forward selections is ŷ = 1.8276879 + 1.3563268X - 0.9147830X2 + 0.5279595X3 and the best model from backward selection is ŷ = 1.3817022 + 1.8065579X + 0.3078184X3 - 0.1877618X4

The best model from forward selection in this case is the model that is closer to the real model.

(e) Now fit a lasso model to the simulated data, again using X, X2,…, X10 as predictors. Use cross-validation to select the optimal value of λ. Create plots of the cross-validation error as a function of λ. Report the resulting coefficient estimates, and discuss the results obtained.

library(glmnet)
## Warning: package 'glmnet' was built under R version 4.3.3
## Loading required package: Matrix
## Warning: package 'Matrix' was built under R version 4.3.2
## Loaded glmnet 4.1-8
res <- cv.glmnet(poly(dat$X, 10, raw = TRUE), dat$Y, alpha = 1)
opt <- res$lambda.min
plot(res)

cat("\nThe coefficient estimates of the lasso model are\n")
## 
## The coefficient estimates of the lasso model are
out <- glmnet(poly(dat$X, 10, raw = TRUE), dat$Y, alpha = 1, lambda = res$lambda.min)
predict(out, type = "coefficients", s = opt)
## 11 x 1 sparse Matrix of class "dgCMatrix"
##                      s1
## (Intercept)  1.68974017
## 1            1.49125931
## 2           -0.62638921
## 3            0.44163473
## 4           -0.06298481
## 5            .         
## 6            .         
## 7            .         
## 8            .         
## 9            .         
## 10           .
cat("\nThe optimal λ is", opt)
## 
## The optimal λ is 0.0506859

The results suggest that a simpler model, utilizing only the first four polynomial terms, sufficiently explains the variation in the data. The Lasso model’s ability to perform feature selection by shrinking irrelevant coefficients to zero is beneficial here, as it simplifies the model and potentially enhances interpretability without sacrificing predictive power.

The cross-validation error plot versus log(λ) reveals that the Mean-Squared Error (MSE) remains consistently low and stable for smaller λ values but begins to rise as λ increases. This trend is characteristic of Lasso regression, where higher λ values apply stronger penalization, causing more coefficients to shrink to zero. Consequently, this excessive penalization results in higher error due to underfitting.

In summary, the cross-validation plot demonstrates that as λ becomes too large, the model suffers from underfitting. Selecting the optimal λ value of 0.02407991 achieves an effective balance between model simplicity and predictive accuracy.

(f) Now generate a response vector Y according to the model Y = β0 + β7X7 + e, and perform best subset selection and the lasso. Discuss the results obtained.

dat$Y <- 2 - 9 * X^2 + 0.9 * X^7 + e
best_subset = regsubsets(Y ~ poly(X, 10, raw = TRUE), data = dat)
summary(best_subset)
## Subset selection object
## Call: regsubsets.formula(Y ~ poly(X, 10, raw = TRUE), data = dat)
## 10 Variables  (and intercept)
##                           Forced in Forced out
## poly(X, 10, raw = TRUE)1      FALSE      FALSE
## poly(X, 10, raw = TRUE)2      FALSE      FALSE
## poly(X, 10, raw = TRUE)3      FALSE      FALSE
## poly(X, 10, raw = TRUE)4      FALSE      FALSE
## poly(X, 10, raw = TRUE)5      FALSE      FALSE
## poly(X, 10, raw = TRUE)6      FALSE      FALSE
## poly(X, 10, raw = TRUE)7      FALSE      FALSE
## poly(X, 10, raw = TRUE)8      FALSE      FALSE
## poly(X, 10, raw = TRUE)9      FALSE      FALSE
## poly(X, 10, raw = TRUE)10     FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
##          poly(X, 10, raw = TRUE)1 poly(X, 10, raw = TRUE)2
## 1  ( 1 ) " "                      " "                     
## 2  ( 1 ) " "                      "*"                     
## 3  ( 1 ) " "                      "*"                     
## 4  ( 1 ) " "                      "*"                     
## 5  ( 1 ) " "                      "*"                     
## 6  ( 1 ) " "                      "*"                     
## 7  ( 1 ) "*"                      "*"                     
## 8  ( 1 ) "*"                      "*"                     
##          poly(X, 10, raw = TRUE)3 poly(X, 10, raw = TRUE)4
## 1  ( 1 ) " "                      " "                     
## 2  ( 1 ) " "                      " "                     
## 3  ( 1 ) " "                      " "                     
## 4  ( 1 ) " "                      "*"                     
## 5  ( 1 ) " "                      "*"                     
## 6  ( 1 ) " "                      " "                     
## 7  ( 1 ) " "                      " "                     
## 8  ( 1 ) "*"                      "*"                     
##          poly(X, 10, raw = TRUE)5 poly(X, 10, raw = TRUE)6
## 1  ( 1 ) " "                      " "                     
## 2  ( 1 ) " "                      " "                     
## 3  ( 1 ) " "                      "*"                     
## 4  ( 1 ) "*"                      " "                     
## 5  ( 1 ) " "                      " "                     
## 6  ( 1 ) "*"                      " "                     
## 7  ( 1 ) "*"                      " "                     
## 8  ( 1 ) " "                      " "                     
##          poly(X, 10, raw = TRUE)7 poly(X, 10, raw = TRUE)8
## 1  ( 1 ) "*"                      " "                     
## 2  ( 1 ) "*"                      " "                     
## 3  ( 1 ) "*"                      " "                     
## 4  ( 1 ) "*"                      " "                     
## 5  ( 1 ) "*"                      " "                     
## 6  ( 1 ) "*"                      "*"                     
## 7  ( 1 ) "*"                      "*"                     
## 8  ( 1 ) "*"                      "*"                     
##          poly(X, 10, raw = TRUE)9 poly(X, 10, raw = TRUE)10
## 1  ( 1 ) " "                      " "                      
## 2  ( 1 ) " "                      " "                      
## 3  ( 1 ) " "                      " "                      
## 4  ( 1 ) " "                      " "                      
## 5  ( 1 ) "*"                      "*"                      
## 6  ( 1 ) "*"                      "*"                      
## 7  ( 1 ) "*"                      "*"                      
## 8  ( 1 ) "*"                      "*"
# Plot results
par(mfrow = c(2, 2))
plot(summary(best_subset)$cp, xlab = "Number of Predictors", ylab = "Cp", type = "b")
plot(summary(best_subset)$bic, xlab = "Number of Predictors", ylab = "BIC", type = "b")
plot(summary(best_subset)$adjr2, xlab = "Number of Predictors", ylab = "Adjusted R^2", type = "b")

When applying best subset selection, the model identifies the optimal set of predictors as the two most relevant terms, based on the evaluation plots.

res <- cv.glmnet(poly(dat$X, 10, raw = TRUE), dat$Y, alpha = 1)
opt <- res$lambda.min
plot(res)

cat("\nThe coefficient estimates of the lasso model are\n")
## 
## The coefficient estimates of the lasso model are
out <- glmnet(poly(dat$X, 10, raw = TRUE), dat$Y, alpha = 1, lambda = res$lambda.min)
predict(out, type = "coefficients", s = opt)
## 11 x 1 sparse Matrix of class "dgCMatrix"
##                     s1
## (Intercept) -2.3539249
## 1            .        
## 2           -3.0663527
## 3            .        
## 4            .        
## 5            0.3268560
## 6           -0.2961248
## 7            0.7630483
## 8            .        
## 9            .        
## 10           .
cat("\nThe optimal λ is", opt)
## 
## The optimal λ is 4.744031

When fitting the Lasso model, it does not perfectly replicate the underlying simulated structure, as it retains coefficients for some powers of X that were not originally part of the real model

Second Exercise

This question uses the variables dis (the weighted mean of distances to five Boston employment centers) and nox (nitrogen oxides concentration in parts per 10 million) from the Boston data. We will treat dis as the predictor and nox as the response.

(a) Use the poly() function to fit a cubic polynomial regression to predict nox using dis. Report the regression output, and plot the resulting data and polynomial fits.

library(MASS)
fit <- glm(nox ~ poly(dis, 3), data = Boston)
summary(fit)
## 
## Call:
## glm(formula = nox ~ poly(dis, 3), data = Boston)
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.554695   0.002759 201.021  < 2e-16 ***
## poly(dis, 3)1 -2.003096   0.062071 -32.271  < 2e-16 ***
## poly(dis, 3)2  0.856330   0.062071  13.796  < 2e-16 ***
## poly(dis, 3)3 -0.318049   0.062071  -5.124 4.27e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.003852802)
## 
##     Null deviance: 6.7810  on 505  degrees of freedom
## Residual deviance: 1.9341  on 502  degrees of freedom
## AIC: -1370.9
## 
## Number of Fisher Scoring iterations: 2

This regression output summarizes the results of a Generalized Linear Model (GLM) that predicts nitrogen oxide concentration (nox) based on a third-degree polynomial of the distance (dis) variable from the Boston dataset. By fitting a cubic polynomial for dis, the model aims to capture potential nonlinear relationships between dis and nox.

The coefficient estimates reveal significant contributions from each term in the polynomial expansion. The intercept is estimated at 0.5547 with a highly significant p-value (p < 2e-16), indicating a strong baseline level of nox concentration when dis is at its mean. The first-degree term of dis has an estimated coefficient of -2.0031, also highly significant, suggesting an inverse relationship between dis and nox. This negative coefficient implies that, on average, as distance increases, nitrogen oxide concentration decreases. The second-degree term for dis has a positive coefficient of 0.8563, which, along with its significance (p < 2e-16), implies a curvilinear adjustment to the relationship, likely capturing a bending effect as dis increases. The third-degree term has a smaller negative coefficient of -0.3180 but remains significant (p < 0.001), fine-tuning the model with a third-order adjustment that further captures the complexity in the dis-nox relationship.

In terms of model fit, the null deviance is 6.7810, which reflects the total variance in nox without any predictors. The model’s residual deviance drops to 1.9341, indicating a substantial reduction in unexplained variance by including the polynomial terms of dis. This improvement suggests that the cubic polynomial is effective at explaining variations in nitrogen oxide concentration. Additionally, the model’s AIC value of -1370.9 supports the quality of fit; lower AIC values generally indicate better model performance for similar types of models, and this negative AIC is suggestive of a strong fit.

Overall, this model highlights a nonlinear relationship between distance and nitrogen oxide concentration, where higher distances are generally associated with lower nox levels. However, the relationship is complex, with each polynomial term adding a layer of adjustment to capture the nuances in the data. The significant contributions of each term demonstrate that dis is a strong predictor of nox, and the model’s fit statistics indicate that it explains a substantial portion of the variation in nox.

library(scales)
## Warning: package 'scales' was built under R version 4.3.3
plot(nox ~ dis, data = Boston, col = alpha("palevioletred1", 0.4), pch = 19)
x <- seq(min(Boston$dis), max(Boston$dis), length.out = 1000)
lines(x, predict(fit, data.frame(dis = x)), col = "red", lty = 2)

The plot displays the relationship between the predictor dis (distance) and the response variable nox (nitrogen oxide concentration), along with a fitted curve from the polynomial regression model. The scatterplot shows an overall negative trend: as distance (dis) increases, nitrogen oxide concentration (nox) tends to decrease.

The fitted curve, shown in red, captures a nonlinear pattern, initially steep and then flattening out as dis increases, aligning with the polynomial regression model of degree 3 used in the analysis. This curvature effectively models the diminishing effect of distance on nitrogen oxide levels, where concentrations drop sharply at shorter distances and stabilize at larger distances.

The curve confirms that a cubic polynomial is appropriate for modeling this relationship, capturing the observed data trend well without significant overfitting. This pattern supports the interpretation that distance from specific sources (such as industrial or traffic-related pollution sources) has a decreasing impact on nitrogen oxide concentration as one moves further away.

(b) Plot the polynomial fits for a range of different polynomial degrees (say, from 1 to 10), and report the associated residual sum of squares.

library(tidyr)
## Warning: package 'tidyr' was built under R version 4.3.3
## 
## Attaching package: 'tidyr'
## The following objects are masked from 'package:Matrix':
## 
##     expand, pack, unpack
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.3
fits <- lapply(1:10, function(i) glm(nox ~ poly(dis, i), data = Boston))

x <- seq(min(Boston$dis), max(Boston$dis), length.out=1000)
pred <- data.frame(lapply(fits, function(fit) predict(fit, data.frame(dis = x))))
colnames(pred) <- 1:10
pred$x <- x
pred <- pivot_longer(pred, !x)
ggplot(Boston, aes(dis, nox)) +
  geom_point(color = alpha("palevioletred1", 0.4)) +
  geom_line(data = pred, aes(x, value, color = name)) +
  theme_bw()

# residual sum of squares
do.call(anova, fits)[, 2]
##  [1] 2.768563 2.035262 1.934107 1.932981 1.915290 1.878257 1.849484 1.835630
##  [9] 1.833331 1.832171

The Residual Sum of Squares (RSS) values show a decreasing trend as the model complexity increases, indicating improved fit with each additional term or predictor. Initially, there’s a notable drop in RSS from 2.768563 to around 1.9, suggesting a significant improvement in capturing the data’s variability. However, the reductions become progressively smaller, with the last values (1.835630 to 1.832171) showing minimal change. This pattern suggests diminishing returns; adding further complexity beyond this point likely yields little benefit, potentially indicating an optimal model around the lower RSS values.

(c) Perform cross-validation or another approach to select the optimal degree for the polynomial, and explain your results.

library(boot)
## Warning: package 'boot' was built under R version 4.3.3
set.seed(290924)
res <- sapply(1:10, function(i) {
  fit <- glm(nox ~ poly(dis, i), data = Boston)
  cv.glm(Boston, fit, K = 10)$delta[1]
})
which.min(res)
## [1] 3

Based on cross-validation, the optimal polynomial degree for the model is 3. Increasing the degree beyond this tends to result in overfitting, as it captures more noise rather than improving the model’s predictive accuracy.

(d) Use the bs() function to fit a regression spline to predict nox using dis. Report the output for the fit using four degrees of freedom. How did you choose the knots? Plot the resulting fit.

library(splines)
fit <- glm(nox ~ bs(dis, df = 4), data = Boston)
summary(fit)
## 
## Call:
## glm(formula = nox ~ bs(dis, df = 4), data = Boston)
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.73447    0.01460  50.306  < 2e-16 ***
## bs(dis, df = 4)1 -0.05810    0.02186  -2.658  0.00812 ** 
## bs(dis, df = 4)2 -0.46356    0.02366 -19.596  < 2e-16 ***
## bs(dis, df = 4)3 -0.19979    0.04311  -4.634 4.58e-06 ***
## bs(dis, df = 4)4 -0.38881    0.04551  -8.544  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.003837874)
## 
##     Null deviance: 6.7810  on 505  degrees of freedom
## Residual deviance: 1.9228  on 501  degrees of freedom
## AIC: -1371.9
## 
## Number of Fisher Scoring iterations: 2
plot(nox ~ dis, data = Boston, col = alpha("palevioletred1", 0.4), pch = 19)
x <- seq(min(Boston$dis), max(Boston$dis), length.out = 1000)
lines(x, predict(fit, data.frame(dis = x)), col = "red", lty = 2)

The fitted regression spline model demonstrates a clear, non-linear relationship between nitrogen oxide concentration (nox) and the distance to employment centers (dis). As dis increases, nox tends to decrease, suggesting that areas closer to employment hubs have higher nitrogen oxide levels, likely due to higher traffic and industrial activity. The curve shows that this decline is steep initially, then gradually levels off as dis increases, reflecting a diminishing impact of distance on nox in areas further from the center.

The regression output further supports this interpretation. Each of the spline basis terms is statistically significant, with p-values below 0.01, indicating that the model effectively captures the non-linear relationship between nox and dis. The reduction in deviance from the null model (6.7810) to the residual deviance (1.9228) suggests a substantial improvement in fit, while the low AIC (-1371.9) indicates a strong model. We used four degrees of freedom, allowing the bs() function to place knots at the quantiles of dis, ensuring an even distribution that captures key trends without overfitting.

Knots are chosen based on quantiles of the data.

(e) Now fit a regression spline for a range of degrees of freedom, and plot the resulting fits and report the resulting RSS. Describe the results obtained.

fits <- lapply(3:10, function(i) {
  glm(nox ~ bs(dis, df = i), data = Boston)
})

x <- seq(min(Boston$dis), max(Boston$dis), length.out = 1000)
pred <- data.frame(lapply(fits, function(fit) predict(fit, data.frame(dis = x))))
colnames(pred) <- 3:10
pred$x <- x
pred <- pivot_longer(pred, !x)
ggplot(Boston, aes(dis, nox)) +
  geom_point(color = alpha("palevioletred1", 0.4)) +
  geom_line(data = pred, aes(x, value, color = name)) +
  theme_bw()

With a high number of degrees of freedom, splines risk overfitting the data, especially at the extreme ends of the predictor variable’s distribution.

(f) Perform cross-validation or another approach in order to select the best degrees of freedom for a regression spline on this data. Describe your results.

set.seed(290924)
err <- sapply(3:10, function(i) {
  fit <- glm(nox ~ bs(dis, df = i), data = Boston)
  suppressWarnings(cv.glm(Boston, fit, K = 10)$delta[1])
})
which.min(err)
## [1] 4

This approach selects a spline with 4 degrees of freedom.