Exercise 1

Use the rnorm() function to generate a predictor X of length n = 100, as well as a noise vector ϵ of length n = 100.

library(ISLR2)
library(glmnet)

## Loading required package: Matrix

## Loaded glmnet 4.1-8

library(leaps)
library(pls)

## 
## Attaching package: 'pls'

## The following object is masked from 'package:stats':
## 
##     loadings

library(scales)
library(tidyr)

## 
## Attaching package: 'tidyr'

## The following objects are masked from 'package:Matrix':
## 
##     expand, pack, unpack

library(ggplot2)
library(boot)
library(splines)

set.seed(123)
x <- rnorm(100)
ep <- rnorm(100)

Generate a response vector Y of length n = 100 according to the model Y = β₀ + β₁X + β₂X² + β₃X³ + ϵ, where β₀, β₁, β₂, and β₃ are constants of your choice.

y <- 2 - 9 * x - 0.5 * x^2 + 9 * x^3 + ep

Use the regsubsets() function to perform best subset selection in order to choose the best model containing the predictors X, X²,…, X¹⁰. What is the best model obtained according to C_p, BIC, and adjusted R²? Show some plots to provide evidence for your answer, and report the coefficients of the best model obtained. Note you will need to use the data.frame() function to create a single data set containing both X and Y.

data <- data.frame(x, y)
regfit <- regsubsets(y ~ poly(x, 10, raw = TRUE), data = data)
regfit_summary <- summary(regfit)

# Extract values for plotting
cp <- regfit_summary$cp
bic <- regfit_summary$bic
adjr2 <- regfit_summary$adjr2

# Set up the plotting area
par(mfrow = c(1, 3))

# Plot Cp
plot(cp, type = "b", xlab = "Number of Predictors", ylab = "Cp", main = "Cp")
points(which.min(cp), min(cp), col = "red", pch = 19)  # Highlight min Cp

# Plot BIC
plot(bic, type = "b", xlab = "Number of Predictors", ylab = "BIC", main = "BIC")
points(which.min(bic), min(bic), col = "red", pch = 19)  # Highlight min BIC

# Plot Adjusted R^2
plot(adjr2, type = "b", xlab = "Number of Predictors", ylab = "Adjusted R^2", main = "Adjusted R^2")
points(which.max(adjr2), max(adjr2), col = "red", pch = 19)  # Highlight max Adjusted R^2

The model selection process using C_p, BIC, and Adjusted R² criteria provides the following insights:

C_p and BIC both indicate that the best model contains 3 predictors. This model achieves the lowest values for both C_p and BIC, suggesting it balances a good fit with simplicity, avoiding unnecessary complexity.
Adjusted R² shows that the model with 7 predictors has the highest value, indicating the best fit to the data among all models. However, the model with 3 predictors already demonstrates a sufficiently high Adjusted R², and the increase in Adjusted R² when moving from 3 to 7 predictors is not particularly significant.

# Perform best subset selection
best_model <- coef(regfit, id = 3)  # id = 3 specifies the model with 3 predictors

# Display the coefficients
best_model

##              (Intercept) poly(x, 10, raw = TRUE)1 poly(x, 10, raw = TRUE)2 
##                1.9703939               -9.0795538               -0.5915431 
## poly(x, 10, raw = TRUE)3 
##                9.0204363

Best model: Y = 1.9703939 - 9.0795538X - 0.5915431X² + 9.0204363X³

It can be seen that the coefficients estimated are similar to those used in the simulation.

Repeat (c), using forward stepwise selection and also using backwards stepwise selection. How does your answer compare to the results in (c)?

# Perform forward and backward stepwise selection
forward_fit <- regsubsets(y ~ poly(x, 10, raw = TRUE), data = data, method = "forward")
backward_fit <- regsubsets(y ~ poly(x, 10, raw = TRUE), data = data, method = "backward")

plot_selection_criteria <- function(fit, title_prefix) {
  # Extract the summary of the fit
  fit_summary <- summary(fit)
  
  # Set up the plotting area
  par(mfrow = c(1, 3))
  
  # Plot Cp
  plot(fit_summary$cp, xlab = "Number of Predictors", ylab = "Cp", type = "b", main = paste(title_prefix, "Cp"))
  points(which.min(fit_summary$cp), fit_summary$cp[which.min(fit_summary$cp)], col = "red", pch = 20)
  
  # Plot BIC
  plot(fit_summary$bic, xlab = "Number of Predictors", ylab = "BIC", type = "b", main = paste(title_prefix, "BIC"))
  points(which.min(fit_summary$bic), fit_summary$bic[which.min(fit_summary$bic)], col = "red", pch = 20)
  
  # Plot Adjusted R^2
  plot(fit_summary$adjr2, xlab = "Number of Predictors", ylab = "Adjusted R^2", type = "b", main = paste(title_prefix, "Adjusted R^2"))
  points(which.max(fit_summary$adjr2), fit_summary$adjr2[which.max(fit_summary$adjr2)], col = "red", pch = 20)
}

# Plot for forward selection
plot_selection_criteria(forward_fit, "Forward Selection")

# Plot for backward selection
plot_selection_criteria(backward_fit, "Backward Selection")

Forward Stepwise Selection:

C_p: The C_p value decreases significantly until around 3 predictors, after which it stabilizes. This suggests that a model with around 3 predictors may be a good choice to balance model fit and simplicity.
BIC: BIC also decreases sharply and stabilizes around 3 predictors, indicating that this number of predictors is likely sufficient to achieve a good fit without overfitting.
Adjusted R²: The adjusted R² increases up to about 4 predictors, after which the improvement is minimal. This suggests that while adding more predictors may slightly increase the adjusted R², the benefit is not substantial after 3-4 predictors.

Backward Stepwise Selection:

C_p: Similar to forward selection, the C_p value drops significantly until around 3 predictors, suggesting that further complexity is unnecessary.
BIC: BIC decreases steeply and then stabilizes after 3 predictors, which aligns with the forward selection result.
Adjusted R²: The adjusted R² value increases with additional predictors but shows diminishing returns after around 3 predictors.

Comparison: Both forward and backward stepwise selection suggest that a model with approximately 3 predictors balances model simplicity with fit quality. This aligns with the result from part (c), where a model with 3 predictors was identified as optimal based on C_p, BIC, and adjusted R² values. Adding more predictors beyond this point offers minimal additional benefit and may lead to overfitting.

Now fit a lasso model to the simulated data, again using X, X²,…, X¹⁰ as predictors. Use cross-validation to select the optimal value of λ. Create plots of the cross-validation error as a function of λ. Report the resulting coefficient estimates, and discuss the results obtained.

res <- cv.glmnet(poly(data$x, 10, raw = TRUE), data$y, alpha = 1)
(best <- res$lambda.min)

## [1] 0.01732416

plot(res)

out <- glmnet(poly(data$x, 10, raw = TRUE), data$y, alpha = 1, lambda = res$lambda.min)
predict(out, type = "coefficients", s = best)

## 11 x 1 sparse Matrix of class "dgCMatrix"
##                        s1
## (Intercept)  1.9870003803
## 1           -8.7212254707
## 2           -0.6253768715
## 3            8.7110626436
## 4            .           
## 5            0.0490382100
## 6            .           
## 7            .           
## 8            .           
## 9            .           
## 10           0.0001757493

The plot shows the cross-validation error as a function of log(λ). The selected λ value, denoted as lambda.min, is the point where the mean-squared error is minimized. This value, 0.0173, minimizes prediction error, balancing model complexity and accuracy.

Interpretation of Non-Zero Coefficients:

Intercept: 1.987, the expected value of y when all predictors are zero.
Non-zero terms: The coefficients for X, X², X³, X⁵, and X¹⁰ are non-zero, suggesting these terms have some predictive power for y.
Coefficients for X⁴, X⁶, X⁷, X⁸, and X⁹: These are set to zero, as lasso penalized them enough to remove them from the model.

Based on the three methods applied previously—best subset selection, forward stepwise selection, and backward stepwise selection—the optimal model was consistently identified as one with three predictors. However, lasso regression selected an optimal model with five predictors. Notably, the intercept and the coefficients for X, X², and X³ in the lasso model were nearly identical to those obtained through the other methods, suggesting a similar core model structure. The coefficients for X⁵ and X¹⁰ in the lasso model were close to zero, indicating a negligible contribution. This alignment across methods implies that, despite minor differences, each approach effectively identifies the same optimal model, as the additional terms in the lasso model do not significantly affect predictive performance. Consequently, all methods converge in selecting the best predictive model.

Now generate a response vector Y according to the model Y = β₀ + β₇X⁷ + ϵ, and perform best subset selection and the lasso. Discuss the results obtained.

data$y <- 2 + 2 * x^7 + ep
regfit <- regsubsets(y ~ poly(x, 10, raw = TRUE), data = data)
regfit_summary <- summary(regfit)

# Extract values for plotting
cp <- regfit_summary$cp
bic <- regfit_summary$bic
adjr2 <- regfit_summary$adjr2

# Set up the plotting area
par(mfrow = c(1, 3))

# Plot Cp
plot(cp, type = "b", xlab = "Number of Predictors", ylab = "Cp", main = "Cp")
points(which.min(cp), min(cp), col = "red", pch = 19)  # Highlight min Cp

# Plot BIC
plot(bic, type = "b", xlab = "Number of Predictors", ylab = "BIC", main = "BIC")
points(which.min(bic), min(bic), col = "red", pch = 19)  # Highlight min BIC

# Plot Adjusted R^2
plot(adjr2, type = "b", xlab = "Number of Predictors", ylab = "Adjusted R^2", main = "Adjusted R^2")
points(which.max(adjr2), max(adjr2), col = "red", pch = 19)  # Highlight max Adjusted R^2

# Perform best subset selection
best_model <- coef(regfit, id = 1)  # id = 1 specifies the model with 1 predictors

# Display the coefficients
best_model

##              (Intercept) poly(x, 10, raw = TRUE)7 
##                 1.893251                 1.999704

res <- cv.glmnet(poly(data$x, 10, raw = TRUE), data$y, alpha = 1)
(best <- res$lambda.min)

## [1] 3.057612

plot(res)

out <- glmnet(poly(data$x, 10, raw = TRUE), data$y, alpha = 1, lambda = best)
predict(out, type = "coefficients", s = best)

## 11 x 1 sparse Matrix of class "dgCMatrix"
##                       s1
## (Intercept) 2.0535998863
## 1           .           
## 2           .           
## 3           .           
## 4           .           
## 5           .           
## 6           .           
## 7           1.9390812082
## 8           .           
## 9           0.0003877265
## 10          .

Based on best subset selection, the optimal model was identified as a model with a single predictor, Y = 1.893251 + 1.999704X⁷, as it achieved the smallest C_p and BIC values and displayed a relatively high Adjusted R². Furthermore, lasso regression indicated that the value of λ minimizing the prediction error was 3.057612, resulting in an optimal model with two predictors. The intercept and the coefficient for X⁷ in the lasso model were almost identical to those obtained in best subset selection, reinforcing the similarity in the core model structure. Meanwhile, the coefficient for X⁹ in the lasso model was close to zero, indicating a minimal contribution. This suggests that, despite lasso including an additional term, both methods effectively point to the same essential model, as the added predictor in the lasso model does not significantly impact predictive performance.

Exercise 2

Use the poly() function to fit a cubic polynomial regression to predict nox using dis. Report the regression output, and plot the resulting data and polynomial fits.

fit <- glm(nox ~ poly(dis, 3), data = Boston)
summary(fit)

## 
## Call:
## glm(formula = nox ~ poly(dis, 3), data = Boston)
## 
## Deviance Residuals: 
##       Min         1Q     Median         3Q        Max  
## -0.121130  -0.040619  -0.009738   0.023385   0.194904  
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.554695   0.002759 201.021  < 2e-16 ***
## poly(dis, 3)1 -2.003096   0.062071 -32.271  < 2e-16 ***
## poly(dis, 3)2  0.856330   0.062071  13.796  < 2e-16 ***
## poly(dis, 3)3 -0.318049   0.062071  -5.124 4.27e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.003852802)
## 
##     Null deviance: 6.7810  on 505  degrees of freedom
## Residual deviance: 1.9341  on 502  degrees of freedom
## AIC: -1370.9
## 
## Number of Fisher Scoring iterations: 2

plot(nox ~ dis, data = Boston, col = alpha("pink1", 0.4), pch = 19)
x <- seq(min(Boston$dis), max(Boston$dis), length.out = 1000)
lines(x, predict(fit, data.frame(dis = x)), col = "purple4", lty = 2)

All coefficients are statistically significant at the p < 0.001 level, as indicated by the triple asterisks (***). This suggests that each term in the cubic polynomial model contributes significantly to explaining the variation in nox.

The purple line on the plot represents the fitted cubic polynomial curve, indicating a nonlinear relationship between dis and nox. As dis increases, nox generally decreases, but not in a strictly linear fashion.

Plot the polynomial fits for a range of different polynomial degrees (say, from 1 to 10), and report the associated residual sum of squares.

fits <- lapply(1:10, function(i) glm(nox ~ poly(dis, i), data = Boston))

x <- seq(min(Boston$dis), max(Boston$dis), length.out=1000)
pred <- data.frame(lapply(fits, function(fit) predict(fit, data.frame(dis = x))))
colnames(pred) <- 1:10
pred$x <- x
pred <- pivot_longer(pred, !x)
ggplot(Boston, aes(dis, nox)) +
  geom_point(color = alpha("pink1", 0.4)) +
  geom_line(data = pred, aes(x, value, color = name)) +
  theme_bw()

# residual sum of squares
do.call(anova, fits)[, 2]

##  [1] 2.768563 2.035262 1.934107 1.932981 1.915290 1.878257 1.849484 1.835630
##  [9] 1.833331 1.832171

The residual sum of squares (RSS) decreases as the polynomial degree increases, indicating that higher-degree polynomials fit the data better by capturing more complexity in the relationship between dis and nox.

However, the decrease in RSS becomes smaller as the degree increases, suggesting diminishing returns. This implies that after a certain degree (around degree 4 or 5), additional complexity in the polynomial fit does not significantly improve the model’s accuracy.

In conclusion, while higher-degree polynomials yield lower RSS, a degree of 3 to 5 might provide a good balance between model complexity and fit quality.

Perform cross-validation or another approach to select the optimal degree for the polynomial, and explain your results.

res <- sapply(1:10, function(i) {
  fit <- glm(nox ~ poly(dis, i), data = Boston)
  cv.glm(Boston, fit, K = 10)$delta[1]
})
which.min(res)

## [1] 3

The degree-3 polynomial is selected as the optimal degree because it minimizes the cross-validation error.

This approach suggests that using a cubic polynomial strikes a good balance between underfitting and overfitting. Lower-degree polynomials (like linear or quadratic) might not capture the non-linear relationship well enough, while higher-degree polynomials (like degree 4 or more) may start to overfit, capturing noise rather than the true pattern in the data.

Use the bs() function to fit a regression spline to predict nox using dis. Report the output for the fit using four degrees of freedom. How did you choose the knots? Plot the resulting fit.

fit <- glm(nox ~ bs(dis, df = 4), data = Boston)
summary(fit)

## 
## Call:
## glm(formula = nox ~ bs(dis, df = 4), data = Boston)
## 
## Deviance Residuals: 
##       Min         1Q     Median         3Q        Max  
## -0.124622  -0.039259  -0.008514   0.020850   0.193891  
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.73447    0.01460  50.306  < 2e-16 ***
## bs(dis, df = 4)1 -0.05810    0.02186  -2.658  0.00812 ** 
## bs(dis, df = 4)2 -0.46356    0.02366 -19.596  < 2e-16 ***
## bs(dis, df = 4)3 -0.19979    0.04311  -4.634 4.58e-06 ***
## bs(dis, df = 4)4 -0.38881    0.04551  -8.544  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.003837874)
## 
##     Null deviance: 6.7810  on 505  degrees of freedom
## Residual deviance: 1.9228  on 501  degrees of freedom
## AIC: -1371.9
## 
## Number of Fisher Scoring iterations: 2

plot(nox ~ dis, data = Boston, col = alpha("pink1", 0.4), pch = 19)
x <- seq(min(Boston$dis), max(Boston$dis), length.out = 1000)
lines(x, predict(fit, data.frame(dis = x)), col = "purple4", lty = 2)

The regression spline model indicates a strong and statistically significant inverse relationship between the distance from employment centers (dis) and nitrogen oxides (nox) concentration. The model includes four spline basis terms, each with negative coefficients, suggesting that as dis increases—meaning as we move further from urban or industrial areas—nox concentrations tend to decrease. This is consistent with the idea that pollution levels drop as one moves away from densely populated or industrialized areas that are often sources of nitrogen oxides. The largest coefficient in magnitude, bs(dis, df = 4)2, shows a strong decrease in nox with this component of dis, reinforcing the downward trend. All spline terms are statistically significant, indicating that the observed relationship is not due to random chance.

Furthermore, the reduction in residual deviance from 6.7810 (in the null model) to 1.9228 in the fitted model suggests a good fit, as the spline model captures the non-linear pattern in the data much better than a simple average-based model would. Overall, this model effectively illustrates how nitrogen oxides concentration diminishes as distance from employment centers increases, capturing a meaningful environmental pattern.

In this case, the choice of knots was handled automatically by the bs() function in R. When we specify df = 4 in the function, it internally chooses the placement of knots to create a smooth curve with four degrees of freedom. This approach allows the spline to capture the essential shape of the relationship between dis and nox without requiring us to manually specify each knot location. If we wanted finer control over the knot placement, we could specify exact locations based on data percentiles or domain knowledge, but automatic selection based on degrees of freedom is often sufficient for exploratory analysis.

Now fit a regression spline for a range of degrees of freedom, and plot the resulting fts and report the resulting RSS. Describe the results obtained.

# Fit models with different degrees of freedom and calculate RSS
fits <- lapply(3:10, function(i) {
  fit <- glm(nox ~ bs(dis, df = i), data = Boston)
  rss <- sum(residuals(fit)^2)  # Calculate RSS
  list(fit = fit, rss = rss)    # Store both model and RSS
})

# Extract predictions for plotting
x <- seq(min(Boston$dis), max(Boston$dis), length.out = 1000)
pred <- data.frame(lapply(fits, function(f) predict(f$fit, data.frame(dis = x))))
colnames(pred) <- 3:10
pred$x <- x
pred <- pivot_longer(pred, !x)

# Plot the fitted splines
ggplot(Boston, aes(dis, nox)) +
  geom_point(color = alpha("pink1", 0.4)) +
  geom_line(data = pred, aes(x, value, color = name)) +
  labs(color = "Degrees of Freedom") +
  theme_bw()

# Print RSS for each degree of freedom
rss_values <- sapply(fits, function(f) f$rss)
names(rss_values) <- 3:10
print(rss_values)

##        3        4        5        6        7        8        9       10 
## 1.934107 1.922775 1.840173 1.833966 1.829884 1.816995 1.825653 1.792535

The RSS decreases as we increase the degrees of freedom from 3 to 10, showing that the model’s fit improves with greater flexibility. For instance, the RSS decreases from 1.934 with 3 degrees of freedom to 1.792 with 10 degrees of freedom.

This trend suggests that the model is able to capture more detail in the relationship between dis and nox as we add more degrees of freedom. However, the improvement in RSS becomes smaller as the degrees of freedom increase (for example, the difference between 9 and 10 degrees of freedom is minimal compared to the drop from 3 to 4). This diminishing return implies that after a certain point, additional complexity may not provide substantial gains in fit and could lead to overfitting, where the model starts to capture noise in the data rather than the true underlying pattern.

Thus, a balance should be struck: a model with around 5 to 7 degrees of freedom appears to offer a good fit without the excessive complexity that higher degrees of freedom might introduce.

Perform cross-validation or another approach in order to select the best degrees of freedom for a regression spline on this data. Describe your results.

set.seed(123)
err <- sapply(3:10, function(i) {
  fit <- glm(nox ~ bs(dis, df = i), data = Boston)
  suppressWarnings(cv.glm(Boston, fit, K = 10)$delta[1])
})
which.min(err)

## [1] 3

The cross-validation process, using 10-fold cross-validation on models with degrees of freedom ranging from 3 to 10, has selected 3 degrees of freedom as the optimal choice for the regression spline model. This result means that, based on cross-validation error, the model with 3 degrees of freedom provides the best generalization to unseen data.

While models with higher degrees of freedom (such as 5 or 7) may fit the training data better, as evidenced by lower RSS values, they might also capture more noise, which can lead to overfitting. Cross-validation penalizes this by assessing the model’s performance on separate validation folds, favoring simpler models if they predict new data more accurately. Thus, choosing 3 degrees of freedom suggests a preference for a model that balances flexibility and generalizability, effectively capturing the main pattern between dis and nox without unnecessary complexity.

Linear Model Selection and Regularization and Moving Beyond Linearity

Dewa Ayu Maharani Adithi Kirana (22/497072/PA/21392)

Exercise 1

Exercise 2