In this exercise, we will generate simulated data and perform best subset selection.
rnorm()
function to generate a predictor
\(X\) of length \(n = 100\), as well as a noise vector \(\epsilon\) of length \(n = 100\).set.seed(42)
x <- rnorm(100)
ep <- rnorm(100)
We use rnorm(100)
to generate the predictor \(X\) and noise vector \(\epsilon\).
\[Y = \beta_0 + \beta_1X + \beta_2X^2 + \beta_3X^3 + \epsilon,\] where \(\beta_0 = 2\), \(\beta_1 = 3\), \(\beta_2 = -2\), and \(\beta_3 = 0.5\).
y <- 2 + 3 * x - 2 * x^2 + 0.5 * x^3 + ep
Here, the response vector \(Y\) is generated based on the specified polynomial model with random noise.
regsubsets()
function to perform best subset
selection and choose the best model containing the predictors \(X, X^2, ..., X^{10}\).dat <- data.frame(x, y)
regfit_full <- regsubsets(y ~ poly(x, 10, raw = TRUE), data = dat)
reg_summary <- summary(regfit_full)
# Plotting Cp, BIC, and Adjusted R-squared
par(mfrow = c(1, 3))
plot(reg_summary$cp, xlab = "Number of Variables", ylab = "Cp", type = "l")
points(which.min(reg_summary$cp), reg_summary$cp[which.min(reg_summary$cp)], col = "red", cex = 2, pch = 20)
plot(reg_summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
points(which.min(reg_summary$bic), reg_summary$bic[which.min(reg_summary$bic)], col = "red", cex = 2, pch = 20)
plot(reg_summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted R2", type = "l")
points(which.max(reg_summary$adjr2), reg_summary$adjr2[which.max(reg_summary$adjr2)], col = "red", cex = 2, pch = 20)
In this step, we perform best subset selection and plot the model’s Cp, BIC, and Adjusted \(R^2\) to select the best model. The best model is indicated by the red dots.
# Forward stepwise selection
regfit_fwd <- regsubsets(y ~ poly(x, 10, raw = TRUE), data = dat, method = "forward")
summary(regfit_fwd)
## Subset selection object
## Call: regsubsets.formula(y ~ poly(x, 10, raw = TRUE), data = dat, method = "forward")
## 10 Variables (and intercept)
## Forced in Forced out
## poly(x, 10, raw = TRUE)1 FALSE FALSE
## poly(x, 10, raw = TRUE)2 FALSE FALSE
## poly(x, 10, raw = TRUE)3 FALSE FALSE
## poly(x, 10, raw = TRUE)4 FALSE FALSE
## poly(x, 10, raw = TRUE)5 FALSE FALSE
## poly(x, 10, raw = TRUE)6 FALSE FALSE
## poly(x, 10, raw = TRUE)7 FALSE FALSE
## poly(x, 10, raw = TRUE)8 FALSE FALSE
## poly(x, 10, raw = TRUE)9 FALSE FALSE
## poly(x, 10, raw = TRUE)10 FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: forward
## poly(x, 10, raw = TRUE)1 poly(x, 10, raw = TRUE)2
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " "*"
## 3 ( 1 ) "*" "*"
## 4 ( 1 ) "*" "*"
## 5 ( 1 ) "*" "*"
## 6 ( 1 ) "*" "*"
## 7 ( 1 ) "*" "*"
## 8 ( 1 ) "*" "*"
## poly(x, 10, raw = TRUE)3 poly(x, 10, raw = TRUE)4
## 1 ( 1 ) "*" " "
## 2 ( 1 ) "*" " "
## 3 ( 1 ) "*" " "
## 4 ( 1 ) "*" " "
## 5 ( 1 ) "*" " "
## 6 ( 1 ) "*" " "
## 7 ( 1 ) "*" "*"
## 8 ( 1 ) "*" "*"
## poly(x, 10, raw = TRUE)5 poly(x, 10, raw = TRUE)6
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) "*" " "
## 5 ( 1 ) "*" " "
## 6 ( 1 ) "*" " "
## 7 ( 1 ) "*" " "
## 8 ( 1 ) "*" " "
## poly(x, 10, raw = TRUE)7 poly(x, 10, raw = TRUE)8
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " " "
## 5 ( 1 ) " " " "
## 6 ( 1 ) "*" " "
## 7 ( 1 ) "*" " "
## 8 ( 1 ) "*" " "
## poly(x, 10, raw = TRUE)9 poly(x, 10, raw = TRUE)10
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " " "
## 5 ( 1 ) "*" " "
## 6 ( 1 ) "*" " "
## 7 ( 1 ) "*" " "
## 8 ( 1 ) "*" "*"
# Backward stepwise selection
regfit_bwd <- regsubsets(y ~ poly(x, 10, raw = TRUE), data = dat, method = "backward")
summary(regfit_bwd)
## Subset selection object
## Call: regsubsets.formula(y ~ poly(x, 10, raw = TRUE), data = dat, method = "backward")
## 10 Variables (and intercept)
## Forced in Forced out
## poly(x, 10, raw = TRUE)1 FALSE FALSE
## poly(x, 10, raw = TRUE)2 FALSE FALSE
## poly(x, 10, raw = TRUE)3 FALSE FALSE
## poly(x, 10, raw = TRUE)4 FALSE FALSE
## poly(x, 10, raw = TRUE)5 FALSE FALSE
## poly(x, 10, raw = TRUE)6 FALSE FALSE
## poly(x, 10, raw = TRUE)7 FALSE FALSE
## poly(x, 10, raw = TRUE)8 FALSE FALSE
## poly(x, 10, raw = TRUE)9 FALSE FALSE
## poly(x, 10, raw = TRUE)10 FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: backward
## poly(x, 10, raw = TRUE)1 poly(x, 10, raw = TRUE)2
## 1 ( 1 ) "*" " "
## 2 ( 1 ) "*" "*"
## 3 ( 1 ) "*" "*"
## 4 ( 1 ) "*" "*"
## 5 ( 1 ) "*" "*"
## 6 ( 1 ) "*" "*"
## 7 ( 1 ) "*" "*"
## 8 ( 1 ) "*" "*"
## poly(x, 10, raw = TRUE)3 poly(x, 10, raw = TRUE)4
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " " "
## 5 ( 1 ) " " " "
## 6 ( 1 ) " " " "
## 7 ( 1 ) " " " "
## 8 ( 1 ) " " " "
## poly(x, 10, raw = TRUE)5 poly(x, 10, raw = TRUE)6
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) "*" " "
## 4 ( 1 ) "*" " "
## 5 ( 1 ) "*" " "
## 6 ( 1 ) "*" "*"
## 7 ( 1 ) "*" "*"
## 8 ( 1 ) "*" "*"
## poly(x, 10, raw = TRUE)7 poly(x, 10, raw = TRUE)8
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) "*" " "
## 5 ( 1 ) "*" " "
## 6 ( 1 ) "*" " "
## 7 ( 1 ) "*" "*"
## 8 ( 1 ) "*" "*"
## poly(x, 10, raw = TRUE)9 poly(x, 10, raw = TRUE)10
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " " "
## 5 ( 1 ) "*" " "
## 6 ( 1 ) "*" " "
## 7 ( 1 ) "*" " "
## 8 ( 1 ) "*" "*"
Here, we apply both forward and backward stepwise selection and obtain the best models. The results are compared to the ones from subset selection.
X_matrix <- model.matrix(y ~ poly(x, 10, raw = TRUE), data = dat)[, -1]
lasso_cv <- cv.glmnet(X_matrix, dat$y, alpha = 1)
best_lambda <- lasso_cv$lambda.min
# Plot cross-validation error as a function of lambda
plot(lasso_cv)
# Fitting the final lasso model
lasso_fit <- glmnet(X_matrix, dat$y, alpha = 1, lambda = best_lambda)
predict(lasso_fit, type = "coefficients", s = best_lambda)
## 11 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 1.8457308
## poly(x, 10, raw = TRUE)1 2.9092918
## poly(x, 10, raw = TRUE)2 -1.9287428
## poly(x, 10, raw = TRUE)3 0.5161012
## poly(x, 10, raw = TRUE)4 .
## poly(x, 10, raw = TRUE)5 .
## poly(x, 10, raw = TRUE)6 .
## poly(x, 10, raw = TRUE)7 .
## poly(x, 10, raw = TRUE)8 .
## poly(x, 10, raw = TRUE)9 .
## poly(x, 10, raw = TRUE)10 .
We perform Lasso regression, using cross-validation to choose the optimal \(\lambda\). The coefficients of the best model are shown, and we can see that the Lasso effectively shrinks some coefficients to zero.
\[Y = \beta_0 + \beta_7X^7 + \epsilon,\] and perform best subset selection and lasso.
# New response vector based on X^7
y_new <- 2 + 0.2 * x^7 + ep
dat$y <- y_new
# Best subset selection for the new model
regfit_new <- regsubsets(y ~ poly(x, 10, raw = TRUE), data = dat)
summary(regfit_new)
## Subset selection object
## Call: regsubsets.formula(y ~ poly(x, 10, raw = TRUE), data = dat)
## 10 Variables (and intercept)
## Forced in Forced out
## poly(x, 10, raw = TRUE)1 FALSE FALSE
## poly(x, 10, raw = TRUE)2 FALSE FALSE
## poly(x, 10, raw = TRUE)3 FALSE FALSE
## poly(x, 10, raw = TRUE)4 FALSE FALSE
## poly(x, 10, raw = TRUE)5 FALSE FALSE
## poly(x, 10, raw = TRUE)6 FALSE FALSE
## poly(x, 10, raw = TRUE)7 FALSE FALSE
## poly(x, 10, raw = TRUE)8 FALSE FALSE
## poly(x, 10, raw = TRUE)9 FALSE FALSE
## poly(x, 10, raw = TRUE)10 FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## poly(x, 10, raw = TRUE)1 poly(x, 10, raw = TRUE)2
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " " "
## 5 ( 1 ) " " " "
## 6 ( 1 ) " " "*"
## 7 ( 1 ) " " " "
## 8 ( 1 ) " " "*"
## poly(x, 10, raw = TRUE)3 poly(x, 10, raw = TRUE)4
## 1 ( 1 ) " " " "
## 2 ( 1 ) "*" " "
## 3 ( 1 ) "*" " "
## 4 ( 1 ) "*" "*"
## 5 ( 1 ) "*" " "
## 6 ( 1 ) "*" " "
## 7 ( 1 ) "*" "*"
## 8 ( 1 ) "*" "*"
## poly(x, 10, raw = TRUE)5 poly(x, 10, raw = TRUE)6
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) "*" " "
## 4 ( 1 ) "*" " "
## 5 ( 1 ) "*" "*"
## 6 ( 1 ) "*" "*"
## 7 ( 1 ) "*" "*"
## 8 ( 1 ) "*" "*"
## poly(x, 10, raw = TRUE)7 poly(x, 10, raw = TRUE)8
## 1 ( 1 ) "*" " "
## 2 ( 1 ) "*" " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " " "
## 5 ( 1 ) " " " "
## 6 ( 1 ) " " "*"
## 7 ( 1 ) " " "*"
## 8 ( 1 ) " " "*"
## poly(x, 10, raw = TRUE)9 poly(x, 10, raw = TRUE)10
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) "*" " "
## 4 ( 1 ) "*" " "
## 5 ( 1 ) "*" "*"
## 6 ( 1 ) "*" " "
## 7 ( 1 ) "*" "*"
## 8 ( 1 ) "*" "*"
# Lasso for the new model
X_matrix_new <- model.matrix(y ~ poly(x, 10, raw = TRUE), data = dat)[, -1]
lasso_cv_new <- cv.glmnet(X_matrix_new, dat$y, alpha = 1)
best_lambda_new <- lasso_cv_new$lambda.min
plot(lasso_cv_new)
# Fitting the lasso model
lasso_fit_new <- glmnet(X_matrix_new, dat$y, alpha = 1, lambda = best_lambda_new)
predict(lasso_fit_new, type = "coefficients", s = best_lambda_new)
## 11 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 1.7236801684
## poly(x, 10, raw = TRUE)1 .
## poly(x, 10, raw = TRUE)2 .
## poly(x, 10, raw = TRUE)3 .
## poly(x, 10, raw = TRUE)4 .
## poly(x, 10, raw = TRUE)5 0.0455429629
## poly(x, 10, raw = TRUE)6 .
## poly(x, 10, raw = TRUE)7 0.1820763243
## poly(x, 10, raw = TRUE)8 .
## poly(x, 10, raw = TRUE)9 0.0008828036
## poly(x, 10, raw = TRUE)10 .
Here, we simulate data where \(X^7\) is the only important predictor and then apply best subset selection and Lasso. The subset selection and Lasso will differ in how they handle the sparse predictor set.
In this exercise, we aim to predict the number of applications
received by various colleges in the College
dataset using
different modeling techniques, including linear regression, ridge
regression, lasso, principal component regression (PCR), and partial
least squares (PLS).
Let’s walk through each part of the problem step by step, along with some additional insights to better understand the results.
We start by splitting the College
data into a training
set and a test set. A random seed is set to ensure the results are
reproducible.
set.seed(42)
train <- sample(nrow(College), nrow(College) * 2 / 3)
test <- setdiff(seq_len(nrow(College)), train)
mse <- list() # To store test mean squared errors for comparison
Next, we fit a linear model using ordinary least squares (OLS) on the training set and compute the test mean squared error (MSE) to evaluate its predictive performance.
fit <- lm(Apps ~ ., data = College[train, ])
(mse$lm <- mean((predict(fit, College[test, ]) - College$Apps[test])^2))
## [1] 1695269
The test MSE gives us an indication of how well the linear model predicts the number of college applications. Linear regression serves as a baseline model for comparison.
We fit a ridge regression model, where λ (the regularization parameter) is chosen through cross-validation. Ridge regression shrinks the coefficients and is particularly useful when multicollinearity is present.
mm <- model.matrix(Apps ~ ., data = College[train, ])
fit2 <- cv.glmnet(mm, College$Apps[train], alpha = 0)
p <- predict(fit2, model.matrix(Apps ~ ., data = College[test, ]), s = fit2$lambda.min)
(mse$ridge <- mean((p - College$Apps[test])^2))
## [1] 2804369
By cross-validating for λ, we select the best regularization strength, which helps reduce overfitting. Ridge regression can be beneficial in high-dimensional settings.
Next, we fit a lasso model, another form of regularized regression, but unlike ridge, it can perform feature selection by forcing some coefficients to zero. Again, λ is chosen through cross-validation.
mm <- model.matrix(Apps ~ ., data = College[train, ])
fit3 <- cv.glmnet(mm, College$Apps[train], alpha = 1)
p <- predict(fit3, model.matrix(Apps ~ ., data = College[test, ]), s = fit3$lambda.min)
(mse$lasso <- mean((p - College$Apps[test])^2))
## [1] 1822322
Lasso also offers insights into which predictors are more influential by zeroing out less important ones. We additionally check how many non-zero coefficients remain in the model.
sum(coef(fit3, s = fit3$lambda.min) != 0)
## [1] 15
This tells us how many features were retained after regularization, which is important for model interpretability.
Principal component regression (PCR) uses principal components as predictors. The number of components M is chosen through cross-validation. PCR is particularly helpful when there is multicollinearity or the need for dimensionality reduction.
fit4 <- pcr(Apps ~ ., data = College[train, ], scale = TRUE, validation = "CV")
validationplot(fit4, val.type = "MSEP") # Plot to select M
p <- predict(fit4, College[test, ], ncomp = 17)
(mse$pcr <- mean((p - College$Apps[test])^2))
## [1] 1695269
The plot of MSEP (mean squared error of prediction) helps us determine the optimal number of principal components MMM for minimizing test error. PCR’s performance depends on how well the selected components capture the underlying data structure.
Partial least squares (PLS) regression is similar to PCR but finds components that not only account for variance in the predictors but also for the response variable. Like PCR, the number of components MMM is selected through cross-validation.
fit5 <- plsr(Apps ~ ., data = College[train, ], scale = TRUE, validation = "CV")
validationplot(fit5, val.type = "MSEP")
p <- predict(fit5, College[test, ], ncomp = 12)
(mse$pls <- mean((p - College$Apps[test])^2))
## [1] 1696902
PLS often performs better than PCR since it focuses more directly on the relationship between predictors and the response.
We now compare the test MSE from all five models using a bar plot for better visualization.
barplot(unlist(mse), ylab = "Test MSE", horiz = TRUE)
The overall results suggest that ridge regression and lasso are the most effective methods for this specific dataset, with ridge regression slightly outperforming the others. These methods provide a balance between bias and variance, especially in scenarios where multicollinearity is present or the data has many predictors.