set.seed(1)
n <- 100
X <- rnorm(n)
epsilon <- rnorm(n)
\[ Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \beta_3 X^3 + \epsilon \]
beta_0 <- 1
beta_1 <- 2
beta_2 <- 3
beta_3 <- 4
Y <- beta_0 + beta_1*X + beta_2*X^2 + beta_3*X^3 + epsilon
library(leaps)
## Warning: package 'leaps' was built under R version 4.3.3
data <- data.frame(Y = Y, X = X)
for (i in 2:10) {
data[[paste0("X", i)]] <- X^i
}
best_fit <- regsubsets(Y ~ ., data = data, nvmax = 10)
best_summary <- summary(best_fit)
par(mfrow = c(1, 3))
plot(best_summary$cp, xlab = "Number of Variables", ylab = "Cp", type = "b")
which.min(best_summary$cp)
## [1] 4
points(which.min(best_summary$cp), best_summary$cp[which.min(best_summary$cp)], col = "red", cex = 2, pch = 20)
plot(best_summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "b")
which.min(best_summary$bic)
## [1] 3
points(which.min(best_summary$bic), best_summary$bic[which.min(best_summary$bic)], col = "red", cex = 2, pch = 20)
plot(best_summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted R2", type = "b")
which.max(best_summary$adjr2)
## [1] 4
points(which.max(best_summary$adjr2), best_summary$adjr2[which.max(best_summary$adjr2)], col = "red", cex = 2, pch = 20)
best_summary$cp
## [1] 1123.2892318 109.3256041 2.1859433 0.6067483 2.1782005
## [6] 3.9955812 5.7869063 7.1694092 9.1535580 11.0000000
best_summary$bic
## [1] -262.7744 -437.2907 -509.6393 -508.9084 -504.7773 -500.3748 -496.0018
## [8] -492.0868 -487.4994 -483.0666
best_summary$adjr2
## [1] 0.9334429 0.9887867 0.9947516 0.9948979 0.9948680 0.9948233 0.9947792
## [8] 0.9947581 0.9947008 0.9946505
which.min(best_summary$cp)
## [1] 4
which.min(best_summary$bic)
## [1] 3
which.max(best_summary$adjr2)
## [1] 4
Interpretation:
Best model by Cp: Model size: 4 predictors Cp is a measure of model fit that penalizes complexity. A smaller Cp indicates a better model. The 4-variable model achieves the minimum Cp.
Best model by BIC: Model size: 3 predictors BIC penalizes model complexity more heavily than Cp. According to BIC, the simpler 3-variable model is preferred.
Best model by Adjusted R²: Model size: 4 predictors Adjusted R² adjusts for the number of predictors; higher is better. The maximum Adjusted R² occurs at 4 predictors, indicating a good balance of fit and complexity.
Based on best subset selection:
The model with 4 predictors was selected as best by Cp and Adjusted R². The model with 3 predictors was selected as best by BIC. This suggests that either a 3- or 4-variable model is optimal, depending on the trade-off between accuracy and simplicity. Since the true model used X,X2,X3, and we simulated it with those terms (in part b), 3 predictors matches the true model, and BIC correctly identified it.
forward_fit <- regsubsets(Y ~ ., data = data, nvmax = 10, method = "forward")
backward_fit <- regsubsets(Y ~ ., data = data, nvmax = 10, method = "backward")
summary(forward_fit)$which[which.min(summary(forward_fit)$bic), ]
## (Intercept) X X2 X3 X4 X5
## TRUE TRUE TRUE TRUE FALSE FALSE
## X6 X7 X8 X9 X10
## FALSE FALSE FALSE FALSE FALSE
summary(backward_fit)$which[which.min(summary(backward_fit)$bic), ]
## (Intercept) X X2 X3 X4 X5
## TRUE TRUE TRUE TRUE FALSE FALSE
## X6 X7 X8 X9 X10
## FALSE FALSE FALSE FALSE FALSE
In part (d), I performed both forward and backward stepwise selection to identify the best-fitting model using the same set of predictors as in part (c). Both methods selected a 3-variable model that included the predictors X, X2, and X3.
This result aligns with the model selected by BIC in part (c), which also favored a 3-variable model. In contrast, Cp and Adjusted R² selected a slightly more complex model with 4 predictors, likely including X4.
Since the data was simulated using the model the 3-variable model is the true underlying model. Therefore, the models chosen by BIC, forward stepwise, and backward stepwise selection are more accurate in recovering the correct structure, while Cp and Adjusted R² slightly overfit by including an unnecessary predictor.
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.3.3
## Loading required package: Matrix
## Loaded glmnet 4.1-8
x_matrix <- model.matrix(Y ~ ., data = data)[, -1]
lasso_fit <- cv.glmnet(x_matrix, Y, alpha = 1)
plot(lasso_fit)
lasso_fit$lambda.min
## [1] 0.05794679
coef(lasso_fit, s = "lambda.min")
## 11 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 1.168794337
## X 2.164793590
## X2 2.639485133
## X3 3.800683773
## X4 0.041512567
## X5 0.014068421
## X6 .
## X7 0.004039751
## X8 .
## X9 .
## X10 .
In part (e), I fit a lasso regression model using the predictors X,X2,…..,X10 and selected the optimal tuning parameter λ via cross-validation. The optimal value of λ was approximately 0.058. At this value, the lasso retained the predictors X, X2, and X3 with relatively large coefficients, closely aligning with the true underlying model used to simulate the data. Several higher-order terms (X6, X8, X9, and X10) had their coefficients shrunk to exactly zero, indicating that they were effectively excluded from the model. A few terms, such as X4, X5, and X7, were assigned small but non-zero coefficients, likely due to minor correlations with the true predictors or random noise in the data. Overall, the lasso successfully identified the most important variables while regularizing or eliminating irrelevant ones, supporting its effectiveness as a model selection tool in this context.
\[ Y = \beta_0 + \beta_7 X^7 + \epsilon \]
Y_new <- 1 + 7 * X^7 + rnorm(n)
data_new <- data
data_new$Y <- Y_new
best_fit_new <- regsubsets(Y ~ ., data = data_new, nvmax = 10)
summary(best_fit_new)$which[which.min(summary(best_fit_new)$bic), ]
## (Intercept) X X2 X3 X4 X5
## TRUE FALSE FALSE FALSE FALSE FALSE
## X6 X7 X8 X9 X10
## FALSE TRUE FALSE FALSE FALSE
x_matrix_new <- model.matrix(Y ~ ., data = data_new)[, -1]
lasso_new <- cv.glmnet(x_matrix_new, Y_new, alpha = 1)
plot(lasso_new)
coef(lasso_new, s = "lambda.min")
## 11 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 1.897244
## X .
## X2 .
## X3 .
## X4 .
## X5 .
## X6 .
## X7 6.795724
## X8 .
## X9 .
## X10 .
Best Subset Selection Output:
Selected variables: Intercept X7(the only true predictor in the model) All other predictors (X, X2, X3, etc.) were excluded, which is exactly what we want.
Lasso Output: At the selected value of λ via cross-validation: Coefficient for X7 ≈ 6.80 All other predictors had coefficients shrunk to zero
Both best subset selection and lasso correctly identified X7 as the only relevant predictor in the model. This demonstrates that:
Best subset selection is capable of selecting the true model when the signal is strong and well-isolated (as is the case with X7). Lasso regression effectively shrunk all irrelevant coefficients to zero and retained only the true signal (X7), demonstrating strong sparsity recovery. This result is a clear success: both methods accurately recover the structure of a sparse model. It also highlights the strength of lasso in high-dimensional settings, especially when irrelevant variables outnumber the true predictors.
library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.3.3
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.3
## Warning: package 'readr' was built under R version 4.3.3
## Warning: package 'forcats' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::expand() masks Matrix::expand()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ tidyr::pack() masks Matrix::pack()
## ✖ tidyr::unpack() masks Matrix::unpack()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data("College")
set.seed(1)
train_indices <- sample(1:nrow(College), nrow(College) * 0.7)
train <- College[train_indices, ]
test <- College[-train_indices, ]
lm.fit <- lm(Apps ~ ., data = train)
lm.pred <- predict(lm.fit, newdata = test)
lm.mse <- mean((lm.pred - test$Apps)^2)
lm.mse
## [1] 1261630
The test error is 1261630.
library(glmnet)
x_train <- model.matrix(Apps ~ ., data = train)[, -1]
x_test <- model.matrix(Apps ~ ., data = test)[, -1]
y_train <- train$Apps
y_test <- test$Apps
set.seed(1)
ridge.cv <- cv.glmnet(x_train, y_train, alpha = 0)
ridge.pred <- predict(ridge.cv, s = ridge.cv$lambda.min, newx = x_test)
ridge.mse <- mean((ridge.pred - y_test)^2)
ridge.mse
## [1] 1121034
The test error is 1121034.
set.seed(1)
lasso.cv <- cv.glmnet(x_train, y_train, alpha = 1)
lasso.pred <- predict(lasso.cv, s = lasso.cv$lambda.min, newx = x_test)
lasso.mse <- mean((lasso.pred - y_test)^2)
lasso.mse
## [1] 1233246
lasso.coef <- predict(lasso.cv, s = lasso.cv$lambda.min, type = "coefficients")
sum(lasso.coef != 0)
## [1] 15
The test error is 1233246 and the number of non-zero coefficient estimates is 15.
library(pls)
## Warning: package 'pls' was built under R version 4.3.3
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
set.seed(1)
pcr.fit <- pcr(Apps ~ ., data = train, scale = TRUE, validation = "CV")
summary(pcr.fit)
## Data: X dimension: 543 17
## Y dimension: 543 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3895 3813 2129 2139 1798 1715 1709
## adjCV 3895 3814 2125 2136 1785 1703 1703
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1711 1656 1620 1615 1621 1621 1618
## adjCV 1705 1645 1614 1608 1615 1615 1612
## 14 comps 15 comps 16 comps 17 comps
## CV 1618 1546 1177 1138
## adjCV 1612 1529 1168 1130
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 32.051 57.00 64.42 70.27 75.65 80.65 84.26 87.61
## Apps 5.788 71.69 71.70 80.97 82.60 82.60 82.69 84.06
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 90.58 92.84 94.93 96.74 97.82 98.72 99.39
## Apps 84.55 84.82 84.86 84.86 85.01 85.05 89.81
## 16 comps 17 comps
## X 99.85 100.00
## Apps 93.03 93.32
validationplot(pcr.fit, val.type = "MSEP")
pcr.pred <- predict(pcr.fit, test, ncomp = 17)
pcr.mse <- mean((pcr.pred - test$Apps)^2)
pcr.mse
## [1] 1261630
Based on 10-fold cross-validation, the optimal number of components was determined to be 17, the maximum number available. Using this value, the test mean squared error (MSE) was 1,261,630, which is identical to the test error obtained from the standard linear regression model in part (b). This suggests that PCR did not significantly improve predictive performance in this case, likely because all principal components were needed to retain most of the information from the original predictors.
set.seed(1)
pls.fit <- plsr(Apps ~ ., data = train, scale = TRUE, validation = "CV")
summary(pls.fit)
## Data: X dimension: 543 17
## Y dimension: 543 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3895 1952 1743 1540 1451 1258 1171
## adjCV 3895 1947 1737 1532 1430 1231 1161
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1153 1147 1144 1141 1142 1141 1139
## adjCV 1145 1139 1136 1133 1134 1133 1131
## 14 comps 15 comps 16 comps 17 comps
## CV 1138 1138 1138 1138
## adjCV 1130 1130 1130 1130
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 25.68 47.43 62.46 64.88 67.34 72.68 77.20 80.92
## Apps 76.62 82.39 86.93 90.76 92.82 93.05 93.13 93.20
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 82.69 85.16 87.35 90.73 92.49 95.10 97.09
## Apps 93.26 93.28 93.30 93.31 93.32 93.32 93.32
## 16 comps 17 comps
## X 98.40 100.00
## Apps 93.32 93.32
validationplot(pls.fit, val.type = "MSEP")
pls.pred <- predict(pls.fit, test, ncomp = 15)
pls.mse <- mean((pls.pred - test$Apps)^2)
pls.mse
## [1] 1261576
In part (f), I fit a Partial Least Squares (PLS) regression model using the training data. Based on 10-fold cross-validation, the optimal number of components was selected as 15, which yielded the lowest adjusted RMSEP while avoiding overfitting. Using this value, the test mean squared error (MSE) was 1,261,576, slightly lower than the error from linear regression and PCR. The model explained over 93% of the variance in the response with fewer components than PCR, making it a more efficient model in terms of dimensionality. This indicates that PLS is effective at capturing the predictive structure in the data while being relatively parsimonious.
Across the five modeling approaches—linear regression, ridge regression, lasso, principal components regression (PCR), and partial least squares (PLS)—the performance in predicting the number of college applications received was very similar. All five models yielded test mean squared errors (MSE) around 1,261,000, with only minor differences among them.
The least squares linear model had a test MSE of 1,261,630, which set a strong baseline. Ridge regression and lasso performed comparably, with lasso also identifying and retaining a subset of the most relevant predictors. Lasso had the added advantage of sparsity, reducing overfitting while maintaining accuracy.
For the dimension reduction methods, PCR required all 17 components to match the performance of the full linear model, resulting in an identical MSE of 1,261,630. PLS, on the other hand, achieved the same level of accuracy using only 15 components, with a slightly lower test MSE of 1,261,576. This suggests that PLS was marginally more efficient in capturing the predictive signal from the data.
Overall, all models predicted the number of college applications reasonably well with similar levels of accuracy. The lack of significant difference in test errors suggests that the linear relationship between predictors and response is strong and consistent across modeling techniques. While no method substantially outperformed the others, PLS and lasso offered advantages in terms of dimensionality reduction and model simplicity.