set.seed(1) # for reproducibility
n <- 100
X <- rnorm(n)
epsilon <- rnorm(n)
beta0 <- 2
beta1 <- 3
beta2 <- -1
beta3 <- 0.5
Y <- beta0 + beta1*X + beta2*X^2 + beta3*X^3 + epsilon
library(leaps)
# Creating predictors X1 to X10:
data <- data.frame(Y = Y, X = X)
for (i in 2:10) {
data[[paste0("X", i)]] <- X^i
}
# Best subset selection:
best_fit <- regsubsets(Y ~ ., data = data, nvmax = 10)
summary_best <- summary(best_fit)
# Visualizations:
par(mfrow = c(1, 3))
plot(summary_best$cp, xlab = "Number of Variables", ylab = "Cp", type = "b", main="Cp")
which.min(summary_best$cp)
## [1] 3
plot(summary_best$bic, xlab = "Number of Variables", ylab = "BIC", type = "b", main="BIC")
which.min(summary_best$bic)
## [1] 3
plot(summary_best$adjr2, xlab = "Number of Variables", ylab = "Adjusted R2", type = "b", main="Adjusted R²")
which.max(summary_best$adjr2)
## [1] 3
# Coefficients of best model:
coef(best_fit, which.max(summary_best$adjr2))
## (Intercept) X X2 X5
## 2.07219472 3.44514720 -1.15676236 0.09022577
From the above output, we can observe that the in all three criteria (Cp, BIC and Adjusted R2), the best model is with 3 predictors as X1, X2 and X5. Also, from the above three plots of Cp, BIC and Adjusted R2, we can be sure that the best model is with 3 predictors since we know a lower Cp, lower BIC and higher Adjusted R2 - all accounts for the best model. Hence the best model looks like:
Y = 2.072 + 3.445X - 1.157X2 + 0.090X5 + ε
where, β0 = 2.072, β1 = 3.445, β2 = -1.1568 and β5 = 0.090.
The point to be noted here is with X5 being chosen by the method rather than X3 as true predictor. This could be due to multicollinearity among the polynomial predictors or even the flexibility of model.
# Forward Selection:
regfit.fwd <- regsubsets(Y ~ ., data = data, nvmax = 10, method = "forward")
reg.summary.fwd <- summary(regfit.fwd)
best_model_index_fwd <- which.min(reg.summary.fwd$bic)
coef_fwd <- coef(regfit.fwd, best_model_index_fwd)
cat("\nForward Stepwise Selection Coefficients (based on BIC):\n")
##
## Forward Stepwise Selection Coefficients (based on BIC):
print(coef_fwd)
## (Intercept) X X2 X5
## 2.07219472 3.44514720 -1.15676236 0.09022577
# Backward Selection:
regfit.bwd <- regsubsets(Y ~ ., data = data, nvmax = 10, method = "backward")
reg.summary.bwd <- summary(regfit.bwd)
best_model_index_bwd <- which.min(reg.summary.bwd$bic)
coef_bwd <- coef(regfit.bwd, best_model_index_bwd)
cat("\nBackward Stepwise Selection Coefficients (based on BIC):\n")
##
## Backward Stepwise Selection Coefficients (based on BIC):
print(coef_bwd)
## (Intercept) X X2 X5
## 2.07219472 3.44514720 -1.15676236 0.09022577
After observing the results from both forward and backward selection, we can conclude that both methods yielded a similar result (similar coefficients) to that of best subset selection.
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-8
x_matrix <- model.matrix(Y ~ . - 1, data = data)
cv.out <- cv.glmnet(x_matrix, Y, alpha = 1) # Lasso (alpha=1)
plot(cv.out)
best_lambda <- cv.out$lambda.min
cat("\nOptimal Lambda (min CV error): ", best_lambda, "\n")
##
## Optimal Lambda (min CV error): 0.03458424
lasso_coef <- coef(cv.out, s = "lambda.min")
cat("Lasso Coefficients at Optimal Lambda:\n")
## Lasso Coefficients at Optimal Lambda:
print(lasso_coef)
## 11 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 2.04367894
## X 3.29157840
## X2 -1.11065890
## X3 0.13441562
## X4 .
## X5 0.06526751
## X6 .
## X7 .
## X8 .
## X9 .
## X10 .
From the above results, the Cross-validation is found at λ(min) approximately equals to 0.055. It also shows that four polynomial terms remain nonzero: X, X2, X3, and X5.
The model suggests a polynomial relationship up to the fifth power of X excluding X4. The rest of the terms are shrunk to zero. The final coefficient estimates differ somewhat from any “true” parameters when simulated data, that might reflects noise and collinearity.
Overall, the result demonstrates that lasso is effective at producing a sparse model (only 4 out of 10 polynomtal terms) while minimizing out-of-sample prediction error.
# Generating a new response according to: Y_new = beta0 + beta7 * X^7 + noise:
beta7 <- 4
Y_new <- beta0 + beta7 * (X^7) + epsilon
# Creating a new data frame (including X, X^2, ..., X^10):
data_new <- data.frame(Y = Y_new, X = X)
for (j in 2:10) {
data_new[[paste0("X", j)]] <- X^j
}
# Best subset selection on the new data:
regfit.new <- regsubsets(Y ~ ., data = data_new, nvmax = 10)
reg.summary.new <- summary(regfit.new)
best_model_index_new <- which.min(reg.summary.new$bic)
coef_new <- coef(regfit.new, best_model_index_new)
cat("\nBest Subset Selection Coefficients for Model Y = beta0 + beta7*X7 + noise:\n")
##
## Best Subset Selection Coefficients for Model Y = beta0 + beta7*X7 + noise:
print(coef_new)
## (Intercept) X7
## 1.95894 4.00077
# Lasso on the new data:
x_matrix_new <- model.matrix(Y ~ . - 1, data = data_new)
cv.out.new <- cv.glmnet(x_matrix_new, Y_new, alpha = 1)
plot(cv.out.new)
best_lambda_new <- cv.out.new$lambda.min
cat("\nOptimal Lambda for New Data: ", best_lambda_new, "\n")
##
## Optimal Lambda for New Data: 7.068491
lasso_coef_new <- coef(cv.out.new, s = "lambda.min")
cat("Lasso Coefficients for New Model at Optimal Lambda:\n")
## Lasso Coefficients for New Model at Optimal Lambda:
print(lasso_coef_new)
## 11 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 2.451138
## X .
## X2 .
## X3 .
## X4 .
## X5 .
## X6 .
## X7 3.884146
## X8 .
## X9 .
## X10 .
From the above results, we obtained that both best subset selection and Lasso offers similar results/coefficients. Both of them yielded intercepts close to 2 (best subset selection ~ 1.95 and lasso ~ 2.45).
For coefficients of X7, best subset selection yielded 4 being extremely close to true value of 4. And lasso yielded 3.88 which is slightly away from true value 4. This difference could be due to the fact that best subset selection finds best subset without shrinking the coefficients while lasso does with shrinking the coefficients.
Overall, both methods were able to select X7 and the only influential predictor for the model and remove others - resembling close to the true model as provided above.
library(ISLR)
set.seed(1)
train_index <- sample(1:nrow(College), round(0.7 * nrow(College)))
College.train <- College[train_index, ]
College.test <- College[-train_index, ]
# Fitting linear model on training data:
lm.fit <- lm(Apps ~ ., data = College.train)
# Predicting on test set:
lm.pred <- predict(lm.fit, newdata = College.test)
# Computing test mean squared error:
lm.mse <- mean((lm.pred - College.test$Apps)^2)
cat("Test MSE (Least Squares):", lm.mse, "\n")
## Test MSE (Least Squares): 1266407
Test MSE: 1,266,407 is the baseline performance using a standard linear model.
library(glmnet)
# Preparing matrices:
x.train <- model.matrix(Apps ~ ., data = College.train)[, -1] # remove intercept column
y.train <- College.train$Apps
x.test <- model.matrix(Apps ~ ., data = College.test)[, -1]
y.test <- College.test$Apps
# Performing cross-validation for ridge regression (alpha=0):
cv.ridge <- cv.glmnet(x.train, y.train, alpha = 0)
best.lambda.ridge <- cv.ridge$lambda.min
cat("Optimal lambda (ridge):", best.lambda.ridge, "\n")
## Optimal lambda (ridge): 367.2207
# Making predictions on test set:
ridge.pred <- predict(cv.ridge, s = best.lambda.ridge, newx = x.test)
ridge.mse <- mean((ridge.pred - y.test)^2)
cat("Test MSE (Ridge):", ridge.mse, "\n")
## Test MSE (Ridge): 1125664
Ridge regression shrunk coefficient estimates (Test MSE), and yielded a lower test error compared to least squares—improving prediction by about 140,000 MSE units –> (1,266,407 - 1,125,664 = 140,743) units.
# Cross-validation for lasso (alpha=1):
cv.lasso <- cv.glmnet(x.train, y.train, alpha = 1)
best.lambda.lasso <- cv.lasso$lambda.min
cat("Optimal lambda (lasso):", best.lambda.lasso, "\n")
## Optimal lambda (lasso): 10.45858
# Predicting on test set:
lasso.pred <- predict(cv.lasso, s = best.lambda.lasso, newx = x.test)
lasso.mse <- mean((lasso.pred - y.test)^2)
cat("Test MSE (Lasso):", lasso.mse, "\n")
## Test MSE (Lasso): 1233415
# Extracting coefficients at the best lambda:
lasso.coef <- predict(cv.lasso, s = best.lambda.lasso, type = "coefficients")
nonzero.count <- sum(lasso.coef != 0) - 1 # Subtract 1 for the intercept
cat("Number of non-zero coefficients (Lasso):", nonzero.count, "\n")
## Number of non-zero coefficients (Lasso): 14
The lasso model, which both shrunk coefficients and performed variable selection, produced a test error slightly lower than least squares (but not as low as ridge). The fact that it retains 14 predictors suggests that most of the variables contribute to the model.
library(pls)
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
set.seed(1)
pcr.fit <- pcr(Apps ~ ., data = College.train, scale = TRUE, validation = "CV")
# Cross-validation results:
summary(pcr.fit)
## Data: X dimension: 544 17
## Y dimension: 544 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3892 3821 2139 2155 1806 1742 1719
## adjCV 3892 3821 2134 2152 1776 1717 1711
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1724 1673 1641 1634 1638 1642 1643
## adjCV 1722 1659 1634 1627 1631 1635 1636
## 14 comps 15 comps 16 comps 17 comps
## CV 1646 1552 1179 1152
## adjCV 1639 1535 1169 1143
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 32.058 57.03 64.43 70.29 75.66 80.65 84.26 87.61
## Apps 5.575 71.65 71.65 81.07 82.59 82.61 82.69 84.06
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 90.59 92.84 94.93 96.74 97.82 98.72 99.38
## Apps 84.56 84.83 84.86 84.87 85.02 85.06 89.81
## 16 comps 17 comps
## X 99.85 100.00
## Apps 93.04 93.32
validationplot(pcr.fit, val.type = "MSEP", main="PCR CV Plot")
optimal_M_pcr <- which.min(pcr.fit$validation$PRESS)
cat("Optimal number of components (PCR):", optimal_M_pcr, "\n")
## Optimal number of components (PCR): 17
# Predicting on test set using the optimal number of components:
pcr.pred <- predict(pcr.fit, College.test, ncomp = optimal_M_pcr)
pcr.mse <- mean((pcr.pred - College.test$Apps)^2)
cat("Test MSE (PCR):", pcr.mse, "\n")
## Test MSE (PCR): 1266407
PCR us all 17 components (i.e., nearly the full predictor space) to explain the variability. Its test error is virtually the same as least squares, implying that dimension reduction did not simplify the model without sacrificing predictive power.
pls.fit <- plsr(Apps ~ ., data = College.train, scale = TRUE, validation = "CV")
summary(pls.fit)
## Data: X dimension: 544 17
## Y dimension: 544 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3892 1949 1742 1545 1469 1268 1200
## adjCV 3892 1944 1738 1537 1446 1240 1187
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1178 1166 1164 1163 1159 1158 1158
## adjCV 1167 1157 1154 1154 1149 1148 1148
## 14 comps 15 comps 16 comps 17 comps
## CV 1157 1156 1156 1156
## adjCV 1148 1147 1147 1147
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 25.67 47.20 62.49 64.91 67.30 72.46 76.95 80.88
## Apps 76.61 82.45 86.93 90.76 92.84 93.06 93.14 93.20
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 82.69 85.27 87.8 90.88 92.47 95.11 97.09
## Apps 93.26 93.28 93.3 93.31 93.32 93.32 93.32
## 16 comps 17 comps
## X 98.37 100.00
## Apps 93.32 93.32
validationplot(pls.fit, val.type = "MSEP", main="PLS CV Plot")
# Determining optimal number of components from the CV plot:
optimal_M_pls <- which.min(pls.fit$validation$PRESS)
cat("Optimal number of components (PLS):", optimal_M_pls, "\n")
## Optimal number of components (PLS): 16
# Predicting on test set:
pls.pred <- predict(pls.fit, College.test, ncomp = optimal_M_pls)
pls.mse <- mean((pls.pred - College.test$Apps)^2)
cat("Test MSE (PLS):", pls.mse, "\n")
## Test MSE (PLS): 1266402
PLS also yielded a test error nearly identical to least squares, indicating that, like PCR, PLS is also not providing a significant advantage in this case.
From the above outputs, we can summarize that all models, whether regularization (ridge or lasso) or dimensionality reduction (PCR or PLS), have a moderate overall predictive performance.
The fact that no approach significantly surpasses the others in terms of test error indicates that a large portion of the underlying relationship is already captured by the linear model (or a minor variation of it). With the exception of ridge regression, which somewhat improves on that, all test MSE values fall within the range of roughly 1.26 million. This implies that regularising (or shrinking) the coefficients in some way enhances prediction, most likely by lowering variance in the case of linked predictors.
Although regularization (ridge regression) can yield certain advantages, our analysis demonstrates that the predictive performance of the various approaches is very comparable, indicating a simple, non-complex patterns in College data.