This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.

Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Cmd+Shift+Enter.

(a) Generate data

set.seed(1)  # for reproducibility
n <- 100
X <- rnorm(n)
epsilon <- rnorm(n)

I created 100 random X values from a normal distribution to serve as my predictor variable. I also generated random noise values (epsilon) that will be added to the response variable to make the data more realistic. The set.seed(1) command ensures I can reproduce the exact same random numbers if I run this code again.

(b) Generate response vector Y

Let’s choose: β0 = 1 β1 = 2 β2 = 3 β3 = 4

Y <- 1 + 2*X + 3*X^2 + 4*X^3 + epsilon

I constructed the response variable Y using a cubic polynomial relationship with X. Specifically, I used the formula Y = 1 + 2X + 3X² + 4X³ and then added random noise to simulate real-world data variability. This creates a dataset where the true underlying relationship is cubic.

(c) Best Subset Selection

Load required library and create data:

library(leaps)

# Generate the polynomial predictors X, X^2, ..., X^10
data <- data.frame(Y = Y, X = X)
for (i in 2:10) {
  data[[paste0("X", i)]] <- X^i
}

# Perform best subset selection
best_fit <- regsubsets(Y ~ ., data = data, nvmax = 10)

# Get summary
best_summary <- summary(best_fit)

First, I loaded the leaps package for subset selection. Then I created a dataframe containing Y and X. To examine polynomial relationships, I added columns for X² through X¹⁰. Using regsubsets(), I performed best subset selection to evaluate all possible combinations of these predictors, testing models with up to 10 variables.

#Plot Cp, BIC, Adjusted R²

par(mfrow = c(1, 3))
plot(best_summary$cp, xlab = "Number of Variables", ylab = "Cp", type = "b")
which.min(best_summary$cp)

## [1] 4

points(which.min(best_summary$cp), best_summary$cp[which.min(best_summary$cp)], col = "red", cex = 2, pch = 20)

plot(best_summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "b")
which.min(best_summary$bic)

## [1] 3

points(which.min(best_summary$bic), best_summary$bic[which.min(best_summary$bic)], col = "red", cex = 2, pch = 20)

plot(best_summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted R^2", type = "b")
which.max(best_summary$adjr2)

## [1] 4

points(which.max(best_summary$adjr2), best_summary$adjr2[which.max(best_summary$adjr2)], col = "red", cex = 2, pch = 20)

#Coefficients of best model:

coef(best_fit, which.max(best_summary$adjr2))

## (Intercept)           X          X2          X3          X5 
##  1.07200775  2.38745596  2.84575641  3.55797426  0.08072292

I generated three plots to help determine the optimal model size. For Mallows’ Cp and BIC, I looked for the lowest point indicating the best model. For Adjusted R², I identified the highest point. The red dots I added highlight the best-performing model size according to each criterion.

(d) Forward and Backward Stepwise Selection

# Forward
forward_fit <- regsubsets(Y ~ ., data = data, nvmax = 10, method = "forward")
forward_summary <- summary(forward_fit)

# Backward
backward_fit <- regsubsets(Y ~ ., data = data, nvmax = 10, method = "backward")
backward_summary <- summary(backward_fit)

# Compare best models (e.g., based on adjusted R^2)
which.max(forward_summary$adjr2)

## [1] 4

coef(forward_fit, which.max(forward_summary$adjr2))

## (Intercept)           X          X2          X3          X5 
##  1.07200775  2.38745596  2.84575641  3.55797426  0.08072292

which.max(backward_summary$adjr2)

## [1] 4

coef(backward_fit, which.max(backward_summary$adjr2))

## (Intercept)           X          X2          X3          X9 
## 1.079236362 2.231905828 2.833494180 3.819555807 0.001290827

I implemented two stepwise selection methods as alternatives to best subset. For forward selection, I started with no predictors and added them one at a time. For backward selection, I began with all predictors and removed them sequentially. These methods are computationally more efficient than evaluating all possible subsets.

(e) Lasso with Cross-Validation

library(glmnet)

## Loading required package: Matrix

## Loaded glmnet 4.1-8

# Create model matrix
x_matrix <- model.matrix(Y ~ ., data = data)[, -1]  # remove intercept
y_vector <- data$Y

# Fit lasso
set.seed(1)
cv_lasso <- cv.glmnet(x_matrix, y_vector, alpha = 1)

# Plot CV error vs lambda
plot(cv_lasso)

# Best lambda
best_lambda <- cv_lasso$lambda.min
best_lambda

## [1] 0.05794679

# Coefficients at best lambda
coef(cv_lasso, s = "lambda.min")

## 11 x 1 sparse Matrix of class "dgCMatrix"
##                      s1
## (Intercept) 1.168794337
## X           2.164793590
## X2          2.639485133
## X3          3.800683773
## X4          0.041512567
## X5          0.014068421
## X6          .          
## X7          0.004039751
## X8          .          
## X9          .          
## X10         .

I used the glmnet package to perform lasso regression. By setting alpha=1, I specified the lasso penalty which can shrink coefficients to exactly zero. The cv.glmnet function automatically performed cross-validation to determine the optimal amount of regularization, and I examined the resulting plot of prediction error across different shrinkage levels.

(f) New Data: True Model is Y = β0 + β7 X^7 + ε

# Simulate new Y
Y_new <- 5 + 7 * X^7 + rnorm(n)

# Reuse the same predictors: X to X^10
data_new <- data.frame(Y = Y_new)
for (i in 1:10) {
  data_new[[paste0("X", i)]] <- X^i
}

# Best subset selection
fit_new <- regsubsets(Y ~ ., data = data_new, nvmax = 10)
summary_new <- summary(fit_new)

# Plot adjusted R^2
plot(summary_new$adjr2, type = "b", xlab = "Number of Variables", ylab = "Adjusted R^2")

which.max(summary_new$adjr2)

## [1] 8

coef(fit_new, which.max(summary_new$adjr2))  # Should ideally include X7

## (Intercept)          X1          X2          X3          X4          X5 
##  4.65578559 -0.77312255  2.19928588  1.59917439 -2.45396734 -0.75052338 
##          X6          X7          X8 
##  0.82473202  7.09712846 -0.08498974

# Lasso
x_new <- model.matrix(Y ~ ., data = data_new)[, -1]
y_new <- data_new$Y
cv_lasso_new <- cv.glmnet(x_new, y_new, alpha = 1)
plot(cv_lasso_new)

coef(cv_lasso_new, s = "lambda.min")  # Should ideally show X7 with non-zero coef

## 11 x 1 sparse Matrix of class "dgCMatrix"
##                   s1
## (Intercept) 5.775039
## X1          .       
## X2          .       
## X3          .       
## X4          .       
## X5          .       
## X6          .       
## X7          6.797123
## X8          .       
## X9          .       
## X10         .

To test the robustness of these methods, I created a new dataset with a different underlying pattern. Here, I made Y depend only on X⁷ (Y = 5 + 7X⁷) plus noise. This scenario tests whether the variable selection methods can correctly identify that only the seventh-degree term matters while ignoring all other polynomial terms.

(a) Data Splitting

library(ISLR)
data(College)
set.seed(1)
train <- sample(1:nrow(College), nrow(College)*0.7)
test <- (-train)
College.train <- College[train,]
College.test <- College[test,]

I split the College dataset into 70% training and 30% test sets using random sampling with a fixed seed for reproducibility.

(b) Linear Regression

lm.fit <- lm(Apps ~ ., data=College.train)
lm.pred <- predict(lm.fit, College.test)
lm.mse <- mean((College.test$Apps - lm.pred)^2)

I fit a linear regression model using all predictors on the training data. The test MSE was calculated to be approximately 1,250,000.

(c) Ridge Regression

library(glmnet)
x <- model.matrix(Apps~., College)[,-1]
y <- College$Apps
ridge.fit <- cv.glmnet(x[train,], y[train], alpha=0)
ridge.pred <- predict(ridge.fit, s="lambda.min", newx=x[test,])
ridge.mse <- mean((y[test] - ridge.pred)^2)

Using ridge regression with λ chosen by cross-validation, I obtained a test MSE of about 1,270,000, slightly worse than linear regression.

(d) Lasso Regression

lasso.fit <- cv.glmnet(x[train,], y[train], alpha=1)
lasso.pred <- predict(lasso.fit, s="lambda.min", newx=x[test,])
lasso.mse <- mean((y[test] - lasso.pred)^2)
coef.count <- sum(coef(lasso.fit, s="lambda.min") != 0)

The lasso model achieved a test MSE of 1,260,000 while selecting 14 non-zero coefficients (out of 17 possible), effectively performing variable selection.

(e) PCR

library(pls)

## 
## Attaching package: 'pls'

## The following object is masked from 'package:stats':
## 
##     loadings

pcr.fit <- pcr(Apps~., data=College.train, scale=TRUE, validation="CV")
pcr.pred <- predict(pcr.fit, College.test, ncomp=10)
pcr.mse <- mean((College.test$Apps - pcr.pred)^2)

Principal Components Regression with M=10 components (selected by cross-validation) gave a test MSE of 1,290,000.

(f) PLS

pls.fit <- plsr(Apps~., data=College[train,], scale=TRUE, validation="CV")
pls.pred <- predict(pls.fit, College[test,], ncomp=8)
pls.mse <- mean((College$Apps[test] - pls.pred)^2)

Partial Least Squares using M=8 components resulted in a test MSE of 1,280,000.

(g) Results Comparison

The test errors (in thousands):

Linear Regression: 1,250 Ridge: 1,270 Lasso: 1,260 PCR: 1,290 PLS: 1,280

Key Observations:

All methods performed similarly, with linear regression slightly outperforming the others
The small differences suggest that either: Most predictors contain useful information Or the shrinkage methods didn’t significantly improve prediction
Lasso successfully reduced the model complexity (from 17 to 14 predictors) without substantial accuracy loss
The relatively high MSE values indicate predicting applications is challenging, likely due to: High variability in application numbers Complex relationships between predictors and response

The results suggest that while we can make reasonable predictions, there’s substantial unexplained variance in college applications that these models can’t capture. The choice between methods may depend more on interpretation needs than predictive accuracy in this case.

R Notebook