#8 a)

# Set seed for reproducibility
set.seed(2)

# Generate predictor X and noise vector ε
n <- 100
X <- rnorm(n)
epsilon <- rnorm(n)

# View the first few values
head(X)
## [1] -0.89691455  0.18484918  1.58784533 -1.13037567 -0.08025176  0.13242028
head(epsilon)
## [1]  1.0744594  0.2605978 -0.3142720 -0.7496301 -0.8621983  2.0480403

#b

# Coefficients
beta0 <- 1
beta1 <- 2
beta2 <- 3
beta3 <- 4

# Response variable Y based on polynomial model
Y <- beta0 + beta1*X + beta2*X^2 + beta3*X^3 + epsilon

head(Y)
## [1] -0.192114667  1.758068478 27.438614992 -3.954480244 -0.005448205
## [6]  3.374774295

#c

library(leaps)

# (a) Generate X and epsilon
set.seed(2)
n <- 100
X <- rnorm(n)
epsilon <- rnorm(n)

# (b) Generate response Y = β0 + β1*X + β2*X^2 + β3*X^3 + ε
beta0 <- 1
beta1 <- 2
beta2 <- 3
beta3 <- 4
Y <- beta0 + beta1*X + beta2*X^2 + beta3*X^3 + epsilon

# (c) Create data frame with X, X^2, ..., X^10
data <- data.frame(Y = Y, X = X)
for (i in 2:10) {
  data[[paste0("X", i)]] <- X^i
}

# Perform best subset selection
best_subset <- regsubsets(Y ~ ., data = data, nvmax = 10)
summary_best <- summary(best_subset)

# Plot Cp, BIC, Adjusted R^2
par(mfrow = c(1, 3))  # One row, three plots

# Cp plot
plot(summary_best$cp, xlab = "Number of Variables", ylab = "Cp", type = "b", main = "Cp")
points(which.min(summary_best$cp), summary_best$cp[which.min(summary_best$cp)],
       col = "red", pch = 19)

# BIC plot
plot(summary_best$bic, xlab = "Number of Variables", ylab = "BIC", type = "b", main = "BIC")
points(which.min(summary_best$bic), summary_best$bic[which.min(summary_best$bic)],
       col = "blue", pch = 19)

# Adjusted R^2 plot
plot(summary_best$adjr2, xlab = "Number of Variables", ylab = "Adjusted R²", type = "b", main = "Adjusted R²")
points(which.max(summary_best$adjr2), summary_best$adjr2[which.max(summary_best$adjr2)],
       col = "green", pch = 19)

# Identify best model size
best_cp_size <- which.min(summary_best$cp)
best_bic_size <- which.min(summary_best$bic)
best_adjr2_size <- which.max(summary_best$adjr2)

cat("Best model size by Cp:", best_cp_size, "\n")
## Best model size by Cp: 3
cat("Best model size by BIC:", best_bic_size, "\n")
## Best model size by BIC: 3
cat("Best model size by Adjusted R²:", best_adjr2_size, "\n")
## Best model size by Adjusted R²: 3
# Report coefficients of the best model (you can choose one criterion, e.g., BIC)
cat("\nCoefficients of best model by BIC:\n")
## 
## Coefficients of best model by BIC:
print(coef(best_subset, best_bic_size))
## (Intercept)           X          X2          X3 
##   0.9621529   1.9867296   3.0504418   3.9875115

Best subset selection works well when the true model is sparse and has strong signal. Simulated noise level (via epsilon) wasn’t too high — allowing the model to detect the real polynomial relationship.

Cp Plot (left) The red dot at 3 variables has the lowest Cp, indicating the best model. Cp drops steeply until 3, then flattens — no gain in adding more variables.

BIC Plot (middle) The blue dot at 3 variables is the lowest BIC. This aligns with Cp — BIC also penalizes complexity, so it confirms the 3-variable model is optimal.

Adjusted R² Plot (right) Green dot at 3 variables has the highest Adjusted R². Adjusted R² increases with more variables only when those variables improve model performance — the curve flatlines after 3, so additional terms offer no meaningful improvement. All three criteria — Cp, BIC, and Adjusted R² — agree that the best model includes 3 predictors, which correspond to:X X2 X3

#d

# Forward Stepwise Selection
forward_fit <- regsubsets(Y ~ ., data = data, nvmax = 10, method = "forward")
forward_summary <- summary(forward_fit)

# Backward Stepwise Selection
backward_fit <- regsubsets(Y ~ ., data = data, nvmax = 10, method = "backward")
backward_summary <- summary(backward_fit)

# Compare best model sizes
forward_best_size <- which.min(forward_summary$bic)
backward_best_size <- which.min(backward_summary$bic)

cat("Best model size by BIC (Forward):", forward_best_size, "\n")
## Best model size by BIC (Forward): 3
cat("Best model size by BIC (Backward):", backward_best_size, "\n")
## Best model size by BIC (Backward): 3
# Coefficients of best models
cat("\nCoefficients (Forward):\n")
## 
## Coefficients (Forward):
print(coef(forward_fit, forward_best_size))
## (Intercept)           X          X2          X3 
##   0.9621529   1.9867296   3.0504418   3.9875115
cat("\nCoefficients (Backward):\n")
## 
## Coefficients (Backward):
print(coef(backward_fit, backward_best_size))
## (Intercept)           X          X2          X3 
##   0.9621529   1.9867296   3.0504418   3.9875115

All methods consistently identified the true underlying structure of your simulated data. This is textbook-perfect behavior when the signal is strong, predictors are orthogonal (here due to using polynomial basis manually), and noise is moderate.

#e

# Load glmnet package
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-8
# Create matrix of predictors (X, X^2, ..., X^10)
X_matrix <- model.matrix(Y ~ . - 1, data = data)  # Removes intercept column

# Perform cross-validated Lasso
set.seed(1)
lasso_cv <- cv.glmnet(X_matrix, Y, alpha = 1)  # Lasso: alpha = 1

# Plot cross-validation error vs. lambda
plot(lasso_cv)
title("Lasso CV Error vs Lambda", line = 2.5)

# Optimal lambda
best_lambda <- lasso_cv$lambda.min
cat("Best lambda (Lasso):", best_lambda, "\n")
## Best lambda (Lasso): 0.08594167
# Coefficients at optimal lambda
lasso_coef <- predict(lasso_cv, type = "coefficients", s = best_lambda)
print(lasso_coef)
## 11 x 1 sparse Matrix of class "dgCMatrix"
##                   s1
## (Intercept) 1.045102
## X           1.991867
## X2          2.989260
## X3          3.962895
## X4          .       
## X5          .       
## X6          .       
## X7          .       
## X8          .       
## X9          .       
## X10         .

Best λ (lambda.min) chosen by cross-validation: 0.0859 This value minimizes the mean cross-validated mean squared error (MSE)

Plot Interpretation The U-shaped curve confirms that too much regularization (right side of the plot) increases error. The minimum error occurs near log(λ) ≈ -2, where λ ≈ 0.0859. At this value, Lasso keeps only the important features, dropping the rest (set to zero). Using Lasso regression with 10-fold cross-validation, we selected an optimal penalty value of λ = 0.0859. The cross-validation error plot shows that this λ minimizes the mean squared prediction error.

#f

# Simulate Data
# --------------------------
set.seed(2)
n <- 100
X <- rnorm(n)
epsilon <- rnorm(n)

# Create data frame with X, X^2, ..., X^10
data_f <- data.frame(X = X)
for (i in 2:10) {
  data_f[[paste0("X", i)]] <- X^i
}

# Generate response Y = β0 + β7 * X^7 + ε
beta0 <- 1
beta7 <- 7
Y <- beta0 + beta7 * X^7 + epsilon

# Add response Y to data
data_f$Y <- Y

# --------------------------
# Best Subset Selection
# --------------------------
subset_fit <- regsubsets(Y ~ ., data = data_f, nvmax = 10)
subset_summary <- summary(subset_fit)

# Plot BIC for model sizes
plot(subset_summary$bic, type = "b", xlab = "Number of Variables", ylab = "BIC", main = "Best Subset Selection - BIC")
points(which.min(subset_summary$bic), min(subset_summary$bic), col = "blue", pch = 19)

# Report best model size and coefficients
best_bic_size <- which.min(subset_summary$bic)
cat("Best model size by BIC (Best Subset):", best_bic_size, "\n\n")
## Best model size by BIC (Best Subset): 1
cat("Coefficients of best model (Best Subset):\n")
## Coefficients of best model (Best Subset):
print(coef(subset_fit, best_bic_size))
## (Intercept)          X7 
##    1.025124    6.998911
# --------------------------
# Lasso Regression with CV
# --------------------------
# Create model matrix for glmnet (without intercept)
X_matrix <- model.matrix(Y ~ . -1, data = data_f)

# Fit lasso model
set.seed(1)
lasso_cv <- cv.glmnet(X_matrix, Y, alpha = 1)

# Plot CV error vs lambda
plot(lasso_cv)
title("Lasso CV Error vs Lambda", line = 2.5)

# Get best lambda and coefficients
best_lambda <- lasso_cv$lambda.min
cat("\nBest lambda (Lasso):", best_lambda, "\n\n")
## 
## Best lambda (Lasso): 16.73074
cat("Lasso Coefficients at best lambda:\n")
## Lasso Coefficients at best lambda:
lasso_coef <- predict(lasso_cv, s = best_lambda, type = "coefficients")
print(lasso_coef)
## 11 x 1 sparse Matrix of class "dgCMatrix"
##                    s1
## (Intercept) 0.2586417
## X           .        
## X2          .        
## X3          .        
## X4          .        
## X5          .        
## X6          .        
## X7          6.7948891
## X8          .        
## X9          .        
## X10         .

Best Subset Selection Plot Insight: The BIC plot shows the lowest BIC at model size = 1, indicating a single best predictor. Best subset selection perfectly recovered the true model, identifying X^7 as the only important predictor. It ignored all irrelevant terms (X, X², …, X⁶, X⁸–X¹⁰). Lasso Regression with Cross-Validation Plot Insight: Cross-validation selected a best λ = 16.73, minimizing mean squared error. Lasso also correctly identified only X^7, setting all other coefficients to zero. The estimate for β7is close to 7, and Lasso maintained model sparsity through regularization. Both best subset selection and Lasso successfully recovered the sparse true model Y=1+7X7+ε, demonstrating: High accuracy of model selection techniques under strong signal and controlled noise. Best subset prefers interpretability via BIC. Lasso automatically shrinks irrelevant features to zero, achieving both selection and regularization.

#9 a

library(ISLR2)
# Load the College dataset
data(College)

# Set seed for reproducibility
set.seed(1)

# Split into training and test sets (e.g., 70% training, 30% test)
train_index <- sample(1:nrow(College), nrow(College) * 0.7)
train_data <- College[train_index, ]
test_data <- College[-train_index, ]

nrow(train_data)  
## [1] 543
nrow(test_data)   
## [1] 234
head(train_data)  
##                                 Private Apps Accept Enroll Top10perc Top25perc
## University of Southern Colorado      No 1401   1239    605        10        34
## College of Notre Dame               Yes  344    264     97        11        42
## Salisbury State University           No 4216   2290    736        20        52
## Regis College                       Yes  427    385    143        18        38
## La Salle University                 Yes 2929   1834    622        20        56
## Illinois State University            No 8681   6695   2408        10        35
##                                 F.Undergrad P.Undergrad Outstate Room.Board
## University of Southern Colorado        3716         675     7100       4380
## College of Notre Dame                   500         331    12600       5520
## Salisbury State University             4296        1027     5130       4690
## Regis College                           581         533    12700       5800
## La Salle University                    2738        1662    12600       5610
## Illinois State University             15701        1823     7799       3403
##                                 Books Personal PhD Terminal S.F.Ratio
## University of Southern Colorado   540     2948  63       88      19.4
## College of Notre Dame             630     2250  77       80      10.4
## Salisbury State University        600     1450  73       75      17.9
## Regis College                     450      700  81       85      10.3
## La Salle University               450     3160  90       90      15.1
## Illinois State University         537     2605  77       84      21.0
##                                 perc.alumni Expend Grad.Rate
## University of Southern Colorado           0   5389        36
## College of Notre Dame                     7   9773        43
## Salisbury State University               18   5125        56
## Regis College                            37  11758        84
## La Salle University                       9   9084        84
## Illinois State University                16   5569        54

Training set: 543 colleges Test set: 234 colleges The training set will be used to fit the model, and the test set will be used to evaluate how well it generalizes to unseen data. Splitting the data like this allows you to build models that can be validated out-of-sample, helping avoid overfitting.

#b

# (Reusing your training and test sets: train_data and test_data)

# Fit linear regression model on training set
lm_fit <- lm(Apps ~ ., data = train_data)

# Predict on test set
lm_preds <- predict(lm_fit, newdata = test_data)

# Compute Mean Squared Error (MSE) on test set
test_mse <- mean((test_data$Apps - lm_preds)^2)
cat("Test MSE (Linear Model):", test_mse, "\n")
## Test MSE (Linear Model): 1261630

On average, the squared difference between predicted and actual number of applications (Apps) is about 1.26 million. Test MSE of 1,261,630 implies a Root Mean Squared Error (RMSE) of: sqrt(1261630) ≈ 1,122 applications The least squares model was able to get reasonably close.

#c

library(glmnet)

# Prepare the model matrix (excluding the response)
x_train <- model.matrix(Apps ~ ., data = train_data)[, -1]  # remove intercept
y_train <- train_data$Apps

x_test <- model.matrix(Apps ~ ., data = test_data)[, -1]
y_test <- test_data$Apps

# Set seed and perform cross-validation to find best lambda
set.seed(1)
cv_ridge <- cv.glmnet(x_train, y_train, alpha = 0)  # alpha = 0 for ridge

# Best lambda value
best_lambda_ridge <- cv_ridge$lambda.min
cat("Best lambda (Ridge):", best_lambda_ridge, "\n")
## Best lambda (Ridge): 367.5286
# Predict on test set using the best lambda
ridge_preds <- predict(cv_ridge, s = best_lambda_ridge, newx = x_test)

# Compute test MSE
ridge_mse <- mean((y_test - ridge_preds)^2)
cat("Test MSE (Ridge Regression):", ridge_mse, "\n")
## Test MSE (Ridge Regression): 1121034

This model was trained using ridge regression (L2 regularization) with cross-validation to choose the best regularization strength. Compared to the linear regression test MSE of 1,261,630, your ridge model has a lower test error: Improvement in test MSE=1,261,630−1,121,034=140,596 That’s roughly an 11% improvement in squared error, meaning ridge generalized better to unseen data. Ridge regression outperformed the standard linear model by reducing test error from 1.26M to 1.12M, showing that regularization helped improve model generalization.

#d

# Fit Lasso using cross-validation
set.seed(1)
cv_lasso <- cv.glmnet(x_train, y_train, alpha = 1)  # alpha = 1 for Lasso

# Best lambda
best_lambda_lasso <- cv_lasso$lambda.min
cat("Best lambda (Lasso):", best_lambda_lasso, "\n")
## Best lambda (Lasso): 8.690175
# Predict on test set
lasso_preds <- predict(cv_lasso, s = best_lambda_lasso, newx = x_test)

# Compute test MSE
lasso_mse <- mean((y_test - lasso_preds)^2)
cat("Test MSE (Lasso):", lasso_mse, "\n")
## Test MSE (Lasso): 1233246
# Get number of non-zero coefficients
lasso_coefs <- predict(cv_lasso, s = best_lambda_lasso, type = "coefficients")
nonzero_count <- sum(lasso_coefs != 0) - 1  # subtract 1 to exclude intercept
cat("Number of non-zero coefficients (Lasso):", nonzero_count, "\n")
## Number of non-zero coefficients (Lasso): 14

lasso model had slightly higher test error than ridge (1,121,034) and OLS (1,261,630), but still reasonably competitive. Lasso automatically performed variable selection: It kept 14 predictors and set the rest to zero, creating a sparser, more interpretable model. This makes lasso especially valuable when want to: Reduce model complexity Identify important predictors Maintain generalization performance While ridge regression achieved the lowest test MSE, lasso offered a good balance between predictive performance and model sparsity, selecting only 14 predictors. This makes it especially useful if model simplicity or feature selection is a priority.

#e

library(pls)
## 
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
## 
##     loadings
# Fit PCR on training set with cross-validation
set.seed(1)
pcr_fit <- pcr(Apps ~ ., data = train_data, scale = TRUE, validation = "CV")

# Plot cross-validation MSE vs number of components
validationplot(pcr_fit, val.type = "MSEP")

# Find number of components (M) with lowest CV error
cv_mse <- MSEP(pcr_fit)$val[1,1,-1]
best_M <- which.min(cv_mse)
cat("Best number of components (M):", best_M, "\n")
## Best number of components (M): 17
# Predict on test set using the best M
pcr_preds <- predict(pcr_fit, newdata = test_data, ncomp = best_M)

# Compute test MSE
pcr_mse <- mean((test_data$Apps - pcr_preds)^2)
cat("Test MSE (PCR):", pcr_mse, "\n")
## Test MSE (PCR): 1261630

PCR used 17 principal components out of all available predictors to minimize cross-validated prediction error. Our test MSE (1,261,630) is identical to the test MSE from ordinary linear regression model. PCR selected 17 components but did not outperform the linear model. This shows that PCR may not always help if most predictors are informative or if PCA components don’t align well with the response variable. In this case, Ridge regression remains your best-performing model.

#f

# Fit PLS model with cross-validation
set.seed(1)
pls_fit <- plsr(Apps ~ ., data = train_data, scale = TRUE, validation = "CV")

# Plot validation error
validationplot(pls_fit, val.type = "MSEP")

# Determine best number of components (M)
cv_mse_pls <- MSEP(pls_fit)$val[1,1,-1]
best_M_pls <- which.min(cv_mse_pls)
cat("Best number of components (PLS):", best_M_pls, "\n")
## Best number of components (PLS): 17
# Predict on test data using best M
pls_preds <- predict(pls_fit, newdata = test_data, ncomp = best_M_pls)

# Calculate test MSE
pls_mse <- mean((test_data$Apps - pls_preds)^2)
cat("Test MSE (PLS):", pls_mse, "\n")
## Test MSE (PLS): 1261630

The PLS model selected 17 components — which is the maximum possible, and the same number chosen by PCR in your earlier step. The test MSE (1,261,630) is identical to the linear regression and PCR models. PLS did not improve performance over linear regression or PCR in this case. This likely means: All predictors were already helpful, and dimensionality reduction didn’t simplify the problem. PLS couldn’t capture more signal with fewer components. Ridge still remains the best generalizing model overall.

#g

In this exercise, we used five different modeling approaches to predict the number of college applications (Apps) received based on various institutional features in the College dataset. The models compared were: Linear Regression Ridge Regression Lasso Regression Principal Components Regression (PCR) Partial Least Squares (PLS)

All five models performed reasonably well, predicting the number of applications with an average error of ~1,100 applications, which is acceptable given the scale of the response variable (ranging up to ~15,000). The differences in test MSE across models were relatively small, except for Ridge, which showed a clear performance advantage. Regularization (ridge/lasso) helped improve generalization by reducing overfitting. Dimensionality reduction methods (PCR, PLS) did not yield significant benefits here, likely because most original predictors contributed meaningful information.

We can predict the number of college applications with moderate accuracy, and while all methods performed similarly, Ridge Regression offered the best balance of predictive power and generalization. Lasso followed closely, with the added benefit of variable selection. This suggests that while all predictors are somewhat useful, regularization techniques provide a valuable edge in real-world predictive tasks.