#8 a)
# Set seed for reproducibility
set.seed(2)
# Generate predictor X and noise vector ε
n <- 100
X <- rnorm(n)
epsilon <- rnorm(n)
# View the first few values
head(X)
## [1] -0.89691455 0.18484918 1.58784533 -1.13037567 -0.08025176 0.13242028
head(epsilon)
## [1] 1.0744594 0.2605978 -0.3142720 -0.7496301 -0.8621983 2.0480403
#b
# Coefficients
beta0 <- 1
beta1 <- 2
beta2 <- 3
beta3 <- 4
# Response variable Y based on polynomial model
Y <- beta0 + beta1*X + beta2*X^2 + beta3*X^3 + epsilon
head(Y)
## [1] -0.192114667 1.758068478 27.438614992 -3.954480244 -0.005448205
## [6] 3.374774295
#c
library(leaps)
# (a) Generate X and epsilon
set.seed(2)
n <- 100
X <- rnorm(n)
epsilon <- rnorm(n)
# (b) Generate response Y = β0 + β1*X + β2*X^2 + β3*X^3 + ε
beta0 <- 1
beta1 <- 2
beta2 <- 3
beta3 <- 4
Y <- beta0 + beta1*X + beta2*X^2 + beta3*X^3 + epsilon
# (c) Create data frame with X, X^2, ..., X^10
data <- data.frame(Y = Y, X = X)
for (i in 2:10) {
data[[paste0("X", i)]] <- X^i
}
# Perform best subset selection
best_subset <- regsubsets(Y ~ ., data = data, nvmax = 10)
summary_best <- summary(best_subset)
# Plot Cp, BIC, Adjusted R^2
par(mfrow = c(1, 3)) # One row, three plots
# Cp plot
plot(summary_best$cp, xlab = "Number of Variables", ylab = "Cp", type = "b", main = "Cp")
points(which.min(summary_best$cp), summary_best$cp[which.min(summary_best$cp)],
col = "red", pch = 19)
# BIC plot
plot(summary_best$bic, xlab = "Number of Variables", ylab = "BIC", type = "b", main = "BIC")
points(which.min(summary_best$bic), summary_best$bic[which.min(summary_best$bic)],
col = "blue", pch = 19)
# Adjusted R^2 plot
plot(summary_best$adjr2, xlab = "Number of Variables", ylab = "Adjusted R²", type = "b", main = "Adjusted R²")
points(which.max(summary_best$adjr2), summary_best$adjr2[which.max(summary_best$adjr2)],
col = "green", pch = 19)
# Identify best model size
best_cp_size <- which.min(summary_best$cp)
best_bic_size <- which.min(summary_best$bic)
best_adjr2_size <- which.max(summary_best$adjr2)
cat("Best model size by Cp:", best_cp_size, "\n")
## Best model size by Cp: 3
cat("Best model size by BIC:", best_bic_size, "\n")
## Best model size by BIC: 3
cat("Best model size by Adjusted R²:", best_adjr2_size, "\n")
## Best model size by Adjusted R²: 3
# Report coefficients of the best model (you can choose one criterion, e.g., BIC)
cat("\nCoefficients of best model by BIC:\n")
##
## Coefficients of best model by BIC:
print(coef(best_subset, best_bic_size))
## (Intercept) X X2 X3
## 0.9621529 1.9867296 3.0504418 3.9875115
Best subset selection works well when the true model is sparse and has strong signal. Simulated noise level (via epsilon) wasn’t too high — allowing the model to detect the real polynomial relationship.
Cp Plot (left) The red dot at 3 variables has the lowest Cp, indicating the best model. Cp drops steeply until 3, then flattens — no gain in adding more variables.
BIC Plot (middle) The blue dot at 3 variables is the lowest BIC. This aligns with Cp — BIC also penalizes complexity, so it confirms the 3-variable model is optimal.
Adjusted R² Plot (right) Green dot at 3 variables has the highest Adjusted R². Adjusted R² increases with more variables only when those variables improve model performance — the curve flatlines after 3, so additional terms offer no meaningful improvement. All three criteria — Cp, BIC, and Adjusted R² — agree that the best model includes 3 predictors, which correspond to:X X2 X3
#d
# Forward Stepwise Selection
forward_fit <- regsubsets(Y ~ ., data = data, nvmax = 10, method = "forward")
forward_summary <- summary(forward_fit)
# Backward Stepwise Selection
backward_fit <- regsubsets(Y ~ ., data = data, nvmax = 10, method = "backward")
backward_summary <- summary(backward_fit)
# Compare best model sizes
forward_best_size <- which.min(forward_summary$bic)
backward_best_size <- which.min(backward_summary$bic)
cat("Best model size by BIC (Forward):", forward_best_size, "\n")
## Best model size by BIC (Forward): 3
cat("Best model size by BIC (Backward):", backward_best_size, "\n")
## Best model size by BIC (Backward): 3
# Coefficients of best models
cat("\nCoefficients (Forward):\n")
##
## Coefficients (Forward):
print(coef(forward_fit, forward_best_size))
## (Intercept) X X2 X3
## 0.9621529 1.9867296 3.0504418 3.9875115
cat("\nCoefficients (Backward):\n")
##
## Coefficients (Backward):
print(coef(backward_fit, backward_best_size))
## (Intercept) X X2 X3
## 0.9621529 1.9867296 3.0504418 3.9875115
All methods consistently identified the true underlying structure of your simulated data. This is textbook-perfect behavior when the signal is strong, predictors are orthogonal (here due to using polynomial basis manually), and noise is moderate.
#e
# Load glmnet package
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-8
# Create matrix of predictors (X, X^2, ..., X^10)
X_matrix <- model.matrix(Y ~ . - 1, data = data) # Removes intercept column
# Perform cross-validated Lasso
set.seed(1)
lasso_cv <- cv.glmnet(X_matrix, Y, alpha = 1) # Lasso: alpha = 1
# Plot cross-validation error vs. lambda
plot(lasso_cv)
title("Lasso CV Error vs Lambda", line = 2.5)
# Optimal lambda
best_lambda <- lasso_cv$lambda.min
cat("Best lambda (Lasso):", best_lambda, "\n")
## Best lambda (Lasso): 0.08594167
# Coefficients at optimal lambda
lasso_coef <- predict(lasso_cv, type = "coefficients", s = best_lambda)
print(lasso_coef)
## 11 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 1.045102
## X 1.991867
## X2 2.989260
## X3 3.962895
## X4 .
## X5 .
## X6 .
## X7 .
## X8 .
## X9 .
## X10 .
Best λ (lambda.min) chosen by cross-validation: 0.0859 This value minimizes the mean cross-validated mean squared error (MSE)
Plot Interpretation The U-shaped curve confirms that too much regularization (right side of the plot) increases error. The minimum error occurs near log(λ) ≈ -2, where λ ≈ 0.0859. At this value, Lasso keeps only the important features, dropping the rest (set to zero). Using Lasso regression with 10-fold cross-validation, we selected an optimal penalty value of λ = 0.0859. The cross-validation error plot shows that this λ minimizes the mean squared prediction error.
#f
# Simulate Data
# --------------------------
set.seed(2)
n <- 100
X <- rnorm(n)
epsilon <- rnorm(n)
# Create data frame with X, X^2, ..., X^10
data_f <- data.frame(X = X)
for (i in 2:10) {
data_f[[paste0("X", i)]] <- X^i
}
# Generate response Y = β0 + β7 * X^7 + ε
beta0 <- 1
beta7 <- 7
Y <- beta0 + beta7 * X^7 + epsilon
# Add response Y to data
data_f$Y <- Y
# --------------------------
# Best Subset Selection
# --------------------------
subset_fit <- regsubsets(Y ~ ., data = data_f, nvmax = 10)
subset_summary <- summary(subset_fit)
# Plot BIC for model sizes
plot(subset_summary$bic, type = "b", xlab = "Number of Variables", ylab = "BIC", main = "Best Subset Selection - BIC")
points(which.min(subset_summary$bic), min(subset_summary$bic), col = "blue", pch = 19)
# Report best model size and coefficients
best_bic_size <- which.min(subset_summary$bic)
cat("Best model size by BIC (Best Subset):", best_bic_size, "\n\n")
## Best model size by BIC (Best Subset): 1
cat("Coefficients of best model (Best Subset):\n")
## Coefficients of best model (Best Subset):
print(coef(subset_fit, best_bic_size))
## (Intercept) X7
## 1.025124 6.998911
# --------------------------
# Lasso Regression with CV
# --------------------------
# Create model matrix for glmnet (without intercept)
X_matrix <- model.matrix(Y ~ . -1, data = data_f)
# Fit lasso model
set.seed(1)
lasso_cv <- cv.glmnet(X_matrix, Y, alpha = 1)
# Plot CV error vs lambda
plot(lasso_cv)
title("Lasso CV Error vs Lambda", line = 2.5)
# Get best lambda and coefficients
best_lambda <- lasso_cv$lambda.min
cat("\nBest lambda (Lasso):", best_lambda, "\n\n")
##
## Best lambda (Lasso): 16.73074
cat("Lasso Coefficients at best lambda:\n")
## Lasso Coefficients at best lambda:
lasso_coef <- predict(lasso_cv, s = best_lambda, type = "coefficients")
print(lasso_coef)
## 11 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 0.2586417
## X .
## X2 .
## X3 .
## X4 .
## X5 .
## X6 .
## X7 6.7948891
## X8 .
## X9 .
## X10 .
Best Subset Selection Plot Insight: The BIC plot shows the lowest BIC at model size = 1, indicating a single best predictor. Best subset selection perfectly recovered the true model, identifying X^7 as the only important predictor. It ignored all irrelevant terms (X, X², …, X⁶, X⁸–X¹⁰). Lasso Regression with Cross-Validation Plot Insight: Cross-validation selected a best λ = 16.73, minimizing mean squared error. Lasso also correctly identified only X^7, setting all other coefficients to zero. The estimate for β7is close to 7, and Lasso maintained model sparsity through regularization. Both best subset selection and Lasso successfully recovered the sparse true model Y=1+7X7+ε, demonstrating: High accuracy of model selection techniques under strong signal and controlled noise. Best subset prefers interpretability via BIC. Lasso automatically shrinks irrelevant features to zero, achieving both selection and regularization.
#9 a
library(ISLR2)
# Load the College dataset
data(College)
# Set seed for reproducibility
set.seed(1)
# Split into training and test sets (e.g., 70% training, 30% test)
train_index <- sample(1:nrow(College), nrow(College) * 0.7)
train_data <- College[train_index, ]
test_data <- College[-train_index, ]
nrow(train_data)
## [1] 543
nrow(test_data)
## [1] 234
head(train_data)
## Private Apps Accept Enroll Top10perc Top25perc
## University of Southern Colorado No 1401 1239 605 10 34
## College of Notre Dame Yes 344 264 97 11 42
## Salisbury State University No 4216 2290 736 20 52
## Regis College Yes 427 385 143 18 38
## La Salle University Yes 2929 1834 622 20 56
## Illinois State University No 8681 6695 2408 10 35
## F.Undergrad P.Undergrad Outstate Room.Board
## University of Southern Colorado 3716 675 7100 4380
## College of Notre Dame 500 331 12600 5520
## Salisbury State University 4296 1027 5130 4690
## Regis College 581 533 12700 5800
## La Salle University 2738 1662 12600 5610
## Illinois State University 15701 1823 7799 3403
## Books Personal PhD Terminal S.F.Ratio
## University of Southern Colorado 540 2948 63 88 19.4
## College of Notre Dame 630 2250 77 80 10.4
## Salisbury State University 600 1450 73 75 17.9
## Regis College 450 700 81 85 10.3
## La Salle University 450 3160 90 90 15.1
## Illinois State University 537 2605 77 84 21.0
## perc.alumni Expend Grad.Rate
## University of Southern Colorado 0 5389 36
## College of Notre Dame 7 9773 43
## Salisbury State University 18 5125 56
## Regis College 37 11758 84
## La Salle University 9 9084 84
## Illinois State University 16 5569 54
Training set: 543 colleges Test set: 234 colleges The training set will be used to fit the model, and the test set will be used to evaluate how well it generalizes to unseen data. Splitting the data like this allows you to build models that can be validated out-of-sample, helping avoid overfitting.
#b
# (Reusing your training and test sets: train_data and test_data)
# Fit linear regression model on training set
lm_fit <- lm(Apps ~ ., data = train_data)
# Predict on test set
lm_preds <- predict(lm_fit, newdata = test_data)
# Compute Mean Squared Error (MSE) on test set
test_mse <- mean((test_data$Apps - lm_preds)^2)
cat("Test MSE (Linear Model):", test_mse, "\n")
## Test MSE (Linear Model): 1261630
On average, the squared difference between predicted and actual number of applications (Apps) is about 1.26 million. Test MSE of 1,261,630 implies a Root Mean Squared Error (RMSE) of: sqrt(1261630) ≈ 1,122 applications The least squares model was able to get reasonably close.
#c
library(glmnet)
# Prepare the model matrix (excluding the response)
x_train <- model.matrix(Apps ~ ., data = train_data)[, -1] # remove intercept
y_train <- train_data$Apps
x_test <- model.matrix(Apps ~ ., data = test_data)[, -1]
y_test <- test_data$Apps
# Set seed and perform cross-validation to find best lambda
set.seed(1)
cv_ridge <- cv.glmnet(x_train, y_train, alpha = 0) # alpha = 0 for ridge
# Best lambda value
best_lambda_ridge <- cv_ridge$lambda.min
cat("Best lambda (Ridge):", best_lambda_ridge, "\n")
## Best lambda (Ridge): 367.5286
# Predict on test set using the best lambda
ridge_preds <- predict(cv_ridge, s = best_lambda_ridge, newx = x_test)
# Compute test MSE
ridge_mse <- mean((y_test - ridge_preds)^2)
cat("Test MSE (Ridge Regression):", ridge_mse, "\n")
## Test MSE (Ridge Regression): 1121034
This model was trained using ridge regression (L2 regularization) with cross-validation to choose the best regularization strength. Compared to the linear regression test MSE of 1,261,630, your ridge model has a lower test error: Improvement in test MSE=1,261,630−1,121,034=140,596 That’s roughly an 11% improvement in squared error, meaning ridge generalized better to unseen data. Ridge regression outperformed the standard linear model by reducing test error from 1.26M to 1.12M, showing that regularization helped improve model generalization.
#d
# Fit Lasso using cross-validation
set.seed(1)
cv_lasso <- cv.glmnet(x_train, y_train, alpha = 1) # alpha = 1 for Lasso
# Best lambda
best_lambda_lasso <- cv_lasso$lambda.min
cat("Best lambda (Lasso):", best_lambda_lasso, "\n")
## Best lambda (Lasso): 8.690175
# Predict on test set
lasso_preds <- predict(cv_lasso, s = best_lambda_lasso, newx = x_test)
# Compute test MSE
lasso_mse <- mean((y_test - lasso_preds)^2)
cat("Test MSE (Lasso):", lasso_mse, "\n")
## Test MSE (Lasso): 1233246
# Get number of non-zero coefficients
lasso_coefs <- predict(cv_lasso, s = best_lambda_lasso, type = "coefficients")
nonzero_count <- sum(lasso_coefs != 0) - 1 # subtract 1 to exclude intercept
cat("Number of non-zero coefficients (Lasso):", nonzero_count, "\n")
## Number of non-zero coefficients (Lasso): 14
lasso model had slightly higher test error than ridge (1,121,034) and OLS (1,261,630), but still reasonably competitive. Lasso automatically performed variable selection: It kept 14 predictors and set the rest to zero, creating a sparser, more interpretable model. This makes lasso especially valuable when want to: Reduce model complexity Identify important predictors Maintain generalization performance While ridge regression achieved the lowest test MSE, lasso offered a good balance between predictive performance and model sparsity, selecting only 14 predictors. This makes it especially useful if model simplicity or feature selection is a priority.
#e
library(pls)
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
# Fit PCR on training set with cross-validation
set.seed(1)
pcr_fit <- pcr(Apps ~ ., data = train_data, scale = TRUE, validation = "CV")
# Plot cross-validation MSE vs number of components
validationplot(pcr_fit, val.type = "MSEP")
# Find number of components (M) with lowest CV error
cv_mse <- MSEP(pcr_fit)$val[1,1,-1]
best_M <- which.min(cv_mse)
cat("Best number of components (M):", best_M, "\n")
## Best number of components (M): 17
# Predict on test set using the best M
pcr_preds <- predict(pcr_fit, newdata = test_data, ncomp = best_M)
# Compute test MSE
pcr_mse <- mean((test_data$Apps - pcr_preds)^2)
cat("Test MSE (PCR):", pcr_mse, "\n")
## Test MSE (PCR): 1261630
PCR used 17 principal components out of all available predictors to minimize cross-validated prediction error. Our test MSE (1,261,630) is identical to the test MSE from ordinary linear regression model. PCR selected 17 components but did not outperform the linear model. This shows that PCR may not always help if most predictors are informative or if PCA components don’t align well with the response variable. In this case, Ridge regression remains your best-performing model.
#f
# Fit PLS model with cross-validation
set.seed(1)
pls_fit <- plsr(Apps ~ ., data = train_data, scale = TRUE, validation = "CV")
# Plot validation error
validationplot(pls_fit, val.type = "MSEP")
# Determine best number of components (M)
cv_mse_pls <- MSEP(pls_fit)$val[1,1,-1]
best_M_pls <- which.min(cv_mse_pls)
cat("Best number of components (PLS):", best_M_pls, "\n")
## Best number of components (PLS): 17
# Predict on test data using best M
pls_preds <- predict(pls_fit, newdata = test_data, ncomp = best_M_pls)
# Calculate test MSE
pls_mse <- mean((test_data$Apps - pls_preds)^2)
cat("Test MSE (PLS):", pls_mse, "\n")
## Test MSE (PLS): 1261630
The PLS model selected 17 components — which is the maximum possible, and the same number chosen by PCR in your earlier step. The test MSE (1,261,630) is identical to the linear regression and PCR models. PLS did not improve performance over linear regression or PCR in this case. This likely means: All predictors were already helpful, and dimensionality reduction didn’t simplify the problem. PLS couldn’t capture more signal with fewer components. Ridge still remains the best generalizing model overall.
#g
In this exercise, we used five different modeling approaches to predict the number of college applications (Apps) received based on various institutional features in the College dataset. The models compared were: Linear Regression Ridge Regression Lasso Regression Principal Components Regression (PCR) Partial Least Squares (PLS)
All five models performed reasonably well, predicting the number of applications with an average error of ~1,100 applications, which is acceptable given the scale of the response variable (ranging up to ~15,000). The differences in test MSE across models were relatively small, except for Ridge, which showed a clear performance advantage. Regularization (ridge/lasso) helped improve generalization by reducing overfitting. Dimensionality reduction methods (PCR, PLS) did not yield significant benefits here, likely because most original predictors contributed meaningful information.
We can predict the number of college applications with moderate accuracy, and while all methods performed similarly, Ridge Regression offered the best balance of predictive power and generalization. Lasso followed closely, with the added benefit of variable selection. This suggests that while all predictors are somewhat useful, regularization techniques provide a valuable edge in real-world predictive tasks.