Question 8:

(a) Predictor X:

set.seed(1)  # for reproducibility
n <- 100
X <- rnorm(n)
epsilon <- rnorm(n)

(b) Response Y:

beta0 <- 2
beta1 <- 3
beta2 <- -1
beta3 <- 0.5

Y <- beta0 + beta1*X + beta2*X^2 + beta3*X^3 + epsilon

(c) Best Subset Selection:

library(leaps)

# Creating predictors X1 to X10:
data <- data.frame(Y = Y, X = X)
for (i in 2:10) {
  data[[paste0("X", i)]] <- X^i
}

# Best subset selection:
best_fit <- regsubsets(Y ~ ., data = data, nvmax = 10)
summary_best <- summary(best_fit)

# Visualizations:
par(mfrow = c(1, 3))
plot(summary_best$cp, xlab = "Number of Variables", ylab = "Cp", type = "b", main="Cp")
which.min(summary_best$cp)
## [1] 3
plot(summary_best$bic, xlab = "Number of Variables", ylab = "BIC", type = "b", main="BIC")
which.min(summary_best$bic)
## [1] 3
plot(summary_best$adjr2, xlab = "Number of Variables", ylab = "Adjusted R2", type = "b", main="Adjusted R²")

which.max(summary_best$adjr2)
## [1] 3
# Coefficients of best model:
coef(best_fit, which.max(summary_best$adjr2))
## (Intercept)           X          X2          X5 
##  2.07219472  3.44514720 -1.15676236  0.09022577

From the above output, we can observe that the in all three criteria (Cp, BIC and Adjusted R2), the best model is with 3 predictors as X1, X2 and X5. Also, from the above three plots of Cp, BIC and Adjusted R2, we can be sure that the best model is with 3 predictors since we know a lower Cp, lower BIC and higher Adjusted R2 - all accounts for the best model. Hence the best model looks like:

Y = 2.072 + 3.445X - 1.157X2 + 0.090X5 + ε

where, β0 = 2.072, β1 = 3.445, β2 = -1.1568 and β5 = 0.090.

The point to be noted here is with X5 being chosen by the method rather than X3 as true predictor. This could be due to multicollinearity among the polynomial predictors or even the flexibility of model.

(d) Forward and Backward selection:

# Forward Selection:

regfit.fwd <- regsubsets(Y ~ ., data = data, nvmax = 10, method = "forward")
reg.summary.fwd <- summary(regfit.fwd)
best_model_index_fwd <- which.min(reg.summary.fwd$bic)
coef_fwd <- coef(regfit.fwd, best_model_index_fwd)
cat("\nForward Stepwise Selection Coefficients (based on BIC):\n")
## 
## Forward Stepwise Selection Coefficients (based on BIC):
print(coef_fwd)
## (Intercept)           X          X2          X5 
##  2.07219472  3.44514720 -1.15676236  0.09022577
# Backward Selection:

regfit.bwd <- regsubsets(Y ~ ., data = data, nvmax = 10, method = "backward")
reg.summary.bwd <- summary(regfit.bwd)
best_model_index_bwd <- which.min(reg.summary.bwd$bic)
coef_bwd <- coef(regfit.bwd, best_model_index_bwd)
cat("\nBackward Stepwise Selection Coefficients (based on BIC):\n")
## 
## Backward Stepwise Selection Coefficients (based on BIC):
print(coef_bwd)
## (Intercept)           X          X2          X5 
##  2.07219472  3.44514720 -1.15676236  0.09022577

After observing the results from both forward and backward selection, we can conclude that both methods yielded a similar result (similar coefficients) to that of best subset selection.

(e) Lasso Model:

library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-8
x_matrix <- model.matrix(Y ~ . - 1, data = data)
cv.out <- cv.glmnet(x_matrix, Y, alpha = 1)  # Lasso (alpha=1)
plot(cv.out)  

best_lambda <- cv.out$lambda.min
cat("\nOptimal Lambda (min CV error): ", best_lambda, "\n")
## 
## Optimal Lambda (min CV error):  0.03458424
lasso_coef <- coef(cv.out, s = "lambda.min")
cat("Lasso Coefficients at Optimal Lambda:\n")
## Lasso Coefficients at Optimal Lambda:
print(lasso_coef)
## 11 x 1 sparse Matrix of class "dgCMatrix"
##                      s1
## (Intercept)  2.04367894
## X            3.29157840
## X2          -1.11065890
## X3           0.13441562
## X4           .         
## X5           0.06526751
## X6           .         
## X7           .         
## X8           .         
## X9           .         
## X10          .

From the above results, the Cross-validation is found at λ(min) approximately equals to 0.055. It also shows that four polynomial terms remain nonzero: X, X2, X3, and X5.

The model suggests a polynomial relationship up to the fifth power of X excluding X4. The rest of the terms are shrunk to zero. The final coefficient estimates differ somewhat from any “true” parameters when simulated data, that might reflects noise and collinearity.

Overall, the result demonstrates that lasso is effective at producing a sparse model (only 4 out of 10 polynomtal terms) while minimizing out-of-sample prediction error.

(f) New Response Y:

# Generating a new response according to: Y_new = beta0 + beta7 * X^7 + noise:
beta7 <- 4 
Y_new <- beta0 + beta7 * (X^7) + epsilon

# Creating a new data frame (including X, X^2, ..., X^10):
data_new <- data.frame(Y = Y_new, X = X)
for (j in 2:10) {
  data_new[[paste0("X", j)]] <- X^j
}

# Best subset selection on the new data:
regfit.new <- regsubsets(Y ~ ., data = data_new, nvmax = 10)
reg.summary.new <- summary(regfit.new)
best_model_index_new <- which.min(reg.summary.new$bic)
coef_new <- coef(regfit.new, best_model_index_new)
cat("\nBest Subset Selection Coefficients for Model Y = beta0 + beta7*X7 + noise:\n")
## 
## Best Subset Selection Coefficients for Model Y = beta0 + beta7*X7 + noise:
print(coef_new)
## (Intercept)          X7 
##     1.95894     4.00077
# Lasso on the new data:
x_matrix_new <- model.matrix(Y ~ . - 1, data = data_new)
cv.out.new <- cv.glmnet(x_matrix_new, Y_new, alpha = 1)
plot(cv.out.new)

best_lambda_new <- cv.out.new$lambda.min
cat("\nOptimal Lambda for New Data: ", best_lambda_new, "\n")
## 
## Optimal Lambda for New Data:  7.068491
lasso_coef_new <- coef(cv.out.new, s = "lambda.min")
cat("Lasso Coefficients for New Model at Optimal Lambda:\n")
## Lasso Coefficients for New Model at Optimal Lambda:
print(lasso_coef_new)
## 11 x 1 sparse Matrix of class "dgCMatrix"
##                   s1
## (Intercept) 2.451138
## X           .       
## X2          .       
## X3          .       
## X4          .       
## X5          .       
## X6          .       
## X7          3.884146
## X8          .       
## X9          .       
## X10         .

From the above results, we obtained that both best subset selection and Lasso offers similar results/coefficients. Both of them yielded intercepts close to 2 (best subset selection ~ 1.95 and lasso ~ 2.45).

For coefficients of X7, best subset selection yielded 4 being extremely close to true value of 4. And lasso yielded 3.88 which is slightly away from true value 4. This difference could be due to the fact that best subset selection finds best subset without shrinking the coefficients while lasso does with shrinking the coefficients.

Overall, both methods were able to select X7 and the only influential predictor for the model and remove others - resembling close to the true model as provided above.

Question 9:

(a) Split the data:

library(ISLR)
set.seed(1)  

train_index <- sample(1:nrow(College), round(0.7 * nrow(College)))
College.train <- College[train_index, ]
College.test  <- College[-train_index, ]

(b) Linear Model:

# Fitting linear model on training data:
lm.fit <- lm(Apps ~ ., data = College.train)

# Predicting on test set:
lm.pred <- predict(lm.fit, newdata = College.test)

# Computing test mean squared error:
lm.mse <- mean((lm.pred - College.test$Apps)^2)
cat("Test MSE (Least Squares):", lm.mse, "\n")
## Test MSE (Least Squares): 1266407

Test MSE: 1,266,407 is the baseline performance using a standard linear model.

(c) Ridge Regression Model:

library(glmnet)

# Preparing matrices: 
x.train <- model.matrix(Apps ~ ., data = College.train)[, -1]  # remove intercept column
y.train <- College.train$Apps
x.test  <- model.matrix(Apps ~ ., data = College.test)[, -1]
y.test  <- College.test$Apps

# Performing cross-validation for ridge regression (alpha=0):
cv.ridge <- cv.glmnet(x.train, y.train, alpha = 0)
best.lambda.ridge <- cv.ridge$lambda.min
cat("Optimal lambda (ridge):", best.lambda.ridge, "\n")
## Optimal lambda (ridge): 367.2207
# Making predictions on test set:
ridge.pred <- predict(cv.ridge, s = best.lambda.ridge, newx = x.test)
ridge.mse <- mean((ridge.pred - y.test)^2)
cat("Test MSE (Ridge):", ridge.mse, "\n")
## Test MSE (Ridge): 1125664

Ridge regression shrunk coefficient estimates (Test MSE), and yielded a lower test error compared to least squares—improving prediction by about 140,000 MSE units –> (1,266,407 - 1,125,664 = 140,743) units.

(d) Lasso Model:

# Cross-validation for lasso (alpha=1):
cv.lasso <- cv.glmnet(x.train, y.train, alpha = 1)
best.lambda.lasso <- cv.lasso$lambda.min
cat("Optimal lambda (lasso):", best.lambda.lasso, "\n")
## Optimal lambda (lasso): 10.45858
# Predicting on test set:
lasso.pred <- predict(cv.lasso, s = best.lambda.lasso, newx = x.test)
lasso.mse <- mean((lasso.pred - y.test)^2)
cat("Test MSE (Lasso):", lasso.mse, "\n")
## Test MSE (Lasso): 1233415
# Extracting coefficients at the best lambda:
lasso.coef <- predict(cv.lasso, s = best.lambda.lasso, type = "coefficients")
nonzero.count <- sum(lasso.coef != 0) - 1  # Subtract 1 for the intercept
cat("Number of non-zero coefficients (Lasso):", nonzero.count, "\n")
## Number of non-zero coefficients (Lasso): 14

The lasso model, which both shrunk coefficients and performed variable selection, produced a test error slightly lower than least squares (but not as low as ridge). The fact that it retains 14 predictors suggests that most of the variables contribute to the model.

(e) PCR Model:

library(pls)
## 
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
## 
##     loadings
set.seed(1)
pcr.fit <- pcr(Apps ~ ., data = College.train, scale = TRUE, validation = "CV")

# Cross-validation results:
summary(pcr.fit)
## Data:    X dimension: 544 17 
##  Y dimension: 544 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3892     3821     2139     2155     1806     1742     1719
## adjCV         3892     3821     2134     2152     1776     1717     1711
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1724     1673     1641      1634      1638      1642      1643
## adjCV     1722     1659     1634      1627      1631      1635      1636
##        14 comps  15 comps  16 comps  17 comps
## CV         1646      1552      1179      1152
## adjCV      1639      1535      1169      1143
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X      32.058    57.03    64.43    70.29    75.66    80.65    84.26    87.61
## Apps    5.575    71.65    71.65    81.07    82.59    82.61    82.69    84.06
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       90.59     92.84     94.93     96.74     97.82     98.72     99.38
## Apps    84.56     84.83     84.86     84.87     85.02     85.06     89.81
##       16 comps  17 comps
## X        99.85    100.00
## Apps     93.04     93.32
validationplot(pcr.fit, val.type = "MSEP", main="PCR CV Plot")

optimal_M_pcr <- which.min(pcr.fit$validation$PRESS)  
cat("Optimal number of components (PCR):", optimal_M_pcr, "\n")
## Optimal number of components (PCR): 17
# Predicting on test set using the optimal number of components:
pcr.pred <- predict(pcr.fit, College.test, ncomp = optimal_M_pcr)
pcr.mse <- mean((pcr.pred - College.test$Apps)^2)
cat("Test MSE (PCR):", pcr.mse, "\n")
## Test MSE (PCR): 1266407

PCR us all 17 components (i.e., nearly the full predictor space) to explain the variability. Its test error is virtually the same as least squares, implying that dimension reduction did not simplify the model without sacrificing predictive power.

(f) PLS Model:

pls.fit <- plsr(Apps ~ ., data = College.train, scale = TRUE, validation = "CV")
summary(pls.fit)
## Data:    X dimension: 544 17 
##  Y dimension: 544 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3892     1949     1742     1545     1469     1268     1200
## adjCV         3892     1944     1738     1537     1446     1240     1187
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1178     1166     1164      1163      1159      1158      1158
## adjCV     1167     1157     1154      1154      1149      1148      1148
##        14 comps  15 comps  16 comps  17 comps
## CV         1157      1156      1156      1156
## adjCV      1148      1147      1147      1147
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       25.67    47.20    62.49    64.91    67.30    72.46    76.95    80.88
## Apps    76.61    82.45    86.93    90.76    92.84    93.06    93.14    93.20
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       82.69     85.27      87.8     90.88     92.47     95.11     97.09
## Apps    93.26     93.28      93.3     93.31     93.32     93.32     93.32
##       16 comps  17 comps
## X        98.37    100.00
## Apps     93.32     93.32
validationplot(pls.fit, val.type = "MSEP", main="PLS CV Plot")

# Determining optimal number of components from the CV plot:
optimal_M_pls <- which.min(pls.fit$validation$PRESS)
cat("Optimal number of components (PLS):", optimal_M_pls, "\n")
## Optimal number of components (PLS): 16
# Predicting on test set:
pls.pred <- predict(pls.fit, College.test, ncomp = optimal_M_pls)
pls.mse <- mean((pls.pred - College.test$Apps)^2)
cat("Test MSE (PLS):", pls.mse, "\n")
## Test MSE (PLS): 1266402

PLS also yielded a test error nearly identical to least squares, indicating that, like PCR, PLS is also not providing a significant advantage in this case.

(g) Comment on Results:

From the above outputs, we can summarize that all models, whether regularization (ridge or lasso) or dimensionality reduction (PCR or PLS), have a moderate overall predictive performance.

The fact that no approach significantly surpasses the others in terms of test error indicates that a large portion of the underlying relationship is already captured by the linear model (or a minor variation of it). With the exception of ridge regression, which somewhat improves on that, all test MSE values fall within the range of roughly 1.26 million. This implies that regularising (or shrinking) the coefficients in some way enhances prediction, most likely by lowering variance in the case of linked predictors.

Although regularization (ridge regression) can yield certain advantages, our analysis demonstrates that the predictive performance of the various approaches is very comparable, indicating a simple, non-complex patterns in College data.