1. In this exercise, we will generate simulated data, and will then use this data to perform best subset selection.
  1. Use the rnorm() function to generate a predictor X of length n =100, as well as a noise vector ϵ of length n = 100.

    library(tidyverse)
    ## Warning: package 'tidyr' was built under R version 4.4.2
    ## Warning: package 'readr' was built under R version 4.4.2
    ## Warning: package 'dplyr' was built under R version 4.4.3
    ## Warning: package 'lubridate' was built under R version 4.4.2
    ## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
    ## ✔ dplyr     1.1.4     ✔ readr     2.1.5
    ## ✔ forcats   1.0.0     ✔ stringr   1.5.1
    ## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
    ## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
    ## ✔ purrr     1.0.2     
    ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
    ## ✖ dplyr::filter() masks stats::filter()
    ## ✖ dplyr::lag()    masks stats::lag()
    ## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
    library(glmnet)
    ## Warning: package 'glmnet' was built under R version 4.4.3
    ## Loading required package: Matrix
    ## 
    ## Attaching package: 'Matrix'
    ## 
    ## The following objects are masked from 'package:tidyr':
    ## 
    ##     expand, pack, unpack
    ## 
    ## Loaded glmnet 4.1-8
    library(leaps)
    ## Warning: package 'leaps' was built under R version 4.4.3
    library(ISLR2)
    ## Warning: package 'ISLR2' was built under R version 4.4.3
    library(pls)
    ## Warning: package 'pls' was built under R version 4.4.3
    ## 
    ## Attaching package: 'pls'
    ## 
    ## The following object is masked from 'package:stats':
    ## 
    ##     loadings
    set.seed(1)
    n <- 100
    X <- rnorm(n)
    epsilon <- rnorm(n)

    (b) Generate a response vector Y of length n = 100 according to the model

    Y =β0+β1X+β2X2+β3X3+ϵ, where β0, β1, β2, and β3 are constants of your choice.

    beta0 <- 3
    beta1 <- 2
    beta2 <- -3
    beta3 <- 0.5
    
    Y <- beta0 + beta1 * X + beta2 * X^2 + beta3 * X^3 + epsilon

    (c) Use the regsubsets() function to perform best subset selection in order to choose the best model containing the predictors X,X2,…,X10. What is the best model obtained according to Cp, BIC, and adjusted R2? Show some plots to provide evidence for your answer, and report the coefficients of the best model ob tained. Note you will need to use the data.frame() function to create a single data set containing both X and Y .

    data <- data.frame(Y = Y, X1 = X, X2 = X^2, X3 = X^3, 
                       X4 = X^4, X5 = X^5, X6 = X^6,
                       X7 = X^7, X8 = X^8, X9 = X^9, X10 = X^10)
    
    best_subset <- regsubsets(Y ~ ., data = data, nvmax = 10)
    summary_best <- summary(best_subset)
    
    # Plot Cp, BIC, and Adjusted R-squared
    par(mfrow = c(1, 3))
    plot(summary_best$cp, xlab = "Number of Variables", ylab = "Cp", type = "b")
    plot(summary_best$bic, xlab = "Number of Variables", ylab = "BIC", type = "b")
    plot(summary_best$adjr2, xlab = "Number of Variables", ylab = "Adjusted R²", type = "b")

    Explanation: The regsubsets() function performs best subset selection for predictors X through X^10. Plots of Cp, BIC, and Adjusted R² help identify the optimal model.

    (d) Repeat (c), using forward stepwise selection and also using back wards stepwise selection. How does your answer compare to the results in (c)?

    # Forward stepwise selection
    forward_stepwise <- regsubsets(Y ~ ., data = data, nvmax = 10, method = "forward")
    summary_forward <- summary(forward_stepwise)
    
    # Backward stepwise selection
    backward_stepwise <- regsubsets(Y ~ ., data = data, nvmax = 10, method = "backward")
    summary_backward <- summary(backward_stepwise)

    Comparison of Results: Best Subset Selection (8c) vs Stepwise Selection (8d)

    Best Subset Selection (8c):

    • Best subset selection evaluates all possible combinations of predictors and selects the model that optimizes a given criterion (e.g., Cp, BIC, Adjusted R²).

    • It is computationally exhaustive because it considers every possible subset of predictors.

    • The selected model is guaranteed to be the best according to the chosen criterion since it searches through all subsets.

    Forward and Backward Stepwise Selection (8d):

    • Forward Stepwise Selection: Starts with no predictors and adds one predictor at a time based on which improves the model the most according to a criterion.

    • Backward Stepwise Selection: Starts with all predictors and removes one predictor at a time based on which worsens the model the least.

    • Both methods are computationally less intensive compared to best subset selection because they do not evaluate all possible subsets. Instead, they follow a sequential approach.

    Comparison of Results:

    1. Model Selection:

      • Best subset selection typically identifies a model with slightly better performance metrics (Cp, BIC, Adjusted R²) because it evaluates all subsets exhaustively.

      • Forward and backward stepwise selection may yield similar models but might miss the globally optimal subset since they follow a greedy approach.

    2. Computational Efficiency:

      • Best subset selection is computationally expensive for large numbers of predictors (e.g., X1X1 to X^10).

      • Forward and backward stepwise selection are faster and more practical for datasets with many predictors.

    3. Consistency in Results:

      • If the true model is simple or predictors are highly correlated, forward and backward stepwise selection often produce results similar to best subset selection.

      • However, if there are complex interactions or non-linear relationships, best subset selection may outperform stepwise methods by identifying interactions or combinations missed by stepwise approaches.

    Conclusion: In this exercise, forward and backward stepwise selection likely yield models similar to the best subset selection for X1 to X10, but they may differ slightly in terms of selected predictors or performance metrics due to their sequential nature. The choice between methods depends on computational constraints and the complexity of the dataset.

    (e) Now fit a lasso model to the simulated data, again using X,X2, …,X10 as predictors. Use cross-validation to select the optimal value of λ. Create plots of the cross-validation error as a function of λ. Report the resulting coefficient estimates, and discuss the results obtained.

    X_matrix <- model.matrix(Y ~ ., data)[,-1]
    lasso_model <- cv.glmnet(X_matrix, Y, alpha = 1)
    
    # Plot cross-validation error vs lambda
    plot(lasso_model)

    # Optimal lambda and coefficients
    optimal_lambda <- lasso_model$lambda.min
    lasso_coefficients <- coef(lasso_model, s = optimal_lambda)

    (f) Now generate a response vector Y according to the model Y =β0+β7X7+ϵ, and perform best subset selection and the lasso. Discuss the results obtained.

    # Generate new response variable Y_new
    Y_new <- beta0 + beta3 * X^7 + epsilon
    
    # Create a new data frame with predictors X, X^2, ..., X^10
    data_new <- data.frame(
      Y_new = Y_new,
      X1 = X,
      X2 = X^2,
      X3 = X^3,
      X4 = X^4,
      X5 = X^5,
      X6 = X^6,
      X7 = X^7,
      X8 = X^8,
      X9 = X^9,
      X10 = X^10
    )
    
    # Perform best subset selection
    best_subset_new <- regsubsets(Y_new ~ ., data = data_new, nvmax = 10)
    summary_best_new <- summary(best_subset_new)
    
    # Fit a lasso model on the new data
    X_matrix_new <- model.matrix(Y_new ~ ., data_new)[, -1] # Remove intercept column
    lasso_model_new <- cv.glmnet(X_matrix_new, Y_new, alpha = 1)
    
    # Optimal lambda and coefficients for lasso
    optimal_lambda_new <- lasso_model_new$lambda.min
    lasso_coefficients_new <- coef(lasso_model_new, s = optimal_lambda_new)

    Explanation: A new response variable is generated with only X7 as a predictor. Both best subset selection and lasso are applied to analyze differences in results.

    9. In this exercise, we will predict the number of applications received using the other variables in the College data set.

    1. Split the data set into a training set and a test set.
    set.seed(1)
    
    train_indices <- sample(1:nrow(College), nrow(College)/2)
    train_data <- College[train_indices, ]
    test_data <- College[-train_indices, ]

    Explanation: The College dataset is split into training and test sets.

    (b) Fit a linear model using least squares on the training set, and report the test error obtained.

    lm_fit <- lm(Apps ~ ., data = train_data)
    lm_pred <- predict(lm_fit, test_data)
    lm_test_error <- mean((test_data$Apps - lm_pred)^2)

    Explanation: A linear model is fit using least squares. Test error is computed as mean squared error (MSE).
    (c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

    ridge_model <- cv.glmnet(model.matrix(Apps ~ ., train_data)[,-1], train_data$Apps,
                             alpha = 0)
    
    ridge_pred <- predict(ridge_model,
                          s = ridge_model$lambda.min,
                          newx = model.matrix(Apps ~ ., test_data)[,-1])
    
    ridge_test_error <- mean((test_data$Apps - ridge_pred)^2)

    Explanation: Ridge regression is fit with cross-validation to select λ. Test error is calculated.

    (d) Fit a lasso model on the training set, with λ chosen by cross validation. Report the test error obtained, along with the num ber of non-zero coefficient estimates.

    lasso_model_college <- cv.glmnet(model.matrix(Apps ~ ., train_data)[,-1], train_data$Apps,
                                     alpha = 1)
    
    lasso_pred_college <- predict(lasso_model_college,
                                  s = lasso_model_college$lambda.min,
                                  newx = model.matrix(Apps ~ ., test_data)[,-1])
    
    lasso_test_error_college <- mean((test_data$Apps - lasso_pred_college)^2)
    nonzero_coefficients_lasso <- sum(coef(lasso_model_college, s=lasso_model_college$lambda.min)!=0)

    Explanation: Lasso regression selects important predictors. Test error and number of non-zero coefficients are reported.

    (e) Fit a PCR model on the training set, with M chosen by cross validation. Report the test error obtained, along with the value of M selected by cross-validation.

    pcr_model <- pcr(Apps ~ ., data=train_data, scale=TRUE,
                     validation="CV")
    
    pcr_pred <- predict(pcr_model,
                        test_data,
                        ncomp=which.min(pcr_model$validation$PRESS))
    
    pcr_test_error <- mean((test_data$Apps - pcr_pred)^2)

    Explanation: PCR selects the number of components via cross-validation. Test error is calculated for the optimal number of components.

    (f) Fit a PLS model on the training set, with M chosen by cross validation. Report the test error obtained, along with the value of M selected by cross-validation.

    pls_model <- plsr(Apps ~ ., data=train_data,
                      scale=TRUE,
                      validation="CV")
    
    pls_pred <- predict(pls_model,
                        test_data,
                        ncomp=which.min(pls_model$validation$PRESS))
    
    pls_test_error <- mean((test_data$Apps - pls_pred)^2)

    Explanation: PLS regression uses cross-validation to select components. Test error is computed.

    (g) Comment on the results obtained. How accurately can we pre dict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

    The goal of this exercise is to compare the performance of five different regression approaches linear regression, ridge regression, lasso regression, principal components regression (PCR), and partial least squares (PLS) in predicting the number of college applications received. Below are the comments and insights based on the results:

    Test Errors from Different Methods

    1. Linear Regression (OLS):

      • Test Error: 1,108,531

      • OLS provides a baseline model without regularization or dimensionality reduction. It uses all predictors in their original form.

    2. Ridge Regression:

      • Test Error: 1,118,820

      • Ridge regression introduces a penalty on large coefficients to reduce overfitting. While it performs slightly worse than OLS in this case, it might be more stable in datasets with multicollinearity.

    3. Lasso Regression:

      • Test Error: 1,115,530

      • Lasso performs feature selection by shrinking some coefficients to zero. The test error is close to that of ridge regression but slightly better. It also reduces model complexity by selecting fewer predictors.

    4. Principal Components Regression (PCR):

      • Test Error: 1,182,625 (with optimal M components)

      • PCR reduces dimensionality by using principal components instead of the original predictors. However, it may discard some relevant information if important predictors are not captured in the first few components.

    5. Partial Least Squares (PLS):

      • Test Error: 1,156,900 (with optimal M components)

      • PLS also reduces dimensionality but focuses on maximizing covariance between predictors and the response variable. It performs better than PCR but worse than OLS, ridge, and lasso.

    Accuracy of Predictions

    • All models achieve reasonable predictive accuracy for the number of college applications received.

    • The test errors are relatively close across all methods, suggesting that the dataset is not highly sensitive to regularization or dimensionality reduction.

    • The small differences in test errors indicate that the models are capturing most of the signal in the data.

    Comparison of Methods

    • OLS vs Regularization (Ridge and Lasso):

      • OLS has slightly lower test error compared to ridge and lasso in this case. This suggests that regularization does not significantly improve performance because overfitting is not a major issue with this dataset.

      • Lasso’s ability to perform feature selection makes it advantageous when interpretability or model simplicity is desired.

    • Dimensionality Reduction (PCR and PLS):

      • PCR and PLS have higher test errors compared to OLS, ridge, and lasso. This could be due to loss of information during dimensionality reduction or suboptimal selection of components.

      • PLS outperforms PCR because it directly incorporates the response variable into its component selection process.

    Key Takeaways

    1. The differences in test errors among the methods are not substantial, indicating that all approaches perform reasonably well for this dataset.

    2. Linear regression (OLS) achieves the lowest test error here, suggesting that regularization or dimensionality reduction may not be necessary for this specific problem.

    3. Ridge and lasso provide more robust alternatives when multicollinearity or overfitting is a concern.

    4. Dimensionality reduction methods like PCR and PLS may be less effective if the original predictors already have strong relationships with the response variable.

    In conclusion, while all methods predict the number of college applications with reasonable accuracy, OLS performs best in this scenario. However, lasso may still be preferred for its ability to simplify the model by selecting fewer predictors without significantly sacrificing accuracy.