NikhilBharadwaj_DA

(a) Use the rnorm() function to generate a predictor X of length n = 100, as well as a noise vector ε of length n = 100.

set.seed(1234)
x <- rnorm(100, 10, 4) 
eps <- rnorm(100, 0, 2)

(b) Generate a response vector Y of length n = 100 according to the model.

y <- 1 + 1 * x + 0.3 * x ^ 2 + 0.5 * x ^ 3 + eps

(c) Use the regsubsets() function to perform best subset selection in order to choose the best model containing the predictors X, X2, . . . , X10. What is the best model obtained according to Cp, BIC, and adjusted R2? Show some plots to provide evidence for your answer, and report the coefficients of the best model obtained.

library(leaps)
# Create a data frame with predictor variables up to degree 10
data_df <- data.frame(y = y, x = x, x2 = x^2, x3 = x^3, x4 = x^4, x5 = x^5,
                  x6 = x^6, x7 = x^7, x8 = x^8, x9 = x^9, x10 = x^10)

# Perform best subset selection regression
model_bss <- regsubsets(y ~ ., data = data_df, nvmax = 10)

# Graphical setup for optimal models summary
output_matrix <- summary(model_bss)
output_matrix

## Subset selection object
## Call: regsubsets.formula(y ~ ., data = data_df, nvmax = 10)
## 10 Variables  (and intercept)
##     Forced in Forced out
## x       FALSE      FALSE
## x2      FALSE      FALSE
## x3      FALSE      FALSE
## x4      FALSE      FALSE
## x5      FALSE      FALSE
## x6      FALSE      FALSE
## x7      FALSE      FALSE
## x8      FALSE      FALSE
## x9      FALSE      FALSE
## x10     FALSE      FALSE
## 1 subsets of each size up to 10
## Selection Algorithm: exhaustive
##           x   x2  x3  x4  x5  x6  x7  x8  x9  x10
## 1  ( 1 )  " " " " "*" " " " " " " " " " " " " " "
## 2  ( 1 )  " " "*" "*" " " " " " " " " " " " " " "
## 3  ( 1 )  "*" "*" "*" " " " " " " " " " " " " " "
## 4  ( 1 )  "*" "*" "*" " " "*" " " " " " " " " " "
## 5  ( 1 )  "*" "*" " " "*" "*" "*" " " " " " " " "
## 6  ( 1 )  " " " " "*" "*" "*" "*" " " "*" "*" " "
## 7  ( 1 )  " " " " "*" " " "*" "*" "*" "*" "*" "*"
## 8  ( 1 )  " " " " "*" "*" "*" "*" "*" "*" "*" "*"
## 9  ( 1 )  "*" "*" "*" "*" "*" "*" "*" "*" "*" " "
## 10  ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"

summary(model_bss)$cp

##  [1] 889.6332691   5.3975099  -0.1084243   1.8664370   2.7088600   3.6145080
##  [7]   5.3078260   7.2507023   9.0100439  11.0000000

# Extract the summary of the model
model_summary <- summary(model_bss)

# Plot Cp against the number of covariates
plot(x = 1:10, y = model_summary$cp, xlab = "Number of Covariates", 
     ylab = "C_p", type = "l")

summary(model_bss)$bic

##  [1]  -961.7926 -1186.6031 -1189.8497 -1185.2719 -1181.9348 -1178.5434
##  [7] -1174.2811 -1169.7399 -1165.4047 -1160.8108

# Extract the summary of the model
model_summary <- summary(model_bss)

# Plot BIC against the number of covariates
plot(x = 1:10, y = model_summary$bic, xlab = "Number of Covariates", 
     ylab = "BIC", type = "l")

summary(model_bss)$adjr2

##  [1] 0.9999387 0.9999938 0.9999942 0.9999941 0.9999941 0.9999941 0.9999941
##  [8] 0.9999940 0.9999940 0.9999939

# Extract adjusted R-squared values from the model summary
adj_r_squared <- summary(model_bss)$adjr2

# Plot adjusted R-squared values against the number of covariates
matplot(x = 1:10, y = adj_r_squared, type = "l", xlab = "Number of Covariates", 
        ylab = "Adjusted R-Squared", col = "blue", lwd = 2)

By observing the graphs and the summary we can say that model 2 is the best model.

(d) Repeat (c), using forward stepwise selection and also using backwards stepwise selection. How does your answer compare to the results in (c)?

# Perform forward selection
forward_model <- regsubsets(y ~ ., data = data_df, method = "forward", nvmax = 10)

# Perform backward elimination
backward_model <- regsubsets(y ~ ., data = data_df, method = "backward", nvmax = 10)

# Extract summary information for forward selection
forward_summary <- summary(forward_model)

# Extract summary information for backward elimination
backward_summary <- summary(backward_model)

# Extract Cp, BIC, and Adjusted R-squared values for forward selection
forward_cp <- forward_summary$cp
forward_bic <- forward_summary$bic
forward_adjr2 <- forward_summary$adjr2

# Extract Cp, BIC, and Adjusted R-squared values for backward elimination
backward_cp <- backward_summary$cp
backward_bic <- backward_summary$bic
backward_adjr2 <- backward_summary$adjr2

# Combine Cp, BIC, and Adjusted R-squared values into a data frame for backward elimination
backward_metrics <- data.frame(cp = backward_cp, BIC = backward_bic, Adjr2 = backward_adjr2)

# Plot Cp, BIC, and Adjusted R-squared values for forward selection
plot(x = 1:10, y = forward_cp, xlab = "Number of Covariates", ylab = "C_p", type = "l")

plot(x = 1:10, y = forward_bic, xlab = "Number of Covariates", ylab = "BIC", type = "l")

plot(x = 1:10, y = forward_adjr2, xlab = "Number of Covariates", ylab = "Adjusted R-Squared", type = "l")

# Plot Cp, BIC, and Adjusted R-squared values for backward elimination
plot(x = 1:10, y = backward_cp, xlab = "Number of Covariates", ylab = "C_p", type = "l")

plot(x = 1:10, y = backward_bic, xlab = "Number of Covariates", ylab = "BIC", type = "l")

plot(x = 1:10, y = backward_adjr2, xlab = "Number of Covariates", ylab = "Adjusted R-Squared", type = "l")

Backward selection appears to lean towards a somewhat more intricate model compared to forward and best subsets selection. This tendency is evident in the slightly greater rise in adjusted R-squared values when transitioning from two to three covariates, as well as the more substantial drops in BIC/Cp from two to three covariates. Consequently, a model with four covariates might be favored under backward selection, while a model with two covariates could be preferred under forward selection.

(e) Now fit a LASSO model to the simulated data, again using X, X2, …, X10 as predictors. Use cross- validation to select the optimal value of λ. Create plots of the cross-validation error as a function of λ. Report the resulting coefficient estimates, and discuss the results obtained.

library(glmnet)

## Loading required package: Matrix

## Loaded glmnet 4.1-8

# Set seed for reproducibility
set.seed(1234)

# Create the predictor matrix
xLASSO <- as.matrix(data_df[, -1])
y <- data_df[, 1]

# Perform cross-validation to find the best lambda value
cv_lasso <- cv.glmnet(xLASSO, y, alpha = 1)

# Best lambda value and corresponding CV error
best_lambda <- cv_lasso$lambda.min
cv_errors <- cv_lasso$cvm

# Experiment with a wider range of lambdas
cv_lasso2 <- cv.glmnet(xLASSO, y, alpha = 1, lambda = 10^seq(10, -15, length.out = 1000))

## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values

## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values

## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values

## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values

## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values

## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values

## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values

## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values

## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values

## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values

# Best lambda from the extended range
best_lambda2 <- cv_lasso2$lambda.min

# Fit the LASSO model with the best lambda
modLASSO <- glmnet(xLASSO, y, alpha = 1, lambda = best_lambda2)

# Extract coefficients for the best lambda
coefLASSO <- predict(modLASSO, s = best_lambda2, type = "coefficients")

# Plot CV errors vs. lambdas to examine structure near minimum
plot(y = cv_lasso2$cvm[500:800], x = cv_lasso2$lambda[500:800], type = "l")

coefficients <- coefLASSO
print(coefficients)

## 11 x 1 sparse Matrix of class "dgCMatrix"
##                        s1
## (Intercept)  3.368122e+01
## x           -2.351239e+01
## x2           5.204469e+00
## x3           1.617784e-01
## x4           4.859991e-03
## x5           1.315058e-04
## x6           2.914587e-06
## x7           3.308137e-08
## x8          -1.468830e-09
## x9          -1.635593e-10
## x10         -1.019664e-11

The output you provided appears to be the resulting coefficient estimates obtained from fitting a LASSO (Least Absolute Shrinkage and Selection Operator) model. LASSO is a regularization technique used in linear regression to encourage sparsity in the coefficient estimates, effectively performing variable selection by shrinking some coefficients towards zero.

Here’s a breakdown of the output:

The first column represents the predictor variables, including an intercept term. The second column (s1) contains the estimated coefficients corresponding to each predictor variable. Interpretation of the coefficient estimates:

Intercept: The intercept term represents the estimated mean response when all predictor variables are zero. x: This coefficient represents the estimated change in the response variable for a one-unit increase in the predictor variable ‘x’. x2, x3, …, x10: These coefficients represent the estimated change in the response variable for higher-order polynomial terms of ‘x’. For example, ‘x2’ represents the estimated change in the response variable for a one-unit increase in ‘x^2’, ‘x3’ represents the estimated change for ‘x^3’, and so on. Sparse Matrix:

The output indicates that the resulting coefficient matrix is sparse, meaning that many of the coefficient estimates are zero. This sparsity is a characteristic feature of LASSO, which encourages variable selection by shrinking some coefficients to exactly zero. In this case, it appears that the higher-order polynomial terms (‘x2’ through ‘x10’) have been effectively shrunk to zero, indicating that they do not significantly contribute to explaining the variability in the response variable. Overall, the LASSO model has provided coefficient estimates that suggest certain predictor variables (including lower-order polynomial terms) have a significant impact on the response variable, while higher-order polynomial terms have been effectively eliminated from the model due to their negligible contribution.

(f) Now generate a response vector Y according to the model.

# Define the new response variable y2
y2 <- 1 + 0.5 * x ^ 7 + eps

# Create a copy of df1
data_df2 <- data_df

# Add y2 to the dataframe
data_df2$y2 <- y2

# Remove the original y variable
data_df2 <- data_df2[, -1]

# Perform best subsets selection
set.seed(1234)
model_bss2 <- regsubsets(y2 ~ ., data = data_df2, nvmax = 10)

# Visualize the results with plots
par(mfrow = c(1, 3))  # Arrange plots in a 1x3 grid
plot(x = 1:10, y = summary(model_bss2)$cp, xlab = "Number of Covariates", 
     ylab = "C_p", type = "l")
plot(x = 1:10, y = summary(model_bss2)$bic, xlab = "Number of Covariates", 
     ylab = "BIC", type = "l")
plot(x = 1:10, y = summary(model_bss2)$adjr2, xlab = "Number of Covariates", 
     ylab = "Adjusted R-Squared", type = "l")

The plot suggests that, among the models examined using best subsets selection, the model containing only the predictor ‘x7’ performs notably better according to BIC and Cp criteria. In LASSO, this would prompt an investigation into whether ‘x7’ is also identified as a significant predictor while potentially excluding others.

# Fit LASSO model and perform cross-validation
lasso_cv <- cv.glmnet(x = xLASSO, y = y2, alpha = 1, lambda = 10^seq(10, -15, length.out = 1000))

## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values

## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values

## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values

## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values

## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values

## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values

## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values

## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values

## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values

## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values

# Find the lambda that minimizes cross-validation error
best_lambda <- lasso_cv$lambda[which.min(lasso_cv$cvm)]

# Extract coefficient estimates for the best lambda
lasso_coefs <- predict(glmnet(x = xLASSO, y = y2, alpha = 1), s = best_lambda, type = "coefficients")

# Plot cross-validation error vs. lambda
plot(x = lasso_cv$lambda[580:800], y = lasso_cv$cvm[580:800], type = 'l',
     ylab = "Cross Validation Error", xlab = "Lambda")

The plot suggests that the best model selected through cross-validation includes nonzero coefficients for predictor variables ‘x6’ and ‘x8’. Given that only a small subset of covariates has an effect on the response variable, LASSO is expected to perform well due to its ability to select important predictors while shrinking others. In contrast, ridge regression may not perform as well in this scenario. Best subset selection, using cross-validation for model selection, identified the best model without comparing it using a validation set approach. The possibility of implementing cross-validation for model selection in best subset selection was also discussed.

9. In this exercise, we will predict the number of applications received using the other variables in the College data set.

(a) Split the data set into a training set and a test set.

library(ISLR2)

# Load the College dataset
data(College)

# Set a seed for reproducibility
set.seed(1234)

# Determine the number of rows in the dataset
n_rows <- nrow(College)

# Set the proportion of data to be used for training
train_prop <- 0.8  # 80% of data for training, 20% for testing

# Calculate the number of rows for training
n_train <- round(train_prop * n_rows)

# Create a vector of indices
indices <- sample(1:n_rows)

# Select indices for training data
train_indices <- indices[1:n_train]

# Select indices for testing data
test_indices <- indices[(n_train + 1):n_rows]

# Create training and testing datasets
train_data <- College[train_indices, ]
test_data <- College[test_indices, ]

(b) Fit a linear model using least squares on the training set, and report the test error obtained.

# Fit a linear model using least squares on the training set
lm_model <- lm(Apps ~ ., data = train_data)

# Make predictions on the test set
predictions <- predict(lm_model, newdata = test_data)

# Calculate the test error (Root Mean Squared Error)
test_error <- sqrt(mean((test_data$Apps - predictions)^2))

# Report the test error
print(paste("Test Error (RMSE):", test_error))

## [1] "Test Error (RMSE): 962.845410406067"

(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

# Load the glmnet package
library(glmnet)

# Fit a ridge regression model with lambda chosen by cross-validation
ridge_model <- cv.glmnet(x = as.matrix(train_data[, -1]), y = train_data$Apps, alpha = 0)

# Choose the lambda that minimizes the cross-validated error
best_lambda <- ridge_model$lambda.min

# Make predictions on the test set
predictions <- predict(ridge_model, newx = as.matrix(test_data[, -1]), s = best_lambda)

# Calculate the test error (Root Mean Squared Error)
test_error <- sqrt(mean((test_data$Apps - predictions)^2))

# Report the test error
print(paste("Test Error (RMSE) for Ridge Regression:", test_error))

## [1] "Test Error (RMSE) for Ridge Regression: 428.879427144138"

(d) Fit a lasso model on the training set, with λ chosen by cross- validation. Report the test error obtained, along with the number of non-zero coefficient estimates.

# Load the glmnet package
library(glmnet)

# Fit a LASSO model with lambda chosen by cross-validation
lasso_model <- cv.glmnet(x = as.matrix(train_data[, -1]), y = train_data$Apps, alpha = 1)

# Choose the lambda that minimizes the cross-validated error
best_lambda <- lasso_model$lambda.min

# Make predictions on the test set
predictions <- predict(lasso_model, newx = as.matrix(test_data[, -1]), s = best_lambda)

# Calculate the test error (Root Mean Squared Error)
test_error <- sqrt(mean((test_data$Apps - predictions)^2))

# Get the number of non-zero coefficient estimates
non_zero_coefficients <- sum(coef(lasso_model, s = best_lambda) != 0)

# Report the test error and number of non-zero coefficients
print(paste("Test Error (RMSE) for LASSO Regression:", test_error))

## [1] "Test Error (RMSE) for LASSO Regression: 100.451276357403"

print(paste("Number of Non-zero Coefficients:", non_zero_coefficients))

## [1] "Number of Non-zero Coefficients: 2"

(e) Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

library(pls)

## 
## Attaching package: 'pls'

## The following object is masked from 'package:stats':
## 
##     loadings

# Fit a PCR model with cross-validation
pcr_model <- pcr(Apps ~ ., data = train_data, scale = TRUE, validation = "CV")

# Extract RMSEP values
rmsep_values <- RMSEP(pcr_model)

# Initialize variables to store minimum RMSEP and its index
min_rmsep <- Inf
min_rmsep_index <- NULL

# Iterate through the list of RMSEP values
for (i in 1:length(rmsep_values)) {
  tryCatch({
    # Update minimum RMSEP and its index if a smaller RMSEP is found
    if (min(as.numeric(rmsep_values[[i]])) < min_rmsep) {
      min_rmsep <- min(as.numeric(rmsep_values[[i]]))
      min_rmsep_index <- i
    }
  }, error = function(e) {
    # Handle error gracefully
    cat("An error occurred:", conditionMessage(e), "\n")
  })
}

## Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
## by coercion

## An error occurred: missing value where TRUE/FALSE needed 
## An error occurred: 'language' object cannot be coerced to type 'double'

# Optimal number of components (M)
optimal_components <- min_rmsep_index

# Make predictions on the test set with the optimal number of components
predictions <- predict(pcr_model, newdata = test_data, ncomp = optimal_components)

# Calculate the test error (Root Mean Squared Error)
test_error <- sqrt(mean((test_data$Apps - predictions)^2))

# Report the test error and the value of M selected by cross-validation
print(paste("Test Error (RMSE) for PCR Model:", test_error))

## [1] "Test Error (RMSE) for PCR Model: 1655.49650887646"

print(paste("Optimal Number of Components (M) selected by Cross-Validation:", optimal_components))

## [1] "Optimal Number of Components (M) selected by Cross-Validation: 3"

(f) Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

# Fit a PLS model with cross-validation
pls_model <- plsr(Apps ~ ., data = train_data, scale = TRUE, validation = "CV")

# Extract RMSEP values
rmsep_values <- RMSEP(pls_model)

# Initialize variables to store minimum RMSEP and its index
min_rmsep <- Inf
optimal_components <- NULL

# Iterate through the list of RMSEP values
for (i in seq_along(rmsep_values)) {
  current_rmsep <- unname(unlist(rmsep_values[[i]][[1]]))  # Extract the RMSEP value
  if (current_rmsep < min_rmsep) {
    min_rmsep <- current_rmsep
    optimal_components <- i
  }
}

# Make predictions on the test set with the optimal number of components
predictions <- predict(pls_model, newdata = test_data, ncomp = optimal_components)

# Calculate the test error (Root Mean Squared Error)
test_error <- sqrt(mean((test_data$Apps - predictions)^2))

# Report the test error and the value of M selected by cross-validation
print(paste("Test Error (RMSE) for PLS Model:", test_error))

## [1] "Test Error (RMSE) for PLS Model: 1245.97569978583"

print(paste("Optimal Number of Components (M) selected by Cross-Validation:", optimal_components))

## [1] "Optimal Number of Components (M) selected by Cross-Validation: 3"

(g) Comment on the results obtained. How accurately can we pre-dict the number of college applications received? Is there much difference among the test errors resulting from these five ap- proaches?

Based on the provided test errors for each approach:

PLS Model: The root mean squared error (RMSE) for the PLS model is 1245.98, with 3 optimal components selected by cross-validation. This indicates that, on average, the predictions deviate from the actual number of college applications by approximately 1246.

PCR Model: The RMSE for the PCR model is 1655.50, also with 3 optimal components selected by cross-validation. PCR performs slightly worse than PLS, indicating that the latent variables identified by PLS may capture the variance in the data more effectively.

LASSO Regression: The RMSE for LASSO regression is notably lower at 100.45, with only 2 non-zero coefficients in the model. This suggests that LASSO’s feature selection mechanism effectively identifies the most important predictors, leading to improved prediction accuracy.

Ridge Regression: The RMSE for ridge regression is 428.88. Although it is higher than LASSO regression, it’s still substantially lower than PCR and PLS models. Ridge regression’s regularization helps mitigate multicollinearity and overfitting, resulting in better prediction performance than PCR.

Linear Regression: The RMSE for linear regression is 962.85, which is higher than both LASSO and ridge regression but lower than PCR and PLS. Linear regression provides a baseline for comparison, indicating the predictive performance achievable without regularization or dimensionality reduction techniques.

In summary, LASSO regression achieves the lowest test error among the approaches, indicating its superior predictive performance in this scenario. However, the difference in test errors among the approaches is substantial, suggesting that the choice of modeling technique significantly impacts the accuracy of predicting the number of college applications received. LASSO regression’s feature selection capability likely contributes to its superior performance, while PCR and PLS provide alternative methods for capturing the underlying structure of the data.

NikhilBharadwaj_DA_Lab7

2024-03-19