2. For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

(a) The lasso, relative to least squares, is:

  1. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
  2. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
  3. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
  4. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

Answer: option iii. is correct.

Because the lasso technique adds a regularization term to the loss function, it is less flexible than least squares. Large coefficients are penalized by this regularization term, which effectively lowers the complexity of the model. Because the lasso may decrease some coefficients to zero, so performing variable selection, the resulting models may have fewer parameters.

A less flexible model, such as the lasso, typically has higher bias but lower variance in the context of the bias-variance tradeoff as compared to a more flexible model, such as least squares. When the rise in bias is less than the matching drop in variance, the lasso will outperform least squares in prediction accuracy. This is due to the fact that when bias and variance are balanced, the overall mean squared error (MSE) is minimized. By introducing bias through regularization, the lasso seeks to strike this equilibrium while preventing overfitting and dramatically lowering variance.

(b) Repeat (a) for ridge regression relative to least squares.

Answer:option iii. is correct.

The ridge objective function RSS + λΣβj2, where the shrinkage penalty term for ridge regression is marginally different from that of the lasso, is the only real variation in this case.

This merely indicates that while the lasso may reduce the coefficients of less-useful features to exactly zero, ridge regression cannot. Nevertheless, the remainder of the argument—because shrinkage lowers variance at the expense of increased bias—remains valid.

(c) Repeat (a) for non-linear methods relative to least squares.

Answer:

  1. is correct. Non-linear techniques have a higher variance but can reduce bias because they are more flexible than least squares. We can anticipate a gain in prediction accuracy if the underlying relationship in the data is nonlinear (the bias reduction will be greater than the variance increase).

9. In this exercise, we will predict the number of applications received using the other variables in the College data set.

(a) Split the data set into a training set and a test set.

Required library:

library(glmnet)
## Warning: package 'glmnet' was built under R version 4.3.3
## Loading required package: Matrix
## Loaded glmnet 4.1-8
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.3.2
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
attach(College)
set.seed(123)
#Splitting 70-30 ratio:
subset_split<-sample(nrow(College),nrow(College)*0.7)
train_data<-College[subset_split,]
test_data<-College[-subset_split,]

(b) Fit a linear model using least squares on the training set, and report the test error obtained.

# Fit linear model using least squares on the training set
lm_model <- lm(Apps ~ ., data = train_data)
summary(lm_model)
## 
## Call:
## lm(formula = Apps ~ ., data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3097.8  -455.8   -46.5   343.8  6452.5 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -310.17331  481.30075  -0.644 0.519566    
## PrivateYes  -681.96465  164.08211  -4.156 3.78e-05 ***
## Accept         1.22130    0.05921  20.626  < 2e-16 ***
## Enroll         0.08046    0.21794   0.369 0.712155    
## Top10perc     49.33503    6.18296   7.979 9.31e-15 ***
## Top25perc    -16.11744    5.02717  -3.206 0.001428 ** 
## F.Undergrad    0.02284    0.03985   0.573 0.566831    
## P.Undergrad    0.03541    0.03529   1.003 0.316139    
## Outstate      -0.05446    0.02132  -2.555 0.010910 *  
## Room.Board     0.18967    0.05275   3.596 0.000354 ***
## Books          0.21366    0.28099   0.760 0.447381    
## Personal      -0.03685    0.07279  -0.506 0.612876    
## PhD           -6.00401    5.34580  -1.123 0.261897    
## Terminal      -5.01712    5.77787  -0.868 0.385609    
## S.F.Ratio     -2.18927   14.83898  -0.148 0.882766    
## perc.alumni   -8.01836    4.67330  -1.716 0.086792 .  
## Expend         0.07614    0.01340   5.681 2.23e-08 ***
## Grad.Rate     10.63461    3.38228   3.144 0.001760 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 992.3 on 525 degrees of freedom
## Multiple R-squared:  0.9175, Adjusted R-squared:  0.9148 
## F-statistic: 343.2 on 17 and 525 DF,  p-value: < 2.2e-16
# Make predictions on the test set
test_predictions <- predict(lm_model, test_data)

# Compute test error (mean squared error)
test_error <- mean((test_data$Apps - test_predictions)^2)
test_error
## [1] 1734841

The test MSE is 1734841

(c) Fit a ridge regression model on the training set, with Lambda chosen by cross-validation. Report the test error obtained.

Answer:

# alpha=0 for ridge regression model

# Create matrix for training set and validation set
train_data.mat <- model.matrix(Apps ~ ., data = train_data)
validation_data.mat <- model.matrix(Apps ~ ., data = test_data)

# Define grid covering all the range of lambda
grid <- 10^seq(4, -2, length = 100)

# Perform grid search to find the best lambda
mse <- rep(NA, length(grid))
for (i in 1:length(grid)) {
  ridge <- glmnet(train_data.mat, train_data$Apps, alpha = 0, lambda = grid[i], thresh = 1e-12)
  pred <- predict(ridge, s = grid[i], newx = validation_data.mat)
  mse[i] <- mean((test_data$Apps - pred)^2)
}

# Find the index of the lambda with the minimum MSE
best_lambda_index <- which.min(mse)
best_lambda <- grid[best_lambda_index]
best_lambda
## [1] 0.01
# Get the predicted values on the test set using the ridge model
pred_test <- predict(ridge, s = best_lambda, newx = validation_data.mat)

# Calculate Mean Square Error (MSE) on the test set
test_mse <- mean((test_data$Apps - pred_test)^2)
test_mse
## [1] 1734931

The test MSE is higher for ridge regression(1734931) than for least squares regression(1734841).

(d) Fit a lasso model on the training set, with Lambda chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.

# alpha=1 for lasso regression model

# Create matrix for training set and validation set
train_data_1.mat <- model.matrix(Apps ~ ., data = train_data)
validation_data_1.mat <- model.matrix(Apps ~ ., data = test_data)

# Define grid covering all the range of lambda
grid_1 <- 10^seq(4, -2, length = 100)

# Perform grid search to find the best lambda
mse_1 <- rep(NA, length(grid_1))
for (i in 1:length(grid)) {
  ridge_1 <- glmnet(train_data.mat, train_data$Apps, alpha = 1, lambda = grid_1[i], thresh = 1e-12)
  pred_1 <- predict(ridge_1, s = grid_1[i], newx = validation_data_1.mat)
  mse_1[i] <- mean((test_data$Apps - pred_1)^2)
}

# Find the index of the lambda with the minimum MSE
best_lambda_index_1 <- which.min(mse_1)
best_lambda_1 <- grid[best_lambda_index_1]
best_lambda_1
## [1] 4.641589
# Get the predicted values on the test set using the lasso model
pred_test_1 <- predict(ridge_1, s = best_lambda_1, newx = validation_data_1.mat)

# Calculate Mean Square Error (MSE) on the test set
test_mse_1 <- mean((test_data$Apps - pred_test_1)^2)
test_mse_1
## [1] 1734857

While the test MSE is higher for lasso regression(1734857) model than least squares regression(1734841) model, and test MSE is higher for ridge regression(1734931) than least squares regression(1734841). when compared to test MSE of ridge and lasso regression, lasso regression model has lower test MSE to that of ridge regression model.

Overall, the test MSE is lower for least squares regression model when compared to all other models.

coefficients_non_zero = ridge_1$beta

print(coefficients_non_zero[coefficients_non_zero[,1]!=0,]) # extracting non zero coefficients
##    PrivateYes        Accept        Enroll     Top10perc     Top25perc 
## -681.92813047    1.22128169    0.08049546   49.32924219  -16.11270195 
##   F.Undergrad   P.Undergrad      Outstate    Room.Board         Books 
##    0.02283484    0.03539740   -0.05444704    0.18965658    0.21358532 
##      Personal           PhD      Terminal     S.F.Ratio   perc.alumni 
##   -0.03682464   -6.00292876   -5.01673188   -2.18405765   -8.01780938 
##        Expend     Grad.Rate 
##    0.07614101   10.63317894
print(paste("Number of Non-zero Coefficients:", length(coefficients_non_zero[coefficients_non_zero[,1]!=0,])))
## [1] "Number of Non-zero Coefficients: 17"

(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

#Least Square model Accuracy
test_avg <- mean(test_data$Apps)
lm_lsmodel_accu <- 1 - mean((test_predictions - test_data$Apps)^2) / mean((test_avg - test_data$Apps)^2)
print(paste("Least Square Model R-Square:",lm_lsmodel_accu,"~",round(lm_lsmodel_accu*100, digits = 4)))
## [1] "Least Square Model R-Square: 0.924075933783536 ~ 92.4076"
#Ridge model Accuracy
ridge_model_accu <- 1 - mean((pred_test - test_data$Apps)^2) / mean((test_avg - test_data$Apps)^2)
print(paste("Ridge model R-Square: ", ridge_model_accu,"~",round(ridge_model_accu*100,digits = 4)))
## [1] "Ridge model R-Square:  0.924071999658342 ~ 92.4072"
#Lasso model Accuracy
lasso_model_accu <- 1 - mean((pred_test_1 - test_data$Apps)^2) / mean((test_avg - test_data$Apps)^2)
print(paste("Lasso model  R-Square: ", lasso_model_accu,"~",round(lasso_model_accu*100,digits=4)))
## [1] "Lasso model  R-Square:  0.924075231851842 ~ 92.4075"

From the above analysis it is understood that, Lasso model has higher R-square compared to Ridge Model, it’s evident that test MSE for model Lasso is lower when compared to Ridge model from the solution # 9(d) as well. Highest R-square among all three is Least Square model, however isn’t much larger in number when compared to the other models.

However, the R-Square for all the models have nearly similar accuracy in predicting the number of college applications received(their isn’t large difference in metrics among the test errors results for the three models).