2A

The lasso, relative to least squares, is:

iii, Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

Lasso shrinks coefficients towards 0 (the magnitude depends on the tuning parameter lambda). This shrinkage reduces variance at the cost of of increased bias. So, if the reduction of variance is greater than the increase in bias, then the model’s prediction power improves over Ordinary Least Squares.

2B

Ridge regression, relative to least squares, is:

iii, Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

For the purposes of this question, the above explanation applies to ridge regression. Ridge shrinks the coefficients towards zero. The only difference is that ridge regression will never result in coefficients equaling zero.

2C

Non-linear methods, relative to least squares, is:

ii, More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

The goal of any non-linear application is to better approximate a non-linear function or distribution. This increases bias, or fit, due to its freedom from the constraints of linearity. So, when the increase in variance is less than the decrease in bias, the model improves.

9A

Split the data into training and test set

library(ISLR2)
library(tidyverse)
library(caret)
library(leaps)
library(glmnet)
college <- College
set.seed(1)
idx <- sample(nrow(college),nrow(college)*.8,replace = F)
train <- college[idx,]
test <- college[-idx,]

9B

Fit a linear model to predict number of applications.

lm1 <- lm(Apps ~ ., data = train)
lm1.predict <- predict(lm1,test)
lm1.mse <- mean((test$Apps-lm1.predict)^2)
lm1.mse
## [1] 1567324

9C

Ridge Regression Model

X <- model.matrix(Apps ~ ., college)[,-1]
y <- college$Apps

Create a grid for lambda hyperparamater and fit model. Alpha = 0 indicates ridge. Function glmnet automatically standardizes values. Then use cross validation to find the best value of lambda. We see this is 362.

grid <- 10^seq(10, -2, length = 100)
ridge.mod <- glmnet(X, y, alpha = 0, lambda = grid)
cv.out <- cv.glmnet(X[idx,],y[idx], alpha = 0)
bestlam <- cv.out$lambda.min
bestlam
## [1] 362.9786

We find the test error of 1,321,993, which is less than the OLS model.

ridge.predict <- predict(ridge.mod, s = bestlam, newx = X[-idx,])
mean((ridge.predict - y[-idx])^2)
## [1] 1321993

9D

We reproduce the previous example, but with Lasso instead of Ridge Regression. The MSE is 1,364,268, which is higher than the previous model.

grid <- 10^seq(10, -2, length = 100)
lasso.mod <- glmnet(X, y, alpha = 1, lambda = grid)
cv.out <- cv.glmnet(X[idx,],y[idx], alpha = 1)
bestlam <- cv.out$lambda.min
lasso.predict <- predict(lasso.mod, s = bestlam, newx = X[-idx,])
mean((lasso.predict - y[-idx])^2)
## [1] 1364268

None of the Lasso coefficients are 0, however F.Undergrad, P.Undergrad, perc.alumni, and Outstate are less than .1.

out <- glmnet(X, y, alpha = 1, lambda = grid)
lasso.coef <- predict(out , type = "coefficients", s = bestlam)[1:18,]
lasso.coef
##   (Intercept)    PrivateYes        Accept        Enroll     Top10perc 
## -470.17160404 -491.28468659    1.57028220   -0.76142448   48.14396937 
##     Top25perc   F.Undergrad   P.Undergrad      Outstate    Room.Board 
##  -12.85966706    0.04199054    0.04404111   -0.08323532    0.14947944 
##         Books      Personal           PhD      Terminal     S.F.Ratio 
##    0.01511191    0.02900213   -8.40624357   -3.26090599   14.54506840 
##   perc.alumni        Expend     Grad.Rate 
##   -0.03647543    0.07712303    8.28906847

9G

There is little difference between the three models tested. Using the summary() function, we see that the basic linear model has an adjusted r squared of .93, indicating a strong prediction power.

summary(lm1)$adj.r.squared
## [1] 0.9328276