The lasso, relative to least squares, is:
iii, Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Lasso shrinks coefficients towards 0 (the magnitude depends on the tuning parameter lambda). This shrinkage reduces variance at the cost of of increased bias. So, if the reduction of variance is greater than the increase in bias, then the model’s prediction power improves over Ordinary Least Squares.
Ridge regression, relative to least squares, is:
iii, Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
For the purposes of this question, the above explanation applies to ridge regression. Ridge shrinks the coefficients towards zero. The only difference is that ridge regression will never result in coefficients equaling zero.
Non-linear methods, relative to least squares, is:
ii, More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
The goal of any non-linear application is to better approximate a non-linear function or distribution. This increases bias, or fit, due to its freedom from the constraints of linearity. So, when the increase in variance is less than the decrease in bias, the model improves.
Split the data into training and test set
library(ISLR2)
library(tidyverse)
library(caret)
library(leaps)
library(glmnet)
college <- College
set.seed(1)
idx <- sample(nrow(college),nrow(college)*.8,replace = F)
train <- college[idx,]
test <- college[-idx,]
Fit a linear model to predict number of applications.
lm1 <- lm(Apps ~ ., data = train)
lm1.predict <- predict(lm1,test)
lm1.mse <- mean((test$Apps-lm1.predict)^2)
lm1.mse
## [1] 1567324
Ridge Regression Model
X <- model.matrix(Apps ~ ., college)[,-1]
y <- college$Apps
Create a grid for lambda hyperparamater and fit model. Alpha = 0 indicates ridge. Function glmnet automatically standardizes values. Then use cross validation to find the best value of lambda. We see this is 362.
grid <- 10^seq(10, -2, length = 100)
ridge.mod <- glmnet(X, y, alpha = 0, lambda = grid)
cv.out <- cv.glmnet(X[idx,],y[idx], alpha = 0)
bestlam <- cv.out$lambda.min
bestlam
## [1] 362.9786
We find the test error of 1,321,993, which is less than the OLS model.
ridge.predict <- predict(ridge.mod, s = bestlam, newx = X[-idx,])
mean((ridge.predict - y[-idx])^2)
## [1] 1321993
We reproduce the previous example, but with Lasso instead of Ridge Regression. The MSE is 1,364,268, which is higher than the previous model.
grid <- 10^seq(10, -2, length = 100)
lasso.mod <- glmnet(X, y, alpha = 1, lambda = grid)
cv.out <- cv.glmnet(X[idx,],y[idx], alpha = 1)
bestlam <- cv.out$lambda.min
lasso.predict <- predict(lasso.mod, s = bestlam, newx = X[-idx,])
mean((lasso.predict - y[-idx])^2)
## [1] 1364268
None of the Lasso coefficients are 0, however F.Undergrad, P.Undergrad, perc.alumni, and Outstate are less than .1.
out <- glmnet(X, y, alpha = 1, lambda = grid)
lasso.coef <- predict(out , type = "coefficients", s = bestlam)[1:18,]
lasso.coef
## (Intercept) PrivateYes Accept Enroll Top10perc
## -470.17160404 -491.28468659 1.57028220 -0.76142448 48.14396937
## Top25perc F.Undergrad P.Undergrad Outstate Room.Board
## -12.85966706 0.04199054 0.04404111 -0.08323532 0.14947944
## Books Personal PhD Terminal S.F.Ratio
## 0.01511191 0.02900213 -8.40624357 -3.26090599 14.54506840
## perc.alumni Expend Grad.Rate
## -0.03647543 0.07712303 8.28906847
There is little difference between the three models tested. Using the summary() function, we see that the basic linear model has an adjusted r squared of .93, indicating a strong prediction power.
summary(lm1)$adj.r.squared
## [1] 0.9328276