For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer. (a) The lasso, relative to least squares, is:
Lasso improves prediction accuracy and reduces variance by utilizing the penalty term of λ that penalizes large coefficients and makes the model more simplistic. The penalty term λ controls how much we shrink the coefficients in the model. The closer λ is to 0, the more flexible it becomes and acts more like least squares regression. Getting an ideal value for λ decreases our variance and we gain bias.
We get the same answer as lasso, as both are similar. Ridge will use all of the predictors in the final model though but they will be close to 0. Lasso could change the coefficient of the predictor to 0. Both still use λ to shrink the model thus decreasing variance but increasing bias.
Non-linear methods are more flexible because they generally have less assumptions or can violate assumptions of more linear / parametric models.They can also take on more complex shapes compared to a more simplistic model. These complex shapes increase our variance due to their flexibility in the bias-variance trade off.
In this exercise, we will predict the number of applications received using the other variables in the College data set.
college = ISLR2::College
set.seed(1)
index = sample(1:nrow(college), 0.8*nrow(college))
college_train = college[index,]
college_test = college[-index,]
Splitting the data into an 80/20 train/test split.
college_lm = lm(Apps ~ ., data = college_train)
summary(college_lm)
##
## Call:
## lm(formula = Apps ~ ., data = college_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5555.2 -404.6 19.9 310.3 7577.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -630.58238 435.56266 -1.448 0.148209
## PrivateYes -388.97393 148.87623 -2.613 0.009206 **
## Accept 1.69123 0.04433 38.153 < 2e-16 ***
## Enroll -1.21543 0.20873 -5.823 9.41e-09 ***
## Top10perc 50.45622 5.88174 8.578 < 2e-16 ***
## Top25perc -13.62655 4.67321 -2.916 0.003679 **
## F.Undergrad 0.08271 0.03632 2.277 0.023111 *
## P.Undergrad 0.06555 0.03367 1.947 0.052008 .
## Outstate -0.07562 0.01987 -3.805 0.000156 ***
## Room.Board 0.14161 0.05130 2.760 0.005947 **
## Books 0.21161 0.25184 0.840 0.401102
## Personal 0.01873 0.06604 0.284 0.776803
## PhD -9.72551 4.91228 -1.980 0.048176 *
## Terminal -0.48690 5.43302 -0.090 0.928620
## S.F.Ratio 18.26146 13.83984 1.319 0.187508
## perc.alumni 1.39008 4.39572 0.316 0.751934
## Expend 0.05764 0.01254 4.595 5.26e-06 ***
## Grad.Rate 5.89480 3.11185 1.894 0.058662 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 993.8 on 603 degrees of freedom
## Multiple R-squared: 0.9347, Adjusted R-squared: 0.9328
## F-statistic: 507.5 on 17 and 603 DF, p-value: < 2.2e-16
The linear model is significant at an alpha level of 0.05 (p-value = 2.2e-16) with a few terms not being significant.
lm_pred = predict(college_lm,college_test)
lm_mse = mean((lm_pred - college_test$Apps)^2)
lm_mse
## [1] 1567324
We get the mean square error of 1567324 with this linear model.
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.3.3
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Loaded glmnet 4.1-8
set.seed(1)
x_train = model.matrix(Apps~.,college_train)[,-1]
y_train = college_train$Apps
x_test = model.matrix(Apps~.,college_test)[,-1]
y_test = college_test$Apps
grid <- 10^seq(10, -2, length = 100)
set.seed(1)
college_ridge_cv = cv.glmnet(x_train,y_train,alpha=0, lambda = grid,
standardize = T, nfolds = 10)
best_lambda = college_ridge_cv$lambda.min
print(best_lambda)
## [1] 0.01
ridge.mod = glmnet(x_train, y_train, alpha = 0, lambda = grid, thresh = 1e-12)
ridge.pred = predict(ridge.mod, s=best_lambda, newx=x_test)
ridge_mse = mean((ridge.pred-y_test)^2)
print(ridge_mse)
## [1] 1567278
We get an MSE of 1567278 which is slightly lower than the linear model.
set.seed(1)
college_lasso_cv = cv.glmnet(x_train,y_train,alpha=1, lambda = grid,
standardize = T, nfolds = 10)
best_lambda = college_lasso_cv$lambda.min
print(best_lambda)
## [1] 0.01
plot(college_lasso_cv)
lasso.mod = glmnet(x_train, y_train, alpha = 1, lambda = grid)
lasso.pred = predict(lasso.mod, s=best_lambda, newx=x_test)
lasso_mse = mean((lasso.pred-y_test)^2)
print(lasso_mse)
## [1] 1565789
This is 2000 less than the ridge MSE.
lasso.coef = predict(lasso.mod, type = 'coefficients',s=best_lambda)[1:18,]
length(lasso.coef[lasso.coef != 0])
## [1] 18
lasso.coef[lasso.coef != 0]
## (Intercept) PrivateYes Accept Enroll Top10perc
## -632.95602448 -388.88540852 1.68973478 -1.20585828 50.38010502
## Top25perc F.Undergrad P.Undergrad Outstate Room.Board
## -13.57568416 0.08163106 0.06550365 -0.07553507 0.14188193
## Books Personal PhD Terminal S.F.Ratio
## 0.21093272 0.01873856 -9.74128656 -0.46973434 18.25011259
## perc.alumni Expend Grad.Rate
## 1.36217567 0.05765739 5.89241221
There are 18 non-zero variables.
paste('Linear Model Test MSE: ', round(lm_mse))
## [1] "Linear Model Test MSE: 1567324"
paste('Ridge Model Test MSE: ', round(ridge_mse))
## [1] "Ridge Model Test MSE: 1567278"
paste('Lasso Model Test MSE: ', round(lasso_mse))
## [1] "Lasso Model Test MSE: 1565789"
The test MSE’s are all rather close to each other but lasso had the lowest MSE which was ~2000 less than the ridge and linear models. We can calculate the R2 for each model to see how much variance is explained by each model.
SStotal = sum((college_test$Apps - mean(college_test$Apps))^2)
lm_r2 = 1 - sum((college_test$Apps - lm_pred)^2) / SStotal
ridge_r2 = 1 - sum((college_test$Apps - ridge.pred)^2) / SStotal
lasso_r2 = 1 - sum((college_test$Apps - lasso.pred)^2) / SStotal
paste('Linear Model R2: ', round(lm_r2,3))
## [1] "Linear Model R2: 0.901"
paste('Ridge Model R2: ', round(ridge_r2,3))
## [1] "Ridge Model R2: 0.901"
paste('Lasso Model R2: ', round(lasso_r2,3))
## [1] "Lasso Model R2: 0.901"
All three models explain 90% of the variance in the data and the lasso model had the lowest test MSE so they model should be slightly better at making predictions compared to the linear and ridge models.
We will now try to predict per capita crime rate in the Boston data set. (a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
boston = ISLR2::Boston
library(leaps)
## Warning: package 'leaps' was built under R version 4.3.2
boston_regfit.full = regsubsets(crim ~ ., data=boston)
summary(boston_regfit.full)
## Subset selection object
## Call: regsubsets.formula(crim ~ ., data = boston)
## 12 Variables (and intercept)
## Forced in Forced out
## zn FALSE FALSE
## indus FALSE FALSE
## chas FALSE FALSE
## nox FALSE FALSE
## rm FALSE FALSE
## age FALSE FALSE
## dis FALSE FALSE
## rad FALSE FALSE
## tax FALSE FALSE
## ptratio FALSE FALSE
## lstat FALSE FALSE
## medv FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## zn indus chas nox rm age dis rad tax ptratio lstat medv
## 1 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " "*" " " " " "*" " "
## 3 ( 1 ) " " " " " " " " " " " " " " "*" " " " " "*" "*"
## 4 ( 1 ) "*" " " " " " " " " " " "*" "*" " " " " " " "*"
## 5 ( 1 ) "*" "*" " " " " " " " " "*" "*" " " " " " " "*"
## 6 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " "*" " " "*"
## 7 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " "*" "*" "*"
## 8 ( 1 ) "*" "*" " " "*" " " " " "*" "*" " " "*" "*" "*"
I just wanted to see the best subset selection.