In this problem, I will learn about the lasso regression, and how to apply this technique for feature selecting.
library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.2.2
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.2.2
library(magrittr)
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.2
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ✔ purrr 0.3.4
## Warning: package 'readr' was built under R version 4.2.2
## Warning: package 'forcats' was built under R version 4.2.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::extract() masks magrittr::extract()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ purrr::set_names() masks magrittr::set_names()
data("Hitters")
We summarize the dataset Hitters and omit all NA values in the data.
Hitters<-na.omit(Hitters)
str(Hitters)
## 'data.frame': 263 obs. of 20 variables:
## $ AtBat : int 315 479 496 321 594 185 298 323 401 574 ...
## $ Hits : int 81 130 141 87 169 37 73 81 92 159 ...
## $ HmRun : int 7 18 20 10 4 1 0 6 17 21 ...
## $ Runs : int 24 66 65 39 74 23 24 26 49 107 ...
## $ RBI : int 38 72 78 42 51 8 24 32 66 75 ...
## $ Walks : int 39 76 37 30 35 21 7 8 65 59 ...
## $ Years : int 14 3 11 2 11 2 3 2 13 10 ...
## $ CAtBat : int 3449 1624 5628 396 4408 214 509 341 5206 4631 ...
## $ CHits : int 835 457 1575 101 1133 42 108 86 1332 1300 ...
## $ CHmRun : int 69 63 225 12 19 1 0 6 253 90 ...
## $ CRuns : int 321 224 828 48 501 30 41 32 784 702 ...
## $ CRBI : int 414 266 838 46 336 9 37 34 890 504 ...
## $ CWalks : int 375 263 354 33 194 24 12 8 866 488 ...
## $ League : Factor w/ 2 levels "A","N": 2 1 2 2 1 2 1 2 1 1 ...
## $ Division : Factor w/ 2 levels "E","W": 2 2 1 1 2 1 2 2 1 1 ...
## $ PutOuts : int 632 880 200 805 282 76 121 143 0 238 ...
## $ Assists : int 43 82 11 40 421 127 283 290 0 445 ...
## $ Errors : int 10 14 3 4 25 7 9 19 0 22 ...
## $ Salary : num 475 480 500 91.5 750 ...
## $ NewLeague: Factor w/ 2 levels "A","N": 2 1 2 2 1 1 1 2 1 1 ...
## - attr(*, "na.action")= 'omit' Named int [1:59] 1 16 19 23 31 33 37 39 40 42 ...
## ..- attr(*, "names")= chr [1:59] "-Andy Allanson" "-Billy Beane" "-Bruce Bochte" "-Bob Boone" ...
In this problem, we apply Lasso Regression to find the predictors in order to predict the salary of baseball players.
#Create a new input matrix for the model
x<-model.matrix(Salary~.,data=Hitters)
x<-x[,-1]
class(x)
## [1] "matrix" "array"
#Label
y<-Hitters$Salary
#Create a sequence of lambda from 10^7 to 10^-4
grid<-10^seq(7,-4,length=100)
Now, we use two new input values x and y for our lasso regression model
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.2.2
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Loaded glmnet 4.1-6
lasso<-glmnet(x,y,
alpha=1, #Alpha = 1 is lasso model
lambda=grid)
coef<-coef(lasso)
dim(coef)
## [1] 20 100
coef is matrix that contains the lasso coefficients with different lambda from \(10^7\) to \(10^{-4}\).
We can check the lasso coefficients with the specificed lambda.
lasso$lambda[1]
## [1] 1e+07
#The coefficients at the first lambda
coef[,1]
## (Intercept) AtBat Hits HmRun Runs RBI
## 535.9259 0.0000 0.0000 0.0000 0.0000 0.0000
## Walks Years CAtBat CHits CHmRun CRuns
## 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## CRBI CWalks LeagueN DivisionW PutOuts Assists
## 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## Errors NewLeagueN
## 0.0000 0.0000
For \(\lambda=10^7\), the lasso regression suggests to eliminate all the predictors for our model.
After having the lasso coefficients, we use k-Fold cross validation to estimiate the optimal lambda.
k_folds<-cv.glmnet(x,y,
alpha=1,
lambda=grid,
nfolds=10)
#Function to find the MSE
mse<-function(x,y){
return(mean((x-y)^2))
}
mse_lambda <- function(lasso, ...) {
mse_lam <- data.frame(Lambda = lasso$lambda,
MSE = lasso$cvm)
mse_lam %>% ggplot(aes(log(Lambda), MSE)) +
geom_line() + geom_point(color = "blue", alpha = 0.3) +
geom_point(data = mse_lam %>% filter(MSE == min(MSE)), color = "red", size = 3) +
labs(title = "MSE V.s Lambda",
subtitle = "Note: The red point corresponds to the smallest value of the MSE") +
theme_minimal()
}
mse_lambda(k_folds)
#Minimum MSE
min(k_folds$cvm)
## [1] 114604.9
#The corresponding lambda
k_folds$lambda.min
## [1] 1.29155
We have the minimum MSE is 115594.4 and the corresponding lambda is 2.154. We will have the lasso coefficients with the optimal lambda.
coef[,which.min(k_folds$cvm)]
## (Intercept) AtBat Hits HmRun Runs RBI
## 149.2323007 -1.8807516 6.5506990 0.8678557 -0.5844286 0.0000000
## Walks Years CAtBat CHits CHmRun CRuns
## 5.4302865 -8.5932498 -0.0441240 0.0000000 0.3565032 0.9958126
## CRBI CWalks LeagueN DivisionW PutOuts Assists
## 0.5056303 -0.6933650 42.1940765 -117.4103836 0.2811973 0.2664320
## Errors NewLeagueN
## -2.7087417 -6.3673670
We can see that our optimal lasso coefficient eliminates HmRun, Runs, RBI, CATBat, CHits and NewLeagueN. While the Ridge Regression includes all predictors in their model.
lasso_predict<-predict(lasso,s=k_folds$lambda.min,
newx=x)
#MSE between the predicted values and the actual value
mse(lasso_predict,y)
## [1] 92981.75
We now compare the lasso model with ridge model.
#Ridge regression model
ridge<-glmnet(x,y,
alpha=0,
lambda=k_folds$lambda.min)
#Predict
ridge_predict<-predict(ridge,
s=k_folds$lambda.min,
newx=x)
mse(ridge_predict,y)
## [1] 92312.53
MSE of Ridge regression model is lower, means that the prediction is better. However, Lasso Regression has the better interpretation since they do not include unnecesary predictors, while Ridge regression includes all predictor and reduce the magnitude of important predictors