lasso_regression

In this problem, I will learn about the lasso regression, and how to apply this technique for feature selecting.

library(ISLR2)

## Warning: package 'ISLR2' was built under R version 4.2.2

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.2.2

library(magrittr)
library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.2.2

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ✔ purrr   0.3.4

## Warning: package 'readr' was built under R version 4.2.2

## Warning: package 'forcats' was built under R version 4.2.2

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::extract()   masks magrittr::extract()
## ✖ dplyr::filter()    masks stats::filter()
## ✖ dplyr::lag()       masks stats::lag()
## ✖ purrr::set_names() masks magrittr::set_names()

data("Hitters")

We summarize the dataset Hitters and omit all NA values in the data.

Hitters<-na.omit(Hitters)
str(Hitters)

## 'data.frame':    263 obs. of  20 variables:
##  $ AtBat    : int  315 479 496 321 594 185 298 323 401 574 ...
##  $ Hits     : int  81 130 141 87 169 37 73 81 92 159 ...
##  $ HmRun    : int  7 18 20 10 4 1 0 6 17 21 ...
##  $ Runs     : int  24 66 65 39 74 23 24 26 49 107 ...
##  $ RBI      : int  38 72 78 42 51 8 24 32 66 75 ...
##  $ Walks    : int  39 76 37 30 35 21 7 8 65 59 ...
##  $ Years    : int  14 3 11 2 11 2 3 2 13 10 ...
##  $ CAtBat   : int  3449 1624 5628 396 4408 214 509 341 5206 4631 ...
##  $ CHits    : int  835 457 1575 101 1133 42 108 86 1332 1300 ...
##  $ CHmRun   : int  69 63 225 12 19 1 0 6 253 90 ...
##  $ CRuns    : int  321 224 828 48 501 30 41 32 784 702 ...
##  $ CRBI     : int  414 266 838 46 336 9 37 34 890 504 ...
##  $ CWalks   : int  375 263 354 33 194 24 12 8 866 488 ...
##  $ League   : Factor w/ 2 levels "A","N": 2 1 2 2 1 2 1 2 1 1 ...
##  $ Division : Factor w/ 2 levels "E","W": 2 2 1 1 2 1 2 2 1 1 ...
##  $ PutOuts  : int  632 880 200 805 282 76 121 143 0 238 ...
##  $ Assists  : int  43 82 11 40 421 127 283 290 0 445 ...
##  $ Errors   : int  10 14 3 4 25 7 9 19 0 22 ...
##  $ Salary   : num  475 480 500 91.5 750 ...
##  $ NewLeague: Factor w/ 2 levels "A","N": 2 1 2 2 1 1 1 2 1 1 ...
##  - attr(*, "na.action")= 'omit' Named int [1:59] 1 16 19 23 31 33 37 39 40 42 ...
##   ..- attr(*, "names")= chr [1:59] "-Andy Allanson" "-Billy Beane" "-Bruce Bochte" "-Bob Boone" ...

In this problem, we apply Lasso Regression to find the predictors in order to predict the salary of baseball players.

#Create a new input matrix for the model
x<-model.matrix(Salary~.,data=Hitters)
x<-x[,-1]
class(x)

## [1] "matrix" "array"

#Label
y<-Hitters$Salary
#Create a sequence of lambda from 10^7 to 10^-4
grid<-10^seq(7,-4,length=100)

Now, we use two new input values x and y for our lasso regression model

library(glmnet)

## Warning: package 'glmnet' was built under R version 4.2.2

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

## Loaded glmnet 4.1-6

lasso<-glmnet(x,y,
              alpha=1, #Alpha = 1 is lasso model
              lambda=grid)
coef<-coef(lasso)
dim(coef)

## [1]  20 100

coef is matrix that contains the lasso coefficients with different lambda from \(10^7\) to \(10^{-4}\).

We can check the lasso coefficients with the specificed lambda.

lasso$lambda[1]

## [1] 1e+07

#The coefficients at the first lambda
coef[,1]

## (Intercept)       AtBat        Hits       HmRun        Runs         RBI 
##    535.9259      0.0000      0.0000      0.0000      0.0000      0.0000 
##       Walks       Years      CAtBat       CHits      CHmRun       CRuns 
##      0.0000      0.0000      0.0000      0.0000      0.0000      0.0000 
##        CRBI      CWalks     LeagueN   DivisionW     PutOuts     Assists 
##      0.0000      0.0000      0.0000      0.0000      0.0000      0.0000 
##      Errors  NewLeagueN 
##      0.0000      0.0000

For \(\lambda=10^7\), the lasso regression suggests to eliminate all the predictors for our model.

After having the lasso coefficients, we use k-Fold cross validation to estimiate the optimal lambda.

k_folds<-cv.glmnet(x,y,
                   alpha=1,
                   lambda=grid,
                   nfolds=10)

#Function to find the MSE
mse<-function(x,y){
  return(mean((x-y)^2))
}

mse_lambda <- function(lasso, ...) {
  mse_lam <- data.frame(Lambda = lasso$lambda, 
                        MSE = lasso$cvm)
  mse_lam %>% ggplot(aes(log(Lambda), MSE)) + 
    geom_line() + geom_point(color = "blue", alpha = 0.3) +
    geom_point(data = mse_lam %>% filter(MSE == min(MSE)), color = "red", size = 3) + 
    labs(title = "MSE V.s Lambda", 
         subtitle = "Note: The red point corresponds to the smallest value of the MSE") + 
    theme_minimal() 
}
mse_lambda(k_folds)

#Minimum MSE
min(k_folds$cvm)

## [1] 114604.9

#The corresponding lambda
k_folds$lambda.min

## [1] 1.29155

We have the minimum MSE is 115594.4 and the corresponding lambda is 2.154. We will have the lasso coefficients with the optimal lambda.

coef[,which.min(k_folds$cvm)]

##  (Intercept)        AtBat         Hits        HmRun         Runs          RBI 
##  149.2323007   -1.8807516    6.5506990    0.8678557   -0.5844286    0.0000000 
##        Walks        Years       CAtBat        CHits       CHmRun        CRuns 
##    5.4302865   -8.5932498   -0.0441240    0.0000000    0.3565032    0.9958126 
##         CRBI       CWalks      LeagueN    DivisionW      PutOuts      Assists 
##    0.5056303   -0.6933650   42.1940765 -117.4103836    0.2811973    0.2664320 
##       Errors   NewLeagueN 
##   -2.7087417   -6.3673670

We can see that our optimal lasso coefficient eliminates HmRun, Runs, RBI, CATBat, CHits and NewLeagueN. While the Ridge Regression includes all predictors in their model.

lasso_predict<-predict(lasso,s=k_folds$lambda.min,
                 newx=x)
#MSE between the predicted values and the actual value
mse(lasso_predict,y)

## [1] 92981.75

We now compare the lasso model with ridge model.

#Ridge regression model
ridge<-glmnet(x,y,
              alpha=0,
              lambda=k_folds$lambda.min)
#Predict
ridge_predict<-predict(ridge,
                       s=k_folds$lambda.min,
                       newx=x)
mse(ridge_predict,y)

## [1] 92312.53

MSE of Ridge regression model is lower, means that the prediction is better. However, Lasso Regression has the better interpretation since they do not include unnecesary predictors, while Ridge regression includes all predictor and reduce the magnitude of important predictors

lasso_regression

quoc_nguyen

2022-12-26