LASSO - Least Absolute Shrinkage and Selection Operator

We have a dataset with the bunch of variables and we need to build a regression model. Do we try to fit all variables? If not all, then which ones should we keep? These are important questions. Fortunately, we have a powerful tool that can help us to resolve this issue. The tool is LASSO.

Variable Selection with LASSO

LASSO does variable selection by setting coefficient of poorly fitted variables to zero.

Example on LASSO variable selection.

require(glmnet)
library(mlbench)
library(dplyr)

d<-mtcars

#Standardize covariates before fitting LASSO

dX<- scale(dplyr::select(d,-mpg))

dY<- d[, "mpg"]

cv.lasso<- cv.glmnet(x=dX, y=dY, family = "gaussian", alpha = 1, nfolds = 10)

plot(cv.lasso)

cv.lasso$lambda.1se
## [1] 1.535678
coef(cv.lasso, s=cv.lasso$lambda.1se)
## 11 x 1 sparse Matrix of class "dgCMatrix"
##                     1
## (Intercept) 20.090625
## cyl         -1.488435
## disp         .       
## hp          -0.403589
## drat         .       
## wt          -2.238530
## qsec         .       
## vs           .       
## am           .       
## gear         .       
## carb         .

LASSO - priventing overfitting

By setting coefficient of poorly fitted values to zeroes, LASSO prevents overfitting. Yes, our adjusted R squired might drop, but in the end we are less likely to overfit our data.

Let’s look at the specific example.

library(olsrr)

data(BostonHousing)

d<-BostonHousing

d$chas<-as.numeric(d$chas)

smp_size <- floor(0.75 * nrow(d))

## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(d)), size = smp_size)

train <- d[train_ind, ]
test <- d[-train_ind, ]
#Standardize covariates before fitting LASSO
dX<- scale(dplyr::select(train,-medv))
dXtest<-scale(dplyr::select(test,-medv))
dY<- train[, "medv"]

dYtest<-test[,"medv"]
cv.lasso<- cv.glmnet(x=dX, y=dY, family = "gaussian", alpha = 1, nfolds = 10)
plot(cv.lasso)

cv.lasso$lambda.1se
## [1] 0.8555024
coef(cv.lasso, s=cv.lasso$lambda.1se)
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                      1
## (Intercept) 22.6773087
## crim         .        
## zn           .        
## indus        .        
## chas         0.3123247
## nox          .        
## rm           2.5246630
## age          .        
## dis          .        
## rad          .        
## tax          .        
## ptratio     -1.2831766
## b            0.1156775
## lstat       -3.6432642
response<-predict(cv.lasso, dXtest , s = "lambda.1se")

actuals_preds <- data.frame(cbind(actuals=dYtest, predicteds=response))  # make actuals_predicteds dataframe.

cor(actuals_preds)  # 87.8%
##           actuals        X1
## actuals 1.0000000 0.8788324
## X1      0.8788324 1.0000000
model <- lm(medv ~ ., data = train)
ols_step_backward_aic(model)
## Backward Elimination Method 
## ---------------------------
## 
## Candidate Terms: 
## 
## 1 . crim 
## 2 . zn 
## 3 . indus 
## 4 . chas 
## 5 . nox 
## 6 . rm 
## 7 . age 
## 8 . dis 
## 9 . rad 
## 10 . tax 
## 11 . ptratio 
## 12 . b 
## 13 . lstat 
## 
## 
## Variables Removed: 
## 
## - age 
## - indus 
## 
## No more variables to be removed.
## 
## 
##                       Backward Elimination Summary                       
## -----------------------------------------------------------------------
## Variable        AIC         RSS        Sum Sq       R-Sq      Adj. R-Sq 
## -----------------------------------------------------------------------
## Full Model    2293.406    8705.468    22519.317    0.72120      0.71127 
## age           2291.507    8707.791    22516.993    0.72113      0.71198 
## indus         2290.155    8722.686    22502.099    0.72065      0.71228 
## -----------------------------------------------------------------------
train1<-train[,-c(3,7)]

modelb <- lm(medv ~ ., data = train1)

response<-predict(modelb, test)

actuals_preds <- data.frame(cbind(actuals=dYtest, predicteds=response))  # make actuals_predicteds dataframe.

cor(actuals_preds)  # 88.7%
##              actuals predicteds
## actuals    1.0000000  0.8869944
## predicteds 0.8869944  1.0000000

LASSO did not perform as well as backward AIC, but still it did got job not to overfit.

LASSO highlightes.

It was first introduced in 1986. Rediscovered and popularized in 1996. SO, it has been around for over 30 years.

LASSO is used for both linear and logistical regression.

LASSO simplifies our model. LASSO is alternative to Ridge, another regularization method.

Interestingly, widely used stepwise method, in some cases, can make a model worse.

LASSO advantages over Ridge - it downweights weak variables as Ridge, but it also addresses issue of multicolinearity which Ridge does not.

While Ridge does not set parameters to 0, LASSO does.

R syntaxis for LASSO

Commonly used package for LASSO is glmnet.

library(glmnet)

We will split data into train and test.

d<-mtcars
set.seed(123)
smp_size <- floor(0.8 * nrow(d))
train_ind <- sample(seq_len(nrow(d)), size = smp_size)

train <- d[train_ind, ]
test <- d[-train_ind, ]

We need to scale our variables and use matrix form for LASSO

dX<- scale(dplyr::select(train,-mpg))

dY<- train[, "mpg"]

dXtest<- scale(dplyr::select(train,-mpg))

dYtest<- train[, "mpg"]

We have to set an alpha parameter to 1, which tells glmnet we are applying LASSO.

cv.lasso<- cv.glmnet(x=dX, y=dY, family = "gaussian", alpha = 1, nfolds = 10)
## Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations
## per fold

We can plot MSE versus lambda

plot(cv.lasso)

We can see two vertical lines with log lambda around -1 and another log lambda around 0. The first shows lambda.min and the second lambda.1se. Lambda.min is a lambda that minimizes MSE, while lambda.1se is lambda which is less likely to overfit, but still ends up with low MSE.

cv.lasso$lambda.min
## [1] 0.3642596
cv.lasso$lambda.1se
## [1] 0.8414869

We can also see variables LASSO selected for lambda min and lambda 1se.

coef(cv.lasso, s=cv.lasso$lambda.min)
## 11 x 1 sparse Matrix of class "dgCMatrix"
##                      1
## (Intercept) 19.6320000
## cyl         -1.6342200
## disp         .        
## hp          -0.5646983
## drat         .        
## wt          -2.6317320
## qsec         .        
## vs           .        
## am           .        
## gear         .        
## carb        -0.1939954
coef(cv.lasso, s=cv.lasso$lambda.1se)
## 11 x 1 sparse Matrix of class "dgCMatrix"
##                      1
## (Intercept) 19.6320000
## cyl         -1.5063519
## disp         .        
## hp          -0.4794876
## drat         .        
## wt          -2.3401395
## qsec         .        
## vs           .        
## am           .        
## gear         .        
## carb         .

We can see that min lambda chooses 3 variables, while lambda 1se also chooses 3 variables.

We can predict dependent variable based on our chosen lambda, which we select as lambda 1se.

response<-predict(cv.lasso, dXtest , s = "lambda.1se")

actuals_preds <- data.frame(cbind(actuals=dYtest, predicteds=response))  # make actuals_predicteds dataframe.

SSE <- sum((dYtest - response)^2)

SST <-sum((mean(dYtest) - dYtest)^ 2)
          
R2 <-1- SSE/SST

R2 # 0.8737, pretty good
## [1] 0.873731
cor(actuals_preds)  # 87.8%
##           actuals        X1
## actuals 1.0000000 0.9525146
## X1      0.9525146 1.0000000

Applying LASSO to HW1

ourdata<-read.csv(file="https://raw.githubusercontent.com/simplymathematics/621/master/HW1/moneyball-training-data.csv")

ourdata$TEAM_BATTING_HBP<-NULL
ourdata$INDEX<-NULL

for(i in 1:ncol(ourdata)){
  ourdata[is.na(ourdata[,i]), i] <- mean(ourdata[,i], na.rm = TRUE)
}

library(outliers)

for(i in 1:ncol(ourdata)){
  outs <- scores(ourdata[,i], type="chisq", prob=0.9)
  ourdata<-ourdata[-outs,]
}

library(caret)
d<-ourdata%>%dplyr::mutate_all(funs(BoxCoxTrans(.)%>% predict(.)))

set.seed(123)
smp_size <- floor(0.75 * nrow(d))
train_ind <- sample(seq_len(nrow(d)), size = smp_size)
train <- d[train_ind, ]
test <- d[-train_ind, ]
mod <- lm(TARGET_WINS ~ ., data = train)
sample_size <- nrow(train)
cooksd <- cooks.distance(mod)
influential <- as.numeric(names(cooksd)[(cooksd > (4/sample_size))])

train <- train[-influential, ]

                         
dX<- scale(dplyr::select(train,-TARGET_WINS))                        

dY<- train[, "TARGET_WINS"]

dXtest<- scale(dplyr::select(train,-TARGET_WINS))

dYtest<- train[, "TARGET_WINS"]


cv.lasso<- cv.glmnet(x=dX, y=dY, family = "gaussian", alpha = 1, nfolds = 10)

plot(cv.lasso)

cv.lasso$lambda.min
## [1] 0.0351983
cv.lasso$lambda.1se
## [1] 0.9134043
coef(cv.lasso, s=cv.lasso$lambda.min)
## 15 x 1 sparse Matrix of class "dgCMatrix"
##                           1
## (Intercept)      80.7056650
## TEAM_BATTING_H    6.1244643
## TEAM_BATTING_2B  -0.6741696
## TEAM_BATTING_3B   3.2009863
## TEAM_BATTING_HR   .        
## TEAM_BATTING_BB   4.4275871
## TEAM_BATTING_SO  -4.9724885
## TEAM_BASERUN_SB   2.2250139
## TEAM_BASERUN_CS   0.1080387
## TEAM_PITCHING_H  -0.7048765
## TEAM_PITCHING_HR  2.4200933
## TEAM_PITCHING_BB -2.6240034
## TEAM_PITCHING_SO  2.9410702
## TEAM_FIELDING_E  -5.8362417
## TEAM_FIELDING_DP -3.0731632
coef(cv.lasso, s=cv.lasso$lambda.1se)
## 15 x 1 sparse Matrix of class "dgCMatrix"
##                           1
## (Intercept)      80.7056650
## TEAM_BATTING_H    5.6857000
## TEAM_BATTING_2B   .        
## TEAM_BATTING_3B   0.6848702
## TEAM_BATTING_HR   .        
## TEAM_BATTING_BB   2.2481428
## TEAM_BATTING_SO   .        
## TEAM_BASERUN_SB   0.8136984
## TEAM_BASERUN_CS   .        
## TEAM_PITCHING_H   .        
## TEAM_PITCHING_HR  .        
## TEAM_PITCHING_BB  .        
## TEAM_PITCHING_SO  .        
## TEAM_FIELDING_E  -2.4436460
## TEAM_FIELDING_DP -1.5729289
response<-predict(cv.lasso, dXtest , s = "lambda.1se")

actuals_preds <- data.frame(cbind(actuals=dYtest, predicteds=response))  # make actuals_predicteds dataframe.

SSE <- sum((dYtest - response)^2)

mean1<-mean(dYtest)

SST<-sum((mean1 - dYtest)^2)

R2 <-1- SSE/SST

R2 # 0.27
## [1] 0.2651141
cor(actuals_preds)  # 0.53%
##           actuals        X1
## actuals 1.0000000 0.5283577
## X1      0.5283577 1.0000000