We have a dataset with the bunch of variables and we need to build a regression model. Do we try to fit all variables? If not all, then which ones should we keep? These are important questions. Fortunately, we have a powerful tool that can help us to resolve this issue. The tool is LASSO.
LASSO does variable selection by setting coefficient of poorly fitted variables to zero.
Example on LASSO variable selection.
require(glmnet)
library(mlbench)
library(dplyr)
d<-mtcars
#Standardize covariates before fitting LASSO
dX<- scale(dplyr::select(d,-mpg))
dY<- d[, "mpg"]
cv.lasso<- cv.glmnet(x=dX, y=dY, family = "gaussian", alpha = 1, nfolds = 10)
plot(cv.lasso)
cv.lasso$lambda.1se
## [1] 1.535678
coef(cv.lasso, s=cv.lasso$lambda.1se)
## 11 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 20.090625
## cyl -1.488435
## disp .
## hp -0.403589
## drat .
## wt -2.238530
## qsec .
## vs .
## am .
## gear .
## carb .
By setting coefficient of poorly fitted values to zeroes, LASSO prevents overfitting. Yes, our adjusted R squired might drop, but in the end we are less likely to overfit our data.
Let’s look at the specific example.
library(olsrr)
data(BostonHousing)
d<-BostonHousing
d$chas<-as.numeric(d$chas)
smp_size <- floor(0.75 * nrow(d))
## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(d)), size = smp_size)
train <- d[train_ind, ]
test <- d[-train_ind, ]
#Standardize covariates before fitting LASSO
dX<- scale(dplyr::select(train,-medv))
dXtest<-scale(dplyr::select(test,-medv))
dY<- train[, "medv"]
dYtest<-test[,"medv"]
cv.lasso<- cv.glmnet(x=dX, y=dY, family = "gaussian", alpha = 1, nfolds = 10)
plot(cv.lasso)
cv.lasso$lambda.1se
## [1] 0.8555024
coef(cv.lasso, s=cv.lasso$lambda.1se)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 22.6773087
## crim .
## zn .
## indus .
## chas 0.3123247
## nox .
## rm 2.5246630
## age .
## dis .
## rad .
## tax .
## ptratio -1.2831766
## b 0.1156775
## lstat -3.6432642
response<-predict(cv.lasso, dXtest , s = "lambda.1se")
actuals_preds <- data.frame(cbind(actuals=dYtest, predicteds=response)) # make actuals_predicteds dataframe.
cor(actuals_preds) # 87.8%
## actuals X1
## actuals 1.0000000 0.8788324
## X1 0.8788324 1.0000000
model <- lm(medv ~ ., data = train)
ols_step_backward_aic(model)
## Backward Elimination Method
## ---------------------------
##
## Candidate Terms:
##
## 1 . crim
## 2 . zn
## 3 . indus
## 4 . chas
## 5 . nox
## 6 . rm
## 7 . age
## 8 . dis
## 9 . rad
## 10 . tax
## 11 . ptratio
## 12 . b
## 13 . lstat
##
##
## Variables Removed:
##
## - age
## - indus
##
## No more variables to be removed.
##
##
## Backward Elimination Summary
## -----------------------------------------------------------------------
## Variable AIC RSS Sum Sq R-Sq Adj. R-Sq
## -----------------------------------------------------------------------
## Full Model 2293.406 8705.468 22519.317 0.72120 0.71127
## age 2291.507 8707.791 22516.993 0.72113 0.71198
## indus 2290.155 8722.686 22502.099 0.72065 0.71228
## -----------------------------------------------------------------------
train1<-train[,-c(3,7)]
modelb <- lm(medv ~ ., data = train1)
response<-predict(modelb, test)
actuals_preds <- data.frame(cbind(actuals=dYtest, predicteds=response)) # make actuals_predicteds dataframe.
cor(actuals_preds) # 88.7%
## actuals predicteds
## actuals 1.0000000 0.8869944
## predicteds 0.8869944 1.0000000
LASSO did not perform as well as backward AIC, but still it did got job not to overfit.
It was first introduced in 1986. Rediscovered and popularized in 1996. SO, it has been around for over 30 years.
LASSO is used for both linear and logistical regression.
LASSO simplifies our model. LASSO is alternative to Ridge, another regularization method.
Interestingly, widely used stepwise method, in some cases, can make a model worse.
LASSO advantages over Ridge - it downweights weak variables as Ridge, but it also addresses issue of multicolinearity which Ridge does not.
While Ridge does not set parameters to 0, LASSO does.
Commonly used package for LASSO is glmnet.
library(glmnet)
We will split data into train and test.
d<-mtcars
set.seed(123)
smp_size <- floor(0.8 * nrow(d))
train_ind <- sample(seq_len(nrow(d)), size = smp_size)
train <- d[train_ind, ]
test <- d[-train_ind, ]
We need to scale our variables and use matrix form for LASSO
dX<- scale(dplyr::select(train,-mpg))
dY<- train[, "mpg"]
dXtest<- scale(dplyr::select(train,-mpg))
dYtest<- train[, "mpg"]
We have to set an alpha parameter to 1, which tells glmnet we are applying LASSO.
cv.lasso<- cv.glmnet(x=dX, y=dY, family = "gaussian", alpha = 1, nfolds = 10)
## Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations
## per fold
We can plot MSE versus lambda
plot(cv.lasso)
We can see two vertical lines with log lambda around -1 and another log lambda around 0. The first shows lambda.min and the second lambda.1se. Lambda.min is a lambda that minimizes MSE, while lambda.1se is lambda which is less likely to overfit, but still ends up with low MSE.
cv.lasso$lambda.min
## [1] 0.3642596
cv.lasso$lambda.1se
## [1] 0.8414869
We can also see variables LASSO selected for lambda min and lambda 1se.
coef(cv.lasso, s=cv.lasso$lambda.min)
## 11 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 19.6320000
## cyl -1.6342200
## disp .
## hp -0.5646983
## drat .
## wt -2.6317320
## qsec .
## vs .
## am .
## gear .
## carb -0.1939954
coef(cv.lasso, s=cv.lasso$lambda.1se)
## 11 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 19.6320000
## cyl -1.5063519
## disp .
## hp -0.4794876
## drat .
## wt -2.3401395
## qsec .
## vs .
## am .
## gear .
## carb .
We can see that min lambda chooses 3 variables, while lambda 1se also chooses 3 variables.
We can predict dependent variable based on our chosen lambda, which we select as lambda 1se.
response<-predict(cv.lasso, dXtest , s = "lambda.1se")
actuals_preds <- data.frame(cbind(actuals=dYtest, predicteds=response)) # make actuals_predicteds dataframe.
SSE <- sum((dYtest - response)^2)
SST <-sum((mean(dYtest) - dYtest)^ 2)
R2 <-1- SSE/SST
R2 # 0.8737, pretty good
## [1] 0.873731
cor(actuals_preds) # 87.8%
## actuals X1
## actuals 1.0000000 0.9525146
## X1 0.9525146 1.0000000
ourdata<-read.csv(file="https://raw.githubusercontent.com/simplymathematics/621/master/HW1/moneyball-training-data.csv")
ourdata$TEAM_BATTING_HBP<-NULL
ourdata$INDEX<-NULL
for(i in 1:ncol(ourdata)){
ourdata[is.na(ourdata[,i]), i] <- mean(ourdata[,i], na.rm = TRUE)
}
library(outliers)
for(i in 1:ncol(ourdata)){
outs <- scores(ourdata[,i], type="chisq", prob=0.9)
ourdata<-ourdata[-outs,]
}
library(caret)
d<-ourdata%>%dplyr::mutate_all(funs(BoxCoxTrans(.)%>% predict(.)))
set.seed(123)
smp_size <- floor(0.75 * nrow(d))
train_ind <- sample(seq_len(nrow(d)), size = smp_size)
train <- d[train_ind, ]
test <- d[-train_ind, ]
mod <- lm(TARGET_WINS ~ ., data = train)
sample_size <- nrow(train)
cooksd <- cooks.distance(mod)
influential <- as.numeric(names(cooksd)[(cooksd > (4/sample_size))])
train <- train[-influential, ]
dX<- scale(dplyr::select(train,-TARGET_WINS))
dY<- train[, "TARGET_WINS"]
dXtest<- scale(dplyr::select(train,-TARGET_WINS))
dYtest<- train[, "TARGET_WINS"]
cv.lasso<- cv.glmnet(x=dX, y=dY, family = "gaussian", alpha = 1, nfolds = 10)
plot(cv.lasso)
cv.lasso$lambda.min
## [1] 0.0351983
cv.lasso$lambda.1se
## [1] 0.9134043
coef(cv.lasso, s=cv.lasso$lambda.min)
## 15 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 80.7056650
## TEAM_BATTING_H 6.1244643
## TEAM_BATTING_2B -0.6741696
## TEAM_BATTING_3B 3.2009863
## TEAM_BATTING_HR .
## TEAM_BATTING_BB 4.4275871
## TEAM_BATTING_SO -4.9724885
## TEAM_BASERUN_SB 2.2250139
## TEAM_BASERUN_CS 0.1080387
## TEAM_PITCHING_H -0.7048765
## TEAM_PITCHING_HR 2.4200933
## TEAM_PITCHING_BB -2.6240034
## TEAM_PITCHING_SO 2.9410702
## TEAM_FIELDING_E -5.8362417
## TEAM_FIELDING_DP -3.0731632
coef(cv.lasso, s=cv.lasso$lambda.1se)
## 15 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 80.7056650
## TEAM_BATTING_H 5.6857000
## TEAM_BATTING_2B .
## TEAM_BATTING_3B 0.6848702
## TEAM_BATTING_HR .
## TEAM_BATTING_BB 2.2481428
## TEAM_BATTING_SO .
## TEAM_BASERUN_SB 0.8136984
## TEAM_BASERUN_CS .
## TEAM_PITCHING_H .
## TEAM_PITCHING_HR .
## TEAM_PITCHING_BB .
## TEAM_PITCHING_SO .
## TEAM_FIELDING_E -2.4436460
## TEAM_FIELDING_DP -1.5729289
response<-predict(cv.lasso, dXtest , s = "lambda.1se")
actuals_preds <- data.frame(cbind(actuals=dYtest, predicteds=response)) # make actuals_predicteds dataframe.
SSE <- sum((dYtest - response)^2)
mean1<-mean(dYtest)
SST<-sum((mean1 - dYtest)^2)
R2 <-1- SSE/SST
R2 # 0.27
## [1] 0.2651141
cor(actuals_preds) # 0.53%
## actuals X1
## actuals 1.0000000 0.5283577
## X1 0.5283577 1.0000000