From the linear regression model with all the variables in, find the best subset model for medv for the training data using LASSO and best lambda.What are the important variables? (Dataset=Boston)
library(glmnet)
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-5
library(MASS)
attach(Boston)
library(plotmo)
## Loading required package: plotrix
## Loading required package: TeachingDemos
grid <- 10^seq(10,-2,length=1000)
#we have chosen to implement the function over a grid of values ranging from ?? = 1010 to ?? = 10???2, essentially covering the full range of scenarios from the null model containing only the intercept, to the least squares fit.
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio black
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12
## lstat medv
## 1 4.98 24.0
## 2 9.14 21.6
## 3 4.03 34.7
## 4 2.94 33.4
## 5 5.33 36.2
## 6 5.21 28.7
dim(Boston)
## [1] 506 14
Boston <- na.omit(Boston)
x <- model.matrix(medv~.,Boston)
str(x)
## num [1:506, 1:14] 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:506] "1" "2" "3" "4" ...
## ..$ : chr [1:14] "(Intercept)" "crim" "zn" "indus" ...
## - attr(*, "assign")= int [1:14] 0 1 2 3 4 5 6 7 8 9 ...
###model.matrix creates a design matrix from the description given in terms(object), using the data in data which must supply variables with the same names as would be created by a call to model.frame. It also automatically transforms any qualitative variables into dummy variables. This property is important because glmnet() can only take numerical quantitative inputs.
y <-Boston$medv
str(y)
## num [1:506] 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
train <- sample(1:nrow(x), nrow(x)*.67)
test <- (-train)
y.test <- y[test]
# Checks
dim (x[train,])
## [1] 339 14
length(y[train])
## [1] 339
length(y.test)
## [1] 167
lasso.mod <-glmnet(x[train,], y[train],alpha=1,lambda = grid)
# glmnet() function standardizes the variables by default so that they are on the same scale.
# If alpha=0 then a ridge regression model is fit, and if alpha=1 then a lasso model is fit.
dim(coef(lasso.mod ))
## [1] 15 1000
# code to check coef at lamda of 825 location (ie lamda=1.265)
lasso.mod$lambda[825]
## [1] 1.265034
coef(lasso.mod)[,825]
## (Intercept) (Intercept) crim zn indus
## 17.781798308 0.000000000 0.000000000 0.000000000 0.000000000
## chas nox rm age dis
## 0.000000000 0.000000000 3.237136004 0.000000000 0.000000000
## rad tax ptratio black lstat
## 0.000000000 0.000000000 -0.563288642 0.002203551 -0.475572298
summary(lasso.mod)
## Length Class Mode
## a0 1000 -none- numeric
## beta 14000 dgCMatrix S4
## df 1000 -none- numeric
## dim 2 -none- numeric
## lambda 1000 -none- numeric
## dev.ratio 1000 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## call 5 -none- call
## nobs 1 -none- numeric
par(mfrow=c(1,2))
plot_glmnet(lasso.mod, xvar = "lambda", label = 5)
plot_glmnet(lasso.mod, xvar="dev",label=5)
When Lambda = 0, then the lasso simply gives the least squares fit, and when Lambda becomes sufficiently large, the lasso gives the null model in which all coefficient estimates equal zero.
Moving from left to right in the L1 vs coefficient plot, we observe that at first the lasso results in a model that contains only the predictor rm.
The five important variables are r, chas, ptratio,dis,nox
Best lamda Approach- Select a grid of potential values, use cross validation to estimate the error rate on test data (for each value of) and select the value that gives the least error rate
set.seed(1)
#how to choose best lambda
set.seed(1)
cv.out=cv.glmnet(x[train,],y[train],alpha=1)
plot(cv.out, label=TRUE)
coef(cv.out)
## 15 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 16.245008372
## (Intercept) .
## crim .
## zn .
## indus .
## chas 1.799072743
## nox -0.550758720
## rm 3.892144488
## age .
## dis -0.129415365
## rad .
## tax .
## ptratio -0.771462283
## black 0.008541421
## lstat -0.500038015
The above plot indcates that for high lambda error is very high, and the coefficients are restricted to be too small, and then at some point, it kind of levels off. This indicates that the full model is it is doing a good job.
Also, There’s two vertical lines. The one is at the minimum, and the other vertical line is at one standard error of the minimum, within one standard error. So it’s a slightly more restricted model that does almost as well as the minimum, and sometimes we’ll go for that.
Top of above chart indicates number of variables in the model
bestlam=cv.out$lambda.min
bestlam
## [1] 0.04776679
#test MSE associated with this value of ??
lasso.pred=predict(lasso.mod,s=bestlam,newx=x[test,])
mse= mean((lasso.pred-y.test)^2)
# refit our lasso regression model on the full data set, using the value of lambda chosen by cross-validation, and examine the coefficient estimates
out=glmnet(x,y,alpha=1,lambda=grid)
lasso.coef=predict(out,type="coefficients",s=bestlam)[1:15,]
lasso.coef
## (Intercept) (Intercept) crim zn indus
## 33.101106507 0.000000000 -0.091491877 0.038374347 0.000000000
## chas nox rm age dis
## 2.660045901 -15.558615441 3.911868548 0.000000000 -1.328234520
## rad tax ptratio black lstat
## 0.220760284 -0.008497253 -0.918692608 0.008842988 -0.522384709
lasso.coef[lasso.coef!=0]
## (Intercept) crim zn chas nox
## 33.101106507 -0.091491877 0.038374347 2.660045901 -15.558615441
## rm dis rad tax ptratio
## 3.911868548 -1.328234520 0.220760284 -0.008497253 -0.918692608
## black lstat
## 0.008842988 -0.522384709
oby= Boston[test,]$medv
error=oby-lasso.pred
mse=mean(error^2)
# Actual R-square
R2=1-sum(error^2)/sum((oby-mean(oby))^2)
R2
## [1] 0.7375726
# Approximate R-square
1-mse/var(oby)
## [1] 0.7391441
As expected, one of the coefficients Age turns zero as Lasso regression does perform variable selection and shrinkage!
Suppose we want to use our earlier train/validation division to select the lambda for the lasso. This is easy to do.
lasso.tr=glmnet(x[train,],y[train])
pred=predict(lasso.tr,x[-train,])
dim(pred)
## [1] 167 77
So based on dimensions, we observe that there are 167 observations and 74 values of lambda (74 columns)
rmse= sqrt(apply((y[-train]-pred)^2,2,mean))
plot(log(lasso.tr$lambda),rmse,type="b",xlab="Log(lambda)")
lam.best=lasso.tr$lambda[order(rmse)[1]]
lam.best
## [1] 0.005621253
coef(lasso.tr,s=lam.best)
## 15 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 38.496276391
## (Intercept) .
## crim -0.016635415
## zn 0.055737914
## indus 0.039636721
## chas 2.886982584
## nox -19.470840682
## rm 3.497945957
## age 0.001698405
## dis -1.544764544
## rad 0.268917307
## tax -0.011492153
## ptratio -0.972725183
## black 0.012018562
## lstat -0.538987585