From the linear regression model with all the variables in, find the best subset model for medv for the training data using LASSO and best lambda.What are the important variables? (Dataset=Boston)

library(glmnet)
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-5
library(MASS)
attach(Boston)
library(plotmo)
## Loading required package: plotrix
## Loading required package: TeachingDemos
grid <- 10^seq(10,-2,length=1000)

#we have chosen to implement the function over a grid of values ranging from ?? = 1010 to ?? = 10???2, essentially covering the full range of scenarios from the null model containing only the intercept, to the least squares fit.

head(Boston)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12
##   lstat medv
## 1  4.98 24.0
## 2  9.14 21.6
## 3  4.03 34.7
## 4  2.94 33.4
## 5  5.33 36.2
## 6  5.21 28.7
dim(Boston)
## [1] 506  14
Boston <- na.omit(Boston) 
x <- model.matrix(medv~.,Boston)
str(x)
##  num [1:506, 1:14] 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:506] "1" "2" "3" "4" ...
##   ..$ : chr [1:14] "(Intercept)" "crim" "zn" "indus" ...
##  - attr(*, "assign")= int [1:14] 0 1 2 3 4 5 6 7 8 9 ...
###model.matrix creates a design matrix from the description given in terms(object), using the data in data which must supply variables with the same names as would be created by a call to model.frame. It also automatically transforms any qualitative variables into dummy variables. This property is important because glmnet() can only take numerical quantitative inputs.

y <-Boston$medv
str(y)
##  num [1:506] 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
train <- sample(1:nrow(x), nrow(x)*.67)
test <- (-train)
y.test <- y[test]

# Checks
dim (x[train,])
## [1] 339  14
length(y[train])
## [1] 339
length(y.test)
## [1] 167
lasso.mod <-glmnet(x[train,], y[train],alpha=1,lambda = grid)
# glmnet() function standardizes the variables by default so that they are on the same scale.
# If alpha=0 then a ridge regression model is fit, and if alpha=1 then a lasso model is fit.

dim(coef(lasso.mod ))
## [1]   15 1000
# code to check coef at lamda of 825 location (ie lamda=1.265)
lasso.mod$lambda[825]
## [1] 1.265034
coef(lasso.mod)[,825]
##  (Intercept)  (Intercept)         crim           zn        indus 
## 17.781798308  0.000000000  0.000000000  0.000000000  0.000000000 
##         chas          nox           rm          age          dis 
##  0.000000000  0.000000000  3.237136004  0.000000000  0.000000000 
##          rad          tax      ptratio        black        lstat 
##  0.000000000  0.000000000 -0.563288642  0.002203551 -0.475572298
summary(lasso.mod)
##           Length Class     Mode   
## a0         1000  -none-    numeric
## beta      14000  dgCMatrix S4     
## df         1000  -none-    numeric
## dim           2  -none-    numeric
## lambda     1000  -none-    numeric
## dev.ratio  1000  -none-    numeric
## nulldev       1  -none-    numeric
## npasses       1  -none-    numeric
## jerr          1  -none-    numeric
## offset        1  -none-    logical
## call          5  -none-    call   
## nobs          1  -none-    numeric
par(mfrow=c(1,2))
plot_glmnet(lasso.mod, xvar = "lambda", label = 5)

plot_glmnet(lasso.mod, xvar="dev",label=5)

When Lambda = 0, then the lasso simply gives the least squares fit, and when Lambda becomes sufficiently large, the lasso gives the null model in which all coefficient estimates equal zero.

Moving from left to right in the L1 vs coefficient plot, we observe that at first the lasso results in a model that contains only the predictor rm.

The five important variables are r, chas, ptratio,dis,nox

Best lamda Approach- Select a grid of potential values, use cross validation to estimate the error rate on test data (for each value of) and select the value that gives the least error rate

set.seed(1)
#how to choose best lambda
set.seed(1)
cv.out=cv.glmnet(x[train,],y[train],alpha=1)
plot(cv.out, label=TRUE)

coef(cv.out)
## 15 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept) 16.245008372
## (Intercept)  .          
## crim         .          
## zn           .          
## indus        .          
## chas         1.799072743
## nox         -0.550758720
## rm           3.892144488
## age          .          
## dis         -0.129415365
## rad          .          
## tax          .          
## ptratio     -0.771462283
## black        0.008541421
## lstat       -0.500038015

The above plot indcates that for high lambda error is very high, and the coefficients are restricted to be too small, and then at some point, it kind of levels off. This indicates that the full model is it is doing a good job.

Also, There’s two vertical lines. The one is at the minimum, and the other vertical line is at one standard error of the minimum, within one standard error. So it’s a slightly more restricted model that does almost as well as the minimum, and sometimes we’ll go for that.

Top of above chart indicates number of variables in the model

bestlam=cv.out$lambda.min
bestlam
## [1] 0.04776679
#test MSE associated with this value of ??
lasso.pred=predict(lasso.mod,s=bestlam,newx=x[test,])
mse= mean((lasso.pred-y.test)^2)

# refit our lasso regression model on the full data set, using the value of lambda chosen by cross-validation, and examine the coefficient estimates

out=glmnet(x,y,alpha=1,lambda=grid)
lasso.coef=predict(out,type="coefficients",s=bestlam)[1:15,]
lasso.coef
##   (Intercept)   (Intercept)          crim            zn         indus 
##  33.101106507   0.000000000  -0.091491877   0.038374347   0.000000000 
##          chas           nox            rm           age           dis 
##   2.660045901 -15.558615441   3.911868548   0.000000000  -1.328234520 
##           rad           tax       ptratio         black         lstat 
##   0.220760284  -0.008497253  -0.918692608   0.008842988  -0.522384709
lasso.coef[lasso.coef!=0]
##   (Intercept)          crim            zn          chas           nox 
##  33.101106507  -0.091491877   0.038374347   2.660045901 -15.558615441 
##            rm           dis           rad           tax       ptratio 
##   3.911868548  -1.328234520   0.220760284  -0.008497253  -0.918692608 
##         black         lstat 
##   0.008842988  -0.522384709
oby= Boston[test,]$medv

error=oby-lasso.pred
mse=mean(error^2)

# Actual R-square
R2=1-sum(error^2)/sum((oby-mean(oby))^2)
R2
## [1] 0.7375726
# Approximate R-square
1-mse/var(oby)
## [1] 0.7391441

As expected, one of the coefficients Age turns zero as Lasso regression does perform variable selection and shrinkage!

Suppose we want to use our earlier train/validation division to select the lambda for the lasso. This is easy to do.

lasso.tr=glmnet(x[train,],y[train])
pred=predict(lasso.tr,x[-train,])
dim(pred)
## [1] 167  77

So based on dimensions, we observe that there are 167 observations and 74 values of lambda (74 columns)

rmse= sqrt(apply((y[-train]-pred)^2,2,mean))
plot(log(lasso.tr$lambda),rmse,type="b",xlab="Log(lambda)")

lam.best=lasso.tr$lambda[order(rmse)[1]]
lam.best
## [1] 0.005621253
coef(lasso.tr,s=lam.best)
## 15 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept)  38.496276391
## (Intercept)   .          
## crim         -0.016635415
## zn            0.055737914
## indus         0.039636721
## chas          2.886982584
## nox         -19.470840682
## rm            3.497945957
## age           0.001698405
## dis          -1.544764544
## rad           0.268917307
## tax          -0.011492153
## ptratio      -0.972725183
## black         0.012018562
## lstat        -0.538987585