ISLR-Chapter-6-Lab-2.R

# An Introduction to Statistical Learning with Applications in R

# Chapter 6 Lab 2: Ridge Regression and the Lasso

# We will use the glmnet package in order to perform ridge regression and
# the lasso. The main function in this package is glmnet(), which can be used
# to fit ridge regression models, lasso models, and more. This function has
# slightly different syntax from other model-fitting functions that we have
# encountered thus far in this book. In particular, we must pass in an x
# matrix as well as a y vector, and we do not use the y ~ x syntax. We will
# now perform ridge regression and the lasso in order to predict Salary on
# the Hitters data. 

# Before proceeding ensure that the missing values have
# been removed from the data, as described in Section 6.5.

# load Hitters data and display data editor window 
# fix(Hitters)

# this is here just for creating the RPubs file
setwd("C:/Users/sgrambow/Dropbox/CRP241share/F20/production/default-crp241F20-scg")
load("workspace.RData")

# Lets remove the missing values for the purpose of this example.
# We can use na.omit() function to remove all rows containing
# a missing value for Salary (note that this is the only
# variable in the data frame with missing values).
# NOTE that in general, just deleting missing data
# is not a reasonable approach for most real modeling
# applications
Hitters <- na.omit(Hitters)

# as discussed above, set up x and y to feed to glmnet()
x=model.matrix(Salary~.,Hitters)[,-1]
y=Hitters$Salary

# The model.matrix() function is particularly useful for creating x; not only
# does it produce a matrix corresponding to the 19 predictors but it also
# automatically transforms any qualitative variables into dummy variables.
# The latter property is important because glmnet() can only take numerical,
# quantitative inputs.

# Ridge Regression

# The glmnet() function has an alpha argument that determines what type
# of model is fit. If alpha=0 then a ridge regression model is fit, and if alpha=1
# then a lasso model is fit. We first fit a ridge regression model.
# install.packages('glmnet')
library(glmnet)

## Warning: package 'glmnet' was built under R version 4.0.4

## Loading required package: Matrix

## Loaded glmnet 4.1-1

grid=10^seq(10,-2,length=100)
ridge.mod=glmnet(x,y,alpha=0,lambda=grid)

# By default the glmnet() function performs ridge regression for an 
# automatically selected range of lambda values. However, here we have chosen to 
# implement the function over a grid of values ranging from lambda = 10exp(10) 
# to lambda = 10exp(-2), essentially covering the full range of scenarios from the 
# null model containing only the intercept, to the least squares fit. 
# As we will see, we can also compute model fits for a particular value of 
# lambda that is not one of the original grid values. Note that by default, 
# the glmnet() function standardizes the variables so that they are on the 
# same scale. To turn off this default setting, use the argument 
# standardize=FALSE.

# Associated with each value of lambda is a vector of ridge regression 
# coefficients, stored in a matrix that can be accessed by coef(). 
# In this case, it is a 20×100 matrix, with 20 rows (one for each predictor,
# plus an intercept) and 100 columns (one for each value of lambda).
dim(coef(ridge.mod))

## [1]  20 100

# We expect the coefficient estimates to be much smaller, in terms of L2 norm,
# when a large value of lambda is used, as compared to when a small value of 
# lambda is used. These are the coefficients when lambda = 11,498, 
# along with their L2 norm:
ridge.mod$lambda[50]

## [1] 11497.57

coef(ridge.mod)[,50]

##   (Intercept)         AtBat          Hits         HmRun          Runs 
## 407.356050200   0.036957182   0.138180344   0.524629976   0.230701523 
##           RBI         Walks         Years        CAtBat         CHits 
##   0.239841459   0.289618741   1.107702929   0.003131815   0.011653637 
##        CHmRun         CRuns          CRBI        CWalks       LeagueN 
##   0.087545670   0.023379882   0.024138320   0.025015421   0.085028114 
##     DivisionW       PutOuts       Assists        Errors    NewLeagueN 
##  -6.215440973   0.016482577   0.002612988  -0.020502690   0.301433531

sqrt(sum(coef(ridge.mod)[-1,50]^2))

## [1] 6.360612

# In contrast, here are the coefficients when lambda = 705, along with their L2
# norm. Note the much larger L2 norm of the coefficients associated with this
# smaller value of lambda.
ridge.mod$lambda[60]

## [1] 705.4802

coef(ridge.mod)[,60]

##  (Intercept)        AtBat         Hits        HmRun         Runs          RBI 
##  54.32519950   0.11211115   0.65622409   1.17980910   0.93769713   0.84718546 
##        Walks        Years       CAtBat        CHits       CHmRun        CRuns 
##   1.31987948   2.59640425   0.01083413   0.04674557   0.33777318   0.09355528 
##         CRBI       CWalks      LeagueN    DivisionW      PutOuts      Assists 
##   0.09780402   0.07189612  13.68370191 -54.65877750   0.11852289   0.01606037 
##       Errors   NewLeagueN 
##  -0.70358655   8.61181213

sqrt(sum(coef(ridge.mod)[-1,60]^2))

## [1] 57.11001

# We can use the predict() function for a number of purposes. For instance,
# we can obtain the ridge regression coefficients for a new value of lambda, 
# say 50:
# note that the s=50 option below specifies the value of lambda to be 50
# we used type="coefficients" to obtain the beta estimates
predict(ridge.mod,s=50,type="coefficients")[1:20,]

##   (Intercept)         AtBat          Hits         HmRun          Runs 
##  4.876610e+01 -3.580999e-01  1.969359e+00 -1.278248e+00  1.145892e+00 
##           RBI         Walks         Years        CAtBat         CHits 
##  8.038292e-01  2.716186e+00 -6.218319e+00  5.447837e-03  1.064895e-01 
##        CHmRun         CRuns          CRBI        CWalks       LeagueN 
##  6.244860e-01  2.214985e-01  2.186914e-01 -1.500245e-01  4.592589e+01 
##     DivisionW       PutOuts       Assists        Errors    NewLeagueN 
## -1.182011e+02  2.502322e-01  1.215665e-01 -3.278600e+00 -9.496680e+00

# We now split the samples into a training set and a test set in order
# to estimate the test error of ridge regression and the lasso. There are two
# common ways to randomly split a data set. The first is to produce a random
# vector of TRUE, FALSE elements and select the observations corresponding to
# TRUE for the training data. The second is to randomly choose a subset of
# numbers between 1 and n; these can then be used as the indices for the
# training observations. The two approaches work equally well. We used the
# former method in Lab 1 with the best subset regression example
# Here we demonstrate the latter approach.
# in lab 1: train=sample(c(TRUE,FALSE), nrow(Hitters),rep=TRUE)

# We first set a random seed so that the results obtained will be reproducible.
set.seed(1)
train=sample(1:nrow(x), nrow(x)/2)
test=(-train)
y.test=y[test]

# Next we fit a ridge regression model on the training set, and evaluate
# its MSE on the test set, using lambda = 4. Note the use of the predict()
# function again. This time we get predictions for a test set, by replacing
# type="coefficients" with the newx argument.
ridge.mod=glmnet(x[train,],y[train],alpha=0,lambda=grid, thresh=1e-12)
ridge.pred=predict(ridge.mod,s=4,newx=x[test,])
mean((ridge.pred-y.test)^2)

## [1] 142199.2

# The test MSE is 142199 Note that if we had instead simply fit a model
# with just an intercept, we would have predicted each test observation using
# the mean of the training observations. In that case, we could compute the
# test set MSE like this:
mean((mean(y[train])-y.test)^2)

## [1] 224669.9

# We could also get the same result by fitting a ridge regression model with
# a very large value of lambda. Note that 1e10 means 10exp(10).
ridge.pred=predict(ridge.mod,s=1e10,newx=x[test,])
mean((ridge.pred-y.test)^2)

## [1] 224669.8

# So fitting a ridge regression model with lambda = 4 leads to a much lower test
# MSE than fitting a model with just an intercept. We now check whether
# there is any benefit to performing ridge regression with lambda = 4 instead of
# just performing least squares regression. Recall that least squares is simply
# ridge regression with lambda = 0.

# NOTE: 
# In order for glmnet() to yield the exact least squares coefficients when 
# lambda = 0, we use the argument exact=T when calling the predict() function. 
# Otherwise, the predict() function will interpolate over the grid of lambda
# values used in fitting the glmnet() model, yielding approximate results. 
# When we use exact=T, there remains a slight discrepancy in the third decimal
# place between the output of glmnet() when lambda = 0 and the output of lm(); 
# this is due to numerical approximation on the part of glmnet().

ridge.pred=predict(ridge.mod,s=0,newx=x[test,],exact=T,x=x[train,],y=y[train])
mean((ridge.pred-y.test)^2)

## [1] 168588.6

lm(y~x, subset=train)

## 
## Call:
## lm(formula = y ~ x, subset = train)
## 
## Coefficients:
## (Intercept)       xAtBat        xHits       xHmRun        xRuns         xRBI  
##    274.0145      -0.3521      -1.6377       5.8145       1.5424       1.1243  
##      xWalks       xYears      xCAtBat       xCHits      xCHmRun       xCRuns  
##      3.7287     -16.3773      -0.6412       3.1632       3.4008      -0.9739  
##       xCRBI      xCWalks     xLeagueN   xDivisionW     xPutOuts     xAssists  
##     -0.6005       0.3379     119.1486    -144.0831       0.1976       0.6804  
##     xErrors  xNewLeagueN  
##     -4.7128     -71.0951

predict(ridge.mod,s=0,exact=T,type="coefficients",x=x[train,],y=y[train])[1:20,]

##  (Intercept)        AtBat         Hits        HmRun         Runs          RBI 
##  274.0200994   -0.3521900   -1.6371383    5.8146692    1.5423361    1.1241837 
##        Walks        Years       CAtBat        CHits       CHmRun        CRuns 
##    3.7288406  -16.3795195   -0.6411235    3.1629444    3.4005281   -0.9739405 
##         CRBI       CWalks      LeagueN    DivisionW      PutOuts      Assists 
##   -0.6003976    0.3378422  119.1434637 -144.0853061    0.1976300    0.6804200 
##       Errors   NewLeagueN 
##   -4.7127879  -71.0898914

# In general, if we want to fit a (unpenalized) least squares model, then
# we should use the lm() function, since that function provides more useful
# outputs, such as standard errors and p-values for the coefficients.

# In general, instead of arbitrarily choosing lambda = 4, it would be better to
# use cross-validation to choose the tuning parameter lambda. We can do this using
# the built-in cross-validation function, cv.glmnet(). By default, the function
# performs ten-fold cross-validation, though this can be changed using the
# argument nfolds. Note that we set a random seed first so our results will
# be reproducible, since the choice of the cross-validation folds is random.
set.seed(1)
cv.out=cv.glmnet(x[train,],y[train],alpha=0)
plot(cv.out)

bestlam=cv.out$lambda.min
bestlam

## [1] 326.0828

# Therefore, we see that the value of lambda that results in the smallest 
# cross-validation error is 326.08. What is the test MSE associated with this 
# value of lambda
ridge.pred=predict(ridge.mod,s=bestlam,newx=x[test,])
mean((ridge.pred-y.test)^2)

## [1] 139856.6

# This represents a further improvement over the test MSE that we got using
# lambda = 4. Finally, we refit our ridge regression model on the full data set,
# using the value of lambda chosen by cross-validation, and examine the 
# coefficient estimates.
out=glmnet(x,y,alpha=0)
predict(out,type="coefficients",s=bestlam)[1:20,]

##  (Intercept)        AtBat         Hits        HmRun         Runs          RBI 
##  15.44383135   0.07715547   0.85911581   0.60103107   1.06369007   0.87936105 
##        Walks        Years       CAtBat        CHits       CHmRun        CRuns 
##   1.62444616   1.35254780   0.01134999   0.05746654   0.40680157   0.11456224 
##         CRBI       CWalks      LeagueN    DivisionW      PutOuts      Assists 
##   0.12116504   0.05299202  22.09143189 -79.04032637   0.16619903   0.02941950 
##       Errors   NewLeagueN 
##  -1.36092945   9.12487767

# As expected, none of the coefficients are zero-ridge regression does not
# perform variable selection!

# The Lasso

# We saw that ridge regression with a wise choice of lambda can outperform least
# squares as well as the null model on the Hitters data set. We now ask
# whether the lasso can yield either a more accurate or a more interpretable
# model than ridge regression. In order to fit a lasso model, we once again
# use the glmnet() function; however, this time we use the argument alpha=1.
# Other than that change, we proceed just as we did in fitting a ridge model.
lasso.mod=glmnet(x[train,],y[train],alpha=1,lambda=grid)
plot(lasso.mod)

## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values

# We can see from the coefficient plot that depending on the choice of tuning
# parameter, some of the coefficients will be exactly equal to zero. We now
# perform cross-validation and compute the associated test error.

set.seed(1)
cv.out=cv.glmnet(x[train,],y[train],alpha=1)
plot(cv.out)

bestlam=cv.out$lambda.min
lasso.pred=predict(lasso.mod,s=bestlam,newx=x[test,])
mean((lasso.pred-y.test)^2)

## [1] 143673.6

# This is substantially lower than the test set MSE of the null model and of
# least squares, and very similar to the test MSE of ridge regression with 
# lambda chosen by cross-validation.

# However, the lasso has a substantial advantage over ridge regression in
# that the resulting coefficient estimates are sparse. Here we see that 8 of
# the 19 coefficient estimates are exactly zero. So the lasso model with lambda
# chosen by cross-validation contains only 11 variables.

out=glmnet(x,y,alpha=1,lambda=grid)
lasso.coef=predict(out,type="coefficients",s=bestlam)[1:20,]
lasso.coef

##   (Intercept)         AtBat          Hits         HmRun          Runs 
##    1.27479059   -0.05497143    2.18034583    0.00000000    0.00000000 
##           RBI         Walks         Years        CAtBat         CHits 
##    0.00000000    2.29192406   -0.33806109    0.00000000    0.00000000 
##        CHmRun         CRuns          CRBI        CWalks       LeagueN 
##    0.02825013    0.21628385    0.41712537    0.00000000   20.28615023 
##     DivisionW       PutOuts       Assists        Errors    NewLeagueN 
## -116.16755870    0.23752385    0.00000000   -0.85629148    0.00000000

lasso.coef[lasso.coef!=0]

##   (Intercept)         AtBat          Hits         Walks         Years 
##    1.27479059   -0.05497143    2.18034583    2.29192406   -0.33806109 
##        CHmRun         CRuns          CRBI       LeagueN     DivisionW 
##    0.02825013    0.21628385    0.41712537   20.28615023 -116.16755870 
##       PutOuts        Errors 
##    0.23752385   -0.85629148

# END OF FILE

ISLR-Chapter-6-Lab-2.R

sgrambow

2021-03-23