Model Selection

This is an R Markdown document. Markdown is a simple formatting syntax for authoring web pages, and a very nice way of distributing an analysis. It has some very simple syntax rules.

We’re going to look at using validation sets, cross-validation for selecting the tuning parameters in Stepwise Regression, Lasso, Ridge Regression.

library(ISLR)
summary(Hitters)

##      AtBat            Hits         HmRun            Runs       
##  Min.   : 16.0   Min.   :  1   Min.   : 0.00   Min.   :  0.00  
##  1st Qu.:255.2   1st Qu.: 64   1st Qu.: 4.00   1st Qu.: 30.25  
##  Median :379.5   Median : 96   Median : 8.00   Median : 48.00  
##  Mean   :380.9   Mean   :101   Mean   :10.77   Mean   : 50.91  
##  3rd Qu.:512.0   3rd Qu.:137   3rd Qu.:16.00   3rd Qu.: 69.00  
##  Max.   :687.0   Max.   :238   Max.   :40.00   Max.   :130.00  
##                                                                
##       RBI             Walks            Years            CAtBat       
##  Min.   :  0.00   Min.   :  0.00   Min.   : 1.000   Min.   :   19.0  
##  1st Qu.: 28.00   1st Qu.: 22.00   1st Qu.: 4.000   1st Qu.:  816.8  
##  Median : 44.00   Median : 35.00   Median : 6.000   Median : 1928.0  
##  Mean   : 48.03   Mean   : 38.74   Mean   : 7.444   Mean   : 2648.7  
##  3rd Qu.: 64.75   3rd Qu.: 53.00   3rd Qu.:11.000   3rd Qu.: 3924.2  
##  Max.   :121.00   Max.   :105.00   Max.   :24.000   Max.   :14053.0  
##                                                                      
##      CHits            CHmRun           CRuns             CRBI        
##  Min.   :   4.0   Min.   :  0.00   Min.   :   1.0   Min.   :   0.00  
##  1st Qu.: 209.0   1st Qu.: 14.00   1st Qu.: 100.2   1st Qu.:  88.75  
##  Median : 508.0   Median : 37.50   Median : 247.0   Median : 220.50  
##  Mean   : 717.6   Mean   : 69.49   Mean   : 358.8   Mean   : 330.12  
##  3rd Qu.:1059.2   3rd Qu.: 90.00   3rd Qu.: 526.2   3rd Qu.: 426.25  
##  Max.   :4256.0   Max.   :548.00   Max.   :2165.0   Max.   :1659.00  
##                                                                      
##      CWalks        League  Division    PutOuts          Assists     
##  Min.   :   0.00   A:175   E:157    Min.   :   0.0   Min.   :  0.0  
##  1st Qu.:  67.25   N:147   W:165    1st Qu.: 109.2   1st Qu.:  7.0  
##  Median : 170.50                    Median : 212.0   Median : 39.5  
##  Mean   : 260.24                    Mean   : 288.9   Mean   :106.9  
##  3rd Qu.: 339.25                    3rd Qu.: 325.0   3rd Qu.:166.0  
##  Max.   :1566.00                    Max.   :1378.0   Max.   :492.0  
##                                                                     
##      Errors          Salary       NewLeague
##  Min.   : 0.00   Min.   :  67.5   A:176    
##  1st Qu.: 3.00   1st Qu.: 190.0   N:146    
##  Median : 6.00   Median : 425.0            
##  Mean   : 8.04   Mean   : 535.9            
##  3rd Qu.:11.00   3rd Qu.: 750.0            
##  Max.   :32.00   Max.   :2460.0            
##                  NA's   :59

Hitters is a baseball dataset which has a recording of a whole lot of statistics for different baseball players during the baseball season and including a response variable salary, which we’re going to use as the target for a regression model.

Goal: We’re going to try and use these statistics to predict the salary of the player.

For salary variable, there are many missing values, 59 NA’s, so, we should do something about this. Among various ways of dealing with them, we’re going to eliminate any row in the data frame that’s got a missing value in it by using na.omit function.
na.omit function will produce a new data frame, which we just overwrite on Hitters. and won’t have any missing values.

Hitters=na.omit(Hitters)
with(Hitters,sum(is.na(Salary)))

## [1] 0

Best Subset regression

Best subset regression looks through all possible regression models of all different subset sizes and looks for the best of each size. That is, it produces a sequence of models which is the best subset for each particular size.

We will now use the package leaps to evaluate all the best-subset models. There’s a function in leaps called regsubsets that will do the best subset modeling. With salary as the response, it takes a formula and the data is Hitters. We’ve got over 200 baseball players here, we’ve got 19 variables and it did all subsets regression in no time at all.

library(leaps)
regfit.full=regsubsets(Salary~.,data=Hitters)
summary(regfit.full)

## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = Hitters)
## 19 Variables  (and intercept)
##            Forced in Forced out
## AtBat          FALSE      FALSE
## Hits           FALSE      FALSE
## HmRun          FALSE      FALSE
## Runs           FALSE      FALSE
## RBI            FALSE      FALSE
## Walks          FALSE      FALSE
## Years          FALSE      FALSE
## CAtBat         FALSE      FALSE
## CHits          FALSE      FALSE
## CHmRun         FALSE      FALSE
## CRuns          FALSE      FALSE
## CRBI           FALSE      FALSE
## CWalks         FALSE      FALSE
## LeagueN        FALSE      FALSE
## DivisionW      FALSE      FALSE
## PutOuts        FALSE      FALSE
## Assists        FALSE      FALSE
## Errors         FALSE      FALSE
## NewLeagueN     FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
##          AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns
## 1  ( 1 ) " "   " "  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 2  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 3  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 4  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 5  ( 1 ) "*"   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 6  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    " "  
## 7  ( 1 ) " "   "*"  " "   " "  " " "*"   " "   "*"    "*"   "*"    " "  
## 8  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   " "    " "   "*"    "*"  
##          CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1  ( 1 ) "*"  " "    " "     " "       " "     " "     " "    " "       
## 2  ( 1 ) "*"  " "    " "     " "       " "     " "     " "    " "       
## 3  ( 1 ) "*"  " "    " "     " "       "*"     " "     " "    " "       
## 4  ( 1 ) "*"  " "    " "     "*"       "*"     " "     " "    " "       
## 5  ( 1 ) "*"  " "    " "     "*"       "*"     " "     " "    " "       
## 6  ( 1 ) "*"  " "    " "     "*"       "*"     " "     " "    " "       
## 7  ( 1 ) " "  " "    " "     "*"       "*"     " "     " "    " "       
## 8  ( 1 ) " "  "*"    " "     "*"       "*"     " "     " "    " "

In summary, for example, if subset size is 1, it puts a star next to the variable that’s in the best subset of size 1. and in the case of 2, 3, … until 8. Anyway, This is one way of looking at our best subset. and by default, it goes only goes up to subsets of size 8. (Maybe that’s why it was so fast)
However, since we’ve got 19 variables, we’re going to go all the way up to size 19 to get the best subset.

regfit.full=regsubsets(Salary~.,data=Hitters, nvmax=19)
reg.summary=summary(regfit.full)
names(reg.summary)

## [1] "which"  "rsq"    "rss"    "adjr2"  "cp"     "bic"    "outmat" "obj"

In names on the summary, this tells us what’s on the summary. It has various things like, for best subset models, it has the r squared, the residual sum of squared, the adjusted r squared, the cp statistic, the bic statistic, and a few other things. All the things can help us select the particular model.

plot(reg.summary$cp,xlab="Number of Variables",ylab="Cp") # Cp component
which.min(reg.summary$cp)

## [1] 10

points(10,reg.summary$cp[10],pch=20,col="red")

Remember Cp is an estimate of prediction error. We plotted it against the number of variables and the Cp statistic. The idea is to pick a model with the lowest Cp. We can use the function, which.min to identify the smallest element of the Cp component.

There is a plot method for the regsubsets object, which is like a pattern picture.

plot(regfit.full,scale="Cp")

For example, Cp 5 will have corresponded to our model of size 10. For each value of Cp, black squares are ones indicating that variables in and white squares are the variables are out.
What you notice is near the bottom of the Cp (upper in picture) is the models are reasonably stable. Like the band stays the same. What you also notice is we should expect that bad Cps corresponds to models that had all the variables. It gives a quick summary of all the models as opposed to just seeing the Cp statistics.

Now, having chosen model 10 as a coefficient method for regsubsets, we ask it for the coefficients for the model indexed 10. We get a coefficient vector.

coef(regfit.full,10)

##  (Intercept)        AtBat         Hits        Walks       CAtBat 
##  162.5354420   -2.1686501    6.9180175    5.7732246   -0.1300798 
##        CRuns         CRBI       CWalks    DivisionW      PutOuts 
##    1.4082490    0.7743122   -0.8308264 -112.3800575    0.2973726 
##      Assists 
##    0.2831680

Forward Stepwise Selection

While Base subset is quite aggressive looking at all possible subsets, Forward stepwise is a greedy algorithm. Each time it includes the next best variable, but it produces a nested sequence. So, it’s a much less adventurous search. Again, it keeps a nested sequence of models each time you just add the variable that improves the set the most.

Here we use the regsubsets function but specify the `method=“forward” option:

regfit.fwd=regsubsets(Salary~.,data=Hitters,nvmax=19,method="forward")
summary(regfit.fwd)

## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19, method = "forward")
## 19 Variables  (and intercept)
##            Forced in Forced out
## AtBat          FALSE      FALSE
## Hits           FALSE      FALSE
## HmRun          FALSE      FALSE
## Runs           FALSE      FALSE
## RBI            FALSE      FALSE
## Walks          FALSE      FALSE
## Years          FALSE      FALSE
## CAtBat         FALSE      FALSE
## CHits          FALSE      FALSE
## CHmRun         FALSE      FALSE
## CRuns          FALSE      FALSE
## CRBI           FALSE      FALSE
## CWalks         FALSE      FALSE
## LeagueN        FALSE      FALSE
## DivisionW      FALSE      FALSE
## PutOuts        FALSE      FALSE
## Assists        FALSE      FALSE
## Errors         FALSE      FALSE
## NewLeagueN     FALSE      FALSE
## 1 subsets of each size up to 19
## Selection Algorithm: forward
##           AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns
## 1  ( 1 )  " "   " "  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 2  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 3  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 4  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 5  ( 1 )  "*"   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 6  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    " "  
## 7  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    " "  
## 8  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    "*"  
## 9  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"  
## 10  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"  
## 11  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"  
## 12  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"    " "   " "    "*"  
## 13  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"    " "   " "    "*"  
## 14  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"    " "   " "    "*"  
## 15  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"    "*"   " "    "*"  
## 16  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"    "*"   " "    "*"  
## 17  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"    "*"   " "    "*"  
## 18  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"    "*"   " "    "*"  
## 19  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"    "*"   "*"    "*"  
##           CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1  ( 1 )  "*"  " "    " "     " "       " "     " "     " "    " "       
## 2  ( 1 )  "*"  " "    " "     " "       " "     " "     " "    " "       
## 3  ( 1 )  "*"  " "    " "     " "       "*"     " "     " "    " "       
## 4  ( 1 )  "*"  " "    " "     "*"       "*"     " "     " "    " "       
## 5  ( 1 )  "*"  " "    " "     "*"       "*"     " "     " "    " "       
## 6  ( 1 )  "*"  " "    " "     "*"       "*"     " "     " "    " "       
## 7  ( 1 )  "*"  "*"    " "     "*"       "*"     " "     " "    " "       
## 8  ( 1 )  "*"  "*"    " "     "*"       "*"     " "     " "    " "       
## 9  ( 1 )  "*"  "*"    " "     "*"       "*"     " "     " "    " "       
## 10  ( 1 ) "*"  "*"    " "     "*"       "*"     "*"     " "    " "       
## 11  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     " "    " "       
## 12  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     " "    " "       
## 13  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 14  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 15  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 16  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 17  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    "*"       
## 18  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    "*"       
## 19  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    "*"

Now, if you look through the models that were selected, you’ll find that they exactly nested. Thus, each new model includes all the one variable that were before, plus new one.

plot(regfit.fwd,scale="Cp")

It looks very similar to what we saw for base subsets. Different in some places, but the same kin of structure near the good, which is low Cp end, where there’s a consistent group of variables that are in a little bit of fluctuation down in terms of which of those variables are in.

Model Selection Using a Validation Set

We’ve got base subset and forward stepwise. and we can select the models using Cp. There’s also adjusted r squared and BIC, but here we’re going to actually use a validation set. We’re going to pick a subset of the observations and put them aside and use them as a validation set and the other as a training data set.

Lets make a training and validation set, so that we can choose a good subset model. We will do it using a slightly different approach from what was done in the the book.

dim(Hitters) # [263 20]

## [1] 263  20

set.seed(1)
train=sample(seq(263),180,replace=FALSE) # sample size = 180, there's 180 numbers chosen at random from the sequence 1 to 263.
train

##   [1]  70  98 150 237  53 232 243 170 161  16 259  45 173  97 192 124 178
##  [18] 245  94 190 228  52 158  31  64  92   4  91 205  80 113 140 115  43
##  [35] 244 153 181  25 163  93 184 144 174 122 117 251   6 104 241 149 102
##  [52] 183 224 242  15  21  66 107 136  83 186  60 211  67 130 210  95 151
##  [69]  17 256 207 162 200 239 236 168 249  73 222 177 234 199 203  59 235
##  [86]  37 126  22 230 226  42  11 110 214 132 134  77  69 188 100 206  58
## [103]  44 159 101  34 208  75 185 201 261 112  54  65  23   2 106 254 257
## [120] 154 142  71 166 221 105  63 143  29 240 212 167 172   5  84 120 133
## [137]  72 191 248 138 182  74 179 135  87 196 157 119  13  99 263 125 247
## [154]  50  55  20  57   8  30 194 139 238  46  78  88  41   7  33 141  32
## [171] 180 164 213  36 215  79 225 229 198  76

regfit.fwd=regsubsets(Salary~.,data=Hitters[train,],nvmax=19,method="forward") # everything is the same as before, except for using only train data.

Now we will make predictions on the observations not used for training. We know there are 19 models, so we set up some vectors to record the errors. We have to do a bit of work here, because there is no predict method for regsubsets.

val.errors=rep(NA,19) # we'll set up a vector having 19 slots.
x.test=model.matrix(Salary~.,data=Hitters[-train,])# notice the -index! (excluding the training data); we'll make an x matrix corresponding to our validation data set.
for(i in 1:19){
  coefi=coef(regfit.fwd,id=i)
  pred=x.test[,names(coefi)]%*%coefi
  val.errors[i]=mean((Hitters$Salary[-train]-pred)^2)
}
plot(sqrt(val.errors),ylab="Root MSE",ylim=c(300,400),pch=19,type="b")
points(sqrt(regfit.fwd$rss[-1]/180),col="blue",pch=19,type="b") # we removed the first one, which was not included in our validation plot.
legend("topright",legend=c("Training","Validation"),col=c("blue","black"),pch=19)

In validation error, it’s slightly jumpy, which OK, because we only have less than 90 observations in validation set. so, there’s going to be a bit of noise. Anyway, there seems to be a minimum around 5.
As we expect, the training error goes down monotonically as the model gets bigger, but not so for the validation error.

This was a little tedious because we had to write code for doing prediction. Instead, now we’re going to make predictions from a method of regsubsets. So we will write one!

predict.regsubsets=function(object,newdata,id,...) {
  form=as.formula(object$call[[2]])
  mat=model.matrix(form,newdata)
  coefi=coef(object,id=id)
  mat[,names(coefi)]%*%coefi
}

function argument (object, which we want to predict from , id, which is going to be the id of the model) regsubset objects got a component called a call. that’s the call that was used to create it. and part of the call is a formula. so, we can extract that formula through this little code line. and then using formula, we make a model matrix from it. much like we did before. then we extract the coefficient and do the matrix multiply.

Now that function is defined, we can use cross validation.

Model Selection by Cross-Validation

We will do 10-fold cross-validation. Its really easy!

set.seed(11)
folds=sample(rep(1:10,length=nrow(Hitters)))
folds # There's the random assignment of folds to each of the observations in hitters.

##   [1]  3  1  4  4  7  7  3  5  5  2  5  2  8  3  3  3  9  2  9  8 10  5  8
##  [24]  5  5  5  5 10 10  4  4  7  6  7  7  7  3  4  8  3  6  8 10  4  3  9
##  [47]  9  3  4  9  8  7 10  6 10  3  6  9  4  2  8  2  5  6 10  7  2  8  8
##  [70]  1  3  6  2  5  8  1  1  2  8  1 10  1  2  3  6  6  5  8  8 10  4  2
##  [93]  6  1  7  4  8  3  7  8  7  1 10  1  6  2  9 10  1  7  7  4  7  4 10
## [116]  3  6 10  6  6  9  8 10  6  7  9  6  7  1 10  2  2  5  9  9  6  1  1
## [139]  2  9  4 10  5  3  7  7 10 10  9  3  3  7  3  1  4  6  6 10  4  9  9
## [162]  1  3  6  8 10  8  5  4  5  6  2  9 10  3  7  7  6  6  2  3  2  4  4
## [185]  4  4  8  2  3  5  9  9 10  2  1  3  9  6  7  3  1  9  4 10 10  8  8
## [208]  8  2  5  9  8 10  5  8  2  4  1  4  4  5  5  2  1  9  5  2  9  9  5
## [231]  3  2  1  9  1  7  2  5  8  1  1  7  6  6  4  5 10  5  7  4  8  6  9
## [254]  1  2  5  7  1  3  1  3  1  2

table(folds) # If we tabulate that, it's pretty balanced.

## folds
##  1  2  3  4  5  6  7  8  9 10 
## 27 27 27 26 26 26 26 26 26 26

Now, we make a matrix for our error and it’s going to have 10 rows and 19 columns because there’re 19 variables. So, there’re going to be 19 subsets and 10 rows for each of the 10 folds.

cv.errors=matrix(NA,10,19)
for(k in 1:10){
  best.fit=regsubsets(Salary~.,data=Hitters[folds!=k,],nvmax=19,method="forward")
  for(i in 1:19){
    pred=predict(best.fit,Hitters[folds==k,],id=i)
    cv.errors[k,i] = mean((Hitters$Salary[folds==k]-pred)^2) # Quality Metrix : RSS
  }
}
rmse.cv=sqrt(apply(cv.errors,2,mean))
plot(rmse.cv,pch=19,type="b")

We processed our output matrix by using Apply funciton to the columns. So, we take the column means becuase they were 10 rows and each row was the mean squared error for a particular fold. But, we wnat to average those.

Now, we can make a plot of that. there’s our 10-fold cross validation curve. It’s not quite as jumphy as the validation curve because this is averaged over the full training set. The errors are computed over the full training set. So, it’s a little bit smoother. As a result, this seems to favor models of size 11 or 12.

Ridge Regression and the Lasso

We will use the package glmnet, which does not use the model formula language, so we will set up an x and y.

library(glmnet) # it doesn't use a formula language

## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-2

x=model.matrix(Salary~.-1,data=Hitters) # predictor matrix
y=Hitters$Salary # response

First we will fit a ridge-regression model. This is achieved by calling glmnet with alpha=0 (see the helpfile). (alpha=1 is lasso, alpha=0 is ridge, alpha between 0 and 1 is elastic ent models, which are in between ridge and lasso) There is also a cv.glmnet function which will do the cross-validation for us.

fit.ridge=glmnet(x,y,alpha=0) 
plot(fit.ridge,xvar="lambda",label=TRUE)

Let’s use the plot method for a glmnet object. It makes a plot as a function of log of lambda and what it’s plotting are the coefficients. Ridge regression is penalized by the sum of squares of the coefficients. So, the penalties put on the sum of squares of the coeiifients and that’s controlled by parameter lambda. The criterion for ridge regression is RSS+\(\lambda \sum_{j=1}^p \beta_j^2\). That is, RSS is been modified by a penality on the sum of squares of the coefficients. If lambda is big, the sum of squares of the coefficients to be small so that will shrink the coefficients towards zero.

What glmnet does is it actually develops a whole part of models on a grid of values of labmda quite a fine grid about 100 values of lambda. When Log Lambda is 12, all the coefficients are essentially zero. and then, as we relax lambda, the coefficients grow away from zero in a nice smooth way and the sum of squares of the coefficients is getting bigger and bigger until we reach a point where lambda is effectively zero and the coefficients are unregularized.

Consequently, unlike subset and forward stepwise regression, which controls the complexity of a model by restricting the number of variables, Ridge regression keeps all the variables and shrinks the coefficients toward zero.

Since Ridge regression gives a whole path, we need to pick a value along the path. and glmnet is got a built-in function called cv.glmnet that’ll do k-fold cross validation.

cv.ridge=cv.glmnet(x,y,alpha=0) # by default 10-fold
plot(cv.ridge)

There’s two vertical lines. The one is at the minimum and the other vertical line is at one standard error of the minimum within one standard error.

Now we fit a lasso model; for this we use the default alpha=1 Lasso is similar to ridge regression, but the only thing that is different is the penalty. like that instead of the sum of squares of the coefficients, we penalize the absolute values of the coefficients : RSS+\(\lambda \sum_{j=1}^p |\beta_j|\).It’s also controlling the size of the coefficients and it seems like absolute value and sum of squares would be rather similar, but the lasso is got a somewhat magical quality by penalizing the absolute values, that is actually going to restrict some of the coefficients to be exactly zero.

fit.lasso=glmnet(x,y) # the default for alpha is 1
plot(fit.lasso,xvar="lambda",label=TRUE)

Again, it’s fits the whole path. The coefficients jump in, having been zero for a length of the path. At the top of the plot, we can see how many non-zero variables are in the model. Lasso is doing shrinkage and variable selection.

plot(fit.lasso,xvar="dev",label=TRUE)

It’s deviance. What it means is the percentage of deviance explained, or percentage of in case of regression, your sum or squares explained, is the r squared.
If you make that plot, it’s changed the orientation. At the bottm, Fraction Deviance Explained is like the r squared. What you find here is that a lot of the r squared was explained for quite heavily shrunk coefficients. Towards the end, with a relatively small increase in r squared from between 0.4 and 0.5, coefficients grow very large. So, this may be an indication that the end of the path is overfitting.

Use cross validation again.

cv.lasso=cv.glmnet(x,y)
plot(cv.lasso)

It’s telling us that the base model, the minimum cross validation areas at full size 15. and within one standard error, we have a model of size 6, and it’s somewhat flat in between. Consequently, we may make our choice based on this.

There’s a coefficient function extractor that works on a cross validation object. Since fit.lasso will have the whole path of coefficients, it’s got roughly 100 coefficient vectors dependent on each value indexed by different values of lambda. However, if we extract the coefficient from the cv object, it’ll pick the coefficient vector corresponding to the best model, which was the model of size. In this case, cv pick the model size of 5, which within one standard error of the minimum.

coef(cv.lasso)

## 21 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept) 127.95694754
## AtBat         .         
## Hits          1.42342566
## HmRun         .         
## Runs          .         
## RBI           .         
## Walks         1.58214111
## Years         .         
## CAtBat        .         
## CHits         .         
## CHmRun        .         
## CRuns         0.16027975
## CRBI          0.33667715
## CWalks        .         
## LeagueA       .         
## LeagueN       .         
## DivisionW    -8.06171262
## PutOuts       0.08393604
## Assists       .         
## Errors        .         
## NewLeagueN    .

Suppose we want to use our earlier train/validation division to select the lambda for the lasso. This is easy to do.

lasso.tr=glmnet(x[train,],y[train])
lasso.tr # Summary of a glmnet fit

## 
## Call:  glmnet(x = x[train, ], y = y[train]) 
## 
##       Df    %Dev    Lambda
##  [1,]  0 0.00000 246.40000
##  [2,]  1 0.05013 224.50000
##  [3,]  1 0.09175 204.60000
##  [4,]  2 0.13840 186.40000
##  [5,]  2 0.18000 169.80000
##  [6,]  3 0.21570 154.80000
##  [7,]  3 0.24710 141.00000
##  [8,]  3 0.27320 128.50000
##  [9,]  4 0.30010 117.10000
## [10,]  4 0.32360 106.70000
## [11,]  4 0.34310  97.19000
## [12,]  4 0.35920  88.56000
## [13,]  5 0.37360  80.69000
## [14,]  5 0.38900  73.52000
## [15,]  5 0.40190  66.99000
## [16,]  5 0.41260  61.04000
## [17,]  5 0.42140  55.62000
## [18,]  5 0.42880  50.67000
## [19,]  5 0.43490  46.17000
## [20,]  5 0.43990  42.07000
## [21,]  5 0.44410  38.33000
## [22,]  5 0.44760  34.93000
## [23,]  6 0.45140  31.83000
## [24,]  7 0.45480  29.00000
## [25,]  7 0.45770  26.42000
## [26,]  7 0.46010  24.07000
## [27,]  8 0.46220  21.94000
## [28,]  8 0.46380  19.99000
## [29,]  8 0.46520  18.21000
## [30,]  8 0.46630  16.59000
## [31,]  8 0.46730  15.12000
## [32,]  8 0.46810  13.78000
## [33,]  9 0.47110  12.55000
## [34,]  9 0.47380  11.44000
## [35,]  9 0.47620  10.42000
## [36,] 10 0.48050   9.49500
## [37,]  9 0.48450   8.65200
## [38,] 10 0.48770   7.88300
## [39,] 10 0.49360   7.18300
## [40,] 11 0.49890   6.54500
## [41,] 12 0.50450   5.96300
## [42,] 12 0.51010   5.43400
## [43,] 13 0.51470   4.95100
## [44,] 13 0.51850   4.51100
## [45,] 13 0.52170   4.11000
## [46,] 14 0.52440   3.74500
## [47,] 14 0.52670   3.41200
## [48,] 15 0.52870   3.10900
## [49,] 15 0.53030   2.83300
## [50,] 15 0.53160   2.58100
## [51,] 16 0.53280   2.35200
## [52,] 17 0.53420   2.14300
## [53,] 18 0.53580   1.95300
## [54,] 18 0.53760   1.77900
## [55,] 18 0.53890   1.62100
## [56,] 18 0.54000   1.47700
## [57,] 18 0.54090   1.34600
## [58,] 18 0.54160   1.22600
## [59,] 18 0.54220   1.11700
## [60,] 18 0.54280   1.01800
## [61,] 18 0.54320   0.92770
## [62,] 18 0.54360   0.84530
## [63,] 18 0.54380   0.77020
## [64,] 19 0.54410   0.70180
## [65,] 19 0.54430   0.63940
## [66,] 19 0.54450   0.58260
## [67,] 19 0.54470   0.53090
## [68,] 19 0.54490   0.48370
## [69,] 20 0.54510   0.44070
## [70,] 20 0.54520   0.40160
## [71,] 20 0.54530   0.36590
## [72,] 20 0.54540   0.33340
## [73,] 20 0.54550   0.30380
## [74,] 20 0.54560   0.27680
## [75,] 20 0.54570   0.25220
## [76,] 20 0.54570   0.22980
## [77,] 20 0.54580   0.20940
## [78,] 20 0.54580   0.19080
## [79,] 20 0.54590   0.17380
## [80,] 20 0.54590   0.15840
## [81,] 20 0.54590   0.14430
## [82,] 20 0.54590   0.13150
## [83,] 20 0.54600   0.11980
## [84,] 19 0.54600   0.10920
## [85,] 19 0.54600   0.09948
## [86,] 19 0.54600   0.09064
## [87,] 19 0.54600   0.08259
## [88,] 20 0.54600   0.07525
## [89,] 20 0.54600   0.06856

It tell us the summary of a glmnet fit. For each of the models in the path, it gives you the degrees of freedom which is the number of non-zero coefficients and the percentage deviance expalined, which is like r squared for generalized linear models and the lambda value that corresponds to that fit, actually tries to fit 100 values of lambda, but if it reaches a point where nothing much is changing, it stops. As lambda is getting smaller here, the model is really not changing, that’s pretty much at the ordinary least squares fit of the unpenalized model.

pred=predict(lasso.tr,x[-train,])
dim(pred)

## [1] 83 89

There’s 83 observations in the validation set and we’ve got 89 values of lambda, so there’s going to be 89 different columns in this prediction matrix.

rmse= sqrt(apply((y[-train]-pred)^2,2,mean)) # column-wise computation
plot(log(lasso.tr$lambda),rmse,type="b",xlab="Log(lambda)")

Overfitting - switch spot (Log(lambda)=3.5) - Underfitting

lam.best=lasso.tr$lambda[order(rmse)[1]] # in ascending order, first is the smallest value
lam.best

## [1] 19.98706

coef(lasso.tr,s=lam.best)

## 21 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept)  107.9416686
## AtBat          .        
## Hits           0.1591252
## HmRun          .        
## Runs           .        
## RBI            1.7340039
## Walks          3.4657091
## Years          .        
## CAtBat         .        
## CHits          .        
## CHmRun         .        
## CRuns          0.5386855
## CRBI           .        
## CWalks         .        
## LeagueA      -30.0493021
## LeagueN        .        
## DivisionW   -113.8317016
## PutOuts        0.2915409
## Assists        .        
## Errors         .        
## NewLeagueN     2.0367518

This is printed out in what’s known as sparse matrix format, which means only the non-missing values are recorded.