Model Selection

Using Statistical Learning`s videos on Model Selection, the following report shows methods for performing Model Selection. All Chunks are printed so code is visible. Some chunks were modified from the lesson for cleaner code and show more selection techniques.

Data Import

library(ISLR)

## Warning: package 'ISLR' was built under R version 3.6.3

summary(Hitters)

##      AtBat            Hits         HmRun            Runs       
##  Min.   : 16.0   Min.   :  1   Min.   : 0.00   Min.   :  0.00  
##  1st Qu.:255.2   1st Qu.: 64   1st Qu.: 4.00   1st Qu.: 30.25  
##  Median :379.5   Median : 96   Median : 8.00   Median : 48.00  
##  Mean   :380.9   Mean   :101   Mean   :10.77   Mean   : 50.91  
##  3rd Qu.:512.0   3rd Qu.:137   3rd Qu.:16.00   3rd Qu.: 69.00  
##  Max.   :687.0   Max.   :238   Max.   :40.00   Max.   :130.00  
##                                                                
##       RBI             Walks            Years            CAtBat       
##  Min.   :  0.00   Min.   :  0.00   Min.   : 1.000   Min.   :   19.0  
##  1st Qu.: 28.00   1st Qu.: 22.00   1st Qu.: 4.000   1st Qu.:  816.8  
##  Median : 44.00   Median : 35.00   Median : 6.000   Median : 1928.0  
##  Mean   : 48.03   Mean   : 38.74   Mean   : 7.444   Mean   : 2648.7  
##  3rd Qu.: 64.75   3rd Qu.: 53.00   3rd Qu.:11.000   3rd Qu.: 3924.2  
##  Max.   :121.00   Max.   :105.00   Max.   :24.000   Max.   :14053.0  
##                                                                      
##      CHits            CHmRun           CRuns             CRBI        
##  Min.   :   4.0   Min.   :  0.00   Min.   :   1.0   Min.   :   0.00  
##  1st Qu.: 209.0   1st Qu.: 14.00   1st Qu.: 100.2   1st Qu.:  88.75  
##  Median : 508.0   Median : 37.50   Median : 247.0   Median : 220.50  
##  Mean   : 717.6   Mean   : 69.49   Mean   : 358.8   Mean   : 330.12  
##  3rd Qu.:1059.2   3rd Qu.: 90.00   3rd Qu.: 526.2   3rd Qu.: 426.25  
##  Max.   :4256.0   Max.   :548.00   Max.   :2165.0   Max.   :1659.00  
##                                                                      
##      CWalks        League  Division    PutOuts          Assists     
##  Min.   :   0.00   A:175   E:157    Min.   :   0.0   Min.   :  0.0  
##  1st Qu.:  67.25   N:147   W:165    1st Qu.: 109.2   1st Qu.:  7.0  
##  Median : 170.50                    Median : 212.0   Median : 39.5  
##  Mean   : 260.24                    Mean   : 288.9   Mean   :106.9  
##  3rd Qu.: 339.25                    3rd Qu.: 325.0   3rd Qu.:166.0  
##  Max.   :1566.00                    Max.   :1378.0   Max.   :492.0  
##                                                                     
##      Errors          Salary       NewLeague
##  Min.   : 0.00   Min.   :  67.5   A:176    
##  1st Qu.: 3.00   1st Qu.: 190.0   N:146    
##  Median : 6.00   Median : 425.0            
##  Mean   : 8.04   Mean   : 535.9            
##  3rd Qu.:11.00   3rd Qu.: 750.0            
##  Max.   :32.00   Max.   :2460.0            
##                  NA's   :59

Salary (which we are using as our predict variable) has missing values in the dataset (59). The easiest method is to remove them, however, it will truncate the dataset by 59 lines.

## [1] 0

Best Subset Regression

The package leaps is used to evaluate all of the best-subset models

library(leaps)

## Warning: package 'leaps' was built under R version 3.6.3

regfit.full=regsubsets(Salary~.,data=Hitters)
summary(regfit.full)

## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = Hitters)
## 19 Variables  (and intercept)
##            Forced in Forced out
## AtBat          FALSE      FALSE
## Hits           FALSE      FALSE
## HmRun          FALSE      FALSE
## Runs           FALSE      FALSE
## RBI            FALSE      FALSE
## Walks          FALSE      FALSE
## Years          FALSE      FALSE
## CAtBat         FALSE      FALSE
## CHits          FALSE      FALSE
## CHmRun         FALSE      FALSE
## CRuns          FALSE      FALSE
## CRBI           FALSE      FALSE
## CWalks         FALSE      FALSE
## LeagueN        FALSE      FALSE
## DivisionW      FALSE      FALSE
## PutOuts        FALSE      FALSE
## Assists        FALSE      FALSE
## Errors         FALSE      FALSE
## NewLeagueN     FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
##          AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns
## 1  ( 1 ) " "   " "  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 2  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 3  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 4  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 5  ( 1 ) "*"   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 6  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    " "  
## 7  ( 1 ) " "   "*"  " "   " "  " " "*"   " "   "*"    "*"   "*"    " "  
## 8  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   " "    " "   "*"    "*"  
##          CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1  ( 1 ) "*"  " "    " "     " "       " "     " "     " "    " "       
## 2  ( 1 ) "*"  " "    " "     " "       " "     " "     " "    " "       
## 3  ( 1 ) "*"  " "    " "     " "       "*"     " "     " "    " "       
## 4  ( 1 ) "*"  " "    " "     "*"       "*"     " "     " "    " "       
## 5  ( 1 ) "*"  " "    " "     "*"       "*"     " "     " "    " "       
## 6  ( 1 ) "*"  " "    " "     "*"       "*"     " "     " "    " "       
## 7  ( 1 ) " "  " "    " "     "*"       "*"     " "     " "    " "       
## 8  ( 1 ) " "  "*"    " "     "*"       "*"     " "     " "    " "

leaps defaults to a size of 8, however, in this dataset we have 19 variables.

regfit.full =  regsubsets(Salary~.,data = Hitters, nvmax = 19)
reg.summary = summary(regfit.full)

# Showing the names listed in the summary
#names(reg.summary)

cp is an estimate of prediction error. We want to plot the cp value and show the lowest number of variables. The lowest cp score indicates the best model.

min.cp = which.min(reg.summary$cp) #minimum or lowest number
plot(reg.summary$cp, xlab = "Number of Variables", ylab = "Cp")
points(10,reg.summary$cp[min.cp], pch=20,col="red") #add highlight point for lowest number

The smallest number of variables is 10. We can plot the the regfit model in an other way to demonstrate which variables are the best on all cp points.

plot(regfit.full, scale = "Cp")

Black squares the variable is in while the white squares indicate the variable is out. It appears bad cp models have all the variables and almost none of the variables. There is also indication some variables are not useful at all, such as Years, NewLeagueN, CHmRun. These variables are almost always removed from the variable list as the cp goes down toward the better models.

Since we chose model 10. We can run the coefficients of our model.

coef(regfit.full,10)

##  (Intercept)        AtBat         Hits        Walks       CAtBat 
##  162.5354420   -2.1686501    6.9180175    5.7732246   -0.1300798 
##        CRuns         CRBI       CWalks    DivisionW      PutOuts 
##    1.4082490    0.7743122   -0.8308264 -112.3800575    0.2973726 
##      Assists 
##    0.2831680

Forward Stepwise Selection

Using the same regsubsets function we can use the Forward methods and compare the results. While Best subset is a very aggressive, Forward stepwise is greedy as it includes the previous model as it grows model over model by adding the next variable which improves the fit the most., resulting in a nested less adventitious model.

regfit.fwd=regsubsets(Salary~., data = Hitters, nvmax = 19, method = "forward")
summary(regfit.fwd)

## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19, method = "forward")
## 19 Variables  (and intercept)
##            Forced in Forced out
## AtBat          FALSE      FALSE
## Hits           FALSE      FALSE
## HmRun          FALSE      FALSE
## Runs           FALSE      FALSE
## RBI            FALSE      FALSE
## Walks          FALSE      FALSE
## Years          FALSE      FALSE
## CAtBat         FALSE      FALSE
## CHits          FALSE      FALSE
## CHmRun         FALSE      FALSE
## CRuns          FALSE      FALSE
## CRBI           FALSE      FALSE
## CWalks         FALSE      FALSE
## LeagueN        FALSE      FALSE
## DivisionW      FALSE      FALSE
## PutOuts        FALSE      FALSE
## Assists        FALSE      FALSE
## Errors         FALSE      FALSE
## NewLeagueN     FALSE      FALSE
## 1 subsets of each size up to 19
## Selection Algorithm: forward
##           AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns
## 1  ( 1 )  " "   " "  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 2  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 3  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 4  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 5  ( 1 )  "*"   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 6  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    " "  
## 7  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    " "  
## 8  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    "*"  
## 9  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"  
## 10  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"  
## 11  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"  
## 12  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"    " "   " "    "*"  
## 13  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"    " "   " "    "*"  
## 14  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"    " "   " "    "*"  
## 15  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"    "*"   " "    "*"  
## 16  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"    "*"   " "    "*"  
## 17  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"    "*"   " "    "*"  
## 18  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"    "*"   " "    "*"  
## 19  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"    "*"   "*"    "*"  
##           CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1  ( 1 )  "*"  " "    " "     " "       " "     " "     " "    " "       
## 2  ( 1 )  "*"  " "    " "     " "       " "     " "     " "    " "       
## 3  ( 1 )  "*"  " "    " "     " "       "*"     " "     " "    " "       
## 4  ( 1 )  "*"  " "    " "     "*"       "*"     " "     " "    " "       
## 5  ( 1 )  "*"  " "    " "     "*"       "*"     " "     " "    " "       
## 6  ( 1 )  "*"  " "    " "     "*"       "*"     " "     " "    " "       
## 7  ( 1 )  "*"  "*"    " "     "*"       "*"     " "     " "    " "       
## 8  ( 1 )  "*"  "*"    " "     "*"       "*"     " "     " "    " "       
## 9  ( 1 )  "*"  "*"    " "     "*"       "*"     " "     " "    " "       
## 10  ( 1 ) "*"  "*"    " "     "*"       "*"     "*"     " "    " "       
## 11  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     " "    " "       
## 12  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     " "    " "       
## 13  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 14  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 15  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 16  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 17  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    "*"       
## 18  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    "*"       
## 19  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    "*"

plot(regfit.fwd, scale="Cp")

The model plot looks very similar to the Best Subset model, very few changes.

regfit.summary = summary(regfit.fwd)
minfit.cp = which.min(regfit.summary$cp)
plot(regfit.summary$cp, xlab = "Number of Variables", ylab = "Cp")
points(10,regfit.summary$cp[minfit.cp], pch=20,col="red")

The best Cp is still 10.

Next, we need to break the data up into training and validation sets. Once we have a model we can validate and chose which one we like.

#dim(Hitters)
set.seed(1)
train=sample(seq(263),180,replace = FALSE) #creating sample numbers
#train
regfit.fwd=regsubsets(Salary~., data = Hitters[train,],nvmax = 19, method = "forward") #rerunning forward step with train set

val.errors=rep(NA,19)
x.test=model.matrix(Salary~., data = Hitters[-train,]) #model Matrix without the train set

for(i in 1:19){           #loop for predictions since the Regsubsets has no predict command
  coefi=coef(regfit.fwd,id=i)
  pred=x.test[,names(coefi)]%*%coefi
  val.errors[i]=mean((Hitters$Salary[-train]-pred)^2)  
}

plot(sqrt(val.errors),ylab = "Root MSE", ylim = c(300,400), pch=19, type = "b")
points(sqrt(regfit.fwd$rss[-1]/180),col="blue",pch=19, type = "b")
legend("topright",legend = c("Training","Validation"),col = c("blue", "black"), pch=19)

Since there is a call to use this predict again, a function is created.

predict.regsubsets=function(object, newdata, id,...){
  form=as.formula(object$call[[2]])
  mat=model.matrix(form,newdata)
  coefi=coef(object, id=id)
  mat[,names(coefi)]%*%coefi
}

Cross-Validation

For the Cross Validation test, we will chose a 10-fold method. The “folds” have a random selection of 1-10 for the length of the Hitter dataframe. When complete, the sample will contain 10 separate sample sections of the dataframe.

set.seed(11)
folds=sample(rep(1:10, length=nrow(Hitters))) #creating the sample folds
#folds
table(folds)

## folds
##  1  2  3  4  5  6  7  8  9 10 
## 27 27 27 26 26 26 26 26 26 26

cv.errors=matrix(NA, 10,19) #19 varibles, 10 rows for the folds filled with NA

Double loop created for the validation

for(k in 1:10){
  best.fit=regsubsets(Salary~.,data=Hitters[folds!=k,],nvmax = 19,method="forward") #train on all the varibles NOT in the k'th fold
  for(i in 1:19){
    pred=predict(best.fit, Hitters[folds==k,], id=i) #each of the subsets left out from the first loop using the predict function
    cv.errors[k,i]=mean( (Hitters$Salary[folds==k]-pred)^2)
  }
}

rmse.cv=sqrt(apply(cv.errors,2,mean)) #mean of the columns giving a root mean
plot(rmse.cv,pch=19,type="b")

This seems to prefer models around 9. Remember, we are looking for the lowest point. Comparing to the other models, 10 is still very low in the estimation.

Ridge Regression and Lasso

Ridge-regression model is achieved by using a new package glmnet, which also contains a cross validation function cv.glmnet. Keep in mind there are other packages for Ridge and Lasso.

library(glmnet)

## Warning: package 'glmnet' was built under R version 3.6.3

## Loading required package: Matrix

## Loaded glmnet 4.0-2

x=model.matrix(Salary~.,-1,data = Hitters) #glmnet does not have a form language model.  x,y must be created seperately

## Warning in model.matrix.default(Salary ~ ., -1, data = Hitters): non-list
## contrasts argument ignored

y=Hitters$Salary

alpha = 0 is glmnet’s call for Ridge. alpha = 1 is the call for lasso. Elastic net models are any alpha between 0 & 1. Those models fall between the Ridge and Lasso methods.

First, we create and fit the ridge regression model.

fit.ridge=glmnet(x,y,alpha=0)
plot(fit.ridge,xvar="lambda", label=TRUE)

The plot is the log of Lambda. The model is penalized by the Sum of Squares of the coefficients controlled by lambda.

If Lambda is large the square of coefficients should be small and shrink toward 0. As Lambda is very big the coefficients will be 0. The model looks over a value of around 100 lambda. As lambda is relaxed, the coefficients grow. When Lambda is 0 the coefficients are unregulated (least squares fit model). Ridge keeps all the variables and shrinks the coefficients toward zero.

Using Cross validation (cv.glmnet) to pick a whole value across the path of the coefficient rise.

cv.ridge=cv.glmnet(x,y,alpha=0)
plot(cv.ridge)

The plot of the cross validated MSE (mean squared error). The full model is doing a good job. The verticals lines represent the minimum and 1 SE of the minimum.

Now, we fit the lasso model. The default alpha is 1 for the lasso model. The Lasso model minimizes the residual sum of squares plus lambda penalized by the absolute values of the coefficients. By generating the absolute values of the coefficients it restricts some of the values to be exactly zero.

fit.lasso=glmnet(x,y)
plot(fit.lasso,xvar="lambda", label=TRUE)

Lasso is doing shrinkage AND variable selection as seen on the top of the chart. As Lambda increases, the amount of variables decreases.

If we change from Lambda to dev. We plot the change in deviance explained, in the case of regression the sum of squares explained or technically \(r^2\).

plot(fit.lasso,xvar="dev", label=TRUE)

A lot of \(r^2\) is explained for quite heavily shrunk coefficients. Towards the end with a relativity small increase in \(r^2\) and explosion in coefficients. Could be an indication the end of the path is over-fitted.

If we run the cross validation again, we see a large change from Ridgewise.

cv.lasso=cv.glmnet(x,y)
plot(cv.lasso)

Its telling us the minimum cross validation area is at full size 13. Within one SE, it drops to between 5-6.

If we pull the coefficient models from Lasso, it will pull the values from the best model.

coef(cv.lasso)

## 21 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept) 144.37970485
## (Intercept)   .         
## AtBat         .         
## Hits          1.36380384
## HmRun         .         
## Runs          .         
## RBI           .         
## Walks         1.49731098
## Years         .         
## CAtBat        .         
## CHits         .         
## CHmRun        .         
## CRuns         0.15275165
## CRBI          0.32833941
## CWalks        .         
## LeagueN       .         
## DivisionW     .         
## PutOuts       0.06625755
## Assists       .         
## Errors        .         
## NewLeagueN    .

This pulled a model size of 5 which is 1 SE from minimum.

We can now use the earlier train/validation sets to select the lambda for lasso.

lasso.tr=glmnet(x[train,],y[train])
#lasso.tr
pred=predict(lasso.tr,x[-train,])
#dim(pred)

rmse=sqrt(apply((y[-train]-pred)^2,2,mean))
plot(log(lasso.tr$lambda),rmse,type="b",xlab="Log(lambda)")

Extract the Best Lambda.

lam.best=lasso.tr$lambda[order(rmse)[1]]
lam.best

## [1] 1.571184

Coefficients of the best lambda identified.

coef(lasso.tr,s=lam.best)

## 21 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept)  193.01429291
## (Intercept)    .         
## AtBat         -1.14830040
## Hits           4.92901730
## HmRun          .         
## Runs          -3.51936866
## RBI            0.38309009
## Walks          6.01828596
## Years        -20.74174043
## CAtBat        -0.01903175
## CHits          0.08077424
## CHmRun         0.53493799
## CRuns          0.77272746
## CRBI           0.49203970
## CWalks        -0.47458673
## LeagueN       91.21313136
## DivisionW   -161.10222986
## PutOuts        0.28639231
## Assists        0.29245560
## Errors        -4.69237401
## NewLeagueN   -56.95409378

Sources

Pulled from the following: https://www.youtube.com/watch?v=3kwdDGnV8MM&list=PL5-da3qGB5IB-Xdpj_uXJpLGiRfv9UVXI&index=11 https://www.youtube.com/watch?v=mv-vdysZIb4&list=PL5-da3qGB5IB-Xdpj_uXJpLGiRfv9UVXI&index=12 https://www.youtube.com/watch?v=mv-vdysZIb4&list=PL5-da3qGB5IB-Xdpj_uXJpLGiRfv9UVXI&index=13 https://www.youtube.com/watch?v=1REe3qSotx8&list=PL5-da3qGB5IB-Xdpj_uXJpLGiRfv9UVXI&index=14