Using Statistical Learning`s videos on Model Selection, the following report shows methods for performing Model Selection. All Chunks are printed so code is visible. Some chunks were modified from the lesson for cleaner code and show more selection techniques.
library(ISLR)
## Warning: package 'ISLR' was built under R version 3.6.3
summary(Hitters)
## AtBat Hits HmRun Runs
## Min. : 16.0 Min. : 1 Min. : 0.00 Min. : 0.00
## 1st Qu.:255.2 1st Qu.: 64 1st Qu.: 4.00 1st Qu.: 30.25
## Median :379.5 Median : 96 Median : 8.00 Median : 48.00
## Mean :380.9 Mean :101 Mean :10.77 Mean : 50.91
## 3rd Qu.:512.0 3rd Qu.:137 3rd Qu.:16.00 3rd Qu.: 69.00
## Max. :687.0 Max. :238 Max. :40.00 Max. :130.00
##
## RBI Walks Years CAtBat
## Min. : 0.00 Min. : 0.00 Min. : 1.000 Min. : 19.0
## 1st Qu.: 28.00 1st Qu.: 22.00 1st Qu.: 4.000 1st Qu.: 816.8
## Median : 44.00 Median : 35.00 Median : 6.000 Median : 1928.0
## Mean : 48.03 Mean : 38.74 Mean : 7.444 Mean : 2648.7
## 3rd Qu.: 64.75 3rd Qu.: 53.00 3rd Qu.:11.000 3rd Qu.: 3924.2
## Max. :121.00 Max. :105.00 Max. :24.000 Max. :14053.0
##
## CHits CHmRun CRuns CRBI
## Min. : 4.0 Min. : 0.00 Min. : 1.0 Min. : 0.00
## 1st Qu.: 209.0 1st Qu.: 14.00 1st Qu.: 100.2 1st Qu.: 88.75
## Median : 508.0 Median : 37.50 Median : 247.0 Median : 220.50
## Mean : 717.6 Mean : 69.49 Mean : 358.8 Mean : 330.12
## 3rd Qu.:1059.2 3rd Qu.: 90.00 3rd Qu.: 526.2 3rd Qu.: 426.25
## Max. :4256.0 Max. :548.00 Max. :2165.0 Max. :1659.00
##
## CWalks League Division PutOuts Assists
## Min. : 0.00 A:175 E:157 Min. : 0.0 Min. : 0.0
## 1st Qu.: 67.25 N:147 W:165 1st Qu.: 109.2 1st Qu.: 7.0
## Median : 170.50 Median : 212.0 Median : 39.5
## Mean : 260.24 Mean : 288.9 Mean :106.9
## 3rd Qu.: 339.25 3rd Qu.: 325.0 3rd Qu.:166.0
## Max. :1566.00 Max. :1378.0 Max. :492.0
##
## Errors Salary NewLeague
## Min. : 0.00 Min. : 67.5 A:176
## 1st Qu.: 3.00 1st Qu.: 190.0 N:146
## Median : 6.00 Median : 425.0
## Mean : 8.04 Mean : 535.9
## 3rd Qu.:11.00 3rd Qu.: 750.0
## Max. :32.00 Max. :2460.0
## NA's :59
Salary (which we are using as our predict variable) has missing values in the dataset (59). The easiest method is to remove them, however, it will truncate the dataset by 59 lines.
## [1] 0
The package leaps is used to evaluate all of the best-subset models
library(leaps)
## Warning: package 'leaps' was built under R version 3.6.3
regfit.full=regsubsets(Salary~.,data=Hitters)
summary(regfit.full)
## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = Hitters)
## 19 Variables (and intercept)
## Forced in Forced out
## AtBat FALSE FALSE
## Hits FALSE FALSE
## HmRun FALSE FALSE
## Runs FALSE FALSE
## RBI FALSE FALSE
## Walks FALSE FALSE
## Years FALSE FALSE
## CAtBat FALSE FALSE
## CHits FALSE FALSE
## CHmRun FALSE FALSE
## CRuns FALSE FALSE
## CRBI FALSE FALSE
## CWalks FALSE FALSE
## LeagueN FALSE FALSE
## DivisionW FALSE FALSE
## PutOuts FALSE FALSE
## Assists FALSE FALSE
## Errors FALSE FALSE
## NewLeagueN FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " "
## 3 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " "
## 4 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " "
## 5 ( 1 ) "*" "*" " " " " " " " " " " " " " " " " " "
## 6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " "
## 7 ( 1 ) " " "*" " " " " " " "*" " " "*" "*" "*" " "
## 8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " "*" "*"
## CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1 ( 1 ) "*" " " " " " " " " " " " " " "
## 2 ( 1 ) "*" " " " " " " " " " " " " " "
## 3 ( 1 ) "*" " " " " " " "*" " " " " " "
## 4 ( 1 ) "*" " " " " "*" "*" " " " " " "
## 5 ( 1 ) "*" " " " " "*" "*" " " " " " "
## 6 ( 1 ) "*" " " " " "*" "*" " " " " " "
## 7 ( 1 ) " " " " " " "*" "*" " " " " " "
## 8 ( 1 ) " " "*" " " "*" "*" " " " " " "
leaps defaults to a size of 8, however, in this dataset we have 19 variables.
regfit.full = regsubsets(Salary~.,data = Hitters, nvmax = 19)
reg.summary = summary(regfit.full)
# Showing the names listed in the summary
#names(reg.summary)
cp is an estimate of prediction error. We want to plot the cp value and show the lowest number of variables. The lowest cp score indicates the best model.
min.cp = which.min(reg.summary$cp) #minimum or lowest number
plot(reg.summary$cp, xlab = "Number of Variables", ylab = "Cp")
points(10,reg.summary$cp[min.cp], pch=20,col="red") #add highlight point for lowest number
The smallest number of variables is 10. We can plot the the regfit model in an other way to demonstrate which variables are the best on all cp points.
plot(regfit.full, scale = "Cp")
Black squares the variable is in while the white squares indicate the variable is out. It appears bad cp models have all the variables and almost none of the variables. There is also indication some variables are not useful at all, such as Years, NewLeagueN, CHmRun. These variables are almost always removed from the variable list as the cp goes down toward the better models.
Since we chose model 10. We can run the coefficients of our model.
coef(regfit.full,10)
## (Intercept) AtBat Hits Walks CAtBat
## 162.5354420 -2.1686501 6.9180175 5.7732246 -0.1300798
## CRuns CRBI CWalks DivisionW PutOuts
## 1.4082490 0.7743122 -0.8308264 -112.3800575 0.2973726
## Assists
## 0.2831680
Using the same regsubsets function we can use the Forward methods and compare the results. While Best subset is a very aggressive, Forward stepwise is greedy as it includes the previous model as it grows model over model by adding the next variable which improves the fit the most., resulting in a nested less adventitious model.
regfit.fwd=regsubsets(Salary~., data = Hitters, nvmax = 19, method = "forward")
summary(regfit.fwd)
## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19, method = "forward")
## 19 Variables (and intercept)
## Forced in Forced out
## AtBat FALSE FALSE
## Hits FALSE FALSE
## HmRun FALSE FALSE
## Runs FALSE FALSE
## RBI FALSE FALSE
## Walks FALSE FALSE
## Years FALSE FALSE
## CAtBat FALSE FALSE
## CHits FALSE FALSE
## CHmRun FALSE FALSE
## CRuns FALSE FALSE
## CRBI FALSE FALSE
## CWalks FALSE FALSE
## LeagueN FALSE FALSE
## DivisionW FALSE FALSE
## PutOuts FALSE FALSE
## Assists FALSE FALSE
## Errors FALSE FALSE
## NewLeagueN FALSE FALSE
## 1 subsets of each size up to 19
## Selection Algorithm: forward
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " "
## 3 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " "
## 4 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " "
## 5 ( 1 ) "*" "*" " " " " " " " " " " " " " " " " " "
## 6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " "
## 7 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " "
## 8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " "*"
## 9 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*"
## 10 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*"
## 11 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*"
## 12 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*"
## 13 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*"
## 14 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" " " " " "*"
## 15 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" "*" " " "*"
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*"
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*"
## 18 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" " " "*"
## 19 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1 ( 1 ) "*" " " " " " " " " " " " " " "
## 2 ( 1 ) "*" " " " " " " " " " " " " " "
## 3 ( 1 ) "*" " " " " " " "*" " " " " " "
## 4 ( 1 ) "*" " " " " "*" "*" " " " " " "
## 5 ( 1 ) "*" " " " " "*" "*" " " " " " "
## 6 ( 1 ) "*" " " " " "*" "*" " " " " " "
## 7 ( 1 ) "*" "*" " " "*" "*" " " " " " "
## 8 ( 1 ) "*" "*" " " "*" "*" " " " " " "
## 9 ( 1 ) "*" "*" " " "*" "*" " " " " " "
## 10 ( 1 ) "*" "*" " " "*" "*" "*" " " " "
## 11 ( 1 ) "*" "*" "*" "*" "*" "*" " " " "
## 12 ( 1 ) "*" "*" "*" "*" "*" "*" " " " "
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" " "
## 14 ( 1 ) "*" "*" "*" "*" "*" "*" "*" " "
## 15 ( 1 ) "*" "*" "*" "*" "*" "*" "*" " "
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" "*" " "
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
## 18 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
## 19 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
plot(regfit.fwd, scale="Cp")
The model plot looks very similar to the Best Subset model, very few changes.
regfit.summary = summary(regfit.fwd)
minfit.cp = which.min(regfit.summary$cp)
plot(regfit.summary$cp, xlab = "Number of Variables", ylab = "Cp")
points(10,regfit.summary$cp[minfit.cp], pch=20,col="red")
The best Cp is still 10.
Next, we need to break the data up into training and validation sets. Once we have a model we can validate and chose which one we like.
#dim(Hitters)
set.seed(1)
train=sample(seq(263),180,replace = FALSE) #creating sample numbers
#train
regfit.fwd=regsubsets(Salary~., data = Hitters[train,],nvmax = 19, method = "forward") #rerunning forward step with train set
val.errors=rep(NA,19)
x.test=model.matrix(Salary~., data = Hitters[-train,]) #model Matrix without the train set
for(i in 1:19){ #loop for predictions since the Regsubsets has no predict command
coefi=coef(regfit.fwd,id=i)
pred=x.test[,names(coefi)]%*%coefi
val.errors[i]=mean((Hitters$Salary[-train]-pred)^2)
}
plot(sqrt(val.errors),ylab = "Root MSE", ylim = c(300,400), pch=19, type = "b")
points(sqrt(regfit.fwd$rss[-1]/180),col="blue",pch=19, type = "b")
legend("topright",legend = c("Training","Validation"),col = c("blue", "black"), pch=19)
Since there is a call to use this predict again, a function is created.
predict.regsubsets=function(object, newdata, id,...){
form=as.formula(object$call[[2]])
mat=model.matrix(form,newdata)
coefi=coef(object, id=id)
mat[,names(coefi)]%*%coefi
}
For the Cross Validation test, we will chose a 10-fold method. The “folds” have a random selection of 1-10 for the length of the Hitter dataframe. When complete, the sample will contain 10 separate sample sections of the dataframe.
set.seed(11)
folds=sample(rep(1:10, length=nrow(Hitters))) #creating the sample folds
#folds
table(folds)
## folds
## 1 2 3 4 5 6 7 8 9 10
## 27 27 27 26 26 26 26 26 26 26
cv.errors=matrix(NA, 10,19) #19 varibles, 10 rows for the folds filled with NA
Double loop created for the validation
for(k in 1:10){
best.fit=regsubsets(Salary~.,data=Hitters[folds!=k,],nvmax = 19,method="forward") #train on all the varibles NOT in the k'th fold
for(i in 1:19){
pred=predict(best.fit, Hitters[folds==k,], id=i) #each of the subsets left out from the first loop using the predict function
cv.errors[k,i]=mean( (Hitters$Salary[folds==k]-pred)^2)
}
}
rmse.cv=sqrt(apply(cv.errors,2,mean)) #mean of the columns giving a root mean
plot(rmse.cv,pch=19,type="b")
This seems to prefer models around 9. Remember, we are looking for the lowest point. Comparing to the other models, 10 is still very low in the estimation.
Ridge-regression model is achieved by using a new package glmnet, which also contains a cross validation function cv.glmnet. Keep in mind there are other packages for Ridge and Lasso.
library(glmnet)
## Warning: package 'glmnet' was built under R version 3.6.3
## Loading required package: Matrix
## Loaded glmnet 4.0-2
x=model.matrix(Salary~.,-1,data = Hitters) #glmnet does not have a form language model. x,y must be created seperately
## Warning in model.matrix.default(Salary ~ ., -1, data = Hitters): non-list
## contrasts argument ignored
y=Hitters$Salary
alpha = 0 is glmnet’s call for Ridge. alpha = 1 is the call for lasso. Elastic net models are any alpha between 0 & 1. Those models fall between the Ridge and Lasso methods.
First, we create and fit the ridge regression model.
fit.ridge=glmnet(x,y,alpha=0)
plot(fit.ridge,xvar="lambda", label=TRUE)
The plot is the log of Lambda. The model is penalized by the Sum of Squares of the coefficients controlled by lambda.
If Lambda is large the square of coefficients should be small and shrink toward 0. As Lambda is very big the coefficients will be 0. The model looks over a value of around 100 lambda. As lambda is relaxed, the coefficients grow. When Lambda is 0 the coefficients are unregulated (least squares fit model). Ridge keeps all the variables and shrinks the coefficients toward zero.
Using Cross validation (cv.glmnet) to pick a whole value across the path of the coefficient rise.
cv.ridge=cv.glmnet(x,y,alpha=0)
plot(cv.ridge)
The plot of the cross validated MSE (mean squared error). The full model is doing a good job. The verticals lines represent the minimum and 1 SE of the minimum.
Now, we fit the lasso model. The default alpha is 1 for the lasso model. The Lasso model minimizes the residual sum of squares plus lambda penalized by the absolute values of the coefficients. By generating the absolute values of the coefficients it restricts some of the values to be exactly zero.
fit.lasso=glmnet(x,y)
plot(fit.lasso,xvar="lambda", label=TRUE)
Lasso is doing shrinkage AND variable selection as seen on the top of the chart. As Lambda increases, the amount of variables decreases.
If we change from Lambda to dev. We plot the change in deviance explained, in the case of regression the sum of squares explained or technically \(r^2\).
plot(fit.lasso,xvar="dev", label=TRUE)
A lot of \(r^2\) is explained for quite heavily shrunk coefficients. Towards the end with a relativity small increase in \(r^2\) and explosion in coefficients. Could be an indication the end of the path is over-fitted.
If we run the cross validation again, we see a large change from Ridgewise.
cv.lasso=cv.glmnet(x,y)
plot(cv.lasso)
Its telling us the minimum cross validation area is at full size 13. Within one SE, it drops to between 5-6.
If we pull the coefficient models from Lasso, it will pull the values from the best model.
coef(cv.lasso)
## 21 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 144.37970485
## (Intercept) .
## AtBat .
## Hits 1.36380384
## HmRun .
## Runs .
## RBI .
## Walks 1.49731098
## Years .
## CAtBat .
## CHits .
## CHmRun .
## CRuns 0.15275165
## CRBI 0.32833941
## CWalks .
## LeagueN .
## DivisionW .
## PutOuts 0.06625755
## Assists .
## Errors .
## NewLeagueN .
This pulled a model size of 5 which is 1 SE from minimum.
We can now use the earlier train/validation sets to select the lambda for lasso.
lasso.tr=glmnet(x[train,],y[train])
#lasso.tr
pred=predict(lasso.tr,x[-train,])
#dim(pred)
rmse=sqrt(apply((y[-train]-pred)^2,2,mean))
plot(log(lasso.tr$lambda),rmse,type="b",xlab="Log(lambda)")
Extract the Best Lambda.
lam.best=lasso.tr$lambda[order(rmse)[1]]
lam.best
## [1] 1.571184
Coefficients of the best lambda identified.
coef(lasso.tr,s=lam.best)
## 21 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 193.01429291
## (Intercept) .
## AtBat -1.14830040
## Hits 4.92901730
## HmRun .
## Runs -3.51936866
## RBI 0.38309009
## Walks 6.01828596
## Years -20.74174043
## CAtBat -0.01903175
## CHits 0.08077424
## CHmRun 0.53493799
## CRuns 0.77272746
## CRBI 0.49203970
## CWalks -0.47458673
## LeagueN 91.21313136
## DivisionW -161.10222986
## PutOuts 0.28639231
## Assists 0.29245560
## Errors -4.69237401
## NewLeagueN -56.95409378
Pulled from the following: https://www.youtube.com/watch?v=3kwdDGnV8MM&list=PL5-da3qGB5IB-Xdpj_uXJpLGiRfv9UVXI&index=11 https://www.youtube.com/watch?v=mv-vdysZIb4&list=PL5-da3qGB5IB-Xdpj_uXJpLGiRfv9UVXI&index=12 https://www.youtube.com/watch?v=mv-vdysZIb4&list=PL5-da3qGB5IB-Xdpj_uXJpLGiRfv9UVXI&index=13 https://www.youtube.com/watch?v=1REe3qSotx8&list=PL5-da3qGB5IB-Xdpj_uXJpLGiRfv9UVXI&index=14