This document will illustrate some of the methods used in subset selection in R. The data to be used are from the Plasma Retinol dataset found in the StatLib archive. The data were located at http://lib.stat.cmu.edu/datasets/Plasma_Retinol and are related to the paper by Nierenberg DW, Stukel TA, Baron JA, Dain BJ, Greenberg ER, ‘Determinants of plasma levels of beta-carotene and retinol’, American Journal of Epidemiology 1989;130:511-521.

To begin, the data are copied from the archive and pasted to a text file (retinol.dat). That file is subsequently loaded into R and each variable is named.

Variable names in order from left to right:

retinol <- read.table('retinol_beta_carotene.txt',sep='')
names(retinol) <- c('age','sex','smokstat','quetelet','vituse','calories','fat','fiber','alcohol','cholesterol','betadiet','retdiet','betaplasma','retplasma')

We will have to use the regsubsets() command found in the leaps library. The resulting object will aid us in determing the best model to use given a set number of variables to use. The summary() command wil then show those models.

## Up to thirteen variables will be used, with retplasma being the dependent variable and the others as predictors.
library(leaps)
ret.full <- regsubsets(retplasma~.,data=retinol,nvmax=13)
sum.ret.full <- summary(ret.full)

As an example, let’s look at the models determined by R squared.

sum.ret.full$rsq
##  [1] 0.04480491 0.06172468 0.07026813 0.07419006 0.07716292 0.07922344
##  [7] 0.08129587 0.08334431 0.08462178 0.08617945 0.08782692 0.08839048
## [13] 0.08845239

What results is a series of r squared measurements as each variable is added to the model. Similar findings can be made for Cp, adjusted r squared and residual sum of squares.

We can also plot out the results for methods to be used in subset selection. Here we will plot out results under RSS.

plot(sum.ret.full$rss,xlab='No. of Variables',ylab='RSS',type='l')

With this in mind we’ll perform plots for adjusted r squared, Cp, and Bayesian Information Criterion. Once ploted, we can determine the best models for each method using the best performance with the least amount of variables. For adjusted r squared, we’ll use the largest value, while for the others the lowest value will be used. Afterward, we can compare the findings to determine the best number of common variables used.

plot(sum.ret.full$adjr2,xlab='No. of Variables',ylab='Adj. R^2',type='l')
which.max(sum.ret.full$adjr2)
## [1] 4
points(4,sum.ret.full$adjr2[4],pch=19,col='red')

plot(sum.ret.full$cp,xlab='No. of Variables',ylab='Cp',type='l')
which.min(sum.ret.full$cp)
## [1] 3
points(3,sum.ret.full$cp[3],pch=19,col='red')

plot(sum.ret.full$bic,xlab='No. of Variables',ylab='BIC',type='l')
which.min(sum.ret.full$bic)
## [1] 1
points(1,sum.ret.full$bic[1],pch=19,col='red')

To assist in finding which variables to use, we can use the plot.regsubsets method. We can plot the regsubsets object, using the methods in question as scales and keeping in mind the best performing sizes of models.

plot(ret.full,scale='adjr2')

plot(ret.full,scale='Cp')

plot(ret.full,scale='bic')

For our purposes, we will use four variables: age, sex, fat, and betaplasma.

coef(ret.full,4)
##   (Intercept)           age           sex           fat    betaplasma 
##  724.44406527    2.03679667 -103.30825736   -0.58041954    0.07247735

In addition to the previous methods, we can also use forward or backward stepwise selection. Again, we will use the regsubsets() function, but we will have to specify the method in use.

ret.full.fwd <- regsubsets(retplasma~.,data=retinol,nvmax=13,method='forward')
summary(ret.full.fwd)
## Subset selection object
## Call: regsubsets.formula(retplasma ~ ., data = retinol, nvmax = 13, 
##     method = "forward")
## 13 Variables  (and intercept)
##             Forced in Forced out
## age             FALSE      FALSE
## sex             FALSE      FALSE
## smokstat        FALSE      FALSE
## quetelet        FALSE      FALSE
## vituse          FALSE      FALSE
## calories        FALSE      FALSE
## fat             FALSE      FALSE
## fiber           FALSE      FALSE
## alcohol         FALSE      FALSE
## cholesterol     FALSE      FALSE
## betadiet        FALSE      FALSE
## retdiet         FALSE      FALSE
## betaplasma      FALSE      FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: forward
##           age sex smokstat quetelet vituse calories fat fiber alcohol
## 1  ( 1 )  "*" " " " "      " "      " "    " "      " " " "   " "    
## 2  ( 1 )  "*" "*" " "      " "      " "    " "      " " " "   " "    
## 3  ( 1 )  "*" "*" " "      " "      " "    " "      "*" " "   " "    
## 4  ( 1 )  "*" "*" " "      " "      " "    " "      "*" " "   " "    
## 5  ( 1 )  "*" "*" " "      " "      " "    " "      "*" "*"   " "    
## 6  ( 1 )  "*" "*" " "      " "      "*"    " "      "*" "*"   " "    
## 7  ( 1 )  "*" "*" "*"      " "      "*"    " "      "*" "*"   " "    
## 8  ( 1 )  "*" "*" "*"      "*"      "*"    " "      "*" "*"   " "    
## 9  ( 1 )  "*" "*" "*"      "*"      "*"    " "      "*" "*"   " "    
## 10  ( 1 ) "*" "*" "*"      "*"      "*"    "*"      "*" "*"   " "    
## 11  ( 1 ) "*" "*" "*"      "*"      "*"    "*"      "*" "*"   "*"    
## 12  ( 1 ) "*" "*" "*"      "*"      "*"    "*"      "*" "*"   "*"    
## 13  ( 1 ) "*" "*" "*"      "*"      "*"    "*"      "*" "*"   "*"    
##           cholesterol betadiet retdiet betaplasma
## 1  ( 1 )  " "         " "      " "     " "       
## 2  ( 1 )  " "         " "      " "     " "       
## 3  ( 1 )  " "         " "      " "     " "       
## 4  ( 1 )  " "         " "      " "     "*"       
## 5  ( 1 )  " "         " "      " "     "*"       
## 6  ( 1 )  " "         " "      " "     "*"       
## 7  ( 1 )  " "         " "      " "     "*"       
## 8  ( 1 )  " "         " "      " "     "*"       
## 9  ( 1 )  "*"         " "      " "     "*"       
## 10  ( 1 ) "*"         " "      " "     "*"       
## 11  ( 1 ) "*"         " "      " "     "*"       
## 12  ( 1 ) "*"         " "      "*"     "*"       
## 13  ( 1 ) "*"         "*"      "*"     "*"
ret.full.bwk <- regsubsets(retplasma~.,data=retinol,nvmax=13,method='backward')
summary(ret.full.bwk)
## Subset selection object
## Call: regsubsets.formula(retplasma ~ ., data = retinol, nvmax = 13, 
##     method = "backward")
## 13 Variables  (and intercept)
##             Forced in Forced out
## age             FALSE      FALSE
## sex             FALSE      FALSE
## smokstat        FALSE      FALSE
## quetelet        FALSE      FALSE
## vituse          FALSE      FALSE
## calories        FALSE      FALSE
## fat             FALSE      FALSE
## fiber           FALSE      FALSE
## alcohol         FALSE      FALSE
## cholesterol     FALSE      FALSE
## betadiet        FALSE      FALSE
## retdiet         FALSE      FALSE
## betaplasma      FALSE      FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: backward
##           age sex smokstat quetelet vituse calories fat fiber alcohol
## 1  ( 1 )  "*" " " " "      " "      " "    " "      " " " "   " "    
## 2  ( 1 )  "*" "*" " "      " "      " "    " "      " " " "   " "    
## 3  ( 1 )  "*" "*" " "      " "      " "    " "      "*" " "   " "    
## 4  ( 1 )  "*" "*" " "      " "      " "    " "      "*" " "   " "    
## 5  ( 1 )  "*" "*" " "      " "      " "    " "      "*" "*"   " "    
## 6  ( 1 )  "*" "*" " "      " "      " "    "*"      "*" "*"   " "    
## 7  ( 1 )  "*" "*" " "      " "      " "    "*"      "*" "*"   "*"    
## 8  ( 1 )  "*" "*" "*"      " "      " "    "*"      "*" "*"   "*"    
## 9  ( 1 )  "*" "*" "*"      " "      " "    "*"      "*" "*"   "*"    
## 10  ( 1 ) "*" "*" "*"      "*"      " "    "*"      "*" "*"   "*"    
## 11  ( 1 ) "*" "*" "*"      "*"      "*"    "*"      "*" "*"   "*"    
## 12  ( 1 ) "*" "*" "*"      "*"      "*"    "*"      "*" "*"   "*"    
## 13  ( 1 ) "*" "*" "*"      "*"      "*"    "*"      "*" "*"   "*"    
##           cholesterol betadiet retdiet betaplasma
## 1  ( 1 )  " "         " "      " "     " "       
## 2  ( 1 )  " "         " "      " "     " "       
## 3  ( 1 )  " "         " "      " "     " "       
## 4  ( 1 )  " "         " "      " "     "*"       
## 5  ( 1 )  " "         " "      " "     "*"       
## 6  ( 1 )  " "         " "      " "     "*"       
## 7  ( 1 )  " "         " "      " "     "*"       
## 8  ( 1 )  " "         " "      " "     "*"       
## 9  ( 1 )  "*"         " "      " "     "*"       
## 10  ( 1 ) "*"         " "      " "     "*"       
## 11  ( 1 ) "*"         " "      " "     "*"       
## 12  ( 1 ) "*"         " "      "*"     "*"       
## 13  ( 1 ) "*"         "*"      "*"     "*"

Again, we find that the previous four variables also serve us well for this method.

coef(ret.full.fwd,4)
##   (Intercept)           age           sex           fat    betaplasma 
##  724.44406527    2.03679667 -103.30825736   -0.58041954    0.07247735
coef(ret.full.bwk,4)
##   (Intercept)           age           sex           fat    betaplasma 
##  724.44406527    2.03679667 -103.30825736   -0.58041954    0.07247735

Though there is not a direct way to do it, we can also try to determine a best model by using validation set methods. The following code was adapted from the book ‘An Introduction to Statistical Learning with Applications in R’ by Witten, James, Hastie, and Tibshirani.

In their process, a regsubsets object is created with a training set of the data. A matrix of coefficents of the best fitting models for each size of model is then created. Mean squared errors are then calculated between the training and test objects and the size of the model with the least amount of error is noted.

set.seed(1234)
train <- sample(315,215)
ret.train <- regsubsets(retplasma~.,data=retinol[train,],nvmax=13)
test.mat <- model.matrix(retplasma~.,data=retinol[-train,])
val.errors <- rep(NA,13)
for(i in 1:13){
coefi <- coef(ret.train,id=i)
pred <- test.mat[,names(coefi)]%*%coefi
val.errors[i] <- mean((retinol$retplasma[-train] - pred)^2)
}
which.min(val.errors)
## [1] 4

We find that the best model contains four variables. They are listed below.

coef(ret.train,4)
##   (Intercept)           age           sex   cholesterol    betaplasma 
##  745.59453752    1.48593462 -102.67968719   -0.18705331    0.06533189

Because there is not a predict() function already connected with regsubsets, Witten et al have provided the following function. This function will be utilized when cross validation is used to determine a best model.

predict.regsubsets <- function(object,newdata,id,...){
form <- as.formula(object$call[[2]])
mat <- model.matrix(form,newdata)
coefi <- coef(object,id=id)
xvars <- names(coefi)
mat[,xvars]%*%coefi
}

For 10 fold cross validation, the dataset is systematically separated into training and test sets. Mean squared errors are then recorded for each size of model. Averages for each size is then taken.

k <- 10
folds <- sample(1:k,nrow(retinol),replace=T)
cv.errors <- matrix(NA,k,13,dimnames=list(NULL,paste(1:13)))
for(j in 1:k){
    ## Training set
    best.fit <- regsubsets(retplasma~.,data=retinol[folds != j,],nvmax=13)
        for(i in 1:13){
            pred <- predict.regsubsets(best.fit,retinol[folds == j,],id=i)
            cv.errors[j,i] <- mean((retinol$retplasma[folds==j] - pred)^2)
            }
    }
mean.cv.errors <- apply(cv.errors,2,mean)

We can now plot the errors and determine how many variables we can select for the model with this method.

plot(mean.cv.errors,type='b')

Here an argument can be made for using a model of three variables. The coefficients of those variables are noted below.

ret.full.cv <- regsubsets(retplasma~.,data=retinol,nvmax=13)
coef(ret.full.cv,3)
## (Intercept)         age         sex         fat 
## 725.6600724   2.1505134 -98.8628894  -0.5992893