This document will illustrate some of the methods used in subset selection in R. The data to be used are from the Plasma Retinol dataset found in the StatLib archive. The data were located at http://lib.stat.cmu.edu/datasets/Plasma_Retinol and are related to the paper by Nierenberg DW, Stukel TA, Baron JA, Dain BJ, Greenberg ER, ‘Determinants of plasma levels of beta-carotene and retinol’, American Journal of Epidemiology 1989;130:511-521.
To begin, the data are copied from the archive and pasted to a text file (retinol.dat). That file is subsequently loaded into R and each variable is named.
Variable names in order from left to right:
retinol <- read.table('retinol_beta_carotene.txt',sep='')
names(retinol) <- c('age','sex','smokstat','quetelet','vituse','calories','fat','fiber','alcohol','cholesterol','betadiet','retdiet','betaplasma','retplasma')
We will have to use the regsubsets() command found in the leaps library. The resulting object will aid us in determing the best model to use given a set number of variables to use. The summary() command wil then show those models.
## Up to thirteen variables will be used, with retplasma being the dependent variable and the others as predictors.
library(leaps)
ret.full <- regsubsets(retplasma~.,data=retinol,nvmax=13)
sum.ret.full <- summary(ret.full)
As an example, let’s look at the models determined by R squared.
sum.ret.full$rsq
## [1] 0.04480491 0.06172468 0.07026813 0.07419006 0.07716292 0.07922344
## [7] 0.08129587 0.08334431 0.08462178 0.08617945 0.08782692 0.08839048
## [13] 0.08845239
What results is a series of r squared measurements as each variable is added to the model. Similar findings can be made for Cp, adjusted r squared and residual sum of squares.
We can also plot out the results for methods to be used in subset selection. Here we will plot out results under RSS.
plot(sum.ret.full$rss,xlab='No. of Variables',ylab='RSS',type='l')
With this in mind we’ll perform plots for adjusted r squared, Cp, and Bayesian Information Criterion. Once ploted, we can determine the best models for each method using the best performance with the least amount of variables. For adjusted r squared, we’ll use the largest value, while for the others the lowest value will be used. Afterward, we can compare the findings to determine the best number of common variables used.
plot(sum.ret.full$adjr2,xlab='No. of Variables',ylab='Adj. R^2',type='l')
which.max(sum.ret.full$adjr2)
## [1] 4
points(4,sum.ret.full$adjr2[4],pch=19,col='red')
plot(sum.ret.full$cp,xlab='No. of Variables',ylab='Cp',type='l')
which.min(sum.ret.full$cp)
## [1] 3
points(3,sum.ret.full$cp[3],pch=19,col='red')
plot(sum.ret.full$bic,xlab='No. of Variables',ylab='BIC',type='l')
which.min(sum.ret.full$bic)
## [1] 1
points(1,sum.ret.full$bic[1],pch=19,col='red')
To assist in finding which variables to use, we can use the plot.regsubsets method. We can plot the regsubsets object, using the methods in question as scales and keeping in mind the best performing sizes of models.
plot(ret.full,scale='adjr2')
plot(ret.full,scale='Cp')
plot(ret.full,scale='bic')
For our purposes, we will use four variables: age, sex, fat, and betaplasma.
coef(ret.full,4)
## (Intercept) age sex fat betaplasma
## 724.44406527 2.03679667 -103.30825736 -0.58041954 0.07247735
In addition to the previous methods, we can also use forward or backward stepwise selection. Again, we will use the regsubsets() function, but we will have to specify the method in use.
ret.full.fwd <- regsubsets(retplasma~.,data=retinol,nvmax=13,method='forward')
summary(ret.full.fwd)
## Subset selection object
## Call: regsubsets.formula(retplasma ~ ., data = retinol, nvmax = 13,
## method = "forward")
## 13 Variables (and intercept)
## Forced in Forced out
## age FALSE FALSE
## sex FALSE FALSE
## smokstat FALSE FALSE
## quetelet FALSE FALSE
## vituse FALSE FALSE
## calories FALSE FALSE
## fat FALSE FALSE
## fiber FALSE FALSE
## alcohol FALSE FALSE
## cholesterol FALSE FALSE
## betadiet FALSE FALSE
## retdiet FALSE FALSE
## betaplasma FALSE FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: forward
## age sex smokstat quetelet vituse calories fat fiber alcohol
## 1 ( 1 ) "*" " " " " " " " " " " " " " " " "
## 2 ( 1 ) "*" "*" " " " " " " " " " " " " " "
## 3 ( 1 ) "*" "*" " " " " " " " " "*" " " " "
## 4 ( 1 ) "*" "*" " " " " " " " " "*" " " " "
## 5 ( 1 ) "*" "*" " " " " " " " " "*" "*" " "
## 6 ( 1 ) "*" "*" " " " " "*" " " "*" "*" " "
## 7 ( 1 ) "*" "*" "*" " " "*" " " "*" "*" " "
## 8 ( 1 ) "*" "*" "*" "*" "*" " " "*" "*" " "
## 9 ( 1 ) "*" "*" "*" "*" "*" " " "*" "*" " "
## 10 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" " "
## 11 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*"
## cholesterol betadiet retdiet betaplasma
## 1 ( 1 ) " " " " " " " "
## 2 ( 1 ) " " " " " " " "
## 3 ( 1 ) " " " " " " " "
## 4 ( 1 ) " " " " " " "*"
## 5 ( 1 ) " " " " " " "*"
## 6 ( 1 ) " " " " " " "*"
## 7 ( 1 ) " " " " " " "*"
## 8 ( 1 ) " " " " " " "*"
## 9 ( 1 ) "*" " " " " "*"
## 10 ( 1 ) "*" " " " " "*"
## 11 ( 1 ) "*" " " " " "*"
## 12 ( 1 ) "*" " " "*" "*"
## 13 ( 1 ) "*" "*" "*" "*"
ret.full.bwk <- regsubsets(retplasma~.,data=retinol,nvmax=13,method='backward')
summary(ret.full.bwk)
## Subset selection object
## Call: regsubsets.formula(retplasma ~ ., data = retinol, nvmax = 13,
## method = "backward")
## 13 Variables (and intercept)
## Forced in Forced out
## age FALSE FALSE
## sex FALSE FALSE
## smokstat FALSE FALSE
## quetelet FALSE FALSE
## vituse FALSE FALSE
## calories FALSE FALSE
## fat FALSE FALSE
## fiber FALSE FALSE
## alcohol FALSE FALSE
## cholesterol FALSE FALSE
## betadiet FALSE FALSE
## retdiet FALSE FALSE
## betaplasma FALSE FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: backward
## age sex smokstat quetelet vituse calories fat fiber alcohol
## 1 ( 1 ) "*" " " " " " " " " " " " " " " " "
## 2 ( 1 ) "*" "*" " " " " " " " " " " " " " "
## 3 ( 1 ) "*" "*" " " " " " " " " "*" " " " "
## 4 ( 1 ) "*" "*" " " " " " " " " "*" " " " "
## 5 ( 1 ) "*" "*" " " " " " " " " "*" "*" " "
## 6 ( 1 ) "*" "*" " " " " " " "*" "*" "*" " "
## 7 ( 1 ) "*" "*" " " " " " " "*" "*" "*" "*"
## 8 ( 1 ) "*" "*" "*" " " " " "*" "*" "*" "*"
## 9 ( 1 ) "*" "*" "*" " " " " "*" "*" "*" "*"
## 10 ( 1 ) "*" "*" "*" "*" " " "*" "*" "*" "*"
## 11 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*"
## cholesterol betadiet retdiet betaplasma
## 1 ( 1 ) " " " " " " " "
## 2 ( 1 ) " " " " " " " "
## 3 ( 1 ) " " " " " " " "
## 4 ( 1 ) " " " " " " "*"
## 5 ( 1 ) " " " " " " "*"
## 6 ( 1 ) " " " " " " "*"
## 7 ( 1 ) " " " " " " "*"
## 8 ( 1 ) " " " " " " "*"
## 9 ( 1 ) "*" " " " " "*"
## 10 ( 1 ) "*" " " " " "*"
## 11 ( 1 ) "*" " " " " "*"
## 12 ( 1 ) "*" " " "*" "*"
## 13 ( 1 ) "*" "*" "*" "*"
Again, we find that the previous four variables also serve us well for this method.
coef(ret.full.fwd,4)
## (Intercept) age sex fat betaplasma
## 724.44406527 2.03679667 -103.30825736 -0.58041954 0.07247735
coef(ret.full.bwk,4)
## (Intercept) age sex fat betaplasma
## 724.44406527 2.03679667 -103.30825736 -0.58041954 0.07247735
Though there is not a direct way to do it, we can also try to determine a best model by using validation set methods. The following code was adapted from the book ‘An Introduction to Statistical Learning with Applications in R’ by Witten, James, Hastie, and Tibshirani.
In their process, a regsubsets object is created with a training set of the data. A matrix of coefficents of the best fitting models for each size of model is then created. Mean squared errors are then calculated between the training and test objects and the size of the model with the least amount of error is noted.
set.seed(1234)
train <- sample(315,215)
ret.train <- regsubsets(retplasma~.,data=retinol[train,],nvmax=13)
test.mat <- model.matrix(retplasma~.,data=retinol[-train,])
val.errors <- rep(NA,13)
for(i in 1:13){
coefi <- coef(ret.train,id=i)
pred <- test.mat[,names(coefi)]%*%coefi
val.errors[i] <- mean((retinol$retplasma[-train] - pred)^2)
}
which.min(val.errors)
## [1] 4
We find that the best model contains four variables. They are listed below.
coef(ret.train,4)
## (Intercept) age sex cholesterol betaplasma
## 745.59453752 1.48593462 -102.67968719 -0.18705331 0.06533189
Because there is not a predict() function already connected with regsubsets, Witten et al have provided the following function. This function will be utilized when cross validation is used to determine a best model.
predict.regsubsets <- function(object,newdata,id,...){
form <- as.formula(object$call[[2]])
mat <- model.matrix(form,newdata)
coefi <- coef(object,id=id)
xvars <- names(coefi)
mat[,xvars]%*%coefi
}
For 10 fold cross validation, the dataset is systematically separated into training and test sets. Mean squared errors are then recorded for each size of model. Averages for each size is then taken.
k <- 10
folds <- sample(1:k,nrow(retinol),replace=T)
cv.errors <- matrix(NA,k,13,dimnames=list(NULL,paste(1:13)))
for(j in 1:k){
## Training set
best.fit <- regsubsets(retplasma~.,data=retinol[folds != j,],nvmax=13)
for(i in 1:13){
pred <- predict.regsubsets(best.fit,retinol[folds == j,],id=i)
cv.errors[j,i] <- mean((retinol$retplasma[folds==j] - pred)^2)
}
}
mean.cv.errors <- apply(cv.errors,2,mean)
We can now plot the errors and determine how many variables we can select for the model with this method.
plot(mean.cv.errors,type='b')
Here an argument can be made for using a model of three variables. The coefficients of those variables are noted below.
ret.full.cv <- regsubsets(retplasma~.,data=retinol,nvmax=13)
coef(ret.full.cv,3)
## (Intercept) age sex fat
## 725.6600724 2.1505134 -98.8628894 -0.5992893