The deficiencies of stepwise procedures are well known. This document illustrates an example where stepwise variable selection procedures fall victim to local minimia and highlights the utility of the various procedures and plots provided by the mplot package.
To begin consider the following data set inspired by the pathoeg data set in the MVP package.
n=50
set.seed(8)
x1 = rnorm(n,0.22,2)
x7 = 0.5*x1 + rnorm(n,0,sd=2)
x6 = -0.75*x1 + rnorm(n,0,3)
x3 = -0.5-0.5*x6 + rnorm(n,0,2)
x9 = rnorm(n,0.6,3.5)
x4 = 0.5*x9 + rnorm(n,0,sd=3)
x2 = -0.5 + 0.5*x9 + rnorm(n,0,sd=2)
x5 = -0.5*x2+0.5*x3+0.5*x6-0.5*x9+rnorm(n,0,1.5)
x8 = x1 + x2 -2*x3 - 0.3*x4 + x5 - 1.6*x6 - 1*x7 + x9 +rnorm(n,0,0.5)
y = 0.6*x8 + rnorm(n,0,2)
df = round(data.frame(x1,x2,x3,x4,x5,x6,x7,x8,x9,y),1)
When looking at the pairs plot and correlation matrix, there is nothing particularly unusual that stands out about this data set:
round(cor(df),1)
## x1 x2 x3 x4 x5 x6 x7 x8 x9 y
## x1 1.0 0.0 0.1 -0.1 0.0 -0.4 0.5 0.4 -0.2 0.4
## x2 0.0 1.0 0.3 0.3 -0.6 0.0 -0.3 0.2 0.5 0.1
## x3 0.1 0.3 1.0 0.0 -0.3 -0.7 -0.1 0.0 0.1 0.1
## x4 -0.1 0.3 0.0 1.0 -0.5 0.0 0.0 -0.1 0.6 -0.2
## x5 0.0 -0.6 -0.3 -0.5 1.0 0.4 0.2 -0.3 -0.8 -0.2
## x6 -0.4 0.0 -0.7 0.0 0.4 1.0 0.0 -0.5 -0.1 -0.5
## x7 0.5 -0.3 -0.1 0.0 0.2 0.0 1.0 -0.4 -0.3 -0.3
## x8 0.4 0.2 0.0 -0.1 -0.3 -0.5 -0.4 1.0 0.3 0.8
## x9 -0.2 0.5 0.1 0.6 -0.8 -0.1 -0.3 0.3 1.0 0.1
## y 0.4 0.1 0.1 -0.2 -0.2 -0.5 -0.3 0.8 0.1 1.0
plot(df)
We can fit the full model and perform the default stepwise variable selection.
mf = lm(y~.,data=df)
summary(mf)
##
## Call:
## lm(formula = y ~ ., data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.725 -1.213 0.246 1.345 3.481
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.100 0.327 -0.31 0.76
## x1 0.640 0.692 0.92 0.36
## x2 0.257 0.618 0.42 0.68
## x3 -0.511 1.239 -0.41 0.68
## x4 -0.298 0.252 -1.18 0.24
## x5 0.355 0.598 0.59 0.56
## x6 -0.543 0.961 -0.56 0.58
## x7 -0.428 0.631 -0.68 0.50
## x8 0.150 0.617 0.24 0.81
## x9 0.400 0.638 0.63 0.53
##
## Residual standard error: 2.02 on 40 degrees of freedom
## Multiple R-squared: 0.751, Adjusted R-squared: 0.695
## F-statistic: 13.4 on 9 and 40 DF, p-value: 1.48e-09
mf.step=step(mf)
## Start: AIC=79.3
## y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9
##
## Df Sum of Sq RSS AIC
## - x8 1 0.24 164 77.4
## - x3 1 0.69 164 77.5
## - x2 1 0.71 164 77.5
## - x6 1 1.31 165 77.7
## - x5 1 1.44 165 77.7
## - x9 1 1.61 165 77.8
## - x7 1 1.88 166 77.9
## - x1 1 3.50 167 78.4
## - x4 1 5.74 169 79.0
## <none> 164 79.3
##
## Step: AIC=77.37
## y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x9
##
## Df Sum of Sq RSS AIC
## <none> 164 77.4
## - x2 1 20.4 184 81.2
## - x5 1 26.0 190 82.7
## - x9 1 33.6 198 84.7
## - x4 1 34.5 198 84.9
## - x7 1 62.1 226 91.4
## - x1 1 68.3 232 92.8
## - x3 1 71.3 235 93.4
## - x6 1 107.9 272 100.7
summary(mf.step)
##
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x9, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.688 -1.235 0.279 1.410 3.542
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.114 0.318 -0.36 0.72095
## x1 0.802 0.194 4.13 0.00017 ***
## x2 0.401 0.178 2.26 0.02944 *
## x3 -0.808 0.191 -4.22 0.00013 ***
## x4 -0.351 0.120 -2.94 0.00541 **
## x5 0.493 0.193 2.55 0.01467 *
## x6 -0.774 0.149 -5.19 6e-06 ***
## x7 -0.577 0.146 -3.94 0.00031 ***
## x9 0.548 0.189 2.90 0.00599 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2 on 41 degrees of freedom
## Multiple R-squared: 0.75, Adjusted R-squared: 0.702
## F-statistic: 15.4 on 8 and 41 DF, p-value: 3.69e-10
It is clear that using the basic stepwise variable selection yields a highly non-parsimonious model. In fact it has removed x8 which was the only variable that y was directly constructed from in the data generating process.
require(mplot)
## Loading required package: mplot
## Loading required package: leaps
## Loading required package: bestglm
## Loading required package: googleVis
##
## Welcome to googleVis version 0.5.3
##
## Please read the Google API Terms of Use
## before you use the package:
## https://developers.google.com/terms/
##
## Note, the plot method of googleVis will by default use
## the standard browser to display its output.
##
## See the googleVis package vignettes for more details,
## or visit http://github.com/mages/googleVis.
##
## To suppress this message use:
## suppressPackageStartupMessages(library(googleVis))
##
## Loading required package: parallel
af1 = af(mf)
## Loading required package: doMC
## Loading required package: foreach
## Loading required package: iterators
plot(af1,html.only=TRUE)
Also could look at the variable importance plots:
vis1 = lvp(mf)
plot(vis1,which="vip",html.only=TRUE)