When stepwise fails

The deficiencies of stepwise procedures are well known. This document illustrates an example where stepwise variable selection procedures fall victim to local minimia and highlights the utility of the various procedures and plots provided by the mplot package.

To begin consider the following data set inspired by the pathoeg data set in the MVP package.

n=50
set.seed(8)
x1 = rnorm(n,0.22,2)
x7 = 0.5*x1 + rnorm(n,0,sd=2)
x6 = -0.75*x1 + rnorm(n,0,3)
x3 = -0.5-0.5*x6 + rnorm(n,0,2)
x9 = rnorm(n,0.6,3.5)
x4 = 0.5*x9 + rnorm(n,0,sd=3)
x2 = -0.5 + 0.5*x9 + rnorm(n,0,sd=2)
x5 = -0.5*x2+0.5*x3+0.5*x6-0.5*x9+rnorm(n,0,1.5)
x8 = x1 + x2 -2*x3 - 0.3*x4 + x5 - 1.6*x6 - 1*x7 + x9 +rnorm(n,0,0.5)
y = 0.6*x8 + rnorm(n,0,2)
df = round(data.frame(x1,x2,x3,x4,x5,x6,x7,x8,x9,y),1)

When looking at the pairs plot and correlation matrix, there is nothing particularly unusual that stands out about this data set:

round(cor(df),1)

##      x1   x2   x3   x4   x5   x6   x7   x8   x9    y
## x1  1.0  0.0  0.1 -0.1  0.0 -0.4  0.5  0.4 -0.2  0.4
## x2  0.0  1.0  0.3  0.3 -0.6  0.0 -0.3  0.2  0.5  0.1
## x3  0.1  0.3  1.0  0.0 -0.3 -0.7 -0.1  0.0  0.1  0.1
## x4 -0.1  0.3  0.0  1.0 -0.5  0.0  0.0 -0.1  0.6 -0.2
## x5  0.0 -0.6 -0.3 -0.5  1.0  0.4  0.2 -0.3 -0.8 -0.2
## x6 -0.4  0.0 -0.7  0.0  0.4  1.0  0.0 -0.5 -0.1 -0.5
## x7  0.5 -0.3 -0.1  0.0  0.2  0.0  1.0 -0.4 -0.3 -0.3
## x8  0.4  0.2  0.0 -0.1 -0.3 -0.5 -0.4  1.0  0.3  0.8
## x9 -0.2  0.5  0.1  0.6 -0.8 -0.1 -0.3  0.3  1.0  0.1
## y   0.4  0.1  0.1 -0.2 -0.2 -0.5 -0.3  0.8  0.1  1.0

plot(df)

plot of chunk unnamed-chunk-2

We can fit the full model and perform the default stepwise variable selection.

mf = lm(y~.,data=df)
summary(mf)

## 
## Call:
## lm(formula = y ~ ., data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.725 -1.213  0.246  1.345  3.481 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   -0.100      0.327   -0.31     0.76
## x1             0.640      0.692    0.92     0.36
## x2             0.257      0.618    0.42     0.68
## x3            -0.511      1.239   -0.41     0.68
## x4            -0.298      0.252   -1.18     0.24
## x5             0.355      0.598    0.59     0.56
## x6            -0.543      0.961   -0.56     0.58
## x7            -0.428      0.631   -0.68     0.50
## x8             0.150      0.617    0.24     0.81
## x9             0.400      0.638    0.63     0.53
## 
## Residual standard error: 2.02 on 40 degrees of freedom
## Multiple R-squared:  0.751,  Adjusted R-squared:  0.695 
## F-statistic: 13.4 on 9 and 40 DF,  p-value: 1.48e-09

mf.step=step(mf)

## Start:  AIC=79.3
## y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9
## 
##        Df Sum of Sq RSS  AIC
## - x8    1      0.24 164 77.4
## - x3    1      0.69 164 77.5
## - x2    1      0.71 164 77.5
## - x6    1      1.31 165 77.7
## - x5    1      1.44 165 77.7
## - x9    1      1.61 165 77.8
## - x7    1      1.88 166 77.9
## - x1    1      3.50 167 78.4
## - x4    1      5.74 169 79.0
## <none>              164 79.3
## 
## Step:  AIC=77.37
## y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x9
## 
##        Df Sum of Sq RSS   AIC
## <none>              164  77.4
## - x2    1      20.4 184  81.2
## - x5    1      26.0 190  82.7
## - x9    1      33.6 198  84.7
## - x4    1      34.5 198  84.9
## - x7    1      62.1 226  91.4
## - x1    1      68.3 232  92.8
## - x3    1      71.3 235  93.4
## - x6    1     107.9 272 100.7

summary(mf.step)

## 
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x9, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.688 -1.235  0.279  1.410  3.542 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -0.114      0.318   -0.36  0.72095    
## x1             0.802      0.194    4.13  0.00017 ***
## x2             0.401      0.178    2.26  0.02944 *  
## x3            -0.808      0.191   -4.22  0.00013 ***
## x4            -0.351      0.120   -2.94  0.00541 ** 
## x5             0.493      0.193    2.55  0.01467 *  
## x6            -0.774      0.149   -5.19    6e-06 ***
## x7            -0.577      0.146   -3.94  0.00031 ***
## x9             0.548      0.189    2.90  0.00599 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2 on 41 degrees of freedom
## Multiple R-squared:  0.75,   Adjusted R-squared:  0.702 
## F-statistic: 15.4 on 8 and 41 DF,  p-value: 3.69e-10

It is clear that using the basic stepwise variable selection yields a highly non-parsimonious model. In fact it has removed x8 which was the only variable that y was directly constructed from in the data generating process.

require(mplot)

## Loading required package: mplot
## Loading required package: leaps
## Loading required package: bestglm
## Loading required package: googleVis
## 
## Welcome to googleVis version 0.5.3
## 
## Please read the Google API Terms of Use
## before you use the package:
## https://developers.google.com/terms/
## 
## Note, the plot method of googleVis will by default use
## the standard browser to display its output.
## 
## See the googleVis package vignettes for more details,
## or visit http://github.com/mages/googleVis.
## 
## To suppress this message use:
## suppressPackageStartupMessages(library(googleVis))
## 
## Loading required package: parallel

af1 = af(mf)

## Loading required package: doMC
## Loading required package: foreach
## Loading required package: iterators

plot(af1,html.only=TRUE)

ScatterChartID29af5c050449

Data: plot.dat • Chart ID: ScatterChartID29af5c050449 • googleVis-0.5.3
R version 3.0.3 (2014-03-06) • Google Terms of Use • Documentation and Data Policy

Also could look at the variable importance plots:

vis1 = lvp(mf)
plot(vis1,which="vip",html.only=TRUE)

LineChartID29af4d5c83d1

Data: data • Chart ID: LineChartID29af4d5c83d1 • googleVis-0.5.3
R version 3.0.3 (2014-03-06) • Google Terms of Use • Documentation and Data Policy

When stepwise fails

Garth Tarr

24 July 2014