dat=read.csv("Z:/Download/dat.csv")
dat=dat[,c(6,2:4,10,11)]
str(dat)
## 'data.frame': 167 obs. of 6 variables:
## $ Y : num 13 33.8 15.4 20 17.8 14.4 14.9 9.2 17.1 22.4 ...
## $ V1: num 37 40 35 42 45 34 36 32 33 34 ...
## $ V2: num 171 171 179 147 156 ...
## $ V3: num 60.8 94.8 70.8 55.9 54.1 63.1 53.6 67.1 53.6 74.5 ...
## $ V4: num 1.7 2.12 1.87 1.51 1.53 ...
## $ V5: num 20.9 32.6 22.2 25.8 22.1 ...
head(dat)
## Y V1 V2 V3 V4 V5
## 1 13.0 37 170.6 60.8 1.697423 20.89034
## 2 33.8 40 170.6 94.8 2.119544 32.57244
## 3 15.4 35 178.6 70.8 1.874158 22.19578
## 4 20.0 42 147.3 55.9 1.512363 25.76359
## 5 17.8 45 156.3 54.1 1.532593 22.14518
## 6 14.4 34 167.4 63.1 1.712936 22.51741
The backward procedure begins with a general model that includes all variables –> eliminates variable one at a time —> until the best model obtained (Model with lowest AIC) [1]
step(lm(Y~.,data=dat),direction="backward")
## Start: AIC=402.36
## Y ~ V1 + V2 + V3 + V4 + V5
##
## Df Sum of Sq RSS AIC
## - V1 1 2.910 1732.2 400.64
## - V2 1 4.100 1733.4 400.75
## <none> 1729.3 402.36
## - V4 1 27.098 1756.4 402.95
## - V3 1 30.802 1760.1 403.30
## - V5 1 34.131 1763.4 403.62
##
## Step: AIC=400.64
## Y ~ V2 + V3 + V4 + V5
##
## Df Sum of Sq RSS AIC
## - V2 1 3.846 1736.0 399.01
## <none> 1732.2 400.64
## - V4 1 27.202 1759.4 401.24
## - V3 1 32.375 1764.6 401.73
## - V5 1 32.398 1764.6 401.73
##
## Step: AIC=399.01
## Y ~ V3 + V4 + V5
##
## Df Sum of Sq RSS AIC
## <none> 1736.0 399.01
## - V3 1 28.645 1764.7 399.74
## - V4 1 44.860 1780.9 401.27
## - V5 1 74.856 1810.9 404.06
##
## Call:
## lm(formula = Y ~ V3 + V4 + V5, data = dat)
##
## Coefficients:
## (Intercept) V3 V4 V5
## 25.9461 0.8431 -53.2751 1.2293
The forward method begins with a simplest level model (no predictor) –>> adds suitable variable one at a time —> until the best model obtained (Model with lowest AIC) [1]
step(lm(Y~1,data=dat),direction="forward",scope=~V1+V2+V3+V4+V5)
## Start: AIC=591.5
## Y ~ 1
##
## Df Sum of Sq RSS AIC
## + V5 1 3566.1 2132.2 429.33
## + V3 1 1298.3 4400.0 550.32
## + V4 1 699.9 4998.4 571.61
## + V2 1 149.3 5548.9 589.06
## <none> 5698.3 591.50
## + V1 1 61.4 5636.8 591.69
##
## Step: AIC=429.33
## Y ~ V5
##
## Df Sum of Sq RSS AIC
## + V4 1 367.47 1764.7 399.74
## + V2 1 366.72 1765.4 399.81
## + V3 1 351.26 1780.9 401.27
## <none> 2132.2 429.33
## + V1 1 17.61 2114.5 429.95
##
## Step: AIC=399.74
## Y ~ V5 + V4
##
## Df Sum of Sq RSS AIC
## + V3 1 28.6446 1736.0 399.01
## <none> 1764.7 399.74
## + V1 1 4.5649 1760.1 401.31
## + V2 1 0.1160 1764.6 401.73
##
## Step: AIC=399.01
## Y ~ V5 + V4 + V3
##
## Df Sum of Sq RSS AIC
## <none> 1736.0 399.01
## + V2 1 3.8460 1732.2 400.64
## + V1 1 2.6564 1733.4 400.75
##
## Call:
## lm(formula = Y ~ V5 + V4 + V3, data = dat)
##
## Coefficients:
## (Intercept) V5 V4 V3
## 25.9461 1.2293 -53.2751 0.8431
The stepwise method is the combination of backwrad and forward procedures AIC is sued as the marker for selecting best model. Lower AIC, the better model. [1,2]
step(lm(Y~.,data=dat),direction="both")
## Start: AIC=402.36
## Y ~ V1 + V2 + V3 + V4 + V5
##
## Df Sum of Sq RSS AIC
## - V1 1 2.910 1732.2 400.64
## - V2 1 4.100 1733.4 400.75
## <none> 1729.3 402.36
## - V4 1 27.098 1756.4 402.95
## - V3 1 30.802 1760.1 403.30
## - V5 1 34.131 1763.4 403.62
##
## Step: AIC=400.64
## Y ~ V2 + V3 + V4 + V5
##
## Df Sum of Sq RSS AIC
## - V2 1 3.846 1736.0 399.01
## <none> 1732.2 400.64
## - V4 1 27.202 1759.4 401.24
## - V3 1 32.375 1764.6 401.73
## - V5 1 32.398 1764.6 401.73
## + V1 1 2.910 1729.3 402.36
##
## Step: AIC=399.01
## Y ~ V3 + V4 + V5
##
## Df Sum of Sq RSS AIC
## <none> 1736.0 399.01
## - V3 1 28.645 1764.7 399.74
## + V2 1 3.846 1732.2 400.64
## + V1 1 2.656 1733.4 400.75
## - V4 1 44.860 1780.9 401.27
## - V5 1 74.856 1810.9 404.06
##
## Call:
## lm(formula = Y ~ V3 + V4 + V5, data = dat)
##
## Coefficients:
## (Intercept) V3 V4 V5
## 25.9461 0.8431 -53.2751 1.2293
library(MASS)
stepAIC(lm(Y~.,data=dat),direction="both")
## Start: AIC=402.36
## Y ~ V1 + V2 + V3 + V4 + V5
##
## Df Sum of Sq RSS AIC
## - V1 1 2.910 1732.2 400.64
## - V2 1 4.100 1733.4 400.75
## <none> 1729.3 402.36
## - V4 1 27.098 1756.4 402.95
## - V3 1 30.802 1760.1 403.30
## - V5 1 34.131 1763.4 403.62
##
## Step: AIC=400.64
## Y ~ V2 + V3 + V4 + V5
##
## Df Sum of Sq RSS AIC
## - V2 1 3.846 1736.0 399.01
## <none> 1732.2 400.64
## - V4 1 27.202 1759.4 401.24
## - V3 1 32.375 1764.6 401.73
## - V5 1 32.398 1764.6 401.73
## + V1 1 2.910 1729.3 402.36
##
## Step: AIC=399.01
## Y ~ V3 + V4 + V5
##
## Df Sum of Sq RSS AIC
## <none> 1736.0 399.01
## - V3 1 28.645 1764.7 399.74
## + V2 1 3.846 1732.2 400.64
## + V1 1 2.656 1733.4 400.75
## - V4 1 44.860 1780.9 401.27
## - V5 1 74.856 1810.9 404.06
##
## Call:
## lm(formula = Y ~ V3 + V4 + V5, data = dat)
##
## Coefficients:
## (Intercept) V3 V4 V5
## 25.9461 0.8431 -53.2751 1.2293
Bias in parameter estimation, inconsistencies among model selection algorithms
Inherent (but often overlooked) problem of multiple hypothesis testing
Inappropriate focus or reliance on a single best model [4]
Reference:
Applied Predictive Modeling
Regression Modeling Strategies (page 68)