Objective: Modeling estimation of Y by 5 variables (V1,V2,V3,V4,V5)

dat=read.csv("Z:/Download/dat.csv")
dat=dat[,c(6,2:4,10,11)]
str(dat)
## 'data.frame':    167 obs. of  6 variables:
##  $ Y : num  13 33.8 15.4 20 17.8 14.4 14.9 9.2 17.1 22.4 ...
##  $ V1: num  37 40 35 42 45 34 36 32 33 34 ...
##  $ V2: num  171 171 179 147 156 ...
##  $ V3: num  60.8 94.8 70.8 55.9 54.1 63.1 53.6 67.1 53.6 74.5 ...
##  $ V4: num  1.7 2.12 1.87 1.51 1.53 ...
##  $ V5: num  20.9 32.6 22.2 25.8 22.1 ...
head(dat)
##      Y V1    V2   V3       V4       V5
## 1 13.0 37 170.6 60.8 1.697423 20.89034
## 2 33.8 40 170.6 94.8 2.119544 32.57244
## 3 15.4 35 178.6 70.8 1.874158 22.19578
## 4 20.0 42 147.3 55.9 1.512363 25.76359
## 5 17.8 45 156.3 54.1 1.532593 22.14518
## 6 14.4 34 167.4 63.1 1.712936 22.51741

1. The backward method

The backward procedure begins with a general model that includes all variables –> eliminates variable one at a time —> until the best model obtained (Model with lowest AIC) [1]

step(lm(Y~.,data=dat),direction="backward")
## Start:  AIC=402.36
## Y ~ V1 + V2 + V3 + V4 + V5
## 
##        Df Sum of Sq    RSS    AIC
## - V1    1     2.910 1732.2 400.64
## - V2    1     4.100 1733.4 400.75
## <none>              1729.3 402.36
## - V4    1    27.098 1756.4 402.95
## - V3    1    30.802 1760.1 403.30
## - V5    1    34.131 1763.4 403.62
## 
## Step:  AIC=400.64
## Y ~ V2 + V3 + V4 + V5
## 
##        Df Sum of Sq    RSS    AIC
## - V2    1     3.846 1736.0 399.01
## <none>              1732.2 400.64
## - V4    1    27.202 1759.4 401.24
## - V3    1    32.375 1764.6 401.73
## - V5    1    32.398 1764.6 401.73
## 
## Step:  AIC=399.01
## Y ~ V3 + V4 + V5
## 
##        Df Sum of Sq    RSS    AIC
## <none>              1736.0 399.01
## - V3    1    28.645 1764.7 399.74
## - V4    1    44.860 1780.9 401.27
## - V5    1    74.856 1810.9 404.06
## 
## Call:
## lm(formula = Y ~ V3 + V4 + V5, data = dat)
## 
## Coefficients:
## (Intercept)           V3           V4           V5  
##     25.9461       0.8431     -53.2751       1.2293

2. The forward method

The forward method begins with a simplest level model (no predictor) –>> adds suitable variable one at a time —> until the best model obtained (Model with lowest AIC) [1]

step(lm(Y~1,data=dat),direction="forward",scope=~V1+V2+V3+V4+V5)
## Start:  AIC=591.5
## Y ~ 1
## 
##        Df Sum of Sq    RSS    AIC
## + V5    1    3566.1 2132.2 429.33
## + V3    1    1298.3 4400.0 550.32
## + V4    1     699.9 4998.4 571.61
## + V2    1     149.3 5548.9 589.06
## <none>              5698.3 591.50
## + V1    1      61.4 5636.8 591.69
## 
## Step:  AIC=429.33
## Y ~ V5
## 
##        Df Sum of Sq    RSS    AIC
## + V4    1    367.47 1764.7 399.74
## + V2    1    366.72 1765.4 399.81
## + V3    1    351.26 1780.9 401.27
## <none>              2132.2 429.33
## + V1    1     17.61 2114.5 429.95
## 
## Step:  AIC=399.74
## Y ~ V5 + V4
## 
##        Df Sum of Sq    RSS    AIC
## + V3    1   28.6446 1736.0 399.01
## <none>              1764.7 399.74
## + V1    1    4.5649 1760.1 401.31
## + V2    1    0.1160 1764.6 401.73
## 
## Step:  AIC=399.01
## Y ~ V5 + V4 + V3
## 
##        Df Sum of Sq    RSS    AIC
## <none>              1736.0 399.01
## + V2    1    3.8460 1732.2 400.64
## + V1    1    2.6564 1733.4 400.75
## 
## Call:
## lm(formula = Y ~ V5 + V4 + V3, data = dat)
## 
## Coefficients:
## (Intercept)           V5           V4           V3  
##     25.9461       1.2293     -53.2751       0.8431

3. The stepwise method

The stepwise method is the combination of backwrad and forward procedures AIC is sued as the marker for selecting best model. Lower AIC, the better model. [1,2]

3.1 Option1 (basic package)

step(lm(Y~.,data=dat),direction="both")
## Start:  AIC=402.36
## Y ~ V1 + V2 + V3 + V4 + V5
## 
##        Df Sum of Sq    RSS    AIC
## - V1    1     2.910 1732.2 400.64
## - V2    1     4.100 1733.4 400.75
## <none>              1729.3 402.36
## - V4    1    27.098 1756.4 402.95
## - V3    1    30.802 1760.1 403.30
## - V5    1    34.131 1763.4 403.62
## 
## Step:  AIC=400.64
## Y ~ V2 + V3 + V4 + V5
## 
##        Df Sum of Sq    RSS    AIC
## - V2    1     3.846 1736.0 399.01
## <none>              1732.2 400.64
## - V4    1    27.202 1759.4 401.24
## - V3    1    32.375 1764.6 401.73
## - V5    1    32.398 1764.6 401.73
## + V1    1     2.910 1729.3 402.36
## 
## Step:  AIC=399.01
## Y ~ V3 + V4 + V5
## 
##        Df Sum of Sq    RSS    AIC
## <none>              1736.0 399.01
## - V3    1    28.645 1764.7 399.74
## + V2    1     3.846 1732.2 400.64
## + V1    1     2.656 1733.4 400.75
## - V4    1    44.860 1780.9 401.27
## - V5    1    74.856 1810.9 404.06
## 
## Call:
## lm(formula = Y ~ V3 + V4 + V5, data = dat)
## 
## Coefficients:
## (Intercept)           V3           V4           V5  
##     25.9461       0.8431     -53.2751       1.2293

3.2 Option2 (MASS package)

library(MASS)
stepAIC(lm(Y~.,data=dat),direction="both")
## Start:  AIC=402.36
## Y ~ V1 + V2 + V3 + V4 + V5
## 
##        Df Sum of Sq    RSS    AIC
## - V1    1     2.910 1732.2 400.64
## - V2    1     4.100 1733.4 400.75
## <none>              1729.3 402.36
## - V4    1    27.098 1756.4 402.95
## - V3    1    30.802 1760.1 403.30
## - V5    1    34.131 1763.4 403.62
## 
## Step:  AIC=400.64
## Y ~ V2 + V3 + V4 + V5
## 
##        Df Sum of Sq    RSS    AIC
## - V2    1     3.846 1736.0 399.01
## <none>              1732.2 400.64
## - V4    1    27.202 1759.4 401.24
## - V3    1    32.375 1764.6 401.73
## - V5    1    32.398 1764.6 401.73
## + V1    1     2.910 1729.3 402.36
## 
## Step:  AIC=399.01
## Y ~ V3 + V4 + V5
## 
##        Df Sum of Sq    RSS    AIC
## <none>              1736.0 399.01
## - V3    1    28.645 1764.7 399.74
## + V2    1     3.846 1732.2 400.64
## + V1    1     2.656 1733.4 400.75
## - V4    1    44.860 1780.9 401.27
## - V5    1    74.856 1810.9 404.06
## 
## Call:
## lm(formula = Y ~ V3 + V4 + V5, data = dat)
## 
## Coefficients:
## (Intercept)           V3           V4           V5  
##     25.9461       0.8431     -53.2751       1.2293

4. Problem of Stepwise method

  1. It yields R2 values that are biased high.
  2. The ordinary F and ??2 test statistics do not have the claimed distributiond.Variable selection is based on methods (e.g., F tests for nested models) that were intended to be used to test only prespecified hypotheses.
  3. The method yields standard errors of regression coefficient estimates that are biased low and confidence intervals for effects and predicted values that are falsely narrow
  4. It yields P-values that are too small (i.e., there are severe multiple comparison problems) and that do not have the proper meaning, and the proper correction for them is a very difficult problem.
  5. It provides regression coefficients that are biased high in absolute value and need shrinkage. Even if only a single predictor were being analyzed and one only reported the regression coefficient for that predictor if its association with Y were “statistically significant,” the estimate of the regression coefficient ^ ?? is biased (too large in absolute value). To put this in symbols for the case where we obtain a positive association ( ^?? > 0), E( ^ ??|P <0.05, ^?? > 0) > ??
  6. In observational studies, variable selection to determine confounders for adjustment results in residual confounding.
  7. Rather than solving problems caused by collinearity, variable selection is made arbitrary by collinearity.
  8. It allows us to not think about the problem. [3]
Shortcoming of stepwise multiple regression:
  1. Bias in parameter estimation, inconsistencies among model selection algorithms

  2. Inherent (but often overlooked) problem of multiple hypothesis testing

  3. Inappropriate focus or reliance on a single best model [4]

Reference:

  1. https://www.youtube.com/watch?v=TzhgPXrFSm8

  2. Applied Predictive Modeling

  3. Regression Modeling Strategies (page 68)

  4. https://www.ncbi.nlm.nih.gov/pubmed/16922854