Model building is basically adding predictors to or elimnating predictors from the model.

Don’t Use

Don’t use R-squared: proportion of variability in response that’s explained by its linear relationship with predictor(s)

Do/Can Do

  1. T-test for adding a predictor - requires a nested model

  2. F-test (partial) - requires a nested model

  3. adjusted R-squared (like R-squared but penalizes for the number of predictors) - don’t actually use this one really

  4. AIC (Akaike Information Criteria) - only good for comparing models - each AIC doesn’t mean anything on its own AIC = 2(k+1) - 2*log likelihood A smaller AIC is better

  5. Mallow’s C_p \[Sum({y_i} - \hat{y_i})^2/MSE(predictors) \] The numerator is just using a subset of the predictors while the denominator is using all the predictors. Smaller is better, ideally about k+1.

Three Different Methods

To show each method, I will use the dataset of water.

library(alr3)
## Loading required package: car
data(water)
attach(water)
names(water)
## [1] "Year"    "APMAM"   "APSAB"   "APSLAKE" "OPBPC"   "OPRC"    "OPSLAKE"
## [8] "BSAAM"

The response is BSAAM. The predictors are APMAM, APSAB, APSLAKE, OPBPC, OPRC, OPSLAKE.

Forward Selection

Start simple, add more predictors using tests/criteria

Below is an example of forward selection using t-test with a significance level of 0.05.

mod1 = lm(BSAAM ~ APSLAKE + OPSLAKE)
summary(mod1)
## 
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPSLAKE)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13335.8  -5893.2   -171.8   4219.5  19500.2 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  19144.9     3812.0   5.022  1.1e-05 ***
## APSLAKE       1768.8      553.7   3.194  0.00273 ** 
## OPSLAKE       3689.5      196.0  18.829  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8063 on 40 degrees of freedom
## Multiple R-squared:  0.9049, Adjusted R-squared:  0.9002 
## F-statistic: 190.3 on 2 and 40 DF,  p-value: < 2.2e-16

I picked two predictors to start with. I chose these two because I already knew, from working with this data before, that these two did help to predict the response.

mod2 = lm(BSAAM ~ APSLAKE + OPSLAKE + APMAM)
summary(mod2)
## 
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPSLAKE + APMAM)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13849.2  -5728.4   -155.1   4084.6  19668.6 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  18671.1     4023.8   4.640 3.87e-05 ***
## APSLAKE       1448.9      963.6   1.504    0.141    
## OPSLAKE       3685.9      198.2  18.595  < 2e-16 ***
## APMAM          286.7      703.0   0.408    0.686    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8149 on 39 degrees of freedom
## Multiple R-squared:  0.9053, Adjusted R-squared:  0.898 
## F-statistic: 124.3 on 3 and 39 DF,  p-value: < 2.2e-16

Looking at the p-value for APMAM, we can see that it is over 0.05.

mod2 = lm(BSAAM ~ APSLAKE + OPSLAKE + APSAB)
summary(mod2)
## 
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPSLAKE + APSAB)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13631.2  -5439.5   -331.7   4767.5  19501.6 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  18279.9     4023.7   4.543 5.23e-05 ***
## APSLAKE        938.8     1292.4   0.726    0.472    
## OPSLAKE       3709.5      199.2  18.626  < 2e-16 ***
## APSAB         1007.6     1415.6   0.712    0.481    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8113 on 39 degrees of freedom
## Multiple R-squared:  0.9061, Adjusted R-squared:  0.8989 
## F-statistic: 125.5 on 3 and 39 DF,  p-value: < 2.2e-16

Looking at the p-value for APSAB, we can see that it is over 0.05.

mod2 = lm(BSAAM ~ APSLAKE + OPSLAKE + OPBPC)
summary(mod2)
## 
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPSLAKE + OPBPC)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13321.6  -5876.3   -166.9   4198.4  19487.6 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 19198.01    4054.50   4.735 2.88e-05 ***
## APSLAKE      1768.90     560.79   3.154  0.00309 ** 
## OPSLAKE      3665.47     595.45   6.156 3.15e-07 ***
## OPBPC          21.17     493.94   0.043  0.96604    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8166 on 39 degrees of freedom
## Multiple R-squared:  0.9049, Adjusted R-squared:  0.8976 
## F-statistic: 123.7 on 3 and 39 DF,  p-value: < 2.2e-16

Looking at the p-value for OPBPC, we can see that it is over 0.05.

mod2 = lm(BSAAM ~ APSLAKE + OPSLAKE + OPRC)
summary(mod2)
## 
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPSLAKE + OPRC)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12964  -5140  -1252   4446  18649 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  15424.6     3638.4   4.239 0.000133 ***
## APSLAKE       1712.5      500.5   3.421 0.001475 ** 
## OPSLAKE       2389.8      447.1   5.346 4.19e-06 ***
## OPRC          1797.5      567.8   3.166 0.002998 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7284 on 39 degrees of freedom
## Multiple R-squared:  0.9244, Adjusted R-squared:  0.9185 
## F-statistic: 158.9 on 3 and 39 DF,  p-value: < 2.2e-16

Looking at the p-value for OPRC, we can see that it is less than 0.05.

This means that our final model should be: lm(BSAAM ~ APSLAKE + OPSLAKE + OPBPC)

Backwards Elimination

Star with complicated model, pare it down with tests/criteria

Below is an example of backwards selection using partial F-test with a significance level of 0.05.

mod3 = lm(BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE)
summary(mod3)
## 
## Call:
## lm(formula = BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + 
##     OPSLAKE)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12690  -4936  -1424   4173  18542 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15944.67    4099.80   3.889 0.000416 ***
## APMAM         -12.77     708.89  -0.018 0.985725    
## APSAB        -664.41    1522.89  -0.436 0.665237    
## APSLAKE      2270.68    1341.29   1.693 0.099112 .  
## OPBPC          69.70     461.69   0.151 0.880839    
## OPRC         1916.45     641.36   2.988 0.005031 ** 
## OPSLAKE      2211.58     752.69   2.938 0.005729 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7557 on 36 degrees of freedom
## Multiple R-squared:  0.9248, Adjusted R-squared:  0.9123 
## F-statistic: 73.82 on 6 and 36 DF,  p-value: < 2.2e-16

This is a model of all of the predictors. We’ll take away one at a time until we’ve got the right model.

mod4 = lm(BSAAM ~ APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE)
summary(mod4)
## 
## Call:
## lm(formula = BSAAM ~ APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12696  -4933  -1396   4187  18550 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15930.84    3972.50   4.010 0.000283 ***
## APSAB        -673.42    1418.96  -0.475 0.637873    
## APSLAKE      2263.86    1269.35   1.783 0.082714 .  
## OPBPC          68.94     453.50   0.152 0.879996    
## OPRC         1915.75     631.46   3.034 0.004399 ** 
## OPSLAKE      2212.62     740.28   2.989 0.004952 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7454 on 37 degrees of freedom
## Multiple R-squared:  0.9248, Adjusted R-squared:  0.9147 
## F-statistic: 91.05 on 5 and 37 DF,  p-value: < 2.2e-16
anova(mod3, mod4)
## Analysis of Variance Table
## 
## Model 1: BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE
## Model 2: BSAAM ~ APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE
##   Res.Df        RSS Df Sum of Sq     F Pr(>F)
## 1     36 2055830733                          
## 2     37 2055849271 -1    -18538 3e-04 0.9857

The p-value is high, so APMAM should be taken out.

mod5 = lm(BSAAM ~ APSLAKE + OPBPC + OPRC + OPSLAKE)
summary(mod5)
## 
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPBPC + OPRC + OPSLAKE)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12932  -5010  -1181   4412  18621 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15538.35    3845.65   4.041  0.00025 ***
## APSLAKE      1712.68     506.99   3.378  0.00170 ** 
## OPBPC          46.22     446.34   0.104  0.91806    
## OPRC         1798.53     575.20   3.127  0.00338 ** 
## OPSLAKE      2336.53     685.62   3.408  0.00156 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7378 on 38 degrees of freedom
## Multiple R-squared:  0.9244, Adjusted R-squared:  0.9164 
## F-statistic: 116.1 on 4 and 38 DF,  p-value: < 2.2e-16
anova(mod5, mod4)
## Analysis of Variance Table
## 
## Model 1: BSAAM ~ APSLAKE + OPBPC + OPRC + OPSLAKE
## Model 2: BSAAM ~ APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE
##   Res.Df        RSS Df Sum of Sq      F Pr(>F)
## 1     38 2068363837                           
## 2     37 2055849271  1  12514566 0.2252 0.6379

The p-value is high, so APSAB should be taken out.

mod6 = lm(BSAAM ~ APSLAKE + OPRC + OPSLAKE)
summary(mod6)
## 
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPRC + OPSLAKE)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12964  -5140  -1252   4446  18649 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  15424.6     3638.4   4.239 0.000133 ***
## APSLAKE       1712.5      500.5   3.421 0.001475 ** 
## OPRC          1797.5      567.8   3.166 0.002998 ** 
## OPSLAKE       2389.8      447.1   5.346 4.19e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7284 on 39 degrees of freedom
## Multiple R-squared:  0.9244, Adjusted R-squared:  0.9185 
## F-statistic: 158.9 on 3 and 39 DF,  p-value: < 2.2e-16
anova(mod6, mod5)
## Analysis of Variance Table
## 
## Model 1: BSAAM ~ APSLAKE + OPRC + OPSLAKE
## Model 2: BSAAM ~ APSLAKE + OPBPC + OPRC + OPSLAKE
##   Res.Df        RSS Df Sum of Sq      F Pr(>F)
## 1     39 2068947585                           
## 2     38 2068363837  1    583748 0.0107 0.9181

The p-value is high, so OPBPC should be taken out.

mod7 = lm(BSAAM ~ APSLAKE + OPSLAKE)
summary(mod7)
## 
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPSLAKE)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13335.8  -5893.2   -171.8   4219.5  19500.2 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  19144.9     3812.0   5.022  1.1e-05 ***
## APSLAKE       1768.8      553.7   3.194  0.00273 ** 
## OPSLAKE       3689.5      196.0  18.829  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8063 on 40 degrees of freedom
## Multiple R-squared:  0.9049, Adjusted R-squared:  0.9002 
## F-statistic: 190.3 on 2 and 40 DF,  p-value: < 2.2e-16
anova(mod7, mod6)
## Analysis of Variance Table
## 
## Model 1: BSAAM ~ APSLAKE + OPSLAKE
## Model 2: BSAAM ~ APSLAKE + OPRC + OPSLAKE
##   Res.Df        RSS Df Sum of Sq      F   Pr(>F)   
## 1     40 2600641788                                
## 2     39 2068947585  1 531694203 10.023 0.002998 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value is low, so OPRC should not be taken out.

Therefore, the model should be lm(BSAAM ~ APSLAKE + OPRC + OPSLAKE).

All Subsets Regression

Only works with reasonably small number of predictors Try all of the possible regression models to see which is best

Below is an example of all subsets regression using AIC.

It would take a while to do all of it so I will just do a small section of each.

AIC(lm(BSAAM ~ APMAM))
## [1] 997.1542
AIC(lm(BSAAM ~ APSLAKE))
## [1] 996.9139

The first section is just the one predictor.

AIC(lm(BSAAM ~ APMAM + OPBPC))
## [1] 932.0902
AIC(lm(BSAAM ~ APSLAKE + OPSLAKE))
## [1] 900.4951

The next section is two predictors.

AIC(lm(BSAAM ~ APMAM + OPBPC + APSAB))
## [1] 932.9863
AIC(lm(BSAAM ~ APSLAKE + OPSLAKE + OPRC))
## [1] 892.6603

The next section is three predictors.

AIC(lm(BSAAM ~ APMAM + OPBPC + APSAB + OPRC))
## [1] 910.7397
AIC(lm(BSAAM ~ APSLAKE + OPSLAKE + OPRC + APMAM))
## [1] 894.6309

The next section is four predictors.

AIC(lm(BSAAM ~ APMAM + OPBPC + APSAB + OPRC + OPSLAKE))
## [1] 899.6805
AIC(lm(BSAAM ~ APSLAKE + OPSLAKE + OPRC + APMAM + OPBPC))
## [1] 896.6135

The next section is five predictors.

AIC(lm(BSAAM ~ APMAM + OPBPC + APSAB + OPRC + OPSLAKE + APSLAKE))
## [1] 898.3868

The last section is all the predictors.

We want the model with the smallest AIC. If there are two that are close (within ten), take the simplier model.

The final model is lm(BSAAM ~ APSLAKE + OPSLAKE + OPRC).

Choosing a Criteria for Model Building

If the models will be nested: using tests.

If the models will not be nested: use AIC or Mallow’s C_p