Model building:

Today during class we finalized our discussion on multicollinearity and started talking about model building. Some ways the know if predictors should be added or not is to run t-test, F-test, AIC of Mallow’s \({Cp}\). The one used most often in practice is the AIC or the Akaike Information Criteria. Better models will have a lower AIC score and you would need a difference of 10 or more to justify a more complex model. It is important ot mension that the numbers mean nothing on their own and must be used as a comparison. In this example I used both ANOVA and AIC to justify my model building.

There are three different types of model building:

  1. Forward Selection: starting simple, adding more predictors using test/predictors

  2. Backward Elimination: starting with all the predictors and drop one at a time

  3. All Subsets: Looking at every possible iteration of the model and comparing.

I did a long hand approach to backwards elimination for my example.

library(alr3)
## Loading required package: car
data(water)
summary(water)
##       Year          APMAM            APSAB           APSLAKE     
##  Min.   :1948   Min.   : 2.700   Min.   : 1.450   Min.   : 1.77  
##  1st Qu.:1958   1st Qu.: 4.975   1st Qu.: 3.390   1st Qu.: 3.36  
##  Median :1969   Median : 7.080   Median : 4.460   Median : 4.62  
##  Mean   :1969   Mean   : 7.323   Mean   : 4.652   Mean   : 4.93  
##  3rd Qu.:1980   3rd Qu.: 9.115   3rd Qu.: 5.685   3rd Qu.: 5.83  
##  Max.   :1990   Max.   :18.080   Max.   :11.960   Max.   :13.02  
##      OPBPC             OPRC           OPSLAKE           BSAAM       
##  Min.   : 4.050   Min.   : 4.350   Min.   : 4.600   Min.   : 41785  
##  1st Qu.: 7.975   1st Qu.: 7.875   1st Qu.: 8.705   1st Qu.: 59857  
##  Median : 9.550   Median :11.110   Median :12.140   Median : 69177  
##  Mean   :12.836   Mean   :12.002   Mean   :13.522   Mean   : 77756  
##  3rd Qu.:16.545   3rd Qu.:14.975   3rd Qu.:16.920   3rd Qu.: 92206  
##  Max.   :43.370   Max.   :24.850   Max.   :33.070   Max.   :146345
mod <- lm(BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE + Year, data = water)
summary(mod)
## 
## Call:
## lm(formula = BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + 
##     OPSLAKE + Year, data = water)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12772.4  -5164.9   -360.6   4379.1  16807.6 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -227814.8   197920.2  -1.151  0.25752   
## APMAM           143.4      715.2   0.200  0.84228   
## APSAB          -546.0     1515.1  -0.360  0.72074   
## APSLAKE        1885.0     1368.1   1.378  0.17699   
## OPBPC            76.6      458.4   0.167  0.86827   
## OPRC           2081.5      650.7   3.199  0.00293 **
## OPSLAKE        2055.0      758.1   2.711  0.01033 * 
## Year            123.9      100.6   1.232  0.22621   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7503 on 35 degrees of freedom
## Multiple R-squared:  0.928,  Adjusted R-squared:  0.9136 
## F-statistic:  64.4 on 7 and 35 DF,  p-value: < 2.2e-16
mod2 <- lm(BSAAM ~ APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE + Year, data = water)
anova(mod, mod2)
## Analysis of Variance Table
## 
## Model 1: BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE + Year
## Model 2: BSAAM ~ APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE + Year
##   Res.Df        RSS Df Sum of Sq      F Pr(>F)
## 1     35 1970400272                           
## 2     36 1972662466 -1  -2262194 0.0402 0.8423
AIC(mod, mod2)
##      df      AIC
## mod   9 898.5617
## mod2  8 896.6111

Removing the least significant variable first, leaves us with a p value in our ANOVA of 0.9857. This means there is no statistical difference within the two models.

AIC for mod: 898.3868
AIC for mod2: 896.3872
Since these are not more than 10 apart from each other, we can again say that there is no significat difference, but it is lower which is moving in the rigth direction and move on to removing another predictor.

mod3 <- lm(BSAAM ~ APSAB + APSLAKE + OPRC + OPSLAKE + Year, data = water)
anova(mod, mod3)
## Analysis of Variance Table
## 
## Model 1: BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE + Year
## Model 2: BSAAM ~ APSAB + APSLAKE + OPRC + OPSLAKE + Year
##   Res.Df        RSS Df Sum of Sq      F Pr(>F)
## 1     35 1970400272                           
## 2     37 1974597927 -2  -4197655 0.0373 0.9634
AIC(mod, mod3)
##      df      AIC
## mod   9 898.5617
## mod3  7 894.6532

Again the high p value and the lower AIC means that this predictor should be removed.

mod4 <- lm(BSAAM ~ APSLAKE + OPRC + OPSLAKE + Year, data = water)
anova(mod, mod4)
## Analysis of Variance Table
## 
## Model 1: BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE + Year
## Model 2: BSAAM ~ APSLAKE + OPRC + OPSLAKE + Year
##   Res.Df        RSS Df Sum of Sq      F Pr(>F)
## 1     35 1970400272                           
## 2     38 1979541875 -3  -9141602 0.0541 0.9831
AIC(mod, mod4)
##      df      AIC
## mod   9 898.5617
## mod4  6 892.7608

Again the high p value and the lower AIC means that this predictor should be removed.

mod5 <- lm(BSAAM ~ OPRC + OPSLAKE + Year, data = water)
anova(mod, mod5)
## Analysis of Variance Table
## 
## Model 1: BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE + Year
## Model 2: BSAAM ~ OPRC + OPSLAKE + Year
##   Res.Df        RSS Df  Sum of Sq      F  Pr(>F)  
## 1     35 1970400272                               
## 2     39 2503354457 -4 -532954185 2.3667 0.07162 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AIC(mod, mod5)
##      df      AIC
## mod   9 898.5617
## mod5  5 900.8557

As you can see, the AIC increased and the p value fell below .05 which indicates that this predictor needs to stay in the model.

mod6 <- lm(BSAAM ~ APSLAKE + OPSLAKE + Year, data = water)
anova(mod, mod6)
## Analysis of Variance Table
## 
## Model 1: BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE + Year
## Model 2: BSAAM ~ APSLAKE + OPSLAKE + Year
##   Res.Df        RSS Df  Sum of Sq      F  Pr(>F)  
## 1     35 1970400272                               
## 2     39 2593349193 -4 -622948921 2.7663 0.04251 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AIC(mod, mod6)
##      df      AIC
## mod   9 898.5617
## mod6  5 902.3744

As you can see, the AIC increased and the p value fell below .05 which indicates that this predictor needs to stay in the model.

mod7 <- lm(BSAAM ~ APSLAKE + OPRC + OPSLAKE, data = water)
anova(mod, mod7)
## Analysis of Variance Table
## 
## Model 1: BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE + Year
## Model 2: BSAAM ~ APSLAKE + OPRC + OPSLAKE
##   Res.Df        RSS Df Sum of Sq      F Pr(>F)
## 1     35 1970400272                           
## 2     39 2068947585 -4 -98547313 0.4376 0.7805
AIC(mod, mod7)
##      df      AIC
## mod   9 898.5617
## mod7  5 892.6603

Again the high p value and the lower AIC means that this predictor should be removed.

mod8 <- lm(BSAAM ~ APSLAKE + OPRC, data = water)
anova(mod, mod8)
## Analysis of Variance Table
## 
## Model 1: BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE + Year
## Model 2: BSAAM ~ APSLAKE + OPRC
##   Res.Df        RSS Df   Sum of Sq      F   Pr(>F)    
## 1     35 1970400272                                   
## 2     40 3584866125 -5 -1614465853 5.7355 0.000575 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AIC(mod, mod8)
##      df      AIC
## mod   9 898.5617
## mod8  4 914.2965

As you can see, the AIC increased and the p value fell below .05 which indicates that this predictor needs to stay in the model.

This means that our final and best model is mod7 using the predictors, APSLAKE, OPRC and OPSLAKE.

R does not make the user always to this though (Thank goodness!) There is a step command that makes this far faster.

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:alr3':
## 
##     forbes
mod <- lm(BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE + Year, data = water)
stepAIC(mod, direction = "backward")
## Start:  AIC=774.53
## BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE + Year
## 
##           Df Sum of Sq        RSS    AIC
## - OPBPC    1   1571591 1971971864 772.57
## - APMAM    1   2262194 1972662466 772.58
## - APSAB    1   7311109 1977711381 772.69
## - Year     1  85430461 2055830733 774.36
## <none>                 1970400272 774.53
## - APSLAKE  1 106880993 2077281265 774.80
## - OPSLAKE  1 413707192 2384107464 780.73
## - OPRC     1 576007855 2546408128 783.56
## 
## Step:  AIC=772.57
## BSAAM ~ APMAM + APSAB + APSLAKE + OPRC + OPSLAKE + Year
## 
##           Df Sum of Sq        RSS    AIC
## - APMAM    1   2626064 1974597927 770.62
## - APSAB    1   6887325 1978859189 770.72
## - Year     1  85160498 2057132362 772.39
## <none>                 1971971864 772.57
## - APSLAKE  1 105315871 2077287734 772.80
## - OPRC     1 574517654 2546489517 781.56
## - OPSLAKE  1 964675516 2936647380 787.69
## 
## Step:  AIC=770.62
## BSAAM ~ APSAB + APSLAKE + OPRC + OPSLAKE + Year
## 
##           Df Sum of Sq        RSS    AIC
## - APSAB    1   4943947 1979541875 768.73
## - Year     1  82535451 2057133378 770.39
## <none>                 1974597927 770.62
## - APSLAKE  1 127432687 2102030614 771.31
## - OPRC     1 575963916 2550561844 779.63
## - OPSLAKE  1 968394770 2942992697 785.78
## 
## Step:  AIC=768.73
## BSAAM ~ APSLAKE + OPRC + OPSLAKE + Year
## 
##           Df  Sum of Sq        RSS    AIC
## - Year     1   89405710 2068947585 768.63
## <none>                  1979541875 768.73
## - APSLAKE  1  523812582 2503354457 776.83
## - OPRC     1  613807319 2593349193 778.35
## - OPSLAKE  1 1175063776 3154605651 786.77
## 
## Step:  AIC=768.63
## BSAAM ~ APSLAKE + OPRC + OPSLAKE
## 
##           Df  Sum of Sq        RSS    AIC
## <none>                  2068947585 768.63
## - OPRC     1  531694203 2600641788 776.47
## - APSLAKE  1  621012173 2689959758 777.92
## - OPSLAKE  1 1515918540 3584866125 790.27
## 
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPRC + OPSLAKE, data = water)
## 
## Coefficients:
## (Intercept)      APSLAKE         OPRC      OPSLAKE  
##       15425         1712         1797         2390

This leads us to the exact same model as was shown above!

You can also use this model to go forwards.

mod9 <- lm(BSAAM ~ OPRC, data = water)
stepAIC(mod9, direction = "forward", scope = list(upper = BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE + Year, lower = BSAAM~1))
## Start:  AIC=795.28
## BSAAM ~ OPRC
## 
##           Df  Sum of Sq        RSS    AIC
## + OPSLAKE  1 1529924515 2689959758 777.92
## + OPBPC    1  888721040 3331163233 787.11
## + APSLAKE  1  635018148 3584866125 790.27
## + Year     1  609982013 3609902260 790.57
## + APMAM    1  261271943 3958612330 794.53
## + APSAB    1  205266122 4014618151 795.14
## <none>                  4219884273 795.28
## 
## Step:  AIC=777.92
## BSAAM ~ OPRC + OPSLAKE
## 
##           Df Sum of Sq        RSS    AIC
## + APSLAKE  1 621012173 2068947585 768.63
## + APSAB    1 457345396 2232614362 771.91
## + APMAM    1 387970039 2301989719 773.22
## + Year     1 186605301 2503354457 776.83
## <none>                 2689959758 777.92
## + OPBPC    1    450573 2689509185 779.91
## 
## Step:  AIC=768.63
## BSAAM ~ OPRC + OPSLAKE + APSLAKE
## 
##         Df Sum of Sq        RSS    AIC
## <none>               2068947585 768.63
## + Year   1  89405710 1979541875 768.73
## + APSAB  1  11814207 2057133378 770.39
## + APMAM  1   1410311 2067537274 770.60
## + OPBPC  1    583748 2068363837 770.62
## 
## Call:
## lm(formula = BSAAM ~ OPRC + OPSLAKE + APSLAKE, data = water)
## 
## Coefficients:
## (Intercept)         OPRC      OPSLAKE      APSLAKE  
##       15425         1797         2390         1712

Again this leads to the exact same model!

This is important because it allows us to be most efficient in our models and only chose predictors that add helpful information.