Model Building

Forward Selection

library(alr3)
## Warning: package 'alr3' was built under R version 3.4.3
## Loading required package: car
## Warning: package 'car' was built under R version 3.4.3
data(water)
attach(water)
head(water)
##   Year APMAM APSAB APSLAKE OPBPC  OPRC OPSLAKE  BSAAM
## 1 1948  9.13  3.58    3.91  4.10  7.43    6.47  54235
## 2 1949  5.28  4.82    5.20  7.55 11.11   10.26  67567
## 3 1950  4.20  3.77    3.67  9.52 12.20   11.35  66161
## 4 1951  4.60  4.46    3.93 11.14 15.15   11.13  68094
## 5 1952  7.15  4.99    4.88 16.34 20.05   22.81 107080
## 6 1953  9.70  5.65    4.91  8.88  8.15    7.41  67594

When choosing which predictors should and shouldn’t be included in a model, start with making a simple linear model for each possible predictor.

mod1 <- lm(BSAAM ~ APMAM)
mod2 <- lm(BSAAM ~ APSAB)
mod3 <- lm(BSAAM ~ APSLAKE)
mod4 <- lm(BSAAM ~ OPBPC)
mod5 <- lm(BSAAM ~ OPRC)
mod6 <- lm(BSAAM ~ OPSLAKE)
summary(mod1)
## 
## Call:
## lm(formula = BSAAM ~ APMAM)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -37043 -16339  -5457  17158  72467 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    63364       9917   6.389 1.21e-07 ***
## APMAM           1965       1249   1.573    0.123    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25080 on 41 degrees of freedom
## Multiple R-squared:  0.05692,    Adjusted R-squared:  0.03391 
## F-statistic: 2.474 on 1 and 41 DF,  p-value: 0.1234
summary(mod2)
## 
## Call:
## lm(formula = BSAAM ~ APSAB)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -41314 -16784  -5101  16492  70942 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    67152       9689   6.931 2.06e-08 ***
## APSAB           2279       1909   1.194    0.239    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25390 on 41 degrees of freedom
## Multiple R-squared:  0.0336, Adjusted R-squared:  0.01003 
## F-statistic: 1.425 on 1 and 41 DF,  p-value: 0.2394
summary(mod3)
## 
## Call:
## lm(formula = BSAAM ~ APSLAKE)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -46438 -16907  -5661  19028  69464 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    63864       9249   6.905 2.25e-08 ***
## APSLAKE         2818       1709   1.649    0.107    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25010 on 41 degrees of freedom
## Multiple R-squared:  0.06217,    Adjusted R-squared:  0.0393 
## F-statistic: 2.718 on 1 and 41 DF,  p-value: 0.1069
summary(mod4)
## 
## Call:
## lm(formula = BSAAM ~ OPBPC)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -21183  -7298   -819   4731  38430 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  40017.4     3589.1   11.15 5.47e-14 ***
## OPBPC         2940.1      240.6   12.22 3.00e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11990 on 41 degrees of freedom
## Multiple R-squared:  0.7845, Adjusted R-squared:  0.7793 
## F-statistic: 149.3 on 1 and 41 DF,  p-value: 2.996e-15
summary(mod5)
## 
## Call:
## lm(formula = BSAAM ~ OPRC)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -24356  -5514   -522   7448  24854 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  21741.4     4044.1   5.376 3.32e-06 ***
## OPRC          4667.3      311.3  14.991  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10150 on 41 degrees of freedom
## Multiple R-squared:  0.8457, Adjusted R-squared:  0.842 
## F-statistic: 224.7 on 1 and 41 DF,  p-value: < 2.2e-16
summary(mod6)
## 
## Call:
## lm(formula = BSAAM ~ OPSLAKE)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17603.8  -5338.0    332.1   3410.6  20875.6 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  27014.6     3218.9   8.393 1.93e-10 ***
## OPSLAKE       3752.5      215.7  17.394  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8922 on 41 degrees of freedom
## Multiple R-squared:  0.8807, Adjusted R-squared:  0.8778 
## F-statistic: 302.6 on 1 and 41 DF,  p-value: < 2.2e-16

From the summary of each of the models, we can immediatly eliminate APSLAKE, APMAM, APSAB since they have large p-values. These large p-values tell us that there is no linear relationship between the predictor and the response, so adding them to a model should not make the model any better.

Next, we look at which predictor had the largest p-value. In this case, it is OPSLAKE. So, we know that OPSLAKE will be in our final model. Then, we create a model with OPSLAKE and one of each of the remaining possible predictors.

MOD1 <- lm(BSAAM ~ OPSLAKE)
MOD2.1 <- lm(BSAAM ~ OPRC + OPSLAKE)
MOD2.2 <- lm(BSAAM ~ OPSLAKE + OPBPC)
summary(MOD2.1)
## 
## Call:
## lm(formula = BSAAM ~ OPRC + OPSLAKE)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15991.2  -6484.6   -498.3   4700.1  19945.8 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  22891.2     3277.8   6.984 1.98e-08 ***
## OPRC          1866.5      638.8   2.922   0.0057 ** 
## OPSLAKE       2400.8      503.3   4.770 2.46e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8201 on 40 degrees of freedom
## Multiple R-squared:  0.9017, Adjusted R-squared:  0.8967 
## F-statistic: 183.4 on 2 and 40 DF,  p-value: < 2.2e-16
summary(MOD2.2)
## 
## Call:
## lm(formula = BSAAM ~ OPSLAKE + OPBPC)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17591.0  -5276.6    275.6   3380.7  20867.0 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 27050.95    3540.07   7.641 2.44e-09 ***
## OPSLAKE      3736.16     658.24   5.676 1.35e-06 ***
## OPBPC          14.37     546.41   0.026    0.979    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9033 on 40 degrees of freedom
## Multiple R-squared:  0.8807, Adjusted R-squared:  0.8747 
## F-statistic: 147.6 on 2 and 40 DF,  p-value: < 2.2e-16

OPRC’s p-value was significant, and OPBPC’s was not, so we know that we should keep OPBPC and not OPRC.

So, our best option is to use OPSLAKE and OPRC to predict BSAAM.

Backward Elimination

We can also use the backward elimination method to determine which predictors we need in the model. We start with the full model, and gradually remove predictors that are unneeded.

FullMod <- lm(BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE)
summary(FullMod)
## 
## Call:
## lm(formula = BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + 
##     OPSLAKE)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12690  -4936  -1424   4173  18542 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15944.67    4099.80   3.889 0.000416 ***
## APMAM         -12.77     708.89  -0.018 0.985725    
## APSAB        -664.41    1522.89  -0.436 0.665237    
## APSLAKE      2270.68    1341.29   1.693 0.099112 .  
## OPBPC          69.70     461.69   0.151 0.880839    
## OPRC         1916.45     641.36   2.988 0.005031 ** 
## OPSLAKE      2211.58     752.69   2.938 0.005729 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7557 on 36 degrees of freedom
## Multiple R-squared:  0.9248, Adjusted R-squared:  0.9123 
## F-statistic: 73.82 on 6 and 36 DF,  p-value: < 2.2e-16

We can remove APSAB, APMAM, APSLAKE because they have large p-values in the SLRs.

Mod_1 <- lm(BSAAM ~ OPBPC + OPRC + OPSLAKE)
summary(Mod_1)
## 
## Call:
## lm(formula = BSAAM ~ OPBPC + OPRC + OPSLAKE)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15964.1  -6491.8   -404.4   4741.9  19921.2 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 22991.85    3545.32   6.485  1.1e-07 ***
## OPBPC          40.61     502.40   0.081  0.93599    
## OPRC         1867.46     647.04   2.886  0.00633 ** 
## OPSLAKE      2353.96     771.71   3.050  0.00410 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8304 on 39 degrees of freedom
## Multiple R-squared:  0.9017, Adjusted R-squared:  0.8941 
## F-statistic: 119.2 on 3 and 39 DF,  p-value: < 2.2e-16

We see that OPBPC has a large p-value, so we remove it from the next model:

Mod_2 <- lm(BSAAM ~OPRC + OPSLAKE)
summary(Mod_2)
## 
## Call:
## lm(formula = BSAAM ~ OPRC + OPSLAKE)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15991.2  -6484.6   -498.3   4700.1  19945.8 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  22891.2     3277.8   6.984 1.98e-08 ***
## OPRC          1866.5      638.8   2.922   0.0057 ** 
## OPSLAKE       2400.8      503.3   4.770 2.46e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8201 on 40 degrees of freedom
## Multiple R-squared:  0.9017, Adjusted R-squared:  0.8967 
## F-statistic: 183.4 on 2 and 40 DF,  p-value: < 2.2e-16

Both OPRC and OPSLAKE have significant p-values, so we keep both. So, like we found using the first method, OPRC and OPSLAKE are the best to predict water runoff.

Step AIC

We can also use the StepAIC package in R to do this. We start with backward elimination:

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:alr3':
## 
##     forbes
?StepAIC
## No documentation for 'StepAIC' in specified packages and libraries:
## you could try '??StepAIC'
stepAIC(FullMod)
## Start:  AIC=774.36
## BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE
## 
##           Df Sum of Sq        RSS    AIC
## - APMAM    1     18537 2055849271 772.36
## - OPBPC    1   1301629 2057132362 772.39
## - APSAB    1  10869771 2066700504 772.58
## <none>                 2055830733 774.36
## - APSLAKE  1 163662571 2219493304 775.65
## - OPSLAKE  1 493012936 2548843669 781.60
## - OPRC     1 509894399 2565725132 781.89
## 
## Step:  AIC=772.36
## BSAAM ~ APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE
## 
##           Df Sum of Sq        RSS    AIC
## - OPBPC    1   1284108 2057133378 770.39
## - APSAB    1  12514566 2068363837 770.62
## <none>                 2055849271 772.36
## - APSLAKE  1 176735690 2232584961 773.90
## - OPSLAKE  1 496370866 2552220136 779.66
## - OPRC     1 511413723 2567262994 779.91
## 
## Step:  AIC=770.39
## BSAAM ~ APSAB + APSLAKE + OPRC + OPSLAKE
## 
##           Df  Sum of Sq        RSS    AIC
## - APSAB    1   11814207 2068947585 768.63
## <none>                  2057133378 770.39
## - APSLAKE  1  175480984 2232614362 771.91
## - OPRC     1  510159318 2567292697 777.91
## - OPSLAKE  1 1165227857 3222361235 787.68
## 
## Step:  AIC=768.63
## BSAAM ~ APSLAKE + OPRC + OPSLAKE
## 
##           Df  Sum of Sq        RSS    AIC
## <none>                  2068947585 768.63
## - OPRC     1  531694203 2600641788 776.47
## - APSLAKE  1  621012173 2689959758 777.92
## - OPSLAKE  1 1515918540 3584866125 790.27
## 
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPRC + OPSLAKE)
## 
## Coefficients:
## (Intercept)      APSLAKE         OPRC      OPSLAKE  
##       15425         1712         1797         2390

Since this method uses AIC and we were using t-tests earlier, we get a slightly different model.

Using the same command, we can also use forward selection to determine which predictors are needed in our model.

simpleMod <- lm(BSAAM ~ 1)
stepAIC(simpleMod, direction = "forward", scope = list(upper = FullMod))
## Start:  AIC=873.65
## BSAAM ~ 1
## 
##           Df  Sum of Sq        RSS    AIC
## + OPSLAKE  1 2.4087e+10 3.2640e+09 784.24
## + OPRC     1 2.3131e+10 4.2199e+09 795.28
## + OPBPC    1 2.1458e+10 5.8928e+09 809.64
## + APSLAKE  1 1.7004e+09 2.5651e+10 872.89
## + APMAM    1 1.5567e+09 2.5794e+10 873.13
## <none>                  2.7351e+10 873.65
## + APSAB    1 9.1891e+08 2.6432e+10 874.18
## 
## Step:  AIC=784.24
## BSAAM ~ OPSLAKE
## 
##           Df Sum of Sq        RSS    AIC
## + APSLAKE  1 663368666 2600641788 776.47
## + APSAB    1 661988129 2602022326 776.49
## + OPRC     1 574050696 2689959758 777.92
## + APMAM    1 524283532 2739726922 778.71
## <none>                 3264010454 784.24
## + OPBPC    1     56424 3263954031 786.24
## 
## Step:  AIC=776.47
## BSAAM ~ OPSLAKE + APSLAKE
## 
##         Df Sum of Sq        RSS    AIC
## + OPRC   1 531694203 2068947585 768.63
## <none>               2600641788 776.47
## + APSAB  1  33349091 2567292697 777.91
## + APMAM  1  11041158 2589600630 778.28
## + OPBPC  1    122447 2600519341 778.46
## 
## Step:  AIC=768.63
## BSAAM ~ OPSLAKE + APSLAKE + OPRC
## 
##         Df Sum of Sq        RSS    AIC
## <none>               2068947585 768.63
## + APSAB  1  11814207 2057133378 770.39
## + APMAM  1   1410311 2067537274 770.60
## + OPBPC  1    583748 2068363837 770.62
## 
## Call:
## lm(formula = BSAAM ~ OPSLAKE + APSLAKE + OPRC)
## 
## Coefficients:
## (Intercept)      OPSLAKE      APSLAKE         OPRC  
##       15425         2390         1712         1797