Today in class we learned how to build a multiple linear regression model. We can think of many possible predictors for a response variable and these methods will help narrow down the list of relevent predictors.

library(alr3)
## Loading required package: car
data(water)

Use forward selection and backward elimination to build a model for runoff. #Forward Selection

names(water)
## [1] "Year"    "APMAM"   "APSAB"   "APSLAKE" "OPBPC"   "OPRC"    "OPSLAKE"
## [8] "BSAAM"
attach(water)
forAPMAM <- lm(BSAAM~APMAM)
forAPSAB <- lm(BSAAM~APSAB)
forAPSLAKE <- lm(BSAAM~APSLAKE)
forOPBPC <- lm(BSAAM~OPBPC)
forOPRC <- lm(BSAAM~OPRC)
forOPSLAKE <- lm(BSAAM~OPSLAKE)
summary(forAPMAM)
## 
## Call:
## lm(formula = BSAAM ~ APMAM)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -37043 -16339  -5457  17158  72467 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    63364       9917   6.389 1.21e-07 ***
## APMAM           1965       1249   1.573    0.123    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25080 on 41 degrees of freedom
## Multiple R-squared:  0.05692,    Adjusted R-squared:  0.03391 
## F-statistic: 2.474 on 1 and 41 DF,  p-value: 0.1234
summary(forAPSAB)
## 
## Call:
## lm(formula = BSAAM ~ APSAB)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -41314 -16784  -5101  16492  70942 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    67152       9689   6.931 2.06e-08 ***
## APSAB           2279       1909   1.194    0.239    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25390 on 41 degrees of freedom
## Multiple R-squared:  0.0336, Adjusted R-squared:  0.01003 
## F-statistic: 1.425 on 1 and 41 DF,  p-value: 0.2394
summary(forAPSLAKE)
## 
## Call:
## lm(formula = BSAAM ~ APSLAKE)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -46438 -16907  -5661  19028  69464 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    63864       9249   6.905 2.25e-08 ***
## APSLAKE         2818       1709   1.649    0.107    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25010 on 41 degrees of freedom
## Multiple R-squared:  0.06217,    Adjusted R-squared:  0.0393 
## F-statistic: 2.718 on 1 and 41 DF,  p-value: 0.1069
summary(forOPBPC)
## 
## Call:
## lm(formula = BSAAM ~ OPBPC)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -21183  -7298   -819   4731  38430 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  40017.4     3589.1   11.15 5.47e-14 ***
## OPBPC         2940.1      240.6   12.22 3.00e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11990 on 41 degrees of freedom
## Multiple R-squared:  0.7845, Adjusted R-squared:  0.7793 
## F-statistic: 149.3 on 1 and 41 DF,  p-value: 2.996e-15
summary(forOPRC)
## 
## Call:
## lm(formula = BSAAM ~ OPRC)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -24356  -5514   -522   7448  24854 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  21741.4     4044.1   5.376 3.32e-06 ***
## OPRC          4667.3      311.3  14.991  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10150 on 41 degrees of freedom
## Multiple R-squared:  0.8457, Adjusted R-squared:  0.842 
## F-statistic: 224.7 on 1 and 41 DF,  p-value: < 2.2e-16
summary(forOPSLAKE)
## 
## Call:
## lm(formula = BSAAM ~ OPSLAKE)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17603.8  -5338.0    332.1   3410.6  20875.6 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  27014.6     3218.9   8.393 1.93e-10 ***
## OPSLAKE       3752.5      215.7  17.394  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8922 on 41 degrees of freedom
## Multiple R-squared:  0.8807, Adjusted R-squared:  0.8778 
## F-statistic: 302.6 on 1 and 41 DF,  p-value: < 2.2e-16

We can eliminate all of the variables that start with the letter “A” because they had large Single Linear Regression p-values. I now need to move forward with two options:

forOPRC2 <- lm(BSAAM~OPSLAKE+OPRC)
summary(forOPRC2)
## 
## Call:
## lm(formula = BSAAM ~ OPSLAKE + OPRC)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15991.2  -6484.6   -498.3   4700.1  19945.8 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  22891.2     3277.8   6.984 1.98e-08 ***
## OPSLAKE       2400.8      503.3   4.770 2.46e-05 ***
## OPRC          1866.5      638.8   2.922   0.0057 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8201 on 40 degrees of freedom
## Multiple R-squared:  0.9017, Adjusted R-squared:  0.8967 
## F-statistic: 183.4 on 2 and 40 DF,  p-value: < 2.2e-16
forOPBPC2 <- lm(BSAAM~OPSLAKE+OPBPC)
summary(forOPBPC2)
## 
## Call:
## lm(formula = BSAAM ~ OPSLAKE + OPBPC)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17591.0  -5276.6    275.6   3380.7  20867.0 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 27050.95    3540.07   7.641 2.44e-09 ***
## OPSLAKE      3736.16     658.24   5.676 1.35e-06 ***
## OPBPC          14.37     546.41   0.026    0.979    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9033 on 40 degrees of freedom
## Multiple R-squared:  0.8807, Adjusted R-squared:  0.8747 
## F-statistic: 147.6 on 2 and 40 DF,  p-value: < 2.2e-16

OPBPC has a large p-value here, so it is eliminated from consideration. OPRC can be added to the model. Now no more variables remain so the final model is…

forward <- lm(BSAAM~OPSLAKE+OPRC)

Backward Elimination

backward6 <- lm(BSAAM~OPSLAKE+OPRC+OPBPC+APMAM+APSAB+APSLAKE)
summary(backward6)
## 
## Call:
## lm(formula = BSAAM ~ OPSLAKE + OPRC + OPBPC + APMAM + APSAB + 
##     APSLAKE)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12690  -4936  -1424   4173  18542 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15944.67    4099.80   3.889 0.000416 ***
## OPSLAKE      2211.58     752.69   2.938 0.005729 ** 
## OPRC         1916.45     641.36   2.988 0.005031 ** 
## OPBPC          69.70     461.69   0.151 0.880839    
## APMAM         -12.77     708.89  -0.018 0.985725    
## APSAB        -664.41    1522.89  -0.436 0.665237    
## APSLAKE      2270.68    1341.29   1.693 0.099112 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7557 on 36 degrees of freedom
## Multiple R-squared:  0.9248, Adjusted R-squared:  0.9123 
## F-statistic: 73.82 on 6 and 36 DF,  p-value: < 2.2e-16

Now we drop variables until the highest p-value is below the significance level.

backward5 <- lm(BSAAM~OPSLAKE+OPRC+OPBPC+APSAB+APSLAKE)
summary(backward5)
## 
## Call:
## lm(formula = BSAAM ~ OPSLAKE + OPRC + OPBPC + APSAB + APSLAKE)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12696  -4933  -1396   4187  18550 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15930.84    3972.50   4.010 0.000283 ***
## OPSLAKE      2212.62     740.28   2.989 0.004952 ** 
## OPRC         1915.75     631.46   3.034 0.004399 ** 
## OPBPC          68.94     453.50   0.152 0.879996    
## APSAB        -673.42    1418.96  -0.475 0.637873    
## APSLAKE      2263.86    1269.35   1.783 0.082714 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7454 on 37 degrees of freedom
## Multiple R-squared:  0.9248, Adjusted R-squared:  0.9147 
## F-statistic: 91.05 on 5 and 37 DF,  p-value: < 2.2e-16
backward4 <- lm(BSAAM~OPSLAKE+OPRC+APSAB+APSLAKE)
summary(backward4)
## 
## Call:
## lm(formula = BSAAM ~ OPSLAKE + OPRC + APSAB + APSLAKE)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12750  -5095  -1494   4245  18594 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  15749.8     3740.8   4.210 0.000151 ***
## OPSLAKE       2295.4      494.8   4.639 4.07e-05 ***
## OPRC          1910.2      622.3   3.070 0.003942 ** 
## APSAB         -650.6     1392.8  -0.467 0.643055    
## APSLAKE       2244.9     1246.9   1.800 0.079735 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7358 on 38 degrees of freedom
## Multiple R-squared:  0.9248, Adjusted R-squared:  0.9169 
## F-statistic: 116.8 on 4 and 38 DF,  p-value: < 2.2e-16
backward3 <- lm(BSAAM~OPSLAKE+OPRC+APSLAKE)
summary(backward3)
## 
## Call:
## lm(formula = BSAAM ~ OPSLAKE + OPRC + APSLAKE)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12964  -5140  -1252   4446  18649 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  15424.6     3638.4   4.239 0.000133 ***
## OPSLAKE       2389.8      447.1   5.346 4.19e-06 ***
## OPRC          1797.5      567.8   3.166 0.002998 ** 
## APSLAKE       1712.5      500.5   3.421 0.001475 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7284 on 39 degrees of freedom
## Multiple R-squared:  0.9244, Adjusted R-squared:  0.9185 
## F-statistic: 158.9 on 3 and 39 DF,  p-value: < 2.2e-16

All predictors are now significant. The backward elimination method yielded a different model than forward selection did. The final model is…

backward <- lm(BSAAM~OPSLAKE+OPRC+APSLAKE)

We already know how to compare these models with t-tests and f-tests from previous learning logs. We can use these methods because the forward model is “nested” within the backward model. Another method is the AIC method. AIC calculates a value and it is best to take the lower of the two unless the difference is less than 10, in that case, just pick the model with the least predictors.

library(stats)
AIC(backward)
## [1] 892.6603
AIC(forward)
## [1] 901.9472

The backward model is not less than the forward model’s AIC by 10, so we shall stick with the simpler forward model.

model <- lm(BSAAM~OPSLAKE+OPRC)