In section 5.1 we talked about model building. We started off with methods NOT to use: don’t use R2.
What we should use for model building:
Now, we can actually build our model! There are 3 options that we talked about in class for which we can use. First, I’ll talk about how we can manually do these, then I’ll show a shortcut option.
library(alr3)
## Warning: package 'alr3' was built under R version 3.4.3
## Loading required package: car
## Warning: package 'car' was built under R version 3.4.3
data("water")
names(water)
## [1] "Year" "APMAM" "APSAB" "APSLAKE" "OPBPC" "OPRC" "OPSLAKE"
## [8] "BSAAM"
mod1 <- lm(BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE, data = water)
summary(mod1)
##
## Call:
## lm(formula = BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC +
## OPSLAKE, data = water)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12690 -4936 -1424 4173 18542
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15944.67 4099.80 3.889 0.000416 ***
## APMAM -12.77 708.89 -0.018 0.985725
## APSAB -664.41 1522.89 -0.436 0.665237
## APSLAKE 2270.68 1341.29 1.693 0.099112 .
## OPBPC 69.70 461.69 0.151 0.880839
## OPRC 1916.45 641.36 2.988 0.005031 **
## OPSLAKE 2211.58 752.69 2.938 0.005729 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7557 on 36 degrees of freedom
## Multiple R-squared: 0.9248, Adjusted R-squared: 0.9123
## F-statistic: 73.82 on 6 and 36 DF, p-value: < 2.2e-16
The predictor with the lowest pvalue is OPRC, so we will fit a model with that to start. Next, we will add OPSLAKE. Lastly, we will add APSLAKE. After each addition we will check to make sure that the pvalues stay below the correct significance level.
mod2 <- lm(BSAAM ~ OPRC, data = water)
summary(mod2)
##
## Call:
## lm(formula = BSAAM ~ OPRC, data = water)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24356 -5514 -522 7448 24854
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21741.4 4044.1 5.376 3.32e-06 ***
## OPRC 4667.3 311.3 14.991 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10150 on 41 degrees of freedom
## Multiple R-squared: 0.8457, Adjusted R-squared: 0.842
## F-statistic: 224.7 on 1 and 41 DF, p-value: < 2.2e-16
Our OPRC is still very significant, so we will add OPSLAKE and see what happens.
mod3 <- lm(BSAAM ~ OPRC + OPSLAKE, data = water)
summary(mod3)
##
## Call:
## lm(formula = BSAAM ~ OPRC + OPSLAKE, data = water)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15991.2 -6484.6 -498.3 4700.1 19945.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22891.2 3277.8 6.984 1.98e-08 ***
## OPRC 1866.5 638.8 2.922 0.0057 **
## OPSLAKE 2400.8 503.3 4.770 2.46e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8201 on 40 degrees of freedom
## Multiple R-squared: 0.9017, Adjusted R-squared: 0.8967
## F-statistic: 183.4 on 2 and 40 DF, p-value: < 2.2e-16
OPRC and OPSLAKE and still significant, so we will add in APSLAKE to (hopefully) complete our model.
mod4 <- lm(BSAAM ~ OPRC + OPSLAKE + APSLAKE, data = water)
summary(mod4)
##
## Call:
## lm(formula = BSAAM ~ OPRC + OPSLAKE + APSLAKE, data = water)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12964 -5140 -1252 4446 18649
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15424.6 3638.4 4.239 0.000133 ***
## OPRC 1797.5 567.8 3.166 0.002998 **
## OPSLAKE 2389.8 447.1 5.346 4.19e-06 ***
## APSLAKE 1712.5 500.5 3.421 0.001475 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7284 on 39 degrees of freedom
## Multiple R-squared: 0.9244, Adjusted R-squared: 0.9185
## F-statistic: 158.9 on 3 and 39 DF, p-value: < 2.2e-16
They all stay as significant predictors, so this will be our final model using the forward selection method.
summary(mod1)
##
## Call:
## lm(formula = BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC +
## OPSLAKE, data = water)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12690 -4936 -1424 4173 18542
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15944.67 4099.80 3.889 0.000416 ***
## APMAM -12.77 708.89 -0.018 0.985725
## APSAB -664.41 1522.89 -0.436 0.665237
## APSLAKE 2270.68 1341.29 1.693 0.099112 .
## OPBPC 69.70 461.69 0.151 0.880839
## OPRC 1916.45 641.36 2.988 0.005031 **
## OPSLAKE 2211.58 752.69 2.938 0.005729 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7557 on 36 degrees of freedom
## Multiple R-squared: 0.9248, Adjusted R-squared: 0.9123
## F-statistic: 73.82 on 6 and 36 DF, p-value: < 2.2e-16
The least significant predictor is APMAM, followed by OPBPC and then APSAB. We will one-by-one get rid of these predictors and see if the signficiant ones stay significant.
mod5 <- lm(BSAAM ~ APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE, data = water)
summary(mod5)
##
## Call:
## lm(formula = BSAAM ~ APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE,
## data = water)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12696 -4933 -1396 4187 18550
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15930.84 3972.50 4.010 0.000283 ***
## APSAB -673.42 1418.96 -0.475 0.637873
## APSLAKE 2263.86 1269.35 1.783 0.082714 .
## OPBPC 68.94 453.50 0.152 0.879996
## OPRC 1915.75 631.46 3.034 0.004399 **
## OPSLAKE 2212.62 740.28 2.989 0.004952 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7454 on 37 degrees of freedom
## Multiple R-squared: 0.9248, Adjusted R-squared: 0.9147
## F-statistic: 91.05 on 5 and 37 DF, p-value: < 2.2e-16
We will continue by getting rid of OPBPC.
mod6 <- lm(BSAAM ~ APSAB + APSLAKE + OPRC + OPSLAKE, data = water)
summary(mod6)
##
## Call:
## lm(formula = BSAAM ~ APSAB + APSLAKE + OPRC + OPSLAKE, data = water)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12750 -5095 -1494 4245 18594
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15749.8 3740.8 4.210 0.000151 ***
## APSAB -650.6 1392.8 -0.467 0.643055
## APSLAKE 2244.9 1246.9 1.800 0.079735 .
## OPRC 1910.2 622.3 3.070 0.003942 **
## OPSLAKE 2295.4 494.8 4.639 4.07e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7358 on 38 degrees of freedom
## Multiple R-squared: 0.9248, Adjusted R-squared: 0.9169
## F-statistic: 116.8 on 4 and 38 DF, p-value: < 2.2e-16
Lastly, we will get rid of APSAB. If the rest of our predictors stay significant, then our model will be complete.
mod7 <- lm(BSAAM ~ APSLAKE + OPRC + OPSLAKE, data = water)
summary(mod7)
##
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPRC + OPSLAKE, data = water)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12964 -5140 -1252 4446 18649
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15424.6 3638.4 4.239 0.000133 ***
## APSLAKE 1712.5 500.5 3.421 0.001475 **
## OPRC 1797.5 567.8 3.166 0.002998 **
## OPSLAKE 2389.8 447.1 5.346 4.19e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7284 on 39 degrees of freedom
## Multiple R-squared: 0.9244, Adjusted R-squared: 0.9185
## F-statistic: 158.9 on 3 and 39 DF, p-value: < 2.2e-16
They are all significant, so we will stop here!
If we download the MASS package and use the stepAIC function, we can find the best combination of predictors for our model. We can specify for it to go backward (starting with a full model) or to go forward (starting with a model with only the intercept). We will go the backward and then the forward.
library(MASS)
## Warning: package 'MASS' was built under R version 3.4.2
##
## Attaching package: 'MASS'
## The following object is masked from 'package:alr3':
##
## forbes
stepAIC(mod1, direction = "backward")
## Start: AIC=774.36
## BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE
##
## Df Sum of Sq RSS AIC
## - APMAM 1 18537 2055849271 772.36
## - OPBPC 1 1301629 2057132362 772.39
## - APSAB 1 10869771 2066700504 772.58
## <none> 2055830733 774.36
## - APSLAKE 1 163662571 2219493304 775.65
## - OPSLAKE 1 493012936 2548843669 781.60
## - OPRC 1 509894399 2565725132 781.89
##
## Step: AIC=772.36
## BSAAM ~ APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE
##
## Df Sum of Sq RSS AIC
## - OPBPC 1 1284108 2057133378 770.39
## - APSAB 1 12514566 2068363837 770.62
## <none> 2055849271 772.36
## - APSLAKE 1 176735690 2232584961 773.90
## - OPSLAKE 1 496370866 2552220136 779.66
## - OPRC 1 511413723 2567262994 779.91
##
## Step: AIC=770.39
## BSAAM ~ APSAB + APSLAKE + OPRC + OPSLAKE
##
## Df Sum of Sq RSS AIC
## - APSAB 1 11814207 2068947585 768.63
## <none> 2057133378 770.39
## - APSLAKE 1 175480984 2232614362 771.91
## - OPRC 1 510159318 2567292697 777.91
## - OPSLAKE 1 1165227857 3222361235 787.68
##
## Step: AIC=768.63
## BSAAM ~ APSLAKE + OPRC + OPSLAKE
##
## Df Sum of Sq RSS AIC
## <none> 2068947585 768.63
## - OPRC 1 531694203 2600641788 776.47
## - APSLAKE 1 621012173 2689959758 777.92
## - OPSLAKE 1 1515918540 3584866125 790.27
##
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPRC + OPSLAKE, data = water)
##
## Coefficients:
## (Intercept) APSLAKE OPRC OPSLAKE
## 15425 1712 1797 2390
The backwards elimination tells us that our final model should have APSLAKE, OPRC and OPSLAKE as our three predictors.
mod10 <- lm(BSAAM ~ 1, data = water)
stepAIC(mod10, direction = "forward", scope = list(upper = BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE, lower = BSAAM ~ 1), data = water)
## Start: AIC=873.65
## BSAAM ~ 1
##
## Df Sum of Sq RSS AIC
## + OPSLAKE 1 2.4087e+10 3.2640e+09 784.24
## + OPRC 1 2.3131e+10 4.2199e+09 795.28
## + OPBPC 1 2.1458e+10 5.8928e+09 809.64
## + APSLAKE 1 1.7004e+09 2.5651e+10 872.89
## + APMAM 1 1.5567e+09 2.5794e+10 873.13
## <none> 2.7351e+10 873.65
## + APSAB 1 9.1891e+08 2.6432e+10 874.18
##
## Step: AIC=784.24
## BSAAM ~ OPSLAKE
##
## Df Sum of Sq RSS AIC
## + APSLAKE 1 663368666 2600641788 776.47
## + APSAB 1 661988129 2602022326 776.49
## + OPRC 1 574050696 2689959758 777.92
## + APMAM 1 524283532 2739726922 778.71
## <none> 3264010454 784.24
## + OPBPC 1 56424 3263954031 786.24
##
## Step: AIC=776.47
## BSAAM ~ OPSLAKE + APSLAKE
##
## Df Sum of Sq RSS AIC
## + OPRC 1 531694203 2068947585 768.63
## <none> 2600641788 776.47
## + APSAB 1 33349091 2567292697 777.91
## + APMAM 1 11041158 2589600630 778.28
## + OPBPC 1 122447 2600519341 778.46
##
## Step: AIC=768.63
## BSAAM ~ OPSLAKE + APSLAKE + OPRC
##
## Df Sum of Sq RSS AIC
## <none> 2068947585 768.63
## + APSAB 1 11814207 2057133378 770.39
## + APMAM 1 1410311 2067537274 770.60
## + OPBPC 1 583748 2068363837 770.62
##
## Call:
## lm(formula = BSAAM ~ OPSLAKE + APSLAKE + OPRC, data = water)
##
## Coefficients:
## (Intercept) OPSLAKE APSLAKE OPRC
## 15425 2390 1712 1797
Our forward selection process tells us to end with the predictors being OPSLAKE, APSLAKE and OPRC. We got the same answer for both forward and backward, which is a good thing.