When model building and adding predictors to your model, its important to keep in mind what ways you can test if adding predictors is good or not. A bad way is by looking at \(R^2\). It is bad, because \(R^2\) will always increase as more predictors are being added. Therefore, its impossible to know if you are adding unneeded variables.
Better ways to test the added predictors are:
It’s important to know when building models, you can use forward selection or backward elimination. Forward selection means you start with a reduced model and then add more predictors using tests mentioned above. Backward elimination means you start with a complete model and eliminate any predictors that are unnecessary.
I will use the data set water to put these tests in practice.
library(alr3)
## Loading required package: car
## Warning: package 'car' was built under R version 3.4.3
data(water)
attach(water)
I will now create a model for runoff using some of the snowfall measurements and perform multiple tests that use backward elimination.
First, create a model with all variables and run summary(model) to determine what variables are needed. Look at steps below to see what the steps are needed in determining model.
snowmod <- lm(BSAAM~ APMAM +APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE, data=water)
summary(snowmod)
##
## Call:
## lm(formula = BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC +
## OPSLAKE, data = water)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12690 -4936 -1424 4173 18542
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15944.67 4099.80 3.889 0.000416 ***
## APMAM -12.77 708.89 -0.018 0.985725
## APSAB -664.41 1522.89 -0.436 0.665237
## APSLAKE 2270.68 1341.29 1.693 0.099112 .
## OPBPC 69.70 461.69 0.151 0.880839
## OPRC 1916.45 641.36 2.988 0.005031 **
## OPSLAKE 2211.58 752.69 2.938 0.005729 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7557 on 36 degrees of freedom
## Multiple R-squared: 0.9248, Adjusted R-squared: 0.9123
## F-statistic: 73.82 on 6 and 36 DF, p-value: < 2.2e-16
snowmod2 <- lm(BSAAM~ APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE, data=water)
summary (snowmod2)
##
## Call:
## lm(formula = BSAAM ~ APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE,
## data = water)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12696 -4933 -1396 4187 18550
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15930.84 3972.50 4.010 0.000283 ***
## APSAB -673.42 1418.96 -0.475 0.637873
## APSLAKE 2263.86 1269.35 1.783 0.082714 .
## OPBPC 68.94 453.50 0.152 0.879996
## OPRC 1915.75 631.46 3.034 0.004399 **
## OPSLAKE 2212.62 740.28 2.989 0.004952 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7454 on 37 degrees of freedom
## Multiple R-squared: 0.9248, Adjusted R-squared: 0.9147
## F-statistic: 91.05 on 5 and 37 DF, p-value: < 2.2e-16
snowmod3 <- lm(BSAAM~ APSAB + APSLAKE + OPRC + OPSLAKE, data=water)
summary (snowmod3)
##
## Call:
## lm(formula = BSAAM ~ APSAB + APSLAKE + OPRC + OPSLAKE, data = water)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12750 -5095 -1494 4245 18594
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15749.8 3740.8 4.210 0.000151 ***
## APSAB -650.6 1392.8 -0.467 0.643055
## APSLAKE 2244.9 1246.9 1.800 0.079735 .
## OPRC 1910.2 622.3 3.070 0.003942 **
## OPSLAKE 2295.4 494.8 4.639 4.07e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7358 on 38 degrees of freedom
## Multiple R-squared: 0.9248, Adjusted R-squared: 0.9169
## F-statistic: 116.8 on 4 and 38 DF, p-value: < 2.2e-16
snowmod4 <- lm(BSAAM~ APSLAKE + OPRC + OPSLAKE, data=water)
summary (snowmod4)
##
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPRC + OPSLAKE, data = water)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12964 -5140 -1252 4446 18649
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15424.6 3638.4 4.239 0.000133 ***
## APSLAKE 1712.5 500.5 3.421 0.001475 **
## OPRC 1797.5 567.8 3.166 0.002998 **
## OPSLAKE 2389.8 447.1 5.346 4.19e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7284 on 39 degrees of freedom
## Multiple R-squared: 0.9244, Adjusted R-squared: 0.9185
## F-statistic: 158.9 on 3 and 39 DF, p-value: < 2.2e-16
Steps:
drop APMAM because it has highest p value, then create new model without APMAM
drop OPBPC because it has highest p value, then create new model w/out APMAM, OPBPC
drop APSAB because it has highest p value, then create new model w/out APMAM,OPBPC,APSAB
The three site predictors left with low p-values mean that those predictors the most useful and important in our model and the the three predictors I dropped did not have a linear relationship between those predictors and runoff response variable.
When determining whether to use AIC or T-test, you need to look if your two models are nested with each other. They are nested if ones model’s predictors are a subset of the other model’s predictors. If they are, use T-test or F-test.
If the two models are not nested, you should use AIC.
I will use the data set water and use AIC with backward elimination. The stepAIC function takes the complete model with all predictors (snowmod) and then automates the process of backward elimination to find the best model with least amount of predictors.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:alr3':
##
## forbes
stepAIC(snowmod, direction="backward")
## Start: AIC=774.36
## BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE
##
## Df Sum of Sq RSS AIC
## - APMAM 1 18537 2055849271 772.36
## - OPBPC 1 1301629 2057132362 772.39
## - APSAB 1 10869771 2066700504 772.58
## <none> 2055830733 774.36
## - APSLAKE 1 163662571 2219493304 775.65
## - OPSLAKE 1 493012936 2548843669 781.60
## - OPRC 1 509894399 2565725132 781.89
##
## Step: AIC=772.36
## BSAAM ~ APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE
##
## Df Sum of Sq RSS AIC
## - OPBPC 1 1284108 2057133378 770.39
## - APSAB 1 12514566 2068363837 770.62
## <none> 2055849271 772.36
## - APSLAKE 1 176735690 2232584961 773.90
## - OPSLAKE 1 496370866 2552220136 779.66
## - OPRC 1 511413723 2567262994 779.91
##
## Step: AIC=770.39
## BSAAM ~ APSAB + APSLAKE + OPRC + OPSLAKE
##
## Df Sum of Sq RSS AIC
## - APSAB 1 11814207 2068947585 768.63
## <none> 2057133378 770.39
## - APSLAKE 1 175480984 2232614362 771.91
## - OPRC 1 510159318 2567292697 777.91
## - OPSLAKE 1 1165227857 3222361235 787.68
##
## Step: AIC=768.63
## BSAAM ~ APSLAKE + OPRC + OPSLAKE
##
## Df Sum of Sq RSS AIC
## <none> 2068947585 768.63
## - OPRC 1 531694203 2600641788 776.47
## - APSLAKE 1 621012173 2689959758 777.92
## - OPSLAKE 1 1515918540 3584866125 790.27
##
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPRC + OPSLAKE, data = water)
##
## Coefficients:
## (Intercept) APSLAKE OPRC OPSLAKE
## 15425 1712 1797 2390
We want the smallest AIC as possible.
To begin, we need to look at the first step in the output with all predictors listed. Whatever predictor we choose to eliminate, the AIC of the model will become whatever the AIC value is in that row. Therefore, we will eliminate the predictor with the lowest AIC value because that’s what the model’s AIC will become.
Steps
1) drop APMAM
2) drop OPBPC
3) drop APSAB
4) finally, drop none because <none> has the lowest AIC
THEREFORE,
The model BSAAM ~ APSLAKE + OPRC + OPSLAKE has the lowest AIC of 768.63. These are the 3 sites that are most useful and good enough predictors for understanding the runoff.