Model building is basically adding predictors to or elimnating predictors from the model.
Don’t use R-squared: proportion of variability in response that’s explained by its linear relationship with predictor(s)
T-test for adding a predictor - requires a nested model
F-test (partial) - requires a nested model
adjusted R-squared (like R-squared but penalizes for the number of predictors) - don’t actually use this one really
AIC (Akaike Information Criteria) - only good for comparing models - each AIC doesn’t mean anything on its own AIC = 2(k+1) - 2*log likelihood A smaller AIC is better
Mallow’s C_p \[Sum({y_i} - \hat{y_i})^2/MSE(predictors) \] The numerator is just using a subset of the predictors while the denominator is using all the predictors. Smaller is better, ideally about k+1.
To show each method, I will use the dataset of water.
library(alr3)
## Loading required package: car
data(water)
attach(water)
names(water)
## [1] "Year" "APMAM" "APSAB" "APSLAKE" "OPBPC" "OPRC" "OPSLAKE"
## [8] "BSAAM"
The response is BSAAM. The predictors are APMAM, APSAB, APSLAKE, OPBPC, OPRC, OPSLAKE.
Start simple, add more predictors using tests/criteria
Below is an example of forward selection using t-test with a significance level of 0.05.
mod1 = lm(BSAAM ~ APSLAKE + OPSLAKE)
summary(mod1)
##
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPSLAKE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13335.8 -5893.2 -171.8 4219.5 19500.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19144.9 3812.0 5.022 1.1e-05 ***
## APSLAKE 1768.8 553.7 3.194 0.00273 **
## OPSLAKE 3689.5 196.0 18.829 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8063 on 40 degrees of freedom
## Multiple R-squared: 0.9049, Adjusted R-squared: 0.9002
## F-statistic: 190.3 on 2 and 40 DF, p-value: < 2.2e-16
I picked two predictors to start with. I chose these two because I already knew, from working with this data before, that these two did help to predict the response.
mod2 = lm(BSAAM ~ APSLAKE + OPSLAKE + APMAM)
summary(mod2)
##
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPSLAKE + APMAM)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13849.2 -5728.4 -155.1 4084.6 19668.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18671.1 4023.8 4.640 3.87e-05 ***
## APSLAKE 1448.9 963.6 1.504 0.141
## OPSLAKE 3685.9 198.2 18.595 < 2e-16 ***
## APMAM 286.7 703.0 0.408 0.686
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8149 on 39 degrees of freedom
## Multiple R-squared: 0.9053, Adjusted R-squared: 0.898
## F-statistic: 124.3 on 3 and 39 DF, p-value: < 2.2e-16
Looking at the p-value for APMAM, we can see that it is over 0.05.
mod2 = lm(BSAAM ~ APSLAKE + OPSLAKE + APSAB)
summary(mod2)
##
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPSLAKE + APSAB)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13631.2 -5439.5 -331.7 4767.5 19501.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18279.9 4023.7 4.543 5.23e-05 ***
## APSLAKE 938.8 1292.4 0.726 0.472
## OPSLAKE 3709.5 199.2 18.626 < 2e-16 ***
## APSAB 1007.6 1415.6 0.712 0.481
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8113 on 39 degrees of freedom
## Multiple R-squared: 0.9061, Adjusted R-squared: 0.8989
## F-statistic: 125.5 on 3 and 39 DF, p-value: < 2.2e-16
Looking at the p-value for APSAB, we can see that it is over 0.05.
mod2 = lm(BSAAM ~ APSLAKE + OPSLAKE + OPBPC)
summary(mod2)
##
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPSLAKE + OPBPC)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13321.6 -5876.3 -166.9 4198.4 19487.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19198.01 4054.50 4.735 2.88e-05 ***
## APSLAKE 1768.90 560.79 3.154 0.00309 **
## OPSLAKE 3665.47 595.45 6.156 3.15e-07 ***
## OPBPC 21.17 493.94 0.043 0.96604
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8166 on 39 degrees of freedom
## Multiple R-squared: 0.9049, Adjusted R-squared: 0.8976
## F-statistic: 123.7 on 3 and 39 DF, p-value: < 2.2e-16
Looking at the p-value for OPBPC, we can see that it is over 0.05.
mod2 = lm(BSAAM ~ APSLAKE + OPSLAKE + OPRC)
summary(mod2)
##
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPSLAKE + OPRC)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12964 -5140 -1252 4446 18649
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15424.6 3638.4 4.239 0.000133 ***
## APSLAKE 1712.5 500.5 3.421 0.001475 **
## OPSLAKE 2389.8 447.1 5.346 4.19e-06 ***
## OPRC 1797.5 567.8 3.166 0.002998 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7284 on 39 degrees of freedom
## Multiple R-squared: 0.9244, Adjusted R-squared: 0.9185
## F-statistic: 158.9 on 3 and 39 DF, p-value: < 2.2e-16
Looking at the p-value for OPRC, we can see that it is less than 0.05.
This means that our final model should be: lm(BSAAM ~ APSLAKE + OPSLAKE + OPBPC)
Star with complicated model, pare it down with tests/criteria
Below is an example of backwards selection using partial F-test with a significance level of 0.05.
mod3 = lm(BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE)
summary(mod3)
##
## Call:
## lm(formula = BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC +
## OPSLAKE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12690 -4936 -1424 4173 18542
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15944.67 4099.80 3.889 0.000416 ***
## APMAM -12.77 708.89 -0.018 0.985725
## APSAB -664.41 1522.89 -0.436 0.665237
## APSLAKE 2270.68 1341.29 1.693 0.099112 .
## OPBPC 69.70 461.69 0.151 0.880839
## OPRC 1916.45 641.36 2.988 0.005031 **
## OPSLAKE 2211.58 752.69 2.938 0.005729 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7557 on 36 degrees of freedom
## Multiple R-squared: 0.9248, Adjusted R-squared: 0.9123
## F-statistic: 73.82 on 6 and 36 DF, p-value: < 2.2e-16
This is a model of all of the predictors. We’ll take away one at a time until we’ve got the right model.
mod4 = lm(BSAAM ~ APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE)
summary(mod4)
##
## Call:
## lm(formula = BSAAM ~ APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12696 -4933 -1396 4187 18550
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15930.84 3972.50 4.010 0.000283 ***
## APSAB -673.42 1418.96 -0.475 0.637873
## APSLAKE 2263.86 1269.35 1.783 0.082714 .
## OPBPC 68.94 453.50 0.152 0.879996
## OPRC 1915.75 631.46 3.034 0.004399 **
## OPSLAKE 2212.62 740.28 2.989 0.004952 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7454 on 37 degrees of freedom
## Multiple R-squared: 0.9248, Adjusted R-squared: 0.9147
## F-statistic: 91.05 on 5 and 37 DF, p-value: < 2.2e-16
anova(mod3, mod4)
## Analysis of Variance Table
##
## Model 1: BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE
## Model 2: BSAAM ~ APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 36 2055830733
## 2 37 2055849271 -1 -18538 3e-04 0.9857
The p-value is high, so APMAM should be taken out.
mod5 = lm(BSAAM ~ APSLAKE + OPBPC + OPRC + OPSLAKE)
summary(mod5)
##
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPBPC + OPRC + OPSLAKE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12932 -5010 -1181 4412 18621
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15538.35 3845.65 4.041 0.00025 ***
## APSLAKE 1712.68 506.99 3.378 0.00170 **
## OPBPC 46.22 446.34 0.104 0.91806
## OPRC 1798.53 575.20 3.127 0.00338 **
## OPSLAKE 2336.53 685.62 3.408 0.00156 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7378 on 38 degrees of freedom
## Multiple R-squared: 0.9244, Adjusted R-squared: 0.9164
## F-statistic: 116.1 on 4 and 38 DF, p-value: < 2.2e-16
anova(mod5, mod4)
## Analysis of Variance Table
##
## Model 1: BSAAM ~ APSLAKE + OPBPC + OPRC + OPSLAKE
## Model 2: BSAAM ~ APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 38 2068363837
## 2 37 2055849271 1 12514566 0.2252 0.6379
The p-value is high, so APSAB should be taken out.
mod6 = lm(BSAAM ~ APSLAKE + OPRC + OPSLAKE)
summary(mod6)
##
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPRC + OPSLAKE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12964 -5140 -1252 4446 18649
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15424.6 3638.4 4.239 0.000133 ***
## APSLAKE 1712.5 500.5 3.421 0.001475 **
## OPRC 1797.5 567.8 3.166 0.002998 **
## OPSLAKE 2389.8 447.1 5.346 4.19e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7284 on 39 degrees of freedom
## Multiple R-squared: 0.9244, Adjusted R-squared: 0.9185
## F-statistic: 158.9 on 3 and 39 DF, p-value: < 2.2e-16
anova(mod6, mod5)
## Analysis of Variance Table
##
## Model 1: BSAAM ~ APSLAKE + OPRC + OPSLAKE
## Model 2: BSAAM ~ APSLAKE + OPBPC + OPRC + OPSLAKE
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 39 2068947585
## 2 38 2068363837 1 583748 0.0107 0.9181
The p-value is high, so OPBPC should be taken out.
mod7 = lm(BSAAM ~ APSLAKE + OPSLAKE)
summary(mod7)
##
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPSLAKE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13335.8 -5893.2 -171.8 4219.5 19500.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19144.9 3812.0 5.022 1.1e-05 ***
## APSLAKE 1768.8 553.7 3.194 0.00273 **
## OPSLAKE 3689.5 196.0 18.829 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8063 on 40 degrees of freedom
## Multiple R-squared: 0.9049, Adjusted R-squared: 0.9002
## F-statistic: 190.3 on 2 and 40 DF, p-value: < 2.2e-16
anova(mod7, mod6)
## Analysis of Variance Table
##
## Model 1: BSAAM ~ APSLAKE + OPSLAKE
## Model 2: BSAAM ~ APSLAKE + OPRC + OPSLAKE
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 40 2600641788
## 2 39 2068947585 1 531694203 10.023 0.002998 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value is low, so OPRC should not be taken out.
Therefore, the model should be lm(BSAAM ~ APSLAKE + OPRC + OPSLAKE).
Only works with reasonably small number of predictors Try all of the possible regression models to see which is best
Below is an example of all subsets regression using AIC.
It would take a while to do all of it so I will just do a small section of each.
AIC(lm(BSAAM ~ APMAM))
## [1] 997.1542
AIC(lm(BSAAM ~ APSLAKE))
## [1] 996.9139
The first section is just the one predictor.
AIC(lm(BSAAM ~ APMAM + OPBPC))
## [1] 932.0902
AIC(lm(BSAAM ~ APSLAKE + OPSLAKE))
## [1] 900.4951
The next section is two predictors.
AIC(lm(BSAAM ~ APMAM + OPBPC + APSAB))
## [1] 932.9863
AIC(lm(BSAAM ~ APSLAKE + OPSLAKE + OPRC))
## [1] 892.6603
The next section is three predictors.
AIC(lm(BSAAM ~ APMAM + OPBPC + APSAB + OPRC))
## [1] 910.7397
AIC(lm(BSAAM ~ APSLAKE + OPSLAKE + OPRC + APMAM))
## [1] 894.6309
The next section is four predictors.
AIC(lm(BSAAM ~ APMAM + OPBPC + APSAB + OPRC + OPSLAKE))
## [1] 899.6805
AIC(lm(BSAAM ~ APSLAKE + OPSLAKE + OPRC + APMAM + OPBPC))
## [1] 896.6135
The next section is five predictors.
AIC(lm(BSAAM ~ APMAM + OPBPC + APSAB + OPRC + OPSLAKE + APSLAKE))
## [1] 898.3868
The last section is all the predictors.
We want the model with the smallest AIC. If there are two that are close (within ten), take the simplier model.
The final model is lm(BSAAM ~ APSLAKE + OPSLAKE + OPRC).
If the models will be nested: using tests.
If the models will not be nested: use AIC or Mallow’s C_p