There are many potential criteria for adding variables. We disucssed a few in class.
R^2. This is bad beccause it will always increase when more variables are added. This makes it difficult to compare models with different numbers of variables. If you use R^2 you could end up overfitting your data.
Adjusted R^2. This is like R^2 but it penalizes for the number of predictors in the model. This might be an okay route to take if you are explaining things to an audience without much stats knowledge.
-start with simple model -gradually add predictors -add most significant first -you can pick which criteria you want to use to determine significance, but remember to make your choice clear.
-sart with many/all predictors -gradually eliminate predictors -eliminate least significant first -stop when all remaining predictors are significant -once again, specify which method you are using to determine significance
-look at ALL possible models -this can get out of control quite quickly
Model1 is said to be nested in model2 if all of the predictor variable in model1 are also contained in model2.
For nested models, we can use all of the possible criteria we talked about. It is better to use a formal test (f-test,t-test) if possible.
For models that are not nested, we cannot use t-test or f-test to determine significance.
Let’s look at some examples using the “water” data.
library(alr3)
## Warning: package 'alr3' was built under R version 3.4.3
## Loading required package: car
## Warning: package 'car' was built under R version 3.4.3
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:alr3':
##
## forbes
data("water")
attach(water)
First, we used forward selection. I used the t-test to tell which predictors were most significant.
#make all models
modone<-lm(BSAAM~APSLAKE)
modtwo<-lm(BSAAM~OPSLAKE)
modthree<-lm(BSAAM~APMAM)
modfour<-lm(BSAAM~APSAB)
modfive<-lm(BSAAM~OPBPC)
modsix<-lm(BSAAM~OPRC)
summary(modone) #no
##
## Call:
## lm(formula = BSAAM ~ APSLAKE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46438 -16907 -5661 19028 69464
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63864 9249 6.905 2.25e-08 ***
## APSLAKE 2818 1709 1.649 0.107
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25010 on 41 degrees of freedom
## Multiple R-squared: 0.06217, Adjusted R-squared: 0.0393
## F-statistic: 2.718 on 1 and 41 DF, p-value: 0.1069
summary(modtwo) #yes
##
## Call:
## lm(formula = BSAAM ~ OPSLAKE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17603.8 -5338.0 332.1 3410.6 20875.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27014.6 3218.9 8.393 1.93e-10 ***
## OPSLAKE 3752.5 215.7 17.394 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8922 on 41 degrees of freedom
## Multiple R-squared: 0.8807, Adjusted R-squared: 0.8778
## F-statistic: 302.6 on 1 and 41 DF, p-value: < 2.2e-16
summary(modthree) #no
##
## Call:
## lm(formula = BSAAM ~ APMAM)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37043 -16339 -5457 17158 72467
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63364 9917 6.389 1.21e-07 ***
## APMAM 1965 1249 1.573 0.123
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25080 on 41 degrees of freedom
## Multiple R-squared: 0.05692, Adjusted R-squared: 0.03391
## F-statistic: 2.474 on 1 and 41 DF, p-value: 0.1234
summary(modfour) #no
##
## Call:
## lm(formula = BSAAM ~ APSAB)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41314 -16784 -5101 16492 70942
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 67152 9689 6.931 2.06e-08 ***
## APSAB 2279 1909 1.194 0.239
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25390 on 41 degrees of freedom
## Multiple R-squared: 0.0336, Adjusted R-squared: 0.01003
## F-statistic: 1.425 on 1 and 41 DF, p-value: 0.2394
summary(modfive) #yes
##
## Call:
## lm(formula = BSAAM ~ OPBPC)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21183 -7298 -819 4731 38430
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40017.4 3589.1 11.15 5.47e-14 ***
## OPBPC 2940.1 240.6 12.22 3.00e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11990 on 41 degrees of freedom
## Multiple R-squared: 0.7845, Adjusted R-squared: 0.7793
## F-statistic: 149.3 on 1 and 41 DF, p-value: 2.996e-15
summary(modsix) #yes
##
## Call:
## lm(formula = BSAAM ~ OPRC)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24356 -5514 -522 7448 24854
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21741.4 4044.1 5.376 3.32e-06 ***
## OPRC 4667.3 311.3 14.991 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10150 on 41 degrees of freedom
## Multiple R-squared: 0.8457, Adjusted R-squared: 0.842
## F-statistic: 224.7 on 1 and 41 DF, p-value: < 2.2e-16
#if no SIMPLE linear relationship then eliminate from the running.
mod1<-lm(BSAAM ~ OPSLAKE,water)
#Considering these two
mod2<-lm(BSAAM~ OPRC+OPSLAKE, water)
summary(mod2)
##
## Call:
## lm(formula = BSAAM ~ OPRC + OPSLAKE, data = water)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15991.2 -6484.6 -498.3 4700.1 19945.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22891.2 3277.8 6.984 1.98e-08 ***
## OPRC 1866.5 638.8 2.922 0.0057 **
## OPSLAKE 2400.8 503.3 4.770 2.46e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8201 on 40 degrees of freedom
## Multiple R-squared: 0.9017, Adjusted R-squared: 0.8967
## F-statistic: 183.4 on 2 and 40 DF, p-value: < 2.2e-16
mod3<-lm(BSAAM~OPBPC +OPSLAKE)
summary(mod3)
##
## Call:
## lm(formula = BSAAM ~ OPBPC + OPSLAKE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17591.0 -5276.6 275.6 3380.7 20867.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27050.95 3540.07 7.641 2.44e-09 ***
## OPBPC 14.37 546.41 0.026 0.979
## OPSLAKE 3736.16 658.24 5.676 1.35e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9033 on 40 degrees of freedom
## Multiple R-squared: 0.8807, Adjusted R-squared: 0.8747
## F-statistic: 147.6 on 2 and 40 DF, p-value: < 2.2e-16
#mod 3 has worse p-value
#mod 2 is BEST
Using Forward Selection, I ended up with the model: mod2<-lm(BSAAM~ OPRC+OPSLAKE, water) We were considering adding OPRC and OPBPC. OPRC was more significant so we added that one.
Next, we used Backward Elimination. I used the t-test to tell which predictors were least significant.
summary(mod1)
##
## Call:
## lm(formula = BSAAM ~ OPSLAKE, data = water)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17603.8 -5338.0 332.1 3410.6 20875.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27014.6 3218.9 8.393 1.93e-10 ***
## OPSLAKE 3752.5 215.7 17.394 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8922 on 41 degrees of freedom
## Multiple R-squared: 0.8807, Adjusted R-squared: 0.8778
## F-statistic: 302.6 on 1 and 41 DF, p-value: < 2.2e-16
#take out worst
mod4<-lm(BSAAM~APMAM+APSAB+APSLAKE+OPRC+OPSLAKE)
summary(mod4)
##
## Call:
## lm(formula = BSAAM ~ APMAM + APSAB + APSLAKE + OPRC + OPSLAKE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12748 -5096 -1501 4242 18593
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15752.558 3845.515 4.096 0.000219 ***
## APMAM -2.978 696.532 -0.004 0.996611
## APSAB -648.489 1499.041 -0.433 0.667815
## APSLAKE 2246.477 1313.976 1.710 0.095701 .
## OPRC 1910.356 631.576 3.025 0.004506 **
## OPSLAKE 2295.408 501.450 4.578 5.16e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7456 on 37 degrees of freedom
## Multiple R-squared: 0.9248, Adjusted R-squared: 0.9146
## F-statistic: 90.99 on 5 and 37 DF, p-value: < 2.2e-16
#take out worst
mod5<-lm(BSAAM~APSAB+APSLAKE+OPRC+OPSLAKE)
summary(mod5)
##
## Call:
## lm(formula = BSAAM ~ APSAB + APSLAKE + OPRC + OPSLAKE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12750 -5095 -1494 4245 18594
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15749.8 3740.8 4.210 0.000151 ***
## APSAB -650.6 1392.8 -0.467 0.643055
## APSLAKE 2244.9 1246.9 1.800 0.079735 .
## OPRC 1910.2 622.3 3.070 0.003942 **
## OPSLAKE 2295.4 494.8 4.639 4.07e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7358 on 38 degrees of freedom
## Multiple R-squared: 0.9248, Adjusted R-squared: 0.9169
## F-statistic: 116.8 on 4 and 38 DF, p-value: < 2.2e-16
#take out worst
mod6<-lm(BSAAM~APSLAKE+OPRC+OPSLAKE)
summary(mod6)
##
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPRC + OPSLAKE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12964 -5140 -1252 4446 18649
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15424.6 3638.4 4.239 0.000133 ***
## APSLAKE 1712.5 500.5 3.421 0.001475 **
## OPRC 1797.5 567.8 3.166 0.002998 **
## OPSLAKE 2389.8 447.1 5.346 4.19e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7284 on 39 degrees of freedom
## Multiple R-squared: 0.9244, Adjusted R-squared: 0.9185
## F-statistic: 158.9 on 3 and 39 DF, p-value: < 2.2e-16
#all are significant so mod6 is final
Using Backward Elimination, I got the model: mod6<-lm(BSAAM~APSLAKE+OPRC+OPSLAKE). At this point all the predictors are significant so we will stop removing any.
It is much easier to get R to do this for us! The first chunk is a way I learned in my other class and the second is the command we learned in 333. Both produce the same resluts. They use AIC to determine which predictors to add/subtract.
To do this for Backward elimntation just change type from “forward” to “backward”.
modnone<-lm(BSAAM~1)
lower.scope <- lm(BSAAM ~ 1, data = water)
upper.scope <- lm(BSAAM ~ ., data = water)
stepfwd<- step(modnone,scope = list(lower=lower.scope, upper= upper.scope),direction = c("forward"))
## Start: AIC=873.65
## BSAAM ~ 1
##
## Df Sum of Sq RSS AIC
## + OPSLAKE 1 2.4087e+10 3.2640e+09 784.24
## + OPRC 1 2.3131e+10 4.2199e+09 795.28
## + OPBPC 1 2.1458e+10 5.8928e+09 809.64
## + APSLAKE 1 1.7004e+09 2.5651e+10 872.89
## + APMAM 1 1.5567e+09 2.5794e+10 873.13
## <none> 2.7351e+10 873.65
## + APSAB 1 9.1891e+08 2.6432e+10 874.18
## + Year 1 7.9010e+08 2.6561e+10 874.38
##
## Step: AIC=784.24
## BSAAM ~ OPSLAKE
##
## Df Sum of Sq RSS AIC
## + APSLAKE 1 663368666 2600641788 776.47
## + APSAB 1 661988129 2602022326 776.49
## + OPRC 1 574050696 2689959758 777.92
## + APMAM 1 524283532 2739726922 778.71
## <none> 3264010454 784.24
## + Year 1 45570705 3218439749 785.63
## + OPBPC 1 56424 3263954031 786.24
##
## Step: AIC=776.47
## BSAAM ~ OPSLAKE + APSLAKE
##
## Df Sum of Sq RSS AIC
## + OPRC 1 531694203 2068947585 768.63
## <none> 2600641788 776.47
## + APSAB 1 33349091 2567292697 777.91
## + APMAM 1 11041158 2589600630 778.28
## + Year 1 7292595 2593349193 778.35
## + OPBPC 1 122447 2600519341 778.46
##
## Step: AIC=768.63
## BSAAM ~ OPSLAKE + APSLAKE + OPRC
##
## Df Sum of Sq RSS AIC
## <none> 2068947585 768.63
## + Year 1 89405710 1979541875 768.73
## + APSAB 1 11814207 2057133378 770.39
## + APMAM 1 1410311 2067537274 770.60
## + OPBPC 1 583748 2068363837 770.62
modall<-lm(BSAAM~.,water)
stepAIC(modall)
## Start: AIC=774.53
## BSAAM ~ Year + APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE
##
## Df Sum of Sq RSS AIC
## - OPBPC 1 1571591 1971971864 772.57
## - APMAM 1 2262194 1972662466 772.58
## - APSAB 1 7311109 1977711381 772.69
## - Year 1 85430461 2055830733 774.36
## <none> 1970400272 774.53
## - APSLAKE 1 106880993 2077281265 774.80
## - OPSLAKE 1 413707192 2384107464 780.73
## - OPRC 1 576007855 2546408128 783.56
##
## Step: AIC=772.57
## BSAAM ~ Year + APMAM + APSAB + APSLAKE + OPRC + OPSLAKE
##
## Df Sum of Sq RSS AIC
## - APMAM 1 2626064 1974597927 770.62
## - APSAB 1 6887325 1978859189 770.72
## - Year 1 85160498 2057132362 772.39
## <none> 1971971864 772.57
## - APSLAKE 1 105315871 2077287734 772.80
## - OPRC 1 574517654 2546489517 781.56
## - OPSLAKE 1 964675516 2936647380 787.69
##
## Step: AIC=770.62
## BSAAM ~ Year + APSAB + APSLAKE + OPRC + OPSLAKE
##
## Df Sum of Sq RSS AIC
## - APSAB 1 4943947 1979541875 768.73
## - Year 1 82535451 2057133378 770.39
## <none> 1974597927 770.62
## - APSLAKE 1 127432687 2102030614 771.31
## - OPRC 1 575963916 2550561844 779.63
## - OPSLAKE 1 968394770 2942992697 785.78
##
## Step: AIC=768.73
## BSAAM ~ Year + APSLAKE + OPRC + OPSLAKE
##
## Df Sum of Sq RSS AIC
## - Year 1 89405710 2068947585 768.63
## <none> 1979541875 768.73
## - APSLAKE 1 523812582 2503354457 776.83
## - OPRC 1 613807319 2593349193 778.35
## - OPSLAKE 1 1175063776 3154605651 786.77
##
## Step: AIC=768.63
## BSAAM ~ APSLAKE + OPRC + OPSLAKE
##
## Df Sum of Sq RSS AIC
## <none> 2068947585 768.63
## - OPRC 1 531694203 2600641788 776.47
## - APSLAKE 1 621012173 2689959758 777.92
## - OPSLAKE 1 1515918540 3584866125 790.27
##
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPRC + OPSLAKE, data = water)
##
## Coefficients:
## (Intercept) APSLAKE OPRC OPSLAKE
## 15425 1712 1797 2390
stepAIC(modnone,direction = "forward", scope = list(lower=lower.scope,upper=upper.scope))
## Start: AIC=873.65
## BSAAM ~ 1
##
## Df Sum of Sq RSS AIC
## + OPSLAKE 1 2.4087e+10 3.2640e+09 784.24
## + OPRC 1 2.3131e+10 4.2199e+09 795.28
## + OPBPC 1 2.1458e+10 5.8928e+09 809.64
## + APSLAKE 1 1.7004e+09 2.5651e+10 872.89
## + APMAM 1 1.5567e+09 2.5794e+10 873.13
## <none> 2.7351e+10 873.65
## + APSAB 1 9.1891e+08 2.6432e+10 874.18
## + Year 1 7.9010e+08 2.6561e+10 874.38
##
## Step: AIC=784.24
## BSAAM ~ OPSLAKE
##
## Df Sum of Sq RSS AIC
## + APSLAKE 1 663368666 2600641788 776.47
## + APSAB 1 661988129 2602022326 776.49
## + OPRC 1 574050696 2689959758 777.92
## + APMAM 1 524283532 2739726922 778.71
## <none> 3264010454 784.24
## + Year 1 45570705 3218439749 785.63
## + OPBPC 1 56424 3263954031 786.24
##
## Step: AIC=776.47
## BSAAM ~ OPSLAKE + APSLAKE
##
## Df Sum of Sq RSS AIC
## + OPRC 1 531694203 2068947585 768.63
## <none> 2600641788 776.47
## + APSAB 1 33349091 2567292697 777.91
## + APMAM 1 11041158 2589600630 778.28
## + Year 1 7292595 2593349193 778.35
## + OPBPC 1 122447 2600519341 778.46
##
## Step: AIC=768.63
## BSAAM ~ OPSLAKE + APSLAKE + OPRC
##
## Df Sum of Sq RSS AIC
## <none> 2068947585 768.63
## + Year 1 89405710 1979541875 768.73
## + APSAB 1 11814207 2057133378 770.39
## + APMAM 1 1410311 2067537274 770.60
## + OPBPC 1 583748 2068363837 770.62
##
## Call:
## lm(formula = BSAAM ~ OPSLAKE + APSLAKE + OPRC)
##
## Coefficients:
## (Intercept) OPSLAKE APSLAKE OPRC
## 15425 2390 1712 1797
##look if would improve by 10.
Using AIC produced a little different result than using the t-tests. If we wanted we could could check the AICS the test produced because if the AIC is not at least 10 less we should just use the simpler model. So the model we would actually pick would be: BSAAM ~ OPSLAKE.