In class we disscused the rest of multicollinearity. Our main focus in class was Model building. In model building we are trying to find the best model to use to predict our response variable given our predictors that we have. Models can be either nested, like y~x1+x2 and y~x1+x2+x3, or not nested, y~x1+x2 and y~x1+x3. To test if a predictor should be used in the model use T-test or F-test if the model is nested, or use AIC(Akaike Information Criteria) or Mallow’s Cp if the model is not nested. To build a model we can use forward and backward selection as well as all subsets. For backwards and forwards selection we will use the water data in the alr3 library.
In backward selection you start with a complicated model with all your predictars and pare it down. Testing each time to see if the predictor should be left in the model. For backwards selection I used the step by step approch of deleting the predictor with the highest pvalue.
library(alr3)
## Warning: package 'alr3' was built under R version 3.4.3
## Loading required package: car
## Warning: package 'car' was built under R version 3.4.3
data(water)
attach(water)
names(water)
## [1] "Year" "APMAM" "APSAB" "APSLAKE" "OPBPC" "OPRC" "OPSLAKE"
## [8] "BSAAM"
mod1=lm(BSAAM~OPBPC+OPRC+OPSLAKE+APSLAKE+APSAB+APMAM)
summary(mod1)
##
## Call:
## lm(formula = BSAAM ~ OPBPC + OPRC + OPSLAKE + APSLAKE + APSAB +
## APMAM)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12690 -4936 -1424 4173 18542
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15944.67 4099.80 3.889 0.000416 ***
## OPBPC 69.70 461.69 0.151 0.880839
## OPRC 1916.45 641.36 2.988 0.005031 **
## OPSLAKE 2211.58 752.69 2.938 0.005729 **
## APSLAKE 2270.68 1341.29 1.693 0.099112 .
## APSAB -664.41 1522.89 -0.436 0.665237
## APMAM -12.77 708.89 -0.018 0.985725
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7557 on 36 degrees of freedom
## Multiple R-squared: 0.9248, Adjusted R-squared: 0.9123
## F-statistic: 73.82 on 6 and 36 DF, p-value: < 2.2e-16
mod2=lm(BSAAM~OPRC+OPSLAKE+APSLAKE+APSAB+APMAM)
summary(mod2)
##
## Call:
## lm(formula = BSAAM ~ OPRC + OPSLAKE + APSLAKE + APSAB + APMAM)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12748 -5096 -1501 4242 18593
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15752.558 3845.515 4.096 0.000219 ***
## OPRC 1910.356 631.576 3.025 0.004506 **
## OPSLAKE 2295.408 501.450 4.578 5.16e-05 ***
## APSLAKE 2246.477 1313.976 1.710 0.095701 .
## APSAB -648.489 1499.041 -0.433 0.667815
## APMAM -2.978 696.532 -0.004 0.996611
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7456 on 37 degrees of freedom
## Multiple R-squared: 0.9248, Adjusted R-squared: 0.9146
## F-statistic: 90.99 on 5 and 37 DF, p-value: < 2.2e-16
mod3=lm(BSAAM~OPRC+OPSLAKE+APSLAKE+APSAB)
summary(mod3)
##
## Call:
## lm(formula = BSAAM ~ OPRC + OPSLAKE + APSLAKE + APSAB)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12750 -5095 -1494 4245 18594
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15749.8 3740.8 4.210 0.000151 ***
## OPRC 1910.2 622.3 3.070 0.003942 **
## OPSLAKE 2295.4 494.8 4.639 4.07e-05 ***
## APSLAKE 2244.9 1246.9 1.800 0.079735 .
## APSAB -650.6 1392.8 -0.467 0.643055
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7358 on 38 degrees of freedom
## Multiple R-squared: 0.9248, Adjusted R-squared: 0.9169
## F-statistic: 116.8 on 4 and 38 DF, p-value: < 2.2e-16
mod4=lm(BSAAM~OPRC+OPSLAKE+APSLAKE)
summary(mod4)
##
## Call:
## lm(formula = BSAAM ~ OPRC + OPSLAKE + APSLAKE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12964 -5140 -1252 4446 18649
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15424.6 3638.4 4.239 0.000133 ***
## OPRC 1797.5 567.8 3.166 0.002998 **
## OPSLAKE 2389.8 447.1 5.346 4.19e-06 ***
## APSLAKE 1712.5 500.5 3.421 0.001475 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7284 on 39 degrees of freedom
## Multiple R-squared: 0.9244, Adjusted R-squared: 0.9185
## F-statistic: 158.9 on 3 and 39 DF, p-value: < 2.2e-16
Our best model is BSAAM~OPRC+OPSLAKE+APSLAKE.
In forward selection you start simple and add more predictors and then test them to see if they should have been added. The stepAIC function, with direction=“foward”, gives us a model that adds a predictor with the next lowest AIC until we reach the lowest AIC possible with our model.
mod5=lm(BSAAM~OPRC, data=water)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:alr3':
##
## forbes
stepAIC(mod5, direction="forward", scope=list(upper=BSAAM~APMAM+APSAB+APSLAKE+OPBPC+OPRC+OPSLAKE, lower=BSAAM~1))
## Start: AIC=795.28
## BSAAM ~ OPRC
##
## Df Sum of Sq RSS AIC
## + OPSLAKE 1 1529924515 2689959758 777.92
## + OPBPC 1 888721040 3331163233 787.11
## + APSLAKE 1 635018148 3584866125 790.27
## + APMAM 1 261271943 3958612330 794.53
## + APSAB 1 205266122 4014618151 795.14
## <none> 4219884273 795.28
##
## Step: AIC=777.92
## BSAAM ~ OPRC + OPSLAKE
##
## Df Sum of Sq RSS AIC
## + APSLAKE 1 621012173 2068947585 768.63
## + APSAB 1 457345396 2232614362 771.91
## + APMAM 1 387970039 2301989719 773.22
## <none> 2689959758 777.92
## + OPBPC 1 450573 2689509185 779.91
##
## Step: AIC=768.63
## BSAAM ~ OPRC + OPSLAKE + APSLAKE
##
## Df Sum of Sq RSS AIC
## <none> 2068947585 768.63
## + APSAB 1 11814207 2057133378 770.39
## + APMAM 1 1410311 2067537274 770.60
## + OPBPC 1 583748 2068363837 770.62
##
## Call:
## lm(formula = BSAAM ~ OPRC + OPSLAKE + APSLAKE, data = water)
##
## Coefficients:
## (Intercept) OPRC OPSLAKE APSLAKE
## 15425 1797 2390 1712
Our best model is again BSAAM~OPRC+OPSLAKE+APSLAKE.
With all subsets you look at every possible model with your predictors and test using AIC and keep the lowest AIC. For this example we will use the Cupid data. Below I made a subset of every possible predictor combination for this model and I used AIC to test it.
data=read.csv("http://cknudson.com/data/OKCupid.csv")
attach(data)
names(data)
## [1] "Sex" "Height" "IdealMateHeight" "Age"
m1=lm(IdealMateHeight~Sex)
AIC(m1)
## [1] 641.2577
m2=lm(IdealMateHeight~Height)
AIC(m2)
## [1] 733.1703
m3=lm(IdealMateHeight~Age)
AIC(m3)
## [1] 741.1309
m4=lm(IdealMateHeight~Sex+Height)
AIC(m4)
## [1] 588.1644
m5=lm(IdealMateHeight~Sex+Age)
AIC(m5)
## [1] 643.2577
m6=lm(IdealMateHeight~Height+Age)
AIC(m6)
## [1] 735.1483
m7=lm(IdealMateHeight~Sex+Height+Age)
AIC(m7)
## [1] 590.0658
As you can see from the AIC our smallest AIC is with model 4 using IdealMaleHeight~Sex+Height.