Objective: We covered the rest of section 5.1 today (model building). Describe the different methods of model building and the different criteria (F-test, AIC, etc). Then build models with the water and/or OKCupid data.
We discussed further regression analysis such as forward selection and backward elimation. With this data, we will do a forward selection to see what’s the best model.
modA<-lm(BSAAM~APMAM+APSAB+APSLAKE+OPBPC+OPRC+OPSLAKE)
mod1<-lm(BSAAM~APMAM)
summary(mod1)
mod2<-lm(BSAAM~APSAB)
summary(mod2)
mod3<-lm(BSAAM~APSLAKE)
summary(mod3)
mod4<-lm(BSAAM~OPBPC)
summary(mod4)
mod5<-lm(BSAAM~OPRC)
summary(mod5)
mod6<-lm(BSAAM~OPSLAKE)
summary(mod6)
#look at all of the p values to see what's significant
mod7<-lm(BSAAM~OPSLAKE)
summary(mod7)
mod8a<-lm(BSAAM~OPSLAKE+OPRC)
summary(mod8a)#has a smaller p value so we move to the next model
mod8B<-lm(BSAAM~OPSLAKE+OPBPC)
summary(mod8B)
mod9<-lm(BSAAM~OPSLAKE+OPRC)#this model is not better
#backward elimation is easier but will give a bigger model
To summarize, we can use the forward selection or backward elimation to build our model. However R has the stepAIC function which can be utalized to develop the model.
stepAIC(modA,direction="backward")
Start: AIC=774.36
BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE
Df Sum of Sq RSS AIC
- APMAM 1 18537 2055849271 772.36
- OPBPC 1 1301629 2057132362 772.39
- APSAB 1 10869771 2066700504 772.58
<none> 2055830733 774.36
- APSLAKE 1 163662571 2219493304 775.65
- OPSLAKE 1 493012936 2548843669 781.60
- OPRC 1 509894399 2565725132 781.89
Step: AIC=772.36
BSAAM ~ APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE
Df Sum of Sq RSS AIC
- OPBPC 1 1284108 2057133378 770.39
- APSAB 1 12514566 2068363837 770.62
<none> 2055849271 772.36
- APSLAKE 1 176735690 2232584961 773.90
- OPSLAKE 1 496370866 2552220136 779.66
- OPRC 1 511413723 2567262994 779.91
Step: AIC=770.39
BSAAM ~ APSAB + APSLAKE + OPRC + OPSLAKE
Df Sum of Sq RSS AIC
- APSAB 1 11814207 2068947585 768.63
<none> 2057133378 770.39
- APSLAKE 1 175480984 2232614362 771.91
- OPRC 1 510159318 2567292697 777.91
- OPSLAKE 1 1165227857 3222361235 787.68
Step: AIC=768.63
BSAAM ~ APSLAKE + OPRC + OPSLAKE
Df Sum of Sq RSS AIC
<none> 2068947585 768.63
- OPRC 1 531694203 2600641788 776.47
- APSLAKE 1 621012173 2689959758 777.92
- OPSLAKE 1 1515918540 3584866125 790.27
Call:
lm(formula = BSAAM ~ APSLAKE + OPRC + OPSLAKE)
Coefficients:
(Intercept) APSLAKE OPRC OPSLAKE
15425 1712 1797 2390
The aicstep function takes the original model with all of the predictors and uses the AIC to take one variable out at a time. It takes the lowest AIC number out and repeats the steps over and over until it finalizes with one model. Notice that this model will be different if we did it by hand using the backward elimation method because we used p-value rather than AIC.