Objective: We covered the rest of section 5.1 today (model building). Describe the different methods of model building and the different criteria (F-test, AIC, etc). Then build models with the water and/or OKCupid data.

We discussed further regression analysis such as forward selection and backward elimation. With this data, we will do a forward selection to see what’s the best model.

modA<-lm(BSAAM~APMAM+APSAB+APSLAKE+OPBPC+OPRC+OPSLAKE)
mod1<-lm(BSAAM~APMAM)
summary(mod1)
mod2<-lm(BSAAM~APSAB)
summary(mod2)
mod3<-lm(BSAAM~APSLAKE)
summary(mod3)
mod4<-lm(BSAAM~OPBPC)
summary(mod4)
mod5<-lm(BSAAM~OPRC)
summary(mod5)
mod6<-lm(BSAAM~OPSLAKE)
summary(mod6)
#look at all of the p values to see what's significant
mod7<-lm(BSAAM~OPSLAKE)
summary(mod7)
mod8a<-lm(BSAAM~OPSLAKE+OPRC)
summary(mod8a)#has a smaller p value so we move to the next model
mod8B<-lm(BSAAM~OPSLAKE+OPBPC)
summary(mod8B)
mod9<-lm(BSAAM~OPSLAKE+OPRC)#this model is not better 
#backward elimation is easier but will give a bigger model

To summarize, we can use the forward selection or backward elimation to build our model. However R has the stepAIC function which can be utalized to develop the model.

stepAIC(modA,direction="backward")
Start:  AIC=774.36
BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE

          Df Sum of Sq        RSS    AIC
- APMAM    1     18537 2055849271 772.36
- OPBPC    1   1301629 2057132362 772.39
- APSAB    1  10869771 2066700504 772.58
<none>                 2055830733 774.36
- APSLAKE  1 163662571 2219493304 775.65
- OPSLAKE  1 493012936 2548843669 781.60
- OPRC     1 509894399 2565725132 781.89

Step:  AIC=772.36
BSAAM ~ APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE

          Df Sum of Sq        RSS    AIC
- OPBPC    1   1284108 2057133378 770.39
- APSAB    1  12514566 2068363837 770.62
<none>                 2055849271 772.36
- APSLAKE  1 176735690 2232584961 773.90
- OPSLAKE  1 496370866 2552220136 779.66
- OPRC     1 511413723 2567262994 779.91

Step:  AIC=770.39
BSAAM ~ APSAB + APSLAKE + OPRC + OPSLAKE

          Df  Sum of Sq        RSS    AIC
- APSAB    1   11814207 2068947585 768.63
<none>                  2057133378 770.39
- APSLAKE  1  175480984 2232614362 771.91
- OPRC     1  510159318 2567292697 777.91
- OPSLAKE  1 1165227857 3222361235 787.68

Step:  AIC=768.63
BSAAM ~ APSLAKE + OPRC + OPSLAKE

          Df  Sum of Sq        RSS    AIC
<none>                  2068947585 768.63
- OPRC     1  531694203 2600641788 776.47
- APSLAKE  1  621012173 2689959758 777.92
- OPSLAKE  1 1515918540 3584866125 790.27

Call:
lm(formula = BSAAM ~ APSLAKE + OPRC + OPSLAKE)

Coefficients:
(Intercept)      APSLAKE         OPRC      OPSLAKE  
      15425         1712         1797         2390  

The aicstep function takes the original model with all of the predictors and uses the AIC to take one variable out at a time. It takes the lowest AIC number out and repeats the steps over and over until it finalizes with one model. Notice that this model will be different if we did it by hand using the backward elimation method because we used p-value rather than AIC.

LS0tCnRpdGxlOiAiTW9kZWwgQnVpbGRpbmciCm91dHB1dDogaHRtbF9ub3RlYm9vawotLS0KCk9iamVjdGl2ZTogV2UgY292ZXJlZCB0aGUgcmVzdCBvZiBzZWN0aW9uIDUuMSB0b2RheSAobW9kZWwgYnVpbGRpbmcpLiBEZXNjcmliZSB0aGUgZGlmZmVyZW50IG1ldGhvZHMgb2YgbW9kZWwgYnVpbGRpbmcgYW5kIHRoZSBkaWZmZXJlbnQgY3JpdGVyaWEgKEYtdGVzdCwgQUlDLCBldGMpLiBUaGVuIGJ1aWxkIG1vZGVscyB3aXRoICB0aGUgd2F0ZXIgYW5kL29yIE9LQ3VwaWQgZGF0YS4KCldlIGRpc2N1c3NlZCBmdXJ0aGVyIHJlZ3Jlc3Npb24gYW5hbHlzaXMgc3VjaCBhcyBmb3J3YXJkIHNlbGVjdGlvbiBhbmQgYmFja3dhcmQgZWxpbWF0aW9uLiBXaXRoIHRoaXMgZGF0YSwgd2Ugd2lsbCBkbyBhIGZvcndhcmQgc2VsZWN0aW9uIHRvIHNlZSB3aGF0J3MgdGhlIGJlc3QgbW9kZWwuIApgYGB7cn0KZGF0YSh3YXRlcikKaGVhZCh3YXRlcikKYXR0YWNoKHdhdGVyKQpgYGAKCmBgYHtyfQptb2RBPC1sbShCU0FBTX5BUE1BTStBUFNBQitBUFNMQUtFK09QQlBDK09QUkMrT1BTTEFLRSkKbW9kMTwtbG0oQlNBQU1+QVBNQU0pCnN1bW1hcnkobW9kMSkKbW9kMjwtbG0oQlNBQU1+QVBTQUIpCnN1bW1hcnkobW9kMikKbW9kMzwtbG0oQlNBQU1+QVBTTEFLRSkKc3VtbWFyeShtb2QzKQptb2Q0PC1sbShCU0FBTX5PUEJQQykKc3VtbWFyeShtb2Q0KQptb2Q1PC1sbShCU0FBTX5PUFJDKQpzdW1tYXJ5KG1vZDUpCm1vZDY8LWxtKEJTQUFNfk9QU0xBS0UpCnN1bW1hcnkobW9kNikKI2xvb2sgYXQgYWxsIG9mIHRoZSBwIHZhbHVlcyB0byBzZWUgd2hhdCdzIHNpZ25pZmljYW50Cm1vZDc8LWxtKEJTQUFNfk9QU0xBS0UpCnN1bW1hcnkobW9kNykKbW9kOGE8LWxtKEJTQUFNfk9QU0xBS0UrT1BSQykKc3VtbWFyeShtb2Q4YSkjaGFzIGEgc21hbGxlciBwIHZhbHVlIHNvIHdlIG1vdmUgdG8gdGhlIG5leHQgbW9kZWwKbW9kOEI8LWxtKEJTQUFNfk9QU0xBS0UrT1BCUEMpCnN1bW1hcnkobW9kOEIpCm1vZDk8LWxtKEJTQUFNfk9QU0xBS0UrT1BSQykjdGhpcyBtb2RlbCBpcyBub3QgYmV0dGVyIAojYmFja3dhcmQgZWxpbWF0aW9uIGlzIGVhc2llciBidXQgd2lsbCBnaXZlIGEgYmlnZ2VyIG1vZGVsCmBgYApUbyBzdW1tYXJpemUsIHdlIGNhbiB1c2UgdGhlIGZvcndhcmQgc2VsZWN0aW9uIG9yIGJhY2t3YXJkIGVsaW1hdGlvbiB0byBidWlsZCBvdXIgbW9kZWwuIEhvd2V2ZXIgUiBoYXMgdGhlIHN0ZXBBSUMgZnVuY3Rpb24gd2hpY2ggY2FuIGJlIHV0YWxpemVkIHRvIGRldmVsb3AgdGhlIG1vZGVsLiAKYGBge3J9CmxpYnJhcnkoTUFTUykKQUlDKG1vZEEpCnN0ZXBBSUMobW9kQSxkaXJlY3Rpb249ImJhY2t3YXJkIikKYGBgClRoZSBhaWNzdGVwIGZ1bmN0aW9uIHRha2VzIHRoZSBvcmlnaW5hbCBtb2RlbCB3aXRoIGFsbCBvZiB0aGUgcHJlZGljdG9ycyBhbmQgdXNlcyB0aGUgQUlDIHRvIHRha2Ugb25lIHZhcmlhYmxlIG91dCBhdCBhIHRpbWUuIEl0IHRha2VzIHRoZSBsb3dlc3QgQUlDIG51bWJlciBvdXQgYW5kIHJlcGVhdHMgdGhlIHN0ZXBzIG92ZXIgYW5kIG92ZXIgdW50aWwgaXQgZmluYWxpemVzIHdpdGggb25lIG1vZGVsLiBOb3RpY2UgdGhhdCB0aGlzIG1vZGVsIHdpbGwgYmUgZGlmZmVyZW50IGlmIHdlIGRpZCBpdCBieSBoYW5kIHVzaW5nIHRoZSBiYWNrd2FyZCBlbGltYXRpb24gbWV0aG9kIGJlY2F1c2Ugd2UgdXNlZCBwLXZhbHVlIHJhdGhlciB0aGFuIEFJQy4gCg==