Today in class we started by reviewing multicollinearity. Multicollinearity is when two predictors are correlated, and results a couple bad things happening. Point estimants become more unstable. T-tests are more likely to drop verriables CI’s for Bj are wider.

After this we went over a test we could use to see if veriables are significant. The first one was the AIC (Akaike Info Criteria) test. AIC is formed by 2(k+1)-2(log of the likelihood of a model) where k is the number of factors For this model, lower numbers are better.

The other thing we covered today was methods of model building. Both ways we went over were stepwise. The forward step method involves starting with just teh intercept, and looking at the variables to see what one veriable would help make it more accurate. In the case of the AIC test, you look for what veriable would decrease the score the most. Once you find the factor that decreases AIC the most, you take it onto your equation and look at the remaining values, and look to see what factor would decrease it most again. This is done untill adding any other factor would cause the AIC to increase.

Here is an example of R going through the steps for forward selection, using AIC.

library(alr3)
## Loading required package: car
data(water)
attach(water)


lower.scope <-lm(BSAAM ~ 1, data = water)
upper.scope <-lm(BSAAM ~ ., data = water)

stepFwd <- step(lm(BSAAM ~ 1, data = water), scope = list(lower = lower.scope, upper=upper.scope), direction = "forward")
## Start:  AIC=873.65
## BSAAM ~ 1
## 
##           Df  Sum of Sq        RSS    AIC
## + OPSLAKE  1 2.4087e+10 3.2640e+09 784.24
## + OPRC     1 2.3131e+10 4.2199e+09 795.28
## + OPBPC    1 2.1458e+10 5.8928e+09 809.64
## + APSLAKE  1 1.7004e+09 2.5651e+10 872.89
## + APMAM    1 1.5567e+09 2.5794e+10 873.13
## <none>                  2.7351e+10 873.65
## + APSAB    1 9.1891e+08 2.6432e+10 874.18
## + Year     1 7.9010e+08 2.6561e+10 874.38
## 
## Step:  AIC=784.24
## BSAAM ~ OPSLAKE
## 
##           Df Sum of Sq        RSS    AIC
## + APSLAKE  1 663368666 2600641788 776.47
## + APSAB    1 661988129 2602022326 776.49
## + OPRC     1 574050696 2689959758 777.92
## + APMAM    1 524283532 2739726922 778.71
## <none>                 3264010454 784.24
## + Year     1  45570705 3218439749 785.63
## + OPBPC    1     56424 3263954031 786.24
## 
## Step:  AIC=776.47
## BSAAM ~ OPSLAKE + APSLAKE
## 
##         Df Sum of Sq        RSS    AIC
## + OPRC   1 531694203 2068947585 768.63
## <none>               2600641788 776.47
## + APSAB  1  33349091 2567292697 777.91
## + APMAM  1  11041158 2589600630 778.28
## + Year   1   7292595 2593349193 778.35
## + OPBPC  1    122447 2600519341 778.46
## 
## Step:  AIC=768.63
## BSAAM ~ OPSLAKE + APSLAKE + OPRC
## 
##         Df Sum of Sq        RSS    AIC
## <none>               2068947585 768.63
## + Year   1  89405710 1979541875 768.73
## + APSAB  1  11814207 2057133378 770.39
## + APMAM  1   1410311 2067537274 770.60
## + OPBPC  1    583748 2068363837 770.62

This brings us to an end result of OPSLAKE + APSLAKE + OPRC, with the lowest possible AIC of 768.63.

The other method we went over was backwards elimination. With backwards elimination, you start with the most complicated equation you can, and start by looking at how much AIC would decrease by removing each variable. You then eliminate taht veriable. This is then repeated untill removing predictors untill it would nolonger lower the AIC.

Here is an example of backwards elimination.

stepback <- step(lm(BSAAM ~ ., data = water), scope = list(lower = lower.scope, upper=upper.scope), direction = "backward")
## Start:  AIC=774.53
## BSAAM ~ Year + APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE
## 
##           Df Sum of Sq        RSS    AIC
## - OPBPC    1   1571591 1971971864 772.57
## - APMAM    1   2262194 1972662466 772.58
## - APSAB    1   7311109 1977711381 772.69
## - Year     1  85430461 2055830733 774.36
## <none>                 1970400272 774.53
## - APSLAKE  1 106880993 2077281265 774.80
## - OPSLAKE  1 413707192 2384107464 780.73
## - OPRC     1 576007855 2546408128 783.56
## 
## Step:  AIC=772.57
## BSAAM ~ Year + APMAM + APSAB + APSLAKE + OPRC + OPSLAKE
## 
##           Df Sum of Sq        RSS    AIC
## - APMAM    1   2626064 1974597927 770.62
## - APSAB    1   6887325 1978859189 770.72
## - Year     1  85160498 2057132362 772.39
## <none>                 1971971864 772.57
## - APSLAKE  1 105315871 2077287734 772.80
## - OPRC     1 574517654 2546489517 781.56
## - OPSLAKE  1 964675516 2936647380 787.69
## 
## Step:  AIC=770.62
## BSAAM ~ Year + APSAB + APSLAKE + OPRC + OPSLAKE
## 
##           Df Sum of Sq        RSS    AIC
## - APSAB    1   4943947 1979541875 768.73
## - Year     1  82535451 2057133378 770.39
## <none>                 1974597927 770.62
## - APSLAKE  1 127432687 2102030614 771.31
## - OPRC     1 575963916 2550561844 779.63
## - OPSLAKE  1 968394770 2942992697 785.78
## 
## Step:  AIC=768.73
## BSAAM ~ Year + APSLAKE + OPRC + OPSLAKE
## 
##           Df  Sum of Sq        RSS    AIC
## - Year     1   89405710 2068947585 768.63
## <none>                  1979541875 768.73
## - APSLAKE  1  523812582 2503354457 776.83
## - OPRC     1  613807319 2593349193 778.35
## - OPSLAKE  1 1175063776 3154605651 786.77
## 
## Step:  AIC=768.63
## BSAAM ~ APSLAKE + OPRC + OPSLAKE
## 
##           Df  Sum of Sq        RSS    AIC
## <none>                  2068947585 768.63
## - OPRC     1  531694203 2600641788 776.47
## - APSLAKE  1  621012173 2689959758 777.92
## - OPSLAKE  1 1515918540 3584866125 790.27

Here we come to a result of APSLAKE + OPRC + OPSLAKE It should be noted that different methods can come up with different results for a model in the end, and you should specify what you did to get your model so that your results can be found again by following the same steps, if somone so wishes to double check.