In class we continued our discussion on multicollinearity for models and introduced what possible methods we can use to build models that are full and without multicollinearity. We can perform either Stepwise Regression or Backward Elimination. Stepwise regression begins with choosing one variable in a whole model, and then continues to add variables, while checking certain criteria after each variable is added. This process will continue until we have the fullest model that we can build from available predictors. Backward elimination is the opposite- starting with all variables, and removing one variable at a time based on a certain ranking.
We can practice these techniques with the water data:
library(alr3)
## Loading required package: car
attach(water)
data(water)
head(water)
## Year APMAM APSAB APSLAKE OPBPC OPRC OPSLAKE BSAAM
## 1 1948 9.13 3.58 3.91 4.10 7.43 6.47 54235
## 2 1949 5.28 4.82 5.20 7.55 11.11 10.26 67567
## 3 1950 4.20 3.77 3.67 9.52 12.20 11.35 66161
## 4 1951 4.60 4.46 3.93 11.14 15.15 11.13 68094
## 5 1952 7.15 4.99 4.88 16.34 20.05 22.81 107080
## 6 1953 9.70 5.65 4.91 8.88 8.15 7.41 67594
mod0<-lm(BSAAM~APMAM+APSAB+APSLAKE+OPBPC+OPRC+OPSLAKE)
mod2<-lm(BSAAM~1)
summary(mod0)
##
## Call:
## lm(formula = BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC +
## OPSLAKE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12690 -4936 -1424 4173 18542
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15944.67 4099.80 3.889 0.000416 ***
## APMAM -12.77 708.89 -0.018 0.985725
## APSAB -664.41 1522.89 -0.436 0.665237
## APSLAKE 2270.68 1341.29 1.693 0.099112 .
## OPBPC 69.70 461.69 0.151 0.880839
## OPRC 1916.45 641.36 2.988 0.005031 **
## OPSLAKE 2211.58 752.69 2.938 0.005729 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7557 on 36 degrees of freedom
## Multiple R-squared: 0.9248, Adjusted R-squared: 0.9123
## F-statistic: 73.82 on 6 and 36 DF, p-value: < 2.2e-16
So I created models that include no variables and all variables. We will use the model with all variables in order to perform the backward elimination, and the model with no variables as the starting point for the stepwise regression.
From here, we are able to provide different ranking or criteria. One type that we learned in class is the Akaike Information Criteria (AIC). This is a numerical value that we ideally would like to be as little as possible. The formula follows as: \[AIC=2(k+1)-2ln(\bar{Y})\] Where Y is the maximum likelihood function. Our aim is to minimize the number of predictor variables, while maximizing the accuracy of our model. R has an AIC function which takes the model as a parameter:
AIC(mod0)
## [1] 898.3868
AIC(mod2)
## [1] 997.674
So here, we can see that our model with all the variables has the lower AIC value, and thus we would want to use that model. When we use the stepwise/backward model building techniques, we can compare the overall AIC of each model and pick the smallest one.
We are also able to do our traditional t-test and partial F-test that we learned previously in chapter 4, if our models are nested with one another, because the AIC test compares the models against each other. We also learned of another test, called Mallows CP, but were told that it’s real application isn’t very widespread as the AIC test.
The stepAIC function will perform the process of testing individual models based on their AIC. The default method is backward elimination.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:alr3':
##
## forbes
stepAIC(mod0)
## Start: AIC=774.36
## BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE
##
## Df Sum of Sq RSS AIC
## - APMAM 1 18537 2055849271 772.36
## - OPBPC 1 1301629 2057132362 772.39
## - APSAB 1 10869771 2066700504 772.58
## <none> 2055830733 774.36
## - APSLAKE 1 163662571 2219493304 775.65
## - OPSLAKE 1 493012936 2548843669 781.60
## - OPRC 1 509894399 2565725132 781.89
##
## Step: AIC=772.36
## BSAAM ~ APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE
##
## Df Sum of Sq RSS AIC
## - OPBPC 1 1284108 2057133378 770.39
## - APSAB 1 12514566 2068363837 770.62
## <none> 2055849271 772.36
## - APSLAKE 1 176735690 2232584961 773.90
## - OPSLAKE 1 496370866 2552220136 779.66
## - OPRC 1 511413723 2567262994 779.91
##
## Step: AIC=770.39
## BSAAM ~ APSAB + APSLAKE + OPRC + OPSLAKE
##
## Df Sum of Sq RSS AIC
## - APSAB 1 11814207 2068947585 768.63
## <none> 2057133378 770.39
## - APSLAKE 1 175480984 2232614362 771.91
## - OPRC 1 510159318 2567292697 777.91
## - OPSLAKE 1 1165227857 3222361235 787.68
##
## Step: AIC=768.63
## BSAAM ~ APSLAKE + OPRC + OPSLAKE
##
## Df Sum of Sq RSS AIC
## <none> 2068947585 768.63
## - OPRC 1 531694203 2600641788 776.47
## - APSLAKE 1 621012173 2689959758 777.92
## - OPSLAKE 1 1515918540 3584866125 790.27
##
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPRC + OPSLAKE)
##
## Coefficients:
## (Intercept) APSLAKE OPRC OPSLAKE
## 15425 1712 1797 2390
## Backward function
Note our final model only contains APSLAKE, OPRC, and OPSLAKE. Which had the smallest AIC out of all the models with 768.63.
We can also perform it as the stepwise regression method by specifying some syntax. We need to change the direction and define the lower and upper range of variables.
stepAIC(mod2, direction="forward", scope=list(upper=mod0,lower=mod2))
## Start: AIC=873.65
## BSAAM ~ 1
##
## Df Sum of Sq RSS AIC
## + OPSLAKE 1 2.4087e+10 3.2640e+09 784.24
## + OPRC 1 2.3131e+10 4.2199e+09 795.28
## + OPBPC 1 2.1458e+10 5.8928e+09 809.64
## + APSLAKE 1 1.7004e+09 2.5651e+10 872.89
## + APMAM 1 1.5567e+09 2.5794e+10 873.13
## <none> 2.7351e+10 873.65
## + APSAB 1 9.1891e+08 2.6432e+10 874.18
##
## Step: AIC=784.24
## BSAAM ~ OPSLAKE
##
## Df Sum of Sq RSS AIC
## + APSLAKE 1 663368666 2600641788 776.47
## + APSAB 1 661988129 2602022326 776.49
## + OPRC 1 574050696 2689959758 777.92
## + APMAM 1 524283532 2739726922 778.71
## <none> 3264010454 784.24
## + OPBPC 1 56424 3263954031 786.24
##
## Step: AIC=776.47
## BSAAM ~ OPSLAKE + APSLAKE
##
## Df Sum of Sq RSS AIC
## + OPRC 1 531694203 2068947585 768.63
## <none> 2600641788 776.47
## + APSAB 1 33349091 2567292697 777.91
## + APMAM 1 11041158 2589600630 778.28
## + OPBPC 1 122447 2600519341 778.46
##
## Step: AIC=768.63
## BSAAM ~ OPSLAKE + APSLAKE + OPRC
##
## Df Sum of Sq RSS AIC
## <none> 2068947585 768.63
## + APSAB 1 11814207 2057133378 770.39
## + APMAM 1 1410311 2067537274 770.60
## + OPBPC 1 583748 2068363837 770.62
##
## Call:
## lm(formula = BSAAM ~ OPSLAKE + APSLAKE + OPRC)
##
## Coefficients:
## (Intercept) OPSLAKE APSLAKE OPRC
## 15425 2390 1712 1797
Note that it had built the same model as the other method, but that may not always be the case.