Multicolinearity

Today we continued our discussion on Multicolinearity. First we discussed some of the reasons why we wish to avoid multicolinearity and what problems it causes. Specifically, it causes problems with point estimates, test statistics, and confidence intervals. If our predictors are correlated, we will have much more unstable point estimates with a higher amount of variance. Because our standard error will be larger, multicolinearity will cause our test statistics to move toward zero. Lastly, we wil see larger confidence intervals (which would indicate less accuracy for the model) due to the larger SE as well. Then, the problem becomes, how do we evaluate whether or not we should add certain variables when building our models? How do we know if each of these variables will introduce multicolinearity into the model?

Adding Predictors

Using R^2 to determine whether or not a predictor should be included in the model is not a reliable option and should not be used. It is simply a measure of the proportion of variable in resonse that is explained by its linear relationship with its predictors. Options that we can use include t-test, F-test, Adjusted R^2, AIC, and Mallow’s CP. AIC and Mallow’s CP were both new to me, but made a lot of sense when you looked at the formula. Each of these isolates the important indicator variables in the model.

Model Building

We discussed a couple of ways to build models. Namely, you can start with a very complex model and remove variables using some of the aforementioned tests (this is called backwards selection). One of the ways in which we did this was by using stepAIC in R. Essentially, the function allowed us to isolate the most important variables in the model, and allowed us to remove unnecessary ones in order to have the best model that we could. Essentially, it went step by step and removed variables until it reached the lowest AIC value that it could. Since AIC is penalized by the number of predictors, so it was often the case, especially when using the OKCupid model, that I would have only one or two variables remaining in the model after using stepAIC. Additonally, you can approach the model building process by starting with a simple model and adding more variables to it, testing each time to see whether or not they are actually helping the accuracy of the model itself. This is called forward selection. It was also interesting to note how you can kind of customize the step-wise portion of that process. You can specify the simple model and the complex model and R will remove or add variables between those two and find the optimal AIC for the best model you can produce with your available indicator variables.

Below is an example of the stepwise AIC process in R, using IdealMateHeight as the response variable. Starting with a complex model, step by step it removes variables until we are left with our lowest AIC value with the simpler model at the bottom of the call.

library(MASS)
mydata <- read.csv("http://cknudson.com/data/OKCupid.csv")
attach(mydata)
newmodel2 <- lm(IdealMateHeight ~ I(Sex)*Height + I(Height^2) + I(Age^2))
stepAIC(newmodel2)
## Start:  AIC=210.65
## IdealMateHeight ~ I(Sex) * Height + I(Height^2) + I(Age^2)
## 
##                 Df Sum of Sq    RSS    AIC
## - I(Age^2)       1    0.8216 590.94 208.84
## - I(Height^2)    1    4.6646 594.78 209.71
## - I(Sex):Height  1    4.8162 594.94 209.74
## <none>                       590.12 210.65
## 
## Step:  AIC=208.84
## IdealMateHeight ~ I(Sex) + Height + I(Height^2) + I(Sex):Height
## 
##                 Df Sum of Sq    RSS    AIC
## - I(Height^2)    1     4.355 595.30 207.82
## - I(Sex):Height  1     4.491 595.43 207.85
## <none>                       590.94 208.84
## 
## Step:  AIC=207.82
## IdealMateHeight ~ I(Sex) + Height + I(Sex):Height
## 
##                 Df Sum of Sq    RSS    AIC
## - I(Sex):Height  1   0.29097 595.59 205.89
## <none>                       595.30 207.82
## 
## Step:  AIC=205.89
## IdealMateHeight ~ I(Sex) + Height
## 
##          Df Sum of Sq     RSS    AIC
## <none>                 595.59 205.89
## - Height  1    302.88  898.47 258.98
## - I(Sex)  1   1188.40 1783.99 350.89
## 
## Call:
## lm(formula = IdealMateHeight ~ I(Sex) + Height)
## 
## Coefficients:
## (Intercept)      I(Sex)M       Height  
##      39.246       -8.538        0.493