Concepts

Today in class we finished discussing multicollinearity, but mainly focused on model building. Since I spent much of the previous learning log discussing the repercussions of multicollinearity, I will primarily focus on different aspects of model building in this learning log. Model building is a very broad concept, but the general idea is that we are trying to find the optimal model to use to predict our response variable given the predictor variables available to us. We can compare these models in various ways depending on whether they are nested or not.

Many of the concepts discussed with regards to building linear models are applicable to all types of model building, but building linear models is also very useful for many real world applications and so the concept of building linear models is important for applications to the real world.

Example

Directional

To best illustrate the various functions that can be used to assist in model building, I’ll go through an example using the OK Cupid dataset. I aim to determine the best model to predict ideal mate height given the different predictor variables that we have available to us, age, height and sex.

I’m going to be using AIC as my criteria for adding or removing predictors, but you could use other criteria, including Mallow’s Cp, for any type of model selection procedure. For a nested model selection procedure, including forward selection, backward elimination and bidirectional selection, we also have the option of using p-values from t-tests and partial F-tests, and adjusted R-squared values as our criteria; however, I would discourage anyone from using adjusted R-squared as their criteria becuase the other criteria are more rigorous.

Like I said, I will be using AIC as my criteria and will require that any model more complex than my current model have an AIC improvement of at least 10 in order to be chosen over my current model. My first example will be using backward elimination. We can use the command stepAIC(starting.model, scope = list(upper = complex.model, lower = simple.model), direction = “backward”) in the R package MASS. starting.model is the model that we will begin the stepwise selection with, complex.model is the most complex model we would want to have, so in the case of backward elimination this would be the same as starting.model, and simple.model is the most simplistic model you would want to obtain. I will specify the simplest model as a model using only the intercept to predict ideal mate height.

OK <- read.csv("http://cknudson.com/data/OKCupid.csv")
library(MASS)
starting.model <- lm(IdealMateHeight ~ ., data = OK)
complex.model <- starting.model
simple.model <- lm(IdealMateHeight ~ 1, data = OK)
final.model <- stepAIC(starting.model, scope = list(upper = complex.model, lower = simple.model), direction = "backward")
## Start:  AIC=207.79
## IdealMateHeight ~ Sex + Height + Age
## 
##          Df Sum of Sq     RSS    AIC
## - Age     1      0.44  595.59 205.89
## <none>                 595.15 207.79
## - Height  1    303.32  898.47 260.98
## - Sex     1   1188.55 1783.70 352.87
## 
## Step:  AIC=205.89
## IdealMateHeight ~ Sex + Height
## 
##          Df Sum of Sq     RSS    AIC
## <none>                 595.59 205.89
## - Height  1    302.88  898.47 258.98
## - Sex     1   1188.40 1783.99 350.89

We can see from the output that the model with an optimal AIC uses sex and height to predict ideal mate height.

We could have easily used a forward selection method by specifying starting.model as the simplest model we would like to obtain and changing direction to “forward”.

Lastly, we can see that this type of model creation creates nested models so we could have used any of the model comparison criteria mentioned previously.

All Subsets

We can also use an all subsets method to select a model. To do this we need to create models for all of the subsets of the predictor variables in our model and compare the AIC values. This method of model creation does not create nested models so it is important that we only use AIC or Mallow’s Cp as our criteria.

First, we create all of the models.

one <- lm(IdealMateHeight ~ 1, data = OK)
two <- lm(IdealMateHeight ~ Sex, data = OK)
three <- lm(IdealMateHeight ~ Age, data = OK)
four <- lm(IdealMateHeight ~ Height, data = OK)
five <- lm(IdealMateHeight ~ Height + Age, data = OK)
six <- lm(IdealMateHeight ~ Height + Sex, data = OK)
seven <- lm(IdealMateHeight ~ Age + Sex, data = OK)
eight <- lm(IdealMateHeight ~ Age + Sex + Height, data = OK)

Now, we can compare the AIC values using the command AIC(model, model,…, model). We can insert as many models as we want into the command at one time.

AIC(one, two, three, four, five, six, seven, eight)
##       df      AIC
## one    2 739.1402
## two    3 641.2577
## three  3 741.1309
## four   3 733.1703
## five   4 735.1483
## six    4 588.1644
## seven  4 643.2577
## eight  5 590.0658

Comparing the AIC values, we see that models six and eight have relatively similar AIC values, but have the lowest AIC values by over 10 units. We can see that model eight is more complex than model six because it has an additional predictor variable so we would say that the best model is model six, the model that uses height and sex to predict ideal mate height.

Comparison of Methods

It happens in this example that the backwards elimination and all subsets methods produced the same best models; however, this is not always the case. It often happens that different methods of model selection produce different best models.

One important quality of AIC to take note of is that an AIC value alone has no meaning. It only holds relevant meaning in comparison to an AIC value calculated by the same formula. As you may have noticed, the AIC values of the optimal model developed by each model selection method are very different even though the models are the same. This is because the AIC values were calculated in different ways.

Comparison to Concepts

The concepts of model building and model selection are interrelated to both other subjects relating to modeling and our class as a whole, since our course deals with different ways to model data. Additionally, methods of finding the optimal model are useful to know for all types of model creation so these techniques can be applied to any modeling in future courses or in our future careers.