On Tuesday, we discussed the types of tools that can be used for model comparison, along with the methods in which we can build models.

Tools for Model Comparison

T-test

We can use the T-test method for adding or dropping \(x_j\). This is a very familiar method, and we can find the test statistic and p-values using the summary function of the model. We then compare this to the level of significance (usually \(\alpha\)=0.05) to see which predictors are significant.

Partial F-test

We can also use the F-test in order to compare models with each other. Again, this is something that we have used in the past. This can be conducted using the summary function similar to the t-test above, giving us the test statistic and p-values and then comparing them to the level of significance to see which predictors of the model are significant. What’s nice about this method is that it works for any predictor.

AIC

AIC, or Akaike Info Criteria is a method that balances model fit with the number of predictors. The AIC value is conducted using the following formula: \[AIC=2(k+1)-2*(log likelihood of model)\] where k+1 is the number of \(\beta\) values in the model. Our goal is to have the smallest or most negative AIC possible since we want to minimize our log likelihood as well as our number of predictors. This value becomes useful once we have 2 or more AIC values, i.e. we have to have enough to compare at least one model to another. As a general rule of thumb, we will use the model with the lower AIC if it’s simpler, or if it’s more complicated, then we need an AIC drop \(\geq\) 10. We need a large drop in AIC to warrant using the more complicated model since a simpler model is generally preferred if the values between the two are close.

Mallow’s \(C_p\)

Lastly, we have Mallow’s \(C_p\). Similar to AIC, a smaller Mallow’s \(C_p\) is better. We ideally want a value that is close to the number of \(\beta\) coefficients, k+1. In order to find this value, we take a subset of the predictors and make a model. If we have a great subset, then the residual sum of squares from the model with subset of predictors is going to estimate that of the model with all of the predictors really well.

Methods of Model Building

Forward Selection

The first method of model building that we discussed is forward selection. With this method, we start by checking the significance of simple models (SLR model), and then gradually add predictors. For adding predictors, we add the most significant and important predictors first, which is determined by the lowest p-value or smallest AIC (with the assumptions that \(p<\alpha\) and the AIC drop \(\geq\) 10). We can stop adding predictors once the remaining ones are no longer significant. I walked through this process below with the OkCupid dataset. First, I conducted forward selection using the t-test method.

OkCupidData<-read.csv("http://cknudson.com/data/OKCupid.csv")
attach(OkCupidData)
head(OkCupidData)

We are going to start with the grand mean model: IdealMateHeight~1. If none of the predictor variables are significant, this is the model we are going to use. Now we want to test each of the predictor variables separately against the response variable using the t-test.

mod1<-lm(IdealMateHeight~Sex)
summary(mod1)
## 
## Call:
## lm(formula = IdealMateHeight ~ Sex)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.147 -1.697  0.303  1.303  6.303 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  71.1471     0.3164  224.88   <2e-16 ***
## SexM         -5.4501     0.4508  -12.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.609 on 132 degrees of freedom
## Multiple R-squared:  0.5255, Adjusted R-squared:  0.5219 
## F-statistic: 146.2 on 1 and 132 DF,  p-value: < 2.2e-16
mod2<-lm(IdealMateHeight~Height)
summary(mod2)
## 
## Call:
## lm(formula = IdealMateHeight ~ Height)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.6262 -2.6511 -0.1796  2.7022  8.9938 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 82.47144    4.93571  16.709  < 2e-16 ***
## Height      -0.20665    0.07266  -2.844  0.00516 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.676 on 132 degrees of freedom
## Multiple R-squared:  0.05774,    Adjusted R-squared:  0.0506 
## F-statistic: 8.089 on 1 and 132 DF,  p-value: 0.005163
mod3<-lm(IdealMateHeight~Age)
summary(mod3)
## 
## Call:
## lm(formula = IdealMateHeight ~ Age)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.4880 -2.4797 -0.4506  3.5203  8.5286 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 68.213927   2.623250  26.004   <2e-16 ***
## Age          0.008304   0.086889   0.096    0.924    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.787 on 132 degrees of freedom
## Multiple R-squared:  6.92e-05,   Adjusted R-squared:  -0.007506 
## F-statistic: 0.009135 on 1 and 132 DF,  p-value: 0.924

Sex has the smallest p-value (2e-16), so start with that predictor and add from there. Age can be eliminated since it is insignificant in SLR.

mod2a<-lm(IdealMateHeight~Height+Sex)
summary(mod2a)
## 
## Call:
## lm(formula = IdealMateHeight ~ Height + Sex)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1770 -1.2642  0.2219  1.2602  5.3020 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  39.2456     3.9171  10.019  < 2e-16 ***
## Height        0.4930     0.0604   8.162 2.38e-13 ***
## SexM         -8.5383     0.5281 -16.168  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.132 on 131 degrees of freedom
## Multiple R-squared:  0.6854, Adjusted R-squared:  0.6806 
## F-statistic: 142.7 on 2 and 131 DF,  p-value: < 2.2e-16

Height has a really small p-value (2.38e-13), so add that predictor variable to the model. Since we have gone through all of the variables, we are left with the following model:

IdealMateHeight~Height+Sex

Now I want to take a look at the AIC method to see what the resulting model will be.

library(MASS)
mod4<-lm(IdealMateHeight~Sex+Age+Height)
simplemod<-lm(IdealMateHeight~1)
stepAIC(simplemod,direction="forward",scope=list(upper=mod4))
## Start:  AIC=356.86
## IdealMateHeight ~ 1
## 
##          Df Sum of Sq     RSS    AIC
## + Sex     1    994.84  898.47 258.98
## + Height  1    109.33 1783.99 350.89
## <none>                1893.31 356.86
## + Age     1      0.13 1893.18 358.86
## 
## Step:  AIC=258.98
## IdealMateHeight ~ Sex
## 
##          Df Sum of Sq    RSS    AIC
## + Height  1    302.88 595.59 205.89
## <none>                898.47 258.98
## + Age     1      0.00 898.47 260.98
## 
## Step:  AIC=205.89
## IdealMateHeight ~ Sex + Height
## 
##        Df Sum of Sq    RSS    AIC
## <none>              595.59 205.89
## + Age   1   0.43821 595.15 207.79
## 
## Call:
## lm(formula = IdealMateHeight ~ Sex + Height)
## 
## Coefficients:
## (Intercept)         SexM       Height  
##      39.246       -8.538        0.493

This leaves us with the same model as the t-test method: IdealMateHeight~Height+Sex This isn’t always going to be the case since we are testing for 2 different things (p-values and AIC values). It wouldn’t be out of the ordinary if they resulted in two different models.

Backward Elimination

The second method for model building is known as backward elimination. With this method, we start with a more complicated model (i.e. using all the predictors that are being tested), and then gradually pare down the model, eliminating the least important predictors. This process can be completed when the remaining predictors shouldn’t be cut, i.e. when the remaining predictors are all statistically significant. Using our OkCupid dataset again, we start with the model that includes all of the predictor variables: Height, Sex, and Age.

mod2b<-lm(IdealMateHeight~Height+Sex+Age)
summary(mod2b)
## 
## Call:
## lm(formula = IdealMateHeight ~ Height + Sex + Age)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1639 -1.2728  0.2494  1.3221  5.2108 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.65668    4.14920   9.558  < 2e-16 ***
## Height       0.49372    0.06066   8.140  2.8e-13 ***
## SexM        -8.54402    0.53027 -16.113  < 2e-16 ***
## Age         -0.01520    0.04913  -0.309    0.758    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.14 on 130 degrees of freedom
## Multiple R-squared:  0.6857, Adjusted R-squared:  0.6784 
## F-statistic: 94.52 on 3 and 130 DF,  p-value: < 2.2e-16

Age is insignificant(p=0.758), so eliminate that predictor variable from the model. Now we are going to test the model with just Height and Sex.

mod2c<-lm(IdealMateHeight~Height+Sex)
summary(mod2c)
## 
## Call:
## lm(formula = IdealMateHeight ~ Height + Sex)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1770 -1.2642  0.2219  1.2602  5.3020 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  39.2456     3.9171  10.019  < 2e-16 ***
## Height        0.4930     0.0604   8.162 2.38e-13 ***
## SexM         -8.5383     0.5281 -16.168  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.132 on 131 degrees of freedom
## Multiple R-squared:  0.6854, Adjusted R-squared:  0.6806 
## F-statistic: 142.7 on 2 and 131 DF,  p-value: < 2.2e-16

Both Height and Sex are significant predictors, so that means that we use the following model:

IdealMateHeight~Height+Sex

Now, for the AIC method:

stepAIC(mod2b)
## Start:  AIC=207.79
## IdealMateHeight ~ Height + Sex + Age
## 
##          Df Sum of Sq     RSS    AIC
## - Age     1      0.44  595.59 205.89
## <none>                 595.15 207.79
## - Height  1    303.32  898.47 260.98
## - Sex     1   1188.55 1783.70 352.87
## 
## Step:  AIC=205.89
## IdealMateHeight ~ Height + Sex
## 
##          Df Sum of Sq     RSS    AIC
## <none>                 595.59 205.89
## - Height  1    302.88  898.47 258.98
## - Sex     1   1188.40 1783.99 350.89
## 
## Call:
## lm(formula = IdealMateHeight ~ Height + Sex)
## 
## Coefficients:
## (Intercept)       Height         SexM  
##      39.246        0.493       -8.538

Again, like in forward selection, we get the same model through the t-test and AIC methods: IdealMateHeight~Height+Sex Like I said before, this isn’t always going to be the case, and we could end up with different models for different methods.

Note that the models resulting from backwards elimination and forward selection are the same. This doesn’t always occur, as different methods have the potential to yield different results.

All Subsets Regression

Lastly, we have All Subsets Regression, which takes into consideration the model with all the predictors and every model that is a subset of it. In other words, all nested models of the complete model as well as the complete model are going to be looked at in order to determine the best model to use. To compare models, we look at the AIC values from all of the models. This method is more time consuming since you have to look at each nested model, which can get really tricky with more than 3 predictor variables.