This focus of this week was model selection. Comparing models is a crucial skill for a statistician to have.

This week I’ll go over a few methods that we went over in class.

For the data this week, I’ll use the swiss dataset we utilized earlier in the week.

swiss <- datasets::swiss
attach(swiss)
summary(swiss)
##    Fertility      Agriculture     Examination      Education    
##  Min.   :35.00   Min.   : 1.20   Min.   : 3.00   Min.   : 1.00  
##  1st Qu.:64.70   1st Qu.:35.90   1st Qu.:12.00   1st Qu.: 6.00  
##  Median :70.40   Median :54.10   Median :16.00   Median : 8.00  
##  Mean   :70.14   Mean   :50.66   Mean   :16.49   Mean   :10.98  
##  3rd Qu.:78.45   3rd Qu.:67.65   3rd Qu.:22.00   3rd Qu.:12.00  
##  Max.   :92.50   Max.   :89.70   Max.   :37.00   Max.   :53.00  
##     Catholic       Infant.Mortality
##  Min.   :  2.150   Min.   :10.80   
##  1st Qu.:  5.195   1st Qu.:18.15   
##  Median : 15.140   Median :20.00   
##  Mean   : 41.144   Mean   :19.94   
##  3rd Qu.: 93.125   3rd Qu.:21.70   
##  Max.   :100.000   Max.   :26.60

I will be using these two models to compare:

model1 <- lm(Fertility ~ Agriculture + Education + Catholic + Infant.Mortality)
model2 <- lm(Fertility ~ Agriculture + Catholic + Infant.Mortality)

Is it a good idea to remove the variable ‘Education’ is the question I will address.

Likelihood Ratio Test

This is also considered a Wilks Test:

logmodel1 <- logLik(model1)
logmodel2 <- logLik(model2)

Now that we have the log-likelihoods we can get our Chi-square test statistic.

Test_Stat <- 2*((as.numeric(logmodel1))-(as.numeric(logmodel2)))
Lik_pvalue <- pchisq(Test_Stat, 1, lower.tail = FALSE)
Lik_pvalue
## [1] 6.880997e-09

This very small p-value should lead us to believe that the distribution fit is significantly improved by using the additional variable of ‘Education’. In other words, ‘Education’ is pretty important to the model and should not be removed.

Wald Test

For the Wald Test, we just need to observe the p-value from the summary output of the bigger model for the variable in question.

summary(model1)
## 
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic + 
##     Infant.Mortality)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.6765  -6.0522   0.7514   3.1664  16.1422 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      62.10131    9.60489   6.466 8.49e-08 ***
## Agriculture      -0.15462    0.06819  -2.267  0.02857 *  
## Education        -0.98026    0.14814  -6.617 5.14e-08 ***
## Catholic          0.12467    0.02889   4.315 9.50e-05 ***
## Infant.Mortality  1.07844    0.38187   2.824  0.00722 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.168 on 42 degrees of freedom
## Multiple R-squared:  0.6993, Adjusted R-squared:  0.6707 
## F-statistic: 24.42 on 4 and 42 DF,  p-value: 1.717e-10

Notice that the p-value is 5.14e-08, AKA very small and very significant to the model. ‘Education’ has a significant linear relationship with ‘Fertility’ with all the other variables accounted for.

ANOVA & F-Test

The F-test compares the variance of two nested models

anova(model1,model2)
## Analysis of Variance Table
## 
## Model 1: Fertility ~ Agriculture + Education + Catholic + Infant.Mortality
## Model 2: Fertility ~ Agriculture + Catholic + Infant.Mortality
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)    
## 1     42 2158.1                                 
## 2     43 4408.0 -1     -2250 43.789 5.14e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The small p-value indicates a significant improvement in the explanation of variance in the larger model so we should keep ‘Education’ in the model.

AIC

AIC(model1)
## [1] 325.2408
AIC(model2)
## [1] 356.809
BIC(model1)
## [1] 336.3417
BIC(model2)
## [1] 366.0597

AIC and BIC confirm everything already established. Despite the extra variable in model 1, it’s contribution to model fit is worth the additional complexity.