This focus of this week was model selection. Comparing models is a crucial skill for a statistician to have.
This week I’ll go over a few methods that we went over in class.
For the data this week, I’ll use the swiss dataset we utilized earlier in the week.
swiss <- datasets::swiss
attach(swiss)
summary(swiss)
## Fertility Agriculture Examination Education
## Min. :35.00 Min. : 1.20 Min. : 3.00 Min. : 1.00
## 1st Qu.:64.70 1st Qu.:35.90 1st Qu.:12.00 1st Qu.: 6.00
## Median :70.40 Median :54.10 Median :16.00 Median : 8.00
## Mean :70.14 Mean :50.66 Mean :16.49 Mean :10.98
## 3rd Qu.:78.45 3rd Qu.:67.65 3rd Qu.:22.00 3rd Qu.:12.00
## Max. :92.50 Max. :89.70 Max. :37.00 Max. :53.00
## Catholic Infant.Mortality
## Min. : 2.150 Min. :10.80
## 1st Qu.: 5.195 1st Qu.:18.15
## Median : 15.140 Median :20.00
## Mean : 41.144 Mean :19.94
## 3rd Qu.: 93.125 3rd Qu.:21.70
## Max. :100.000 Max. :26.60
I will be using these two models to compare:
model1 <- lm(Fertility ~ Agriculture + Education + Catholic + Infant.Mortality)
model2 <- lm(Fertility ~ Agriculture + Catholic + Infant.Mortality)
Is it a good idea to remove the variable ‘Education’ is the question I will address.
This is also considered a Wilks Test:
logmodel1 <- logLik(model1)
logmodel2 <- logLik(model2)
Now that we have the log-likelihoods we can get our Chi-square test statistic.
Test_Stat <- 2*((as.numeric(logmodel1))-(as.numeric(logmodel2)))
Lik_pvalue <- pchisq(Test_Stat, 1, lower.tail = FALSE)
Lik_pvalue
## [1] 6.880997e-09
This very small p-value should lead us to believe that the distribution fit is significantly improved by using the additional variable of ‘Education’. In other words, ‘Education’ is pretty important to the model and should not be removed.
For the Wald Test, we just need to observe the p-value from the summary output of the bigger model for the variable in question.
summary(model1)
##
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic +
## Infant.Mortality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.6765 -6.0522 0.7514 3.1664 16.1422
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.10131 9.60489 6.466 8.49e-08 ***
## Agriculture -0.15462 0.06819 -2.267 0.02857 *
## Education -0.98026 0.14814 -6.617 5.14e-08 ***
## Catholic 0.12467 0.02889 4.315 9.50e-05 ***
## Infant.Mortality 1.07844 0.38187 2.824 0.00722 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.168 on 42 degrees of freedom
## Multiple R-squared: 0.6993, Adjusted R-squared: 0.6707
## F-statistic: 24.42 on 4 and 42 DF, p-value: 1.717e-10
Notice that the p-value is 5.14e-08, AKA very small and very significant to the model. ‘Education’ has a significant linear relationship with ‘Fertility’ with all the other variables accounted for.
The F-test compares the variance of two nested models
anova(model1,model2)
## Analysis of Variance Table
##
## Model 1: Fertility ~ Agriculture + Education + Catholic + Infant.Mortality
## Model 2: Fertility ~ Agriculture + Catholic + Infant.Mortality
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 42 2158.1
## 2 43 4408.0 -1 -2250 43.789 5.14e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The small p-value indicates a significant improvement in the explanation of variance in the larger model so we should keep ‘Education’ in the model.
AIC(model1)
## [1] 325.2408
AIC(model2)
## [1] 356.809
BIC(model1)
## [1] 336.3417
BIC(model2)
## [1] 366.0597
AIC and BIC confirm everything already established. Despite the extra variable in model 1, it’s contribution to model fit is worth the additional complexity.