This week we went further in-depth on model selection and also different tests for model comparison
We learned about two different methods for building a model: forward selection and backward selection. Forward selection is when you start with the grand mean model and then add in the most significant variable one at a time until there are no more variables left that a significant under a certain threshold, say .01. Backwards selection, as the name would indicate, operates the opposite way. You start with a model containing all of the predictors and then remove the least significant variable one at a time until all the variables left in the model are significant.
The following steps walk through an example of forward selection. For each step you begin with the smallest significant model, so for the first iteration that is the grand mean model. Then you create new models containing that smallest model and add in one variable to each to create a nested model with one more parameter. Then you find the pvals of each model to determine which model is the best, as seen below. The most significant model becomes the new smallest model and then the steps repeat.
Iteration 1
currentmod <- lm(lifeExpF~1, newdata)
mod1 <- lm(lifeExpF~region, newdata)
mod2 <- lm(lifeExpF~group, newdata)
mod3 <- lm(lifeExpF~fertility, newdata)
mod4 <- lm(lifeExpF~ppgdp, newdata)
mod5 <- lm(lifeExpF~pctUrban, newdata)
mod6 <- lm(lifeExpF~infantMortality, newdata)
pvals <- c(anova(currentmod, mod1)[6], + anova(currentmod, mod2)[6], + anova(currentmod, mod3)[6], + anova(currentmod, mod4)[6], + anova(currentmod, mod5)[6], + anova(currentmod, mod6)[6])
pvals
## $`Pr(>F)`
## [1] NA 4.222081e-38
##
## $`Pr(>F)`
## [1] NA 1.289346e-42
##
## $`Pr(>F)`
## [1] NA 5.719746e-49
##
## $`Pr(>F)`
## [1] NA 3.673818e-17
##
## $`Pr(>F)`
## [1] NA 6.337908e-20
##
## $`Pr(>F)`
## [1] NA 8.225623e-87
Iteration 2
currentmod <- mod6
mod1 <- lm(lifeExpF~infantMortality + region, newdata)
mod2 <- lm(lifeExpF~infantMortality + group, newdata)
mod3 <- lm(lifeExpF~infantMortality + fertility, newdata)
mod4 <- lm(lifeExpF~infantMortality + ppgdp, newdata)
mod5 <- lm(lifeExpF~infantMortality + pctUrban, newdata)
pvals <- c(anova(currentmod, mod1)[6], + anova(currentmod, mod2)[6], + anova(currentmod, mod3)[6], + anova(currentmod, mod4)[6], + anova(currentmod, mod5)[6])
pvals
## $`Pr(>F)`
## [1] NA 3.543629e-07
##
## $`Pr(>F)`
## [1] NA 3.85907e-09
##
## $`Pr(>F)`
## [1] NA 0.08196719
##
## $`Pr(>F)`
## [1] NA 0.0005004301
##
## $`Pr(>F)`
## [1] NA 0.07870966
Iteration 3
currentmod <- mod2
mod1 <- lm(lifeExpF~group + infantMortality + region, newdata)
mod2 <- lm(lifeExpF~group + infantMortality + fertility, newdata)
mod3 <- lm(lifeExpF~group + infantMortality + ppgdp, newdata)
mod4 <- lm(lifeExpF~group + infantMortality + pctUrban, newdata)
pvals <- c(anova(currentmod, mod1)[6], + anova(currentmod, mod2)[6], + anova(currentmod, mod3)[6], + anova(currentmod, mod4)[6])
pvals
## $`Pr(>F)`
## [1] NA 0.004085691
##
## $`Pr(>F)`
## [1] NA 0.2436129
##
## $`Pr(>F)`
## [1] NA 0.03402495
##
## $`Pr(>F)`
## [1] NA 0.08777467
Iteration 4. At this iteration, no models are more significant under a certain threshold of .01, so your final model is the current model.
currentmod <- mod1
mod1 <- lm(lifeExpF~region + group + infantMortality + fertility, newdata)
mod2 <- lm(lifeExpF~region + group + infantMortality + ppgdp, newdata)
mod3 <- lm(lifeExpF~region + group + infantMortality + pctUrban, newdata)
pvals <- c(anova(currentmod, mod1)[6], + anova(currentmod, mod2)[6], + anova(currentmod, mod3)[6])
pvals
## $`Pr(>F)`
## [1] NA 0.8308651
##
## $`Pr(>F)`
## [1] NA 0.01419862
##
## $`Pr(>F)`
## [1] NA 0.2734193
We also looked at automatic model selection. The method we looked at chooses the best model where the best model is defined as minimizing AIC. The code can be seen below. This AIC regression can do both forwards and backwards selection.
mod <- lm(lifeExpF~., newdata)
forwardmod <- stepAIC(mod, direction = "forward", scope = list(upper = ~infantMortality + group + region + pctUrban + ppgdp + fertility, lower = ~1))
## Start: AIC=459.88
## lifeExpF ~ region + group + fertility + ppgdp + pctUrban + infantMortality
backwardmod <- stepAIC(mod, direction = "backward")
## Start: AIC=459.88
## lifeExpF ~ region + group + fertility + ppgdp + pctUrban + infantMortality
##
## Df Sum of Sq RSS AIC
## - pctUrban 1 0.03 1846.7 457.89
## - fertility 1 0.66 1847.4 457.95
## <none> 1846.7 459.88
## - ppgdp 1 49.87 1896.6 463.03
## - group 1 54.57 1901.3 463.50
## - region 5 175.42 2022.1 467.40
## - infantMortality 1 1925.05 3771.7 595.71
##
## Step: AIC=457.89
## lifeExpF ~ region + group + fertility + ppgdp + infantMortality
##
## Df Sum of Sq RSS AIC
## - fertility 1 0.68 1847.4 455.96
## <none> 1846.7 457.89
## - group 1 54.69 1901.4 461.52
## - ppgdp 1 62.09 1908.8 462.27
## - region 5 184.25 2031.0 466.24
## - infantMortality 1 2026.61 3873.3 598.84
##
## Step: AIC=455.96
## lifeExpF ~ region + group + ppgdp + infantMortality
##
## Df Sum of Sq RSS AIC
## <none> 1847.4 455.96
## - group 1 54.1 1901.5 459.53
## - ppgdp 1 61.9 1909.3 460.32
## - region 5 198.1 2045.5 465.62
## - infantMortality 1 4088.7 5936.1 679.24
Finally, we looked at several different tests to compare two models.Those are briefly outlined below.
Chi-Squared Test: The chi-square distribution is used for several purposes, one of those being for the likelihood ratio test. It can also be used for goodness-of-fit tests. With these tests, the test statistic is found by summing up the quantities of the observed point minus the expected point squared, all over the observed point. This distribution is right skewed and positive, with a mean equal to the degrees of freedom. The code to find a probability from this distribution is pchisq(y, k) where y is the test statistic and k is the degrees of freedom. It finds P(Y < y).
T-Test: This test is often used to make inferences about a populations mean where the population standard deviation is unknown. This is typically used for samples. This distribution is symmetrical and centered at 0, and as the degrees of freedom head to infinity, this distribution heads towards a normal distribution. The test statistic is found by taking the difference of the two population sample means and then dividing by the pooled sample standard deviation. The code for this distribution is pt(y, df) where y is the test statistic and df is the degrees of freedom. This finds P(Y < y).
F-Test: The F-distribution is used in ANOVA tests, and is also often used for comparing two different variance estimates for the same parameter. This distribution is right skewed and positive, with a mean at k2/(k2 - 2) where k2 is the degrees of freedom for the second sample. This distribution takes two different degrees of freedom parameters. For the purposes of this class, we will mostly access the F-distribution through the anova function.