Statistical inference using Information Theory: AICs, model selection, model averaging, and some related topics

May 6, 2016

Burnham and Anderson, 1998, 2002

> 30,000 citations

Goals

selecting the model(s) that best approximate the data
comparing different models as a proxy for comparing biological hypotheses (in a more suttle way than just dismissing null hypotheses)

while accouting for model uncertainty !!!

(so just selecting a model with smallest AIC and doing usual stats with it misses the main point…)

What is a good approximation?

What is a good approximation?

The goal is to find the sweet spot between underfitting and overfitting:

underfitting: you miss out some relevent effects and mischaracterise the others (ie. you are ignorant)
overfitting: your predictions are only true within your sample, but not within the population (ie. you have false beliefs)

TECHNICALLY:

The proper balance is achieved when bias and variance are controlled to achieve confidence interval coverage at approximately the nominal level and where interval width is at a minimum. (Burnham and Anderson, page 33)

(and this happens when information loss is minimized)

What is a good approximation?

set.seed(1)
x <- 1:10
y <-  rnorm(n=10, mean=2+5*exp(-x)+2*x^0.8-0.2*x^2, sd=0.5)

underfit <- lm(y~x)
coef(underfit)

## (Intercept)           x 
##    8.575699   -1.144184

overfit <- lm(y~x+I(x^2)+I(x^3)+I(x^4)+I(x^5)+I(x^6)+I(x^7)+I(x^8)+I(x^9))
coef(overfit)

##   (Intercept)             x        I(x^2)        I(x^3)        I(x^4) 
## -7.441659e+01  1.974493e+02 -1.874012e+02  9.136103e+01 -2.552343e+01 
##        I(x^5)        I(x^6)        I(x^7)        I(x^8)        I(x^9) 
##  4.254808e+00 -4.200975e-01  2.293207e-02 -5.616357e-04  2.167395e-06

What is a good approximation?

The big problem:

The true model is unknown (and will always be)

Step 1: define the set of candidate models

each model must be defined a priori
each model must make biological sense / each model must represent an hypothesis!
do not omit good a priori models
do not consider too many models
data must be strictly identical between all models (carefull to missing data and carefull to transformations!)
best to include the global model (if applicable)

Magic number: 4-20 models and always much smaller than sample size (Burnham and Anderson, pages 19+205)

IMPORTANT

Using AIC and other similar methods one can only hope to select the best model from this set; if good models are not in the set of candidates, they cannot be discovered by model selection (i.e., data analysis) algorithms. (Burnham and Anderson, page 18)

Data dredging [(data dependent model construction or the consideration of all possible models)] should be reserved for early exploratory phases of initial investigation [and] should probably remain unpublished. (Burnham and Anderson, page 19+39)

Journal editors and referees rarely seem to show concern for the validity of results and conclusions where substantial data dredging has occurred. Thus, the entire methodology based on data dredging has been allowed to be perpetuated in an unthinking manner. (Burnham and Anderson, page 40)

Step 2: fit all candidate models

The global model is used to:

check model assumptions
compute the dispersion parameter if applicable (to handle overdispersion in Poisson and Binomial models)

IMPORTANT

Use the same package/function to fit all models to make sure that the constant term in the likelihood is the same for all models

Step 3: compute the AIC of each candidate model

\(AIC = -2 \times \left( \text{log} L - K \right) = -2 \times \text{log} L + 2 \times K\)

\(AIC\) = estimate of the expected, relative distance between the fitted model and the unknown true mechanism that actually generated the observed data
\(L\) = likelihood of the parameters given the data and the model
\(K\) = number of model parameters
\(-2\) = used for historical reasons (see e.g. LRT, deviance…)

The AIC is an approximation that holds for large sample size and for certain kind of good models!

Step 3: compute the AIC of each candidate model

log.densities <- dnorm(x = y, mean = predict(underfit),
                      sd = sqrt(var(resid(underfit))*9/10), log=TRUE)
sum(log.densities)

## [1] -18.09026

logLik(underfit)

## 'log Lik.' -18.09026 (df=3)

Step 3: compute the AIC of each candidate model

as.numeric(-2*(logLik(underfit) - 3))

## [1] 42.18053

AIC(underfit)

## [1] 42.18053

\(K =\) 1 (intercept) + number of regression coefficient(s) + number of parameter(s) used to model the residual variance

Alternative information criterions

TIC (Takeuchi’s information criterion): generalisation of the AIC to use if the approximating models in the candidate set are poor and sample size is quite large

\(TIC = −2 \times \text{log}L + 2 \times tr\left(J(θ) \times I(θ)^{−1}\right)\)

The K × K matrices \(J(θ)\) and \(I(θ)\) involve first and second mixed partial derivatives of the log-likelihood function, and \(tr\) denotes the matrix trace function (Burnham and Anderson, page 65)

When all is good, \(J(θ) \times I(θ)^{−1} \approx K\) and it requires very large dataset to estimate \(J(θ)\) and \(I(θ)\) reliably so AIC remains best in practice!

Alternative information criterions

AICc (second order information criterion): better approximation for small sample size

\(AIC_c = −2 \text{log}L + 2K\times\left(\frac{N}{N-K-1}\right) = AIC + \frac{2K(K+1)}{N-K-1}\)

Alternative information criterions

AICc (second order information criterion): better approximation for small sample size

Unless the sample size is large with respect to the number of estimated parameters (i.e. N/K > 40 for the global model), use of AICc is recommended (Burnham and Anderson, page 66)

This correction has been developped for Gaussian models and should not be needed when variance is known

However, it seems to give nice results as long as the underlying statistic distribution of the data is not strongly skewed (Burnham and Anderson, page 328)

So maybe use it unless binomial with small or high probabilities, or Poisson with low Lambda… (unclear from litterature)

Alternative information criterions

QAIC (quasi-AIC): better approximation for overdispersed models (you can also compute QAICc)

\(QAIC = -2 \times \left( \frac{\text{log} L}{\hat{c}} - K \right) = -2 \times \frac{\text{log} L}{\hat{c}} + 2 \times K\)

with \(\hat{c}\) the dispersion parameter (or variance inflation factor)

If \(\hat{c}<1\), use AIC

If \(\hat{c}>4\) the model structure is really bad and no result can be trusted

The number of parameters (K) must now include one additional parameter as overdispersion has been estimated!

Step 4: compute the \(\Delta_i\) for each candidate model

AIC value by itself is not interpretable and you need to compare AIC values between models

\(\Delta_i = AIC_i - AIC_{min}\) (Burnham and Anderson, page 70)

\(\Delta_i\)	Level of Empirical Support of Model i
\(0-2\)	Substantial
\(4-7\)	Considerably less
\(>10\)	Essentially none

IMPORTANT

\(\Delta_i\) values tell you nothing about SIGNIFICANCE!
If you use these values as threshold, you should rather avoid AIC and stick to p-values!

Model selection?

If all models but the best one have a very large \(\Delta_i\) you may stop here: this is your best approximate model

In the context of a string of nested models, when there is a single model that is clearly superior (say, the next best model is \(>\) 9–10 AIC units from the minimum) there is little model selection uncertainty and the theoretical standard errors can be used. (Burnham and Anderson, page 88)

(same goes for an AIC weight > 0.9; see next slides, Burnham and Anderson, page 184)

HOWEVER

In general, the information-theoretic approach should not mean merely searching for a single best model as a basis for inference. (Burnham and Anderson, page 47)

Step 5: compute the Akaike weights for each candidate model

\(w_i = \frac{\text{exp}^{-\frac{\Delta_i}{2}}}{\sum^{R}_{j=1} \text{exp}^{-\frac{\Delta_j}{2}}}\) (with \(R\) = the number of candidate models)

\(w_i\) gives the relative likelihood (or relative strength of evidence) of the model \(i\) given the data (and the set of candidate models considered)

For example, if \(w_i=0.7\), then we predict that 70% of samples coming from your population will give this model \(i\) as best

Therefore, for 30% of the samples the best model will be another one!!!

Ambivalence

The inability to ferret out a single best model is not a defect of AIC or any other selection criterion. Rather, it is an indication that the data are simply inadequate to reach such a strong inference.

(Burnham and Anderson, page 80)

This is not a problem if you perform multimodel inference!

Step 6 (optional): compute the evidence ratio between specific models

An evidence ratio is the ratio between model likelihoods for any two models i and j, which is also given by the ratio between Akaike weights \(\left(\frac{w_i}{w_j}\right)\)

The evidence ratio tells how more likely it is that the better model is best compared to the worst one

For example, if \(w_i=0.80\) and \(w_j=0.032\), the evidence ratio is 0.80/0.032 = 25

This means that the empirical support (strength for evidence) for model \(i\) is 25 times more important than for model \(j\)

Example: get your data

set.seed(1)
x <- seq(1, 10, length=100)
y <- rnorm(n=100, mean=2+5*exp(-x)+2*x^0.8-0.2*x^2, sd=0.5)

Example: define and fit the candidate models

m5 <- lm(y~poly(x, 5, raw=TRUE))  ## global model
m4 <- lm(y~poly(x, 4, raw=TRUE))
m3 <- lm(y~poly(x, 3, raw=TRUE))
m2 <- lm(y~poly(x, 2, raw=TRUE))
m1 <- lm(y~poly(x, 1, raw=TRUE))

Example: check the global model

Example: compute AIC values

Ks   <- sapply(paste("m", 5:1, sep=""),
               function(model) length(coef(get(model)))+1)

AICs <- sapply(paste("m", 5:1, sep=""),
               function(model) AIC(get(model)))

d <- data.frame(N=length(y), K=Ks, AIC=AICs)

d$AICc <- d$AIC + (2*d$K*(d$K+1))/(d$N-d$K-1)

d

##      N K      AIC     AICc
## m5 100 7 135.4945 136.7119
## m4 100 6 133.8334 134.7366
## m3 100 5 132.6581 133.2964
## m2 100 4 138.6960 139.1171
## m1 100 3 335.1163 335.3663

Example: re-order the table

d <- d[order(d$AIC), ]  ## reorder from best to worst
d

##      N K      AIC     AICc
## m3 100 5 132.6581 133.2964
## m4 100 6 133.8334 134.7366
## m5 100 7 135.4945 136.7119
## m2 100 4 138.6960 139.1171
## m1 100 3 335.1163 335.3663

Example: compute the \(\Delta_i\)

d$AICc <- NULL ## I remove the AICc for clarity
d$delta_i <- round(d$AIC - d$AIC[1], 2)
d

##      N K      AIC delta_i
## m3 100 5 132.6581    0.00
## m4 100 6 133.8334    1.18
## m5 100 7 135.4945    2.84
## m2 100 4 138.6960    6.04
## m1 100 3 335.1163  202.46

Example: compute the Akaike weights

exp_terms <- exp(-d$delta_i/2)
d$weights <- round(exp_terms/sum(exp_terms), 2)
d

##      N K      AIC delta_i weights
## m3 100 5 132.6581    0.00    0.54
## m4 100 6 133.8334    1.18    0.30
## m5 100 7 135.4945    2.84    0.13
## m2 100 4 138.6960    6.04    0.03
## m1 100 3 335.1163  202.46    0.00

Example: compute bootstrap selection probabilities (alternative)

results <- replicate(1000, {
  i <- sample(1:100, replace=TRUE)
  newy <- y[i]
  newx <- x[i]
  newm5 <- lm(newy~poly(newx, 5, raw=TRUE))
  newm4 <- lm(newy~poly(newx, 4, raw=TRUE))
  newm3 <- lm(newy~poly(newx, 3, raw=TRUE))
  newm2 <- lm(newy~poly(newx, 2, raw=TRUE))
  newm1 <- lm(newy~poly(newx, 1, raw=TRUE))
  best <- which.min(
    c(AIC(newm5), AIC(newm4), AIC(newm3), AIC(newm2), AIC(newm1)))
  c("m5", "m4", "m3", "m2", "m1")[best]
}
)

Example: compute bootstrap selection probabilities (alternative)

table(results)

## results
##  m2  m3  m4  m5 
##  43 567 236 154

d$pi <- round(table(results)[rownames(d)]/1000, 2)
d$pi[is.na(d$pi)] <- 0; d

##      N K      AIC delta_i weights   pi
## m3 100 5 132.6581    0.00    0.54 0.57
## m4 100 6 133.8334    1.18    0.30 0.24
## m5 100 7 135.4945    2.84    0.13 0.15
## m2 100 4 138.6960    6.04    0.03 0.04
## m1 100 3 335.1163  202.46    0.00 0.00

Example: let us check what we should really expect! (alternative)

We can do so because we know the model used to simulate the data, but for real applications you won't know it

true_results <- replicate(1000, {
  newy <- rnorm(n=100, mean=2+5*exp(-x)+2*x^0.8-0.2*x^2, sd=0.5)
  newm5 <- lm(newy~poly(x, 5, raw=TRUE))
  newm4 <- lm(newy~poly(x, 4, raw=TRUE))
  newm3 <- lm(newy~poly(x, 3, raw=TRUE))
  newm2 <- lm(newy~poly(x, 2, raw=TRUE))
  newm1 <- lm(newy~poly(x, 1, raw=TRUE))
  best <- which.min(
    c(AIC(newm5), AIC(newm4), AIC(newm3), AIC(newm2), AIC(newm1)))
  c("m5", "m4", "m3", "m2", "m1")[best]
}
)

Example: let us check what we should really expect! (alternative)

table(true_results)

## true_results
##  m2  m3  m4  m5 
## 129 430 297 144

d$true_pi <- round(table(true_results)[rownames(d)]/1000, 2)
d$true_pi[is.na(d$true_pi)] <- 0; d

##      N K      AIC delta_i weights   pi true_pi
## m3 100 5 132.6581    0.00    0.54 0.57    0.43
## m4 100 6 133.8334    1.18    0.30 0.24    0.30
## m5 100 7 135.4945    2.84    0.13 0.15    0.14
## m2 100 4 138.6960    6.04    0.03 0.04    0.13
## m1 100 3 335.1163  202.46    0.00 0.00    0.00

Example: if we had more data…

We can do so because we know the model used to simulate the data, but for real applications you won't know it

##    K true_pi_100 true_pi_500 true_pi_1000 true_pi_10000
## m3 5        0.43        0.09         0.01             0
## m4 6        0.30        0.55         0.47             0
## m5 7        0.14        0.36         0.53             1
## m2 4        0.13        0.00         0.00             0
## m1 3        0.00        0.00         0.00             0

Step 6: Multimodel Inference (MMI)

The idea is to use all candidate models to get a single estimation of each estimate

Poor models may impede your results, but this should not be the case if you built your set of candidate models properly

Step 6: MMI (method 1)

\(\overline{\beta_j} = \sum^{R}_{i=1} w_i\beta_{ij}\)

If a model \(i\) does not contain \(\beta_{ij}\) you must consider that \(\beta_{ij}=0\)

In each model \(i\) the value for \(\beta_{ij}\) is associated with a (conditional) uncertainty \(\text{var}(\beta_{ij})\)

We can also get the multimodel average of such uncertainty, which accounts for model selection uncertainty as well (again, if a model \(i\) does not contain \(\beta_{ij}\), consider that \(\text{var}(\beta_{ij})=0\))

\(\overline{\text{var}(\beta_j)}=\left(\sum^{R}_{i=1}w_i\sqrt{\text{var}(\beta_{ij}) + \left(\beta_{ij} - \overline{\beta_j}\right)^2}\right)^2\)

Example: MMI (method 1)

list_beta_i3 <- sapply(rownames(d), function(model) coef(get(model)))
d$beta_i3 <- unlist(lapply(list_beta_i3, function(element) element[4]))
d$beta_i3[is.na(d$beta_i3)] <- 0
d$var_beta_i3 <- sapply(rownames(d), function(model)
  ifelse(length(vcov(get(model))) < 16, NA, vcov(get(model))[4, 4]))
d$var_beta_i3[is.na(d$var_beta_i3)] <- 0

Example: MMI (method 1)

summary(m5)

## 
## Call:
## lm(formula = y ~ poly(x, 5, raw = TRUE))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1779 -0.2868 -0.0044  0.2910  1.1342 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              6.4425874  1.3002361   4.955 3.19e-06 ***
## poly(x, 5, raw = TRUE)1 -1.2769781  1.7570676  -0.727    0.469    
## poly(x, 5, raw = TRUE)2  0.6143718  0.8294561   0.741    0.461    
## poly(x, 5, raw = TRUE)3 -0.1342036  0.1756759  -0.764    0.447    
## poly(x, 5, raw = TRUE)4  0.0108539  0.0170647   0.636    0.526    
## poly(x, 5, raw = TRUE)5 -0.0003493  0.0006184  -0.565    0.573    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4582 on 94 degrees of freedom
## Multiple R-squared:  0.9815, Adjusted R-squared:  0.9806 
## F-statistic: 999.5 on 5 and 94 DF,  p-value: < 2.2e-16

Example: MMI (method 1)

##      N K      AIC delta_i weights     beta_i3  var_beta_i3
## m3 100 5 132.6581    0.00    0.54 -0.00911023 1.032938e-05
## m4 100 6 133.8334    1.18    0.30 -0.03654047 9.667686e-04
## m5 100 7 135.4945    2.84    0.13 -0.13420364 3.086201e-02
## m2 100 4 138.6960    6.04    0.03  0.00000000 0.000000e+00
## m1 100 3 335.1163  202.46    0.00  0.00000000 0.000000e+00

beta_3 <- sum(d$weights*d$beta_i3)
beta_3

## [1] -0.03332814

var_beta_3 <- (sum(d$weights*sqrt(d$var_beta_i3+(d$beta_i3-beta_3)^2)))^2
var_beta_3

## [1] 0.002490494

Example: same using bootstraps

set.seed(1)
new_beta_3 <- replicate(1000, {
  i <- sample(1:100, replace=TRUE)
  newy <- y[i]; newx <- x[i]
  newm5 <- lm(newy~poly(newx, 5, raw=TRUE))
  newm4 <- lm(newy~poly(newx, 4, raw=TRUE))
  newm3 <- lm(newy~poly(newx, 3, raw=TRUE))
  newm2 <- lm(newy~poly(newx, 2, raw=TRUE))
  newm1 <- lm(newy~poly(newx, 1, raw=TRUE))
  best <- which.min(
    c(AIC(newm5), AIC(newm4), AIC(newm3), AIC(newm2), AIC(newm1)))
  best_model <- get(c("m5", "m4", "m3", "m2", "m1")[best])
  ifelse(length(coef(best_model))<4, 0, coef(best_model)[4])})
c(mean(new_beta_3), var(new_beta_3))

## [1] -0.034447303  0.001951054

The problem with this 1st method for MMI: shrinkage!

Estimates are biased toward zero as for some models we assume a null effect of \(\beta_j\)

Solution?

Step 6: MMI (method 2)

Compute the average over all models in which \(x_j\) appears:

\(\overline{\beta_j} = \frac{\sum^{R}_{i=1} w_i I_{ij}\beta_{ij}}{w_{+(j)}}\)

\(w_{+(j)} = \sum^{R}_{i=1} w_i I_{ij}\)

\(I_{ij} = 1\) if predictor \(\beta_{j}\) is in model \(i\)

\(I_{ij} = 0\) otherwise

We may calculate the unconditional sampling variance with the same equation used for \(\overline{\text{var}(\beta_j)}\) applied only on models in which \(\beta_j\) appears, but for this you need to rescale the weights (so that they sum to one)

Example: MMI (method 2)

d <- subset(d, beta_i3!=0)
beta_3_noshrink <- sum(d$weights * d$beta_i3)/sum(d$weights)
beta_3_noshrink

## [1] -0.03435891

var_beta_3_noshrink <- (sum(d$weights/sum(d$weights)*
                           sqrt(d$var_beta_i3 + 
                           (d$beta_i3 - beta_3_noshrink)^2)))^2
var_beta_3_noshrink

## [1] 0.002589843

Example: MMI (method 2)

Or more simple:

d$weights <- d$weights/sum(d$weights)
beta_3_noshrink <- sum(d$weights * d$beta_i3)
beta_3_noshrink

## [1] -0.03435891

(sum(d$weights*sqrt(d$var_beta_i3 + (d$beta_i3 - beta_3_noshrink)^2)))^2

## [1] 0.002589843

Example: method 2 using bootstraps

set.seed(1)
new_beta_3 <- replicate(1000, {
  i <- sample(1:100, replace=TRUE)
  newy <- y[i]; newx <- x[i]
  newm5 <- lm(newy~poly(newx, 5, raw=TRUE))
  newm4 <- lm(newy~poly(newx, 4, raw=TRUE))
  newm3 <- lm(newy~poly(newx, 3, raw=TRUE))
  newm2 <- lm(newy~poly(newx, 2, raw=TRUE))
  newm1 <- lm(newy~poly(newx, 1, raw=TRUE))
  best <- which.min(
    c(AIC(newm5), AIC(newm4), AIC(newm3), AIC(newm2), AIC(newm1)))
  best_model <- get(c("m5", "m4", "m3", "m2", "m1")[best])
  ifelse(length(coef(best_model))<4, NA, coef(best_model)[4])}) # NA not 0!!
c(mean(new_beta_3, na.rm=TRUE), var(new_beta_3, na.rm=TRUE))

## [1] -0.036032743  0.001983758

Little change, why? (models without \(x^3\) have \(w_i \sim 0\))

AIC for GLM?

Yes in principle (not always for AICc)

But if you use AIC to compare different error structure, your software/package MUST gives you a complete expression of the likelihood (and this can be hard to check)!

(Burnham and Anderson, pages 317-323)

AIC for mixed models?

PROBLEMS

Which likelihood (marginal or conditional)? Which method of likelihood computation (ML or REML)? How to measure \(K\) with respect to the number of parameters of the variance of random effects (PS: only maters if models differ in their random structure)?

See for example:

Vaida & Blanchard 2005 Biometrika 92 (2): 351-370
Greven & Kneib 2010 Biometrika 97 (4): 773-789
Saefken, Kneib, van Waveren & Greven 2014 Electron. J. Statist. 8 (1): 201-225

Carefull, solutions proposed have often be developped for Gaussian models

Conclusions: 1

Read the book (and papers) and do NOT use packages blindly
AIC based methods allows for the consideration of model selection uncertainty
Building the set of candidate models is the most difficult step
Each alternative metric comes with different assumptions and limitations
Use AICc if possible (be carefull with BIC)
Bootstraps methods are often more simple but also more computer intensive
Do not use variable sum of weights (unless you are an expert)

Conclusions: 2

Prediction intervals, covariance analyses… are possible but not always straightforward
Not clear what should be done for GLMMs
Stepwise selection is NOT an alternative
If you have enough data, sticking to one global model is an alternative (with pros and cons)
Hypothesis testing is still good for analysing the outcome of experiments