May 6, 2016
while accouting for model uncertainty !!!
(so just selecting a model with smallest AIC and doing usual stats with it misses the main point…)
The goal is to find the sweet spot between underfitting and overfitting:
The proper balance is achieved when bias and variance are controlled to achieve confidence interval coverage at approximately the nominal level and where interval width is at a minimum. (Burnham and Anderson, page 33)
(and this happens when information loss is minimized)
set.seed(1) x <- 1:10 y <- rnorm(n=10, mean=2+5*exp(-x)+2*x^0.8-0.2*x^2, sd=0.5)
underfit <- lm(y~x) coef(underfit)
## (Intercept) x ## 8.575699 -1.144184
overfit <- lm(y~x+I(x^2)+I(x^3)+I(x^4)+I(x^5)+I(x^6)+I(x^7)+I(x^8)+I(x^9)) coef(overfit)
## (Intercept) x I(x^2) I(x^3) I(x^4) ## -7.441659e+01 1.974493e+02 -1.874012e+02 9.136103e+01 -2.552343e+01 ## I(x^5) I(x^6) I(x^7) I(x^8) I(x^9) ## 4.254808e+00 -4.200975e-01 2.293207e-02 -5.616357e-04 2.167395e-06
The true model is unknown (and will always be)
Magic number: 4-20 models and always much smaller than sample size (Burnham and Anderson, pages 19+205)
Using AIC and other similar methods one can only hope to select the best model from this set; if good models are not in the set of candidates, they cannot be discovered by model selection (i.e., data analysis) algorithms. (Burnham and Anderson, page 18)
Data dredging [(data dependent model construction or the consideration of all possible models)] should be reserved for early exploratory phases of initial investigation [and] should probably remain unpublished. (Burnham and Anderson, page 19+39)
Journal editors and referees rarely seem to show concern for the validity of results and conclusions where substantial data dredging has occurred. Thus, the entire methodology based on data dredging has been allowed to be perpetuated in an unthinking manner. (Burnham and Anderson, page 40)
The global model is used to:
IMPORTANT
Use the same package/function to fit all models to make sure that the constant term in the likelihood is the same for all models
\(AIC = -2 \times \left( \text{log} L - K \right) = -2 \times \text{log} L + 2 \times K\)
The AIC is an approximation that holds for large sample size and for certain kind of good models!
log.densities <- dnorm(x = y, mean = predict(underfit),
sd = sqrt(var(resid(underfit))*9/10), log=TRUE)
sum(log.densities)
## [1] -18.09026
logLik(underfit)
## 'log Lik.' -18.09026 (df=3)
as.numeric(-2*(logLik(underfit) - 3))
## [1] 42.18053
AIC(underfit)
## [1] 42.18053
\(K =\) 1 (intercept) + number of regression coefficient(s) + number of parameter(s) used to model the residual variance
\(TIC = −2 \times \text{log}L + 2 \times tr\left(J(θ) \times I(θ)^{−1}\right)\)
The K × K matrices \(J(θ)\) and \(I(θ)\) involve first and second mixed partial derivatives of the log-likelihood function, and \(tr\) denotes the matrix trace function (Burnham and Anderson, page 65)
When all is good, \(J(θ) \times I(θ)^{−1} \approx K\) and it requires very large dataset to estimate \(J(θ)\) and \(I(θ)\) reliably so AIC remains best in practice!
\(AIC_c = −2 \text{log}L + 2K\times\left(\frac{N}{N-K-1}\right) = AIC + \frac{2K(K+1)}{N-K-1}\)
Unless the sample size is large with respect to the number of estimated parameters (i.e. N/K > 40 for the global model), use of AICc is recommended (Burnham and Anderson, page 66)
This correction has been developped for Gaussian models and should not be needed when variance is known
However, it seems to give nice results as long as the underlying statistic distribution of the data is not strongly skewed (Burnham and Anderson, page 328)
So maybe use it unless binomial with small or high probabilities, or Poisson with low Lambda… (unclear from litterature)
\(QAIC = -2 \times \left( \frac{\text{log} L}{\hat{c}} - K \right) = -2 \times \frac{\text{log} L}{\hat{c}} + 2 \times K\)
with \(\hat{c}\) the dispersion parameter (or variance inflation factor)
If \(\hat{c}<1\), use AIC
If \(\hat{c}>4\) the model structure is really bad and no result can be trusted
The number of parameters (K) must now include one additional parameter as overdispersion has been estimated!
AIC value by itself is not interpretable and you need to compare AIC values between models
\(\Delta_i = AIC_i - AIC_{min}\) (Burnham and Anderson, page 70)
| \(\Delta_i\) | Level of Empirical Support of Model i |
| \(0-2\) | Substantial |
| \(4-7\) | Considerably less |
| \(>10\) | Essentially none |
IMPORTANT
If all models but the best one have a very large \(\Delta_i\) you may stop here: this is your best approximate model
In the context of a string of nested models, when there is a single model that is clearly superior (say, the next best model is \(>\) 9–10 AIC units from the minimum) there is little model selection uncertainty and the theoretical standard errors can be used. (Burnham and Anderson, page 88)
(same goes for an AIC weight > 0.9; see next slides, Burnham and Anderson, page 184)HOWEVER
In general, the information-theoretic approach should not mean merely searching for a single best model as a basis for inference. (Burnham and Anderson, page 47)
\(w_i = \frac{\text{exp}^{-\frac{\Delta_i}{2}}}{\sum^{R}_{j=1} \text{exp}^{-\frac{\Delta_j}{2}}}\) (with \(R\) = the number of candidate models)
\(w_i\) gives the relative likelihood (or relative strength of evidence) of the model \(i\) given the data (and the set of candidate models considered)
For example, if \(w_i=0.7\), then we predict that 70% of samples coming from your population will give this model \(i\) as best
Therefore, for 30% of the samples the best model will be another one!!!
The inability to ferret out a single best model is not a defect of AIC or any other selection criterion. Rather, it is an indication that the data are simply inadequate to reach such a strong inference.
(Burnham and Anderson, page 80)
This is not a problem if you perform multimodel inference!
An evidence ratio is the ratio between model likelihoods for any two models i and j, which is also given by the ratio between Akaike weights \(\left(\frac{w_i}{w_j}\right)\)
The evidence ratio tells how more likely it is that the better model is best compared to the worst one
For example, if \(w_i=0.80\) and \(w_j=0.032\), the evidence ratio is 0.80/0.032 = 25
This means that the empirical support (strength for evidence) for model \(i\) is 25 times more important than for model \(j\)
set.seed(1) x <- seq(1, 10, length=100) y <- rnorm(n=100, mean=2+5*exp(-x)+2*x^0.8-0.2*x^2, sd=0.5)
m5 <- lm(y~poly(x, 5, raw=TRUE)) ## global model m4 <- lm(y~poly(x, 4, raw=TRUE)) m3 <- lm(y~poly(x, 3, raw=TRUE)) m2 <- lm(y~poly(x, 2, raw=TRUE)) m1 <- lm(y~poly(x, 1, raw=TRUE))
Ks <- sapply(paste("m", 5:1, sep=""),
function(model) length(coef(get(model)))+1)
AICs <- sapply(paste("m", 5:1, sep=""),
function(model) AIC(get(model)))
d <- data.frame(N=length(y), K=Ks, AIC=AICs)
d$AICc <- d$AIC + (2*d$K*(d$K+1))/(d$N-d$K-1)
d
## N K AIC AICc ## m5 100 7 135.4945 136.7119 ## m4 100 6 133.8334 134.7366 ## m3 100 5 132.6581 133.2964 ## m2 100 4 138.6960 139.1171 ## m1 100 3 335.1163 335.3663
d <- d[order(d$AIC), ] ## reorder from best to worst d
## N K AIC AICc ## m3 100 5 132.6581 133.2964 ## m4 100 6 133.8334 134.7366 ## m5 100 7 135.4945 136.7119 ## m2 100 4 138.6960 139.1171 ## m1 100 3 335.1163 335.3663
d$AICc <- NULL ## I remove the AICc for clarity d$delta_i <- round(d$AIC - d$AIC[1], 2) d
## N K AIC delta_i ## m3 100 5 132.6581 0.00 ## m4 100 6 133.8334 1.18 ## m5 100 7 135.4945 2.84 ## m2 100 4 138.6960 6.04 ## m1 100 3 335.1163 202.46
exp_terms <- exp(-d$delta_i/2) d$weights <- round(exp_terms/sum(exp_terms), 2) d
## N K AIC delta_i weights ## m3 100 5 132.6581 0.00 0.54 ## m4 100 6 133.8334 1.18 0.30 ## m5 100 7 135.4945 2.84 0.13 ## m2 100 4 138.6960 6.04 0.03 ## m1 100 3 335.1163 202.46 0.00
results <- replicate(1000, {
i <- sample(1:100, replace=TRUE)
newy <- y[i]
newx <- x[i]
newm5 <- lm(newy~poly(newx, 5, raw=TRUE))
newm4 <- lm(newy~poly(newx, 4, raw=TRUE))
newm3 <- lm(newy~poly(newx, 3, raw=TRUE))
newm2 <- lm(newy~poly(newx, 2, raw=TRUE))
newm1 <- lm(newy~poly(newx, 1, raw=TRUE))
best <- which.min(
c(AIC(newm5), AIC(newm4), AIC(newm3), AIC(newm2), AIC(newm1)))
c("m5", "m4", "m3", "m2", "m1")[best]
}
)
table(results)
## results ## m2 m3 m4 m5 ## 43 567 236 154
d$pi <- round(table(results)[rownames(d)]/1000, 2) d$pi[is.na(d$pi)] <- 0; d
## N K AIC delta_i weights pi ## m3 100 5 132.6581 0.00 0.54 0.57 ## m4 100 6 133.8334 1.18 0.30 0.24 ## m5 100 7 135.4945 2.84 0.13 0.15 ## m2 100 4 138.6960 6.04 0.03 0.04 ## m1 100 3 335.1163 202.46 0.00 0.00
We can do so because we know the model used to simulate the data, but for real applications you won't know it
true_results <- replicate(1000, {
newy <- rnorm(n=100, mean=2+5*exp(-x)+2*x^0.8-0.2*x^2, sd=0.5)
newm5 <- lm(newy~poly(x, 5, raw=TRUE))
newm4 <- lm(newy~poly(x, 4, raw=TRUE))
newm3 <- lm(newy~poly(x, 3, raw=TRUE))
newm2 <- lm(newy~poly(x, 2, raw=TRUE))
newm1 <- lm(newy~poly(x, 1, raw=TRUE))
best <- which.min(
c(AIC(newm5), AIC(newm4), AIC(newm3), AIC(newm2), AIC(newm1)))
c("m5", "m4", "m3", "m2", "m1")[best]
}
)
table(true_results)
## true_results ## m2 m3 m4 m5 ## 129 430 297 144
d$true_pi <- round(table(true_results)[rownames(d)]/1000, 2) d$true_pi[is.na(d$true_pi)] <- 0; d
## N K AIC delta_i weights pi true_pi ## m3 100 5 132.6581 0.00 0.54 0.57 0.43 ## m4 100 6 133.8334 1.18 0.30 0.24 0.30 ## m5 100 7 135.4945 2.84 0.13 0.15 0.14 ## m2 100 4 138.6960 6.04 0.03 0.04 0.13 ## m1 100 3 335.1163 202.46 0.00 0.00 0.00
We can do so because we know the model used to simulate the data, but for real applications you won't know it
## K true_pi_100 true_pi_500 true_pi_1000 true_pi_10000 ## m3 5 0.43 0.09 0.01 0 ## m4 6 0.30 0.55 0.47 0 ## m5 7 0.14 0.36 0.53 1 ## m2 4 0.13 0.00 0.00 0 ## m1 3 0.00 0.00 0.00 0
The idea is to use all candidate models to get a single estimation of each estimate
Poor models may impede your results, but this should not be the case if you built your set of candidate models properly
\(\overline{\beta_j} = \sum^{R}_{i=1} w_i\beta_{ij}\)
If a model \(i\) does not contain \(\beta_{ij}\) you must consider that \(\beta_{ij}=0\)
In each model \(i\) the value for \(\beta_{ij}\) is associated with a (conditional) uncertainty \(\text{var}(\beta_{ij})\)
We can also get the multimodel average of such uncertainty, which accounts for model selection uncertainty as well (again, if a model \(i\) does not contain \(\beta_{ij}\), consider that \(\text{var}(\beta_{ij})=0\))
\(\overline{\text{var}(\beta_j)}=\left(\sum^{R}_{i=1}w_i\sqrt{\text{var}(\beta_{ij}) + \left(\beta_{ij} - \overline{\beta_j}\right)^2}\right)^2\)
list_beta_i3 <- sapply(rownames(d), function(model) coef(get(model))) d$beta_i3 <- unlist(lapply(list_beta_i3, function(element) element[4])) d$beta_i3[is.na(d$beta_i3)] <- 0 d$var_beta_i3 <- sapply(rownames(d), function(model) ifelse(length(vcov(get(model))) < 16, NA, vcov(get(model))[4, 4])) d$var_beta_i3[is.na(d$var_beta_i3)] <- 0
summary(m5)
## ## Call: ## lm(formula = y ~ poly(x, 5, raw = TRUE)) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.1779 -0.2868 -0.0044 0.2910 1.1342 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 6.4425874 1.3002361 4.955 3.19e-06 *** ## poly(x, 5, raw = TRUE)1 -1.2769781 1.7570676 -0.727 0.469 ## poly(x, 5, raw = TRUE)2 0.6143718 0.8294561 0.741 0.461 ## poly(x, 5, raw = TRUE)3 -0.1342036 0.1756759 -0.764 0.447 ## poly(x, 5, raw = TRUE)4 0.0108539 0.0170647 0.636 0.526 ## poly(x, 5, raw = TRUE)5 -0.0003493 0.0006184 -0.565 0.573 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.4582 on 94 degrees of freedom ## Multiple R-squared: 0.9815, Adjusted R-squared: 0.9806 ## F-statistic: 999.5 on 5 and 94 DF, p-value: < 2.2e-16
## N K AIC delta_i weights beta_i3 var_beta_i3 ## m3 100 5 132.6581 0.00 0.54 -0.00911023 1.032938e-05 ## m4 100 6 133.8334 1.18 0.30 -0.03654047 9.667686e-04 ## m5 100 7 135.4945 2.84 0.13 -0.13420364 3.086201e-02 ## m2 100 4 138.6960 6.04 0.03 0.00000000 0.000000e+00 ## m1 100 3 335.1163 202.46 0.00 0.00000000 0.000000e+00
beta_3 <- sum(d$weights*d$beta_i3) beta_3
## [1] -0.03332814
var_beta_3 <- (sum(d$weights*sqrt(d$var_beta_i3+(d$beta_i3-beta_3)^2)))^2 var_beta_3
## [1] 0.002490494
set.seed(1)
new_beta_3 <- replicate(1000, {
i <- sample(1:100, replace=TRUE)
newy <- y[i]; newx <- x[i]
newm5 <- lm(newy~poly(newx, 5, raw=TRUE))
newm4 <- lm(newy~poly(newx, 4, raw=TRUE))
newm3 <- lm(newy~poly(newx, 3, raw=TRUE))
newm2 <- lm(newy~poly(newx, 2, raw=TRUE))
newm1 <- lm(newy~poly(newx, 1, raw=TRUE))
best <- which.min(
c(AIC(newm5), AIC(newm4), AIC(newm3), AIC(newm2), AIC(newm1)))
best_model <- get(c("m5", "m4", "m3", "m2", "m1")[best])
ifelse(length(coef(best_model))<4, 0, coef(best_model)[4])})
c(mean(new_beta_3), var(new_beta_3))
## [1] -0.034447303 0.001951054
Estimates are biased toward zero as for some models we assume a null effect of \(\beta_j\)
Solution?
Compute the average over all models in which \(x_j\) appears:
\(\overline{\beta_j} = \frac{\sum^{R}_{i=1} w_i I_{ij}\beta_{ij}}{w_{+(j)}}\)
\(w_{+(j)} = \sum^{R}_{i=1} w_i I_{ij}\)
\(I_{ij} = 1\) if predictor \(\beta_{j}\) is in model \(i\)
\(I_{ij} = 0\) otherwise
We may calculate the unconditional sampling variance with the same equation used for \(\overline{\text{var}(\beta_j)}\) applied only on models in which \(\beta_j\) appears, but for this you need to rescale the weights (so that they sum to one)
d <- subset(d, beta_i3!=0) beta_3_noshrink <- sum(d$weights * d$beta_i3)/sum(d$weights) beta_3_noshrink
## [1] -0.03435891
var_beta_3_noshrink <- (sum(d$weights/sum(d$weights)*
sqrt(d$var_beta_i3 +
(d$beta_i3 - beta_3_noshrink)^2)))^2
var_beta_3_noshrink
## [1] 0.002589843
Or more simple:
d$weights <- d$weights/sum(d$weights) beta_3_noshrink <- sum(d$weights * d$beta_i3) beta_3_noshrink
## [1] -0.03435891
(sum(d$weights*sqrt(d$var_beta_i3 + (d$beta_i3 - beta_3_noshrink)^2)))^2
## [1] 0.002589843
set.seed(1)
new_beta_3 <- replicate(1000, {
i <- sample(1:100, replace=TRUE)
newy <- y[i]; newx <- x[i]
newm5 <- lm(newy~poly(newx, 5, raw=TRUE))
newm4 <- lm(newy~poly(newx, 4, raw=TRUE))
newm3 <- lm(newy~poly(newx, 3, raw=TRUE))
newm2 <- lm(newy~poly(newx, 2, raw=TRUE))
newm1 <- lm(newy~poly(newx, 1, raw=TRUE))
best <- which.min(
c(AIC(newm5), AIC(newm4), AIC(newm3), AIC(newm2), AIC(newm1)))
best_model <- get(c("m5", "m4", "m3", "m2", "m1")[best])
ifelse(length(coef(best_model))<4, NA, coef(best_model)[4])}) # NA not 0!!
c(mean(new_beta_3, na.rm=TRUE), var(new_beta_3, na.rm=TRUE))
## [1] -0.036032743 0.001983758
Little change, why? (models without \(x^3\) have \(w_i \sim 0\))
Yes in principle (not always for AICc)
But if you use AIC to compare different error structure, your software/package MUST gives you a complete expression of the likelihood (and this can be hard to check)!
(Burnham and Anderson, pages 317-323)
PROBLEMS
Which likelihood (marginal or conditional)? Which method of likelihood computation (ML or REML)? How to measure \(K\) with respect to the number of parameters of the variance of random effects (PS: only maters if models differ in their random structure)?
See for example:
Carefull, solutions proposed have often be developped for Gaussian models