Akaike's Information Criterion (AIC)

M. Drew LaMar
February 22, 2019

Announcements

  • Reading Assignment by Monday
    • Railsback & Grimm, “Agent-Based and Individual-Based Modeling: A Practical Introduction”
      • Chapter 1: Models, Agent-Based Models, and the Modeling Cycle
      • Chapter 2: Getting Started with NetLogo
  • Bring your laptops to class!
  • Lab #5: Predicting Pumpkin Weights: Information Theory and Model Selection
    • Due Monday, March 11, 11:59 pm
  • Homework #4: Intro to NetLogo (Chapter 2)
    • Due Wednesday, March 13, 11:59 pm

Akaike's Information Criterion (AIC)

AIC uses Likelihood and Information Theory to construct another model selection criterion (different from backward elimination or forward selection).

One problem with using adjusted \( R^2 \) values for model selection is it does poorly with out-of-sample prediction. In other words, finding a best-fit model using adjusted \( R^2 \) creates a bias towards the data set that was used.

AIC addresses this out-of-sample prediction issue by its very design. Current state-of-the-art is to use AIC for model selection, and adjusted \( R^2 \) for model validation (all models in the candidate set could be bad!!)

Akaike's Information Criterion (AIC)

\[ AIC = -2\log(\mathcal{L}(\hat{\theta} | x,g) + 2K \]

where

  • \( x \) is the data
  • \( g \) is the model
  • \( K \) is the number of parameters
  • \( \hat{\theta} \) are the parameters of the best-fit model via maximum likelihood
  • \( \log(\mathcal{L}(\hat{\theta} | x,g) \) is the maximum log-likelihood

Parsimony achieved

\[ AIC = -2\log(\mathcal{L}(\hat{\theta}) | x,g) + 2K \]

  • First term reduces bias as more parameters are added
  • Second term increases penalty of adding more parameters

Caveat - need a bias correction term

Corrected AIC

\[ \begin{eqnarray*} AICc & = & -2\log(\mathcal{L}(\hat{\theta}) | x,g) + 2K\left(\frac{n}{n-K-1}\right) \\ & = & AIC + \frac{2K(K+1)}{n-K-1} \end{eqnarray*} \]

where \( n \) is the sample size. When \( n \) is large, AICc converges to AIC.

Example: Hardening of Portland Cement

Hypotheses/Models

g1 <- lm(y ~ 1, data=cement)
g2 <- lm(y ~ x1 + x2, data=cement)
g3 <- lm(y ~ x1 + x2 + x1*x2, data=cement)
g4 <- lm(y ~ x3 + x4, data=cement)
g5 <- lm(y ~ x3 + x4 + x3*x4, data=cement)

Example: Hardening of Portland Cement

Hypotheses/Models

Discuss: What is \( K \) for these models?

Answer: 2, 4, 5, 4, and 5, respectively.

The variance of the residuals is also included in the model as a parameter!!

Example: Hardening of Portland Cement

Hypotheses/Models

The maximum likelihood estimate for the variance of the residuals \( \hat{\sigma}^2 \) is given by \[ \hat{\sigma}^2 = \frac{\textrm{RSS}}{n}, \] where \[ \textrm{RSS} = \sum_{i=1}^{n}\left(\hat{y}_{i} - y_{i}\right)^2. \]

Example: Hardening of Portland Cement

Hypotheses/Models

sig1 <- sum((fitted(g1) - cement$y)^2)/13
sig2 <- sum((fitted(g2) - cement$y)^2)/13
sig3 <- sum((fitted(g3) - cement$y)^2)/13
sig4 <- sum((fitted(g4) - cement$y)^2)/13
sig5 <- sum((fitted(g5) - cement$y)^2)/13
sig <- c(sig1, sig2, sig3, sig4, sig5)

Example: Hardening of Portland Cement

Hypotheses/Models

You can go from RSS to log-likelihood as follows:

\[ \log(\mathcal{L}) = -\frac{n}{2}\log\left(\hat{\sigma}^2\right) \]

Example: Hardening of Portland Cement

Hypotheses/Models

You can go from RSS to log-likelihood as follows:

loglik <- sapply(sig, function (x) {-1*(13/2)*log(x)})

Example: Hardening of Portland Cement

Hypotheses/Models

The corrected Akaike Information Criterion (AICc) is given by

\[ \textrm{AICc} = n\log\left(\hat{\sigma}^2\right) + 2K + \frac{2K(K+1)}{n-K-1}. \]

Example: Hardening of Portland Cement

Hypotheses/Models

K <- c(2,4,5,4,5)
(aicc <- sapply(1:5, function (i) {13*log(sig[i]) + 2*K[i] + (2*K[i]*(K[i]+1))/(13-K[i]-1)}))
[1] 74.64443 32.41999 37.82382 46.85258 51.32391
(rnk <- rank(aicc))
[1] 5 1 2 3 4

Example: Hardening of Portland Cement

Hypotheses/Models

  K        sig     LogLik     AICc Rank
1 2 208.904852 -34.722213 74.64443    5
2 4   4.454191  -9.709995 32.41999    1
3 5   4.397135  -9.626196 37.82382    2
4 4  13.518308 -16.926292 46.85258    3
5 5  12.421406 -16.376238 51.32391    4

Example: Hardening of Portland Cement

Hypotheses/Models

AICtable <- data.frame(Manual = aicc, Package = c(AICc(g1), 
  AICc(g2),
  AICc(g3),
  AICc(g4),
  AICc(g5)))

Example: Hardening of Portland Cement

Hypotheses/Models

    Manual   Package
1 74.64443 111.53683
2 32.41999  69.31239
3 37.82382  74.71622
4 46.85258  83.74499
5 51.32391  88.21631

Example: Hardening of Portland Cement

\[ \Delta_{i} = \mathrm{AICc}_{i} - \mathrm{AICc}_{min}. \]

 Model   Manual   Package  Delta_M  Delta_P
     2 32.41999  69.31239  0.00000  0.00000
     3 37.82382  74.71622  5.40383  5.40383
     4 46.85258  83.74499 14.43259 14.43259
     5 51.32391  88.21631 18.90391 18.90391
     1 74.64443 111.53683 42.22443 42.22443

Example: Hardening of Portland Cement

Model Likelihoods

\[ \mathcal{L}(g_{i} \ | \ \mathrm{data}) \propto \exp\left(-\frac{1}{2}\Delta_{i}\right) \]

Model “Probabilities” (Weights)

\[ w_{i} = \frac{\exp\left(-\frac{1}{2}\Delta_{i}\right)}{\sum_{j=1}^{R}\exp\left(-\frac{1}{2}\Delta_{j}\right)} \]

This is necessary as AICc estimates K-L information. The model probabilities give you a probability each model is the actual K-L best model.

Example: Hardening of Portland Cement

 Model Manual Delta_M          w
     2 32.420  0.0000 9.3643e-01
     3 37.824  5.4038 6.2813e-02
     4 46.853 14.4326 6.8782e-04
     5 51.324 18.9039 7.3543e-05
     1 74.644 42.2244 6.3468e-10

Model averaging (Multimodel Inference)

\[ \mathbf{g} = \sum_{i=1}^{R}w_{i}g_{i}. \]

Example: Hardening of Portland Cement

As an example, create an arbitrary input \( x_{1} \), \( x_{2} \), \( x_{3} \), and \( x_{4} \) and look at the predicted values.

X <- data.frame(x1 = 0.14, x2 = 0.40, x3 = 0.52, x4 = 0.05)
W <- AICtable$w
pred1 <- predict(g1, X)
pred2 <- predict(g2, X)
pred3 <- predict(g3, X)
pred4 <- predict(g4, X)
pred5 <- predict(g5, X)
AICtable$pred <- c(pred1, pred2, pred3, pred4, pred5)

print(AICtable %>% select(Model, w, pred) %>% mutate('w*pred' = w*pred), digits=4, row.names=F)

Example: Hardening of Portland Cement

 Model         w   pred    w*pred
     1 6.347e-10  95.42 6.056e-08
     2 9.364e-01  53.05 4.968e+01
     3 6.281e-02  54.14 3.401e+00
     4 6.878e-04 130.62 8.984e-02
     5 7.354e-05 135.29 9.950e-03

The predicted value for the averaged model is given by

W[1]*pred1 + W[2]*pred2 + W[3]*pred3 + W[4]*pred4 + W[5]*pred5
       1 
53.17581