Akaike's Information Criterion (AIC)

M. Drew LaMar
March 8, 2021

Announcements

  • Readings for Wednesday
    • Railsback & Grimm, “Agent-Based and Individual-Based Modeling: A Practical Introduction”
      • Chapter 1: Models, Agent-Based Models, and the Modeling Cycle
      • Chapter 2: Getting Started with NetLogo
  • Exam #1 went live last night at 11:59 pm!!!
    • Due date: Monday, March 15, 11:59 pm
  • Mini-Homework #5: Predicting Pumpkin Weights using AIC
    • Due date: Friday, March 19, 11:59 pm

Likelihood Theory

Definition: The Likelihood of model parameters \( \theta \) given the model (\( g \)) and data (\( x \)) is given by: \[ \mathcal{L}(\theta | x, g) \]

Note: Likelihood theory describes how to find the most likely parameters of a model that fit the data the best.

Likelihood Theory: Example

Suppose you observe 10 coin flips and you see the following result:

H H H H H H T T T H

Discuss: What’s the most likely value for the probability of heads on an individual coin flip?

Let \( p \) denote the probability of getting heads.

This follows what is known as a binomial model, with the probability of getting 7 heads out of 10 given by:

\[ \mathrm{Prob}(7 \ \textrm{heads}) = \left(\begin{array}{c}10 \\ 7\end{array}\right)p^{7}(1-p)^{3} \]

Likelihood Theory: Example

In this example \[ \mathcal{L}(\theta | x, g) = \left(\begin{array}{c}10 \\ 7\end{array}\right)p^{7}(1-p)^{3} \] we have

  • The model \( g \) is the binomial model
  • The data \( x \) is the number of coin flips (10) and number of heads (7). In other words, it's what we observed.
  • The unknown parameter \( \theta \) is \( p \)

Likelihood Theory: Example

\[ \mathcal{L}(p | 10, 7; \mathrm{binomial}) = \left(\begin{array}{c}10 \\ 7\end{array}\right)p^{7}(1-p)^{3} \]

plot of chunk unnamed-chunk-2

Likelihood Theory: Example

The most likely parameter given the model and data is where the likelihood function is maximized.

It's usually easier to deal with summation rather than products, so we look at the log-likelihood function instead:

\[ \log(\mathcal{L}(\theta | g, x)) \]

which in our case becomes

\[ \log\left(\begin{array}{c}10 \\ 7\end{array}\right) + 7\log p +3\log (1-p) \]

Likelihood Theory: Example

Log-likelihood function:

plot of chunk unnamed-chunk-3

Akaike's Information Criterion (AIC)

AIC uses Likelihood and Information Theory to construct another model selection criterion (different from backward elimination or forward selection).

One problem with using adjusted \( R^2 \) values for model selection is it does poorly with out-of-sample prediction. In other words, finding a best-fit model using adjusted \( R^2 \) creates a bias towards the data set that was used.

AIC addresses this out-of-sample prediction issue by its very design. Current state-of-the-art is to use AIC for model selection, and adjusted \( R^2 \) for model validation (all models in the candidate set could be bad!!)

Akaike's Information Criterion (AIC)

\[ AIC = -2\log(\mathcal{L}(\hat{\theta} | x,g) + 2K \]

where

  • \( x \) is the data
  • \( g \) is the model
  • \( K \) is the number of parameters
  • \( \hat{\theta} \) are the parameters of the best-fit model via maximum likelihood
  • \( \log(\mathcal{L}(\hat{\theta} | x,g) \) is the maximum log-likelihood

Parsimony achieved

\[ AIC = -2\log(\mathcal{L}(\hat{\theta}) | x,g) + 2K \]

  • First term reduces bias as more parameters are added
  • Second term increases penalty of adding more parameters

Caveat - need a bias correction term

Corrected AIC

\[ \begin{eqnarray*} AICc & = & -2\log(\mathcal{L}(\hat{\theta}) | x,g) + 2K\left(\frac{n}{n-K-1}\right) \\ & = & AIC + \frac{2K(K+1)}{n-K-1} \end{eqnarray*} \]

where \( n \) is the sample size. When \( n \) is large, AICc converges to AIC.

Example: Hardening of Portland Cement

Hypotheses/Models

g1 <- lm(y ~ 1, data=cement)
g2 <- lm(y ~ x1 + x2, data=cement)
g3 <- lm(y ~ x1 + x2 + x1*x2, data=cement)
g4 <- lm(y ~ x3 + x4, data=cement)
g5 <- lm(y ~ x3 + x4 + x3*x4, data=cement)

Example: Hardening of Portland Cement

Hypotheses/Models

Discuss: What is \( K \) for these models?

Answer: 2, 4, 5, 4, and 5, respectively.

The variance of the residuals is also included in the model as a parameter!!

Example: Hardening of Portland Cement

Hypotheses/Models

The maximum likelihood estimate for the variance of the residuals \( \hat{\sigma}^2 \) is given by \[ \hat{\sigma}^2 = \frac{\textrm{RSS}}{n}, \] where \[ \textrm{RSS} = \sum_{i=1}^{n}\left(\hat{y}_{i} - y_{i}\right)^2. \]

Example: Hardening of Portland Cement

Hypotheses/Models

sig1 <- sum((fitted(g1) - cement$y)^2)/13
sig2 <- sum((fitted(g2) - cement$y)^2)/13
sig3 <- sum((fitted(g3) - cement$y)^2)/13
sig4 <- sum((fitted(g4) - cement$y)^2)/13
sig5 <- sum((fitted(g5) - cement$y)^2)/13
sig <- c(sig1, sig2, sig3, sig4, sig5)

Example: Hardening of Portland Cement

Hypotheses/Models

You can go from RSS to log-likelihood as follows:

\[ \log(\mathcal{L}) = -\frac{n}{2}\log\left(\hat{\sigma}^2\right) \]

Example: Hardening of Portland Cement

Hypotheses/Models

You can go from RSS to log-likelihood as follows:

loglik <- sapply(sig, function (x) {-1*(13/2)*log(x)})

Example: Hardening of Portland Cement

Hypotheses/Models

The corrected Akaike Information Criterion (AICc) is given by

\[ \textrm{AICc} = n\log\left(\hat{\sigma}^2\right) + 2K + \frac{2K(K+1)}{n-K-1}. \]

Example: Hardening of Portland Cement

Hypotheses/Models

K <- c(2,4,5,4,5)
(aicc <- sapply(1:5, function (i) {13*log(sig[i]) + 2*K[i] + (2*K[i]*(K[i]+1))/(13-K[i]-1)}))
[1] 74.64443 32.41999 37.82382 46.85258 51.32391
(rnk <- rank(aicc))
[1] 5 1 2 3 4

Example: Hardening of Portland Cement

Hypotheses/Models

  K        sig     LogLik     AICc Rank
1 2 208.904852 -34.722213 74.64443    5
2 4   4.454191  -9.709995 32.41999    1
3 5   4.397135  -9.626196 37.82382    2
4 4  13.518308 -16.926292 46.85258    3
5 5  12.421406 -16.376238 51.32391    4

Example: Hardening of Portland Cement

Hypotheses/Models

AICtable <- data.frame(Manual = aicc, Package = c(AICc(g1), 
  AICc(g2),
  AICc(g3),
  AICc(g4),
  AICc(g5)))

Example: Hardening of Portland Cement

Hypotheses/Models

    Manual   Package
1 74.64443 111.53683
2 32.41999  69.31239
3 37.82382  74.71622
4 46.85258  83.74499
5 51.32391  88.21631

Example: Hardening of Portland Cement

\[ \Delta_{i} = \mathrm{AICc}_{i} - \mathrm{AICc}_{min}. \]

 Model   Manual   Package  Delta_M  Delta_P
     2 32.41999  69.31239  0.00000  0.00000
     3 37.82382  74.71622  5.40383  5.40383
     4 46.85258  83.74499 14.43259 14.43259
     5 51.32391  88.21631 18.90391 18.90391
     1 74.64443 111.53683 42.22443 42.22443

Example: Hardening of Portland Cement

Model Likelihoods

\[ \mathcal{L}(g_{i} \ | \ \mathrm{data}) \propto \exp\left(-\frac{1}{2}\Delta_{i}\right) \]

Model “Probabilities” (Weights)

\[ w_{i} = \frac{\exp\left(-\frac{1}{2}\Delta_{i}\right)}{\sum_{j=1}^{R}\exp\left(-\frac{1}{2}\Delta_{j}\right)} \]

This is necessary as AICc estimates K-L information. The model probabilities give you a probability each model is the actual K-L best model.

Example: Hardening of Portland Cement

 Model Manual Delta_M          w
     2 32.420  0.0000 9.3643e-01
     3 37.824  5.4038 6.2813e-02
     4 46.853 14.4326 6.8782e-04
     5 51.324 18.9039 7.3543e-05
     1 74.644 42.2244 6.3468e-10

Model averaging (Multimodel Inference)

\[ \mathbf{g} = \sum_{i=1}^{R}w_{i}g_{i}. \]

Example: Hardening of Portland Cement

As an example, create an arbitrary input \( x_{1} \), \( x_{2} \), \( x_{3} \), and \( x_{4} \) and look at the predicted values.

X <- data.frame(x1 = 0.14, x2 = 0.40, x3 = 0.52, x4 = 0.05)
W <- AICtable$w
pred1 <- predict(g1, X)
pred2 <- predict(g2, X)
pred3 <- predict(g3, X)
pred4 <- predict(g4, X)
pred5 <- predict(g5, X)
AICtable$pred <- c(pred1, pred2, pred3, pred4, pred5)

print(AICtable %>% select(Model, w, pred) %>% mutate('w*pred' = w*pred), digits=4, row.names=F)

Example: Hardening of Portland Cement

 Model         w   pred    w*pred
     1 6.347e-10  95.42 6.056e-08
     2 9.364e-01  53.05 4.968e+01
     3 6.281e-02  54.14 3.401e+00
     4 6.878e-04 130.62 8.984e-02
     5 7.354e-05 135.29 9.950e-03

The predicted value for the averaged model is given by

W[1]*pred1 + W[2]*pred2 + W[3]*pred3 + W[4]*pred4 + W[5]*pred5
       1 
53.17581