M. Drew LaMar
March 8, 2021
Definition: The Likelihood of model parameters \( \theta \) given the model (\( g \)) and data (\( x \)) is given by: \[ \mathcal{L}(\theta | x, g) \]
Note: Likelihood theory describes how to find the most likely parameters of a model that fit the data the best.
Suppose you observe 10 coin flips and you see the following result:
H H H H H H T T T H
Discuss: What’s the most likely value for the probability of heads on an individual coin flip?
Let \( p \) denote the probability of getting heads.
This follows what is known as a binomial model, with the probability of getting 7 heads out of 10 given by:
\[ \mathrm{Prob}(7 \ \textrm{heads}) = \left(\begin{array}{c}10 \\ 7\end{array}\right)p^{7}(1-p)^{3} \]
In this example \[ \mathcal{L}(\theta | x, g) = \left(\begin{array}{c}10 \\ 7\end{array}\right)p^{7}(1-p)^{3} \] we have
\[ \mathcal{L}(p | 10, 7; \mathrm{binomial}) = \left(\begin{array}{c}10 \\ 7\end{array}\right)p^{7}(1-p)^{3} \]
The most likely parameter given the model and data is where the likelihood function is maximized.
It's usually easier to deal with summation rather than products, so we look at the log-likelihood function instead:
\[ \log(\mathcal{L}(\theta | g, x)) \]
which in our case becomes
\[ \log\left(\begin{array}{c}10 \\ 7\end{array}\right) + 7\log p +3\log (1-p) \]
Log-likelihood function:
AIC uses Likelihood and Information Theory to construct another model selection criterion (different from backward elimination or forward selection).
One problem with using adjusted \( R^2 \) values for model selection is it does poorly with out-of-sample prediction. In other words, finding a best-fit model using adjusted \( R^2 \) creates a bias towards the data set that was used.
AIC addresses this out-of-sample prediction issue by its very design. Current state-of-the-art is to use AIC for model selection, and adjusted \( R^2 \) for model validation (all models in the candidate set could be bad!!)
\[ AIC = -2\log(\mathcal{L}(\hat{\theta} | x,g) + 2K \]
where
\[ AIC = -2\log(\mathcal{L}(\hat{\theta}) | x,g) + 2K \]
\[ \begin{eqnarray*} AICc & = & -2\log(\mathcal{L}(\hat{\theta}) | x,g) + 2K\left(\frac{n}{n-K-1}\right) \\ & = & AIC + \frac{2K(K+1)}{n-K-1} \end{eqnarray*} \]
where \( n \) is the sample size. When \( n \) is large, AICc converges to AIC.
g1 <- lm(y ~ 1, data=cement)
g2 <- lm(y ~ x1 + x2, data=cement)
g3 <- lm(y ~ x1 + x2 + x1*x2, data=cement)
g4 <- lm(y ~ x3 + x4, data=cement)
g5 <- lm(y ~ x3 + x4 + x3*x4, data=cement)
Discuss: What is \( K \) for these models?
Answer: 2, 4, 5, 4, and 5, respectively.
The variance of the residuals is also included in the model as a parameter!!
The maximum likelihood estimate for the variance of the residuals \( \hat{\sigma}^2 \) is given by \[ \hat{\sigma}^2 = \frac{\textrm{RSS}}{n}, \] where \[ \textrm{RSS} = \sum_{i=1}^{n}\left(\hat{y}_{i} - y_{i}\right)^2. \]
sig1 <- sum((fitted(g1) - cement$y)^2)/13
sig2 <- sum((fitted(g2) - cement$y)^2)/13
sig3 <- sum((fitted(g3) - cement$y)^2)/13
sig4 <- sum((fitted(g4) - cement$y)^2)/13
sig5 <- sum((fitted(g5) - cement$y)^2)/13
sig <- c(sig1, sig2, sig3, sig4, sig5)
You can go from RSS to log-likelihood as follows:
\[ \log(\mathcal{L}) = -\frac{n}{2}\log\left(\hat{\sigma}^2\right) \]
You can go from RSS to log-likelihood as follows:
loglik <- sapply(sig, function (x) {-1*(13/2)*log(x)})
The corrected Akaike Information Criterion (AICc) is given by
\[ \textrm{AICc} = n\log\left(\hat{\sigma}^2\right) + 2K + \frac{2K(K+1)}{n-K-1}. \]
K <- c(2,4,5,4,5)
(aicc <- sapply(1:5, function (i) {13*log(sig[i]) + 2*K[i] + (2*K[i]*(K[i]+1))/(13-K[i]-1)}))
[1] 74.64443 32.41999 37.82382 46.85258 51.32391
(rnk <- rank(aicc))
[1] 5 1 2 3 4
K sig LogLik AICc Rank
1 2 208.904852 -34.722213 74.64443 5
2 4 4.454191 -9.709995 32.41999 1
3 5 4.397135 -9.626196 37.82382 2
4 4 13.518308 -16.926292 46.85258 3
5 5 12.421406 -16.376238 51.32391 4
AICtable <- data.frame(Manual = aicc, Package = c(AICc(g1),
AICc(g2),
AICc(g3),
AICc(g4),
AICc(g5)))
Manual Package
1 74.64443 111.53683
2 32.41999 69.31239
3 37.82382 74.71622
4 46.85258 83.74499
5 51.32391 88.21631
\[ \Delta_{i} = \mathrm{AICc}_{i} - \mathrm{AICc}_{min}. \]
Model Manual Package Delta_M Delta_P
2 32.41999 69.31239 0.00000 0.00000
3 37.82382 74.71622 5.40383 5.40383
4 46.85258 83.74499 14.43259 14.43259
5 51.32391 88.21631 18.90391 18.90391
1 74.64443 111.53683 42.22443 42.22443
\[ \mathcal{L}(g_{i} \ | \ \mathrm{data}) \propto \exp\left(-\frac{1}{2}\Delta_{i}\right) \]
\[ w_{i} = \frac{\exp\left(-\frac{1}{2}\Delta_{i}\right)}{\sum_{j=1}^{R}\exp\left(-\frac{1}{2}\Delta_{j}\right)} \]
This is necessary as AICc estimates K-L information. The model probabilities give you a probability each model is the actual K-L best model.
Model Manual Delta_M w
2 32.420 0.0000 9.3643e-01
3 37.824 5.4038 6.2813e-02
4 46.853 14.4326 6.8782e-04
5 51.324 18.9039 7.3543e-05
1 74.644 42.2244 6.3468e-10
\[ \mathbf{g} = \sum_{i=1}^{R}w_{i}g_{i}. \]
As an example, create an arbitrary input \( x_{1} \), \( x_{2} \), \( x_{3} \), and \( x_{4} \) and look at the predicted values.
X <- data.frame(x1 = 0.14, x2 = 0.40, x3 = 0.52, x4 = 0.05)
W <- AICtable$w
pred1 <- predict(g1, X)
pred2 <- predict(g2, X)
pred3 <- predict(g3, X)
pred4 <- predict(g4, X)
pred5 <- predict(g5, X)
AICtable$pred <- c(pred1, pred2, pred3, pred4, pred5)
print(AICtable %>% select(Model, w, pred) %>% mutate('w*pred' = w*pred), digits=4, row.names=F)
Model w pred w*pred
1 6.347e-10 95.42 6.056e-08
2 9.364e-01 53.05 4.968e+01
3 6.281e-02 54.14 3.401e+00
4 6.878e-04 130.62 8.984e-02
5 7.354e-05 135.29 9.950e-03
The predicted value for the averaged model is given by
W[1]*pred1 + W[2]*pred2 + W[3]*pred3 + W[4]*pred4 + W[5]*pred5
1
53.17581