Introduction

Model selection is the problem of choosing one from among a set of candidate models.

It is common to choose a model that performs the best on a hold-out test dataset or to estimate model performance using a resampling technique, such as k-fold cross-validation.

An alternative approach to model selection involves using probabilistic statistical measures that attempt to quantify both the model performance on the training dataset and the complexity of the model. Examples include the Akaike and Bayesian Information Criterion and the Minimum Description Length. The benefit of these information criterion statistics is that they do not require a hold-out test set, although a limitation is that they do not take the uncertainty of the models into account and may end-up selecting models that are too simple.

Probabilistic Model Selection

Probabilistic model selection (or “information criteria”) provides an analytical technique for scoring and choosing among candidate models.

Models are scored both on their performance on the training dataset and based on the complexity of the model.

  1. Model Performance. How well a candidate model has performed on the training dataset.

  2. Model Complexity. How complicated the trained candidate model is after training.

Model performance may be evaluated using a probabilistic framework, such as log-likelihood under the framework of maximum likelihood estimation. Model complexity may be evaluated as the number of degrees of freedom or parameters in the model.

AIC : Akaike Information Criterion

The first criteria we will discuss is the Akaike Information Criterion, or \(\text{AIC}\) for short. (Note that, when Akaike first introduced this metric, it was simply called An Information Criterion. The A has changed meaning over the years.)

Recall, the maximized log-likelihood of a regression model can be written as

\[ \log L(\boldsymbol{\hat{\beta}}, \hat{\sigma}^2) = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log\left(\frac{\text{RSS}}{n}\right) - \frac{n}{2}, \]

where \(\text{RSS} = \sum_{i=1}^n (y_i - \hat{y}_i) ^ 2\) and \(\boldsymbol{\hat{\beta}}\) and \(\hat{\sigma}^2\) were chosen to maximize the likelihood.

Then we can define \(\text{AIC}\) as

\[ \text{AIC} = -2 \log L(\boldsymbol{\hat{\beta}}, \hat{\sigma}^2) + 2p = n + n \log(2\pi) + n \log\left(\frac{\text{RSS}}{n}\right) + 2p, \]

which is a measure of quality of the model. The smaller the \(\text{AIC}\), the better. To see why, let’s talk about the two main components of \(\text{AIC}\), the likelihood (which measures “goodness-of-fit”) and the penalty (which is a function of the size of the model).

The likelihood portion of \(\text{AIC}\) is given by

\[ -2 \log L(\boldsymbol{\hat{\beta}}, \hat{\sigma}^2) = n + n \log(2\pi) + n \log\left(\frac{\text{RSS}}{n}\right). \]

For the sake of comparing models, the only term here that will change is \(n \log\left(\frac{\text{RSS}}{n}\right)\), which is function of \(\text{RSS}\). The \(n + n \log(2\pi)\) terms will be constant across all models applied to the same data. So, when a model fits well, that is, has a low \(\text{RSS}\), then this likelihood component will be small.

Similarly, we can discuss the penalty component of \(\text{AIC}\) which is,\(2p,\)

where \(p\) is the number of \(\beta\) parameters in the model. We call this a penalty, because it is large when \(p\) is large, but we are seeking to find a small \(\text{AIC}\)

Thus, a good model, that is one with a small \(\text{AIC}\), will have a good balance between fitting well, and using a small number of parameters. For comparing models

\[ \text{AIC} = n\log\left(\frac{\text{RSS}}{n}\right) + 2p \]

is a sufficient expression, as \(n + n \log(2\pi)\) is the same across all models for any particular dataset.

\[ \newline \]

BIC : Bayesian Information Criterion

The Bayesian Information Criterion, or \(\text{BIC}\), is similar to \(\text{AIC}\), but has a larger penalty. \(\text{BIC}\) also quantifies the trade-off between a model which fits well and the number of model parameters, however for a reasonable sample size, generally picks a smaller model than \(\text{AIC}\). Again, for model selection use the model with the smallest \(\text{BIC}\).

\[ \text{BIC} = -2 \log L(\boldsymbol{\hat{\beta}}, \hat{\sigma}^2) + \log(n) p = n + n\log(2\pi) + n\log\left(\frac{\text{RSS}}{n}\right) + \log(n)p. \]

Notice that the \(\text{AIC}\) penalty was\(2p,\)

whereas for \(\text{BIC}\), the penalty is \(\log(n) p.\)

So, for any dataset where \(log(n) > 2\) the \(\text{BIC}\) penalty will be larger than the \(\text{AIC}\) penalty, thus \(\text{BIC}\) will likely prefer a smaller model.

Note that, sometimes the penalty is considered a general expression of the form \(k \cdot p.\)

Then, for \(\text{AIC}\) \(k = 2\), and for \(\text{BIC: } k = \log(n)\).

For comparing models

\[ \text{BIC} = n\log\left(\frac{\text{RSS}}{n}\right) + \log(n)p \]

is again a sufficient expression, as \(n + n \log(2\pi)\) is the same across all models for any particular dataset.

References

  1. package, F., docs, R., & browser, R. (2020). daviddalpiaz/appliedstats: selection.Rmd. Retrieved 4 September 2020, from https://rdrr.io/github/daviddalpiaz/appliedstats/f/selection.Rmd

  2. Akaike information criterion. (2020). Retrieved 4 September 2020, from https://en.wikipedia.org/wiki/Akaike_information_criterion

  3. Brownlee, J. (2019). Probabilistic Model Selection with AIC, BIC, and MDL. Retrieved 4 September 2020, from https://machinelearningmastery.com/probabilistic-model-selection-measures/