Problem Statement

Suppose that \(y_1, \ldots, y_n\) are random observations from a model \(f(y \mid \theta)\) and a prior distribution \(\pi(\theta)\) is assumed for \(\theta\). Denote \(\tilde{Y}_i\) to be the future observation corresponding to \(y_i\) and its posterior predictive distribution is given by \(f(\tilde{y}_i \mid y)\) where \(y = (y_1, \ldots, y_n)\).


Part (a)

Assume the loss function:

\[ L(\tilde{Y}, y) = \sum_{i=1}^n (\tilde{Y}_i - y_i)^2 \]

where \(\tilde{Y} = (\tilde{Y}_1, \ldots, \tilde{Y}_n)\).

We need to show:

\[ E\left[ L(\tilde{Y}, y) \mid y \right] = \sum_{i=1}^n E\left[ \left( \tilde{Y}_i - E(\tilde{Y}_i \mid y) \right)^2 \mid y \right] + \sum_{i=1}^n \left( y_i - E(\tilde{Y}_i \mid y) \right)^2 \]

Derivation

Let \(m_i = E(\tilde{Y}_i \mid y)\). For each \(i\):

\[ \begin{aligned} E\left[ (\tilde{Y}_i - y_i)^2 \mid y \right] &= E\left[ (\tilde{Y}_i - m_i + m_i - y_i)^2 \mid y \right] \\ &= E\left[ (\tilde{Y}_i - m_i)^2 \mid y \right] + 2 E\left[ (\tilde{Y}_i - m_i)(m_i - y_i) \mid y \right] + (m_i - y_i)^2 \end{aligned} \]

Since \(m_i\) and \(y_i\) are constants given \(y\):

\[ E\left[ (\tilde{Y}_i - m_i)(m_i - y_i) \mid y \right] = (m_i - y_i) \cdot E\left[ \tilde{Y}_i - m_i \mid y \right] = 0 \]

Thus:

\[ E\left[ (\tilde{Y}_i - y_i)^2 \mid y \right] = E\left[ (\tilde{Y}_i - m_i)^2 \mid y \right] + (m_i - y_i)^2 \]

Summing over \(i = 1, \ldots, n\):

\[ E\left[ L(\tilde{Y}, y) \mid y \right] = \sum_{i=1}^n E\left[ (\tilde{Y}_i - m_i)^2 \mid y \right] + \sum_{i=1}^n (m_i - y_i)^2 \]

Substituting back \(m_i = E(\tilde{Y}_i \mid y)\) completes the proof.

Interpretation of the Two Terms

First Term: Predictive Variance

\[ \sum_{i=1}^n \text{Var}(\tilde{Y}_i \mid y) = \sum_{i=1}^n E\left[ (\tilde{Y}_i - E(\tilde{Y}_i \mid y))^2 \mid y \right] \]

This represents the irreducible uncertainty in predicting a new observation. It measures how much the future observation \(\tilde{Y}_i\) varies around its posterior predictive mean. This term depends only on the model and the posterior distribution, not on the actual observed \(y_i\) except through the conditioning on \(y\).

Second Term: Squared Bias (Lack of Fit)

\[ \sum_{i=1}^n \left( y_i - E(\tilde{Y}_i \mid y) \right)^2 \]

This measures how well the model’s predictions match the observed data. It is the squared distance between each observed value \(y_i\) and its posterior predictive mean. A large value indicates systematic disagreement between the model and the data.

Together, the two terms form a bias-variance decomposition for the predictive squared error loss.

Monte Carlo Estimation

Given \(S\) samples from the posterior predictive distribution \(f(\tilde{y}_i \mid y)\):

Let \(\tilde{y}_i^{(1)}, \tilde{y}_i^{(2)}, \ldots, \tilde{y}_i^{(S)}\) be Monte Carlo draws.

  1. Estimate \(m_i\) (posterior predictive mean): \[ \hat{m}_i = \frac{1}{S} \sum_{s=1}^S \tilde{y}_i^{(s)} \]

  2. Estimate predictive variance (first term): \[ \widehat{\text{Var}}(\tilde{Y}_i \mid y) = \frac{1}{S} \sum_{s=1}^S \left( \tilde{y}_i^{(s)} - \hat{m}_i \right)^2 \]

  3. Estimate squared bias (second term): \[ (y_i - \hat{m}_i)^2 \]

Then the estimated expected loss is:

\[ \widehat{E[L \mid y]} = \sum_{i=1}^n \widehat{\text{Var}}(\tilde{Y}_i \mid y) + \sum_{i=1}^n (y_i - \hat{m}_i)^2 \]

Using This for Model Comparison

To compare candidate models:

  1. For each model, compute the estimated expected loss using Monte Carlo samples from its posterior predictive distribution.
  2. The model with the smaller expected loss is preferred, as it better predicts future observations.
  3. This criterion balances:
    • Fit: bias term (smaller is better)
    • Complexity/Uncertainty: variance term (often increases with model complexity)

This approach is a fully Bayesian alternative to information criteria like AIC or BIC for predictive model selection.


Part (b)

Definition of Deviance Information Criterion (DIC)

Let \(D(\theta) = -2 \log f(y \mid \theta)\) be the deviance.

Define:

  • \(\bar{\theta} = E[\theta \mid y]\) (posterior mean of parameters)
  • \(D_{\text{mean}} = D(\bar{\theta})\) (deviance evaluated at posterior mean)
  • \(\bar{D} = E[D(\theta) \mid y]\) (posterior mean deviance)

The effective number of parameters is:

\[ p_D = \bar{D} - D_{\text{mean}} \]

Then the Deviance Information Criterion is:

\[ \boxed{\text{DIC} = D_{\text{mean}} + 2p_D} \]

Equivalently:

\[ \text{DIC} = \bar{D} + p_D = 2\bar{D} - D_{\text{mean}} \]

Interpretation

  • DIC rewards models that fit the data well (small \(\bar{D}\))
  • DIC penalizes model complexity (large \(p_D\))
  • Smaller DIC indicates a better model for prediction

Using DIC to Compare Models

Procedure:

  1. For each candidate model:
    • Obtain posterior samples \(\theta^{(1)}, \ldots, \theta^{(S)}\)
    • Compute deviance \(D(\theta^{(s)}) = -2\log f(y \mid \theta^{(s)})\) for each sample
    • Estimate \(\bar{D} \approx \frac{1}{S} \sum_{s=1}^S D(\theta^{(s)})\)
    • Estimate \(\hat{\theta}\) (e.g., posterior mean or median)
    • Compute \(D_{\text{mean}} = D(\hat{\theta})\)
    • Calculate \(p_D = \bar{D} - D_{\text{mean}}\)
    • Calculate \(\text{DIC} = \bar{D} + p_D\)
  2. Compare DIC values across models:
    • Smaller DIC → better predictive ability given complexity penalty
    • Differences > 5-10 are typically considered meaningful
    • DIC can be negative; only relative differences matter

Advantages and Limitations

Advantages: - Fully Bayesian - Applicable to non-nested and hierarchical models - Accounts for posterior uncertainty - Easy to compute from MCMC output

Limitations: - Not valid for singular models (where Fisher information matrix is singular) - Can be sensitive to parameterization - Requires proper posterior distributions


Summary

  • Part (a) decomposed the expected predictive squared loss into a variance term (predictive uncertainty) and a bias term (lack of fit), providing a Monte Carlo approach for model comparison.
  • Part (b) defined DIC as a Bayesian model comparison criterion that balances fit and complexity, with smaller values indicating better models.

Both methods emphasize predictive performance rather than just in-sample fit.