Suppose that \(y_1, \ldots, y_n\) are random observations from a model \(f(y \mid \theta)\) and a prior distribution \(\pi(\theta)\) is assumed for \(\theta\). Denote \(\tilde{Y}_i\) to be the future observation corresponding to \(y_i\) and its posterior predictive distribution is given by \(f(\tilde{y}_i \mid y)\) where \(y = (y_1, \ldots, y_n)\).
Assume the loss function:
\[ L(\tilde{Y}, y) = \sum_{i=1}^n (\tilde{Y}_i - y_i)^2 \]
where \(\tilde{Y} = (\tilde{Y}_1, \ldots, \tilde{Y}_n)\).
We need to show:
\[ E\left[ L(\tilde{Y}, y) \mid y \right] = \sum_{i=1}^n E\left[ \left( \tilde{Y}_i - E(\tilde{Y}_i \mid y) \right)^2 \mid y \right] + \sum_{i=1}^n \left( y_i - E(\tilde{Y}_i \mid y) \right)^2 \]
Let \(m_i = E(\tilde{Y}_i \mid y)\). For each \(i\):
\[ \begin{aligned} E\left[ (\tilde{Y}_i - y_i)^2 \mid y \right] &= E\left[ (\tilde{Y}_i - m_i + m_i - y_i)^2 \mid y \right] \\ &= E\left[ (\tilde{Y}_i - m_i)^2 \mid y \right] + 2 E\left[ (\tilde{Y}_i - m_i)(m_i - y_i) \mid y \right] + (m_i - y_i)^2 \end{aligned} \]
Since \(m_i\) and \(y_i\) are constants given \(y\):
\[ E\left[ (\tilde{Y}_i - m_i)(m_i - y_i) \mid y \right] = (m_i - y_i) \cdot E\left[ \tilde{Y}_i - m_i \mid y \right] = 0 \]
Thus:
\[ E\left[ (\tilde{Y}_i - y_i)^2 \mid y \right] = E\left[ (\tilde{Y}_i - m_i)^2 \mid y \right] + (m_i - y_i)^2 \]
Summing over \(i = 1, \ldots, n\):
\[ E\left[ L(\tilde{Y}, y) \mid y \right] = \sum_{i=1}^n E\left[ (\tilde{Y}_i - m_i)^2 \mid y \right] + \sum_{i=1}^n (m_i - y_i)^2 \]
Substituting back \(m_i = E(\tilde{Y}_i \mid y)\) completes the proof.
\[ \sum_{i=1}^n \text{Var}(\tilde{Y}_i \mid y) = \sum_{i=1}^n E\left[ (\tilde{Y}_i - E(\tilde{Y}_i \mid y))^2 \mid y \right] \]
This represents the irreducible uncertainty in predicting a new observation. It measures how much the future observation \(\tilde{Y}_i\) varies around its posterior predictive mean. This term depends only on the model and the posterior distribution, not on the actual observed \(y_i\) except through the conditioning on \(y\).
\[ \sum_{i=1}^n \left( y_i - E(\tilde{Y}_i \mid y) \right)^2 \]
This measures how well the model’s predictions match the observed data. It is the squared distance between each observed value \(y_i\) and its posterior predictive mean. A large value indicates systematic disagreement between the model and the data.
Together, the two terms form a bias-variance decomposition for the predictive squared error loss.
Given \(S\) samples from the posterior predictive distribution \(f(\tilde{y}_i \mid y)\):
Let \(\tilde{y}_i^{(1)}, \tilde{y}_i^{(2)}, \ldots, \tilde{y}_i^{(S)}\) be Monte Carlo draws.
Estimate \(m_i\) (posterior predictive mean): \[ \hat{m}_i = \frac{1}{S} \sum_{s=1}^S \tilde{y}_i^{(s)} \]
Estimate predictive variance (first term): \[ \widehat{\text{Var}}(\tilde{Y}_i \mid y) = \frac{1}{S} \sum_{s=1}^S \left( \tilde{y}_i^{(s)} - \hat{m}_i \right)^2 \]
Estimate squared bias (second term): \[ (y_i - \hat{m}_i)^2 \]
Then the estimated expected loss is:
\[ \widehat{E[L \mid y]} = \sum_{i=1}^n \widehat{\text{Var}}(\tilde{Y}_i \mid y) + \sum_{i=1}^n (y_i - \hat{m}_i)^2 \]
To compare candidate models:
This approach is a fully Bayesian alternative to information criteria like AIC or BIC for predictive model selection.
Let \(D(\theta) = -2 \log f(y \mid \theta)\) be the deviance.
Define:
The effective number of parameters is:
\[ p_D = \bar{D} - D_{\text{mean}} \]
Then the Deviance Information Criterion is:
\[ \boxed{\text{DIC} = D_{\text{mean}} + 2p_D} \]
Equivalently:
\[ \text{DIC} = \bar{D} + p_D = 2\bar{D} - D_{\text{mean}} \]
Procedure:
Advantages: - Fully Bayesian - Applicable to non-nested and hierarchical models - Accounts for posterior uncertainty - Easy to compute from MCMC output
Limitations: - Not valid for singular models (where Fisher information matrix is singular) - Can be sensitive to parameterization - Requires proper posterior distributions
Both methods emphasize predictive performance rather than just in-sample fit.