We want to derive the posterior predictive distribution, which gives the probability of a new data point \(y^{\star}\) given our observed data \(y\):
\[p(y^{\star} \mid y)\]
This derivation relies on two key concepts:
Exchangeability (Conditional Independence): Given the parameters \(\theta\), the observed data \(y\) and the future data \(y^{\star}\) are independent.
Bayes’ Theorem: Relates the posterior \(p(\theta \mid y)\) to the likelihood \(p(y \mid \theta)\) and prior \(p(\theta)\).
By the definition of conditional probability, we have:
\[p(y^{\star} \mid y) = \frac{p(y^{\star}, y)}{p(y)}\]
where:
Since we do not know the true value of the parameters \(\theta\), we must average (marginalize) over all possible values of \(\theta\). We do this for both the numerator and the denominator:
\[p(y^{\star} \mid y) = \frac{\int p(y^{\star}, y, \theta) \, d\theta}{\int p(y, \theta) \, d\theta}\]
The denominator is just the marginal likelihood:
\[p(y) = \int p(y \mid \theta) \, p(\theta) \, d\theta\]
So we can write:
\[p(y^{\star} \mid y) = \frac{\int p(y^{\star}, y, \theta) \, d\theta}{p(y)}\]
Using the chain rule, we factor the joint distribution \(p(y^{\star}, y, \theta)\) into a product of conditional distributions:
\[p(y^{\star}, y, \theta) = p(y^{\star} \mid y, \theta) \cdot p(y \mid \theta) \cdot p(\theta)\]
Substituting this into the numerator gives:
\[p(y^{\star} \mid y) = \frac{\int p(y^{\star} \mid y, \theta) \, p(y \mid \theta) \, p(\theta) \, d\theta}{p(y)}\]
The exchangeability assumption states that, given the parameters \(\theta\), the observed data \(y\) and the future data \(y^{\star}\) are conditionally independent. Mathematically:
\[p(y^{\star} \mid y, \theta) = p(y^{\star} \mid \theta)\]
This is a crucial modeling assumption: if we know the true parameters \(\theta\), the past data gives us no additional information about the future data. All information about the data-generating process is captured in \(\theta\).
Applying exchangeability, our expression becomes:
\[p(y^{\star} \mid y) = \frac{\int p(y^{\star} \mid \theta) \, p(y \mid \theta) \, p(\theta) \, d\theta}{p(y)}\]
Now we apply Bayes’ theorem to the term \(p(y \mid \theta) \, p(\theta)\). Bayes’ theorem tells us that:
\[p(\theta \mid y) = \frac{p(y \mid \theta) \, p(\theta)}{p(y)}\]
Rearranging this gives:
\[p(y \mid \theta) \, p(\theta) = p(\theta \mid y) \, p(y)\]
This is the joint probability of \(y\) and \(\theta\) written in two equivalent ways:
\[p(y, \theta) = p(y \mid \theta)p(\theta) = p(\theta \mid y)p(y)\]
Substituting \(p(\theta \mid y) \, p(y)\) into the numerator:
\[p(y^{\star} \mid y) = \frac{\int p(y^{\star} \mid \theta) \, p(\theta \mid y) \, p(y) \, d\theta}{p(y)}\]
Since \(p(y)\) is a constant with respect to the integral over \(\theta\), we can factor it out of the integral:
\[p(y^{\star} \mid y) = \frac{p(y) \int p(y^{\star} \mid \theta) \, p(\theta \mid y) \, d\theta}{p(y)}\]
Canceling \(p(y)\) from the numerator and denominator yields the final result:
\[p(y^{\star} \mid y) = \int p(y^{\star} \mid \theta) \, p(\theta \mid y) \, d\theta\]
The final expression has a beautiful intuitive interpretation:
The posterior predictive distribution is the average of the likelihood of the new data, weighted by the posterior distribution of the parameters.
In other words:
| Step | Action | Mathematical Operation |
|---|---|---|
| 1 | Conditional probability | \(p(y^{\star} \mid y) = \frac{p(y^{\star}, y)}{p(y)}\) |
| 2 | Marginalize over \(\theta\) | Introduce \(\int (\cdots) \, d\theta\) |
| 3 | Chain rule | \(p(y^{\star}, y, \theta) = p(y^{\star} \mid y, \theta) \, p(y \mid \theta) \, p(\theta)\) |
| 4 | Exchangeability | \(p(y^{\star} \mid y, \theta) = p(y^{\star} \mid \theta)\) |
| 5 | Bayes’ theorem | \(p(y \mid \theta) \, p(\theta) = p(\theta \mid y) \, p(y)\) |
| 6 | Cancel \(p(y)\) | Simplify to final form |
Combining all steps into a single continuous derivation:
\[ \begin{aligned} p(y^{\star} \mid y) &= \frac{p(y^{\star}, y)}{p(y)} && \text{(Definition of conditional probability)} \\ &= \frac{\int p(y^{\star}, y, \theta) \, d\theta}{p(y)} && \text{(Marginalize over } \theta \text{)} \\ &= \frac{\int p(y^{\star} \mid y, \theta) \, p(y \mid \theta) \, p(\theta) \, d\theta}{p(y)} && \text{(Chain rule)} \\ &= \frac{\int p(y^{\star} \mid \theta) \, p(y \mid \theta) \, p(\theta) \, d\theta}{p(y)} && \text{(Exchangeability / Conditional independence)} \\ &= \frac{\int p(y^{\star} \mid \theta) \, p(\theta \mid y) \, p(y) \, d\theta}{p(y)} && \text{(Bayes' theorem: } p(y \mid \theta)p(\theta) = p(\theta \mid y)p(y) \text{)} \\ &= \int p(y^{\star} \mid \theta) \, p(\theta \mid y) \, d\theta && \text{(Cancel } p(y) \text{)} \end{aligned} \]
We have successfully derived the posterior predictive distribution:
\[\boxed{p(y^{\star} \mid y) = \int p(y^{\star} \mid \theta) \, p(\theta \mid y) \, d\theta}\]
This formula shows that to make predictions about new data, we:
This is the fundamental equation for Bayesian prediction and forms the basis for many machine learning and statistical modeling approaches.