Derivation of the Posterior Predictive Distribution

Introduction

We want to derive the posterior predictive distribution, which gives the probability of a new data point \(y^{\star}\) given our observed data \(y\):

\[p(y^{\star} \mid y)\]

This derivation relies on two key concepts:

Exchangeability (Conditional Independence): Given the parameters \(\theta\), the observed data \(y\) and the future data \(y^{\star}\) are independent.
Bayes’ Theorem: Relates the posterior \(p(\theta \mid y)\) to the likelihood \(p(y \mid \theta)\) and prior \(p(\theta)\).

Step 1: Start with the Definition of Conditional Probability

By the definition of conditional probability, we have:

\[p(y^{\star} \mid y) = \frac{p(y^{\star}, y)}{p(y)}\]

where:

\(p(y^{\star}, y)\) is the joint probability of the new data and the observed data,
\(p(y)\) is the marginal probability of the observed data (also called the evidence or marginal likelihood).

Step 2: Introduce the Parameters \(\theta\) via Marginalization

Since we do not know the true value of the parameters \(\theta\), we must average (marginalize) over all possible values of \(\theta\). We do this for both the numerator and the denominator:

\[p(y^{\star} \mid y) = \frac{\int p(y^{\star}, y, \theta) \, d\theta}{\int p(y, \theta) \, d\theta}\]

The denominator is just the marginal likelihood:

\[p(y) = \int p(y \mid \theta) \, p(\theta) \, d\theta\]

So we can write:

\[p(y^{\star} \mid y) = \frac{\int p(y^{\star}, y, \theta) \, d\theta}{p(y)}\]

Step 3: Apply the Chain Rule of Probability

Using the chain rule, we factor the joint distribution \(p(y^{\star}, y, \theta)\) into a product of conditional distributions:

\[p(y^{\star}, y, \theta) = p(y^{\star} \mid y, \theta) \cdot p(y \mid \theta) \cdot p(\theta)\]

Substituting this into the numerator gives:

\[p(y^{\star} \mid y) = \frac{\int p(y^{\star} \mid y, \theta) \, p(y \mid \theta) \, p(\theta) \, d\theta}{p(y)}\]

Step 4: Apply Exchangeability (Conditional Independence)

The exchangeability assumption states that, given the parameters \(\theta\), the observed data \(y\) and the future data \(y^{\star}\) are conditionally independent. Mathematically:

\[p(y^{\star} \mid y, \theta) = p(y^{\star} \mid \theta)\]

This is a crucial modeling assumption: if we know the true parameters \(\theta\), the past data gives us no additional information about the future data. All information about the data-generating process is captured in \(\theta\).

Applying exchangeability, our expression becomes:

\[p(y^{\star} \mid y) = \frac{\int p(y^{\star} \mid \theta) \, p(y \mid \theta) \, p(\theta) \, d\theta}{p(y)}\]

Step 5: Apply Bayes’ Theorem

Now we apply Bayes’ theorem to the term \(p(y \mid \theta) \, p(\theta)\). Bayes’ theorem tells us that:

\[p(\theta \mid y) = \frac{p(y \mid \theta) \, p(\theta)}{p(y)}\]

Rearranging this gives:

\[p(y \mid \theta) \, p(\theta) = p(\theta \mid y) \, p(y)\]

This is the joint probability of \(y\) and \(\theta\) written in two equivalent ways:

\[p(y, \theta) = p(y \mid \theta)p(\theta) = p(\theta \mid y)p(y)\]

Substituting \(p(\theta \mid y) \, p(y)\) into the numerator:

\[p(y^{\star} \mid y) = \frac{\int p(y^{\star} \mid \theta) \, p(\theta \mid y) \, p(y) \, d\theta}{p(y)}\]

Step 6: Cancel \(p(y)\)

Since \(p(y)\) is a constant with respect to the integral over \(\theta\), we can factor it out of the integral:

\[p(y^{\star} \mid y) = \frac{p(y) \int p(y^{\star} \mid \theta) \, p(\theta \mid y) \, d\theta}{p(y)}\]

Canceling \(p(y)\) from the numerator and denominator yields the final result:

\[p(y^{\star} \mid y) = \int p(y^{\star} \mid \theta) \, p(\theta \mid y) \, d\theta\]

Interpretation

The final expression has a beautiful intuitive interpretation:

The posterior predictive distribution is the average of the likelihood of the new data, weighted by the posterior distribution of the parameters.

In other words:

\(p(y^{\star} \mid \theta)\) is the probability of the new data point for a given \(\theta\),
\(p(\theta \mid y)\) is our updated belief about \(\theta\) after seeing the data \(y\),
The integral averages over all possible \(\theta\), weighting each by how likely it is given our observed data.

Summary of Key Steps

Step	Action	Mathematical Operation
1	Conditional probability	\(p(y^{\star} \mid y) = \frac{p(y^{\star}, y)}{p(y)}\)
2	Marginalize over \(\theta\)	Introduce \(\int (\cdots) \, d\theta\)
3	Chain rule	\(p(y^{\star}, y, \theta) = p(y^{\star} \mid y, \theta) \, p(y \mid \theta) \, p(\theta)\)
4	Exchangeability	\(p(y^{\star} \mid y, \theta) = p(y^{\star} \mid \theta)\)
5	Bayes’ theorem	\(p(y \mid \theta) \, p(\theta) = p(\theta \mid y) \, p(y)\)
6	Cancel \(p(y)\)	Simplify to final form

Complete Derivation (All in One)

Combining all steps into a single continuous derivation:

\[ \begin{aligned} p(y^{\star} \mid y) &= \frac{p(y^{\star}, y)}{p(y)} && \text{(Definition of conditional probability)} \\ &= \frac{\int p(y^{\star}, y, \theta) \, d\theta}{p(y)} && \text{(Marginalize over } \theta \text{)} \\ &= \frac{\int p(y^{\star} \mid y, \theta) \, p(y \mid \theta) \, p(\theta) \, d\theta}{p(y)} && \text{(Chain rule)} \\ &= \frac{\int p(y^{\star} \mid \theta) \, p(y \mid \theta) \, p(\theta) \, d\theta}{p(y)} && \text{(Exchangeability / Conditional independence)} \\ &= \frac{\int p(y^{\star} \mid \theta) \, p(\theta \mid y) \, p(y) \, d\theta}{p(y)} && \text{(Bayes' theorem: } p(y \mid \theta)p(\theta) = p(\theta \mid y)p(y) \text{)} \\ &= \int p(y^{\star} \mid \theta) \, p(\theta \mid y) \, d\theta && \text{(Cancel } p(y) \text{)} \end{aligned} \]

Conclusion

We have successfully derived the posterior predictive distribution:

\[\boxed{p(y^{\star} \mid y) = \int p(y^{\star} \mid \theta) \, p(\theta \mid y) \, d\theta}\]

This formula shows that to make predictions about new data, we:

Consider all possible parameter values \(\theta\),
Weight each by how plausible it is given our observed data (the posterior),
Average the likelihood of the new data under each parameter value.

This is the fundamental equation for Bayesian prediction and forms the basis for many machine learning and statistical modeling approaches.