Derivation of Conditional Normal Distribution and Posterior Predictive Distribution

1. General Conditional Normal Distribution

Suppose we have a jointly normal random vector:

\[ \begin{pmatrix} Z_1 \\ Z_2 \end{pmatrix} \sim N\left( \begin{pmatrix} \mu_1 \\ \mu_2 \end{pmatrix},\; \begin{pmatrix} \Sigma_{11} & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{22} \end{pmatrix} \right) \]

where: - \(Z_1 \in \mathbb{R}^m\), \(Z_2 \in \mathbb{R}^n\) - \(\Sigma_{11} \in \mathbb{R}^{m\times m}\), \(\Sigma_{22} \in \mathbb{R}^{n\times n}\), \(\Sigma_{12} \in \mathbb{R}^{m\times n}\), \(\Sigma_{21} = \Sigma_{12}^T\).

We want \(Z_1 \mid Z_2 = z_2\).

Step 1: Define a transformation to achieve independence

Let: \[ W = Z_1 - A Z_2 \] We choose \(A\) such that \(W\) and \(Z_2\) are uncorrelated.

Compute \(\text{Cov}(W, Z_2)\):

\[ \begin{aligned} \text{Cov}(W, Z_2) &= \text{Cov}(Z_1 - A Z_2, Z_2) \\ &= \text{Cov}(Z_1, Z_2) - A \,\text{Cov}(Z_2, Z_2) \\ &= \Sigma_{12} - A \Sigma_{22}. \end{aligned} \]

Set this to zero: \[ \Sigma_{12} - A \Sigma_{22} = 0 \quad\Rightarrow\quad A = \Sigma_{12} \Sigma_{22}^{-1}. \]

Thus \(W = Z_1 - \Sigma_{12} \Sigma_{22}^{-1} Z_2\) is uncorrelated with \(Z_2\).

Because \((W, Z_2)\) is jointly normal (linear transform of normal), uncorrelated implies independent.

Step 2: Express \(Z_1\) in terms of \(W\) and \(Z_2\)

\[ Z_1 = W + \Sigma_{12} \Sigma_{22}^{-1} Z_2. \]

Given \(Z_2 = z_2\): \[ Z_1 \mid (Z_2 = z_2) = W + \Sigma_{12} \Sigma_{22}^{-1} z_2, \] with \(W\) independent of \(Z_2\).

Step 3: Conditional expectation

Since \(W\) independent of \(Z_2\): \[ E[W \mid Z_2 = z_2] = E[W]. \]

Now \(E[W] = E[Z_1] - \Sigma_{12} \Sigma_{22}^{-1} E[Z_2] = \mu_1 - \Sigma_{12} \Sigma_{22}^{-1} \mu_2\).

Thus: \[ \begin{aligned} E[Z_1 \mid Z_2 = z_2] &= E[W] + \Sigma_{12} \Sigma_{22}^{-1} z_2 \\ &= \mu_1 - \Sigma_{12} \Sigma_{22}^{-1} \mu_2 + \Sigma_{12} \Sigma_{22}^{-1} z_2 \\ &= \mu_1 + \Sigma_{12} \Sigma_{22}^{-1} (z_2 - \mu_2). \end{aligned} \]

Step 4: Conditional variance

Given \(Z_2 = z_2\), \(\Sigma_{12} \Sigma_{22}^{-1} z_2\) is constant, so: \[ \text{Var}(Z_1 \mid Z_2 = z_2) = \text{Var}(W \mid Z_2 = z_2). \]

Since \(W\) independent of \(Z_2\): \[ \text{Var}(W \mid Z_2 = z_2) = \text{Var}(W). \]

Now compute \(\text{Var}(W) = \text{Var}(Z_1 - A Z_2)\) with \(A = \Sigma_{12} \Sigma_{22}^{-1}\):

\[ \begin{aligned} \text{Var}(W) &= \text{Var}(Z_1) + \text{Var}(A Z_2) - 2\text{Cov}(Z_1, A Z_2) \\ &= \Sigma_{11} + A \Sigma_{22} A^T - 2 \Sigma_{12} A^T. \end{aligned} \]

Substitute \(A = \Sigma_{12} \Sigma_{22}^{-1}\):

\(A \Sigma_{22} A^T = \Sigma_{12} \Sigma_{22}^{-1} \Sigma_{22} (\Sigma_{12} \Sigma_{22}^{-1})^T = \Sigma_{12} (\Sigma_{12} \Sigma_{22}^{-1})^T = \Sigma_{12} \Sigma_{22}^{-1} \Sigma_{21}\).
\(\Sigma_{12} A^T = \Sigma_{12} (\Sigma_{12} \Sigma_{22}^{-1})^T = \Sigma_{12} \Sigma_{22}^{-1} \Sigma_{21}\).

Thus: \[ \text{Var}(W) = \Sigma_{11} + \Sigma_{12} \Sigma_{22}^{-1} \Sigma_{21} - 2 \Sigma_{12} \Sigma_{22}^{-1} \Sigma_{21} = \Sigma_{11} - \Sigma_{12} \Sigma_{22}^{-1} \Sigma_{21}. \]

Step 5: Final conditional distribution

\[ Z_1 \mid Z_2 = z_2 \;\sim\; N\big( \mu_1 + \Sigma_{12} \Sigma_{22}^{-1} (z_2 - \mu_2),\; \Sigma_{11} - \Sigma_{12} \Sigma_{22}^{-1} \Sigma_{21} \big). \]

2. Application to Posterior Predictive Distribution

We have:

\[ \begin{pmatrix} Y_0 \\ Y \end{pmatrix} \mid \beta, \sigma^2 \sim N\left( \begin{pmatrix} x_0\beta \\ X\beta \end{pmatrix},\; \sigma^2 \begin{pmatrix} 1 & \Sigma_{12} \\ \Sigma_{21} & H \end{pmatrix} \right) \]

where: - \(\Sigma_{12}\) is a \(1 \times n\) row vector of correlations: \(\Sigma_{12}(i) = \text{Cor}(Y_0, Y_i)\). - \(\Sigma_{21} = \Sigma_{12}^T\) (an \(n \times 1\) column vector). - \(H\) is the \(n \times n\) correlation matrix of \(Y\) (so \(\text{Cov}(Y) = \sigma^2 H\)). - \(x_0\) is \(1\times p\), \(\beta\) is \(p\times 1\), \(X\) is \(n\times p\).

We want \(Y_0 \mid y, \beta, \sigma^2\).

Step 1: Match to general formula

Identify: - \(Z_1 = Y_0\), \(Z_2 = Y\) - \(\mu_1 = x_0\beta\), \(\mu_2 = X\beta\) - \(\Sigma_{11} = \sigma^2 \cdot 1 = \sigma^2\) - \(\Sigma_{12} = \sigma^2 \Sigma_{12}^{\text{(corr)}}\) — careful: in the block matrix, the \(\Sigma_{12}\) inside the brackets is the correlation row. To avoid confusion, denote the correlation row as \(r\). Then \(\Sigma_{12}^{\text{(cov)}} = \sigma^2 r\). - \(\Sigma_{22} = \sigma^2 H\) - \(\Sigma_{21}^{\text{(cov)}} = \sigma^2 r^T\).

Step 2: Conditional mean

\[ \begin{aligned} E[Y_0 \mid Y=y, \beta, \sigma^2] &= \mu_1 + \Sigma_{12}^{\text{(cov)}} \Sigma_{22}^{-1} (y - \mu_2) \\ &= x_0\beta + (\sigma^2 r) (\sigma^2 H)^{-1} (y - X\beta) \\ &= x_0\beta + r H^{-1} (y - X\beta). \end{aligned} \]

In the original notation, \(r = \Sigma_{12}\) (the correlation vector). So: \[ E[Y_0 \mid y, \beta, \sigma^2] = x_0\beta + \Sigma_{12} H^{-1} (y - X\beta). \]

Step 3: Conditional variance

\[ \begin{aligned} \text{Var}(Y_0 \mid y, \beta, \sigma^2) &= \Sigma_{11} - \Sigma_{12}^{\text{(cov)}} \Sigma_{22}^{-1} \Sigma_{21}^{\text{(cov)}} \\ &= \sigma^2 - (\sigma^2 r) (\sigma^2 H)^{-1} (\sigma^2 r^T) \\ &= \sigma^2 - \sigma^2 r H^{-1} r^T. \end{aligned} \]

Factor \(\sigma^2\): \[ = \sigma^2 \big[ 1 - r H^{-1} r^T \big]. \]

Since \(r H^{-1} r^T = \Sigma_{12} H^{-1} \Sigma_{21}\) in their notation (\(\Sigma_{21} = r^T\)), we have: \[ \text{Var}(Y_0 \mid y, \beta, \sigma^2) = \sigma^2 \big[ 1 - \Sigma_{12} H^{-1} \Sigma_{21} \big]. \]

Step 4: Final posterior predictive distribution

\[ Y_0 \mid y, \beta, \sigma^2 \sim N\Big( x_0\beta + \Sigma_{12} H^{-1} (y - X\beta),\; \sigma^2[1 - \Sigma_{12} H^{-1} \Sigma_{21}] \Big). \]

3. Special case: Independence between \(Y_0\) and \(Y\)

If \(Y_0\) is independent of \(Y\) given \(\beta, \sigma^2\) (standard regression assumption), then: - \(\Sigma_{12} = 0\) (row of zeros) - \(H\) can be anything (but often \(H = I_n\) if errors are i.i.d.)

Then: \[ Y_0 \mid y, \beta, \sigma^2 \sim N(x_0\beta, \sigma^2). \]

That matches the usual predictive distribution in simple linear regression.

Summary

The conditional normal formula comes from finding a linear combination \(W = Z_1 - A Z_2\) independent of \(Z_2\), then using independence to compute conditional mean and variance.
For prediction, \(\Sigma_{12}\) in the joint distribution is the correlation vector between \(Y_0\) and each \(Y_i\), and \(H\) is the correlation matrix of \(Y\).
The result shows how past data \(y\) updates the prediction of \(Y_0\) when \(Y_0\) is correlated with the observed \(Y\).

\end{document}