Bayesian Inference

In frequentist theory, \(\theta\) is treated as a constant and the data is treated as random.

\[ p(x|\theta) = X|\theta\sim Bin(n,\theta) \]

In Bayesian theory, \(\theta\) is treated as a random quantity, inferences are made in terms of probability statements.

\[ p(\underbrace{\theta}_{\text{Random}}|\underbrace{x}_{\text{Fixed}}) \]

This posterior distribution can be calculated through the application of Bayes theorem:

\[ p(\theta|x)=\frac{p(\theta)p(x|\theta)}{p(x)}=\frac{p(\theta)p(x|\theta)}{\int p(\theta)p(x|\theta)d\theta}\]

As \(x\) is fixed the denominator is seen as a normalising constant. We often consider the unnormalised posterior density where we consider proportions, therefore we do not need to worry about normalising constants, as we are more interested in the shape of the distribution rather than the scale.

\[ \underbrace{p(\theta|x)}_{\text{Posterior}}\propto \underbrace{p(\theta)}_{\text{Prior}}p(x|\theta)\]

However, the use of this prior distribution to calculate the posterior is the pro and con of the bayesian approach. Different prior beliefs result in different posteriors and therefore different inference, which is Bayes Theorem’s blessing and curse, depending on who you ask.
Bayesian Updating is done in 4 steps
- Specify a likelihood model \(p(x|\theta)\)
- Determine prior \(p(\theta)\)
- Calculate the posterior, \(p(\theta|x)\) from Bayes’ Theorem
- Draw inference from this posterior information

Bayesian Linear Model

Frequentist linear model is what we have been focusing on prior

\[ y = X\beta+\epsilon,\quad\epsilon\sim N(0,\sigma^2I) \]
- We have \(\theta = (\beta,\sigma^2)\) as unknown parameters that we plan to estimate
The bayesian linear model is computed in a different way as we consider a prior belief.
- In this example we utilise the ‘standard’ or no information prior:
  - Uniform prior on \(\beta\), uniform prior on \(\log(\sigma^2)\)
\[ p(\beta,\sigma^2)\propto\sigma^{-2} \]
Note we will be working with the multivariate normal distribution with the following formula:

\[ \frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}}\exp\left(-\frac{1}{2}(y-\mathbf{X\beta})^\top\Sigma^{-1}(y-\mathbf{X\beta})\right)\]

Our goal is to find the posterior distribution of the two unknown parameters \(\beta,\sigma\), therefore we use the conditional probability expression to make it easier. From here we find each of the RHS parts seperately.

\[ p(\beta,\sigma^2|y)=p(\sigma^2|y)p(\beta|\sigma^2,y)\]

This results in:

\[ p(\beta|\sigma^2,y)\propto \underbrace{p(\beta,\sigma^2)}_{\sigma^{-2}}\underbrace{p(y|\beta,\sigma^2)}_{\text{Multi Normal}}\]

\[ \beta|\sigma^2,y\sim N((X^\top X)^{-1}X^\top y,\sigma^2(X^\top X)^{-1})\]

\[ p(\sigma^2|y)\propto\underbrace{p(\sigma^2)}_{\sigma^{-2}}\underbrace{p(y|\sigma^2)}_{\text{Integrate marginal}}\]

\[ p(\sigma^2|y)\sim\text{ Inverse}-\chi^2(n-p,s^2),\quad\text{where } s^2=\frac{y^\top y-y^\top X(X^\top X)^{-1}X^\top y}{n-p}\]

Where we utilise the following lemmas in (1) and (2)
- Considering a q-dim vector \(Z\), where \(A\) is symmetric positive definite \(q*q\) matrix and \(b\) is a fixed \(q*1\) vector \[ p(x)\propto\exp\left(-\frac{1}{2}(z^\top Az-2b^\top z)\right)\quad\therefore\quad Z\sim N(A^{-1}b,A^{-1})\quad(1)\]
- Considering a q-dim vector \(Z\), where \(A\) is symmetric positive definite \(q*q\) matrix and \(b\) is a fixed \(q*1\) vector \[ \int\exp\left(-\frac{1}{2}(z^\top Az-2b^\top z)\right)dz=(2\pi)^{q/2}|A|^{-1/2}\exp(\frac{1}{2}b^\top A^{-1}b)\quad(2)\]
Full working (1):

* Noting use of lemma (1)

Full working (2):

* Noting use of lemma (2)

Direction Simulation from Posterior Distribution

Note that the calculation for \((X^\top X)^{-1}\) needed for the mean may be infeasible, therefore we consider the QR decomposition:

\[ X = QR,\quad Q^\top Q=I,\quad R\text{ is upper triangular p*p} \]

Using the QR decomposition we result in a easier to calculate solution for \(b\):

\[ (X^\top X)b = X^\top y\]

\[ R^\top Q^\top QRb = R^\top Q^\top y \]

\[ R^\top Rb = R^\top Q^\top y\]

\[ Rb = Q^\top y\quad\therefore\quad b = R^{-1}Q^\top y\]

Also note the following, which is useful for the covariance matrix of \(\beta\), which equals \(\sigma^2(X^\top X)^{-1}\)

\[ (X^\top X)^{-1} = (R^\top Q^\top QR)^{-1} = (R^\top R)^{-1} = R^{-1}(R^{-1})^\top\]

Simulating the Posterior

We simulate \(p(\sigma^2|y)\) first so that this simulated value can be used in the simulation of \(p(\beta|\sigma^2,y)\). These simulations are then multipled together to get the posterior.

\[ p(\beta,\sigma^2|y)=p(\sigma^2|y)p(\beta|\sigma^2,y)\]

Simulate \(\sigma^2\) from \(p(\sigma^2|y)\sim\text{ Inverse}-\chi^2(n-p,s^2)\) which is easy to do
Simulate \(\beta\) from \(p(\beta|\sigma^2,y)\sim N(b,\sigma^2(X^\top X)^{-1})\) as fgollows:
- Generate a vector of standard normal variables \(Z\sim N(0,I)\)
- Calculate beta in the following way:
\[ \beta = b+\sigma R^{-1}Z\]
To get a \(100(1-\alpha)\%\) credible interval for coefficients we can take the \(2.5%\) and \(97.5%\) quantiles of the sigmas
This can be done in R, using 1000 simulations in this example:

data(USArrests)
predictors = USArrests$UrbanPop
y = USArrests$Assault
n <- length(y)
X <- cbind(rep(1,times=n),predictors)
p = dim(X)[2]
QR.decomposition = qr(X)
R = qr.R(QR.decomposition)
Q = qr.Q(QR.decomposition)
b = backsolve(R, t(Q)%*%y)
b

##          [,1]
## [1,] 73.07658
## [2,]  1.49044

res = y-X%*%b
SSE = sum(res^2)
s=SSE/(n-p)
sigma_sim = rep(0,1000)
beta_sim = matrix(ncol=1000,nrow=2)
for (i in 1:1000){
  sigma_sim[i] = (n-p)*s/rchisq(1,df = n-p)
  beta_sim[,i] = b+sqrt(sigma_sim)[i]*backsolve(R,rnorm(n=p))
}
summary(sigma_sim)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3520    5854    6730    6900    7735   13020

summary(beta_sim[1,])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -137.63   36.05   74.01   73.32  110.03  240.60

summary(beta_sim[2,])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.1115  0.9032  1.4622  1.4799  2.0504  4.2867

quantile(beta_sim[2,], probs = c(0.025,0.975))

##       2.5%      97.5% 
## -0.1397417  3.0554255

Prediction

We may want to predict future, unknown responses. Write \(\tilde{y}\) as the future response and \(\tilde{x}\) as the future predictor values.

\[ \tilde{y}|\beta,\sigma^2\sim N(\tilde{x}^\top\beta,\sigma^2)\]

We want to calculate the posterior predictive distribution of \(\tilde{y}\) by integrating out the uncertainty of the parameters. This is similar to what we expect \(\tilde{y}\) to be given the parameters.

\[ p(\tilde{y}|y)=\int\int p(\tilde{y}|\beta,\sigma^2)p(\beta|\sigma^2,y)d\beta P(\sigma^2|y)d\sigma^2\]

This can be simplified to this, which is easier to calculate as it utilises normal distributions.

\[ \int p(\tilde{y}|\sigma^2,y)p(\sigma^2|y)d\sigma^2,\quad p(\tilde{y}|\sigma^2,y)=\int \underbrace{p(\tilde{y}|\beta,\sigma^2)}_{\text{Univariate}}\underbrace{p(\beta |\sigma^2, y)}_{\text{Multivariate}}d\beta\]

This results in \(p(\tilde{y}|y)\sim t_{n-p}(\tilde{x}^\top b,s^2(1+\tilde{x}^\top(X^\top X)^{-1}\tilde{x}))\)
We utilise lemma (1), (2) and (3) in this calculation:

\[ \int^\infty_0z^{-(a+1)}\exp(-b/z)dz=\frac{\Gamma(a)}{b^a} \quad(3)\]

Full Working:

We could solve the previous part with algebra, however this is quite difficult, therefore we use the law of iterated expectations:

\[ \mathbb{E}[\mathbb{E}[U|V]]=\mathbb{E}[U]\]

\[ \mathbb{V}ar(U) = \mathbb{E}[\mathbb{V}ar(U|V)]+\mathbb{V}ar(\mathbb{E}[U|V])\]

Therefore we have:

\[ \frac{\tilde{y}-\tilde{x}^\top b}{s\sqrt{1+\tilde{x}^\top(X^\top X)^{-1}\tilde{x}}}\sim t_{n-p}\]

And the resulting credible interval:

\[ \tilde{x}^\top b\pm st_{n-p;\alpha/2}\sqrt{1+\tilde{x}^\top(X^\top X)^{-1}\tilde{x}}\]

Residuals are calculated the same way as for a normal linear regression

\[ e = y-Xb\]

Bayesian Models

Jake

19/08/2021

Bayesian Inference

Bayesian Linear Model

Direction Simulation from Posterior Distribution

Simulating the Posterior

Prediction