Bayesian Neural Networks
- Bayesian Neural Networks can naturally take into account uncertainty in parameters (weights) and topology
- Probability distributions are used to represent the weights.
- While the uncertainty quantification and ability to integrate priors is useful, the computation is complex and expensive, therefore Bayesian neural networks aren’t used often in complicated architectures. However, there is work being done to improve this such as parallel tempering. There is also an issue if you are overconfidient in an incorrect prior.
Bayesian Inference
\[ p(\theta|X)=\frac{p(X|\theta)*p(\theta)}{\int p (X|\theta)*p(\theta) d\theta} \propto p(X|\theta)*p(\theta)\]
- Use prior belief about a variable, and consider the likelihood of the data given that variable to create a new posterior distrbution for the variable.
MCMC Sampling
- MCMC sampling allows samples to be taken from a distribution iteratively using a proposal (prior) distribution and likelihood function to create a posterior distribution.
Markov Processes
- Markov processes are used in MCMC sampling
- A markov process is uniquely defined by its transition probabilities \(p(x'|x)\), which defines the probability of transitioning from any given state \(x\) to other given state \(x'\).
- The process has a unique stationary distribution \(\pi(x)\) given the following conditions are met:
- Sufficient detailed balance condition, requiring that each transition \(x\rightarrow x'\) is reversible:
- This means we can sample the entire distribution, equal amounts of past and future acceptances, distribution is well explored
- The stationary distribution must be unique, which is guarenteed by the ergodicity of the process
- Ergodicity is guaranteed when every state is aperiodic where the system does not return to the same state at fixed intervals, and when every state is positive recurrent where the expected number of steps for returning to the same state is finite. In other words, an ergodic system is one that mixes well, i.e. you get the same result whether you average its values over time or over space.
- The detailed balance condition is used in the MCMC algorithm, where \(\pi(x) = p(x)\):
\[ p(x)p(x'|x) = p(x')p(x|x')\rightarrow \frac{p(x'|x)}{p(x|x')} = \frac{p(x')}{p(x)}\] * Note that for random walk distribution the q ratio cancels out:
\[ y = x+\epsilon\]
\[ q(y|x) = q(\epsilon)\]
\[ q(x|y) = q(-\epsilon) = q(\epsilon)\]
Basic MCMC Algorithm
- For range(max_samples) (or until convergence criterion is reached):
- Prospose value \(x'|x\sim q(x_i)\), where \(q()\) is the proposed distribution
- Given \(x'\), compute \(f(x', X)\) and compute \(\hat{y}\) and log-likelihood
- Calculate the acceptance probability (Metro-Hastings Criterion)
- Generate from U(0,1)
- if \(\alpha < u\) accept by setting \(x_i = x'\)
- else, reject by setting \(x_i = x_{i-1}\)
Priors and Likelihoods
- We maximise the log likelihood rather than the likelihood as it is easier to computer the are max at the same point.
- For a continuous output we typically say that the relationship between input and output is a signal plus noise model
\[ y = f(x|\theta) + \epsilon,\quad \epsilon\sim N(0,\tau^2)\]
- This has a Gaussian likelihood, note \(S\) are samples:
\[ \frac{1}{(2\pi\tau^2)^{S/2}}*\exp\left(-\frac{1}{2\tau}\sum^S_{i=1}(y_t-f(x_I,\theta))^2\right)\] * If the problem is classification, the likelihood is multinomial.
- Priors for the weights and biases are typically treated as Gaussian plus noise, note \(L\) is number of parameters in model :
- Prior for noise is inverse gamma as it has to be positive
\[ p(\theta) \propto \frac{1}{(2\pi\sigma^2)^{L/2}}*\exp\left(-\frac{1}{2\sigma^2}\sum^M_{i=1}\theta_i^2\right)*\tau^{-2(1+\nu_1)}\exp\left(\frac{-\nu_2}{\tau^2}\right) \]
Burn-In
- A portion of the initial samples is discarded in the MCMC, with higher percentages for more complicated models. This discards the samples that haven’t gotten to convergence yet and are therefore not part of the posterior distribution.
Langevin-Gradient Bayesian Neural Networks
- ‘Smart’ proposal distribution utilises Langevin-Gradient rather than a ‘dumb’ random walk.
- However due to the use of gradients the model must be able to generate gradients.
- The gradient drives the random walk towards regions of high probability in the manner of gradient flow
Detailed Balance for Proposal Distribution LG
- Acceptance probability for Metro-Hastings:
\[ \alpha = \min\left(1,\frac{p(\theta^p|y)q(\theta^{[k]}|\theta^p)}{p(\theta^{[k]}|y)q(\theta^{p}|\theta^{[k]})}\right)\]
\[ q(\theta^{p}|\theta^{[k]})\text{ is given by } \theta^p\sim N(\bar{\theta}^{[k]},\Sigma_\theta) \]
\[ q(\theta^{[k]}|\theta^p)\sim N(\bar{\theta}^p,\Sigma_\theta) \]
- These probabilites take into account the gradient \(\delta E\) and learning rate \(r\)
\[ \bar{\theta}^p = \theta^p + r\Delta E[\theta^p]\]
- The above ensures that the balance condition holds and \(\theta^{[k]}\) converges to draws from the posterior \(p(\theta|y)\)