literature review

(Rainforth et al. 2024)

Basic of BED

  • Problem setting of Bayesian experimental design (BED): choose experimental design optimally.
  • Optimality: Maximise information gain from the experiment. (usuallly)
  • Relation to Bayesian adaptive design (BAD): After every experiment, re-optimise the experiment design sequentially.

Formulation of BED

  • Maximise information gain

\[ \begin{align} \text{infoGain}{\theta}(\xi, y) :&= H[p(\theta)] - H[p(\theta | y, \xi)] \\ &= - E_{p(\theta)}[\log p(\theta)] + E_{p(\theta | y, \xi)}[\log p(\theta | y, \xi)] \end{align} \]

  • \(\xi\): the design we can control.
  • \(y\): observation after experiment \(\xi\).
  • \(y\) is unknown upon experiment: take expectation with respect to \(p(y | \xi)\) to obtain expected information gain.

\[ \begin{align} \text{EIG}_{\theta}(\xi) :&= E_{p(y | \xi)}[\text{infoGain}_{\theta}(\xi, y)] \\ &= E_{p(y | \xi)}[- E_{p(\theta)}[\log p(\theta)] + E_{p(\theta | y, \xi)}[\log p(\theta | y, \xi)]] \\ &= E_{p(y|\xi)p(\theta|y, \xi)}[\log p(\theta|y, \xi)]- E_{p(y|\xi)p(\theta)}[\log p(\theta)] \end{align} \] From

\[ p(y|\xi)p(\theta|y,\xi) = p(y, \theta| \xi) = p(\theta)p(y|\theta, \xi) \]

and similarly

\[p(y|\xi)p(\theta) = p(y,\theta|\xi)\]

\[ \text{lhs} = E_{p(\theta)p(y|\theta, \xi)}[\log p(\theta|y, \xi) - \log p(\theta)] \] Interpretation: KL-divergence between posterior and prior.

Additionally,

\[ \begin{align} \text{lhs} &= E_{p(\theta, y|\xi)}[\log p(\theta|y, \xi) - \log p(\theta)] \\ &= E_{p(\theta, y|\xi)}\left[ \log \frac{p(\theta, y | \xi)}{p(y | \xi)} - \log p(\theta) \right] \\ &= E_{p(\theta, y|\xi)} \left[ \log \frac{p(\theta, y | \xi)}{p(y|\xi)p(\theta)} \right] \end{align} \] Interpretation: Mutual information bwtween \(y\) and \(\theta\) given \(\xi\).

Moreover, from independence of the design \(\xi\),

\[ \begin{align} p(\theta| y, \xi) &= \frac{p(\theta)p(y, \xi|\theta)}{p(y, \xi)} \\ &= \frac{p(\theta)p(y|\xi, \theta)p(\xi)} {p(y|\xi)p(\xi)} \\ &= \frac{p(\theta)p(y|\xi, \theta)}{p(y|\xi)} \end{align} \] \[ \begin{align} \text{lhs} &= E_{p(\theta)p(y|\theta, \xi)} \left[ \log \frac{p(\theta)p(y|\xi, \theta)}{p(y|\xi)} - \log p(\theta) \right]\\ &= E_{p(\theta)p(y|\theta, \xi)} [\log p(y|\xi, \theta) - \log p(y | \xi)] \end{align} \]

Interpretation: the expected reduction in predictive uncertainty in \(y\) after observing \(\theta\) w.r.t the joint distribution.

Formulation of BAD

Now EIG depends on the history of experiments \(h_{t-1} := \{ \xi_k, y_k\}_{k=1}^{t-1}\)

\[ \text{EIG}_\theta (\xi_t|h_{t-1}) := E_{p(\theta | h_{t-1})p(y|\xi, \theta, h_{t-1})} \left[ \log p(y|\xi_t, \theta_t, h_{t-1}) - \log p(y | \xi_t, h_{t-1}) \right] \]

Justification of use of Bayes

  • Framework for decision making under uncertainty and incomplete information.
  • Incorporation of previous experiment data sequentially by prior. (especially for BAD)
  • Frequentist approach: Fisher information matrix instead of EIG.
  • Requires estimate of \(\theta\) to calculate the quantity, when we are collecting information about \(\theta\).

Computational challenges

Posterior \(p(\theta | y, \xi)\) and marginal \(p(y | \xi) = \int p(y | \xi, \theta) d\theta\) are intractable.

  • Nested Monte Carlo.
  • Multi-level Monte Carlo.
  • Variational approach.
  • Black-box optimisation (e.g., Bayesian optimisation)

Amortised learning

  • For BAD, decision for design need to be made instantly: the above method is too slow.
  • Instead of calculating optimal \(\xi\) by maximizing EIG, learn policy for decision making beforehand.

Policy based BAD

  • Traditional BAD approach is sub-optimal since it fails to account for the fact that EIG at current step also affect EIG in the future through \(h_t\).
  • Use total EIG over all \(T\) experiment instead.

\[ \begin{align} \text{TEIG}_{\theta}(\pi_{\phi}) := E_{p(\theta)p(y_{1:T} | \theta, \pi_{\phi})} [ \log p(y_{1:T} | \theta, \pi_{\phi}) - \log p(y_{1:T} | \pi_{\phi}) ] \end{align} \] where \(\xi_t = \pi_{\phi} (h_{t-1})\) random variable depending on the data from previous experiments \(h_{t-1}\).

  • Policy is a function of \(\phi\): learn policy by maxmising TEIG w.r.t \(\phi\).

Future directions

  • scalability for policy based BAD
  • connection with other fields
  • Model misspecification for likelihood \(p(y | \theta, \xi)\).
  • We don’t only estimate model, but use model to collect more data!

(Foster et al. 2019)

How to estimate intractable \(EIG_\theta(\xi)\), (which will be optimised w.r.t \(\xi\) using Bayesian optimisation for example afterwards)?

Nested Monte Carlo

Classical approach.

\[ \begin{align} EIG_\theta(\xi) &= E_{p(y, \theta|\xi)} [\log p(y|\xi, \theta) - \log p(y | \xi)]\\ &\approx \frac{1}{S} \sum_{s=1}^S \log \frac{p(y^{(s)}|\xi, \theta^{(s)})}{p(y^{(s)}| \xi)} \end{align} \]

where \(\theta^{(s)} \sim p(\theta)\) and \(y^{(s)} \sim p(y| \theta = \theta^{(s)}, \xi)\). Note that \(p(y^{(s)} | \xi) = \int p(y^{(s)}, \theta^{(s)}|\xi)d\theta^{(s)}\) is also intractable hence nested Monte Carlo is used.

\[ \begin{align} \int p(y, \theta^{(s)}|\xi)d\theta^{(s)} &= \int p(y|\theta^{(s)} \xi) p(\theta^{(s)})d\theta^{(s)} \\ &\approx \frac{1}{T}\sum_{t=1}^{T}p(y^{(s)}| \theta^{(s, t)}, \xi) \end{align} \]

where \(\theta^{(s, t)} \sim p(\theta)\). This gives the estimation of \(\text{EIG}_\theta(\xi)\) for fixed \(\xi\).

Variational inference

Nested Monte Carlo is computationally quite expensive.

Variational posterior

Apply variational inference to intractable posterior.

\[ \begin{align} E_{p(y, \theta| \xi)}[\log p(\theta|y, \xi) - \log p(\theta)] &\approx E_{p(y, \theta| \xi)}[\log q_\phi(\theta|y, \xi) - \log p(\theta)] \\ &\approx \frac{1}{S} \sum_{s=1}^{S} \log \frac{q_\phi(\theta^{(s)}| y^{(s)}, \xi)}{p(\theta^{(s)})} \end{align} \]

where \(\theta^{(s)}, y^{(s)} \sim p(\theta, y | \xi) = p(\theta)p(y|\theta, \xi)\) First obtain variational parameter \(\phi\), then EIG with MC. One can show it is upper bound for EIG with tightness achieved with perfect approximation by variational distribution.

Variational marginal

\[ \begin{align} E_{p(\theta)p(y|\theta, \xi)} [\log p(y|\xi, \theta) - \log p(y | \xi)] &\approx E_{p(\theta)p(y|\theta, \xi)} [\log p(y|\xi, \theta) - \log q_\phi(y | \xi)] \\ &\approx \frac{1}{S} \sum_{s=1}^{S} \log \frac{p(y^{(s)} | \xi, \theta^{(s)})}{q_\phi (y^{(s)} | \xi)} \end{align} \]

Variational Nested MC

  • Problem with above approach: not consistent when variational family is misspecified.
  • Use variational inference for proposal distribution.

\[ \begin{align} E_{p(\theta)p(y|\theta, \xi)} [\log p(y|\xi, \theta) - \log p(y | \xi)] &= \int [\log p(y | \xi, \theta) - \log p(y | \xi) ]p(y, \theta | \xi) dyd\theta \\ &\approx \frac{1}{S} \sum_{s=1}^{S} \log p(y^{(s)} | \xi, \theta^{(s)}) - \log p(y^{(s)} | \xi) \end{align} \]

where \(\theta^{(s)} \sim p(\theta)\), \(y^{(s)} \sim p(y | \theta = \theta^{(s)}, \xi)\). Again, \(\log p(y^{(s)} | \xi)\) is intractable. This time, use importance sampling with proposal distribution \(q_\phi\), whose parameter \(\phi\) is learnt beforehand by minimising the upper bound for EIG.

Then we get the importance sampling estimate as

\[ \begin{align} \log p(y^{(s)} | \xi) &= \int \log p(y^{(s)}, \theta^{(s)} | \xi) d \theta^{(s)} \\ &:= \int \log \frac{p(y^{(s)} ,\theta^{s} | \xi)}{q_\phi(\theta^{(s)} | y^{(s)}, \xi)} q_\phi(\theta^{(s)} | y^{(s)}, \xi) d\theta^{(s)} \\ &\approx \frac{1}{T} \sum_{t=1}^{T} \log \frac{p(y^{(s)}, \theta^{(s,t)} | \xi)}{q_{\phi*}(\theta^{(s,t)} | y^{(s)}, \xi)} \end{align} \] where \(\theta^{(s,t)} \sim q_{\phi}(\theta^{(s,t)} | y^{(s)}, \xi)\).

(Foster et al. 2020)

  • Above variational method focus on estimation of \(\text{EIG}_\theta(\xi)\) for single point.
  • Then, we evaluate EIG for multiple \(\xi\) and find the maximiser (for example use BO).
  • Two stage procedure.

Recall the upper bound for EIG used to obtain parameter \(\phi\) for variational distribution. It looks like this.

\[ I_{VNMC}(\xi, \phi, L) = E\left[ \log p(y | \theta^{(0)}, \xi) - \log \frac{1}{L} \sum_{l=1}^L \frac{p(y, \theta^{(l)} | \xi)}{q_{\phi}(\theta^{(l)} | y, \xi)} \right] \] How to make it one stage?: Minimise \(I_{VNMC}\) w.r.t \(\phi\) and at the same time maximise it w.r.t \(\xi\).

Less complicated: Find lower bound for EIG and maximise w.r.t \((\xi, \phi)\) jointly.

Adaptive contrastive esimation

Include \(\theta^{(0)}\) from which \(y\) was sampled.

\[ I_{ACE}(\xi, \phi, L) = E\left[ \log \frac{p(y | \theta^{(0)}, \xi)} {\frac{1}{L + 1} \sum_{l=0}^L \frac{p(y, \theta^{(l)} | \xi)}{q_{\phi}(\theta^{(l)} | y, \xi)}} \right] \]

Now denominator \(\frac{p(y, \theta | \xi)}{q(\theta | y, \xi)} \approx \frac{p(y, \theta | \xi)}{p(\theta | y, \xi)} = \frac{p(y, \theta | \xi)}{\frac{p(y, \theta| \xi)}{p(y | \xi)}} = p(y | \xi)\) is “over-estimated”. Therefore, lower bound for EIG.

\(\theta^{1:L}\) is called contrastive samples.

Prior conrastive estimation

  • Replace variational distribution \(q_{\phi}(\theta^{(l)} | y, \xi)\) with prior \(p(\theta^{(l)})\).
  • Tightness now only depends on contrastive samples \(\theta^{1:L}\).

\[ I_{PCE}(\xi, \phi, L) = E\left[ \log \frac{p(y | \theta^{(0)}, \xi)} {\frac{1}{L + 1} \sum_{l=0}^L p(y | \theta^{(l)}, \xi)} \right] \]

Estimation: SGD (hence \(\xi\) is asssumed to be continuous)

(Foster et al. 2021)

  • Consider total expected information gain instead of EIG.
  • Derive Prior contractive estimation for sequential setting (BAD).

In order to amortised the sequential decision about experimental design instead of esitmation and optimisation of EIG at each step, consider policy network \(\pi_\phi\) which maps experimental history; \(h_{t-1} := \{ (\xi_1, y_1),..., (\xi_{t-1}, y_{t-1}) \}\) to experiment design at step \(t\) such that \(\xi_t = \pi(h_{t-1})\).

How to train the network?: Maximise total expected information gain from \(T\) experiment.

\[ \begin{align} I_T(\pi) &:= E_{p(\theta)p(h_T | \theta, \pi)}\left[ \sum_{t=1}^T I_{h_{t-1}}(\xi_t)\right] \\ &= E_{p(\theta)p(h_T | \theta, \pi)}[\log p(h_T| \theta, \pi) - \log p(h_T | \pi)] \end{align} \]

Observation
  • We don’t need intermediate posteriors -> cheaper computation.
  • The policy is non-myopic: consider information gains about future at each step.
  • Find maximiser \(\pi\) to obtain design network.

\(I_T(\pi)\) is intractable due to the term \(p(h_T | \theta, \pi)\).

  • Use lower bounds for \(I_T(\pi)\) and maximise it w.r.t \(\phi\) (params for design network).
  • Similar idea as Prior conrastive estimation (sequential version of it).
Limitation
  • explicity of likelihood
  • assume conditional independence of observation
  • assume continuity of designe \(\xi_t\).

References

Foster, Adam, Desi R Ivanova, Ilyas Malik, and Tom Rainforth. 2021. “Deep Adaptive Design: Amortizing Sequential Bayesian Experimental Design.” In International Conference on Machine Learning, 3384–95. PMLR.
Foster, Adam, Martin Jankowiak, Elias Bingham, Paul Horsfall, Yee Whye Teh, Thomas Rainforth, and Noah Goodman. 2019. “Variational Bayesian Optimal Experimental Design.” Advances in Neural Information Processing Systems 32.
Foster, Adam, Martin Jankowiak, Matthew O’Meara, Yee Whye Teh, and Tom Rainforth. 2020. “A Unified Stochastic Gradient Approach to Designing Bayesian-Optimal Experiments.” In International Conference on Artificial Intelligence and Statistics, 2959–69. PMLR.
Rainforth, Tom, Adam Foster, Desi R Ivanova, and Freddie Bickford Smith. 2024. “Modern Bayesian Experimental Design.” Statistical Science 39 (1): 100–114.