- Bayesian Deep Learning
- Probabilistic Programming with Edward
- Dropout as a Bayesian Approximation in Deep Neural Networks
- Experiments
7 May 2017
State of the art in a wide range of applications: computer vision, speech recognition, natural language processing etc.
Neural network architectures: Feedforward, Recurrent, Convolutional…
Deeper neural networks and massive datasets: stochastic gradient descent, backpropagation, dropout
Deep Reinforcement Learning: represent Q-value functions, which estimate quality of an agent's next action, as deep neural networks
Aim: Use probability theory to express all forms of uncertainty
Uncertainty estimates needed in forecasting, decision making, learning from limited and noisy data
AI Safety: medicine, engineering, finance, etc.
The probabilistic approach provides a way to reason about uncertainty and learn from data
A Bayesian neural network is a neural network with a prior distribution on the weights
Data \(\mathcal{D} = \{\mathbf{X}_i, y_i \}_{i=1}^{N}\) for input data point \(\mathbf{X}_i \in \mathbb{R}^d\) and output \(y_i \in \mathbb{R}\).
Parameters: network weights \(\mathbf{w}\) and biases \(b\).
For Bayesian neural networks normal priors are often used: \(p(\mathbf{w)} = \mathcal{N}(\mu, \Sigma)\)
Posterior: \(p(\mathbf{w} \vert \mathbf{X},y) = \frac{\displaystyle{ p(y \vert \mathbf{X}, \mathbf{w} )p(\mathbf{w}) }} { \displaystyle{p(y\vert \mathbf{X})} }\)
Predictive distribution of output \(y^*\) given a new input \(\mathbf{x}^*\): \[p(y^* \vert \mathbf{x}^*, \mathbf{X}, y) = \int p(y^* \vert \mathbf{x}^*, \mathbf{w})p(\mathbf{w} \vert \mathbf{X}, y) d\mathbf{w}\]
NIPS 2016: bayesiandeeplearning.org
Practical Variational Inference for Neural Networks, A Graves, NIPS 2011.
Auto-Encoding Variational Bayes, D P. Kingma, M Welling, CoRR abs/1312.6114, 2013.
Stochastic Backpropagation and Approximate Inference in Deep Generative Models, D Rezende, S Mohamed, and D Wierstra, ICML 2014.
Weight uncertainty in neural networks, C Blundell, J Cornebise, K Kavukcuoglu, and D Wierstra, ICML 2015.
Probabilistic backpropagation for scalable learning of Bayesian neural networks, J M Hernandez-Lobato, R P Adams, ICML 2015
See Yarin Gal's PhD thesis: Uncertainty in Deep Learning, University of Cambridge (2016)
Represent probabilistic models as programs that generate samples
Automate inference of unobserved variables in the model conditioned on observed progam output
Probabilistic Programming Languages (PPLs): Church, Venture, Anglican, Figaro, WebPPL, Stan, PyMC3, Edward etc.
Many PPLs treat inference as a black box abstracted away from model
Separation of model and inference: harder to represent some recent advances, e.g. Generative Adversarial Networks
"All sufficiently complex PPLs eventually reimplement probabilistic LISP."
Daniel Roy, University of Toronto
Edward is a thin platform over TensorFlow
Edward = TensorFlow + Random Variables
Edward is a thin platform over TensorFlow
Edward = TensorFlow + Random Variables + Inference Algorithms
TensorFlow is a computational graph framework
Graph nodes are operations: arithmetic, looping, indexing etc.
Graph edges are tensors representing data communicated between nodes
Do everything in the computational graph: scale, distribute, GPU etc.
Edward's Random Variables are class objects with methods: log_prob()
, samples()
, summary statistics etc.
Edward's Inference
algorithms define a TensorFlow computational graph
Inference can be written as a collection of separate inference programs
Leverage TensorFlow optimisers, e.g. ADAM for Variational Inference; gradient structure for Hamiltonian Monte Carlo
Allows composition of Models and of Inference methods for greater expressivity
Allows data subsampling for greater scalability
George Edward Pelham Box's loop as a Design Principle
1. Build a model of the science 2. Infer the model given data 3. Criticize the model given data (Posterior Predictive Checks)
import tensorflow as tf def neural_network(x, W_0, W_1, W_2, b_0, b_1, b2): h = tf.tanh(tf.matmul(x, W_0) + b_0) h = tf.tanh(tf.matmul(h, W_1) + b_1) h = tf.matmul(h, W_2) + b_2 return tf.reshape(h, [-1])
import tensorflow as tf def neural_network(x, W_0, W_1, W_2, b_0, b_1, b2): h = tf.tanh(tf.matmul(x, W_0) + b_0) h = tf.tanh(tf.matmul(h, W_1) + b_1) h = tf.matmul(h, W_2) + b_2 return tf.reshape(h, [-1])
from edward.models import Normal W_0 = Normal(loc=tf.zeros([D, n_hidden]), scale=tf.ones([D, n_hidden])) W_1 = Normal(loc=tf.zeros([n_hidden, n_hidden]), scale=tf.ones([n_hidden, n_hidden])) W_2 = Normal(loc=tf.zeros([n_hidden, 1]), scale=tf.ones([n_hidden, 1])) b_0 = Normal(loc=tf.zeros(n_hidden), scale=tf.ones(n_hidden)) b_1 = Normal(loc=tf.zeros(n_hidden), scale=tf.ones(n_hidden)) b_2 = Normal(loc=tf.zeros(1), scale=tf.ones(1))
x
and y
be observed random variables, and x_train
and y_train
be atually observed dataw_0
, w_1
, b_0
and b_1
be latent variables in modelqw_0
, qw_1
, qb_0
and qb_1
be random variables defined to approximate the posterior{w_0: qw_0, b_0: qb_0, w_1: qw_1, b_1: qb_1}
{x: x_train, y: y_train}
inference = ed.Inference(latent_vars={w_0: qw_0, b_0: qb_0, w_1: qw_1, b_1: qb_1}, data={x: x_train, y: y_train})
Approximate the posterior \(p(\mathbf{w} \vert \mathbf{X, Y})\) with a simpler distribution from a variational family of distributions \(q_{\theta}(\mathbf{w})\) parameterised by \(\theta\)
Minimise the Kullback-Leibler (KL) divergence between approximate posterior \(q_{\theta}(\mathbf{w})\) and the actual posterior \(p(\mathbf{w} \vert \mathbf{X, Y})\) w.r.t. \(\theta\)
Mean Field approximation: factorise the approximating posterior \(q_{\theta}(\mathbf{w})\) into independent distributions for each weight layer
\[\text{ELBO} = \underbrace{\mathbb{E}_{q_{\theta}}[\log p(\mathbf{x} \vert \mathbf{z})]}_\text{Expected Log Likelihood} - \underbrace{ \text{KL}( q_{\theta}(\mathbf{z}) \| p(\mathbf{z})] }_\text{Prior KL}\]
Edward minimises the objective by automatically selecting from a variety of black box inference techniques
Edward applies variance reduction to the Monte Carlo gradient estimator
Edward computes the Prior KL automatically when \(p(\mathbf{z})\) and \(q_{\theta}(\mathbf{z})\) are Normal
Allows Black Box Variational Inference to be implemented in a wide variety of models
Deep neural networks: many parameters; very prone to overfitting; require large data sets
Dropout is an empirical technique used to avoid overfitting in neural networks
Dropout multiplies hidden activations by Bernoulli distributed random variables which take the value 1 with probability \(p\) and 0 otherwise
Randomly "drop out" hidden units and their connections during training time to prevents hidden units from co-adapting too much
Often yields significant improvements over other regularisation methods
\[\mathcal{L}_{\text{dropout}} = \underbrace{ \frac{1}{N} \sum_{i=1}^{N} E(\mathbf{y}_{i}, \mathbf{\hat{y}}_{i})}_\text{Loss function} + \underbrace{ \lambda \sum_{i=1}^{L} (\|\mathbf{W}_{i} \|^2 + \| \mathbf{b}_{i} \|^2 )}_{L_2 \text{ regularisation with weight decay} \lambda}\]
Gal & Ghahramani: Reparameterise the approximate variational distribution \(q_{\theta}(\mathbf{w})\) to be non-Gaussian (Bernouilli)
Factorise the distribution for each neural network weight row in each weight matrix
Gal and Ghahramani's method (MC Dropout) requires applying dropout at every weight layer at test time
For input \(\mathbf{x}^{*}\) the predictive distribution for output \(\mathbf{y}^*\) is \[q(\mathbf{y}^{*}\vert \mathbf{x}^{*}) = \int p(\mathbf{y}^{*}\vert \mathbf{x}^{*}, \mathbf{w}) q(\mathbf{w} \vert \mathbf{X, Y} )\]
MC Dropout averages over \(T\) forward passes through the network at test time
To estimate predictive mean and predictive uncertainty:
Highly competitive results for
Convolutional Neural Networks: Bayesian Convolutional Neural Networks with Bernoulli Approximate Variational Inference, Y Gal, Z Ghahramani, ICLR 2016
Recurrent Neural Networks: A Theoretically Grounded Application of Dropout in Recurrent Neural Networks, Y Gal, Z Ghahramani, NIPS 2016
Reinforcement Learning: Improving PILCO with Bayesian Neural Network Dynamics Models, Y Gal, R McAllister, C Rasmussen, ICML 2016
Feedforward Neural Networks with 2 hidden layers and 50 units per layer
Same benchmark data sets and architecture as Hernandez-Lobato & Adams (2015) and Gal & Ghaharmani (2016)
Yarin Gal's MC Dropout: https://github.com/yaringal/DropoutUncertaintyExps
Harvard Intelligent Probabilistic Systems Group: https://github.com/HIPS/Probabilistic-Backpropagation (Hernandez-Lobato & Adams):
Uses Keras (Chollet, 2015) and Theano (Bergstra et al, 2010)
Black Box Variational Inference: aims for maximum flexibility and model independence
Edward uses mean field VI: factorises the approximating posterior \(q(\mathbf{w})\) into independent distributions for each latent variable: \(\prod_{i=i}^{l}q_i(w_i)\):
MC Dropout: joint distribution over each layer's weights is multimodal (mixtures of delta peaks)
For Bayesian Neural Networks, including correlations between layers improves results: see Louizos & Welling (2016)
We are hiring!
Structured and Efficient Variational Deep Learning with Matrix Posteriors - C Louizos & M Welling. arXiv:1603.04733 (2016)
Multiplicative Normalizing Flows for Variational Bayesian Neural Networks - C Louizos & M Welling. arXiv:1703.01961 (2017)
Hierarchical variational models, Ranganath, D. Tran, and D. M. Blei. ICML 2016.
Black box Variational Inference. R. Ranganath, S. Gerrish, and D. M. Blei. Artificial Intelligence and Statistics, 2014.
Variational inference with normalizing flows, D J Rezende, S Mohamed, ICML 2015.
Reparameterization Gradients through Acceptance-Rejection Sampling Algorithms - C Naesseth, F Ruiz, S Linderman, D M Blei (2017)
Importance Sampled Stochastic Optimization for Variational Inference, J Sakaya, A Klami, arXiv:1704.05786 (2017)
Sticking the Landing: An Asymptotically Zero-Variance Gradient Estimator for Variational Inference, G Roeder, Y Wu, D Duvenaud (2017)
Edward: A library for probabilistic modeling, inference, and criticism., D Tran, A Kucukelbir, A B Dieng, M Rudolph, D Liang, and D M Blei. arXiv:1610.09787.
Deep Probabilistic Programming. D Tran, M Hoffman, R A Saurous, E Brevdo, K Murphy, and D M Blei, ICLR 2017
Practical Variational Inference for Neural Networks, A Graves, NIPS 2011.
Auto-Encoding Variational Bayes, D P. Kingma, M Welling, CoRR abs/1312.6114, 2013.
Stochastic Backpropagation and Approximate Inference in Deep Generative Models, D J Rezende, S Mohamed, and D Wierstra, ICML 2014.
Weight uncertainty in neural networks, C Blundell, J Cornebise, K Kavukcuoglu, and D Wierstra, ICML 2015.
Probabilistic backpropagation for scalable learning of Bayesian neural networks, J M Hernandez-Lobato, R P Adams, ICML 2015
Uncertainty in Deep Learning, Y Gal, PhD Thesis, University of Cambridge (2016)
Bayesian Convolutional Neural Networks with Bernoulli Approximate Variational Inference, Y Gal, Z Ghahramani, ICLR 2016
A Theoretically Grounded Application of Dropout in Recurrent Neural Networks, Y Gal, Z Ghahramani, NIPS 2016