tfprobably correct: Adding uncertainty to deep learning with TensorFlow ProbabilitySigrid KeydanaWhyR 20191 / 40

Neural networks are awesome ...2 / 40

Deep learning yields excellent results in e.g.

Image recognition (classification, segmentation, object detection)
Natural language processing (language models, translation, generation)
Domain modeling (embeddings)
Entity generation ("fakes"): Faces, animals, scenes...
Tasks involving tabular data (e.g., recommender systems)
Playing games (deep reinforcement learning)

3 / 40

So what are we missing?4 / 40

Networks output point estimates

A single numeric prediction
A single segmentation mask
A single translation
A single embedding

But wait ... aren't there probabilities in there, somewhere?

5 / 40

Let's see an example!

library(magick)
# Attribution: Ben Tubby [CC BY 2.0 (https://creativecommons.org/licenses/by/2.0)]
# https://upload.wikimedia.org/wikipedia/commons/3/30/Falkland_Islands_Penguins_35.jpg
path <- "penguin1.jpg" 
image_read(path) %>% image_resize("140")

6 / 40

Let's ask the network ...

library(tensorflow)
library(keras)
model <- application_mobilenet_v2()
image <- # do some preprocessing ...
probs <- model %>% predict(image)
imagenet_decode_predictions(probs)

  class_name class_description        score
1  n02056570      king_penguin 0.9899597168
2  n01847000             drake 0.0011793933
3  n01798484   prairie_chicken 0.0002387235
4  n02058221         albatross 0.0002117234
5  n02071294      killer_whale 0.0001432021

7 / 40

Let's do another one

# Attribution: M. Murphy [Public domain]
# https://upload.wikimedia.org/wikipedia/commons/2/22/RoyalPenguins3.JPG
path <- "penguin2.jpg" 
image_read(path) %>% image_resize("180")

8 / 40

So?

probs <- model %>% predict(image)
imagenet_decode_predictions(probs)

  class_name class_description      score
1  n02051845           pelican 0.24614049
2  n02009912    American_egret 0.18564136
3  n02058221         albatross 0.06848499
4  n02012849             crane 0.04572001
5  n02009229 little_blue_heron 0.03902744

9 / 40

Stepping back: What's a neural network?

10 / 40

Zooming in: Weights and activations

Source: https://en.wikipedia.org/wiki/Perceptron

11 / 40

Regression with Keras

model <- keras_model_sequential() %>%
  layer_dense(units = 32, activation = "relu", input_shape = 7) %>%
  # default activation = linear
  layer_dense(units = 1)

12 / 40

(Binary) Classification with Keras

model <- keras_model_sequential() %>%
  layer_dense(units = 32, activation = "relu", input_shape = 7) %>%
  layer_dense(units = 1, activation = "sigmoid")

13 / 40

(Multiple) Classification with Keras

model <- keras_model_sequential() %>%
  layer_dense(units = 32, activation = "relu", input_shape = 7) %>%
  layer_dense(units = 10, activation = "softmax")

Before and after softmax activation.

14 / 40

So our "probability" is just a maximum likelihood estimate ...

... how can we make this probabilistic?

15 / 40

Bayesian networks16 / 40

Put distributions over the network's weights

Blundell et al. (2015): Weight Uncertainty in Neural Networks

17 / 40

Yeah, but how?

TensorFlow Probability (Python library on top of TensorFlow)

tfprobability (R package)

devtools::install_github("rstudio/tfprobability")
library(tfprobability)
install_tfprobability()

18 / 40

tfprobability (1): Basic building blocks

Distributions

d <- tfd_binomial(total_count = 7, probs = 0.3)
d %>% tfd_mean()
#> tf.Tensor(2.1000001, shape=(), dtype=float32)
d %>% tfd_variance()
#> tf.Tensor(1.47, shape=(), dtype=float32)
d %>% tfd_log_prob(2.3)
#> tf.Tensor(-1.1914139, shape=(), dtype=float32)

Bijectors

b <- tfb_affine_scalar(shift = 3.33, scale = 0.5)
x <- c(100, 1000, 10000)
b %>% tfb_forward(x)
#> tf.Tensor([  53.33  503.33 5003.33], shape=(3,), dtype=float32)

19 / 40

tfprobability (2): Higher-level modules

Keras layers
Markov Chain Monte Carlo (Hamiltonian Monte Carlo, NUTS)
Variational inference
State space models
GLMs

20 / 40

How does that help us?21 / 40

With TFP, neural network layers can be distributions

A network that has a multivariate normal distribution as output

model <- keras_model_sequential() %>%
  layer_dense(units = params_size_multivariate_normal_tri_l(d)) %>%
  layer_multivariate_normal_tri_l(event_size = d)
log_loss <- function (y, model) - (model %>% tfd_log_prob(y))
model %>% compile(optimizer = "adam", loss = log_loss)
model %>% fit(
  x,
  y,
  batch_size = 100,
  epochs = 1,
  steps_per_epoch = 10
)

22 / 40

More distribution layers

23 / 40

Outputting a distribution: Aleatoric uncertainty

Instead of a single unit (of a dense layer), we output a normal distribution:

model <- keras_model_sequential() %>%
  layer_dense(units = 8, activation = "relu") %>%
  layer_dense(units = 2, activation = "linear") %>%
  layer_distribution_lambda(function(x)
    tfd_normal(loc = x[, 1, drop = FALSE],
               scale = 1e-3 + tf$math$softplus(x[, 2, drop = FALSE])
               )
  )
negloglik <- function(y, model) - (model %>% tfd_log_prob(y))
model %>% compile(
  optimizer = optimizer_adam(lr = 0.01),
  loss = negloglik
)
model %>% fit(x, y, epochs = 1000)

24 / 40

Aleatoric uncertainty ~ learned spread in the data

yhat <- model(tf$constant(x_test))
mean <- yhat %>% tfd_mean()
sd <- yhat %>% tfd_stddev()

25 / 40

Placing a distribution over the weights: Epistemic uncertainty

model <- keras_model_sequential() %>%
  layer_dense_variational(
    units = 1,
    make_posterior_fn = posterior_mean_field,
    make_prior_fn = prior_trainable,
    kl_weight = 1 / n
  ) %>%
  layer_distribution_lambda(
    function(x) tfd_normal(loc = x, scale = 1)
    )
negloglik <- function(y, model) - (model %>% tfd_log_prob(y))
model %>% compile(
  optimizer = optimizer_adam(lr = 0.1),
  loss = negloglik
)
model %>% fit(x, y, epochs = 1000)

26 / 40

Epistemic uncertainty ~ model uncertainty

Every prediction uses a different sample from the weight distributions!

yhats <- purrr::map(1:100, function(x) model(tf$constant(x_test)))

27 / 40

Posterior weights are computed using variational inference

Source: David Blei, Rajesh Ranganath, Shakir Mohamed: Variational Inference:Foundations and Modern Methods. NIPS 2016 Tutorial·December 5, 2016.

28 / 40

Epistemic and aleatoric uncertainty in one model

model <- keras_model_sequential() %>%
  layer_dense_variational(
    units = 2,
  ) %>%
  layer_distribution_lambda(function(x)
    tfd_normal(loc = x[, 1, drop = FALSE],
               scale = 1e-3 + tf$math$softplus(0.01 * x[, 2, drop = FALSE])
               )
    )
yhats <- purrr::map(1:100, function(x) model(tf$constant(x_test)))
means <-
  purrr::map(yhats, purrr::compose(as.matrix, tfd_mean)) %>% abind::abind()
sds <-
  purrr::map(yhats, purrr::compose(as.matrix, tfd_stddev)) %>% abind::abind()

29 / 40

Main challenge now is how to display ...

Each line is one draw from the posterior weights; each line has its own standard deviation.

More background: https://blogs.rstudio.com/tensorflow/posts/2019-06-05-uncertainty-estimates-tfprobability/

30 / 40

Other ways of modeling uncertainty with TFP31 / 40

Variational autoencoders: Informative latent codes

Source: https://lilianweng.github.io/lil-log/2018/08/12/from-autoencoder-to-beta-vae.html

32 / 40

Variational autoencoders with tfprobability

encoder_model <- keras_model_sequential() %>%
  [...] %>%
  layer_multivariate_normal_tri_l(event_size = encoded_size) %>%
  # pass in the prior of your choice ...
  # can use exact KL divergence or Monte Carlo approximation
  layer_kl_divergence_add_loss([...])
decoder_model <- keras_model_sequential() %>%
  [...] %>%
 layer_independent_bernoulli([...])
vae_model <- keras_model(inputs = encoder_model$inputs,
                         outputs = decoder_model(encoder_model$outputs[1]))
vae_loss <- function (x, rv_x) - (rv_x %>% tfd_log_prob(x))

33 / 40

Monte Carlo approximations with tfprobability

used internally by TFP to perform approximate inference
mcmc_- kernels available to the user:

34 / 40

Example: Modeling tadpole mortality (1)

https://blogs.rstudio.com/tensorflow/posts/2019-05-06-tadpoles-on-tensorflow/

Define a joint probability distribution:

m2 <- tfd_joint_distribution_sequential(
  list(
    tfd_normal(loc = 0, scale = 1.5),
    tfd_exponential(rate = 1),
    function(sigma, a_bar) 
      tfd_sample_distribution(
        tfd_normal(loc = a_bar, scale = sigma),
        sample_shape = list(n_tadpole_tanks)
      ), 
    function(l)
      tfd_independent(
        tfd_binomial(total_count = n_start, logits = l),
        reinterpreted_batch_ndims = 1
      )))

35 / 40

Example: Modeling tadpole mortality (2)

Define optimization target (loss) and kernel (algorithm):

logprob <- function(a, s, l)
  m2 %>% tfd_log_prob(list(a, s, l, n_surviving))
hmc <- mcmc_hamiltonian_monte_carlo(
  target_log_prob_fn = logprob,
  num_leapfrog_steps = 3,
  step_size = 0.1,
) %>%
  mcmc_simple_step_size_adaptation(
    target_accept_prob = 0.8,
    num_adaptation_steps = n_burnin
  )

36 / 40

Example: Modeling tadpole mortality (3)

Get starting values and sample:

c(initial_a, initial_s, initial_logits, .) %<-% (m2 %>% tfd_sample(n_chain))
run_mcmc <- function(kernel) {
  kernel %>% mcmc_sample_chain(
    num_results = n_steps,
    num_burnin_steps = n_burnin,
    current_state = list(initial_a, tf$ones_like(initial_s), initial_logits)
  )
res <- hmc %>% run_mcmc()

37 / 40

State space models

Use variational inference or (Hamiltonian) Monte Carlo to

decompose
filter (as in: Kálmán filter)
smooth
forecast

dynamic linear models.

Dynamic regression example: https://blogs.rstudio.com/tensorflow/posts/2019-06-25-dynamic_linear_models_tfprobability/

38 / 40

Dynamic linear models: Filtering example

39 / 40

Wrapping up

tfprobability (TensorFlow Probability) - your toolbox for "everything Bayesian"
Lots of ongoing development on the TFP side - stay tuned for cool additions :-)
Follow the blog for applications and background: https://blogs.rstudio.com/tensorflow/
Depending on your needs, pick what's most useful to you:
- Integration with deep learning (Keras layers)
- Monte Carlo Methods
- Dynamic linear models
Thanks for listening!

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

tfprobably correct: Adding uncertainty to deep learning with TensorFlow Probability

Sigrid Keydana

WhyR 2019

Neural networks are awesome ...