Obtaining Uncertainty Estimates from Neural Networks Using TensorFlow Probability

# Obtaining Uncertainty Estimates from Neural Networks Using TensorFlow Probability
### Sigrid Keydana
### Predictive Analytics World 2019

---

# Neural networks are awesome ...

---

# Deep learning yields excellent results in e.g.

- Image recognition (classification, segmentation, object detection)

- Natural language processing (language models, translation, generation)

- Domain modeling (embeddings)

- Entity generation ("fakes"): Faces, text ...

- Tasks involving tabular data (e.g., recommender systems)

- Playing games (deep reinforcement learning)

---

# So what are we missing?

---

# Networks output _point estimates_

- A single numeric prediction

- A single segmentation mask

- A single translation

- A single embedding

But wait ... aren't there probabilities in there, _somewhere_?

---
# Name that animal

<figure>
<img src="https://raw.githubusercontent.com/skeydan/whyR2019/master/p1.png"/>
<figcaption> Image: Ben Tubby [CC BY 2.0 (https://creativecommons.org/licenses/by/2.0)]
https://upload.wikimedia.org/wikipedia/commons/3/30/Falkland_Islands_Penguins_35.jpg</figcaption>
</figure>

Let's ask the network ...

---
# What network?

... TensorFlow, but from R!

---
# R-TensorFlow

- [R bindings to TensorFlow](https://tensorflow.rstudio.com/) (subsumes Keras)

- TensorFlow-related infrastructure...
  - [Dataset preprocessing pipeline](https://github.com/rstudio/tfdatasets)
  - [TensorFlow datasets](https://github.com/rstudio/tfds)
  - [tfhub](https://github.com/rstudio/tfhub)
  - and more ...

- [_r-tensorflow_ ecosystem](https://github.com/r-tensorflow):
  - Model implementations, e.g. U-Net, WaveNet ... 
  - User-contributed packages for automation, tuning, ...

- Blog: https://blogs.rstudio.com/tensorflow/

---
# With R-Keras, let's ask a pretrained model...

```r
library(tensorflow)
library(keras)

model <- application_mobilenet_v2()
probs <- model %>% predict(image)
imagenet_decode_predictions(probs)
```

```
  class_name class_description        score
1  n02056570      king_penguin 0.9899597168
2  n01847000             drake 0.0011793933
3  n01798484   prairie_chicken 0.0002387235
4  n02058221         albatross 0.0002117234
5  n02071294      killer_whale 0.0001432021
```

---
# Let's do another one

<figure>
<img src="https://raw.githubusercontent.com/skeydan/whyR2019/master/p2.png"/>
<figcaption> Image: M. Murphy [Public domain] 
https://upload.wikimedia.org/wikipedia/commons/2/22/RoyalPenguins3.JPG</figcaption>
</figure>

---
# So what is that?

```r
probs <- model %>% predict(image)
imagenet_decode_predictions(probs)
```

```
 class_name class_description score
1 n02051845 pelican 0.24614049
2 n02009912 American_egret 0.18564136
3 n02058221 albatross 0.06848499
4 n02012849 crane 0.04572001
5 n02009229 little_blue_heron 0.03902744
```

### Hm... what really is that _score_?

---
# Stepping back: What's a neural network?

---
# Zooming in: Weights and activations

<figure>
 <img src="https://raw.githubusercontent.com/skeydan/whyR2019/master/perceptron.png" width = "60%">
 <figcaption>Source: https://en.wikipedia.org/wiki/Perceptron</figcaption>
</figure>

---

# Activation functions in regression

```r
model <- keras_model_sequential() %>%
 layer_dense(units = 32, activation = "relu", input_shape = 7) %>%
 # default activation = linear
 layer_dense(units = 1)
```

<figure>
 <img src="https://raw.githubusercontent.com/skeydan/whyR2019/master/relu.png" width = "60%">
 <figcaption>ReLU activation</figcaption>
</figure>

---

# Sigmoid activation for binary classification

```r
model <- keras_model_sequential() %>%
 layer_dense(units = 32, activation = "relu", input_shape = 7) %>%
 layer_dense(units = 1, activation = "sigmoid")
```

<figure>
 <img src="https://raw.githubusercontent.com/skeydan/whyR2019/master/sigmoid.png" width = "60%">
 <figcaption>Sigmoid activation</figcaption>
</figure>

---

# Softmax activation for multiple classification

```r
model <- keras_model_sequential() %>%
 layer_dense(units = 32, activation = "relu", input_shape = 7) %>%
 layer_dense(units = 10, activation = "softmax")
```

Before and after __softmax__ activation.

---

# So our "probability" is just a maximum likelihood estimate ...

... how can we make this probabilistic?

---
class: inverse, middle, center

# Indirect ways: ensembling (model averaging), dropout ...

---
# Dropout uncertainty (Gal & Ghahramani, 2016) 1

- _dropout_: stochastic removal of units in hidden (or input) layers

- habitually used as a regularization measure during training

- idea: use at test time and average predictions

## Question: How to choose the dropout rate?

.footnote[
[1] Gal and Ghahramani (2016): Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning
]

---
# Dropout uncertainty, next level (Kendall & Gal, 2017) 12

- let the network learn the dropout rate

- let the network learn the variance (account for _heteroscedasticity_)

- independently calculate _epistemic_ and _aleatoric_ uncertainty

.footnote[
[1] Kendall & Gal (2017): What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?

[2] Gal, Hron, & Kendall (2017): Concrete Dropout.
]

---
# Types of uncertainty: aleatoric vs. epistemic1

- aleatoric:

- variation due to measurement process, irreducible
  
  - "known unknowns"
  
- epistemic:

- insufficiency in the model; reducible by getting more data
  
  - "unknown unknowns"

.footnote[
[1]
Hora, S. (1996) Aleatory and epistemic uncertainty in probability elicitation with an example from hazardous waste management.
]

---
# Yet more terminology? How does this matter?

- Circle in what you don't know ...

- get more data
  - try other models / model families
  - try getting more context
  
  
- Deal with unavoidable risk (safety margins)

- include complementary methods
  - keep in mind in subsequent decision-making

---
# Assessing both types of uncertainty, the Gal et al. way

- __Aleatoric__: Have network output not just predictive mean, but also _predictive variance_

- Learning different variances for different data points!

- __Epistemic__: Use Monte Carlo sampling to calculate variance of predictions (that are inherently probabilistic due to use of dropout at test time)

- Making use of learned dropout rate

---
class: inverse, middle, center

## This is nice, but computationally intensive, and, well, _indirect_ ...

---
class: inverse, middle, center

# Enter the direct way: Bayesian networks!

---

# The idea: Put distributions over the network's weights

<figure>
 <img src="https://raw.githubusercontent.com/skeydan/whyR2019/master/weight_uncertainty.png" width = "120%">
 <figcaption>Blundell et al. (2015): Weight Uncertainty in Neural Networks</figcaption>
</figure>

---
# The idea itself is all but new...

- McKay (1992): A practical Bayesian framework for backprop networks
- Hinton and van Camp (1993): Keeping neural networks simple by minimizing the description length of the weights
- R. Neal (1996): Bayesian learning for neural networks

## ... but gained momentum rather recently:

- Graves (2011): Practical Variational Inference for Neural Networks
- Kingma et al. (2015): Variational dropout and the local reparameterization trick
- Blundell et al. (2015): Weight uncertainty in neural networks

---
# So how can we do this?

[TensorFlow Probability](https://www.tensorflow.org/probability/) (Python library on top of TensorFlow)

[tfprobability](https://rstudio.github.io/tfprobability/) (R package)

```r
install.packages("tfprobability")

library(tfprobability)
install_tfprobability()
```

---
# tfprobability (1): Basic building blocks

#### Distributions

```r
d <- tfd_binomial(total_count = 7, probs = 0.3)
d %>% tfd_mean()
#> tf.Tensor(2.1000001, shape=(), dtype=float32)
d %>% tfd_variance()
#> tf.Tensor(1.47, shape=(), dtype=float32)
d %>% tfd_log_prob(2.3)
#> tf.Tensor(-1.1914139, shape=(), dtype=float32)
```

#### Bijectors

```r
b <- tfb_affine_scalar(shift = 3.33, scale = 0.5)
x <- c(100, 1000, 10000)
b %>% tfb_forward(x)
#> tf.Tensor([ 53.33 503.33 5003.33], shape=(3,), dtype=float32)
```

---
# tfprobability (2): Higher-level modules

- Keras layers

- Markov Chain Monte Carlo (Hamiltonian Monte Carlo, NUTS)

- Variational inference

- State space models

- GLMs

---
class: inverse, middle, center

# How does that help us?

---
# With TFP, neural network layers can be _distributions_

A network that has a multivariate normal distribution as output

```r
model <- keras_model_sequential() %>%
 layer_dense(units = params_size_multivariate_normal_tri_l(d)) %>%
 layer_multivariate_normal_tri_l(event_size = d)

log_loss <- function (y, model) - (model %>% tfd_log_prob(y))

model %>% compile(optimizer = "adam", loss = log_loss)

model %>% fit(
  x,
  y,
  batch_size = 100,
  epochs = 1,
  steps_per_epoch = 10
)
```

---
# Back to uncertainty...

- If our network can output a distribution, we can have it learn the distribution's parameters

- Choosing e.g. a normal distribution, we can train the network to learn the actual _spread_ in the data

- Thus, we have found a way to estimate _aleatoric_ uncertainty

---
# Example: Aleatoric uncertainty

```r
model <- keras_model_sequential() %>%
 layer_dense(units = 8, activation = "relu") %>%
 layer_dense(units = 2, activation = "linear") %>%
 layer_distribution_lambda(function(x)
 tfd_normal(loc = x[, 1, drop = FALSE],
 scale = 1e-3 + tf$math$softplus(x[, 2, drop = FALSE])
 )
 )

negloglik <- function(y, model) - (model %>% tfd_log_prob(y))

model %>% compile(
  optimizer = optimizer_adam(lr = 0.01),
  loss = negloglik
)

model %>% fit(x, y, epochs = 1000)
```

---
# The network has learned about variance in the data

```r
yhat <- model(tf$constant(x_test))
mean <- yhat %>% tfd_mean()
sd <- yhat %>% tfd_stddev()
```

---
# How about epistemic uncertainty?

Use _tfprobability_ variational layers:

- layer_dense_variational

- layer_dense_flipout

- layer_dense_reparameterization

- layer_dense_local_reparameterization

- layer_conv_nd_flipout (n = [1..3])

---
# Posterior weights are computed using variational inference

<figure>
 <img src="https://raw.githubusercontent.com/skeydan/whyR2019/master/blei_vi.png" width = "80%">
 <figcaption>Source: <a href="https://media.nips.cc/Conferences/2016/Slides/6199-Slides.pdf">David Blei, Rajesh Ranganath, Shakir Mohamed: Variational Inference:Foundations and Modern Methods. NIPS 2016 Tutorial·December 5, 2016.</a></figcaption>
</figure>

---
# Example: Epistemic uncertainty

```r
model <- keras_model_sequential() %>%
 layer_dense_variational(
 units = 1,
 make_posterior_fn = posterior_mean_field,
 make_prior_fn = prior_trainable,
 kl_weight = 1 / n
 ) %>%
 layer_distribution_lambda(
 function(x) tfd_normal(loc = x, scale = 1)
 )

negloglik <- function(y, model) - (model %>% tfd_log_prob(y))

model %>% compile(
  optimizer = optimizer_adam(lr = 0.1),
  loss = negloglik
)

model %>% fit(x, y, epochs = 1000)
```

---
# Now every prediction uses different _weight samples_

```r
yhats <- purrr::map(1:100, function(x) model(tf$constant(x_test)))
```

---
# Aleatoric, epistemic ... why not have both?

```r
model <- keras_model_sequential() %>%
 layer_dense_variational(
 units = 2,
 ) %>%
 layer_distribution_lambda(function(x)
 tfd_normal(loc = x[, 1, drop = FALSE],
 scale = 1e-3 + tf$math$softplus(0.01 * x[, 2, drop = FALSE])
 )
 )

yhats <- purrr::map(1:100, function(x) model(tf$constant(x_test)))
means <-
 purrr::map(yhats, purrr::compose(as.matrix, tfd_mean)) %>% abind::abind()
sds <-
 purrr::map(yhats, purrr::compose(as.matrix, tfd_stddev)) %>% abind::abind()
```

---
# Main challenge now is how to display ...

Each line is one draw from the posterior weights; each line has its own standard deviation.

More background: https://blogs.rstudio.com/tensorflow/posts/2019-06-05-uncertainty-estimates-tfprobability/

---
class: inverse, middle, center

# Case study: Smartphone activity detection

---
# Non-probabilistic cassification

Nick Strayer, Classifying physical activity from smartphone data [https://blogs.rstudio.com/tensorflow/posts/2018-07-17-activity-detection/](https://blogs.rstudio.com/tensorflow/posts/2018-07-17-activity-detection/)

---
# Original confusion matrix

---
# Goal: Port model to TensorFlow Probability

- replace normal conv layers by _layer_conv_1d_flipout_

- replace deterministic output layer by _layer_one_hot_categorical_

- training objective: ELBO: minimize:
  
   - KL divergence between a (standard) multivariate normal prior and the approximate posterior
   - negative log probability of the data under the approximate posterior

---
# Probabilistic model

```r
model <- keras_model_sequential()

model %>% 
  layer_conv_1d_flipout(
    filters = 12,
    kernel_size = 3, 
    activation = "relu",
    kernel_divergence_fn = kl_div
  ) %>%
  #...
  #...
  layer_dense_flipout(
    num_classes, 
    kernel_divergence_fn = kl_div,
  ) %>%
  layer_one_hot_categorical(event_size = num_classes)

nll <- function(y, model) - (model %>% tfd_log_prob(y))
```

---
# Loss curve

---
# Uncertainties

- __maxprob__: single-run highest per-obs mean of categorical distribution
- __maxprob_sd__: standard deviation associated with single-run highest per-obs mean of categorical distribution
- __maxprob_sd_mc__: standard deviation of means obtained from 100 prediction runs (weights are probabilistic!)

```
     obs maxprob maxprob_sd maxprob_mc_sd     predicted         truth correct
     
 1     1   0.997     0.0561        0.175    STAND_TO_SIT   STAND_TO_S…   TRUE   
 2     2   0.653     0.476         0.235    STAND_TO_SIT   STAND_TO_S…   TRUE   
 3     3   0.987     0.112         0.121    STAND_TO_SIT   STAND_TO_S…   TRUE   
 4     4   0.960     0.195         0.130    STAND_TO_SIT   STAND_TO_S…   TRUE   
 5     5   0.987     0.115         0.325    STAND_TO_SIT   STAND_TO_S…   TRUE   
 6     6   0.568     0.495         0.265    STAND_TO_SIT   STAND_TO_S…   TRUE   
 7     7   0.701     0.458         0.320    STAND_TO_SIT   STAND_TO_S…   TRUE   
 8     8   0.982     0.134         0.0651   STAND_TO_SIT   STAND_TO_S…   TRUE   
 9     9   0.589     0.492         0.235    STAND_TO_SIT   STAND_TO_S…   TRUE   
10    10   0.546     0.498         0.114    STAND_TO_SIT   STAND_TO_S…   TRUE   
11    11   0.798     0.401         0.220    STAND_TO_SIT   STAND_TO_S…   TRUE   
... 
```

---
# Are wrong classifications associated with higher uncertainty?

```
correct count   avg_mean   avg_sd   avg_mc_sd

FALSE      17      0.673    0.420       0.229
TRUE       52      0.814    0.314       0.178
```

```
truth          predicted    avg_mean   avg_sd avg_mc_sd   correct

SIT_TO_LIE     SIT_TO_LIE      0.904    0.240     0.136   TRUE   
LIE_TO_SIT     LIE_TO_STAND    0.532    0.498     0.153   FALSE  
SIT_TO_STAND   SIT_TO_STAND    0.904    0.238     0.165   TRUE   
STAND_TO_LIE   STAND_TO_LIE    0.833    0.297     0.179   TRUE   
LIE_TO_STAND   LIE_TO_STAND    0.651    0.470     0.191   TRUE   
STAND_TO_SIT   STAND_TO_SIT    0.814    0.290     0.194   TRUE   
LIE_TO_SIT     LIE_TO_SIT      0.682    0.448     0.196   TRUE   
LIE_TO_STAND   LIE_TO_SIT      0.639    0.454     0.213   FALSE  
SIT_TO_LIE     STAND_TO_SIT    0.692    0.462     0.240   FALSE  
SIT_TO_LIE     STAND_TO_LIE    0.738    0.379     0.265   FALSE  
STAND_TO_LIE   SIT_TO_LIE      0.813    0.282     0.299   FALSE
```

---
# Where to from here?

- TensorFlow Probability is under rapid development

- Follow the TensorFlow for R blog for updates: https://blogs.rstudio.com/tensorflow/

## Thanks for your attention!