Currently under construction

Understanding NLL and LOO

Some notes for the evaluation of the performance of a probabilistic prediction. All valid and applicable measures are variants of the negative log likelihood NLL. In the Bayesian context often a quantity known under the name log predictive probability lppd is used. In the regression setting, the question remains at which points \(x_i\) this quantity shall be evaluated. In https://arxiv.org/abs/1507.04544 the quantity elpd the expected log pointwise predictive density for a new dataset is thus defined as:

\[ \mathit{elpd} = \sum_{i=1}^{N} E_{y \sim p(y|x_i)}\left[\log(p(y|x_i,D))\right] \] This quantity requires integrating over the data generating process \(p(y|x_i)\) for a given \(x_i\). The first approximation which can be done, is to evaluate the expectation at a test set at the single points \(y_i\).

\[ \mathit{elpd.test} = \hat{\mathit{elpd}}= \sum_{i=1}^{N} \log(p(y_i|x_i,D)) = -N \cdot \mathit{NLL} \] It is thus \(N\) times the usual NLL.

Random Data

Some Data for linear regression:

N = 50
SD = 0.3
A = 1
B = 2
gen_data = function(){
  x = seq(-2,2,length.out = N)
  y = rnorm(N,A*x+B, sd=SD)  
  return (data.frame(x=x,y=y))
}
set.seed(1)
dat = gen_data()
x=dat$x
y=dat$y
plot(x,y)

Theoretical value

In principle we know that the data is distributed normally with constant spread SD. In this case the elpd would be.

\[ \mathit{elpd}=N \cdot \int_{-\infty}^{+\infty} \log(N(y|0,0.3)) \cdot N(y|0, 0.3) \; dy \]

  # Numerical Integration
  f = function(x) dnorm(x, sd=SD)*log(dnorm(x, sd=SD))
  integrate(f, -1,1)
  integrate(f, -2,2)
  integrate(f, -4,4)
  (elpd.theo = N*integrate(f, -4,4)$value)
  (nll.theo = -integrate(f, -4,4)$value)
  #integrate(f, -Inf,Inf) not working
  
  # Sampling
  d = rnorm(1e6, mean=0,sd=0.3)
  (elpd.samp = N*mean(log(dnorm(d, mean=0, sd=0.3))))

Taking spread of y into account

  vals = rep(NA, 1000)
  set.seed(1)
  for (i in 1:length(vals)) {
    dat = gen_data()
    vals[i] = -mean(log(dnorm(dat$y,A*dat$x+B, sd=SD)))
  }
  plot(density(vals), xlab="NLL", main='Distribution of the NLL estimates')
  abline(v = nll.theo, col='green')
  abline(v = mean(vals), col='red')

Caption: ‘The NLL evaluated from sampling of the data generating process. Note there is quite spread’

Fitting with stan

  1. Fit the and check the results
  library(rstan)
  fit = stan(file = 'lr.stan', data=list(N=length(y),y=y,x=x))
  print(fit, pars = c('a','b','sigma'))

Estimating the elpd

The package loo provides advances estimates of the elpd. For that a matrix of the log-likelihood needs to be provided. The dimensions are: S by N, where S is the size of the posterior sample (with all chains merged) and N is the number of data points. We caluculate the log-likelihood ps as follows.

  library(loo)
## Warning: package 'loo' was built under R version 3.5.2
## This is loo version 2.2.0
## - Online documentation and vignettes at mc-stan.org/loo
## - As of v2.0.0 loo defaults to 1 core but we recommend using as many as possible. Use the 'cores' argument or set options(mc.cores = NUM_CORES) for an entire session.
  calc_ps = function () {
    a_sam = rstan::extract(fit, 'a')[[1]]
    b_sam = rstan::extract(fit, 'b')[[1]]
    s_sam = rstan::extract(fit, 'sigma')[[1]]
    samples = length(a_sam)
    ps = matrix(NA, nrow = samples, ncol=N)
    for (i in (1:samples)){
      for (j in (1:length(x))){
        ps[i,j] = dnorm(y[j],a_sam[i] * x[j] + b_sam[i], s_sam[i], log=TRUE)
      }
    }
    return (ps)
  }
  ps = calc_ps()

Creating of summary statistics.

  make_df = function(ps){
    (elpd.train = N*mean(ps)) 
    res = loo::loo(ps)
    (elpd.loo = res$elpd_loo)
    (p.loo = res$estimates['p_loo',1])
    res = loo::waic(ps)$estimates
    (elpd.waic = res['elpd_waic',1])
    (p.waic = res['p_waic',1])
    
    df = data.frame(
        loo = c(elpd.loo, -elpd.loo/N, p.loo),
        waic = c(elpd.waic, -elpd.waic/N, p.waic),
        training = c(elpd.train, -elpd.train/N, NA)
        )
    return(df)
  }
  ps = calc_ps()
  df = make_df(ps)
## Warning: Relative effective sample sizes ('r_eff' argument) not specified.
## For models fit with MCMC, the reported PSIS effective sample sizes and 
## MCSE estimates will be over-optimistic.
## Warning: Accessing elpd_loo using '$' is deprecated and will be removed in
## a future release. Please extract the elpd_loo estimate from the 'estimates'
## component instead.
## Warning: 
## 1 (2.0%) p_waic estimates greater than 0.4. We recommend trying loo instead.
  row.names(df) = c('elpd', 'NLL', 'p')
  print(df)
##              loo       waic    training
## elpd -4.21719908 -4.1904100 -2.61371600
## NLL   0.08434398  0.0838082  0.05227432
## p     2.96634012  2.9395510          NA
  elpd.theo
## [1] -10.74829
  library("kableExtra")
## Warning: package 'kableExtra' was built under R version 3.5.2
  kableExtra::kable(df)
loo waic training
elpd -4.217199 -4.1904100 -2.6137160
NLL 0.084344 0.0838082 0.0522743
p 2.966340 2.9395510 NA

Comparison with example from statistical rethinking

In the book (page 222, R Code 7.19) they use the following example:

  x = cars$speed
  y = cars$dist
  N = length(y)
  #x =  c(-2.,-0.66666, 0.666, 2.)
  #y = c(-6.25027354, -2.50213382, -6.07525495,  7.92081243)
  plot(x,y)

library(rstan)
fit = stan(file = 'lr.stan', data=list(N=length(y),y=y,x=x))
## 
## SAMPLING FOR MODEL 'lr' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 8.8e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.88 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 1.38323 seconds (Warm-up)
## Chain 1:                1.2598 seconds (Sampling)
## Chain 1:                2.64303 seconds (Total)
## Chain 1: 
## 
## SAMPLING FOR MODEL 'lr' NOW (CHAIN 2).
## Chain 2: 
## Chain 2: Gradient evaluation took 5.5e-05 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.55 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2: 
## Chain 2: 
## Chain 2: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 2: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 2: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 2: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 2: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 2: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 2: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 2: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 2: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 2: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 2: 
## Chain 2:  Elapsed Time: 1.28153 seconds (Warm-up)
## Chain 2:                1.21935 seconds (Sampling)
## Chain 2:                2.50089 seconds (Total)
## Chain 2: 
## 
## SAMPLING FOR MODEL 'lr' NOW (CHAIN 3).
## Chain 3: 
## Chain 3: Gradient evaluation took 5.1e-05 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.51 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3: 
## Chain 3: 
## Chain 3: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 3: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 3: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 3: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 3: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 3: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 3: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 3: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 3: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 3: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 3: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 3: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 3: 
## Chain 3:  Elapsed Time: 1.38446 seconds (Warm-up)
## Chain 3:                1.24624 seconds (Sampling)
## Chain 3:                2.6307 seconds (Total)
## Chain 3: 
## 
## SAMPLING FOR MODEL 'lr' NOW (CHAIN 4).
## Chain 4: 
## Chain 4: Gradient evaluation took 4.8e-05 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.48 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4: 
## Chain 4: 
## Chain 4: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 4: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 4: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 4: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 4: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 4: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 4: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 4: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 4: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 4: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 4: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 4: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 4: 
## Chain 4:  Elapsed Time: 1.27738 seconds (Warm-up)
## Chain 4:                1.12011 seconds (Sampling)
## Chain 4:                2.39749 seconds (Total)
## Chain 4:
print(fit, pars = c('a','b','sigma'))
## Inference for Stan model: lr.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##         mean se_mean   sd   2.5%    25%    50%   75% 97.5% n_eff Rhat
## a       3.62    0.01 0.36   2.89   3.39   3.62  3.86  4.29  1422    1
## b     -12.16    0.15 5.64 -22.90 -16.04 -12.29 -8.34 -1.15  1391    1
## sigma  15.39    0.04 1.60  12.78  14.27  15.22 16.32 19.01  1386    1
## 
## Samples were drawn using NUTS(diag_e) at Tue Jul  7 10:09:15 2020.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).
  ps = calc_ps()
  df = make_df(ps)
## Warning: Relative effective sample sizes ('r_eff' argument) not specified.
## For models fit with MCMC, the reported PSIS effective sample sizes and 
## MCSE estimates will be over-optimistic.
## Warning: Some Pareto k diagnostic values are slightly high. See help('pareto-k-diagnostic') for details.
## Warning: Accessing elpd_loo using '$' is deprecated and will be removed in
## a future release. Please extract the elpd_loo estimate from the 'estimates'
## component instead.
## Warning: 
## 2 (4.0%) p_waic estimates greater than 0.4. We recommend trying loo instead.
  row.names(df) = c('elpd', 'NLL', 'p')
  library("kableExtra")
  kableExtra::kable(df)
loo waic training
elpd -210.121586 -210.062221 -208.252882
NLL 4.202432 4.201244 4.165058
p 3.373714 3.314350 NA

Manual calculation of the WAIC

The elpd estimate can be calculated by summing up the posterior variances of the different MCMC samples.

  (p.waic.manual = sum(apply(ps, 2, var))) #p.waid
## [1] 3.31435
  elpd.train = df['elpd','training']
  elpd.train -  p.waic.manual #
## [1] -211.5672
  -2*(elpd.train -  p.waic.manual) #422.85 in book 423.
## [1] 423.1345

Discussion: The effective number of parameters, is like in the loo routine. However, the WAIC is probably calculated slightly different in loo compared to the approach in statistical rethinking and done in the manual approach.