Currently under construction
Some notes for the evaluation of the performance of a probabilistic prediction. All valid and applicable measures are variants of the negative log likelihood NLL. In the Bayesian context often a quantity known under the name log predictive probability lppd
is used. In the regression setting, the question remains at which points \(x_i\) this quantity shall be evaluated. In https://arxiv.org/abs/1507.04544 the quantity elpd
the expected log pointwise predictive density for a new dataset is thus defined as:
\[ \mathit{elpd} = \sum_{i=1}^{N} E_{y \sim p(y|x_i)}\left[\log(p(y|x_i,D))\right] \] This quantity requires integrating over the data generating process \(p(y|x_i)\) for a given \(x_i\). The first approximation which can be done, is to evaluate the expectation at a test set at the single points \(y_i\).
\[ \mathit{elpd.test} = \hat{\mathit{elpd}}= \sum_{i=1}^{N} \log(p(y_i|x_i,D)) = -N \cdot \mathit{NLL} \] It is thus \(N\) times the usual NLL.
Some Data for linear regression:
N = 50
SD = 0.3
A = 1
B = 2
gen_data = function(){
x = seq(-2,2,length.out = N)
y = rnorm(N,A*x+B, sd=SD)
return (data.frame(x=x,y=y))
}
set.seed(1)
dat = gen_data()
x=dat$x
y=dat$y
plot(x,y)
In principle we know that the data is distributed normally with constant spread SD. In this case the elpd would be.
\[ \mathit{elpd}=N \cdot \int_{-\infty}^{+\infty} \log(N(y|0,0.3)) \cdot N(y|0, 0.3) \; dy \]
# Numerical Integration
f = function(x) dnorm(x, sd=SD)*log(dnorm(x, sd=SD))
integrate(f, -1,1)
integrate(f, -2,2)
integrate(f, -4,4)
(elpd.theo = N*integrate(f, -4,4)$value)
(nll.theo = -integrate(f, -4,4)$value)
#integrate(f, -Inf,Inf) not working
# Sampling
d = rnorm(1e6, mean=0,sd=0.3)
(elpd.samp = N*mean(log(dnorm(d, mean=0, sd=0.3))))
vals = rep(NA, 1000)
set.seed(1)
for (i in 1:length(vals)) {
dat = gen_data()
vals[i] = -mean(log(dnorm(dat$y,A*dat$x+B, sd=SD)))
}
plot(density(vals), xlab="NLL", main='Distribution of the NLL estimates')
abline(v = nll.theo, col='green')
abline(v = mean(vals), col='red')
Caption: ‘The NLL evaluated from sampling of the data generating process. Note there is quite spread’
library(rstan)
fit = stan(file = 'lr.stan', data=list(N=length(y),y=y,x=x))
print(fit, pars = c('a','b','sigma'))
The package loo
provides advances estimates of the elpd
. For that a matrix of the log-likelihood needs to be provided. The dimensions are: S by N, where S is the size of the posterior sample (with all chains merged) and N is the number of data points. We caluculate the log-likelihood ps
as follows.
library(loo)
## Warning: package 'loo' was built under R version 3.5.2
## This is loo version 2.2.0
## - Online documentation and vignettes at mc-stan.org/loo
## - As of v2.0.0 loo defaults to 1 core but we recommend using as many as possible. Use the 'cores' argument or set options(mc.cores = NUM_CORES) for an entire session.
calc_ps = function () {
a_sam = rstan::extract(fit, 'a')[[1]]
b_sam = rstan::extract(fit, 'b')[[1]]
s_sam = rstan::extract(fit, 'sigma')[[1]]
samples = length(a_sam)
ps = matrix(NA, nrow = samples, ncol=N)
for (i in (1:samples)){
for (j in (1:length(x))){
ps[i,j] = dnorm(y[j],a_sam[i] * x[j] + b_sam[i], s_sam[i], log=TRUE)
}
}
return (ps)
}
ps = calc_ps()
Creating of summary statistics.
make_df = function(ps){
(elpd.train = N*mean(ps))
res = loo::loo(ps)
(elpd.loo = res$elpd_loo)
(p.loo = res$estimates['p_loo',1])
res = loo::waic(ps)$estimates
(elpd.waic = res['elpd_waic',1])
(p.waic = res['p_waic',1])
df = data.frame(
loo = c(elpd.loo, -elpd.loo/N, p.loo),
waic = c(elpd.waic, -elpd.waic/N, p.waic),
training = c(elpd.train, -elpd.train/N, NA)
)
return(df)
}
ps = calc_ps()
df = make_df(ps)
## Warning: Relative effective sample sizes ('r_eff' argument) not specified.
## For models fit with MCMC, the reported PSIS effective sample sizes and
## MCSE estimates will be over-optimistic.
## Warning: Accessing elpd_loo using '$' is deprecated and will be removed in
## a future release. Please extract the elpd_loo estimate from the 'estimates'
## component instead.
## Warning:
## 1 (2.0%) p_waic estimates greater than 0.4. We recommend trying loo instead.
row.names(df) = c('elpd', 'NLL', 'p')
print(df)
## loo waic training
## elpd -4.21719908 -4.1904100 -2.61371600
## NLL 0.08434398 0.0838082 0.05227432
## p 2.96634012 2.9395510 NA
elpd.theo
## [1] -10.74829
library("kableExtra")
## Warning: package 'kableExtra' was built under R version 3.5.2
kableExtra::kable(df)
loo | waic | training | |
---|---|---|---|
elpd | -4.217199 | -4.1904100 | -2.6137160 |
NLL | 0.084344 | 0.0838082 | 0.0522743 |
p | 2.966340 | 2.9395510 | NA |
In the book (page 222, R Code 7.19) they use the following example:
x = cars$speed
y = cars$dist
N = length(y)
#x = c(-2.,-0.66666, 0.666, 2.)
#y = c(-6.25027354, -2.50213382, -6.07525495, 7.92081243)
plot(x,y)
library(rstan)
fit = stan(file = 'lr.stan', data=list(N=length(y),y=y,x=x))
##
## SAMPLING FOR MODEL 'lr' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 8.8e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.88 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 1: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 1: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 1: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 1: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 1.38323 seconds (Warm-up)
## Chain 1: 1.2598 seconds (Sampling)
## Chain 1: 2.64303 seconds (Total)
## Chain 1:
##
## SAMPLING FOR MODEL 'lr' NOW (CHAIN 2).
## Chain 2:
## Chain 2: Gradient evaluation took 5.5e-05 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.55 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2:
## Chain 2:
## Chain 2: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 2: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 2: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 2: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 2: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 2: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 2: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 2: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 2: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 2: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 2: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 2: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 2:
## Chain 2: Elapsed Time: 1.28153 seconds (Warm-up)
## Chain 2: 1.21935 seconds (Sampling)
## Chain 2: 2.50089 seconds (Total)
## Chain 2:
##
## SAMPLING FOR MODEL 'lr' NOW (CHAIN 3).
## Chain 3:
## Chain 3: Gradient evaluation took 5.1e-05 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.51 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3:
## Chain 3:
## Chain 3: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 3: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 3: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 3: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 3: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 3: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 3: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 3: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 3: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 3: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 3: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 3: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 3:
## Chain 3: Elapsed Time: 1.38446 seconds (Warm-up)
## Chain 3: 1.24624 seconds (Sampling)
## Chain 3: 2.6307 seconds (Total)
## Chain 3:
##
## SAMPLING FOR MODEL 'lr' NOW (CHAIN 4).
## Chain 4:
## Chain 4: Gradient evaluation took 4.8e-05 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.48 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4:
## Chain 4:
## Chain 4: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 4: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 4: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 4: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 4: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 4: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 4: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 4: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 4: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 4: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 4: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 4: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 4:
## Chain 4: Elapsed Time: 1.27738 seconds (Warm-up)
## Chain 4: 1.12011 seconds (Sampling)
## Chain 4: 2.39749 seconds (Total)
## Chain 4:
print(fit, pars = c('a','b','sigma'))
## Inference for Stan model: lr.
## 4 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=4000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## a 3.62 0.01 0.36 2.89 3.39 3.62 3.86 4.29 1422 1
## b -12.16 0.15 5.64 -22.90 -16.04 -12.29 -8.34 -1.15 1391 1
## sigma 15.39 0.04 1.60 12.78 14.27 15.22 16.32 19.01 1386 1
##
## Samples were drawn using NUTS(diag_e) at Tue Jul 7 10:09:15 2020.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
ps = calc_ps()
df = make_df(ps)
## Warning: Relative effective sample sizes ('r_eff' argument) not specified.
## For models fit with MCMC, the reported PSIS effective sample sizes and
## MCSE estimates will be over-optimistic.
## Warning: Some Pareto k diagnostic values are slightly high. See help('pareto-k-diagnostic') for details.
## Warning: Accessing elpd_loo using '$' is deprecated and will be removed in
## a future release. Please extract the elpd_loo estimate from the 'estimates'
## component instead.
## Warning:
## 2 (4.0%) p_waic estimates greater than 0.4. We recommend trying loo instead.
row.names(df) = c('elpd', 'NLL', 'p')
library("kableExtra")
kableExtra::kable(df)
loo | waic | training | |
---|---|---|---|
elpd | -210.121586 | -210.062221 | -208.252882 |
NLL | 4.202432 | 4.201244 | 4.165058 |
p | 3.373714 | 3.314350 | NA |
The elpd estimate can be calculated by summing up the posterior variances of the different MCMC samples.
(p.waic.manual = sum(apply(ps, 2, var))) #p.waid
## [1] 3.31435
elpd.train = df['elpd','training']
elpd.train - p.waic.manual #
## [1] -211.5672
-2*(elpd.train - p.waic.manual) #422.85 in book 423.
## [1] 423.1345
Discussion: The effective number of parameters, is like in the loo
routine. However, the WAIC is probably calculated slightly different in loo
compared to the approach in statistical rethinking and done in the manual approach.