Demystifying Jacobian adjustment for transformed parameters in Stan

References

Putting priors on transformed parameters: https://discourse.mc-stan.org/t/putting-priors-on-transformed-parameters/2488
Probability Theory (For Scientists and Engineers): https://betanalpha.github.io/assets/case_studies/probability_theory.html#42_probability_density_functions
8.5 Program Block: parameters: https://mc-stan.org/docs/2_19/reference-manual/program-block-parameters.html
8.6 Program Block: transformed parameters: https://mc-stan.org/docs/2_19/reference-manual/program-block-transformed-parameters.html
20.3 Changes of Variables: https://mc-stan.org/docs/2_19/stan-users-guide/changes-of-variables.html
Stan: A Probabilistic Programming Language: http://www.stat.columbia.edu/~gelman/research/unpublished/stan-resubmit-JSS1293.pdf#page41
How Stan Works: https://mc-stan.org/workshops/vanderbilt2016/carp-4.pdf#page41
RStan: the R interface to Stan: https://cran.r-project.org/web/packages/rstan/vignettes/rstan.html#the-log-posterior-function-and-gradient
Model’s log_prob and grad_log_prob functions: https://mc-stan.org/rstan/reference/stanfit-method-logprob.html
Casella and Berger. Statistical Inference. 2nd Ed.
Lambert. A Student’s Guide to Bayesian Statistics. 1st Ed.

Summary

One mysterious thing about Stan for me was the complications of putting priors on transformed parameters defined in the transformed parameters {} block.

Long story short, the key is in understanding what the model {} block is defining. As confirmed by @betanalpha, the model {} block defines the log joint density $f$(data, parameters) of the data as defined in the data {} block and the parameters as defined in the parameters {} block.

Note that $f$(data, parameters) is NOT regarded as a function of the transformed parameters as defined in the transformed parameters block even if you do use transformed parameters for the likelihood expression in the model {} block.

The implication of what the model {} block defines is the following factorization.

\[\begin{align*} \log f(\text{data}, \text{parameters}) &= \underbrace{\log f(\text{data} | \text{parameters})}_{\text{log likelihood}} + \underbrace{\log f(\text{parameters})}_{\text{log prior}} \end{align*} \]

That is, the log prior we have to include in the model {} block is the log prior for the parameters as defined in the parameters {} block, NOT the log prior for the transformed parameters per se.

This process can be facilitated by the inclusion of the log prior for the transformed parameters, but that alone is typically NOT enough. This is why and where the Jacobian adjustment comes in.

We will take a deeper look into this using a simple example with an analytical solution that we can check against.

Simple coin toss example

Let us now consider a simple coin toss experiment to estimate the head probability $p$ of a potentially biased coin. We toss it $n$ times and consider the toss results to be independent and identically distributed (iid). We can describe each toss outcome in $X_{i} \in \left\{ 0,1 \right\}$ where 1 is for a head and 0 is for a tail.

\[\begin{align*} X_{i} &\overset{\text{iid}}{\sim} \text{Bernoulli}(p)\\ i &= 1, \dots, n \end{align*} \]

We will fix $n$ at 10 and the data vector at $(X_{1},...,X_{10}) = (0,1,1,1,0,1,1,1,0,1)$. That is, 7 heads and 3 tails out of 10 tosses.

For the prior distribution, we will use a Beta prior for $p$:

\[\begin{align*} p &\sim \text{Beta}(a,b) \end{align*} \]

For the rest of the discussion, we will us $a=b=1$, which gives a Uniform(0,1) prior.

The deterministic transformation that we will consider is the following:

\[\begin{align*} \theta &= \text{logit}(p) \end{align*} \]

Note that this likelihood-prior pair is a conjugate pair. Thus, we have the analytical posterior, which is Beta($a+\sum^{10}_{i=1}X_{i}, b+(10-\sum^{10}_{i=1}X_{i})$).

In this example this is Beta(1+7, 1+3) = Beta(8, 4). By the nature of the Beta distribution, the posterior mean of $p$ is $\frac{a}{a+b}$, which in this case, is 8/12 = 0.67. We will check our Stan results against this analytical solution.

Non-linear transformation, density function, and Jacobian of transformation

Let $p \in [0,1]$ ($p$ takes on any value between 0 and 1) be a probability parameter. As discussed above, we assume that $p$ follows Beta($a,b$). We can consider the logit (log odds) non-linear transformation of $p$, which we call $\theta$. That is,

\[\begin{align*} p &\sim \text{Beta}(a,b)\\ \theta &= \text{logit}(p) = \log \left( \frac{p}{1-p} \right)\\ p &= \text{expit}(\theta) = \frac{e^{\theta}}{1+e^{\theta}}\\ \end{align*} \]

expit is the inverse of logit. It is called inv_logit in Stan.

What is the density function of $\theta$, which we call $f_\theta$? We can “look up” the density function of $p$, which we call $f_p$, to figure it out. This is where the absolute determinant of the Jacobian of the transformation comes in (Casella and Berger. p 51. Theorem 2.1.5.)

\[\begin{align*} f_{\theta}(\theta) &= f_{p}(\text{expit}(\theta)) \left| \frac{\text{d}~\text{expit}(\theta)}{\text{d}~\theta} \right| \end{align*} \]

We use the inverse transformation expit to obtain $p$ from $\theta$ so that we can “look up” the density function $f_{p}$ for $p$, which we already know.

The complication here is the second term, the absolute determinant of the Jacobian of the inverse transformation (expit). See Probability Theory (For Scientists and Engineers) for more comprehensive review including multivariate transformation. Here the transformation is univariate, so the determinant of the Jacobian “matrix” just means the single value itself.

Lambert explains the intuition behind this extra term nicely. “Whenever we apply a non-linear transform to a variable, its probability distribution is stretched in some areas and squashed in others.”

I have always had difficulty remembering what to put in the “numerator” of the Jacobian part. Just make sure it is the same expression as the one fed to $f_{p}$. This is because of the chain rule of differentiating the distribution function.

Let us resort to WolframAlpha for the derivative.

\[\begin{align*} \frac{\text{d}~\text{expit}(\theta)}{\text{d}~\theta} &= \frac{\text{d}}{\text{d}~\theta} \frac{e^{\theta}}{1+e^{\theta}}\\ &= \frac{e^{\theta}}{(1+e^{\theta})^{2}} \ge 0\\ \end{align*} \]

This expression is non-negative, so it can come out of the absolute value.

\[\begin{align*} f_{\theta}(\theta) &= f_{p}(\text{expit}(\theta)) \left( \frac{e^{\theta}}{(1+e^{\theta})^{2}} \right) \end{align*} \]

The Beta($a,b$) density function $f_{p}$ is the following expression.

\[\begin{align*} f_{p}(p) &= \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)} p^{a-1}(1-p)^{b-1} \end{align*} \]

Now we know the density function for $\theta$, the logit transformation of $p$!

Load packages

## https://www.random.org
set.seed(533276351)
library(tidyverse)
## devtools::install_github('jburos/biostan', build_vignettes = TRUE, dependencies = TRUE)
library(biostan)

## Warning: replacing previous import 'rstan::loo' by 'rstanarm::loo' when loading 'biostan'

library(rstan)

Define data

Here we define the hyperparameters and coin toss results.

stan_data <- list(a = 1, b = 1,
                  N = 10,
                  X = c(0,1,1,1,0,1,1,1,0,1))

Modeling $p$ as a parameter

Here we define $p$ in the parameters {} block and there is no complication.

stan_model_p_param <- biostan::read_stan_file("./stan_p_param.stan")
biostan::print_stan_code(stan_model_p_param)

## data { 
##     real<lower=0> a; 
##     real<lower=0> b; 
##     int<lower=0> N; 
##     int<lower=0,upper=1> X[N]; 
## } 
##  
## parameters { 
##     real<lower=0,upper=1> p; 
## } 
##  
## transformed parameters { 
##     real theta; 
##     theta = logit(p); 
## } 
##  
## model { 
##     p ~ beta(a, b); 
##     for (i in 1:N) { 
##         X[i] ~ bernoulli(p); 
##     } 
## } 
##

fit_stan_model_p_param <- rstan::stan(model_code = stan_model_p_param,
                                      data = stan_data,
                                      chains = n_cores)

print(fit_stan_model_p_param, pars = c("theta","p","lp__"))

## Inference for Stan model: 11307b76230518f3b52a9d07241124fb.
## 12 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=12000.
## 
##        mean se_mean   sd   2.5%   25%   50%   75% 97.5% n_eff Rhat
## theta  0.77    0.01 0.65  -0.44  0.33  0.74  1.18  2.14  4135    1
## p      0.67    0.00 0.13   0.39  0.58  0.68  0.76  0.89  4291    1
## lp__  -8.17    0.01 0.76 -10.35 -8.34 -7.87 -7.69 -7.64  4787    1
## 
## Samples were drawn using NUTS(diag_e) at Mon Apr 15 05:54:01 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

The posterior mean is correctly at 0.67.

The model {} block is where we define the model.

biostan::print_stan_code(stan_model_p_param, section = "model")

## model { 
##     p ~ beta(a, b); 
##     for (i in 1:N) { 
##         X[i] ~ bernoulli(p); 
##     }

This is where we define the log joint density of the data (random variables) and parameters that is proportional to the posterior of interest. Note that the parameters here refers to the ones defined in the parameters {} block only.

In this example, the model block can be understood the following log density. Note that $\log f_{X,p}(\mathbf{X},p)$ is proportional to the log posterior $\log p(p | \mathbf{X})$.

\[\begin{align*} \log f_{X,p}(\mathbf{X},p) &= \log \left[ f_{X|p}(\mathbf{X} | p) f_{p}(p) \right]\\ &= \log \left[ f_{p}(p) ~ \prod^{n}_{i=1} f_{X|p}(X_{i} | p) \right]\\ &= \underbrace{\log f_{p}(p)}_{\text{log prior on } p} + \underbrace{\sum^{n}_{i=1} \log f_{X|p}(X_{i} | p)}_{\text{log likelihood}}\\ \end{align*} \]

The more explicit way to write the model {} block directly corresponds to the expression above.

biostan::print_stan_code(biostan::read_stan_file("./stan_p_param_explicit.stan"), section = "model")

## model { 
##     target += beta_lpdf(p | a, b); 
##     for (i in 1:N) { 
##         target += bernoulli_lpmf(X[i] | p); 
##     }

beta_lpdf corresponds to the log density of the prior on $p$, $\log f_{p}(p)$.
bernoulli_lpmf corresponds to the log likelihood of $X_{i}$, $\log f_{X|p}(X_{i} | p)$

Modeling $p$ as a transformed parameter

In some cases, it may make more sense to define a different parameter and define the parameter of interest as the non-linar transformation of the first. In this specific instance, let us consider $\theta$, the log odds of $p$ as a parameter in the parameters {} block and $p$ as a transformed parameter in the transformed parameters {} block

stan_model_theta_param <- biostan::read_stan_file("./stan_theta_param.stan")
biostan::print_stan_code(stan_model_theta_param)

## data { 
##     real<lower=0> a; 
##     real<lower=0> b; 
##     int<lower=0> N; 
##     int<lower=0,upper=1> X[N]; 
## } 
##  
## parameters { 
##     real theta; 
## } 
##  
## transformed parameters { 
##     real<lower=0,upper=1> p; 
##     p = inv_logit(theta); 
## } 
##  
## model { 
##     p ~ beta(a, b); 
##     /* Now we need a log absolute Jacobian adjustment. */ 
##     target += theta - 2 * log(1 + exp(theta)); 
##     for (i in 1:N) { 
##         X[i] ~ bernoulli(p); 
##     } 
## } 
##

fit_stan_model_theta_param <- rstan::stan(model_code = stan_model_theta_param,
                                          data = stan_data,
                                          cores = n_cores)

DIAGNOSTIC(S) FROM PARSER:
Info (non-fatal):
Left-hand side of sampling statement (~) may contain a non-linear transform of a parameter or local variable.
If it does, you need to include a target += statement with the log absolute determinant of the Jacobian of the transform.
Left-hand-side of sampling statement:
    p ~ beta(...)

print(fit_stan_model_theta_param, pars = c("theta","p","lp__"))

## Inference for Stan model: 8425e26f4079863fb4fd22857ea939be.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##        mean se_mean   sd   2.5%   25%   50%   75% 97.5% n_eff Rhat
## theta  0.77    0.02 0.67  -0.48  0.33  0.75  1.18  2.15  1508    1
## p      0.67    0.00 0.13   0.38  0.58  0.68  0.76  0.90  1558    1
## lp__  -8.20    0.02 0.82 -10.48 -8.40 -7.88 -7.69 -7.64  1458    1
## 
## Samples were drawn using NUTS(diag_e) at Mon Apr 15 05:54:42 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

The posterior mean is correctly at 0.67.

In this example, the model {} block can be understood the following log joint density of the random variable and the parameter defined in the parameters {} block, which is $\theta$. Note that $p$, which was defined in the transformed parameters {}, is NOT an argument to the first log density expression on the left-hand side. Also $\log f_{X,\theta}(\mathbf{X},\theta)$ is proportional to the log posterior $\log p(\theta | \mathbf{X})$, again a function of $\theta$, NOT $p$.

\[\begin{align*} \log f_{X,\theta}(\mathbf{X},\theta) &= \log \left[ f_{X|\theta}(\mathbf{X} | \theta) f_{\theta}(\theta) \right]\\ &= \log \left[ f_{\theta}(\theta) ~ \prod^{n}_{i=1} f_{X|\theta}(X_{i} | \theta) \right]\\ &= \underbrace{\log f_{\theta}(\theta)}_{\text{log prior on } \theta} + \underbrace{\sum^{n}_{i=1} \log f_{X|\theta}(X_{i} | \theta)}_{\text{log likelihood}}\\ &~~~\text{Using what we derived first, rewrite with $f_{p}$}\\ &= \log \left[ f_{p}(\text{expit}(\theta)) \left( \frac{e^{\theta}}{(1+e^{\theta})^{2}} \right) \right] + \sum^{n}_{i=1} \log f_{X|\theta}(X_{i} | \theta)\\ &= \log f_{p}(\text{expit}(\theta)) + \log \left( \frac{e^{\theta}}{(1+e^{\theta})^{2}} \right) + \sum^{n}_{i=1} \log f_{X|\theta}(X_{i} | \theta)\\ &= \log f_{p}(p) + \left[ \theta - 2 \log (1+e^{\theta}) \right] + \sum^{n}_{i=1} \log f_{X|\theta}(X_{i} | \theta)\\ &~~~\text{Conditioning on $\theta$ is equivalent to conditioning on $p$}\\ &= \underbrace{\log f_{p}(p)}_{\text{log prior on } p} + \underbrace{\left[ \theta - 2 \log (1+e^{\theta}) \right]}_{\text{log absolute Jacobian}} + \underbrace{\sum^{n}_{i=1} \log f_{X|p}(X_{i} | p)}_{\text{log likelihood}}\\ \end{align*} \]

That is, what we need to do is include the log prior for the parameter $\theta$ defined in the parameters {} block. Thus, including the log prior for the transformed parameter $p$ is insufficient because $\log f_{\theta}(\theta) = \log f_{p}(p) + \left[ \theta - 2 \log (1+e^{\theta}) \right]$.

Again the more explicit code may clarify the correspondence between the math above and each line.

biostan::print_stan_code(biostan::read_stan_file("./stan_theta_param_explicit.stan"), section = "model")

## model { 
##     target += beta_lpdf(p | a, b); 
##     /* Now we need a log absolute Jacobian adjustment. */ 
##     target += theta - 2 * log(1 + exp(theta)); 
##     for (i in 1:N) { 
##         target += bernoulli_lpmf(X[i] | p); 
##     }

beta_lpdf corresponds to the log density of the prior on $p$: $\log f_{p}(p)$.
theta - 2 * log(1 + exp(theta)) corresponds to the log absolute Jacobian: $\left[ \theta - 2 \log (1+e^{\theta}) \right]$
bernoulli_lpmf corresponds to the log likelihood of $X_{i}$: $\log f_{X|p}(X_{i} | p)$

What happens if we ignore the warning in modeling $p$ as a transformed parameter

stan_model_theta_param_wrong <- biostan::read_stan_file("./stan_theta_param_wrong.stan")
biostan::print_stan_code(stan_model_theta_param_wrong)

## data { 
##     real<lower=0> a; 
##     real<lower=0> b; 
##     int<lower=0> N; 
##     int<lower=0,upper=1> X[N]; 
## } 
##  
## parameters { 
##     real theta; 
## } 
##  
## transformed parameters { 
##     real<lower=0,upper=1> p; 
##     p = inv_logit(theta); 
## } 
##  
## model { 
##     p ~ beta(a, b); 
##     for (i in 1:N) { 
##         X[i] ~ bernoulli(p); 
##     } 
## } 
##

fit_stan_model_theta_param_wrong <- rstan::stan(model_code = stan_model_theta_param_wrong,
                                                data = stan_data,
                                                cores = n_cores)

DIAGNOSTIC(S) FROM PARSER:
Info (non-fatal):
Left-hand side of sampling statement (~) may contain a non-linear transform of a parameter or local variable.
If it does, you need to include a target += statement with the log absolute determinant of the Jacobian of the transform.
Left-hand-side of sampling statement:
    p ~ beta(...)

print(fit_stan_model_theta_param_wrong, pars = c("theta","p","lp__"))

## Inference for Stan model: 9cf03771c04db9f07e6d4022132b4407.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##        mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
## theta  0.96    0.02 0.72 -0.38  0.46  0.93  1.43  2.49  1248    1
## p      0.70    0.00 0.14  0.41  0.61  0.72  0.81  0.92  1282    1
## lp__  -6.62    0.02 0.70 -8.63 -6.80 -6.35 -6.17 -6.11  1328    1
## 
## Samples were drawn using NUTS(diag_e) at Mon Apr 15 05:55:18 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

The result does not agree with the analytical solution or the previous two correct models.

Reasoning

This notion of the model {} block defining a joint density of the random variables and the parameter defined in the parameters {} block only seems to be implicit in the following.

“From a user’s perspective, the parameters in the program block are the parameters being sampled by Stan.“ Stan Reference Manual. 8.5 Program Block: parameters

My understanding based on the above is that the (log) posterior that Stan sample from is the posterior of the parameters defined in the parameters {} block from a user’s perspective.

Fortunately, this was confirmed by @betanalpha. That is, the model {} block defines the log joint density $f$(data, parameters) of the data as defined in the data {} block and the parameters as defined in the parameters {} block.

Conclusion

This topic has been discussed many times, but I found the coherent explanation that ties the mathematical statistics and the design of Stan under the hood.

Understanding what model {} specifies as the log joint density for the random variables and the parameters defined in the parameters {} block is the key. The log priors that have to be included in the model {} are the log priors for the parameters defined in the parameters {} block.

Including the log priors for the transformed parameters defined in the transformed parameters {} block alone is insufficient for this goal when the transformation is non-linear.

Additional inclusion of the log absolute Jacobian determinant of the inverse transformation ensures that we effectively have the log priors for the parameters defined in the parameters {} block.

Top Page: http://rpubs.com/kaz_yos/
Github: https://github.com/kaz-yos

print(sessionInfo())

## R version 3.5.2 (2018-12-20)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.4
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] rstan_2.18.2       StanHeaders_2.18.1 biostan_0.1.0      forcats_0.3.0      stringr_1.4.0     
##  [6] dplyr_0.8.0        purrr_0.3.0        readr_1.3.1        tidyr_0.8.2        tibble_2.0.1      
## [11] ggplot2_3.1.0      tidyverse_1.2.1    doRNG_1.7.1        rngtools_1.3.1     pkgmaker_0.27     
## [16] registry_0.5       doParallel_1.0.14  iterators_1.0.10   foreach_1.4.4      knitr_1.21        
## 
## loaded via a namespace (and not attached):
##  [1] minqa_1.2.4        colorspace_1.4-0   ggridges_0.5.1     rsconnect_0.8.13   markdown_0.9      
##  [6] base64enc_0.1-3    rstudioapi_0.9.0   DT_0.5             lubridate_1.7.4    xml2_1.2.0        
## [11] codetools_0.2-16   splines_3.5.2      shinythemes_1.1.2  bayesplot_1.6.0    jsonlite_1.6      
## [16] nloptr_1.2.1       broom_0.5.1        shiny_1.2.0        compiler_3.5.2     httr_1.4.0        
## [21] backports_1.1.3    assertthat_0.2.0   Matrix_1.2-15      lazyeval_0.2.1     cli_1.0.1         
## [26] later_0.8.0        htmltools_0.3.6    prettyunits_1.0.2  tools_3.5.2        igraph_1.2.4      
## [31] gtable_0.2.0       glue_1.3.0         reshape2_1.4.3     Rcpp_1.0.0         cellranger_1.1.0  
## [36] nlme_3.1-137       crosstalk_1.0.0    xfun_0.4           ps_1.3.0           lme4_1.1-20       
## [41] rvest_0.3.2        mime_0.6           miniUI_0.1.1.1     gtools_3.8.1       MASS_7.3-51.1     
## [46] zoo_1.8-4          scales_1.0.0       rstanarm_2.18.2    colourpicker_1.0   hms_0.4.2         
## [51] promises_1.0.1     inline_0.3.15      shinystan_2.5.0    yaml_2.2.0         gridExtra_2.3     
## [56] loo_2.0.0          stringi_1.3.1      dygraphs_1.1.1.6   pkgbuild_1.0.2     bibtex_0.4.2      
## [61] rlang_0.3.1        pkgconfig_2.0.2    matrixStats_0.54.0 evaluate_0.13      lattice_0.20-38   
## [66] rstantools_1.5.1   htmlwidgets_1.3    processx_3.2.1     tidyselect_0.2.5   plyr_1.8.4        
## [71] magrittr_1.5       R6_2.4.0           generics_0.0.2     pillar_1.3.1       haven_2.0.0       
## [76] withr_2.1.2        xts_0.11-2         survival_2.43-3    modelr_0.1.3       crayon_1.3.4      
## [81] rmarkdown_1.11     grid_3.5.2         readxl_1.2.0       callr_3.1.1        threejs_0.3.1     
## [86] digest_0.6.18      xtable_1.8-3       httpuv_1.4.5.1     stats4_3.5.2       munsell_0.5.0     
## [91] shinyjs_1.0

## Record execution time and multicore use
end_time <- Sys.time()
diff_time <- difftime(end_time, start_time, units = "auto")
cat("Started  ", as.character(start_time), "\n",
    "Finished ", as.character(end_time), "\n",
    "Time difference of ", diff_time, " ", attr(diff_time, "units"), "\n",
    "Used ", foreach::getDoParWorkers(), " cores\n",
    "Used ", foreach::getDoParName(), " as backend\n",
    sep = "")

## Started  2019-04-15 05:53:16
## Finished 2019-04-15 05:55:18
## Time difference of 2.040881 mins
## Used 12 cores
## Used doParallelMC as backend

Demystifying Jacobian adjustment for transformed parameters in Stan

Kazuki Yoshida

2019-04-15