Cmdstan - algebraic solver

Introduction

Say we want to sample from a target distribution. Inverse cdf sampling uses the fact that if \(x\) are random draws from any pdf then the cdf of \(x\) have a uniform distribution (see below). This means that if we can invert the cdf of the target distribution and then pass the inverse samples from a standard uniform distribution, we will generate random draws from our target distribution.

# random draws from exponential distribution
x <- rexp(1000)
# histogram of cdf of x should be (is) uniform
u <- pexp(x)
hist(u)

One way to inverse a function is to do it analytically. If \(F(x) = 1 - \exp(-\lambda x)\) then solving for \(x\) we get the inverse:

\[ \begin{equation} \begin{split} \exp(-\lambda x) = 1 - y \\ x = \frac{\log(1 - y)}{-\lambda} \\ F^{-1}(x) = \frac{\log(1 - x)}{-\lambda} \end{split} \tag{1} \end{equation} \]

Now, when I push draws from a standard uniform into the inverse, the values that pop out are draws from the target distribution:

u <- runif(10000)
lambda <- 1
x <- -(log(1 - u))/lambda
hist(x, prob = T, breaks = 15)
curve(dexp, add = T, col = "red")

But sometimes you won’t be able to solve for the inverse analytically. In these cases you can use a unit-root solver. These return the root of interest, which in this case is the value of \(x\) when the cdf of \(x\) minus \(u\) equals zero.

rootfn <- function(x, u){
  # Assume lambda is 1 again
  lambda <- 1
  cdf <- 1 - exp(-lambda * x)
  # I am using log because for numerical stability.
  return(log(cdf) - log(u))
}
y <- unlist(lapply(u, function(z){
  stats::uniroot(rootfn, 
                 # The interval over which you want to 
                 # generate the draws, but technically 
                 # it is the interval over which you want
                 # to find the root.
                 interval = c(0, 100),
                 # You draws from the standard uniform 
                 # for the inverse cdf method.
                 u = z,
                 check.conv = TRUE)$root
}))
hist(y, prob = T, breaks = 15)
curve(dexp, add = T, col = "red")

The interval specified deserve comment. This is your search space for the unit root, so you need to be careful to ensure that the root will exist within this interval. We are pretty safe here because it is very unlikely that an exponential distribution with rate equal to 1 will produce a value beyond 100. However, if I had set the upper bound to 2 then I would have received an error at some point saying that the root cannot be found because values beyond 2 will commonly be generated by this distribution. Note that usually this error is unhelpfully reported as values at end points not of opposite sign. So, say if you wanted to create censoring in the data, you might want to test for being outside of the limit first, as I did previously here.

You can also do root finding in a rather clunky way using stan, well, more specifically, cmdstan. This uses the algebraic solver (and you therefore need to supply fixed_param = T when running sample). The documentation is fairly crap in that it only gives one example (and it isn’t a very helpful one at that), but here is some stan code that I cobbled together to do something similar to the above.

functions{
  // y is the unknowns (the thing we want to solve for)
  // theta is the parameters, in this instance just lambda from the 
  // cdf for the exponential distribution
  // data (real) is going to be our draws from uniform dist
  // data (int) is just going to be a (mandatory) placeholder
  //vector system(vector y,vector theta,data array[] real x_r, array[] int x_i) {    
  vector system(vector y, vector theta, data array[] real x_r, array[] int x_i) {
    vector[1] z;
    real cdf;
    // this encodes the cdf from which we want to sample
    cdf = 1 - exp(-theta[1]*y[1]);
    z[1] = cdf - theta[2];
    return z;
  }
}
data{
  int N;
  vector[N] u;
  real lambda;
}
transformed data{
  array[0] real x_r;
  array[0] int x_i;
  vector[1] y_guess = [1.0]';
}
parameters{
}
transformed parameters{
  vector[N] y;
  vector[2] theta;

  for(i in 1:N){
    theta[1] = lambda;
    theta[2] = u[i];
    y[i] = algebra_solver_newton(system, y_guess, theta, x_r, x_i)[1];
  }

}
model{
}

The data correspond to the number of points to generate and the u are draws from a standard uniform. Note that in principle, I could have generated these in the transformed data block. Basially, I shoehorn what I need into the mandatory function signature and code up the cdf as I had in the unit root code above. Note that the implementation is pretty rough, because I haven’t yet quite figured out what is going on with this set of functionality. Anyway, it works, or at least seems to work:

if(basename(getwd()) == "rpubs"){
  fname <- ".//stan//solver1.stan"
} else if(basename(getwd()) == "R"){
  fname <- "..//stan//solver1.stan"
}

mod <- cmdstan_model(stan_file = fname)
# mod$print()

ld <- list(
  N = 10000, 
  u = as.array(runif(10000), dims = c(10000, 1)),
  lambda = 1)

f1 <- mod$sample(
  data = ld, 
  chains = 1, 
  iter_sampling = 1,
  fixed_param = TRUE
)

## Running MCMC with 1 chain...
## 
## Chain 1 Iteration: 1 / 1 [100%]  (Sampling) 
## Chain 1 finished in 0.0 seconds.

# f1$summary()
d1 <- data.table(f1$draws(format = "df"))
smpl <- as.numeric(t(d1[, 1:10000, with = F]))
hist(smpl, prob = T, breaks = 15)
curve(dexp, add = T, col = "red")

Now, of course, there isn’t really a need to use inverse cdf sampling via unit roots to generate draws from an exponential distribution, but hopefully this can be applied in the future for arbitrary distributions that do not have an analytic inverse. Ok, that’s enough for now.

Cmdstan - algebraic solver - unit-roots

2022-02-19

Setup

Introduction