An Introduction to RStan and the Stan Modelling Language

Paul Viefers
June 25, 2015

Where to look for more?

Faces of Stan

  • RStan
  • PyStan
  • CmdStan (Command-line or terminal)
  • MatlabStan
  • Stan.jl (Julia)

Stanislaw Ulam

Refresher: Bayesian inference

  • Have a vector of parameters \( \theta \) and data \( y \)
  • Goal: Develop a probability model \( p(y, \theta) \) to arrive at \( p(\theta \mid y) \).
  • Tool: Use Bayes' rule

\[ \begin{align} p(\theta \mid y) &= \frac{p(y,\theta)}{p(y)} = \frac{p(y \mid \theta) p(\theta)}{p(y)} \propto p(y \mid \theta) p(\theta) \end{align} \]

  • \( p(\theta \mid y) \) summarizes what we know given the data.
  • \( p(y \mid \theta) \) the sampling density (likelihood function).
  • \( p(\theta) \) prior distribution of the parameters.

Example: A simple binomial model

  • \( \theta \) an unknown sample proportion or probability of a binary event (see e.g. Markus' blog)

  • likelihood: \[ \begin{align} p(y \mid \theta) = \binom{n}{y} \theta^{y}(1- \theta)^{n-y} \end{align} \]

  • prior: \[ \begin{align} p(\theta) = \frac{\Gamma(a+b)}{\Gamma (a) \Gamma(b)}\theta^{a-1} (1-\theta)^{b-1} \end{align} \]

  • posterior: \[ \begin{align} p(\theta \mid y) \propto \theta^{y + a - 1}(1- \theta)^{n - y + b -1} \end{align} \]

More complex models usually do not have analytical posteriors

  • Bayesian paradigm very flexible of the term 'parameter' in this paradigm: missing observations, latent classes etc.
  • In a more serious application \( \theta \) might be thus (highly) \( d \)-dimensional.
  • Makes \( p(\theta \mid y) \) very complicated and impossible to sample \( \theta \) directly.
  • Leads to posterior simulation: grid approximation, rejection sampling, Markov chain simulation.
  • MCMC: Gibbs sampling, Metropolis(-Hastings) algorithm.

Monte Carlo & random-walk exploration

  • Monte Carlo: use a 'trick' to obtain samples \( \theta_1, \ldots, \theta_n \) from \( p(\theta \mid y) \) indirectly.
  • Then it's easy compute quantities of interest \( \mathbb{E}[h(\theta)] = \frac{1}{n} \sum_{i=1}^n h(\theta_i) \), e.g. means, variances etc.
  • Example: Random-walk Metropolis-Hastings algorithm.
    • Idea: Start at \( \theta_1 \) and take a random jump to \( \theta_{a} \sim \text{Normal}(\theta_1, \Sigma) \).
    • Accept-reject: admit new location to list of sampled values with probability

\[ \begin{align} \min \left\{1, p(\theta_a \mid y)/p(\theta_1 \mid y) \right\} \end{align} \]

  • But: random-walk behavior inefficient, particularly for large \( d \).

Monte Carlo & random-walk exploration

plot of chunk unnamed-chunk-1

plot of chunk unnamed-chunk-2

HMC: the posterior paired with 'standard' mechanics

  • Hamiltonian Monte Carlo (HMC) is more subtle in proposing jumps.
  • HMC introduces more structure by translation into Hamiltonian mechanics framework.
  • Intuition: a marble on a surface
    • Imagining the posterior to be a surface in the 'real world' adds gravitation (kinetic energy) and height (potential energy)
    • We may then use 'standard' mechanics to describe how the marble rolls on the posterior when we 'kick' it.
  • Caveat: HMC requires a lot of user input or tuning of parameters.
  • Stan takes care of a host of problems under hood (e.g. get gradient of the log-posterior using automatic differentiation, etc. more here)

HMC: the posterior paired with 'standard' mechanics

Enter Stan: The Stan project

R & Stan = RStan

The skeletal Stan program

functions {
    #  OPTIONAL: user-defined functions
}
data {
    # OPTIONAL: read in data ...
}
transformed data {
    # OPTIONAL: Create new variables/auxiliary variables
}
parameters {
    # Declare parameters that will be estimated
}
transformed parameters {
    # OPTIONAL: Declare auxiliary parameters (e.g. to reparamterize, Stan Manual ch. 20)
}
model {
    # Declare your probability model: priors, hyperpriors & likelihood
}
generated quantities {
    # OPTIONAL: Declare any quantities other than simulated parameters to be generated
}

Data block

  • Regression model: \( y = X \beta + u \)
data {
    int<lower=0> N;
    int<lower=0> M;
    matrix[N, M] x;
    vector[N] y;
}

Types

int<lower=a, upper=b> x;         // a, b maybe a parameter
real<lower=a, upper=b> x;

vector[N] x;
row_vector[N] x;
matrix[N, K] X;

simplex[h] x;                   // sum(x) = 1
ordered[j] x;                   // x[1] < ... < x[j]

cov_matrix[N] Sigma;
corr_matrix[N] Rho;

Transformed data block

transformed data {
    vector[N] cons;
    matrix[N, M+1] X;

    cons <- rep_vector(1, N);
    X <- append_col(cons, x);
}

Parameters block

parameters {
    vector[M+1] beta;
    real<lower=0> sigma;
}

Model block

model {
    y ~ normal(X * beta, sigma);
}

Complete model: Regression.stan

data {
    int<lower=0> N;
    int<lower=0> M;
    matrix[N, M] x;
    vector[N] y;
}
transformed data {
    vector[N] cons;
    matrix[N, M+1] X;

    cons <- rep_vector(1, N);
    X <- append_col(cons, x);
}
parameters {
    vector[K] beta;
    real<lower=0> sigma;
}
model {
    y ~ normal(x * beta, sigma);
}

Run Stan from R

# Load Stan from CRAN
install.packages('rstan')
library(rstan)

# In case you want to boost performance through parallelization...
source("http://mc-stan.org/rstan/stan.R")

# Read in data 
...

# ... and convert it to a named list (careful with types here)
stan_data <- list(N=N, M=M, y=y, x=x)

# Assume you have Regression.stan in the current WD
fitted.model <- stan(file = 'Regression.stan', model_name='Multi_Reg', data=stan_data, iter=5000, chains=2)

Example: Hierarchical Logistic Regression

Example: Hierarchical Logistic Regression

  • 1,000 participants in an online experiment (939 of which enter final dataset).
  • Each making 25 choices between two options (get money today, or more later).
  • Authors use logistic regression on pooled data to model binary choice.
  • Stan allows to easily turn this into a hierarchical model.
  • Outcome: \( y_i \) of indivisual \( i \) on \( j=1, \ldots, n \) trials.
  • Model: \[ \begin{align} y_{i,j} \mid \theta_i &\sim \text{Bin}(n_i, \theta_i) \\ \text{logit}(\theta_i) &= x_i'\beta_i \\ \beta_i &\sim \text{Normal}(\mu, \tau) \end{align} \]
  • Five explanatory variables in \( x_i \).
  • Hierarchical model has \( 939 \times 5 = 4695 \) different \( \beta \)'s.

Example: Hierarchical Logistic Regression

data {
    int<lower=1> K;     // No. of predictors incl. constant
    int<lower=1> G;     // No. of individuals or units.
    int<lower=0> N;     // Total no. of observations
    int<lower=1, upper = G> id[N];  // mapping obs. to inds.
    int<lower=0, upper = 1> y[N];   // The LHS outcome variable
    matrix[N, K] x;
}
parameters {
    real mu[K];                 // Means of top-level dist.
    real<lower = 0> sigmamu[K]; // Top-level variances
    matrix[G, K] beta;          // Regression parameters incl. intercept
}
model {
    vector[N] x_beta;
    mu ~ normal(0, 3);
    sigmamu ~ chi_square(2);
    for(j in 1:K)
        for(i in 1:G)
            beta[i, j] ~ normal(mu[j], sigmamu[j]);
    for(i in 1:N)
        x_beta[i] <- dot_product(x[i], beta[id[i]]);

    y ~ bernoulli_logit(x_beta);
}