An Introduction to RStan and the Stan Modelling Language

Paul Viefers
June 25, 2015

Where to look for more?

Today: a brief introduction to Stan.
Build my talk around excellent talk by Michael Betancourt (Machine Learning Summer School 2014, Reykjavik).
For the more math/physics savvy: Radford Neal's Handbook chapter .
Resources on Stan website here.
Andrew Gelman often blogging about the latest news surrounding Stan.
Very active Google group.

Faces of Stan

RStan
PyStan
CmdStan (Command-line or terminal)
MatlabStan
Stan.jl (Julia)

Stanislaw Ulam

Refresher: Bayesian inference

Have a vector of parameters \( \theta \) and data \( y \)
Goal: Develop a probability model \( p(y, \theta) \) to arrive at \( p(\theta \mid y) \).
Tool: Use Bayes' rule

\[ \begin{align} p(\theta \mid y) &= \frac{p(y,\theta)}{p(y)} = \frac{p(y \mid \theta) p(\theta)}{p(y)} \propto p(y \mid \theta) p(\theta) \end{align} \]

\( p(\theta \mid y) \) summarizes what we know given the data.
\( p(y \mid \theta) \) the sampling density (likelihood function).
\( p(\theta) \) prior distribution of the parameters.

Example: A simple binomial model

\( \theta \) an unknown sample proportion or probability of a binary event (see e.g. Markus' blog)
likelihood: \[ \begin{align} p(y \mid \theta) = \binom{n}{y} \theta^{y}(1- \theta)^{n-y} \end{align} \]
prior: \[ \begin{align} p(\theta) = \frac{\Gamma(a+b)}{\Gamma (a) \Gamma(b)}\theta^{a-1} (1-\theta)^{b-1} \end{align} \]
posterior: \[ \begin{align} p(\theta \mid y) \propto \theta^{y + a - 1}(1- \theta)^{n - y + b -1} \end{align} \]

More complex models usually do not have analytical posteriors

Bayesian paradigm very flexible of the term 'parameter' in this paradigm: missing observations, latent classes etc.
In a more serious application \( \theta \) might be thus (highly) \( d \)-dimensional.
Makes \( p(\theta \mid y) \) very complicated and impossible to sample \( \theta \) directly.
Leads to posterior simulation: grid approximation, rejection sampling, Markov chain simulation.
MCMC: Gibbs sampling, Metropolis(-Hastings) algorithm.

Monte Carlo & random-walk exploration

Monte Carlo: use a 'trick' to obtain samples \( \theta_1, \ldots, \theta_n \) from \( p(\theta \mid y) \) indirectly.
Then it's easy compute quantities of interest \( \mathbb{E}[h(\theta)] = \frac{1}{n} \sum_{i=1}^n h(\theta_i) \), e.g. means, variances etc.
Example: Random-walk Metropolis-Hastings algorithm.
- Idea: Start at \( \theta_1 \) and take a random jump to \( \theta_{a} \sim \text{Normal}(\theta_1, \Sigma) \).
- Accept-reject: admit new location to list of sampled values with probability

\[ \begin{align} \min \left\{1, p(\theta_a \mid y)/p(\theta_1 \mid y) \right\} \end{align} \]

But: random-walk behavior inefficient, particularly for large \( d \).

Monte Carlo & random-walk exploration

plot of chunk unnamed-chunk-1

plot of chunk unnamed-chunk-2

HMC: the posterior paired with 'standard' mechanics

Hamiltonian Monte Carlo (HMC) is more subtle in proposing jumps.
HMC introduces more structure by translation into Hamiltonian mechanics framework.
Intuition: a marble on a surface
- Imagining the posterior to be a surface in the 'real world' adds gravitation (kinetic energy) and height (potential energy)
- We may then use 'standard' mechanics to describe how the marble rolls on the posterior when we 'kick' it.
Caveat: HMC requires a lot of user input or tuning of parameters.
Stan takes care of a host of problems under hood (e.g. get gradient of the log-posterior using automatic differentiation, etc. more here)

HMC: the posterior paired with 'standard' mechanics

Enter Stan: The Stan project

R & Stan = RStan

The skeletal Stan program

functions {
    #  OPTIONAL: user-defined functions
}
data {
    # OPTIONAL: read in data ...
}
transformed data {
    # OPTIONAL: Create new variables/auxiliary variables
}
parameters {
    # Declare parameters that will be estimated
}
transformed parameters {
    # OPTIONAL: Declare auxiliary parameters (e.g. to reparamterize, Stan Manual ch. 20)
}
model {
    # Declare your probability model: priors, hyperpriors & likelihood
}
generated quantities {
    # OPTIONAL: Declare any quantities other than simulated parameters to be generated
}

Data block

Regression model: \( y = X \beta + u \)

data {
    int<lower=0> N;
    int<lower=0> M;
    matrix[N, M] x;
    vector[N] y;
}

Types

int<lower=a, upper=b> x;         // a, b maybe a parameter
real<lower=a, upper=b> x;

vector[N] x;
row_vector[N] x;
matrix[N, K] X;

simplex[h] x;                   // sum(x) = 1
ordered[j] x;                   // x[1] < ... < x[j]

cov_matrix[N] Sigma;
corr_matrix[N] Rho;

Transformed data block

transformed data {
    vector[N] cons;
    matrix[N, M+1] X;

    cons <- rep_vector(1, N);
    X <- append_col(cons, x);
}

Parameters block

parameters {
    vector[M+1] beta;
    real<lower=0> sigma;
}

Model block

model {
    y ~ normal(X * beta, sigma);
}

Complete model: Regression.stan

data {
    int<lower=0> N;
    int<lower=0> M;
    matrix[N, M] x;
    vector[N] y;
}
transformed data {
    vector[N] cons;
    matrix[N, M+1] X;

    cons <- rep_vector(1, N);
    X <- append_col(cons, x);
}
parameters {
    vector[K] beta;
    real<lower=0> sigma;
}
model {
    y ~ normal(x * beta, sigma);
}

Run Stan from R

# Load Stan from CRAN
install.packages('rstan')
library(rstan)

# In case you want to boost performance through parallelization...
source("http://mc-stan.org/rstan/stan.R")

# Read in data 
...

# ... and convert it to a named list (careful with types here)
stan_data <- list(N=N, M=M, y=y, x=x)

# Assume you have Regression.stan in the current WD
fitted.model <- stan(file = 'Regression.stan', model_name='Multi_Reg', data=stan_data, iter=5000, chains=2)

Example: Hierarchical Logistic Regression

1,000 participants in an online experiment (939 of which enter final dataset).
Each making 25 choices between two options (get money today, or more later).
Authors use logistic regression on pooled data to model binary choice.
Stan allows to easily turn this into a hierarchical model.
Outcome: \( y_i \) of indivisual \( i \) on \( j=1, \ldots, n \) trials.
Model: \[ \begin{align} y_{i,j} \mid \theta_i &\sim \text{Bin}(n_i, \theta_i) \\ \text{logit}(\theta_i) &= x_i'\beta_i \\ \beta_i &\sim \text{Normal}(\mu, \tau) \end{align} \]
Five explanatory variables in \( x_i \).
Hierarchical model has \( 939 \times 5 = 4695 \) different \( \beta \)'s.

Example: Hierarchical Logistic Regression

data {
    int<lower=1> K;     // No. of predictors incl. constant
    int<lower=1> G;     // No. of individuals or units.
    int<lower=0> N;     // Total no. of observations
    int<lower=1, upper = G> id[N];  // mapping obs. to inds.
    int<lower=0, upper = 1> y[N];   // The LHS outcome variable
    matrix[N, K] x;
}
parameters {
    real mu[K];                 // Means of top-level dist.
    real<lower = 0> sigmamu[K]; // Top-level variances
    matrix[G, K] beta;          // Regression parameters incl. intercept
}
model {
    vector[N] x_beta;
    mu ~ normal(0, 3);
    sigmamu ~ chi_square(2);
    for(j in 1:K)
        for(i in 1:G)
            beta[i, j] ~ normal(mu[j], sigmamu[j]);
    for(i in 1:N)
        x_beta[i] <- dot_product(x[i], beta[id[i]]);

    y ~ bernoulli_logit(x_beta);
}