# An Introduction to RStan and the Stan Modelling Language

Paul Viefers
June 25, 2015

### Faces of Stan

• RStan
• PyStan
• CmdStan (Command-line or terminal)
• MatlabStan
• Stan.jl (Julia)

### Refresher: Bayesian inference

• Have a vector of parameters $$\theta$$ and data $$y$$
• Goal: Develop a probability model $$p(y, \theta)$$ to arrive at $$p(\theta \mid y)$$.
• Tool: Use Bayes' rule

\begin{align} p(\theta \mid y) &= \frac{p(y,\theta)}{p(y)} = \frac{p(y \mid \theta) p(\theta)}{p(y)} \propto p(y \mid \theta) p(\theta) \end{align}

• $$p(\theta \mid y)$$ summarizes what we know given the data.
• $$p(y \mid \theta)$$ the sampling density (likelihood function).
• $$p(\theta)$$ prior distribution of the parameters.

### Example: A simple binomial model

• $$\theta$$ an unknown sample proportion or probability of a binary event (see e.g. Markus' blog)

• likelihood: \begin{align} p(y \mid \theta) = \binom{n}{y} \theta^{y}(1- \theta)^{n-y} \end{align}

• prior: \begin{align} p(\theta) = \frac{\Gamma(a+b)}{\Gamma (a) \Gamma(b)}\theta^{a-1} (1-\theta)^{b-1} \end{align}

• posterior: \begin{align} p(\theta \mid y) \propto \theta^{y + a - 1}(1- \theta)^{n - y + b -1} \end{align}

### More complex models usually do not have analytical posteriors

• Bayesian paradigm very flexible of the term 'parameter' in this paradigm: missing observations, latent classes etc.
• In a more serious application $$\theta$$ might be thus (highly) $$d$$-dimensional.
• Makes $$p(\theta \mid y)$$ very complicated and impossible to sample $$\theta$$ directly.
• Leads to posterior simulation: grid approximation, rejection sampling, Markov chain simulation.
• MCMC: Gibbs sampling, Metropolis(-Hastings) algorithm.

### Monte Carlo & random-walk exploration

• Monte Carlo: use a 'trick' to obtain samples $$\theta_1, \ldots, \theta_n$$ from $$p(\theta \mid y)$$ indirectly.
• Then it's easy compute quantities of interest $$\mathbb{E}[h(\theta)] = \frac{1}{n} \sum_{i=1}^n h(\theta_i)$$, e.g. means, variances etc.
• Example: Random-walk Metropolis-Hastings algorithm.
• Idea: Start at $$\theta_1$$ and take a random jump to $$\theta_{a} \sim \text{Normal}(\theta_1, \Sigma)$$.
• Accept-reject: admit new location to list of sampled values with probability

\begin{align} \min \left\{1, p(\theta_a \mid y)/p(\theta_1 \mid y) \right\} \end{align}

• But: random-walk behavior inefficient, particularly for large $$d$$.

### HMC: the posterior paired with 'standard' mechanics

• Hamiltonian Monte Carlo (HMC) is more subtle in proposing jumps.
• HMC introduces more structure by translation into Hamiltonian mechanics framework.
• Intuition: a marble on a surface
• Imagining the posterior to be a surface in the 'real world' adds gravitation (kinetic energy) and height (potential energy)
• We may then use 'standard' mechanics to describe how the marble rolls on the posterior when we 'kick' it.
• Caveat: HMC requires a lot of user input or tuning of parameters.
• Stan takes care of a host of problems under hood (e.g. get gradient of the log-posterior using automatic differentiation, etc. more here)

### The skeletal Stan program

functions {
#  OPTIONAL: user-defined functions
}
data {
# OPTIONAL: read in data ...
}
transformed data {
# OPTIONAL: Create new variables/auxiliary variables
}
parameters {
# Declare parameters that will be estimated
}
transformed parameters {
# OPTIONAL: Declare auxiliary parameters (e.g. to reparamterize, Stan Manual ch. 20)
}
model {
# Declare your probability model: priors, hyperpriors & likelihood
}
generated quantities {
# OPTIONAL: Declare any quantities other than simulated parameters to be generated
}


### Data block

• Regression model: $$y = X \beta + u$$
data {
int<lower=0> N;
int<lower=0> M;
matrix[N, M] x;
vector[N] y;
}


### Types

int<lower=a, upper=b> x;         // a, b maybe a parameter
real<lower=a, upper=b> x;

vector[N] x;
row_vector[N] x;
matrix[N, K] X;

simplex[h] x;                   // sum(x) = 1
ordered[j] x;                   // x[1] < ... < x[j]

cov_matrix[N] Sigma;
corr_matrix[N] Rho;


### Transformed data block

transformed data {
vector[N] cons;
matrix[N, M+1] X;

cons <- rep_vector(1, N);
X <- append_col(cons, x);
}


### Parameters block

parameters {
vector[M+1] beta;
real<lower=0> sigma;
}


### Model block

model {
y ~ normal(X * beta, sigma);
}


### Complete model: Regression.stan

data {
int<lower=0> N;
int<lower=0> M;
matrix[N, M] x;
vector[N] y;
}
transformed data {
vector[N] cons;
matrix[N, M+1] X;

cons <- rep_vector(1, N);
X <- append_col(cons, x);
}
parameters {
vector[K] beta;
real<lower=0> sigma;
}
model {
y ~ normal(x * beta, sigma);
}


### Run Stan from R

# Load Stan from CRAN
install.packages('rstan')
library(rstan)

# In case you want to boost performance through parallelization...
source("http://mc-stan.org/rstan/stan.R")

...

# ... and convert it to a named list (careful with types here)
stan_data <- list(N=N, M=M, y=y, x=x)

# Assume you have Regression.stan in the current WD
fitted.model <- stan(file = 'Regression.stan', model_name='Multi_Reg', data=stan_data, iter=5000, chains=2)


### Example: Hierarchical Logistic Regression

• 1,000 participants in an online experiment (939 of which enter final dataset).
• Each making 25 choices between two options (get money today, or more later).
• Authors use logistic regression on pooled data to model binary choice.
• Stan allows to easily turn this into a hierarchical model.
• Outcome: $$y_i$$ of indivisual $$i$$ on $$j=1, \ldots, n$$ trials.
• Model: \begin{align} y_{i,j} \mid \theta_i &\sim \text{Bin}(n_i, \theta_i) \\ \text{logit}(\theta_i) &= x_i'\beta_i \\ \beta_i &\sim \text{Normal}(\mu, \tau) \end{align}
• Five explanatory variables in $$x_i$$.
• Hierarchical model has $$939 \times 5 = 4695$$ different $$\beta$$'s.

### Example: Hierarchical Logistic Regression

data {
int<lower=1> K;     // No. of predictors incl. constant
int<lower=1> G;     // No. of individuals or units.
int<lower=0> N;     // Total no. of observations
int<lower=1, upper = G> id[N];  // mapping obs. to inds.
int<lower=0, upper = 1> y[N];   // The LHS outcome variable
matrix[N, K] x;
}
parameters {
real mu[K];                 // Means of top-level dist.
real<lower = 0> sigmamu[K]; // Top-level variances
matrix[G, K] beta;          // Regression parameters incl. intercept
}
model {
vector[N] x_beta;
mu ~ normal(0, 3);
sigmamu ~ chi_square(2);
for(j in 1:K)
for(i in 1:G)
beta[i, j] ~ normal(mu[j], sigmamu[j]);
for(i in 1:N)
x_beta[i] <- dot_product(x[i], beta[id[i]]);

y ~ bernoulli_logit(x_beta);
}