15 Mar 2018

What is Stan

  • Stan is a turing complete, probabilistic programming language that is used mostly for Bayesian inference
  • Stan is a two stage compiler: you write your model in the Stan language, which is translated into C++ and then compiled to the native platform
  • Stan includes a matrix and math library that supports auto-differentiation
  • Stan includes interfaces to R and other languages
  • There are post-inference plotting libraries that can be used with Stan like bayesplot1
  • Stan has thousands of users and growing. Stan is used in clinical drug trials, genomics and cancer biology, population dynamics, psycholinguistics, social networks, finance and econometrics, professional sports, publishing, recommender systems, educational testing, climate models, and many more
  • There companies that are building commercial products on top of Stan

What is Known

  • As a matter of notation we call:
    • Observed data \(\boldsymbol{y}\)
    • Unobserved but observable data \(\boldsymbol{\widetilde{y}}\)
    • Unobserved and unobservable parameters \(\boldsymbol{\theta}\)
    • Covariates \(\boldsymbol{x}\)
    • Probability distribution (density) \(\boldsymbol{p(\cdot)}\)
  • Estimation is the process of figuring out the unknowns, i.e. unobserved quantities
  • In classical machine learning and frequentist inference (including prediction), the problem is framed in terms of the most likely value of \(\boldsymbol{\theta}\)
  • Bayesians are extremely greedy people: they are not satisfied with the maximum, the want the whole thing

Why Bayes

What is Bayes

\[ p(\theta\mid y, X) = \frac{p(y \mid X, \theta) * p(\theta)}{p(y)} = \frac{p(y \mid X, \theta) * p(\theta)}{\int p(y \mid X, \theta) * p(y) \space d\theta} \propto p(y \mid X, \theta) * p(\theta) \]

  • Bayesian inference is an approach to figuring out the updated \(\boldsymbol{p(\theta)}\) after observing \(\boldsymbol{y}\) and \(\boldsymbol{X}\)
  • When \(\boldsymbol{p(y \mid X, \theta)}\) is evaluated at each value of \(\boldsymbol{y}\), it is called a likelihood function - this is our data generating process
  • An MCMC algorithm draws from an implied probability distribution \(\boldsymbol{p(\theta \mid y, X)}\)
  • In Stan we specify: \[ \log[p(\theta) * p(y \mid X, \theta)] = \log[p(\theta)] + \sum_{i=1}^{N}\log[p(y_i \mid x_i, \theta)] \]

Stan Manual

QR Parameterization (Section 9.2)

  • What is it?
    • If \(N > K\), where \(N\) is the number of rows and \(K\) is the number of columns, for design matrix \(X\), we have \(X = QR\), where \(Q\) is an orthonormal matrix and \(R\) is upper-triangular
    • Notation: in Linear Algebra books people usually write a linear system as \(Ax = b\). We usually write it as \(X\beta = y\)
  • Why do we care?
    • It makes the sampling go faster (loosely speaking, no tight regions)
    • We like faster! How much faster? It depends, but we will do some simulations
  • When is it not great?
    • When you have informative priors on \(\beta\)

What is it

  • Let's take a simple matrix and see what the transformations look like
set.seed(123)
(X <- matrix(sample(9), ncol = 3))
##      [,1] [,2] [,3]
## [1,]    3    6    2
## [2,]    7    5    8
## [3,]    9    1    4
qrx <- qr(X); Q <- qr.Q(qrx); R <- qr.R(qrx)
Q %*% R
##      [,1] [,2] [,3]
## [1,]    3    6    2
## [2,]    7    5    8
## [3,]    9    1    4

What is it

  • What are the properties of \(Q\)?
    • \(Q\) is orthonormal.
  • This means that: \(v_i \cdot v_i = 1\) and \(v_i \cdot v_j = 0\)
apply(Q, 2, function (x) x %*% x) # column vectors are length 1
## [1] 1 1 1
# column vectors are orthogonal to each other
zapsmall(t(Q) %*% Q)
##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1

What is it

  • What are the properties of \(R\)?
    • \(R\) is upper-triangular
    • Solving upper-triangular systems is easy
(R <- qr.R(qrx))
##           [,1]      [,2]      [,3]
## [1,] -11.78983 -5.258771 -8.312252
## [2,]   0.00000  5.860488  2.096714
## [3,]   0.00000  0.000000 -3.241954
  • In R you can (back)solve the upper triangular system:
x <- backsolve(R, b) # solves Rx = b

Linear regression in Stan

data {
  int<lower=1> N;
  int<lower=1> K;
  vector[N] y;
  matrix[N, K] X;
}
parameters {
  real alpha;
  vector[K] beta;
  real<lower=0> sigma;
}
model {
  alpha ~ normal(0, 1);    
  beta ~ normal(0, 1);
  sigma ~ cauchy(0, 2.5);
  y ~ normal(alpha + X * beta, sigma);
}

Coding QR regression in Stan

  • Our linear predictor is \(\eta = X\beta\) and \(X = QR\), so the linear predictor becomes \(\eta = QR\beta\)
  • In Stan, we will code \(\theta = R\beta\), so to compute the original \(\beta = R^{-1}\theta\) and \(\eta = Q\theta\)
transformed data {
  Q = qr_Q(x)[, 1:K]; # in R qr.X(qr, complete = TRUE) will return the same
  R = qr_R(x)[1:K, ]; 
  R_inverse = inverse(R);
}
model {
  y ~ normal(Q * theta + alpha, sigma);  // likelihood
}
generated quantities {
  vector[K] beta;
  beta = R_inverse * theta; // coefficients on x
}

Thinning the QR Matrix

(A <- matrix(sample(4*2), ncol = 2))
##      [,1] [,2]
## [1,]    4    6
## [2,]    7    1
## [3,]    3    2
## [4,]    8    5
qra <- qr(A)
Q <- qr.Q(qra, complete = TRUE)
zapsmall(t(Q) %*% Q) # could also do crossprod(Q, Q)
##      [,1] [,2] [,3] [,4]
## [1,]    1    0    0    0
## [2,]    0    1    0    0
## [3,]    0    0    1    0
## [4,]    0    0    0    1

Why do we care

  • Speed! Let's simulate some data.
nc <- 100; nr <- 200
A <- matrix(runif(nc^2, -1, 1), ncol = nc) 
Sigma <- t(A) %*% A
cov2cor(Sigma)[1:10, 1:5]
##              [,1]        [,2]        [,3]        [,4]        [,5]
##  [1,]  1.00000000 -0.05132024 -0.13710853 -0.03067070 -0.08239046
##  [2,] -0.05132024  1.00000000 -0.15538675 -0.01023773 -0.05561505
##  [3,] -0.13710853 -0.15538675  1.00000000  0.15075837 -0.07965899
##  [4,] -0.03067070 -0.01023773  0.15075837  1.00000000  0.06031342
##  [5,] -0.08239046 -0.05561505 -0.07965899  0.06031342  1.00000000
##  [6,] -0.01990631  0.11646252 -0.17306376 -0.09224045  0.15738933
##  [7,]  0.02306227  0.04143631 -0.07601835  0.00829358  0.06744620
##  [8,]  0.06411499 -0.03243986 -0.10831315  0.04595341  0.25958174
##  [9,] -0.11012189  0.08133849 -0.12491623  0.25148685  0.04952560
## [10,]  0.01563579  0.23659090  0.01780008  0.13837934 -0.01362358

Why do we care

b <- runif(nc, -1, 1)
X <- mvtnorm::rmvnorm(nr, sigma = Sigma)
y <- X %*% b + rnorm(nr, mean = 0, sd = 0.5)
qrx <- qr(X)
Q <- qr.Q(qrx)
R <- qr.R(qrx)
betahat <- backsolve(R, crossprod(Q, y))
round(betahat[1:10], 2)
##  [1] -0.41 -0.52  0.45 -0.04 -0.18  0.87  0.73  0.37 -0.20 -0.17
lm_fit <- summary(lm(y ~ X -1))$coef[1:10, 'Estimate']
round(as.vector(lm_fit), 2)
##  [1] -0.41 -0.52  0.45 -0.04 -0.18  0.87  0.73  0.37 -0.20 -0.17

Why do we care

pairs(X[, 1:5])

Why do we care

data <- list(N = nr, K = nc, X = X, y = as.vector(y))
# lr_model <- stan_model("lin_reg.stan")
lr_model <- readRDS("lr_model.rds")
lr_fit <- sampling(lr_model, iter = 500, data = data)
lr_fit_time <- get_elapsed_time(lr_fit)
# qr_model <- stan_model("qr_reg1.stan")
qr_model <- readRDS("qr_model.rds")
qr_fit <- sampling(qr_model, iter = 500, data = data)
qr_fit_time <- get_elapsed_time(qr_fit)

Why do we care

lr_fit_time
##          warmup  sample
## chain:1 13.6533 22.4441
## chain:2 13.1853 21.1318
## chain:3 13.9737 21.8191
## chain:4 13.1125 21.2434
qr_fit_time
##          warmup   sample
## chain:1 1.52713 1.196980
## chain:2 1.59903 0.431617
## chain:3 1.73961 0.352686
## chain:4 1.67594 0.438472

Simulating correlation matrices

  • Create a file called stan_func.stan containing the following code:
functions {
  matrix qr_Q_stan(matrix A) {
    return qr_Q(A); // qr.Q(qr, complete = TRUE) in R
  }
  matrix qr_R_stan(matrix A) {
    return qr_R(A); // qr.R(qr, complete = TRUE) in R
  }
  matrix lkj_cor_rng(int K, real eta) {
    return lkj_corr_rng(K, eta);
  }
}

Simulating correlation matrices

  • Exposing Stan functions inside of R
expose_stan_functions("stan_func.stan")
# sample 3 x 3 correlation matrix
lkj_cor_rng(K = 3, eta = 1)
##            [,1]       [,2]       [,3]
## [1,]  1.0000000 -0.7097328  0.1850581
## [2,] -0.7097328  1.0000000 -0.2193609
## [3,]  0.1850581 -0.2193609  1.0000000
rho <- numeric(0)
eta <- c(0.7, 1, 2, 4)
for (i in 1:length(eta)) {
  lkj2 <- replicate(1e4, lkj_cor_rng(2, eta[i]))
  rho <- cbind(rho, apply(lkj2, 3, function(x) x[1, 2]))
}

Simulating correlation matrices

Priors for Coefficients and Scales (Section 9.3)

  • Improper uniform priors
  • Proper uniform priors
  • Uninformative priors
  • Truncated priors
  • Weakly informative priors
  • Bounded priors
  • Conjugacy
  • References for priors

Priors

x <- rnorm(3); y <- 0.2 * x + rnorm(3)
data <- list(N = 3, y = y, x = x)
qplot(x, y)

Priors

data {
  int<lower=1> N;
  vector[N] y;
  vector[N] x;
}
parameters {
  real beta;
  real sigma;
}
model {
  y ~ normal(beta*x, sigma);
}

Priors

# priors_model <- stan_model("priors.stan")
priors_model <- readRDS("priors_model.rds")
priors_fit <- sampling(priors_model, iter = 300, data = data)
sigma1 <- extract(priors_fit, pars = "sigma")$sigma
p1 <- qplot(sigma1, geom = "histogram"); p1

Priors

data {
  int<lower=1> N;
  vector[N] y;
  vector[N] x;
}
parameters {
  real beta;
  real<lower=0> sigma;
}
model {
  beta ~ normal(0, 1);
  sigma ~ normal(0, 2);
  y ~ normal(beta*x, sigma);
}

Priors

# priors1_model <- stan_model("priors1.stan")
priors1_model <- readRDS("priors1_model.rds")
priors1_fit <- sampling(priors1_model, iter = 300, data = data)
sigma2 <- extract(priors1_fit, pars = "sigma")$sigma; 
p2 <- qplot(sigma2, geom = "histogram"); cowplot::plot_grid(p1, p2)

Prior choice recommendations

Thank you!