Deconstructing the Stan Manual Part 2: (QR Parameterization)++

15 Mar 2018

What is Stan

Stan is a turing complete, probabilistic programming language that is used mostly for Bayesian inference
Stan is a two stage compiler: you write your model in the Stan language, which is translated into C++ and then compiled to the native platform
Stan includes a matrix and math library that supports auto-differentiation
Stan includes interfaces to R and other languages
There are post-inference plotting libraries that can be used with Stan like bayesplot1
Stan has thousands of users and growing. Stan is used in clinical drug trials, genomics and cancer biology, population dynamics, psycholinguistics, social networks, finance and econometrics, professional sports, publishing, recommender systems, educational testing, climate models, and many more
There companies that are building commercial products on top of Stan

What is Known

As a matter of notation we call:
- Observed data \(\boldsymbol{y}\)
- Unobserved but observable data \(\boldsymbol{\widetilde{y}}\)
- Unobserved and unobservable parameters \(\boldsymbol{\theta}\)
- Covariates \(\boldsymbol{x}\)
- Probability distribution (density) \(\boldsymbol{p(\cdot)}\)
Estimation is the process of figuring out the unknowns, i.e. unobserved quantities
In classical machine learning and frequentist inference (including prediction), the problem is framed in terms of the most likely value of \(\boldsymbol{\theta}\)
Bayesians are extremely greedy people: they are not satisfied with the maximum, the want the whole thing

Why Bayes

What is Bayes

\[ p(\theta\mid y, X) = \frac{p(y \mid X, \theta) * p(\theta)}{p(y)} = \frac{p(y \mid X, \theta) * p(\theta)}{\int p(y \mid X, \theta) * p(y) \space d\theta} \propto p(y \mid X, \theta) * p(\theta) \]

Bayesian inference is an approach to figuring out the updated \(\boldsymbol{p(\theta)}\) after observing \(\boldsymbol{y}\) and \(\boldsymbol{X}\)
When \(\boldsymbol{p(y \mid X, \theta)}\) is evaluated at each value of \(\boldsymbol{y}\), it is called a likelihood function - this is our data generating process
An MCMC algorithm draws from an implied probability distribution \(\boldsymbol{p(\theta \mid y, X)}\)
In Stan we specify: \[ \log[p(\theta) * p(y \mid X, \theta)] = \log[p(\theta)] + \sum_{i=1}^{N}\log[p(y_i \mid x_i, \theta)] \]

Stan Manual

QR Parameterization (Section 9.2)

What is it?
- If \(N > K\), where \(N\) is the number of rows and \(K\) is the number of columns, for design matrix \(X\), we have \(X = QR\), where \(Q\) is an orthonormal matrix and \(R\) is upper-triangular
- Notation: in Linear Algebra books people usually write a linear system as \(Ax = b\). We usually write it as \(X\beta = y\)
Why do we care?
- It makes the sampling go faster (loosely speaking, no tight regions)
- We like faster! How much faster? It depends, but we will do some simulations
When is it not great?
- When you have informative priors on \(\beta\)

What is it

Let's take a simple matrix and see what the transformations look like

set.seed(123)
(X <- matrix(sample(9), ncol = 3))

##      [,1] [,2] [,3]
## [1,]    3    6    2
## [2,]    7    5    8
## [3,]    9    1    4

qrx <- qr(X); Q <- qr.Q(qrx); R <- qr.R(qrx)
Q %*% R

##      [,1] [,2] [,3]
## [1,]    3    6    2
## [2,]    7    5    8
## [3,]    9    1    4

What is it

What are the properties of \(Q\)?
- \(Q\) is orthonormal.
This means that: \(v_i \cdot v_i = 1\) and \(v_i \cdot v_j = 0\)

apply(Q, 2, function (x) x %*% x) # column vectors are length 1

## [1] 1 1 1

# column vectors are orthogonal to each other
zapsmall(t(Q) %*% Q)

##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1

What is it

What are the properties of \(R\)?
- \(R\) is upper-triangular
- Solving upper-triangular systems is easy

(R <- qr.R(qrx))

##           [,1]      [,2]      [,3]
## [1,] -11.78983 -5.258771 -8.312252
## [2,]   0.00000  5.860488  2.096714
## [3,]   0.00000  0.000000 -3.241954

In R you can (back)solve the upper triangular system:

x <- backsolve(R, b) # solves Rx = b

Linear regression in Stan

data {
  int<lower=1> N;
  int<lower=1> K;
  vector[N] y;
  matrix[N, K] X;
}
parameters {
  real alpha;
  vector[K] beta;
  real<lower=0> sigma;
}
model {
  alpha ~ normal(0, 1);    
  beta ~ normal(0, 1);
  sigma ~ cauchy(0, 2.5);
  y ~ normal(alpha + X * beta, sigma);
}

Coding QR regression in Stan

Our linear predictor is \(\eta = X\beta\) and \(X = QR\), so the linear predictor becomes \(\eta = QR\beta\)
In Stan, we will code \(\theta = R\beta\), so to compute the original \(\beta = R^{-1}\theta\) and \(\eta = Q\theta\)

transformed data {
  Q = qr_Q(x)[, 1:K]; # in R qr.X(qr, complete = TRUE) will return the same
  R = qr_R(x)[1:K, ]; 
  R_inverse = inverse(R);
}
model {
  y ~ normal(Q * theta + alpha, sigma);  // likelihood
}
generated quantities {
  vector[K] beta;
  beta = R_inverse * theta; // coefficients on x
}

Thinning the QR Matrix

(A <- matrix(sample(4*2), ncol = 2))

##      [,1] [,2]
## [1,]    4    6
## [2,]    7    1
## [3,]    3    2
## [4,]    8    5

qra <- qr(A)
Q <- qr.Q(qra, complete = TRUE)
zapsmall(t(Q) %*% Q) # could also do crossprod(Q, Q)

##      [,1] [,2] [,3] [,4]
## [1,]    1    0    0    0
## [2,]    0    1    0    0
## [3,]    0    0    1    0
## [4,]    0    0    0    1

Why do we care

Speed! Let's simulate some data.

nc <- 100; nr <- 200
A <- matrix(runif(nc^2, -1, 1), ncol = nc) 
Sigma <- t(A) %*% A
cov2cor(Sigma)[1:10, 1:5]

##              [,1]        [,2]        [,3]        [,4]        [,5]
##  [1,]  1.00000000 -0.05132024 -0.13710853 -0.03067070 -0.08239046
##  [2,] -0.05132024  1.00000000 -0.15538675 -0.01023773 -0.05561505
##  [3,] -0.13710853 -0.15538675  1.00000000  0.15075837 -0.07965899
##  [4,] -0.03067070 -0.01023773  0.15075837  1.00000000  0.06031342
##  [5,] -0.08239046 -0.05561505 -0.07965899  0.06031342  1.00000000
##  [6,] -0.01990631  0.11646252 -0.17306376 -0.09224045  0.15738933
##  [7,]  0.02306227  0.04143631 -0.07601835  0.00829358  0.06744620
##  [8,]  0.06411499 -0.03243986 -0.10831315  0.04595341  0.25958174
##  [9,] -0.11012189  0.08133849 -0.12491623  0.25148685  0.04952560
## [10,]  0.01563579  0.23659090  0.01780008  0.13837934 -0.01362358

Why do we care

b <- runif(nc, -1, 1)
X <- mvtnorm::rmvnorm(nr, sigma = Sigma)
y <- X %*% b + rnorm(nr, mean = 0, sd = 0.5)
qrx <- qr(X)
Q <- qr.Q(qrx)
R <- qr.R(qrx)
betahat <- backsolve(R, crossprod(Q, y))
round(betahat[1:10], 2)

##  [1] -0.41 -0.52  0.45 -0.04 -0.18  0.87  0.73  0.37 -0.20 -0.17

lm_fit <- summary(lm(y ~ X -1))$coef[1:10, 'Estimate']
round(as.vector(lm_fit), 2)

##  [1] -0.41 -0.52  0.45 -0.04 -0.18  0.87  0.73  0.37 -0.20 -0.17

Why do we care

pairs(X[, 1:5])

Why do we care

data <- list(N = nr, K = nc, X = X, y = as.vector(y))
# lr_model <- stan_model("lin_reg.stan")
lr_model <- readRDS("lr_model.rds")
lr_fit <- sampling(lr_model, iter = 500, data = data)
lr_fit_time <- get_elapsed_time(lr_fit)
# qr_model <- stan_model("qr_reg1.stan")
qr_model <- readRDS("qr_model.rds")
qr_fit <- sampling(qr_model, iter = 500, data = data)
qr_fit_time <- get_elapsed_time(qr_fit)

Why do we care

lr_fit_time

##          warmup  sample
## chain:1 13.6533 22.4441
## chain:2 13.1853 21.1318
## chain:3 13.9737 21.8191
## chain:4 13.1125 21.2434

qr_fit_time

##          warmup   sample
## chain:1 1.52713 1.196980
## chain:2 1.59903 0.431617
## chain:3 1.73961 0.352686
## chain:4 1.67594 0.438472

Simulating correlation matrices

Create a file called stan_func.stan containing the following code:

functions {
  matrix qr_Q_stan(matrix A) {
    return qr_Q(A); // qr.Q(qr, complete = TRUE) in R
  }
  matrix qr_R_stan(matrix A) {
    return qr_R(A); // qr.R(qr, complete = TRUE) in R
  }
  matrix lkj_cor_rng(int K, real eta) {
    return lkj_corr_rng(K, eta);
  }
}

Simulating correlation matrices

Exposing Stan functions inside of R

expose_stan_functions("stan_func.stan")
# sample 3 x 3 correlation matrix
lkj_cor_rng(K = 3, eta = 1)

##            [,1]       [,2]       [,3]
## [1,]  1.0000000 -0.7097328  0.1850581
## [2,] -0.7097328  1.0000000 -0.2193609
## [3,]  0.1850581 -0.2193609  1.0000000

rho <- numeric(0)
eta <- c(0.7, 1, 2, 4)
for (i in 1:length(eta)) {
  lkj2 <- replicate(1e4, lkj_cor_rng(2, eta[i]))
  rho <- cbind(rho, apply(lkj2, 3, function(x) x[1, 2]))
}

Simulating correlation matrices

Priors for Coefficients and Scales (Section 9.3)

Improper uniform priors
Proper uniform priors
Uninformative priors
Truncated priors
Weakly informative priors
Bounded priors
Conjugacy
References for priors

Priors

x <- rnorm(3); y <- 0.2 * x + rnorm(3)
data <- list(N = 3, y = y, x = x)
qplot(x, y)

Priors

data {
  int<lower=1> N;
  vector[N] y;
  vector[N] x;
}
parameters {
  real beta;
  real sigma;
}
model {
  y ~ normal(beta*x, sigma);
}

Priors

# priors_model <- stan_model("priors.stan")
priors_model <- readRDS("priors_model.rds")
priors_fit <- sampling(priors_model, iter = 300, data = data)
sigma1 <- extract(priors_fit, pars = "sigma")$sigma
p1 <- qplot(sigma1, geom = "histogram"); p1

Priors

data {
  int<lower=1> N;
  vector[N] y;
  vector[N] x;
}
parameters {
  real beta;
  real<lower=0> sigma;
}
model {
  beta ~ normal(0, 1);
  sigma ~ normal(0, 2);
  y ~ normal(beta*x, sigma);
}

Priors

# priors1_model <- stan_model("priors1.stan")
priors1_model <- readRDS("priors1_model.rds")
priors1_fit <- sampling(priors1_model, iter = 300, data = data)
sigma2 <- extract(priors1_fit, pars = "sigma")$sigma; 
p2 <- qplot(sigma2, geom = "histogram"); cowplot::plot_grid(p1, p2)

Prior choice recommendations

https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations

Thank you!

Get in touch
eric@generable.com
@ericnovik on Twitter