A tutorial in STAN for Bayesian Hierarchical Models

Rasim Muzaffer Musal

Bayesian vs Frequentist Statistics

This Tutorial makes use of Dr. McElreath’s book Statistical Rethinking.
We will be referring to models that are within the realm of general and generalized linear models.
For both of these types of models we need to understand some notation and concepts.

  Likelihood (Both for frequentist and Bayesian)

  Prior (Bayesian Only) Quantification about uncertainty, usually on the parameters of the model before we learn about data through updating procedures.

Bayesian vs Frequentist Statistics

In Bayesian Statistics we treat every component as a random variable. Data is a realization from a random variable(s).
Every random variable can be assigned a probability distribution. This is an important distinction from frequentist in the inference of model coefficients. For instance in O.L.S. how \(\beta\) is interpreted.
Frequentists treat coefficient estimates as giving the best estimate of a fixed but unknown value.

Bayesian Concepts:

Y is your variable of interest you would like to learn more about.
Y is observed as a result of a process which we statitically model.
Models have parameters that we need to learn about both for purposes of inference about them as well as predict the values of Y.
You update your knowledge about the parameters of the model with Bayesian updating
A form of inversion law that does not need to be Bayesian.

\[p(\theta|Y)=\frac{p(Y|\theta)*p(\theta)}{\int\limits_{\theta}p(Y|\theta)*p(\theta)}=\frac{p(Y,\theta)}{p(Y)}\]

Bayesian vs Frequentist Statistics, Notationally

Imagine a very simplistic/trivial scenario of making predictions involving real estate prices via their square footage.
We need to choose a form for the likelihood.
The likelihood is the joint probability density function for a particular set of parameters which generates. The frequentist form would write it out as \[L(Y;X,\beta)\]
Bayesians would write it as \[f(Y|X,\beta)\] Is \(\beta\) unknown fixed quantity (frequentist) or a random variable (Bayesian)?

Frequentist Simple Linear Regression

Assume \(Y\sim Normal (\mu,\sigma)\)
We model \(\mu\) via \(\alpha + \beta\times X\)
Assume \(i=1, \ldots, N\) \[L(Y;X,\beta)=\prod\limits_{i=1}^{i=N}N(\alpha + \beta\times X_{i},\sigma)\]
Any choice of \(\alpha\) and \(\beta\) evaluates a likelihood.
Find Maximum Likelihood Estimator by taking the derivative, equating it to 0 and solving for \(\alpha\) and \(\beta\). \[\prod\limits_{i=1}^{i=N}\frac{1}{\sigma\sqrt{2\pi}}e^{\frac{\left(y-(\alpha+\beta X_{i})^{2}\right)}{2\sigma^{2}}} \]

Bayesian Simple Linear Regression

\[f(\alpha,\beta,\sigma|X,Y)=\frac{\prod\limits_{i=1}^{i=N} f(Y_{i}|\alpha+\beta X_{i},\sigma)f(\alpha)f(\beta)f(\sigma)}{\int\limits_{\alpha}\int\limits_{\beta}\int\limits_{\sigma}\prod\limits_{i=1}^{i=N} f(Y_{i}|\alpha+\beta X_{i},\sigma)f(\alpha)f(\beta)f(\sigma)} \] - Perhaps it is better to show this as

\[f(\alpha,\beta,\sigma|X,Y)=\frac{f(Y|X,\alpha,\beta,\sigma)f(\alpha)f(\beta)f(\sigma)}{\int\limits_{\alpha}\int\limits_{\beta}\int\limits_{\sigma}f(Y|X,\alpha,\beta,\sigma)f(\alpha)f(\beta)f(\sigma)} \] - The likelihood is calculated in the same manner as in frequentist framework. The priors are to be declared depending on context.

Bayesian Simple Linear Regression

To summarize, the above can be written as:

\[f(\alpha,\beta,\sigma|X,Y)=\frac{f(Y,X,\alpha,\beta,\sigma)}{f(Y,X)}\] - It is straightforward to calculate the numerator. Not so much the denominator. Luckily the denominator is a constant and does not need to be calculated (in most cases).

\[f(\alpha,\beta,\sigma|X,Y)\propto f(Y|X,\alpha,\beta,\sigma)f(\alpha)f(\beta)f(\sigma) \]

Types of questions we can answer with Bayesian Framework

What is the probability that \(\beta>0\)
Predictive Intervals involving \(\beta\)
If all we are doing is simple linear regression there will not be that much of a difference especially if \(N\) is large.
We shall use STAN to compute \[f(\alpha,\beta,\sigma|X,Y).\]

Example of STAN

A basic linear regression in STAN.

data {
  int<lower=0> N;
  vector[N] x;
  vector[N] y;
}
parameters {
  real alpha;
  real beta;
  real<lower=0> sigma;
}
model {
  y ~ normal(alpha + beta * x, sigma);
}

Outputs from Frequentist vs Bayesian

If there is enough data, at least for linear models, frequentist MLEs and posterior means of coefficients converge.
N=518,
Predicting Price vs Square Footage
O.L.S. Non-Hierarchical

O.L.S. Application in STAN

N=518
Note that the covariate (Size) is not standardized.

data {
  int<lower=0> N;
  vector[N] Price;
  vector[N] Size;
}
parameters {
  real alpha;
  real beta;
  real<lower=0> sigma;
}
model {
  Price ~ normal(alpha + beta * Size, sigma);
  alpha ~ normal(0,10000);
  beta ~ normal(0,10000);
  sigma ~ normal(0,10000);
}

How to run this model in STAN

df_list=as.list(df)
df_list$N=518

fit_M2 <- stan(
  file = "C:/Users/rm84/Desktop/service/Simple Linear Regression.stan",  # Stan program
  data = df_list,    # named list of data
  chains = 3,             # number of Markov chains
  warmup = 15000,          # number of warmup iterations per chain
  iter = 25000,            # total number of iterations per chain
  cores = 3,              # number of cores (could use one per chain)
  refresh = 1000             # no progress shown
)
summary(fit_M2)$summary

             mean     se_mean          sd        2.5%         25%         50%
alpha 20879.02444 28.92380335 2915.283381 15232.40700 18924.72431 20849.33244
beta     38.77437  0.01518594    1.537100    35.75203    37.73455    38.78952
sigma 12639.02775  3.42252230  395.673778 11886.91114 12370.78073 12629.24751
lp__  -5144.87517  0.01252116    1.217237 -5148.04311 -5145.42177 -5144.56262
              75%       97.5%     n_eff     Rhat
alpha 22851.81357 26635.90924 10158.995 1.000081
beta     39.80971    41.73124 10245.211 1.000076
sigma 12900.60077 13446.03127 13365.400 1.000406
lp__  -5143.98472 -5143.48271  9450.632 1.000143

How to analyze basic output

[1] "Convergence Check"

[1] 0

named numeric(0)

How to analyze basic output

How much did we learn about \(\beta\)? We can look at a comparison of prior vs posterior distribution of parameters. \(P(\beta)\) vs \(P(\beta|X,Y)\).
Compare prior vs posterior distribution about \(\beta\)

Learning about model parameters

Prior vs Posterior

Learning about model parameters

What is the probability that \(\beta>0\)

sum(listofdraws_M2$beta>0)/length(listofdraws_M2$beta)

[1] 1

What is the probability that \(\beta\) is larger than 37

sum(listofdraws_M2$beta>37)/length(listofdraws_M2$beta)

[1] 0.8745667

What is the posterior predictive of the distribution of Price (Y) when Size (X) is 1500 square feet?
These questions are not that interesting since we can find analogous results asymptotically in frequentist statistics.

How would this work with Frequentist Framework

It would be straightforward to compare \(\beta\) of Bayesian vs the Frequentist estimate below. There really is not much point to it for S.L.R.


Call:
lm(formula = df$Price ~ df$Size)

Residuals:
   Min     1Q Median     3Q    Max 
-26100 -10552  -1141  11000  28267 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 22938.737   3110.831   7.374 6.65e-13 ***
df$Size        37.708      1.637  23.037  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12620 on 516 degrees of freedom
Multiple R-squared:  0.507, Adjusted R-squared:  0.5061 
F-statistic: 530.7 on 1 and 516 DF,  p-value: < 2.2e-16

Hypothesis tests vs Model Comparison

“You can reject the null hypothesis that Size has not effect on Price when alpha=0.0001”
How do we do this in Bayesian Framework? Bayes Factors.
Assume there are 2 Hypothesis. H1 states \(\beta=0\) and H2 states \(\beta \neq 0\)

\[\frac{p(H_{2}|Y)}{p(H_{1}|Y)} =\frac{p(H_{2})}{p(H_{1})} \frac{p(Y|H_{2})}{p(Y|H_{1})} \]

-If the priors are identical all we need to calculate is

\[\frac{p(Y|H_{M2})}{p(Y|H_{M1})}= \frac{\int\limits_{\theta^{+}} p(Y|\theta_{M2}^{+})}{\int\limits_{\theta^{+}} p(Y|\theta_{M1}^{+})}\]

Fitting the Intercept Only Model M1

\[Y\sim N(\alpha,\sigma)\]

           mean     se_mean         sd      2.5%      25%      50%       75%
alpha 92864.311 4.902302247 787.367401 91304.063 92340.87 92870.62 93396.620
sigma 17953.881 3.488005390 551.786524 16906.594 17576.06 17942.42 18316.791
lp__  -5369.022 0.008648082   1.000786 -5371.716 -5369.40 -5368.71 -5368.313
          97.5%    n_eff      Rhat
alpha 94396.118 25796.14 0.9999191
sigma 19083.488 25025.80 0.9999517
lp__  -5368.053 13391.91 1.0000118

Calculating Bayesian Factors via MCMC Integration

All we need to do is to evaluate \[\int\limits_{\alpha_{M2},\beta_{M2},\sigma_{M2}} P(Y|\alpha_{M_2}+\beta_{M2}*Size,\sigma_{M2})\]
and

\[\int\limits_{\alpha_{M1},\sigma_{M1}} P(Y|\alpha_{M_1},\sigma_{M1})\]

MCMC Integration and Bayesian Factors in R

library(BayesFactor)
N=length(df$Price)
BF2=array(dim=c(N))
BF1=array(dim=c(N))

for (i in 1:N){
BF2[i] =mean(dnorm(df$Price[i],mean=listofdraws_M2$alpha+listofdraws_M2$beta*df$Size[i],sd=listofdraws_M2$sigma,log=TRUE))
BF1[i] =mean(dnorm(df$Price[i],mean=listofdraws_M1$alpha,sd=listofdraws_M1$sigma,log=TRUE))

}
#BF21  
exp(sum(BF2)-sum(BF1))

[1] 2.41002e+79

bf = regressionBF(Price ~ Size, data = df)
print(bf)

Bayes factor analysis
--------------
[1] Size : 2.754885e+77 ±0.01%

Against denominator:
  Intercept only 
---
Bayes factor type: BFlinearModel, JZS

Hypothesis tests vs Model Comparison

Bayesian Factors are not the preferred model comparison methods.https://www.reddit.com/r/rstats/comments/15j31zp/why_isnt_there_a_bayesian_factor_in_the_results/
Instead we evaluate information criterions to compare models rather than trying to find the correct model.

What are Hierarchical Models

Hierarchical Models are also called Multilevel and Mixed Effects Models.
The idea is that the model structure contains parameters that can learn from each other.
\(P(Y_{j}|\theta_{j})P(\theta_{j}|\phi)\)
Useful in repeated sampling to avoid under or over sampling.
Useful when there is unbalance in groups of data, larger groups do not dominate.
Estimation of variation (How do groups in data vary).
Avoid averaging (Averaging Unemployment Rates across months)

Example of a Hierarchical Model

library(rethinking)
data(reedfrogs)
d<-reedfrogs
str(d)

'data.frame':   48 obs. of  5 variables:
 $ density : int  10 10 10 10 10 10 10 10 10 10 ...
 $ pred    : Factor w/ 2 levels "no","pred": 1 1 1 1 1 1 1 1 2 2 ...
 $ size    : Factor w/ 2 levels "big","small": 1 1 1 1 2 2 2 2 1 1 ...
 $ surv    : int  9 10 7 10 9 9 10 9 4 9 ...
 $ propsurv: num  0.9 1 0.7 1 0.9 0.9 1 0.9 0.4 0.9 ...

Density is initial count of frog eggs.
Surv is the number of surviving eggs in the experiment.

What are the models for no and partial pooling for information?

Non Hierarchical Not Pooled \[S_{i} \sim Binomial(N_{i},p_{i}) \] \[logit(p_{i}) = \alpha_{Tank_{i}} \] \[\alpha_{i}\sim Normal(0,1.5) \] Hierarchical Model-Partial Pooling \[ S_{i} \sim Binomial(N_{i},p_{i})\] \[logit(p_{i}) = \alpha_{Tank_{i}} \]

\[\alpha_{i}\sim Normal(\bar{\alpha},\sigma) \] \[\bar{\alpha}\sim Normal(0,1.5) ; \sigma\sim Exponential(1)\]

Using RStan for unpooled model

library(tidyverse)
library(tidybayes)
library(rstan)
library(rethinking)
library(patchwork)
options(mc.cores = 4)

stan_data <- list(n = nrow(d),              surv = d$surv,dens = d$density,
tank = 1:nrow(d))

stan_program <- '
data {
int n; array [n] int surv;
  array [n] int dens;array  [n] int tank;
}
parameters {
array [n] real a;
}
transformed parameters {
    vector[n] p;
    for (i in 1:n) {
        p[i] = inv_logit(a[i]);
    }
}
model {
    a ~ normal(0, 1.5);
    for (i in 1:n) {
        surv[i] ~ binomial(dens[i], p[i]);
    }
}

'

m13.1 <- stan(model_code = stan_program, data = stan_data)

Running MCMC with 1 chain...

Chain 1 Iteration:   1 / 1000 [  0%]  (Warmup) 
Chain 1 Iteration: 100 / 1000 [ 10%]  (Warmup) 
Chain 1 Iteration: 200 / 1000 [ 20%]  (Warmup) 
Chain 1 Iteration: 300 / 1000 [ 30%]  (Warmup) 
Chain 1 Iteration: 400 / 1000 [ 40%]  (Warmup) 
Chain 1 Iteration: 500 / 1000 [ 50%]  (Warmup) 
Chain 1 Iteration: 501 / 1000 [ 50%]  (Sampling) 
Chain 1 Iteration: 600 / 1000 [ 60%]  (Sampling) 
Chain 1 Iteration: 700 / 1000 [ 70%]  (Sampling) 
Chain 1 Iteration: 800 / 1000 [ 80%]  (Sampling) 
Chain 1 Iteration: 900 / 1000 [ 90%]  (Sampling) 
Chain 1 Iteration: 1000 / 1000 [100%]  (Sampling) 
Chain 1 finished in 0.2 seconds.

Using RStan for partial pooled model

stan_program <- '
data {
int n; array [n] int surv;
  array [n] int dens;array  [n] int tank;
}
parameters {
real abar; real<lower=0> sigma; array [n] real a;
}
transformed parameters {
    vector[n] p;
    for (i in 1:n) {
        p[i] = inv_logit(a[i]);
    }
}
model {
a ~ normal(abar, sigma); sigma ~ exponential(1);
    for (i in 1:n) {
        surv[i] ~ binomial(dens[i], p[i]);
    }
}
'

m13.2 <- stan(model_code = stan_program, data = stan_data)

Running MCMC with 1 chain...

Chain 1 Iteration:   1 / 1000 [  0%]  (Warmup) 
Chain 1 Iteration: 100 / 1000 [ 10%]  (Warmup) 
Chain 1 Iteration: 200 / 1000 [ 20%]  (Warmup) 
Chain 1 Iteration: 300 / 1000 [ 30%]  (Warmup) 
Chain 1 Iteration: 400 / 1000 [ 40%]  (Warmup) 
Chain 1 Iteration: 500 / 1000 [ 50%]  (Warmup) 
Chain 1 Iteration: 501 / 1000 [ 50%]  (Sampling) 
Chain 1 Iteration: 600 / 1000 [ 60%]  (Sampling) 
Chain 1 Iteration: 700 / 1000 [ 70%]  (Sampling) 
Chain 1 Iteration: 800 / 1000 [ 80%]  (Sampling) 
Chain 1 Iteration: 900 / 1000 [ 90%]  (Sampling) 
Chain 1 Iteration: 1000 / 1000 [100%]  (Sampling) 
Chain 1 finished in 0.2 seconds.

Results

Model Comparison- Cross Validation

Cross Validation for out of sample accuracy
- K fold and Leave one Out (too many posterior distributions)
- Pareto smoothed importance sampling, where each data value of Y gets an importance weight. Likelihood is calculated based on it.

Model Comparison - Information Criteria

LogPointWisePredictiveDensity is a measure we already calculated \[\sum\limits_{i}log\frac{1}{S}\sum\limits_{s}p(y_{i}|\theta_{s})\]

AIC = \(-2\times LogPointWisePredictiveDensity+2*estimated\_ parameters\)

Only valid when Priors are flat or overwhelmed by likelihood

Posterior Distribution is approximately multivariate Gaussian.

```
N >> k
```
Model Comparison - Information Criteria

DIC, Deviance Information Criterion still assumes posterior is multivariate and \(N>>k\). New Index is
Watanabe (2010) Gelman (2014) Widely applicable information criterion or Watanabe-Akaike Information Criterion (WAIC)

\[WAIC(y,\Theta) = -2\times(lppd - \sum_\limits_{i}var_{\theta}log(p(y_{i}|\theta)))\]