A Short Talk on Bayesian Generalized Linear Models

Mike Grosskopf

Overview

Quick Review of Bayesian Inference
General Approach to Bayesian GLM
Bayesian Logistic Regression with bayesglm()
Fitting Bayesian Models in Stan

References

Bayesian Data Analysis; 3\( ^{rd} \) edition

Gelman, et al.

Data Analysis Using Regression and Multilevel/Hierarchical Models

Gelman and Hill

Catagorical Data Analysis

Agresti

The basics of Bayesian inference

Everything not observed is treated as random
- Variables, parameters, etc \( \sim p(\theta,Y) \)
Randomness reflects the imperfect knowledge of the r.v.
Upon observation of data, we use Bayes rule to update the prior distribution and obtain the posterior distribution \[ p(\theta \mid Y=y) = p(\theta)\frac{p(Y=y \mid \theta)}{p(Y=y)} \] \[ p(Y=y) = \int p(\theta)p(Y=y \mid \theta) d \theta \]
For prediction \[ p(Y_{new}\mid Y=y) = \int p(Y_{new} \mid \theta,Y=y)*p(\theta \mid Y=y) d \theta \]

Why Bayes?

Coherent, consistent inductive inference
Intuitive interpretation
Allows for a wider class of applicable inference than frequency interpretation of probability
Prior information can allow for reasonable inference with moderate samples
- Asymptotically equivalent to MLE estimates
Natural for complex or hierarchical models

Bayesian models allow use of external information into estimates

The one advantage of the Bayesian approach is the use of external information to improve the estimates of the linear model coefficients

Gelman sex-ratio examples: “Of Beauty, Sex, and Power”

Uncertainty introduced by adding addtional model complexity leads to a natural regularization

Lasso and other regularized estimators can be viewed as Bayesian estimators with a particular prior

The choice of prior distribution for the parameters is an important part of the statistical model

Independence assumptions, etc can have notable effects on the inference
A common point of contention for Bayesian methods
Many philosophies about the priors
- Subjective Bayes, Objective Bayes, Prior as part of the Model, etc.
Statement of prior distribution makes assumptions more explicit
A Bayes estimate is consistent as long as the true value in is the support of the prior

Bayesian Generalized Linear Models

For logisitic model with a covariate vector \( \beta \), we have:

\[ p(Y_i \mid X_i,\beta) \sim \text{Binomial}(n_i,p_i) \\ \text{where } p_i = \text{logit}^{-1}(\beta^TX) \]

For Poisson model with a covariate vector \( \beta \), we have:

\[ p(Y_i \mid X_i,\beta) \sim \text{Poisson}(\lambda_i) \\ \text{where } \lambda_i = e^{\beta^TX} \]

\( p(\beta \mid Y,X) \propto p(Y_i \mid X_i,\beta)*p(\beta) \)
Normal or t distribution for model parameters are common

The arm package in R provides tools for quick fitting of basic Bayesian GLM

bayesglm()
- Equivalent of the glm command in R
- Restricted to t-distributions for priors
- Fit with a pseudo-data approach or modified EM algorithm
sim() and coef()
- Can be used to draw samples from the posterior distribution for the model parameters

In standard logistic regression, complete separation can be a problem

Can generate simulated data where covariate \( x1 \) has no association with the outcome and covariate \( x2 \) completely determines the outcome:

\( y=1 \) if \( x2>0 \) and \( y=0 \) otherwise.

  y      x1      x2
1 0 -0.5605 -0.8572
2 0 -0.2302 -0.5855
3 0  1.5587 -0.5863

   y      x1     x2
18 1 -1.9666 0.7533
19 1  0.7014 0.8950
20 1 -0.4728 0.3745

Standard logistic regression poorly estimates covariate coefficients

glm(formula = y ~ x1 + x2, family = "binomial", data = d)
            coef.est  coef.se  
(Intercept)      5.28  33507.11
x1              25.40  65424.13
x2             117.66 170423.35
---
  n = 20, k = 3
  residual deviance = 0.0, null deviance = 27.7 (difference = 27.7)

Using a normal prior density for the coefficients solves this

Prior for \( \beta_1 \) approximately N(0,10)
- bayesglm() does some automatic scaling to priors

bayesglm(formula = y ~ x1 + x2, family = "binomial", data = d, 
    prior.scale = 10, prior.df = Inf)
            coef.est coef.se
(Intercept) 0.64     1.41   
x1          1.49     1.75   
x2          9.36     4.16   
---
n = 20, k = 3
residual deviance = 1.3, null deviance = 3.3 (difference = 2.0)

We can get similar results with a Cauchy prior

Cauchy(\( \gamma=10 \)) prior has wider tails than Normal

bayesglm(formula = y ~ x1 + x2, family = "binomial", data = d, 
    prior.scale = 10)
            coef.est coef.se
(Intercept)  0.69     1.50  
x1           1.50     1.84  
x2          10.21     4.88  
---
n = 20, k = 3
residual deviance = 1.1, null deviance = 2.9 (difference = 1.8)

We can get similar results with a Cauchy prior

plot of chunk unnamed-chunk-5

We can recover the classical estimate using a "uninformative prior"

bayesglm(formula = y ~ x1 + x2, family = "binomial", data = d, 
    prior.scale = Inf)
            coef.est  coef.se  
(Intercept)      5.53  55242.70
x1              26.54 107862.04
x2             122.82 280964.40
---
n = 20, k = 3
residual deviance = 0.0, null deviance = 0.0 (difference = 0.0)

Note: Uninformative priors can lead to improper posteriors

Poorly specified prior distributions can lead to poor results

bayesglm(formula = y ~ x1 + x2, family = "binomial", data = d, 
    prior.mean = 100)
            coef.est   coef.se   
(Intercept) -1.721e+14  1.539e+07
x1           4.956e+14  1.590e+07
x2           3.138e+15  2.513e+07
---
n = 20, k = 3
residual deviance = 0.0, null deviance = 0.0 (difference = 0.0)

Stan provides a valuable tool for fitting more complex Bayesian models

A framework for flexible models with Hamiltonian MCMC

logitReg <- 'data {
  int<lower=0> N; real x1[N]; real x2[N]; int<lower=0,upper=1> y[N];}

parameters {
  real alpha; real beta1; real beta2;}

model {
  alpha ~ normal(0,10.); beta1 ~ normal(0,10.); beta2 ~ normal(0,10.);
  for (n in 1:N)
    y[n] ~ bernoulli(inv_logit(alpha + beta1 * x1[n]+beta2 * x2[n]));
}'

Stan provides a valuable tool for fitting more complex Bayesian models

A framework for model evaluation with Hamiltonian MCMC

logitData <- list(N=10, x1=c(runif(5,-1,0),runif(5,0,1)), x2=c(rnorm(10)),y=c(rep(0,5),rep(1,5)))

fit <- stan(model_code=logitReg,
            data=logitData,iter=1000,chains=4)


TRANSLATING MODEL 'logitReg' FROM Stan CODE TO C++ CODE NOW.
COMPILING THE C++ CODE FOR MODEL 'logitReg' NOW.
SAMPLING FOR MODEL 'logitReg' NOW (CHAIN 1).

Iteration:   1 / 1000 [  0%]  (Warmup)
Iteration: 100 / 1000 [ 10%]  (Warmup)
Iteration: 200 / 1000 [ 20%]  (Warmup)
Iteration: 300 / 1000 [ 30%]  (Warmup)
Iteration: 400 / 1000 [ 40%]  (Warmup)
Iteration: 500 / 1000 [ 50%]  (Warmup)
Iteration: 600 / 1000 [ 60%]  (Sampling)
Iteration: 700 / 1000 [ 70%]  (Sampling)
Iteration: 800 / 1000 [ 80%]  (Sampling)
Iteration: 900 / 1000 [ 90%]  (Sampling)
Iteration: 1000 / 1000 [100%]  (Sampling)
Elapsed Time: 0.01 seconds (Warm-up)
              0.01 seconds (Sampling)
              0.02 seconds (Total)

SAMPLING FOR MODEL 'logitReg' NOW (CHAIN 2).

Iteration:   1 / 1000 [  0%]  (Warmup)
Iteration: 100 / 1000 [ 10%]  (Warmup)
Iteration: 200 / 1000 [ 20%]  (Warmup)
Iteration: 300 / 1000 [ 30%]  (Warmup)
Iteration: 400 / 1000 [ 40%]  (Warmup)
Iteration: 500 / 1000 [ 50%]  (Warmup)
Iteration: 600 / 1000 [ 60%]  (Sampling)
Iteration: 700 / 1000 [ 70%]  (Sampling)
Iteration: 800 / 1000 [ 80%]  (Sampling)
Iteration: 900 / 1000 [ 90%]  (Sampling)
Iteration: 1000 / 1000 [100%]  (Sampling)
Elapsed Time: 0.01 seconds (Warm-up)
              0.01 seconds (Sampling)
              0.02 seconds (Total)

SAMPLING FOR MODEL 'logitReg' NOW (CHAIN 3).

Iteration:   1 / 1000 [  0%]  (Warmup)
Iteration: 100 / 1000 [ 10%]  (Warmup)
Iteration: 200 / 1000 [ 20%]  (Warmup)
Iteration: 300 / 1000 [ 30%]  (Warmup)
Iteration: 400 / 1000 [ 40%]  (Warmup)
Iteration: 500 / 1000 [ 50%]  (Warmup)
Iteration: 600 / 1000 [ 60%]  (Sampling)
Iteration: 700 / 1000 [ 70%]  (Sampling)
Iteration: 800 / 1000 [ 80%]  (Sampling)
Iteration: 900 / 1000 [ 90%]  (Sampling)
Iteration: 1000 / 1000 [100%]  (Sampling)
Elapsed Time: 0.01 seconds (Warm-up)
              0.01 seconds (Sampling)
              0.02 seconds (Total)

SAMPLING FOR MODEL 'logitReg' NOW (CHAIN 4).

Iteration:   1 / 1000 [  0%]  (Warmup)
Iteration: 100 / 1000 [ 10%]  (Warmup)
Iteration: 200 / 1000 [ 20%]  (Warmup)
Iteration: 300 / 1000 [ 30%]  (Warmup)
Iteration: 400 / 1000 [ 40%]  (Warmup)
Iteration: 500 / 1000 [ 50%]  (Warmup)
Iteration: 600 / 1000 [ 60%]  (Sampling)
Iteration: 700 / 1000 [ 70%]  (Sampling)
Iteration: 800 / 1000 [ 80%]  (Sampling)
Iteration: 900 / 1000 [ 90%]  (Sampling)
Iteration: 1000 / 1000 [100%]  (Sampling)
Elapsed Time: 0.01 seconds (Warm-up)
              0.01 seconds (Sampling)
              0.02 seconds (Total)

Stan provides a valuable tool for fitting more complex Bayesian models

A framework for model evaluation with Hamiltonian MCMC

Inference for Stan model: logitReg.
4 chains, each with iter=1000; warmup=500; thin=1; 
post-warmup draws per chain=500, total post-warmup draws=2000.

      mean se_mean  sd 2.5%  25%  50%  75% 97.5% n_eff Rhat
alpha -1.2     0.1 2.9 -8.1 -3.0 -1.0  0.7   4.1   430    1
beta1 16.5     0.3 6.6  5.8 11.7 15.8 20.8  30.5   448    1
beta2  0.8     0.2 3.5 -5.9 -1.4  0.7  2.9   8.3   481    1
lp__  -2.3     0.1 1.4 -5.7 -2.9 -1.9 -1.2  -0.7   451    1

Samples were drawn using NUTS(diag_e) at Wed Apr  2 00:02:24 2014.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Stan provides a valuable tool for fitting more complex Bayesian models

plot of chunk unnamed-chunk-11

Posterior predictive model checking is important for assessing model quality

Comparing data from the model to the observed data, \( y_{obs} \):

Simulating data from the probabilistic model, \( y_{rep} \)
Compare statistics of the \( y_{rep} \) to \( y_{obs} \)

Graphical and quantitative methods are useful

References

Bayesian Data Analysis; 3\( ^{rd} \) edition

Gelman, et al.

Data Analysis Using Regression and Multilevel/Hierarchical Models

Gelman and Hill

Catagorical Data Analysis

Agresti