1 What is this about?

In this publication I want to show you how to sample from the Bayesian equivalent of a Poisson regression model with the Metropolis sampler using R. So let’s get started…

2 Metropolis-Hastings algorithm

The Metropolis-Hastings sampling algorithm is a class of Markov Chain Monte Carlo (MCMC) method whose main idea is to generate a Markov Chain \(\{X_t|t=0,1,2,...\}\) such that its stationary distribution is the target distribution. One of the most frequent applications of this algorithm is sampling from the posterior density in Bayesian statistics, which is the goal of this publication.

The algorithm specifies for a given state \(X_t\), how to generate the next state \(X_{t+1}\) and there is a candidate point \(Y\) which is generated from a proposal distribution \(g(.|X_t)\), that according to a decision criteria is accepted so the chain moves to state \(Y\) at time \(t+1\), that is, \(X_{t+1} = Y\) or rejected so the chain stays in state \(X_t\) at time \(t+1\), that is, \(X_{t+1} = X_t\).

To learn more about Metropolis-Hatings algorithm check this book: Statistical computing with R

3 Metropolis sampler

In the Metropolis algorithm, the proposal distribution is symmetric, that is, the proposal distribution \(g(.|X_t)\) satisfies \(g(X|Y)=g(Y|X)\), so the Metropolis sampler generates the Markov Chain as follows:

Choose a proposal distribution \(g(.|X_t\)). Before choosing it, check the book presented above to learn more about the desirable features in this function.
Generate \(X_0\) from the proposal distribution \(g\).
Repeat until the chain has converged to a stationary distribution:

Generate Y from \(g(.|X_t)\).
Generate U from Uniform(0, 1).
If \(U \leq \frac{f(Y)}{f(X_t)}\), accept \(Y\) and set \(X_{t+1}=Y\), otherwise set \(X_{t+1}=X_t\). This means that the candidate point Y is accepted with probability \(\alpha(X_t, Y)=\min\left(1, \frac{f(Y)}{f(X_t)}\right)\).
Increment \(t\).

4 Bayesian approach

As I mentioned before, we are going to sample from the Bayesian equivalent of a Poisson regression model defined as:

\[\begin{align*} y_i &\sim P(\lambda_i), \\ \lambda_i &= \exp(\beta_0 + \beta_1 x_i), \\ \beta_0 &\sim Uniform(-2, 2),\\ \beta_1 &\sim Uniform(-2, 2),\\ x_i &\sim Uniform(-2, 2). \end{align*}\]

For estimating parameters in Bayesian analysis we need to find the likelihood function for the model of interest, in this case, from the Poisson regression model:

\[\begin{align*} L(X,Y|\lambda)&=\prod_{i=1}^{n}\frac{\exp(-\lambda)\times\lambda^{Y_i}}{Y_i!},\\ L(X,Y|\beta_0, \beta_1)&=\prod_{i=1}^{n}\frac{\exp(-(\beta_0 + \beta_1 \times X_i))\times(\beta_0 + \beta_1 \times X_i)^{Y_i}}{Y_i!}. \end{align*}\]

Now we have to specify a prior distribution for each parameter \(\beta_0\) and \(\beta_1\). We are going to use uninformative normal distributions for the two parameters, \(\beta_0 \sim \mathcal{N}(0, 100)\) and \(\beta_1 \sim \mathcal{N}(0, 100)\).

\[ \pi(\beta_0)=\frac{1}{\sqrt{2\pi(100)}}\exp\left(\frac{-(\beta_0)^{2}}{2(100)}\right),\\ \pi(\beta_1)=\frac{1}{\sqrt{2\pi(100)}}\exp\left(\frac{-(\beta_1)^{2}}{2(100)}\right). \]

Finally, we define the posterior distribution as the product of the prior and likelihood distribution:

\[ p(\beta_0, \beta_1 | X,Y) \propto L(X,Y|\beta_0, \beta_1)\times \pi(\beta_0)\times \pi(\beta_1). \]

The posterior distribution will be the target distribution when using the Metropolis sampler.

5 Computational approach… So much fun!

Here you are going to learn how to use the Metropolis sampler using R to sample from the posterior distribution of parameters \(\beta_0\) and \(\beta_1\).

5.1 The data

First, we generate the data from the Poisson regression model presented above:

n <- 1000 # Sample size
J <- 2 # Number of parameters
set.seed(159) 
X <- runif(n,-2,2) # Generate values for the independent variable
beta <- runif(J,-2,2) # Generate values for parameters
lambda <- exp(beta[1] + beta[2] * X)
y <- rpois(n, lambda = lambda) # Generate values for the dependent variable

5.2 The Likelihood function

Now we define the likelihood function. In this case we will work with the logarithm of this function which is highly recommended to avoid numerical problems while running the algorithm:

LikelihoodFunction <- function(param){
        beta0 <- param[1] 
        beta1 <- param[2] 
        lambda <- exp(beta1*X + beta0)
        # Loglikelihood function
        loglikelihoods <- sum(dpois(y, lambda = lambda, log=T)) 
        return(loglikelihoods)
}

5.3 The prior distributions

Next we define the prior distributions for the parameters \(\beta_0\) and \(\beta_1\). As with the likelihood function, we are going to use the logarithm of the prior distributions:

LogPriorFunction <- function(param){
        beta0 <- param[1] 
        beta1 <- param[2] 
        beta0prior <- dnorm(beta0, 0, sqrt(100), log=TRUE)
        beta1prior <- dnorm(beta1, 0, sqrt(100), log=TRUE)
        return(beta0prior + beta1prior) # Logarithm of prior distributions
}

5.4 The posterior distributions

As we are working with logarithms, we define the posterior distribution as the sum of the logarithm of the likelihood function and the logarithm of the prior distributions. Remember that this function is our target function \(f(.)\) from which we want to sample.

PosteriorFunction <- function(param){
        return (LikelihoodFunction(param) + LogPriorFunction(param)) 
}

5.5 The proposal function

Finally, we define the proposal distribution \(g(.|X_t)\). As we will work with the Metropolis sampler, the proposal distribution must be symmetric and depend on the current state of the chain so we will use the normal distribution with mean equal to the values of the parameters in the current state.

ProposalFunction <- function(param){
        return(rnorm(2, mean = param, sd = 0.01))
}

5.6 The Metropolis sampler

Finally, we write the code that will help us to perform the Metropolis sampler. In this case, as we are working with logarithms, we must define the probability that the candidate point \(Y\) is accepted as:

\[ \alpha(X_t, Y)=\min\left(1, \exp(\log(f(Y))-\log(f(X_t))\right). \]

RunMetropolisMCMC <- function(startvalue, iterations){
        # Create an array to save the chain values
        chain <- array(dim=c(iterations + 1, 2)) 
        chain[1, ] <- startvalue # Define the start value of the chain
        for (i in 1:iterations){
                # Generate Y from the proposal function
                Y <- ProposalFunction(chain[i, ]) 
                # Probability that the candidate point is accepted
                probability <- exp(PosteriorFunction(Y) - 
                                           PosteriorFunction(chain[i, ]))
                # Decision criteria to accept or reject Y 
                if (runif(1) < probability) {
                        chain[i+1, ] <- Y
                }else{ 
                        chain[i+1, ] <- chain[i, ]
                }
        }
        return(chain)
}

As the MCMC chain is strongly autocorrelated, it may produce samples that are unrepresentative, in the short run, of the true underlying posterior distribution. Then, in order to decrease autocorrelation we can thin the sample using only every nth value in the chain. In this case, we will select a value for our final chain every 20 iterations of the algorithm. To learn more about this method check this publication: Thinning to reduce autocorrelation: Rarely useful.

startvalue <- c(0, 0) # Define the start value of the chain
iterations <- 20 * 10000 
chain <- RunMetropolisMCMC(startvalue, iterations)
# Select values every 20 iterations for the final chain
cfinal <- matrix(NA, ncol = 2, nrow = 10000)
for (i in 1:10000){
        if (i == 1){
                cfinal[i, ] <- chain[i*20,]
        } else {
                cfinal[i, ] <- chain[i*20,]
        }
}
# Remove the first 5000 values of the chain
burnIn <- 5000

Here, you can observe the ACF plot which gives us values of auto-correlation of any series with its lagged values. In this case, we present an ACF plot for the initial MCMC chain and for the final chain after thinning the sample for the two parameters. From the graphs we can conclude that the procedure used actually manages to significantly decrease autocorrelation.

To learn more about ACF check this publication: Significance of ACF and PACF Plots In Time Series Analysis.

6 Results

In this section, we present the chain generated by Metropolis sampler and its distribution for the parameters \(\beta_0\) and \(\beta_1\). The true value of the parameters is represented by the red line.

7 Comparison with glm()

Now we must compare the results obtained using the Metropolis sampler with \(glm()\) function, used to fit generalized linera models.

mod <- glm(y ~ X, family = poisson())

The next table presents the real value of the parameters and the average of the estimates obtained using the Metropolis sampler:

##       True value Mean MCMC       glm
## beta0  1.0578047 1.0769213 1.0769789
## beta1  0.8113144 0.8007347 0.8009269

8 Conclusions

From the results, we can conclude that the estimations for the parameters \(\beta_0\) and \(\beta_1\) from the Poisson regression model obtained using the Metropolis sampler and \(glm()\) function are very similar and close to the actual values of the parameters. Also, it is important to realize that the selection of the prior distributions, the proposal distribution and the initial values of the chain, significantly influences the results, therefore this selection must be done properly.

I hope you enjoyed this post and learned about the Metropolis sampler. I encourage you to replicate this procedure with other regression models.

The Metropolis sampler and the Poisson regression model