Applied Bayesian Lab 4

Complete all Questions and submit your final PDF or html (either works!) under Assignments in Canvas.

Goal

In our last class, we encountered a scenario we hadn’t seen before. We had not one but two parameters that we needed to estimate in our Bayesian model, which meant that our posterior distribution had two random variables in it.

When we encounter a scenario when the posterior distribution is not a known distribution (like a Gamma, a Beta, a Normal, etc.), we are not able to directly use properties of that distribution when we do our Bayesian analysis. In these situations, we need to use something called Markov Chain Monte Carlo (MCMC) to allow us to estimate the posterior distribution for our parameters of interest and hence perform inference on those parameters.

Today, we are going to explore one specific type of MCMC called Gibbs sampling. We will practice (1) deriving the full conditionals that we need, (2) coding the Gibbs sampler in R, and (3) working with the samples that we obtain from our Gibbs sampler to approximate our posterior of interest.

The Data

Today we are interested in modeling the average number of people who click on an advertisement for a new bubble tea shop in Winston-Salem. This advertisement is posted on articles for the online version of the local paper. We have information on \(Y\)= the number of people who have clicked on the ad per day for a random sample of \(n=30\) days.

Now, because \(Y\) is a count random variable, our first instinct is probably to reach for the Poisson-Gamma Bayesian model.

\[Y_1, \dots, Y_{n} \stackrel{iid}{\sim} Poisson(\theta)\]

\[\theta \sim Gamma(\alpha, \beta)\]

That’s a great starting spot! However, today we are told by our client to use something slightly different.

The Requested Bayesian Model

\[Y_1, \dots, Y_{n} \stackrel{iid}{\sim} Poisson(\theta)\]

\[\theta|\beta \sim Gamma(\alpha, \beta)\]

\[\beta \sim Exponential(c)\]

Question 1

Show that choosing the prior \(\beta \sim Gamma(1, c)\) is the same as choosing the prior \(\beta \sim Exponential(c)\). Hint: Look up the Exponential pdf and the Gamma pdf.

So, essentially what our client has asked us to do is to treat \(\alpha\) as a known constant. However, they do not want to set a fixed \(\beta\) hyperparameter. Instead, they want to treat \(\beta\) as a random variable.

Why would we do that?? Well, this allows us to actually model our uncertainty about \(\beta\). We are moving towards hierarchical models in this course, where we will want to be able to set priors on hyperparameters. This is a good first step towards those models!

Starting the Process

Now that we have our Bayesian model, let’s write out the posterior distribution. We have two parameters \(\theta\) and \(\beta\) that we need to estimate. This means that for Bayesian inference, we want to use the following posterior calculations:

\[P(\theta, \beta| Y_1, \dots, Y_n) = \frac{P(Y_1,\dots,Y_n| \theta,\beta)P(\theta,\beta)}{P(Y_1,\dots,Y_n)}\] Hmmm. Let’s see if we can simplify that a bit. Using properties of joint probability, we can write

\[P(\theta,\beta) = P(\theta|\beta)P(\beta)\] Also, the distribution for \(Y_i\) does not involve \(\beta\), so

\[P(Y_1,\dots,Y_n| \theta,\beta) = P(Y_1,\dots,Y_n| \theta)\]

This gives us

\[P(\theta, \beta| Y_1, \dots, Y_n) = \frac{P(Y_1,\dots,Y_n| \theta)P(\theta|\beta)P(\beta)}{P(Y_1,\dots,Y_n)}\]

We know all the parts of the numerator! However, the denominator is going to prove problematic. Even more of an issue is that fact that with two parameters, this is not going to give us a known distribution. Because of this, we will not be able to directly use properties of a known distribution to perform posterior inference.

In these settings, we need to approximate the posterior distribution through a process called Markov Chain Monte Carlo (MCMC). For today, we will use Gibbs sampling to help us approximate the posterior.

The Full Conditionals

Okay, so the goal is approximate the posterior. This means we really want to draw samples from

\[P(\theta, \beta| Y_1, \dots, Y_n) = \frac{P(Y_1,\dots,Y_n| \theta)P(\theta|\beta)P(\beta)}{P(Y_1,\dots,Y_n)}\]

We can then use these samples to do posterior inference on \(\theta\) and \(\beta\). However, we can’t directly sample from this distribution (it’s not one that is known) so we need to figure out another way to get the samples that we need.

We know that the denominator just plays the role of the normalizing constant, so

\[P(\theta, \beta| Y_1, \dots, Y_n) \propto P(Y_1,\dots,Y_n| \theta)P(\theta|\beta)P(\beta)\]

Question 2

Simplify \[P(\theta, \beta| Y_1, \dots, Y_n) \propto P(Y_1,\dots,Y_n| \theta)P(\theta|\beta)P(\beta)\] By “simplify” I mean fill in the distributions for \(P(\beta)\), \(P(\theta|\beta)\) and \(P(Y_1,\dots,Y_n| \theta)\). Remember anything that does not involve a parameter \(\theta\) or \(\beta\) can be absorbed into the proportionality.

So, this is the distribution that we would like to sample from…but we can’t because there are two parameters. We can’t sample from this, but we can approximate a draw from this distribution using Gibbs Sampling.

The First Full Conditional

Gibbs sampling works as follows. We start with an initial guess at \(\theta\), and we call this guess \(\theta_{(0)}\). This is how we initialize, or start, our Markov Chain:

\[\theta_{(0)} \rightarrow ?\]

Our posterior of interest has two random variables, \(\theta\) and \(\beta\). We now have \(\theta\), so we need \(\beta\). Ideally, we would like to not have to guess at \(\beta\). Instead, we would like to choose \(\beta\) conditional on our starting guess at \(\theta\) and the data \(Y_1, \dots, Y_n\). Why do we want to condition on the data? We want to do this because we want to draw a sample from the posterior distribution.

This means we are going to draw our sample of \(\beta\) from its full conditional distribution:

\[\begin{aligned} P(\beta| \theta, Y_1, \dots, Y_n) & \propto \underbrace{P(Y_1,\dots,Y_n, \theta | \beta )}_\text{The Joint Likelihood}P(\beta) \\ & = \underbrace{P(Y_1, \dots,Y_n |\theta,\beta)P(\theta|\beta)}_\text{The Joint Likelihood}P(\beta) \\ & = \underbrace{P(Y_1, \dots,Y_n |\theta)P(\theta|\beta)}_\text{The Joint Likelihood}P(\beta) \end{aligned}\]

We are hoping that this full condition is a known distribution. If it is, we can (1) plug in \(\theta = \theta_{(0)}\) and then (2) draw a sample from this distribution to take a first guess at what \(\beta\) might we. We will call this sample \(\beta_{(1)}\).

Question 3

Show that the full conditional for \(\beta\) is a Gamma distribution. Show your work and clearly state the form of the posterior hyperparameters. Hint: I strongly recommend that you use the proportionality trick here. This means that anything that does not involve \(\beta\) can be absorbed into the proportionality.

So now, we have an estimate of \(\theta\) (\(\theta_{(0)}\)) and an estimate of \(\beta\) (\(\beta_{(1)}\)). This means we have now sampled:

\[\theta_{(0)} \rightarrow \beta_{(1)}|\theta_{(0)}\]

You will notice that our samples are not independent. A Markov chain is a sequence of dependent values.

Do we stop here??? Nope.

The Second Full Conditional

Our first guess at \(\theta\), \(\theta_{(0)}\), didn’t actually use the dependence between \(\theta\) and \(\beta\). Our prior for \(\theta\) is conditional on \(\beta\), so we know that there is a dependence between the two parameters. So, now that we have a sample \(\beta_{(1)}\) for \(\beta\), we want to update our guess at \(\theta\).

\[\underbrace{\theta_{(0)}}_\text{Initialized Value} \rightarrow \beta_{(1)}|\theta_{(0)} \rightarrow \underbrace{\theta_{(1)}|\beta_{(1)}}_\text{Updated Sample} \]

As we did for \(\beta\), this means we want to draw our sample \(\theta_{(1)}\) conditional on \(\beta_{(1)}\) and \(Y_1, \dots, Y_n\). This means we need to find the full conditional distribution for \(\theta\):

\[P(\theta| \beta, Y_1, \dots, Y_n) \propto P(Y_1, \dots, Y_n, \beta|\theta) P(\theta|\beta) = P(Y_1,\dots, Y_n|\theta)P(\theta|\beta)\]

If this is a known distribution, too, we can then (1) plug in \(\beta = \beta_{(1)}\) and (2) draw a sample of \(\theta\), and call this sample \(\theta_{(1)}\).

Question 4

Find the full conditional for \(\theta\). Show your work and clearly state what distribution you end up with as well as any needed hyperparameters. Hint: I strongly recommend that you use the proportionality trick here. Remember that anything that does not involve \(\theta\) can be absorbed into the proportionality.

By sampling from the distribution in Question 4, we have just updated our belief about \(\theta\) from \(\theta_{(0)}\) to \(\theta_{(1)}\).

\[\theta_{(0)} \rightarrow \theta_{(1)}|\beta_{(1)}\]

Okay, so \(\theta\) is updated now. What about \(\beta\)? Well, we can now use \(\theta_{(1)}\) to update our initial draw of \(\beta_{(1)}\) by sampling from

\[P(\beta| \theta_{(1)}, Y_1, \dots, Y_n)\]

We then have

\[{\theta_{(0)}} \rightarrow \overbrace{\beta_{(1)}|\theta_{(0)}}^\text{First Sample} \rightarrow {\theta_{(1)}|\beta_{(1)}} \rightarrow \overbrace{\beta_{(2)}|\theta_{(1)}}^\text{Updated Sample}\]

Growing the Chain

And we keep going! Starting from this first guess \(\theta_{(0)}\), we update our beliefs about \(\theta\) and \(\beta\), switching back and forth between one parameter and the other to create our Markov Chain.

When we are done with \(S\) iterations of this process, we will have \(S\) samples for \(\theta\):

\[\theta_{(1)} \rightarrow \dots \rightarrow \theta_{(S-1)} \rightarrow \theta_{(S)}\]

Notice that we do not include the initialized value.

We will also have \(S\) samples of \(\beta\):

\[\beta_{(1)} \rightarrow \dots \rightarrow \beta_{(S-1)} \rightarrow \beta_{(S)}\]

These draws for \(\beta\) and \(\theta\) we get will approximate the draws we would have gotten if we could have sampled from the joint posterior that we wanted in the first place!!! So, if we can get our hands on the full conditionals, we can do posterior inference!!

Coding the Gibbs Sampler

Now that we have done that math that we need, let’s try coding this up! Things like Gibbs samplers tend to make more sense when we can see the algorithm running, so let’s do it.

Some last important details: We are told that in our \(n=30\) days of data, the average number of people who clicked on the ad each day was 87. The client has told us to set \(\alpha = 1125\) and \(c = .05\).

Question 5

What value would you suggest we use for \(\theta_{(0)}\)? Explain your reasoning.

Question 6

Based on the given prior hyperparameters, what is the prior mean for \(\beta\)?

Now we are going to run just one iteration of the Gibbs sampler to start. This means we

    1. Initialize with \(\theta = \theta_{(0)}\).
    1. Draw \(\beta_{(1)}\) from \(\beta|\theta_{(0)}, Y_1 = y_1, \dots, Y_n =y_n\)
    1. Draw \(\theta_{(1)}\) from \(\theta|\beta_{(1)}, Y_1 = y_1, \dots, Y_n =y_n\)

Question 7

Set a random seed of 365. Using the given information and your choice of \(\theta_{(0)}\), draw a random sample from the full conditional of \(\beta\). We will call this sampled value \(\beta_{(1)}\). State the value of \(\beta_{(1)}\).

Question 8

Set a random seed of 365. Using the given information and your sampled \(\beta_{(1)}\) from the previous question, draw a random sample from the full conditional of \(\theta\). We will call this sampled value \(\theta_{(1)}\). State the value of \(\theta_{(1)}\).

This is the core of the code that we need! Now that we can do one iteration, let’s perform our Gibbs sampler for \(S = 10000\) iterations.

Question 9

Set a random seed of 365. Create a Gibbs sampler code that creates \(S = 10000\) samples of \(\theta\) and \(\beta\). Show your code as the answer to this question. Hint: Use the code from Class 15 if you need a starting point! Do NOT print out all your posterior samples.

When you finish Question 9, you should have an 10000 by 2 data frame filled with your posterior draws for \(\theta\) and \(\beta\). Now what??

Assessing Convergence

We now have the results of our Gibbs sampler - fantastic!! However, there is one thing that we need to do before we can use our results. We need to pause and assess convergence. This means we need to check to make sure that the results of our draws look fairly consistent from sample to sample.

We don’t want the results to be identical (we need to see the variability from sample to sample in order to estimate the distribution of the parameter). However, wildly different values for each sample can mean that we need to run the process longer.

As we discussed in class, one method that we use to assess convergence is a trace plot.

Question 10

Create a trace plot for your draws of \(\beta\). Does it look like your sampler has converged for \(\beta\)? Explain your reasoning.

Question 11

Create a trace plot for your draws of \(\theta\). Does it look like your sampler has converged for \(\theta\)? Explain your reasoning.

It is important to check for convergence before we proceed with any posterior inference. If you do not check, and the sampler has not converged, your draws from your full conditionals will not approximate draws from your posterior of interest.

There are two other things that we are going to explore before we start using our posterior samples.

But what if…we changed the intialization?

We had to make a few choices in running our Gibbs sampler. The first was that we needed to choose a first guess at \(\theta\), \(\theta_{(0)}\). What if we guessed really poorly? Would that impact our results?

Let’s explore this by choosing a really unlikely initialization for \(\theta\).

Question 12

Initialize \(\theta_{(0)} = 2\). Run your Gibbs sampler (remember to change the name of your data frame or you will lose your original draws!) and show your trace plots for \(\theta\) and \(\beta\). Comment on what you see in your trace plots.

When we have a part of the trace plot where it looks like the draws have not yet converged, we call these draws burn-in samples. These samples represent the sampler attempting to get draws that approximate the posterior, but it is has not quite gotten to the values it needs yet. Before we do posterior inference, we need to disregard these draws.

But what if…we changed the update order?

In our Gibbs sampler, we chose to update \(\beta\) first. Why? Because it was easier to guess at an initial \(\theta\) than it would be in this scenario to guess at an initial \(\beta\).

Question 13

Explain briefly why in this scenario is it easier to guess at an initial value for \(\theta\) rather than \(\beta\).

However, let’s suppose we started with a guess \(\beta_{(0)}\) and updated \(\theta\) first. Would that have impacted our posterior inference?

Let’s start with a reasonable guess at \(\beta\).

Question 14

Initialize \(\beta_{(0)} = 12\). Run your Gibbs sampler (remember to change the name of your data frame) and show your trace plots for \(\theta\) and \(\beta\). Comment on what you see in your trace plots.

Now, let’s put \(\beta\) in an unreasonable part of the space.

Question 15

Initialize \(\beta_{(0)} = 1\) or \(\beta_{(0)} = 200\) (your choice!). Run your Gibbs sampler and show your trace plots for \(\theta\) and \(\beta\). Comment on what you see in your trace plots.

It is important to note that choosing poor initial values has stronger effects in some posterior analysis than others, depending on the complexity of the model. We will explore this more as we move through more types of models!

Using the Samples

At this point, we have explored convergence under a variety of settings. Let’s return to our original Gibbs sampler (when we initialized \(\theta_{(0)}\)) and do some posterior analysis!

We have posterior samples for \(\beta\) and \(\theta\), and we could do inference on both. \(\theta\) is really the parameter of interest, but let’s explore both for now.

Question 16

Create a plot to visualize the distribution of your draws of \(\beta\).

Question 17

Create a plot to visualize the distribution of your draws of \(\theta\).

Question 18

Create and interpret a 95% HPD for \(\beta\). Hint: Use the following code:

suppressMessages(library(HDInterval))
HDInterval::hdi( GibbsSamples$beta, credMass=.95)

Question 19

Create and interpret a 95% HPD for \(\theta\). Note: We know what \(\theta\) represents in the context of the question, so your interpretation should not involve the word (or symbol) \(\theta\).

Question 20

Do we have convincing evidence that the average number of clicks on the ads per day is more than 85? Clearly explain your steps and conclusion. Note: You may not use credible intervals to answer this question.

Turning in your assignment

When your Markdown document is complete, do a final knit and make sure everything compiles. You will then submit your document on Canvas. You must submit a PDF or html document to Canvas. No other formats will be accepted. Make sure you look through the final document to make sure everything has knit correctly.

Creative Commons License
This work was created by Nicole Dalzell and is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2023 March 13.