Applied Bayesian Lab 2

Complete all Questions and submit your final PDF or html (either works!) under Assignments in Canvas.

Goal

We have been exploring a variety of things you can do with Bayesian models, and last class we were introduced to a new model: the Poisson-Gamma. We will start our lab by exploring the Poisson-Gamma model a little more.

In our lab today we will also dig a little bit more deeply into posterior predictive distributions. This is a tool that we use in Bayesian statistics to predict the outcome for a new data point (a new observation or new trial). We are going start with thinking about posterior predictive distributions (PPDs) in terms of the Beta-Binomial. Why? Because we have already derived the form! We will come back around to the Poisson-Gamma in Homework 4.

Part 1: The Poisson - Gamma

The Data Scenario

We are working with a local animal shelter. They are attempting to determine how many days it takes, on average, for a mixed-breed puppy that is brought into the shelter to be adopted. During the height of the pandemic, puppies were adopted extremely quickly. However, now the adoption rate is slower, and understanding how long it might take a puppy to be adopted now is critical for determining the number of foster parents that are needed to care for the puppies until they can be placed for adoption.

Let \(Y_i\) be a random variable that indicates the number of days it takes a mixed-breed puppy to be adopted (once the puppy is old enough for adoption). We have information on \(n = 25\) puppies that were brought into the animal shelter system during the same weekend. All of these puppies are mixed breeds that were roughly the same age when they were brought to the shelter. They all became eligible for adoption on the same day.

Question 1

Why is it important to know that the puppies were brought in on the same weekend and are about the same age?

Based on the work we have done in class, we know that one possible sampling model we could use is

\[Y_1, \dots, Y_{25} \stackrel{iid}{\sim} Poisson(\theta),\]

When we choose this model, we assume that is reasonable to use one parameter, \(\theta\), to represent the average number days it takes a mixed-breed puppy to be adopted. \(\theta\) also represents the variance in the number of days it takes the puppy to be adopted. For today, you may assume these assumptions are reasonable.

Question 2

Suppose we are told that \(\theta = 15\). Using our sampling model, what is the probability that it takes one mixed-breed puppy 21 days to be adopted? Hint: This means just use \(Y_i|\theta \sim Poisson(\theta)\).

The Prior

Now that we have the sampling model established, it is time to bring in the prior. As we saw in class,

\[\theta \sim Gamma(\alpha,\beta)\] is a convenient prior choice. We have discussed that we use the shape-rate parameterization of Gamma random variables in Bayesian statistics. In this parameterization, \(\alpha\) is called the shape hyperparameter while \(\beta\) is called the rate hyperparameter.

However…what does changing these hyperparameters actually do??

Exploring the Gamma

To plot the pdf of a Gamma random variable, we are going to use a function modified from the BayesRules package in R. To load the function, create a chunk, copy and paste the following inside the chunk and press play.

plot_gamma <- function (shape, rate, titlein, mean = FALSE, mode = FALSE){
    suppressMessages(library(ggplot2))
    x_min <- qgamma(1e-25, shape, rate)
    x_max <- qgamma(0.99999, shape, rate)
    p <- ggplot(data = data.frame(x = c(x_min, x_max)), aes(x)) + 
        stat_function(fun = dgamma, n = 101, args = list(shape = shape, 
            rate = rate)) + labs(x = expression(theta), y = "Density", title = titlein) + theme_light()
    if (mean == TRUE & mode == FALSE) {
        mean <- shape/rate
        p <- p + geom_segment(aes(x = mean, y = 0, xend = mean, 
            yend = dgamma(mean, shape, rate), linetype = "mean")) + 
            scale_linetype_manual(values = c(mean = "solid")) + 
            theme(legend.title = element_blank()) + theme_light()
    }
    if (mean == FALSE & mode == TRUE) {
        if (shape >= 1) {
            mode <- (shape - 1)/rate
            p <- p + geom_segment(aes(x = mode, y = 0, xend = mode, 
                yend = dgamma(mode, shape, rate), linetype = "mode")) + 
                scale_linetype_manual(values = c(mode = "dashed")) + 
                theme(legend.title = element_blank()) + xlim(x_min, 
                x_max)
        }
        else {
            stop("In order to plot the mode the shape parameter must be greater than\n           or equal to 1.")
        }
    }
    if (mean == TRUE & mode == TRUE) {
        mean <- shape/rate
        if (shape >= 1) {
            mode <- (shape - 1)/rate
            p <- p + geom_segment(aes(x = mean, y = 0, xend = mean, 
                yend = dgamma(mean, shape, rate), linetype = "mean")) + 
                geom_segment(aes(x = mode, y = 0, xend = mode, 
                  yend = dgamma(mode, shape, rate), linetype = "mode")) + 
                scale_linetype_manual(values = c(mean = "solid", 
                  mode = "dashed")) + theme(legend.title = element_blank())
        }
        else {
            stop("In order to plot the mode the shape parameter must be greater than\n           or equal to 1.")
        }
    }
    p
}

To plot a Gamma pdf in R using the function we have just loaded, you use the following:

plot_gamma(shape, rate, "Put your Title here")

You will replace shape and rate with the numeric values of the hyperparameters. You can also title the graph as you like using the 3rd argument in the function. Note: Make sure your chosen title is in quotation marks.

Question 3

Using the function above, explore what happens when you change the shape hyperparameter \(\alpha\) in a Gamma pdf. Set \(\beta = 10\) and play with at least 4 choices of \(\alpha\). Show the graphs you create. Make sure each plot has a title (Figure 1A, Figure 1B, etc.), and stack the four plots. Note: If you need a reminder on how to do this check here.

Question 4

Repeat the same process, but this time set \(\alpha = 10\) and adjust your rate parameter. Hint: This is little more subtle. To really see the results, consider limiting the x-axis range of all of your graphs to be the same by adding +xlim(c(lower,upper)). Replace lower with the lower bound you want on the x-axis for all plots and upper with the upper bound you want on the x-axis for all plots.

In addition to visual checks, we can also look at the measures of central tendency to explore how changing the hyperparameters impacts the shape of the Gamma pdf.

Mean

\[E(\theta) = \frac{\alpha}{\beta}\]

Mode

\[Mode(\theta) = \frac{\alpha - 1}{\beta}, \alpha > 1\]

Standard Deviation

\[StDev(\theta) = \frac{\sqrt{\alpha}}{\beta}\]

Question 5

Based on what you saw in your graphs as well as the measures of central tendency above, describe what happens to the Gamma pdf when you increase the shape parameter \(\alpha\).

Question 6

Based on what you saw in your graphs as well as the measures of central tendency above, describe what happens to the pdf when you increase the rate parameter \(\beta\).

The Gamma Prior

The next step is to choose the hyperparameters for our prior.

\[\theta \sim Gamma(\alpha,\beta)\]

To help us do this, we speak to an individual who works in the shelter. They tell us that before the pandemic, it typically took about 18 days for a mixed-breed puppy to be adopted once they were eligible for adoption.

Question 7

The expert from the shelter suggests a rate hyperparameter of \(\beta = 10\). Based on this, what value would you suggest we use for \(\alpha\)? Explain your choice.

Question 8

Create a plot to visualize the pdf of your prior. In what range of \(\theta\) values have you allocated most of the prior mass?

Question 9

Comment on the prior sample size and explain why it might make sense that this is smaller than the data sample size in this scenario.

The Gamma posterior

At this point, we have a sampling model, and we have our prior with hyper-parameters. It is time to look at the posterior!

In our data from \(n = 25\) puppies, it took an average of 20 days for a puppy to be adopted.

Question 10

State the posterior distribution for \(\theta\) given this data. Note: This means state the distribution and the numeric value for any hyperparameters.

Question 11

State the posterior distribution sample size.

There is also a function in the bayesrules package that allows you to visualize the prior density function, a scaled version of the likelihood and the posterior density function all on the same graph. I’ve adapted the function slightly to allow you to add titles to your plots.

plot_poisson_gamma <- function (shape, rate, sum_y = NULL, n = NULL, prior = TRUE, likelihood = TRUE, posterior = TRUE, titlein) 
{
    if (is.null(sum_y) | is.null(n)) 
        warning("To visualize the posterior,\n            specify information about the data: sum_y and n")
    x_min <- min(qgamma(1e-05, shape, rate), qgamma(1e-05, shape + 
        sum_y, rate + n), qgamma(1e-05, sum_y + 1, n))
    x_max <- max(qgamma(0.99999, shape, rate), qgamma(0.99999, 
        shape + sum_y, rate + n), qgamma(0.99999, sum_y + 1, 
        n))
    g <- ggplot(data = data.frame(x = c(x_min, x_max)), aes(x)) + 
        labs(x = expression(theta), y = "density", title = titlein) + scale_fill_manual("", 
        values = c(prior = "#f0e442", `(scaled) likelihood` = "#0071b2", 
            posterior = "#009e74"), breaks = c("prior", 
            "(scaled) likelihood", "posterior")) + theme_light()
    if (prior == TRUE) {
        g <- g + stat_function(fun = dgamma, args = list(shape = shape, 
            rate = rate)) + stat_function(fun = dgamma, args = list(shape = shape, 
            rate = rate), geom = "area", alpha = 0.5, aes(fill = "prior"))
    }
    if (!is.null(sum_y) & !is.null(n)) {
        shape_post <- shape + sum_y
        rate_post <- rate + n
        like_scaled <- function(x) {
            dgamma(x, shape = sum_y + 1, rate = n)
        }
    }
    if (!is.null(sum_y) & !is.null(n) & (likelihood != FALSE)) {
        g <- g + stat_function(fun = like_scaled) + stat_function(fun = like_scaled, 
            geom = "area", alpha = 0.5, aes(fill = "(scaled) likelihood"))
    }
    if (!is.null(sum_y) & !is.null(n) & posterior == TRUE) {
        g <- g + stat_function(fun = dgamma, args = list(shape = shape_post, 
            rate = rate_post)) + stat_function(fun = dgamma, 
            args = list(shape = shape_post, rate = rate_post), 
            geom = "area", alpha = 0.5, aes(fill = "posterior"))
    }
    g
}

To use the function, use the following structure:

plot_poisson_gamma( shape = , rate = , sum_y = , n = , titlein = "The Title you want")

Question 12

Using the code above, plot the prior for \(\theta\), the posterior for \(\theta\), and the scaled likelihood. Comment on how the posterior for \(\theta\) differs from the prior.

Question 13

Compute and interpret a 95% HPD interval for \(\theta\).

With this information, we can start helping our client determine how long foster parents may be needed to care for puppies!

Part 2: Making predictions

Now a new puppy is brought into the shelter. This puppy’s age and mixed-breed characteristics are similar to those from our sample. Based on the data we have, we want to predict how many days it will take this puppy to get adopted so they can be placed with an appropriate foster home.

This is an example of a common goal in statistics: prediction. Based on data that we have observed, we want to predict what will happen next.

We have seen how to do this once before, in the context of a different model: the Beta-Binomial. We are going to go back to that model for lab today. We will explore prediction with the Poisson-Gamma in Homework 4.

The Data Scenario: Beta-Binomial

A common problem in koala wildlife rehabilitation is orphaned or abandoned infants (called joeys). In order to care for koala joeys who do not have mothers or whose mothers have rejected them, these joeys are often placed with surrogate koala mothers. However, it is very difficult to predict which surrogate parent might bond with a specific koala joey.

A researcher has built a new statistical model to try and match koala joeys with surrogate mothers. They use the model to match \(n = 50\) koala joeys with prospective mothers, and they record whether or not the surrogate mother accepts the joey.

Now, a new koala joey is brought in. What is the probability that the surrogate mother suggested by the model will accept this joey?

Posterior Predictive Distributions: The Concept

So, based on 50 koalas that have been matched, we want to predict what will happen with this new koala joey. Let’s add some notation to this. Let \(Y_i\) be a binary random variable that represents whether koala surrogate mother \(i\) will bond with the joey that is placed with her by the new model. \(Y_i = 1\) if the mother bonds with the joey and \(Y_i =0\) otherwise. We assume that

\[Y_1, \dots, Y_{50} | \theta \stackrel{iid}{\sim} Bernoulli(\theta),\]

where \(\theta\) is the proportion of surrogate mothers that will bond with the joey that is placed with her by the new model.

Recall that we already have observations on 50 koalas. Let the outcomes for these 50 koalas be denoted \(y_1, \dots, y_{50}\). Now, we want to predict what will happen with koala 51.

Let \(Y_{51}^{*}\) be a binary random variable that represents whether koala surrogate mother \(51\) will bond with the joey that is placed with her by the new model. We assume that

\[Y_{51}^{*} | \theta \stackrel{iid}{\sim} Bernoulli(\theta).\]

We also assume that conditional on \(\theta\), \(Y_{51}^{*}\) is independent of \(Y_1, \dots, Y_{50}\).

Question 14

Explain why it might be reasonable to assume \(Y_{51}^{*}\) is independent of \(Y_1, \dots, Y_{50}\), conditional on \(\theta\). Yes, we stated that we are making this assumption, but explain to me why it might make sense that if we know \(\theta\), we may not gain additional information from knowing \(Y_1 = y_1, \dots, Y_n = y_n\).

The problem is…we don’t know \(\theta\). We do know \((Y_1 = y_1, \dots, Y_n=y_n)\). So, based on data that we have observed, we want to predict what will happen next. This means we want to know the distribution of

\[Y_{51}^{*} | (Y_1 = y_1, \dots, Y_n=y_n).\]

This is the posterior predictive distribution (PPD).

Question 15

Now that we have removed the conditioning on \(\theta\), explain why it is not reasonable to assume that \(\theta\), \(Y_{51}^{*}\) is independent of \(Y_1, \dots, Y_{50}\). In other words, explain to me why it might make sense that if we do NOT know \(\theta\), we would likely gain additional information about \(Y_{n+1}^{*}\) from knowing \(Y_1 = y_1, \dots,Y_n = y_n\).

Okay, so the goal is to find the PPD,

\[Y_{51}^{*} | (Y_1 = y_1, \dots, Y_n=y_n).\]

However, we don’t know this distribution. Is there something close that we do know? Yes!

Stuff we know #1:

\[Y_{51}^{*} | \theta \stackrel{iid}{\sim} Bernoulli(\theta).\] Stuff we know #2:

\[Y_{51}^{*} | (Y_1 = y_1, \dots, Y_n=y_n, \theta) \stackrel{iid}{\sim} Bernoulli(\theta).\]

Question 16

Explain why Stuff we know #2 is true given our modeling assumptions and Stuff we know #1.

Okay, so we know that

\[Y_{51}^{*} | (Y_1 = y_1, \dots, Y_n=y_n, \theta) \stackrel{iid}{\sim} Bernoulli(\theta).\] This means that to get the PPD, all we need to do is remove the dependence on \(\theta\).

How do we do that??

A Short Pause to be…Discrete

Suppose you have 1 hour before an exam. You are person who likes to study in exact 30 minute intervals (go with me here, folks). You are interested in your chances of getting an A on the exam.

Well, there are exactly 3 ways that can happen.

  • You can get an A and not study any more
  • You can get an A and study for 30 minutes
  • You can get an A and study for an hour

Question 17

By the law of total probability, this means we can express the \[P(A~on~exam)\] as the sum of what 3 joint probabilities?

See what we have done here?? There is an unknown factor - how long are you going to study? We have removed the need to condition on that unknown variable by adding up all the possible combinations of how the event we want (getting an A) can happen.

Let’s take this one step further. Suppose you get an A on the first exam - go you! Now the second exam comes around, and again you have one hours to study and you like to study in exact 30 minute intervals. We now want to know the probability that you get an A on the second exam, now with the knowledge that you got an A on the first. This means we want to know:

\[P(A~on~second~exam|A~on~first~exam)\]

Now again, we have three different ways this can happen:

  • You can get an A on the 2nd exam and do not study any more, given that you got an A on the first exam.
  • You can get an A on the 2nd exam and study for 30 minutes, given that you got an A on the first exam.
  • You can get an A on the 2nd exam and study for an hour, given that you got an A on the first exam.

Question 18

By the law of total probability, this means we can express the \[P(A~on~second|A~on~first)\] as the sum of what 3 joint probabilities?

Back to the Beta-Binomial

The PPD uses exactly this concept! We don’t know what \(\theta\) is, so we want to be able to make predictions without needing to condition on \(\theta\). So, we remove the need for conditioning on it by adding up the probability that, after we observed the original \(n\) results, we can get \(Y_{n+1}^{*} = y_{n+1}^{*}\) for every possible choice of \(\theta\). However, since \(\theta\) is continuous, we need to use an integral rather than a finite sum.

\[P(Y_{51}^{*} | Y_1 = y_1, \dots, Y_n=y_n) = \int_{\Theta} P(Y_{51}^{*} ~ and~\theta| Y_1 = y_1, \dots, Y_n=y_n) d\theta\]

You notice we do NOT remove the \(Y_1 = y_1, \dots, Y_n=y_n\) from the conditioning. We need that!! Our goal was to predict a new observation based on the observed data, so we need to keep the observed data. Because we know that \(P(A~and~B|C) = P(A|B,C)P(B|C)\), we have:

\[\begin{aligned} P(Y_{51}^{*} & | Y_1 = y_1, \dots, Y_n=y_n) = \int_{\Theta} P(Y_{51}^{*} ~ and~\theta| Y_1 = y_1, \dots, Y_n=y_n) d\theta \\& = \int_{\Theta} P(Y_{51}^{*}| \theta, Y_1 = y_1, \dots, Y_n=y_n) P(\theta| Y_1 = y_1, \dots, Y_n=y_n) d\theta \end{aligned}\] This is the form of the PPD! Do we know all the pieces we need for this??

Stuff we know #3

\[\theta|(Y_1 = y_1, \dots, Y_n=y_n) \sim Beta(\alpha + \sum_{i=1}^n y_i, n + \beta - \sum_{i=1}^n y_i)\] So, this means that we have all the pieces that we need!

For the Beta-Binomial, we have already done the math in class, so we know that

\[P(Y_{n+1}^* =1 |y_1,y_2,...y_n) = \frac{\alpha + \sum_{i=1}^n y_i }{\alpha + \beta + n}\]

and

\[P(Y_{n+1}^* =0 |y_1,y_2,...y_n) = \frac{n + \beta - \sum_{i=1}^n y_i}{\alpha + \beta + n}\] This gives us that the PPD is

\[Y_{n+1}^* | (Y_1 = y_1, \dots, Y_n = y_n) \sim Bernoulli\left( \frac{\alpha + \sum_{i=1}^n y_i }{\alpha + \beta + n}\right)\] Great!

Question 19

In our koala data setting, \(n = 50\), \(\alpha = 5\), \(\beta=20\), and \(\sum_{i=1}^{n} y_i = 35\). What is the posterior predictive probability that the 51st koala mother who is matched by the model will accept the joey?

This is a nice result! We don’t just state prediction 1 or 0. Instead, we can estimate a probability that our predictive value will be 1 or 0. This allows us to represent our uncertainty in our estimate, and this is why PPDs are so useful in practice.

Part 3: Monte Carlo Approximation

Now, this worked out fairly neatly in our case (though it took a lot of algebra). However, in general, the PPD is very difficult, if not impossible, to compute. That means it would be handy to figure out another way to find the posterior predictive distribution without the need to integrate. We can’t find the exact PPD without integration, but we can get very close using something called Monte Carlo approximation.

The goal of Monte Carlo approximation is to approximate the PPD,

\[P(Y_{n+1}^* | Y_1 = y_1, \dots, Y_n = y_n) = \int_{\Theta} P(Y_{n+1}^{*} | \theta) P(\theta | y_1, \dots, y_n) d\theta\]

For our Beta-Binomial model, this means we need to approximate

\[P(Y_{n+1}^* =1 |y_1,y_2,...y_n)\]

and

\[P(Y_{n+1}^* =0 |y_1,y_2,...y_n)\]

How might we do that? Well, one way is to draw a lot of random samples of \(y_j^{*}\), \(j= 1, \dots, s\), from \(P(Y_{n+1}^*|y_1,y_2,...y_n)\) and determine what percent of these draws are 1 and what percent are 0. In other words,

\[P(Y_{n+1}^* =1 |y_1,y_2,...y_n) \approx \frac{1}{s} \sum_{j=1}^{s} y^{*}_{j}\]

By the law of large numbers, this approximation will approach the true probability \(P(Y_{n+1}^* =1 |y_1,y_2,...y_n)\) we want as \(s\) approaches infinity!

So, all we need to do is figure out how to get our samples \(y_{j}^{*}\) and we will have a process to approximate the probability we need. Great!! But, how do we get the samples we need?

Well, there is one distribution that we know that we could sample from:

\[Y_{n+1}^* | \theta \sim Bernoulli(\theta)\]

The only problem is, we don’t know \(\theta\) (that’s kind of why we are doing this whole model to begin with!!). What we could do is plug in an estimate of \(\theta\), like the posterior mean or posterior mode. However, if we do this, we treat that estimate like it is the true value of \(\theta\). That means we ignore the fact that \(\theta\) is estimated, and that tends to underestimate our uncertainty about \(Y_{n+1}^{*}\). So, we need another option.

Steps of Monte Carlo Approximation

The way Monte Carlo Approximation works is that instead of one estimate of \(\theta\), we use many different values of \(\theta\). We then draw our estimates \(y_{j}^{*}\) from

\[Y_{n+1}^* | \theta \sim Bernoulli(\theta)\]

How do we decide what different values of \(\theta\) to use? Well, we draw these values of \(\theta\) from the posterior distribution of \(\theta\). This means that we select values of \(\theta\) that the posterior suggests are likely values.

The steps of Monte Carlo approximation are then as follows.

  • Step 1: Sample a value \(\theta^{*}\) from the posterior distribution \(\theta| y_1, \dots, y_n\).
  • Step 2: Conditional on this sampled value \(\theta^{*}\), sample a value of \(y_{j}^{*}\) from the sampling distribution \(Y_{n+1}^{*} | \theta^{*}\). Store the value of \(y_{j}^{*}\).
  • Step 3: Repeat Steps 1 and 2 over and over.
  • Step 4: Use the draws of \(y_{j}^{*}, j = 1, \dots, s\), to approximate your quantity of interest.

Let’s try this and see what happens.

Step 1:

  • Sample a value \(\theta^{*}\) from the posterior distribution \(\theta| y_1, \dots, y_n\).

In our koala data setting, \(n = 50\), \(\alpha = 5\), \(\beta=20\), and \(\sum_{i=1}^{n} y_i = 35\). This means the posterior distribution for \(\theta\) is

\[\theta|Y_1 = y_1, \dots, Y_n = y_n \sim Beta(\alpha + \sum_{i=1}^n y_i, n- \sum_{i=1}^n y_i + \beta)= Beta(40, 35)\]

Question 20

Draw a random sample from the posterior distribution for \(\theta\). Set a random seed of 100, and state what value of \(\theta^{*}\) you drew.

Step 2:

  • Conditional on this sampled value \(\theta^{*}\), sample a value of \(y_{j=1}^{*}\) from the sampling distribution \(Y_{n+1}^{*} | \theta^{*}\). Store the value of \(y_{j=1}^{*}\)

Ah, now we have a value of \(\theta\)! If \(\theta^{*}\) were the true value of \(\theta\), what value of \(y_{j}^{*}\) would we be likely to draw? This is what we actually need for Monte Carlo approximation - we need a lot of draws of \(y_{j}^{*}\).

Question 21

Draw a random sample from the \(P(Y^{*}_{n+1} | \theta^{*})\). State what value you get. Hint: The following code might be helpful: ystar <- sample(c(0,1), 1, prob = c(1-theta_star,theta_star))

So, what have we just seen? Conditional on this particular value of \(\theta=\theta^{*}\), this is the value we would predict for \(Y_{n+1}^{*}\).

But…there are other values of \(\theta\). What would we predict if \(\theta\) were truly one of those other values?

Step 3

  • Repeat Steps 1 and 2 over and over.

This allows us to get the many samples of \(Y_{n+1}^{*}\), but we sample using different estimated values of \(\theta\) to reflect the fact that we really don’t know what \(\theta\) is!

Question 22

Set a random seed of 100. Run a code that performs \(s = 100000\) iterations of Step 1 and Step 2. Each time through the process, you need to store your value of \(y_{s}^{*}\). Show your code as the answer to this question.

Step 4

  • Use the draws of \(y_{j}^{*}, j = 1, \dots, s\), to approximate your quantity of interest.

\[P(Y_{n+1}^* =1 |y_1,y_2,...y_n) \approx \frac{1}{s} \sum_{j=1}^{s} y^{*}_{j}\]

Question 23

What proportion of \(y_{j}^{*}\) values are 0s and what proportion of your \(y_{j}^{*}\) values are 1?

Question 24

Take a look back at Question 19, where we computed the exact posterior predictive probability that \(Y_{n+1}^{*}=1\). Compare this values that got in Question 23. What do you notice?

This is pretty powerful!!! This means that even without the ability to do the integration, we can approximate the PPD very accurately! This is going to be extremely useful as we move into more complex Bayesian models.

Putting the Pieces Together

There is one last thing we should notice. Take a look at the two distributions we use in Steps 1 and 2.

  • Step 1: Sample a value \(\theta^{*}\) from the posterior distribution \(\theta| y_1, \dots, y_n\). \[\theta|Y_1 = y_1, \dots, Y_n = y_n\]
  • Step 2: Conditional on this sampled value \(\theta^{*}\), sample a value of \(y_{n+1}^{*}\) from the sampling distribution \(Y_{n+1}^{*} | \theta^{*}\). Store the value of \(y_{n+1}^{*}\). \[Y_{n+1}^{*}|\theta\]

Have we seen these two together somewhere before??

\[P(Y_{n+1}^* | Y_1 = y_1, \dots, Y_n = y_n) = \int_{\Theta} P(Y_{n+1}^{*} | \theta) P(\theta | y_1, \dots, y_n) d\theta\]

Yes!! These are exactly the same two terms we have inside the integral!! So, Monte Carlo approximation is approximating the value of the integral with a finite sum rather than an infinite sum.

Turning in your assignment

When your Markdown document is complete, do a final knit and make sure everything compiles. You will then submit your document on Canvas. You must submit a PDF or html document to Canvas. No other formats will be accepted. Make sure you look through the final document to make sure everything has knit correctly.

Creative Commons License
This work was created by Nicole Dalzell and is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2023 February 6.