Applied Bayesian Lab 6

Complete all Questions and submit your final PDF or html (either works!) under Assignments in Canvas.

Goal

In our lab today, we are going to explore Bayesian Regression models. The goal for today is not only to reinforce the ideas of Bayesian regression but also to explore a coding method that is commonly used in practice to build Bayesian regression models. This should be useful for your project!

The Data

Today we have an actual data set (gasp!!!) with \(X\) and \(Y\) information. (I know, you’ve been waiting for this.)

Our data for today is from a bike rental company. We have information on \(n = 500\) days in which bikes were rented. We are interested in modeling \(Y=\) the number of bikes rented on a given day (rides), and we want to see how \(X=\) the temperature of the day in degrees Fahrenheit (temp_actual) is associated with the number of bikes rented. The data set can be downloaded from Canvas.

Question 1

Why might the relationship between \(X\) = temperature and \(Y\) = the number of biked rented be of interest to our client (who owns the bike rental shop)?

Because we have data, the first thing we are going to do is exploratory data analysis (EDA). The goal of our EDA is to (1) determine if there is any missing data we need to handle and (2) determine the shape of the relationship between \(X\) and \(Y\) so we know if a linear regression model might be an appropriate choice.

Question 2

Is there any missing data present in the two variables of interest (temperature and the number of bikes)? Note: There are two temperature variables, and we are using temp_actual.

Question 3

Create a plot to visualize the relationship between temperature and the number of bikes rented. Label your axes and title your figure. Based on what you see, is a line a reasonable shape to use to describe the relationship between temperature and the number of bikes?

The Bayesian Regression Model

Spoiler alert if you haven’t done Question 3: Yes, using a line is a reasonable choice here. This means we can proceed with Bayesian Linear Regression. Just like in Frequentist regression, we start with the idea of a line. For \(i = 1, \dots, 500\),

\[Y_i = \beta_0 + \beta_1 X_i + \epsilon_i,\] where \[\epsilon_i \sim N(0, \sigma^2)\]

This means we have a shape component (\(\beta_0 + \beta_1 X_i\)) that captures the relationship between \(Y\) and \(X\) and a noise component (\(\epsilon_i\)) that captures the rest of the variation in \(Y\) (the part not associated with \(X\)).

As we have discussed in class, we can represent the model above in terms of a sampling model as follows:

The Sampling Model

\[Y_i | \beta_0, \beta_1, \sigma^2, X_i \sim N(\beta_0 + \beta_1 X_i, \sigma^2)\]

This means that we have 3 different parameters that we need to estimate: \(\beta_0, \beta_1,\) and \(\sigma^2\).

There are a variety of different choices for priors that we can use. For today, we are going to use a very standard prior structure.

The Priors

\[\beta_0 \sim N(m_0, \tau_0^2)\]

\[\beta_1 \sim N(m_1, \tau_1^2)\]

\[\sigma \sim Exponential(\alpha)\]

Hmm…this is a little different from the prior structure that we used in class but not by much. What’s different?

First of all, we decided that we did not want to center the priors for \(\beta_0\) and \(\beta_1\) and 0. This was a choice that we made on the slides. The prior notation above is actually the more general form; we don’t always want to center our priors at 0!

However, the choice of not centering the priors at 0 means we now have quite a few hyperparameters that we have to set: \(\alpha\) for the prior on the standard deviation and \(m_0, m_1, \tau_0\), and \(\tau_1\) for the priors on the intercept and slope. We’ll get to those in just a minute.

The last difference in the prior structure above and the one from the slides is the prior for \(\sigma\). On the slides, we used:

\[\sigma^2 \sim InverseGamma(\alpha,\beta)\]

Here, we use:

\[\sigma \sim Exponential(\alpha)\] In this prior,

\[E(\sigma) = \frac{1}{\alpha}, Var(\sigma) = \frac{1}{\alpha}\]

Why do we do this?? Well, one reason is because we using rstan to build our model today, and this is the default prior in rstan. It turns out that there are a few priors that are used in practice, including the Inverse Gamma, the exponential, and the half-Cauchy. Each have pros and cons, but we will not dig into this today.

The last step that we need to before building our model is to center and scale our \(X = (X_1,\dots,X_n)\). This means that for each temperature in the data set, we will subtract the average temperature in the data set and divide by the standard deviation. The result should be a new variable \(X^{C} = (X_1^C,\dots,X_n^C)\) that has mean 0 and standard deviation 1.

Question 4

Create a vector called XC in R that holds the centered and scaled version of \(X =\) temp_actual. Show that the mean and standard deviation of your new vector are what they should be.

With this new version of \(X\), we update our sampling model:

\[Y_i | \beta_0, \beta_1, \sigma^2, X_i^C \sim N(\beta_0 + \beta_1 X_i^C, \sigma^2)\]

The Hyperparameters

Okay, with the model set up, it is time to turn our attention to the hyperparameters. The bike shop has been operating for a while, so our clients do have quite a bit of prior information that they provide to us. They ask us to incorporate this prior information into our Bayesian regression model.

Here’s what we know:

  1. On an average temperature day, there are typically around 5000 riders, though this average could be somewhere between 3000 and 7000.

  2. For every increase in \(X_i^C\) by 1, ridership typically increases by 100 rides, though this average increase could be as low as 20 or as high as 180.

  3. At any given temperature, daily ridership will tend to vary with a moderate standard deviation of 1250 rides.

Question 5

Based on this prior information, what values would we choose for the hyperparameters in our Bayesian regression model? Make sure to explain your reasoning or show your work (something to back up your choices). Hint 1: Remember that we have centered \(X\). Hint 2: When we say “typically” in a normal, we mean within 2 standard deviations.

Question 6

Because it will make your lives easier as we proceed, write out the Bayesian regression model with the numeric values you chose in Question 5 placed in the model instead of the generic parameters \(\alpha\), etc.

Now, let’s do something we haven’t done before - let’s check on this prior specification and see if it actually reflects what we want it to. We can do this by running a simulation study. A simulation study just means that we generate data ourselves under a given model and then check to see what properties that data has. We have been given very specific prior beliefs - let’s make sure our prior is reflecting those beliefs well.

Let’s see how to do this.

Question 7

To start, set a seed of 365 and then draw a value for \(\beta_0\), and \(\beta_1\) and\(\sigma\) (in this order1!!) from the priors you specified in Question 6. Use a seed of 365 and state the values you draw.

Question 8

Using the values from Question 7 and XC, simulate \(n=500\) values for the number of bikes rented. Call the value of simulated \(Y\) values Y_Sim1. Provide a 6 number summary of your Y_Sim1. Note: We have information on \(\sigma\). Don’t ignore it when you are generating your data.

One piece of prior information we were given is that is that at any given temperature, daily ridership will tend to vary with a moderate standard deviation of 1250 rides.

Question 9

Check to see if your simulated \(Y\) values reflect this prior belief. Show your work or explain your steps.

A second piece of prior information is that on an average temperature day, there are typically around 5000 riders, though this average could be somewhere between 3000 and 7000. This is going to require a little more work, as it discusses the variance in \(\beta_0\).

Question 10

Using the same process as in Questions 7 and 8, generate 4 more versions of \(Y\). In other words, we are going to have 5 different columns of \(Y\) information, one generated with each of 5 different seeds. Use the seeds 100, 200, 300, and 400 for the 4 remaining data sets. Build a Frequentist linear regression model on each of the 5 different \(Y\) vectors and report the coefficients of each. Do not just output the results of summary(). I only want to see the coefficients

Question 11

Our prior information says that on an average temperature day, there are typically around 5000 riders, though this average could be somewhere between 3000 and 7000. Check to see if your computed values of \(\beta_0\) reflect this prior belief. Show your work or explain your steps.

Question 12

Our last piece of prior information says that for every increase of 1 in \(X_i^C\), ridership typically increases by 100 rides, though this average increase could be as low as 20 or as high as 180. Check to see if your computed values of \(\beta_0\) reflect this prior belief. Show your work or explain your steps.

That seemed like a lot of work…why did we do all of this?? Well, it’s a really important tool that we use in Bayesian statistics to not only verify that our priors are calibrated (specified) correctly, but also to explore what different relationships different choices of hyperparameters can reflect. This is important when we don’t necessarily have strong prior beliefs. Our client was very kind to us today and gave us very specific information, but this is not always the case.

In practice, we would typically simulated more than 5 data sets. I would have done at least 100, probably more, to get a good idea of what information the priors were conveying.

The MCMC

With our model now set and our prior hyperparameters specified, it’s time to actually proceed with estimating the posterior distributions of our parameters!

For our model, we need to use MCMC to estimate the posterior distributions of our parameters. For the first time since we learned MCMC, we are not going to code it ourselves!! Instead, we are going to be using a very commonly used package called stan. To use this package, you need to load a few libraries:

# Allows us to run MCMC
library(rstanarm)
# Allows us to make pretty pictures with our MCMC draws
library(bayesplot)

The function we will be using is called stan_glm. Yes, glm like Generalized Linear Models, which is great news for us because that means this package can be used for things like Bayesian Logistic Regression or Bayesian Poisson Regression. Yay for learning a versatile function that can be applied in a variety of settings!!

So, let’s get to know Stan. The skeleton of the code you will need looks like this:

rental_model <- stan_glm( Y ~ X, data = ,
                       family = gaussian,
                       prior_intercept = normal(, ),
                       prior = normal(, ), 
                       prior_aux = exponential(),
                       chains = , iter = , seed = 84735)

Let’s talk through this.

  • Y: this is your \(Y\) variable, so for us rides, which represents the number of bikes that are rented for rides on a given day.
  • X : this is your CENTERED and SCALED \(X\) variable. This is important!!! This function will not center for you.
  • family = gaussian: Gaussian is another word for a normal distribution, so this is telling the computer that we are using OLS regression. For different regression models, this would change.

Now we are into the prior specification part.

  • prior_intercept: This is the prior on \(\beta_0\). We are specifying a normal, and you need to fill in the prior mean and standard deviation (not variance).
  • prior: This is the prior on \(\beta_1\). We are specifying a normal, and you need to fill in the prior mean and standard deviation (not variance).
  • prior_aux: This is the prior on \(\sigma\), and you need to feed it the prior hyperparameters \(\alpha\).

The last few arguments have to do with MCMC related things.

  • chains: How many chains do we want to run? For today, let’s do 3.
  • iter: This is \(S\), the number of iterations of MCMC we want to run.
  • seed: This is the random seed we are all going to use today. This makes Dr. Dalzell’s life easier, so please and thank you for using the seed in the skeleton code.

Question 13

Fill in the skeleton code given above to run the MCMC for 3 chains for \(S = 10000\) iterations. The code will helpfully output how long it took to run each chain; provide the information as your answer to this question. I think you will be pleasantly surprised!

Now, what you have created is a list object with lots of things in it. Let’s see how we can use our results.

Assessing Convergence

The first thing we probably want to do it to create a trace plot. Using the bayesplot package, we can do this quite easily.

mcmc_trace(rental_model, size = 0.1)

Question 14

Use the code above to create trace plots. Show the plots for all parameters and discuss whether or not it looks like the three parameters have converged (this discussion can be very brief)

It is often also helpful to look at the empirical posterior distributions that we have created using our draws from the MCMC chains. To do that, we use the following code:

mcmc_dens_overlay(rental_model)

This shows us what the posterior distributions for each parameter looks like, based on our MCMC samples!

One other thing we may want to assess is effective sample size. Luckily, we can do this, too.

neff_ratio(rental_model)

The result running the code above is a little different from what we are used to when we assess effective sample size. Instead of returning a number, it returns a ratio. To get the effective sample size we are used to, we just multiply this by \(S\). Note: This code assumes we have only fed in post burn-in samples. If you want to burn any, remove them BEFORE running the code above.

Question 15

State the effective sample size for each of the parameters. Does this suggest that there is strong autocorrelation? Explain your reasoning.

If you want to look at the autocorrelation itself, we can do that, too! Let’s take a look for \(\beta_1\).

mcmc_acf(rental_model, pars = "XC")

Question 16

Why are there 3 different graphs when we are only looking at one parameter?

Using the Results

Okay, we done a lot of things. However, one thing we have not done is answer our client’s question. Let’s do that now, shall we?

At the moment, we have a lot of samples for each of the three parameters. Is there is a neat way to summarize all of this?

library(broom.mixed)
tidy(rental_model, effects = c("fixed", "aux"),
     conf.int = TRUE, conf.level = )

The function above provides a very nice summary of our posterior results. The effects tells the computer which of the parameters we want to do posterior inference on. If you remember from the skeleton code, aux means the standard deviation \(\sigma\). The fixed means the rest (\(\beta_0, \beta_1\)).

You will also notice that we can request a credible interval. Yes it says confidence - I didn’t write this code!! It is a credible interval.

Question 17

Build an interpret a 95% posterior credible interval for \(\beta_1\). Note: This is the equal tails posterior credible interval, which means it looks at the middle 95% of the posterior draws.

Posterior Prediction

The last step in our regression journey for today is making predictions for new observations. Let’s suppose we want to predict how many bikes would be rented on a day with a temperature of 70 degrees Fahrenheit.

Question 18

Let \(X^{*} = 70\). What is the value of \((X^{*})^C\)?

In Frequentist statistics, we estimate a new \(Y^{*}\) by plugging in point estimates of \(\hat{\beta}_1\) and \(\hat{\beta_0}\), so that

\[{Y}^{*} = \hat{\beta}_0 + \hat{\beta}_1 (X^{*})^C\]

However, this ignores two things. First of all, we don’t expect that every single day with 70 degrees in temperature would have the same number of bikes rented; we expect that the number of rentals might deviate from day to day. This means it would be more helpful to have a distribution of \({Y}^{*}\) rather than just one estimate.

Secondly, using the above equation completely ignores the fact that we have estimated the intercept and slope parameters. These values have a distribution, as well, and ignoring this means we tend to underestimate our variability in \({Y}^{*}\).

So what is the solution? Well, we want to use our MCMC results to estimate \(Y^{*}\). To do this, we assume that

\[Y^* | \beta_0, \beta_1, \sigma, (X^*)^C \sim N(\beta_0+ \beta_1(X^*)^C, \sigma^2)\]

Let’s do this for the first set of MCMC estimates of \(\beta_0^{(1)},\beta_1^{(1)},\) and \(\sigma^{(1)}\) (meaning from the first iteration). To see what these MCMC draws were, use the following:

rental_model_df <- as.data.frame(rental_model)
first_it <- rental_model_df[1,]

Question 19

For these values, simulate a value of \(Y^{*}\) for a 70 degree day. Show your work and your result. Set a seed of 365.

Question 20

Repeat the process for the second MCMC iteration values. Show your resultant prediction.

You will notice your predictions are not the same!

Question 21

Repeat the process for all \(S\) MCMC iterations. Create a plot to show the empirical posterior predictive distribution of \(Y^{*}\).

Question 22

Create and interpret a 95% posterior credible interval for the posterior predictive distribution of \(Y^{*}\).

Turning in your assignment

When your Markdown document is complete, do a final knit and make sure everything compiles. You will then submit your document on Canvas. You must submit a PDF or html document to Canvas. No other formats will be accepted. Make sure you look through the final document to make sure everything has knit correctly.

Last updated 2023 April 12 by Nicole Dalzell.