In the handouts we have focused on inference for a population proportion. This lab will introduce you to another of the most common problems in statistics: inference for a population mean of a numerical variable. You will apply a process similar to what we covered in handouts to perform Bayesian inference for a population mean. This lab will make some simplifying assumptions to illustrate the basic process and ideas. We’ll consider more realistic scenarios later.
Important note regarding code: We will cover code for performing and summarizing Bayesian analyses in much more detail throughout the course. Early in the course I have tried to keep the code as simple and bare bones as possible, especially with respect to plotting. There are certainly better ways to code, especially better plots that could be made. You are welcome to jazz it up (or tidyverse it up) if you want, but it’s certainly not required. The main goal for this lab is to understand the basic process.
This RMarkdown file provides a template for you to fill in. Read the file from start to finish, completing the parts as indicated. Some code is provided for you. Be sure to run this code, and make sure you understand what it’s doing. Some blank “code chunks” are provided; you are welcome to add more (Code > Insert chunk or CTRL-ALT-I) as needed. There are also a few places where you should type text responses. Feel free to add additional text responses as necessary.
You can run individual code chunks using the play button. You can use objects defined in one chunk in others. Just keep in mind that chunks are evaluated in order. So if you call something x in one chunk, but redefine x as something else in another chunk, it’s essential that you evaluate the chunks in the proper order.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. When you are finished
You’ll need a few R packages, which you can install using install.packages
library(ggplot2)
library(dplyr)
library(knitr)
library(janitor)
knitr::opts_chunk$set(
echo = TRUE,
warning = FALSE,
message = FALSE
)
Recall that the likelihood function is the probability of observing the sample data \(y\) viewed as a function of the parameter \(\theta\). When the data \(y\) takes values on a continuous scale, the likelihood is determined by the probability density function of \(Y\) given \(\theta\), \(f(y|\theta)\). In the likelihood function, the observed value of the data \(y\) is treated as a fixed constant, and the likelihood of observing that \(y\) is evaluated for all potential values of \(\theta\).
Recall that a continuous random variable1 \(U\) follows a Normal (a.k.a., Gaussian) distribution with mean \(\mu\) and standard deviation \(\sigma>0\) if its probability density function is2 \[\begin{align*} f_U(u) & = \frac{1}{\sigma\sqrt{2\pi}}\,\exp\left(-\frac{1}{2}\left(\frac{u-\mu}{\sigma}\right)^2\right), \quad -\infty<u<\infty.\\ & \propto \frac{1}{\sigma}\,\exp\left(-\frac{1}{2}\left(\frac{u-\mu}{\sigma}\right)^2\right), \quad -\infty<u<\infty. \end{align*}\] The constant \(1/\sqrt{2\pi}\) ensures that the total area under the density is 1, but it doesn’t effect the shape of the density.
In R, dnorm(u, mu, sigma) returns the density at \(u\).
Remember, density is NOT probability: \(f_U(u) \neq P(U = u)\), because for a continuous random variable the probability of any particular value is 0. We’ll review continuous distributions in a little more detail soon. In this lab, you just need to use dnorm.
Normal distributions follow the empirical rule; in particular:
Assume body temperatures (degrees Fahrenheit) of healthy adults follow a Normal distribution with unknown mean \(\theta\) and known standard deviation \(\sigma=1\). We wish to estimate \(\theta\), the population mean healthy human body temperature, based on sample data.
Note: it’s unrealistic to assume the population standard deviation is known. We’ll consider the case of unknown standard deviation later. + Exercises.
TYPE YOUR RESPONSE HERE. 1) The population standard deviation describes the true spread of the population’s mean human body temperature. 2) Assuming the population SD is 1, we would expect 68% of the all observations of healthy human body temperatures to be between the population mean + 1 and the population mean - 1. Similarly, we would expect 95% of all observations of healthy human body temperatures to be between the population mean + 2 and the population mean - 2. Finally, we would expect 99.7% of the all observations of healthy human body temperatures to be between the population mean + 3 and the population mean - 3.
In this section, all the code and output is provided for you. But be sure to carefully read through the section and run the code. In particular, mind the PAUSEs; be sure to PAUSE and think about these questions before proceeding.
We’ll walk through a discrete example to illustrate ideas. Try to see what is the same and what is different from what we have done in the handouts. The basic process is the same as what we did for proportions. The main difference is in the type of data - categorical versus numerical - resulting in a different likelihood.
Assume first the following discrete prior distribution for \(\theta\) which places probability 0.10, 0.25, 0.30, 0.25, 0.10 on the values 97.6, 98.1, 98.6, 99.1, 99.6, respectively.
Suppose a single temperature value of 97.9 is observed. We’ll construct a Bayes table and find the posterior distribution of \(\theta\) given this single temperature value.
PAUSE. Before proceeding, pause to consider how do you determine the likelihood?
The likelihood is determined by evaluating the Normal(\(\theta\), 1) density at \(y=97.9\) for different values of \(\theta\): dnorm(97.9, theta, 1) or \[
f(97.9|\theta) = \frac{1}{\sqrt{2\pi}}\,\exp\left(-\frac{1}{2}\left(\frac{97.9-\theta}{1}\right)^2\right)
\] For example, the likelihood for \(\theta = 98.6\) is \(\frac{1}{\sqrt{2\pi}}\,\exp\left(-\frac{1}{2}\left(\frac{97.9-98.6}{1}\right)^2\right)\), or dnorm(97.9, 98.6, 1) = 0.312.
See the Bayes table below.
PAUSE. Before proceeding, pause to consider how do you expect the posterior distribution to compare to the prior distribution?
theta = seq(97.6, 99.6, 0.5)
# prior
prior = c(0.10, 0.25, 0.30, 0.25, 0.10)
prior = prior / sum(prior)
# data
y = 97.9 # single observed value
sigma = 1 # assume known population SD
# likelihood
likelihood = dnorm(y, theta, sigma) # function of theta
# posterior
product = likelihood * prior
posterior = product / sum(product)
# bayes table
bayes_table = data.frame(theta,
prior,
likelihood,
product,
posterior)
bayes_table %>%
adorn_totals("row") %>%
kable(digits = 4, align = 'r')
| theta | prior | likelihood | product | posterior |
|---|---|---|---|---|
| 97.6 | 0.10 | 0.3814 | 0.0381 | 0.1326 |
| 98.1 | 0.25 | 0.3910 | 0.0978 | 0.3400 |
| 98.6 | 0.30 | 0.3123 | 0.0937 | 0.3258 |
| 99.1 | 0.25 | 0.1942 | 0.0485 | 0.1688 |
| 99.6 | 0.10 | 0.0940 | 0.0094 | 0.0327 |
| Total | 1.00 | 1.3729 | 0.2875 | 1.0000 |
We see that posterior probability is shifted towards the smaller values of \(\theta\) since those give the higher likelihood of the observed value \(y=97.9\).
Now suppose a second temperature value, 97.5, is observed, independently of the first. We’ll construct a Bayes table and find the posterior distribution of \(\theta\) after observing these two measurements, using the posterior distribution after the first observation as the prior distribution before the second observation.
PAUSE. Before proceeding, pause to consider how do you determine the likelihood?
The likelihood is determined by evaluating the Normal(\(\theta\), 1) density at \(y=97.5\) for different values of \(\theta\): dnorm(97.5, theta, 1) or \[
f(97.5|\theta) = \frac{1}{\sqrt{2\pi}}\,\exp\left(-\frac{1}{2}\left(\frac{97.5-\theta}{1}\right)^2\right)
\] For example, the likelihood for \(\theta = 98.6\) is \(\frac{1}{\sqrt{2\pi}}\,\exp\left(-\frac{1}{2}\left(\frac{97.5-98.6}{1}\right)^2\right)\), or dnorm(97.5, 98.6, 1) = 0.218.
See the Bayes table below.
PAUSE. Before proceeding, pause to consider how do you expect the posterior distribution to compare to the prior distribution?
# prior is posterior from previous part
prior = posterior
# data
y = 97.5 # second observed value
# likelihood
likelihood = dnorm(y, theta, sigma) # function of theta
# posterior
product = likelihood * prior
posterior = product / sum(product)
# bayes table
bayes_table = data.frame(theta,
prior,
likelihood,
product,
posterior)
bayes_table %>%
adorn_totals("row") %>%
kable(digits = 4, align = 'r')
| theta | prior | likelihood | product | posterior |
|---|---|---|---|---|
| 97.6 | 0.1326 | 0.3970 | 0.0527 | 0.2048 |
| 98.1 | 0.3400 | 0.3332 | 0.1133 | 0.4407 |
| 98.6 | 0.3258 | 0.2179 | 0.0710 | 0.2761 |
| 99.1 | 0.1688 | 0.1109 | 0.0187 | 0.0728 |
| 99.6 | 0.0327 | 0.0440 | 0.0014 | 0.0056 |
| Total | 1.0000 | 1.1029 | 0.2571 | 1.0000 |
Again, we see that posterior probability is shifted towards the smaller values of \(\theta\) since those give the higher likelihood of the observed value \(y=97.5\).
PAUSE. Before proceeding, pause to consider Does the order in which the observations were recorded matter? That is, if we had first observed 97.5 and then 97.9, would the posterior after the second observation be the same as for the (97.9, 97.5) sample? Think about it; if you’re not convinced you can always try it out and see what happens!
Now consider the original prior again. Suppose we update our posterior distribution only after observing the two measurements, rather than after each measurement. That is, we’ll construct a Bayes table to find the posterior distribution after observing a sample of size 2 with a first measurement of 97.9 and a second measurement of 97.5.
PAUSE. Before proceeding, pause to consider how do you determine the likelihood?
Since the two measurements are independent, the likelihood is the product of the likelihoods for \(y=97.9\) and \(y=97.5\): dnorm(97.9, theta, 1) * dnorm(97.5, theta, 1) or \[
f((97.9, 97.5)|\theta) = \left(\frac{1}{\sqrt{2\pi}}\,\exp\left(-\frac{1}{2}\left(\frac{97.9-\theta}{1}\right)^2\right)\right)\left(\frac{1}{\sqrt{2\pi}}\,\exp\left(-\frac{1}{2}\left(\frac{97.5-\theta}{1}\right)^2\right)\right)
\] For example, the likelihood for \(\theta = 98.6\) is \[
f((97.9, 97.5)|\theta=98.6) = \left(\frac{1}{\sqrt{2\pi}}\,\exp\left(-\frac{1}{2}\left(\frac{97.9-98.6}{1}\right)^2\right)\right)\left(\frac{1}{\sqrt{2\pi}}\,\exp\left(-\frac{1}{2}\left(\frac{97.5-98.6}{1}\right)^2\right)\right)
\] or dnorm(97.9, 98.6, 1) * dnorm(97.5, 98.6, 1) = (0.312)(0.218) = 0.068.
See the Bayes table below.
PAUSE. Before proceeding, pause to consider how do you expect the posterior distribution to compare to the posterior distribution after the second measurement above?
theta = seq(97.6, 99.6, 0.5)
# prior
prior = c(0.10, 0.25, 0.30, 0.25, 0.10)
prior = prior / sum(prior)
# data
y = c(97.9, 97.5) # two observed values
sigma = 1
# likelihood - function of theta
likelihood = dnorm(y[1], theta, sigma) * dnorm(y[2], theta, sigma)
# posterior
product = likelihood * prior
posterior = product / sum(product)
# bayes table
bayes_table = data.frame(theta,
prior,
likelihood,
product,
posterior)
bayes_table %>%
adorn_totals("row") %>%
kable(digits = 4, align = 'r')
| theta | prior | likelihood | product | posterior |
|---|---|---|---|---|
| 97.6 | 0.10 | 0.1514 | 0.0151 | 0.2048 |
| 98.1 | 0.25 | 0.1303 | 0.0326 | 0.4407 |
| 98.6 | 0.30 | 0.0680 | 0.0204 | 0.2761 |
| 99.1 | 0.25 | 0.0215 | 0.0054 | 0.0728 |
| 99.6 | 0.10 | 0.0041 | 0.0004 | 0.0056 |
| Total | 1.00 | 0.3754 | 0.0739 | 1.0000 |
The posterior is the same in the previous part. It doesn’t matter if we update the posterior after each observations, or all at once.
Consider the original prior again. Suppose that we take a random sample of two temperature measurements, but instead of observing the two individual values, we only observe that the sample mean is 97.7. We’ll construct a Bayes table to find the posterior distribution after observing a sample of size 2 with a sample mean of 97.7.
PAUSE. Before proceeding, pause to consider how do you determine the likelihood? Hint: if \(\bar{Y}\) is the sample mean of \(n\) values from a \(N(\theta, \sigma)\) distribution, what is the distribution of \(\bar{Y}\)?
For a sample of size \(n\) from a \(N(\theta,\sigma)\) distribution, the sample mean \(\bar{Y}\) follows a \(N\left(\theta, \frac{\sigma}{\sqrt{n}}\right)\) distribution. Remember: \(\sigma/\sqrt{n}\) represents the sample-to-sample variability of sample means.
The likelihood is determined by evaluating the Normal(\(\theta\), \(\frac{1}{\sqrt{2}}\)) density at \(\bar{y}=97.7\) for different values of \(\theta\): dnorm(97.7, theta, 1 / sqrt(2)) or \[
f_{\bar{Y}}(97.7|\theta) \propto \exp\left(-\frac{1}{2}\left(\frac{97.7-\theta}{1/\sqrt{2}}\right)^2\right)
\] For example, the likelihood for \(\theta = 98.6\) is \[
f_{\bar{Y}}((97.7)|\theta=98.6) \propto \exp\left(-\frac{1}{2}\left(\frac{97.7-98.6}{1/\sqrt{2}}\right)^2\right)
\] or dnorm(97.7, 98.6, 1 / sqrt(2)) = 0.251.
See the Bayes table below.
PAUSE. Before proceeding, pause to consider
theta = seq(97.6, 99.6, 0.5)
# prior
prior = c(0.10, 0.25, 0.30, 0.25, 0.10)
prior = prior / sum(prior)
# data
n = 2
y = 97.7 # sample mean
sigma = 1
# likelihood
likelihood = dnorm(y, theta, sigma / sqrt(n)) # function of theta
# posterior
product = likelihood * prior
posterior = product / sum(product)
# bayes table
bayes_table = data.frame(theta,
prior,
likelihood,
product,
posterior)
bayes_table %>%
adorn_totals("row") %>%
kable(digits = 4, align = 'r')
| theta | prior | likelihood | product | posterior |
|---|---|---|---|---|
| 97.6 | 0.10 | 0.5586 | 0.0559 | 0.2048 |
| 98.1 | 0.25 | 0.4808 | 0.1202 | 0.4407 |
| 98.6 | 0.30 | 0.2510 | 0.0753 | 0.2761 |
| 99.1 | 0.25 | 0.0795 | 0.0199 | 0.0728 |
| 99.6 | 0.10 | 0.0153 | 0.0015 | 0.0056 |
| Total | 1.00 | 1.3851 | 0.2727 | 1.0000 |
The likelihood is not the same as in the previous part. There are many samples with a sample mean of 97.7, of which the sample (97.9, 97.5) is only one possibility, so the likelihood of observing a sample mean of 97.7 is greater than the likelihood of (97.9, 97.5). However, while the likelihood is not the same as in the previous part, it is proportionally the same. That is, the likelihood in this part has the same shape as the likelihood in the previous part. For example, the ratio of the likelihoods for \(\theta=98.6\) and \(\theta=98.1\) is the same in both parts: \(0.4808/0.2510 = 1.91\) and \(0.1303/0.068=1.91\). Since the prior distribution is the same in both parts, and the likelihoods are proportionally the same, then the posterior distributions are the same.
It is often not necessary to know all the individual data values to evaluate the shape of the likelihood as a function of the parameter \(\theta\), but rather simply the values of a few summary statistics.
For example, when estimating the population mean of a Normal distribution with known standard deviation \(\sigma\), it is sufficient to know the sample mean for the purposes of evaluating the shape of the likelihood of the observed data under different potential values of the population mean. For the population proportion examples from handouts, we didn’t need to know the individual success/failure responses but rather just the number (or proportion) of successes.
Recall that if \(Y_1, \ldots, Y_n\) is a random sample from a \(N(\mu, \sigma)\) distribution, then \(\bar{Y}\) has a \(N\left(\mu, \frac{\sigma}{\sqrt{n}}\right)\) distribution
The small discrete example illustrated ideas. For the remaining parts we’ll now use a grid approximation and assume that any multiple of 0.0001 between 96.0 and 100.0 is a possible value of \(\theta\): \(96.0, 96.0001, 96.0002, \ldots, 99.9999, 100.0\).
As in the previous section, we’ll consider data from a sample of size \(n=2\) with a sample mean of \(\bar{y} = 97.7\).
We’ll investigate sensitivity of the posterior distribution to changes in the prior. For each of the following priors:
Note: “appropriate summary characteristics” can include: mean/median/mode, SD, credible intervals, percentiles, or probabilities of events (e.g., probability that \(\theta\) is less than 98.0.) You can choose - you don’t have to include everything - but include what you think gives a reasonable summary of the distribution.
Scenario 1: Prior distribution is Uniform over the possible values of \(\theta\)
# 1)
library(ggplot2)
library(dplyr)
library(knitr)
library(janitor)
theta = seq(96, 100, .0001)
n <- length(theta)
v <- rep(1, n)
prior = v / sum(theta)
n = 2
y = 97.7 # sample mean
sigma = 1
likelihood = dnorm(y, theta, sigma / sqrt(n)) # function of theta
# posterior
product = likelihood * prior
posterior = product / sum(product)
# b) Creation of Bayes Table without printing it
bayes_table = data.frame(theta,
prior,
likelihood,
product,
posterior)
# c) Creation of plot of prior, likelihood, and posterior
ggplot(bayes_table, aes(x = theta)) +
geom_line(aes(y = posterior), col = "seagreen") +
geom_line(aes(y = prior), col = "skyblue") +
geom_line(aes(y = likelihood / sum(likelihood)), col = "orange") +
labs(x = "Population proportion (theta)",
y = "Plausibility",
title = "Prior (blue), (Scaled) Likelihood (orange), Posterior (green)") +
theme_bw()
# d) Appropriate summary characteristics of posterior distribution
posterior_mean = sum(theta * posterior)
posterior_mean
## [1] 97.71438
posterior_var = sum(theta ^ 2 * posterior) - posterior_mean ^ 2
posterior_sd = sqrt(posterior_var)
posterior_sd
## [1] 0.6852828
sum(posterior[theta > y])
## [1] 0.5037704
theta[max(which(cumsum(posterior) < 0.01))]
## [1] 96.2174
theta[max(which(cumsum(posterior) < 0.99))]
## [1] 99.3323
mean(prior)
## [1] 2.550957e-07
TYPE YOUR RESPONSE HERE. a) The prior distribution of theta says that any value between 96 to 100 with .0001 increments, i.e. 96.0001, 96.0002, etc., is equally plausibile to be the true population proportion of mean human body temperatures. The mean probability of any one of these theta’s being the true population proportion of mean human body temperatures 2.55*10^-7. d) The posterior distribution says that, after observing the data we saw and taking into account our flat prior distribution, the center of our posterior distribution for the true population mean human body temperature is around 97.71 degrees fahrenheit. Additionally, the spread of the distribution is around 0.69 degrees fahrenheit. This distribution also says that about 50.4% of human body temperature observations should be greater than what our sample mean was, 97.7 degrees fahrenheit. Additionally there is a posterior probability of 98% that the population proportion of human body temperatures for healthy adults is between 96.22 and 99.33. According to the posterior, it is 49 times more plausible for theta to lie inside the interval [96.22, 99.33] as to lie outside this interval.
Scenario 2: Prior distribution is proportional to \(2-|\theta - 98.0|\) over the possible values of \(\theta\).
theta = seq(96, 100, .0001)
n <- length(theta)
v <- rep(1, n)
prior = 2 - abs(theta - 98)
prior = prior / sum(prior)
n = 2
y = 97.7 # sample mean
sigma = 1
likelihood = dnorm(y, theta, sigma / sqrt(n)) # function of theta
# posterior
product = likelihood * prior
posterior = product / sum(product)
# b) Creation of Bayes Table without printing it
bayes_table = data.frame(theta,
prior,
likelihood,
product,
posterior)
# c) Creation of plot of prior, likelihood, and posterior
ggplot(bayes_table, aes(x = theta)) +
geom_line(aes(y = posterior), col = "seagreen") +
geom_line(aes(y = prior), col = "skyblue") +
geom_line(aes(y = likelihood / sum(likelihood)), col = "orange") +
labs(x = "Population proportion (theta)",
y = "Plausibility",
title = "Prior (blue), (Scaled) Likelihood (orange), Posterior (green)") +
theme_bw()
# d) Appropriate summary characteristics of posterior distribution
posterior_mean = sum(theta * posterior)
posterior_mean
## [1] 97.81569
posterior_var = sum(theta ^ 2 * posterior) - posterior_mean ^ 2
posterior_sd = sqrt(posterior_var)
posterior_sd
## [1] 0.5540791
sum(posterior[theta > y])
## [1] 0.5893807
theta[max(which(cumsum(posterior) < 0.01))]
## [1] 96.5356
theta[max(which(cumsum(posterior) < 0.99))]
## [1] 99.1198
TYPE YOUR RESPONSE HERE. a) The prior distribution of theta says that any value between 96 to 100 with .0001 increments, i.e. 96.0001, 96.0002, etc., has plausibility proportional to the equation 2 - |theta -98|. The mean probability of any one of these theta’s being the true population proportion of mean human body temperatures 2.5*10^-5. d) The posterior distribution says that, after observing the data we saw and taking into account our prior distribution, the center of our posterior distribution for the true population mean human body temperature is around 97.82 degrees fahrenheit. Additionally, the spread of the distribution is around 0.55 degrees fahrenheit. This distribution also says that about 58.9% of human body temperature observations should be greater than what our sample mean was, 97.7 degrees fahrenheit. Additionally there is a posterior probability of 98% that the population proportion of human body temperatures for healthy adults is between 96.54 and 99.12 degrees fahrenheit. According to the posterior, it is 49 times more plausible for theta to lie inside the interval [96.54, 99.12] as to lie outside this interval.
Scenario 3: Prior distribution is proportional to a Normal distribution with mean 98.6 and standard deviation 0.7 over the possible values of \(\theta\). Careful: be sure not to confuse a Normal prior distribution with the likelihood.
theta = seq(96, 100, .0001)
n <- length(theta)
v <- rep(1, n)
prior = dnorm(theta, 98.6, 0.7) # shape of prior
prior = prior / sum(prior)
n = 2
y = 97.7 # sample mean
sigma = 1
likelihood = dnorm(y, theta, sigma / sqrt(n)) # function of theta
# posterior
product = likelihood * prior
posterior = product / sum(product)
# b) Creation of Bayes Table without printing it
bayes_table = data.frame(theta,
prior,
likelihood,
product,
posterior)
# c) Creation of plot of prior, likelihood, and posterior
ggplot(bayes_table, aes(x = theta)) +
geom_line(aes(y = posterior), col = "seagreen") +
geom_line(aes(y = prior), col = "skyblue") +
geom_line(aes(y = likelihood / sum(likelihood)), col = "orange") +
labs(x = "Population proportion (theta)",
y = "Plausibility",
title = "Prior (blue), (Scaled) Likelihood (orange), Posterior (green)") +
theme_bw()
# d) Appropriate summary characteristics of posterior distribution
posterior_mean = sum(theta * posterior)
posterior_mean
## [1] 98.15436
posterior_var = sum(theta ^ 2 * posterior) - posterior_mean ^ 2
posterior_sd = sqrt(posterior_var)
posterior_sd
## [1] 0.4970538
sum(posterior[theta > y])
## [1] 0.8195283
theta[max(which(cumsum(posterior) < 0.01))]
## [1] 96.9973
theta[max(which(cumsum(posterior) < 0.99))]
## [1] 99.3098
mean(prior)
## [1] 2.499938e-05
TYPE YOUR RESPONSE HERE. a) The prior distribution of theta says that any value between 96 to 100 with .0001 increments, i.e. 96.0001, 96.0002, etc., has plausibility proportional to the normal distribution with mean equal to 98.6 and a spread (SD) of 0.7 degrees fahrenheit. The mean probability of any one of these theta’s being the true population proportion of mean human body temperatures 2.5*10^-5. d) The posterior distribution says that, after observing the data we saw and taking into account our prior distribution, the center of our posterior distribution for the true population mean human body temperature is around 98.15 degrees fahrenheit. Additionally, the spread of the distribution is around 0.50 degrees fahrenheit. This distribution also says that about 82.0% of human body temperature observations should be greater than what our sample mean was, 97.7 degrees fahrenheit. Additionally there is a posterior probability of 98% that the population proportion of human body temperatures for healthy adults is between 97.00 and 99.31 degrees fahrenheit. According to the posterior, it is 49 times more plausible for theta to lie inside the interval [97.00, 99.31] as to lie outside this interval.
Scenario 4: Prior distribution is your choice!
theta = seq(96, 100, .0001)
n <- length(theta)
v <- rep(1, n)
prior = dlnorm(theta) # shape of prior
prior = prior / sum(prior)
n = 2
y = 97.7 # sample mean
sigma = 1
likelihood = dnorm(y, theta, sigma / sqrt(n)) # function of theta
# posterior
product = likelihood * prior
posterior = product / sum(product)
# b) Creation of Bayes Table without printing it
bayes_table = data.frame(theta,
prior,
likelihood,
product,
posterior)
# c) Creation of plot of prior, likelihood, and posterior
ggplot(bayes_table, aes(x = theta)) +
geom_line(aes(y = posterior), col = "seagreen") +
geom_line(aes(y = prior), col = "skyblue") +
geom_line(aes(y = likelihood / sum(likelihood)), col = "orange") +
labs(x = "Population proportion (theta)",
y = "Plausibility",
title = "Prior (blue), (Scaled) Likelihood (orange), Posterior (green)") +
theme_bw()
# d) Appropriate summary characteristics of posterior distribution
posterior_mean = sum(theta * posterior)
posterior_mean
## [1] 97.68761
posterior_var = sum(theta ^ 2 * posterior) - posterior_mean ^ 2
posterior_sd = sqrt(posterior_var)
posterior_sd
## [1] 0.6839398
sum(posterior[theta > y])
## [1] 0.4880199
theta[max(which(cumsum(posterior) < 0.01))]
## [1] 96.2033
theta[max(which(cumsum(posterior) < 0.99))]
## [1] 99.3061
mean(prior)
## [1] 2.499938e-05
TYPE YOUR RESPONSE HERE. a) The prior distribution of theta says that any value between 96 to 100 with .0001 increments, i.e. 96.0001, 96.0002, etc., has plausibility proportional to the lognormal distribution with mean 0 and standard deviation 1, with both being on the log scale. The mean probability of any one of these theta’s being the true population proportion of mean human body temperatures 2.5*10^-5. d) The posterior distribution says that, after observing the data we saw and taking into account our prior distribution, the center of our posterior distribution for the true population mean human body temperature is around 97.69 degrees fahrenheit. Additionally, the spread of the distribution is around 0.68 degrees fahrenheit. This distribution also says that about 48.8% of human body temperature observations should be greater than what our sample mean was, 97.7 degrees fahrenheit. Additionally there is a posterior probability of 98% that the population proportion of human body temperatures for healthy adults is between 96.20 and 99.31 degrees fahrenheit. According to the posterior, it is 49 times more plausible for theta to lie inside the interval [96.20, 99.31] as to lie outside this interval.
Discussion. Compare the posterior distributions from the scenarios above and discuss what influence the prior has on the results.
TYPE YOUR RESPONSE HERE. In the case of the uniform prior distribution, the posterior distribution resembles the scaled likelihood distribution as there is no scaling taking place. Therefore, the prior distribution has no effect on the posterior distribution when the prior is uniform; the likelihood distribution is what determines the posterior when prior is uniform. However in the second case, the prior seems to have shifted upwards in plausibility when theta is near 98. This makes sense as the prior distribution is an absolute value function with a vertex at 98, thus impacting the posterior distribution to also have it’s maximum plausibility near values of 98 degrees fahrenheit. In the third case, because both the prior and likelihood distributions are normal, the posterior appears to be the two combined with smaller tails. Because the mean of the prior distribution in this case was higher than that of the likelihood function and what we observed in the data, along with the standard deviation, we see that our posterior distribution is now scaled upwards with a much larger center (mean) relative to the other three posterior distributions. The final prior distribution, that of the lognormal distribution, had little effect on the posterior distribution because I used the default R settings of 1 and 0 for the mean and SD, respectively. Thus, the posterior is only slightly different from the scaled likelihood distribution with the mean of the posterior distribution centered a bit left of the center of the scaled likelihood distribution. # Data Analysis
Continuing the body temperature example, now we’ll do a Bayesian analysis based on actual sample data.
We’ll again use a grid approximation and assume that any multiple of 0.0001 between 96.0 and 100.0 is a possible value of \(\mu\): \(96.0, 96.0001, 96.0002, \ldots, 99.9999, 100.0\).
Choose one of the prior distributions from the four scenarios in the previous section as the prior distribution.
In a recent study3, the sample mean body temperature in a sample of 208 healthy adults was 97.7 degrees F. You can assume this is a reasonably representative sample of healthy human body temperatures.
theta = seq(96, 100, .0001)
n <- length(theta)
v <- rep(1, n)
prior = dnorm(theta, 98.6, 0.7) # shape of prior
prior = prior / sum(prior)
n = 208
y = 97.7 # sample mean
sigma = 1
likelihood = dnorm(y, theta, sigma / sqrt(n)) # function of theta
# posterior
product = likelihood * prior
posterior = product / sum(product)
# b) Creation of Bayes Table without printing it
bayes_table = data.frame(theta,
prior,
likelihood,
product,
posterior)
# c) Creation of plot of prior, likelihood, and posterior
ggplot(bayes_table, aes(x = theta)) +
geom_line(aes(y = posterior), col = "seagreen") +
geom_line(aes(y = prior), col = "skyblue") +
geom_line(aes(y = likelihood / sum(likelihood)), col = "orange") +
labs(x = "Population proportion (theta)",
y = "Plausibility",
title = "Prior (blue), (Scaled) Likelihood (orange), Posterior (green)") +
theme_bw()
# d) Appropriate summary characteristics of posterior distribution
posterior_mean = sum(theta * posterior)
posterior_mean
## [1] 97.70874
posterior_var = sum(theta ^ 2 * posterior) - posterior_mean ^ 2
posterior_sd = sqrt(posterior_var)
posterior_sd
## [1] 0.06899985
sum(posterior[theta > y])
## [1] 0.5501379
theta[max(which(cumsum(posterior) < 0.01))]
## [1] 97.5481
theta[max(which(cumsum(posterior) < 0.99))]
## [1] 97.8692
mean(prior)
## [1] 2.499938e-05
TYPE YOUR RESPONSE HERE. c) The posterior distribution says that, after observing the data we saw and taking into account our prior distribution, the center of our posterior distribution for the true population mean human body temperature is around 97.71 degrees fahrenheit. Additionally, the spread of the distribution is around 0.069 degrees fahrenheit. This distribution also says that about 55.0% of human body temperature observations should be greater than what our sample mean was, 97.7 degrees fahrenheit.
If you had chosen a different prior in the previous part, would your conclusions have changed substantially?
Redo the analysis from the previous part with a different prior from the four scenarios above. Choose whichever prior seems most different from the one you used in your data analysis in the previous section. (You can try all the other priors if you want, but you don’t have to.) Are the conclusions sensitive to the choice of prior?
theta = seq(96, 100, .0001)
n <- length(theta)
v <- rep(1, n)
prior = 2 - abs(theta - 98)
prior = prior / sum(prior)
n = 208
y = 97.7 # sample mean
sigma = 1
likelihood = dnorm(y, theta, sigma / sqrt(n)) # function of theta
# posterior
product = likelihood * prior
posterior = product / sum(product)
# b) Creation of Bayes Table without printing it
bayes_table = data.frame(theta,
prior,
likelihood,
product,
posterior)
# c) Creation of plot of prior, likelihood, and posterior
ggplot(bayes_table, aes(x = theta)) +
geom_line(aes(y = posterior), col = "seagreen") +
geom_line(aes(y = prior), col = "skyblue") +
geom_line(aes(y = likelihood / sum(likelihood)), col = "orange") +
labs(x = "Population proportion (theta)",
y = "Plausibility",
title = "Prior (blue), (Scaled) Likelihood (orange), Posterior (green)") +
theme_bw()
# d) Appropriate summary characteristics of posterior distribution
posterior_mean = sum(theta * posterior)
posterior_mean
## [1] 97.70283
posterior_var = sum(theta ^ 2 * posterior) - posterior_mean ^ 2
posterior_sd = sqrt(posterior_var)
posterior_sd
## [1] 0.06927973
sum(posterior[theta > y])
## [1] 0.5159838
theta[max(which(cumsum(posterior) < 0.01))]
## [1] 97.5416
theta[max(which(cumsum(posterior) < 0.99))]
## [1] 97.8639
TYPE YOUR RESPONSE HERE. c) The posterior distribution says that, after observing the data we saw and taking into account our prior distribution, the center of our posterior distribution for the true population mean human body temperature is around 97.70 degrees fahrenheit. Additionally, the spread of the distribution is around 0.069 degrees fahrenheit. This distribution also says that about 51.5% of human body temperature observations should be greater than what our sample mean was, 97.7 degrees fahrenheit.
Thus, because of our large sample size (n = 208), the posterior distribution is less effected by the prior distribution. The conclusions from the absolute value function prior distribution and the normal distribution prior are extrmeley similar and thus we can say that with a large sample size, the prior distribution plays a small role in determining the posterior distribution. Therefore, our conclusions are not senstitive to the choice in priors. # Comparison between Bayesian and frequentist results
Continue to assume \(\sigma=1\) and the sample data with \(n=208\) and \(\bar{y}=97.7\). This section contains some exercises to compare the results and interpretations of Bayesian and frequentist analyses.
Bayesian
theta = seq(96, 100, .0001)
n <- length(theta)
v <- rep(1, n)
prior = dnorm(theta, 98.6, 0.7) # shape of prior
prior = prior / sum(prior)
n = 208
y = 97.7 # sample mean
sigma = 1
likelihood = dnorm(y, theta, sigma / sqrt(n)) # function of theta
# posterior
product = likelihood * prior
posterior = product / sum(product)
theta[max(which(cumsum(posterior) < 0.025))]
## [1] 97.5734
theta[max(which(cumsum(posterior) < 0.975))]
## [1] 97.8439
TYPE YOUR RESPONSE HERE. 2) There is a posterior probability of 95% that the population mean of body temperatures for healthy adults is between 97.57 and 97.84 degrees fahrenheit. We believe that the population proportion is about 47.5 times more likely to be inside this interval than to be outside it. 3) There is a 95% probability that the true mean body temperature for the population of healthy adults of interest is between 97.57 and 97.84 degrees fahrenheit.
Frequentist
error <- qnorm(0.975)*sigma/sqrt(n)
left <- y-error
right <- y+error
left
## [1] 97.5641
right
## [1] 97.8359
TYPE YOUR RESPONSE HERE. 2) We are 95% confident that the true population mean for body temperatures for healthy adults is between 97.56 and 97.84 degrees fahrenheit. 3) 95% confidence means that if we were to repeat the process of random sampling from the population of interest, approximately 95% of those intervals will contain the true population mean for body temperatures for healthy adults.
Write a few sentences comparing both the numerical results and the interpretations of the Bayesian and frequentist interval estimates.
TYPE YOUR RESPONSE HERE. 1) The Bayesian interval estimate is a tad wider than that of the frequentist interval estimate, [97.57, 97.84] and [97.56, 97.84], respectively. I can only assume that this small difference is because of the differing interpretations of the two. The Bayesian interval is saying that the posterior distribution will contain 95% of the observations that we see from this distribution are between 97.54 and 97.84 degrees fahrenheit. Whereas the frequentist interval is saying that we have 95% confidence in that the true population mean for body temperatures for healthy adults is between 97.56 and 97.84 degrees fahrenheit. Frequentists are confident in the procedure whereas Bayesians have credibility that the parameter is quantified within the given interval.
Bayesian
### Type your code here. Feel free to add chunks as necessary
sum(posterior[theta < 98])
## [1] 0.9999878
TYPE YOUR RESPONSE HERE. 2) There is 99.99% posterior probability that theta (population mean body temperatures of healthy adults) is less than 98 degrees fahrenheit.
Frequentist
### Type your code here. Feel free to add chunks as necessary
sds = 1/sqrt(n)
res = (98-y)/sds
pval = pnorm(res, lower.tail=TRUE)
pval
## [1] 0.9999924
TYPE YOUR RESPONSE HERE. 2) Therefore, with a p-value of .9999, we do not have evidence to reject the null hypothesis and cannot conclude that true population mean body temperatures of healthy adults is less than 98. 3) If the population mean body temperatures of healthy adults was equal to 98 degrees fahrenheit, then we would observe a sample mean of 98 degrees fahrenheit or less in about 99.99% of random samples of size 208, which is extremely likely.
Write a few sentences comparing both the numerical results (posterior probability versus p-value) and the interpretations of the Bayesian and frequentist interval estimates.
TYPE YOUR RESPONSE HERE. The Bayesian analysis computes a probability that theta is less than 98, and we found a 99.99% probability that theta is less than 98 degrees fahrenheit. But such a probability make no sense from a frequentist perspective. From the frequentist perspective, the unknown parameteris a number: either than number is less than 98 or it’s not; there’s no probability to it. The p-value is a probability referring to what would happen over many samples.
TYPE YOUR RESPONSE HERE. 1) One important concept I learned in this lab is that when our data is small (small sample size) our prior distribution can have a large effect on the results of the posterior distribution. In contrary, when our sample size is large and we have large amounts of data from our population of focus, the prior distribution that we choose has negligible effects on the results of our posterior distribution. 2) I am still wondering if there is any other difference in the Bayesian credible intervals and the Frequentist confidence interval other than interpretation. Are there any other meaningful differences or is the interpretation of these intervals the only differing factor?
Why \(U\) and not \(X\) or \(Y\)? In a Bayesian analysis, we might assume the data follows a Normal distribution, but we might also use a Normal distribution to quantify the uncertainty about a parameter. We generally associate \(X\) and \(Y\) with data. \(U\) is supposed to represent a more general variable, which could be either data or a parameter.↩
\(\exp\) is just another way of writing the exponential function, \(\exp(u)=e^u\).↩
Source and a related article.↩