This RMarkdown file provides a template for you to fill in. Read the file from start to finish, completing the parts as indicated. Some blank “code chunks” are provided; you are welcome to add more (CTRL-ALT-I) as needed. There are also a few places where you should type text responses. Feel free to add additional text responses as necessary.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. When you are finished

Problem 1.

Does uniform color give athletes an advantage over their competitors? To investigate this question, Hill and Barton (Nature, 2005) examined the records in the 2004 Olympic Games for four combat sports: boxing, tae kwon do, Greco-Roman wrestling, and freestyle wrestling. Competitors in these sports were randomly assigned to wear either a red or a blue uniform. The researchers investigated whether competitors wearing one color won significantly more often than those wearing the other color. They collected data on a total of 457 matches.

Let the parameter \(\theta\) represent the probability that a competitor wearing red wins a match. Let \(Y\) be the number of matches, out of 457, in which the competitor wearing red defeats the competitor wearing blue.

  1. Assume a Beta(50, 50) prior for \(\theta\). Use simulation to approximate the prior predictive distribution of \(Y\) and plot it. (It might make more sense to plot \(Y/457\).) Explain why a Beta(50, 50) distribution seems like an unreasonable choice for the prior of \(\theta\).

    # Type your code here.
    theta = seq(0, 1, 0.0001)
    
    # prior
    prior = theta ^ 49 * (1-theta)^49
    prior = prior / sum(prior)
    
    ylim = c(0, max(prior))
    plot(theta, prior, type='l', xlim=c(0, 1), ylim=ylim, col="skyblue", xlab='theta', ylab='')

    Type your response here. The prior predictive distribution of theta in the case of using Beta(50,50) has a mean at around 0.5, which is reasonable. however, the prior SD (spread) of this distribution is a little too wide and gives too much plausibility to smaller and bigger numbers for theta (comapared to what we know is reasonable for this problem, 0.5). Thus, this prior distribution is unreasonable and I think we can make it better.

  2. Assume a Beta(200, 200) prior distribution for \(\theta\). Use simulation to approximate the prior predictive distribution of \(Y\) and plot it. (It might make more sense to plot \(Y/457\).) Explain why a Beta(200, 200) distribution seems like a more reasonable choice than Beta(50, 50) for the prior of \(\theta\).

    # Type your code here.
    theta = seq(0, 1, 0.0001)
    
    # prior
    prior = theta ^ 199 * (1-theta)^199
    prior = prior / sum(prior)
    
    ylim = c(0, max(prior))
    plot(theta, prior, type='l', xlim=c(0, 1), ylim=ylim, col="skyblue", xlab='theta', ylab='')

    Type your response here. We know that the most plausible value for theta is 0.5 and we should be confident in this given research question. Thus, in order to reflect this, we should have a relatively small SD. This is better shown in our prior distribution for theta in the case of Beta(200,200) vs Beta(50,50) where the latter has a much too large value for the prior SD and the former has a smaller SD, reflecting our confidence that we should expect 0.5 as our true value of theta. Thus, the Beta(200,200) seems more reasonable than the Beta(50,50) distribution.

  3. Assume a Beta(50, 50) prior distribution for \(\theta\). The researchers analyzed a total of 457 matches, and they found that the competitor wearing red defeated the competitor wearing blue in 248 of these matches. Identify the posterior distribution, its mean and SD, and central 50%, 80%, and 98% posterior credible intervals. Create a plot showing prior, (scaled) likelihood, and posterior.

    theta = seq(0, 1, 0.0001) # the grid is just for plotting
    
    # prior
    alpha_prior = 50
    beta_prior = 50
    prior = dbeta(theta, alpha_prior, beta_prior)
    
    # data
    n = 457
    y = 248
    
    # likelihood
    likelihood = dbinom(y, n, theta)
    
    # posterior
    alpha_post = alpha_prior + y
    beta_post = beta_prior + n - y
    posterior = dbeta(theta, alpha_post, beta_post)
    
    # plot
    ymax = max(c(prior, posterior))
    scaled_likelihood = likelihood * ymax / max(likelihood)
    
    plot(theta, prior, type='l', col='skyblue', xlim=c(0, 1), ylim=c(0, ymax), ylab='', yaxt='n')
    par(new=T)
    plot(theta, scaled_likelihood, type='l', col='orange', xlim=c(0, 1), ylim=c(0, ymax), ylab='',  yaxt='n')
    par(new=T)
    plot(theta, posterior, type='l', col='seagreen', xlim=c(0, 1), ylim=c(0, ymax), ylab='', yaxt='n')
    legend("topleft", c("prior", "scaled likelihood", "posterior"), lty=1, col=c("skyblue", "orange", "seagreen"))

    # 50% posterior credible interval
    qbeta(c(0.25, 0.75), alpha_post, beta_post)
    ## [1] 0.5207740 0.5492897
    # 80% posterior credible interval
    qbeta(c(0.1, 0.9), alpha_post, beta_post)
    ## [1] 0.5079061 0.5620579
    # 98% posterior credible interval
    qbeta(c(0.01, 0.99), alpha_post, beta_post)
    ## [1] 0.4857573 0.5838912
    #posterior mean 
    post.mean = (alpha_prior + y) / (alpha_prior + beta_prior + n)
    post.mean
    ## [1] 0.535009
    post.sd = sqrt((post.mean* (1-post.mean))/(alpha_prior +beta_prior+n+1))
    post.sd
    ## [1] 0.02111474

    Type your response here. The posterior distriubtion is now ~ Beta(298,259), with a posterior mean of 0.535 and a posterior SD of 0.0211. The central posterior 50% credible interval is: 0.5207740 0.5492897. The central posterior 80% credible interval is: 0.5079061 0.5620579. The central posterior 98% credible interval is: 0.4857573 0.5838912.

  4. Assume a Beta(200, 200) prior distribution for \(\theta\). The researchers analyzed a total of 457 matches, and they found that the competitor wearing red defeated the competitor wearing blue in 248 of these matches. Identify the posterior distribution, its mean and SD, and central 50%, 80%, and 98% posterior credible intervals. Create a plot showing prior, (scaled) likelihood, and posterior. How sensitive are the results to the choice of prior (Beta(50, 50) versus Beta(200, 200))?

       theta = seq(0, 1, 0.0001) # the grid is just for plotting
    
    # prior
    alpha_prior = 200
    beta_prior = 200
    prior = dbeta(theta, alpha_prior, beta_prior)
    
    # data
    n = 457
    y = 248
    
    # likelihood
    likelihood = dbinom(y, n, theta)
    
    # posterior
    alpha_post = alpha_prior + y
    beta_post = beta_prior + n - y
    posterior = dbeta(theta, alpha_post, beta_post)
    
    # plot
    ymax = max(c(prior, posterior))
    scaled_likelihood = likelihood * ymax / max(likelihood)
    
    plot(theta, prior, type='l', col='skyblue', xlim=c(0, 1), ylim=c(0, ymax), ylab='', yaxt='n')
    par(new=T)
    plot(theta, scaled_likelihood, type='l', col='orange', xlim=c(0, 1), ylim=c(0, ymax), ylab='',  yaxt='n')
    par(new=T)
    plot(theta, posterior, type='l', col='seagreen', xlim=c(0, 1), ylim=c(0, ymax), ylab='', yaxt='n')
    legend("topleft", c("prior", "scaled likelihood", "posterior"), lty=1, col=c("skyblue", "orange", "seagreen"))

    # 50% posterior credible interval
    qbeta(c(0.25, 0.75), alpha_post, beta_post)
    ## [1] 0.5112535 0.5342734
    # 80% posterior credible interval
    qbeta(c(0.1, 0.9), alpha_post, beta_post)
    ## [1] 0.5008807 0.5446041
    # 98% posterior credible interval
    qbeta(c(0.01, 0.99), alpha_post, beta_post)
    ## [1] 0.4830347 0.5623168
    #posterior mean 
    post.mean = (alpha_prior + y) / (alpha_prior + beta_prior + n)
    post.mean
    ## [1] 0.5227538
    post.sd = sqrt((post.mean* (1-post.mean))/(alpha_prior +beta_prior+n+1))
    post.sd
    ## [1] 0.01705203

    Type your response here. The posterior distriubtion is now ~ Beta(448,409), with a posterior mean of 0.523 and a posterior SD of 0.0171. The central posterior 50% credible interval is: 0.5112535 0.5342734. The central posterior 80% credible interval is: 0.5008807 0.5446041. The central posterior 98% credible interval is: 0.4830347 0.5623168.

The results for the posterior distribution are less sensitive in the Beta(50,50) case when compared to the Beta(200,200) case. The reason for this is that the former is assuming a smaller degree of certain in the prior distribution, thus the data (likelihood function) takes more precedence and the posterior is less sensitive to this case. The posterior is more sensitive in the Beta(200,200) case as we are assuming a high degree of confidence in our prior distribution in this case because of the larger alpha and beta values acting as sample size causing our greater degree of certainty.

  1. Express the posterior mean as a weighted average of the prior mean and the sample proportion.

    Type your response here. 298/557 = ((50/100) * (100/557)) + ((248/457)*(457/557))

and

448/857 = ((200/400) * (400/857)) + ((248/457)*(457/857))

  1. We might say that if the probability of winning by the competitor wearing red is between 0.49 and 0.51, then any color advantage is small enough to be considered unimportant. In other words \(\theta\) outside of 0.49 and 0.51 could be interpreted as one color giving a meaningful advantage. Compute and interpret the posterior probability that \(\theta\) is not in [0.49, 0.51]. (If you disagree with the range [0.49, 0.51] feel free to use different values.)

    qbeta(c(0.016747, 0.118445), alpha_post, beta_post)
    ## [1] 0.4864531 0.5025683
    res = 1 - (0.118445 - 0.016747)
    res
    ## [1] 0.898302

    Type your response here. Therefore, there is a posterior probability of 0.8983 that the true value of theta, the true probability of winning a match when the team in question wears red, is not within the range of 0.49 and .51. Thus, there is a high plausbility in the idea that the true probability of winning a match when the team in question wears red is not within the 0.49 and 0.51 range as previously believed.

  2. Write 1-3 clearly worded sentences reporting the conclusions of your analysis in this context.

    Type your response here. Thus, after taking into consideration a prior distribution with a center of 0.5, as well as observing data with a sample size of 457 matches and 248 wins by the team that was wearing red, we have determine a posterior distribution for the true probability of winning a match given that a team is wearing red. This distribution is centered at 0.5227 and has a spread of 0.017. Additionally, there is a posterior probability of 98% that the true probability of a team winning given that they are wearing red is 48.3% and 56.2%, for competitors similar to those in the 2004 Olympic Games for four combat sports: boxing, tae kwon do, Greco-Roman wrestling, and freestyle wrestling ; after observing the sample data, it’s 49 times mores plausible that θ is inside [0.4830347 0.5623168] as outside.

  3. Assume a Beta(200, 200) prior and the 248/457 sample data. Use JAGS to approximate the posterior distribution of \(\theta\). Include a plot of the simulated posterior distribution, and some summary statistics. Does the JAGS output reasonably match the posterior distribution that you computed previously?

    # Type your code here.
    library(rjags)
    ## Warning: package 'rjags' was built under R version 4.0.5
    ## Loading required package: coda
    ## Warning: package 'coda' was built under R version 4.0.5
    ## Linked to JAGS 4.3.0
    ## Loaded modules: basemod,bugs
    library(bayesplot)
    ## Warning: package 'bayesplot' was built under R version 4.0.5
    ## This is bayesplot version 1.8.1
    ## - Online documentation and vignettes at mc-stan.org/bayesplot
    ## - bayesplot theme set to bayesplot::theme_default()
    ##    * Does _not_ affect other ggplot2 plots
    ##    * See ?bayesplot_theme_set for details on theme setting
    library(runjags) 
    ## Warning: package 'runjags' was built under R version 4.0.5
    n = 457 # sample size
    y = 248 # number of successes
    
    model_string <- "model{
    
      # Likelihood
      y ~ dbinom(theta, n)
    
      # Prior
      theta ~ dbeta(alpha, beta)
      alpha <- 200 # prior successes
      beta <- 200 # prior failures
    
    }"
    
    dataList = list(y = y, n = n)
    
    model <- jags.model(file = textConnection(model_string), 
                    data = dataList)
    ## Compiling model graph
    ##    Resolving undeclared variables
    ##    Allocating nodes
    ## Graph information:
    ##    Observed stochastic nodes: 1
    ##    Unobserved stochastic nodes: 1
    ##    Total graph size: 4
    ## 
    ## Initializing model
    update(model, n.iter = 1000)
    
    Nrep = 10000 # number of values to simulate
    
    posterior_sample <- coda.samples(model,
                       variable.names = c("theta"),
                       n.iter = Nrep)
    summary(posterior_sample)
    ## 
    ## Iterations = 2001:12000
    ## Thinning interval = 1 
    ## Number of chains = 1 
    ## Sample size per chain = 10000 
    ## 
    ## 1. Empirical mean and standard deviation for each variable,
    ##    plus standard error of the mean:
    ## 
    ##           Mean             SD       Naive SE Time-series SE 
    ##      0.5231261      0.0167655      0.0001677      0.0002145 
    ## 
    ## 2. Quantiles for each variable:
    ## 
    ##   2.5%    25%    50%    75%  97.5% 
    ## 0.4906 0.5119 0.5231 0.5344 0.5563
    plot(posterior_sample)

    library(bayesplot)
    mcmc_hist(posterior_sample)
    ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

    Type your response here.

The posterior distribution we calculated before without using JAGS is centered at 0.5227 and has a spread of 0.017. In comparison, the JAGS simulation has a mean of 0.5228 and a SD of 0.01689. These two summary statistics are extremely close to each and thus the JAGS output definitely does match the posterior distribution I previously computed.

  1. A Beta(200, 200) distribution is closely approximated by a Normal distribution with mean 0.5 and SD 0.025. Suppose we had assumed a N(0.5, 0.025) prior for \(\theta\), with the 248/457 sample data. Use grid approximation to approximate the posterior distribution of \(\theta\) and its mean and SD, and central 50%, 80%, and 98% posterior credible intervals. Create a plot showing prior, (scaled) likelihood, and posterior. How does it compare to the posterior from the previous part (with the Beta(200, 200) prior)?

    # Type your code here.
    theta = seq(0, 1, 0.0001)
    
    # prior
    prior = dnorm(theta, 0.5, 0.025) # shape of prior
    prior = prior / sum(prior) # scales so that prior sums to 1
    
    # data
    n = 457 # sample size
    y = 248 # sample count of success
    
    # likelihood, using binomial
    likelihood = dbinom(y, n, theta) # function of theta
    
    # posterior
    product = likelihood * prior
    posterior = product / sum(product)
    
    # plot
    ymax = max(c(prior, posterior))
    scaled_likelihood = likelihood * ymax / max(likelihood)
    
    plot(theta, prior, type='l', col='skyblue', xlim=c(0, 1), ylim=c(0, ymax), ylab='', yaxt='n')
    par(new=T)
    plot(theta, scaled_likelihood, type='l', col='orange', xlim=c(0, 1), ylim=c(0, ymax), ylab='',  yaxt='n')
    par(new=T)
    plot(theta, posterior, type='l', col='seagreen', xlim=c(0, 1), ylim=c(0, ymax), ylab='', yaxt='n')
    legend("topleft", c("prior", "scaled likelihood", "posterior"), lty=1, col=c("skyblue", "orange", "seagreen"))

    posterior_mean = sum(theta * posterior)
    posterior_mean
    ## [1] 0.5227598
    posterior_var = sum(theta ^ 2 * posterior) - posterior_mean ^ 2
    posterior_sd = sqrt(posterior_var)
    posterior_sd
    ## [1] 0.01707074

    Type your response here. The posterior distribution we calculated before with a prior distribution of Beta(200,200) is centered at 0.5227 and has a spread of 0.017. In comparison, when using a normal distribution for our prior with mean 0.5 and SD of 0.025 the posterior distribution in this latter case has a mean of 0.52275 and a SD of 0.01707. These two summary statistics are extremely close to each and thus the posterior distribution using a prior of N(0.5,0.025) definitely does match the posterior distribution I previously computed using a prior distribution of Beta(200,200).

  2. Use simulation to approximation the posterior distribution of the odds \(\phi = \frac{\theta}{1-\theta}\). Find and interpret in context a central 98% credible interval for \(\phi\). (Your posteriors for \(\theta\) from the previous parts should be pretty similar, so it doesn’t matter what you use here.)

      # Type your code here.
    theta = seq(0, 1, 0.0001)
    
    # prior
    prior = dnorm(theta, 0.5, 0.025) # shape of prior
    prior = prior / sum(prior) # scales so that prior sums to 1
    
    # data
    n = 457 # sample size
    y = 248 # sample count of success
    
    # likelihood, using binomial
    likelihood = dbinom(y, n, theta) # function of theta
    
    # posterior
    product = likelihood * prior
    posterior1 = product / sum(product)
    posterior = posterior1 / (1 - posterior1)
    
    # plot
    ymax = max(c(prior, posterior))
    scaled_likelihood = likelihood * ymax / max(likelihood)
    
    plot(theta, prior, type='l', col='skyblue', xlim=c(0, 1), ylim=c(0, ymax), ylab='', yaxt='n')
    par(new=T)
    plot(theta, scaled_likelihood, type='l', col='orange', xlim=c(0, 1), ylim=c(0, ymax), ylab='',  yaxt='n')
    par(new=T)
    plot(theta, posterior, type='l', col='seagreen', xlim=c(0, 1), ylim=c(0, ymax), ylab='', yaxt='n')
    legend("topleft", c("prior", "scaled likelihood", "posterior"), lty=1, col=c("skyblue", "orange", "seagreen"))

    posterior_cdf = cumsum(posterior)
    # posterior 98% central credible interval
    c(theta[max(which(posterior_cdf <= 0.01))], theta[min(which(posterior_cdf >= 0.99))])
    ## [1] 0.4829 0.5614

    Type your response here. There is a posterior probability of 98% that the true posterior odds of a team winning given that they are wearing red is 48.29% and 56.14%, for competitors similar to those in the 2004 Olympic Games for four combat sports: boxing, tae kwon do, Greco-Roman wrestling, and freestyle wrestling ; after observing the sample data, it’s 49 times mores plausible that θ is inside [0.4829 0.5614] as outside.

Problem 2.

This problem will investigate frequentist properties of Bayesian procedures (or lack thereof).

Consider estimating a population proportion \(\theta\) based on a sample of size \(n\) with \(Y\) successes, where \(Y\) has a Binomial(\(n\), \(\theta\)) distribution. Let \(\hat{p}=Y/n\) be the sample proportion.

  1. The sample proportion \(\hat{p}\) is an unbiased estimator of the population proportion \(p\): \(E(\hat{p}) = p\). That is, over many hypothetical samples, sample proportions do not systematically overestimate or underestimate the population proportion, they are on average just right (assuming the sampling procedure tends to produce representative samples.)

    Assume \(n=100\) and \(p=0.5\). Conduct a simulation to demonstrate that \(\hat{p}\) is unbiased. Hint: simulate many values of \(\hat{p}\), compute their average, and compare to \(p=0.5\).

    # Type your code here.
    theta = seq(0, 1, 0.0001)
    # data
    n = 100 # sample size
    
    # likelihood, using binomial
    y = rbinom(100000, n, theta) # function of theta
    p.hat = mean(y) / n
    p.hat
    ## [1] 0.4999513

    Type your response here. After simulating 100,000 samples of p.hat, the average comes out to be very close or exactly 0.5 depending on the iteration of the simulation. Thus, we can say that p.hat is an unbiased estimator of the population proportion p, because the expected value of p.hat = population proportion (p).

  2. Let \(\hat{\theta}\) denote the posterior mean in a Bayesian analysis. Is a Bayesian posterior mean \(\hat{\theta}\) unbiased for \(p\)? Assume a Beta(\(\alpha\), \(\beta\)) prior distribution for \(\theta\), with \(\alpha = 50\), \(\beta = 50\). Assume \(n=100\) and \(p=0.5\). Conduct a simulation to determine if the posterior mean is unbiased in this scenario. Hint: how does the Bayesian posterior mean \(\hat{\theta}\) relate to \(\hat{p}\)? You can simulate values of \(\hat{\theta}\) by first simulating values of \(\hat{p}\) as in the previous part, and then using the relationship between \(\hat{p}\) and \(\hat{\theta}\).

    # prior
    alpha_prior = 50
    beta_prior = 50
    
    # data
    n = 100
    p = .5
    # posterior
    alpha_post = alpha_prior + y
    beta_post = beta_prior + n - y
    posterior = rbeta(100000, alpha_post, beta_post)
    mean(posterior)
    ## [1] 0.500082

    Type your response here.

After simulating 100,000 samples of from the posterior distribution, utilizing the Beta-Binomial model, the average comes out to be very close or exactly 0.5 depending on the iteration of the simulation. Thus, we can say that mean of the posterior mean from the posterior distribution using Beta-Binomial is an unbiased estimator of the population proportion p.

  1. Repeat the two previous parts for different scenarios of \(n\), \(p\), \(\alpha\), and \(\beta\). Here are a few suggestions, but you can experiment with your own choices too.

    • \(n = 10, 100, 1000\)
    • \(p = 0.5, 0.7, 0.9\)
    • \(\alpha = 1, 10, 100\)
    • \(\beta = 1, 10, 100\)
library(magrittr) # needs to be run every time you start R and want to use %>%
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:runjags':
## 
##     extract
library(dplyr)  
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(janitor)
## Warning: package 'janitor' was built under R version 4.0.5
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(knitr)
## Warning: package 'knitr' was built under R version 4.0.5
# prior
alpha_prior = 10
beta_prior = 10

# data
n = 100

# posterior
alpha_post = alpha_prior + y
beta_post = beta_prior + n - y
posterior = rbeta(1000, alpha_post, beta_post)
p.hat = mean(posterior)
p.hat
## [1] 0.1251362
# prior
alpha_prior = 100
beta_prior = 100

# data
n = 100

# posterior
alpha_post = alpha_prior + y
beta_post = beta_prior + n - y
posterior = rbeta(1000, alpha_post, beta_post)
p.hat = mean(posterior)
p.hat
## [1] 0.3501234
# prior
alpha_prior = 1000
beta_prior = 1000

# data
n = 1000

# posterior
alpha_post = alpha_prior + y
beta_post = beta_prior + n - y
posterior = rbeta(1000, alpha_post, beta_post)
p.hat = mean(posterior)
p.hat
## [1] 0.3348879
Then write a few sentences summarizing your findings and comparing the sample proportion and the posterior mean in terms of their bias.

As we can see with these results, when the prior distribution becomes more in line with our expected value for the mean of p.hat when our alpha and betas become larger. This means that as our alpha and betas become larger our bias in this case decreases and similarly as our n (sample size) increases our bias decreases.

(You've already written code for the two previous parts.

Don’t copy the code a bunch of times. Instead, just change the inputs and record the outputs in a table like below.)

**Type your response here.**

| $n$ | $p$ | Mean of $\hat{p}$ | $\alpha$ | $\beta$ | Mean of $\hat{\theta}$ |
|----:|----:|------------------:|---------:|--------:|-----------------------:|
| 100 | 0.5 |             0.498 |        1 |       1 |                  0.506 |
  1. A commonly used frequentist 95% confidence interval for \(p\) has endpoints \[ \hat{p}\pm 2 \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \] Remember what 95% confidence means: over many hypothetical samples, the interval computed using the above formula will contain, or “cover”, the value of \(p\) for 95% of samples.

    Assume \(n=100\) and \(p=0.5\). Conduct a simulation to approximate the coverage level of the confidence interval procedure above. Hint: simulate a value of \(\hat{p}\), plug it into the formula to get a confidence interval, and then check if the interval contains \(p=0.5\). Repeat many times and find the proportion of intervals that contain \(p=0.5\).

    theta = seq(0, 1, 0.0001)
    # data
    n = 100 # sample size
    
    # likelihood, using binomial
    y = rbinom(10000, n, theta) # function of theta
    p.hat = mean(y) / n
    cil = p.hat - 2*sqrt((p.hat*(1-p.hat))/n)
    ciu = p.hat + 2*sqrt((p.hat*(1-p.hat))/n)
    cil
    ## [1] 0.399903
    ciu
    ## [1] 0.599903

    Type your response here. Therefore, we are 95% confident that the true value of the population proportion p is between 0.400091 and 0.600091.

  2. What is the frequentist coverage level for a central 95% posterior credible interval? Assume a Beta(\(\alpha\), \(\beta\)) prior distribution for \(\theta\), with \(\alpha = 50\), \(\beta = 50\). Assume \(n=100\) and \(p=0.5\). Conduct a simulation to approximate the coverage level for a central 95% posterior credible interval in this scenario. Hint: simulate \(\hat{p}\), find the Beta posterior distribution given \(\hat{p}\), then use qbeta to get the endpoints of the confidence interval. Then continue as in the previous part.

    # Type your code here.
    # prior
    alpha_prior = 50
    beta_prior = 50
    
    # data
    n = 100
    p = .5
    # posterior
    alpha_post = alpha_prior + y
    beta_post = beta_prior + n - y
    posterior = rbeta(100000, alpha_post, beta_post)
    # 98% posterior credible interval
    cil = qbeta(.01, alpha_post, beta_post)
    ciu = qbeta(.99, alpha_post, beta_post)
    mean(cil) 
    ## [1] 0.4217952
    mean(ciu)
    ## [1] 0.5781092

    Type your response here. There is a posterior probability of 98% that the true value of the population proportion is between 42.2% and 57.8%, after observing the sample data, it’s 49 times mores plausible that θ is inside [0.422, 0.578] as outside.

theta = seq(0, 1, 0.0001)
    # data
n = 1000 # sample size

# likelihood, using binomial
y1 = rbinom(10000, n, theta) # function of theta
p.hat = mean(y1) / n
cil = p.hat - 2*sqrt((p.hat*(1-p.hat))/n)
ciu = p.hat + 2*sqrt((p.hat*(1-p.hat))/n)
cil
## [1] 0.4682201
ciu
## [1] 0.5314657
 ####################################
# prior
alpha_prior = 100
beta_prior = 100

# data
n = 100
p = .5
# posterior
alpha_post = alpha_prior + y
beta_post = beta_prior + n - y
posterior = rbeta(100000, alpha_post, beta_post)
# 98% posterior credible interval
cil = qbeta(.01, alpha_post, beta_post)
ciu = qbeta(.99, alpha_post, beta_post)
mean(cil) 
## [1] 0.4343498
mean(ciu)
## [1] 0.5655862
  1. Repeat the two previous parts for different scenarios of \(n\), \(p\), \(\alpha\), and \(\beta\). Here are a few suggestions, but you can experiment with your own choices too

    • \(n = 100, 1000\)
    • \(p = 0.5, 0.7, 0.9\)
    • \(\alpha = 1, 10, 100\)
    • \(\beta = 1, 10, 100\)

    Then write a few sentences summarizing your findings and comparing the frequentist confidence interval and Bayesian cedible interval in terms of their coverage.

    (You’ve already written code for the two previous parts. Don’t copy the code a bunch of times. Instead, just change the inputs and record the outputs in a table like below.)

    Type your response here. Thus, we can see that as the sample size increases, the coverage of the confidence interval also increases. This makes sense as our posterior distribution is shifitng towards what we see in the data when p = .5. As, p increases, the coverage of confidence interval as well as the coverage of the credible interval deceases especially if the sample size also increases along with p. In the Beta prior case, as alpha and beta increase, so does the coverage of the credible interval, similar to the frequentists case

    \(n\) \(p\) Coverage of Confidence interval \(\alpha\) \(\beta\) Coverage of Credible interval
    100 0.5 0.937 1 1 0.957