STAT 415 Investigation 2

This RMarkdown file provides a template for you to fill in. Read the file from start to finish, completing the parts as indicated. Some blank “code chunks” are provided; you are welcome to add more (CTRL-ALT-I) as needed. There are also a few places where you should type text responses. Feel free to add additional text responses as necessary.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. When you are finished

click Knit and check to make sure that you have no errors and everything runs properly. (Fix any problems and redo this step until it works.)
Make sure your type your name(s) at the top of the notebook where it says “Type your name(s) here”. If you worked in a team, you will submit a single notebook with both names; make sure both names are included
Submit your completed file in Canvas.

Problem 1.

Does uniform color give athletes an advantage over their competitors? To investigate this question, Hill and Barton (Nature, 2005) examined the records in the 2004 Olympic Games for four combat sports: boxing, tae kwon do, Greco-Roman wrestling, and freestyle wrestling. Competitors in these sports were randomly assigned to wear either a red or a blue uniform. The researchers investigated whether competitors wearing one color won significantly more often than those wearing the other color. They collected data on a total of 457 matches.

Let the parameter \(\theta\) represent the probability that a competitor wearing red wins a match. Let \(Y\) be the number of matches, out of 457, in which the competitor wearing red defeats the competitor wearing blue.

Assume a Beta(50, 50) prior for \(\theta\). Use simulation to approximate the prior predictive distribution of \(Y\) and plot it. (It might make more sense to plot \(Y/457\).) Explain why a Beta(50, 50) distribution seems like an unreasonable choice for the prior of \(\theta\).
```
# Type your code here.
theta = seq(0, 1, 0.0001)

# prior
prior = theta ^ 49 * (1-theta)^49
prior = prior / sum(prior)

ylim = c(0, max(prior))
plot(theta, prior, type='l', xlim=c(0, 1), ylim=ylim, col="skyblue", xlab='theta', ylab='')
```
Type your response here. The prior predictive distribution of theta in the case of using Beta(50,50) has a mean at around 0.5, which is reasonable. however, the prior SD (spread) of this distribution is a little too wide and gives too much plausibility to smaller and bigger numbers for theta (comapared to what we know is reasonable for this problem, 0.5). Thus, this prior distribution is unreasonable and I think we can make it better.
Assume a Beta(200, 200) prior distribution for \(\theta\). Use simulation to approximate the prior predictive distribution of \(Y\) and plot it. (It might make more sense to plot \(Y/457\).) Explain why a Beta(200, 200) distribution seems like a more reasonable choice than Beta(50, 50) for the prior of \(\theta\).
```
# Type your code here.
theta = seq(0, 1, 0.0001)

# prior
prior = theta ^ 199 * (1-theta)^199
prior = prior / sum(prior)

ylim = c(0, max(prior))
plot(theta, prior, type='l', xlim=c(0, 1), ylim=ylim, col="skyblue", xlab='theta', ylab='')
```
Type your response here. We know that the most plausible value for theta is 0.5 and we should be confident in this given research question. Thus, in order to reflect this, we should have a relatively small SD. This is better shown in our prior distribution for theta in the case of Beta(200,200) vs Beta(50,50) where the latter has a much too large value for the prior SD and the former has a smaller SD, reflecting our confidence that we should expect 0.5 as our true value of theta. Thus, the Beta(200,200) seems more reasonable than the Beta(50,50) distribution.

Assume a Beta(50, 50) prior distribution for \(\theta\). The researchers analyzed a total of 457 matches, and they found that the competitor wearing red defeated the competitor wearing blue in 248 of these matches. Identify the posterior distribution, its mean and SD, and central 50%, 80%, and 98% posterior credible intervals. Create a plot showing prior, (scaled) likelihood, and posterior.

theta = seq(0, 1, 0.0001) # the grid is just for plotting

# prior
alpha_prior = 50
beta_prior = 50
prior = dbeta(theta, alpha_prior, beta_prior)

# data
n = 457
y = 248

# likelihood
likelihood = dbinom(y, n, theta)

# posterior
alpha_post = alpha_prior + y
beta_post = beta_prior + n - y
posterior = dbeta(theta, alpha_post, beta_post)

# plot
ymax = max(c(prior, posterior))
scaled_likelihood = likelihood * ymax / max(likelihood)

plot(theta, prior, type='l', col='skyblue', xlim=c(0, 1), ylim=c(0, ymax), ylab='', yaxt='n')
par(new=T)
plot(theta, scaled_likelihood, type='l', col='orange', xlim=c(0, 1), ylim=c(0, ymax), ylab='',  yaxt='n')
par(new=T)
plot(theta, posterior, type='l', col='seagreen', xlim=c(0, 1), ylim=c(0, ymax), ylab='', yaxt='n')
legend("topleft", c("prior", "scaled likelihood", "posterior"), lty=1, col=c("skyblue", "orange", "seagreen"))

# 50% posterior credible interval
qbeta(c(0.25, 0.75), alpha_post, beta_post)

## [1] 0.5207740 0.5492897

# 80% posterior credible interval
qbeta(c(0.1, 0.9), alpha_post, beta_post)

## [1] 0.5079061 0.5620579

# 98% posterior credible interval
qbeta(c(0.01, 0.99), alpha_post, beta_post)

## [1] 0.4857573 0.5838912

#posterior mean 
post.mean = (alpha_prior + y) / (alpha_prior + beta_prior + n)
post.mean

## [1] 0.535009

post.sd = sqrt((post.mean* (1-post.mean))/(alpha_prior +beta_prior+n+1))
post.sd

## [1] 0.02111474

Type your response here. The posterior distriubtion is now ~ Beta(298,259), with a posterior mean of 0.535 and a posterior SD of 0.0211. The central posterior 50% credible interval is: 0.5207740 0.5492897. The central posterior 80% credible interval is: 0.5079061 0.5620579. The central posterior 98% credible interval is: 0.4857573 0.5838912.

Assume a Beta(200, 200) prior distribution for \(\theta\). The researchers analyzed a total of 457 matches, and they found that the competitor wearing red defeated the competitor wearing blue in 248 of these matches. Identify the posterior distribution, its mean and SD, and central 50%, 80%, and 98% posterior credible intervals. Create a plot showing prior, (scaled) likelihood, and posterior. How sensitive are the results to the choice of prior (Beta(50, 50) versus Beta(200, 200))?

   theta = seq(0, 1, 0.0001) # the grid is just for plotting

# prior
alpha_prior = 200
beta_prior = 200
prior = dbeta(theta, alpha_prior, beta_prior)

# data
n = 457
y = 248

# likelihood
likelihood = dbinom(y, n, theta)

# posterior
alpha_post = alpha_prior + y
beta_post = beta_prior + n - y
posterior = dbeta(theta, alpha_post, beta_post)

# plot
ymax = max(c(prior, posterior))
scaled_likelihood = likelihood * ymax / max(likelihood)

plot(theta, prior, type='l', col='skyblue', xlim=c(0, 1), ylim=c(0, ymax), ylab='', yaxt='n')
par(new=T)
plot(theta, scaled_likelihood, type='l', col='orange', xlim=c(0, 1), ylim=c(0, ymax), ylab='',  yaxt='n')
par(new=T)
plot(theta, posterior, type='l', col='seagreen', xlim=c(0, 1), ylim=c(0, ymax), ylab='', yaxt='n')
legend("topleft", c("prior", "scaled likelihood", "posterior"), lty=1, col=c("skyblue", "orange", "seagreen"))

# 50% posterior credible interval
qbeta(c(0.25, 0.75), alpha_post, beta_post)

## [1] 0.5112535 0.5342734

# 80% posterior credible interval
qbeta(c(0.1, 0.9), alpha_post, beta_post)

## [1] 0.5008807 0.5446041

# 98% posterior credible interval
qbeta(c(0.01, 0.99), alpha_post, beta_post)

## [1] 0.4830347 0.5623168

#posterior mean 
post.mean = (alpha_prior + y) / (alpha_prior + beta_prior + n)
post.mean

## [1] 0.5227538

post.sd = sqrt((post.mean* (1-post.mean))/(alpha_prior +beta_prior+n+1))
post.sd

## [1] 0.01705203

Type your response here. The posterior distriubtion is now ~ Beta(448,409), with a posterior mean of 0.523 and a posterior SD of 0.0171. The central posterior 50% credible interval is: 0.5112535 0.5342734. The central posterior 80% credible interval is: 0.5008807 0.5446041. The central posterior 98% credible interval is: 0.4830347 0.5623168.

The results for the posterior distribution are less sensitive in the Beta(50,50) case when compared to the Beta(200,200) case. The reason for this is that the former is assuming a smaller degree of certain in the prior distribution, thus the data (likelihood function) takes more precedence and the posterior is less sensitive to this case. The posterior is more sensitive in the Beta(200,200) case as we are assuming a high degree of confidence in our prior distribution in this case because of the larger alpha and beta values acting as sample size causing our greater degree of certainty.

Express the posterior mean as a weighted average of the prior mean and the sample proportion.

Type your response here. 298/557 = ((50/100) * (100/557)) + ((248/457)*(457/557))

and

448/857 = ((200/400) * (400/857)) + ((248/457)*(457/857))

We might say that if the probability of winning by the competitor wearing red is between 0.49 and 0.51, then any color advantage is small enough to be considered unimportant. In other words \(\theta\) outside of 0.49 and 0.51 could be interpreted as one color giving a meaningful advantage. Compute and interpret the posterior probability that \(\theta\) is not in [0.49, 0.51]. (If you disagree with the range [0.49, 0.51] feel free to use different values.)
```
qbeta(c(0.016747, 0.118445), alpha_post, beta_post)
```
```
## [1] 0.4864531 0.5025683
```
```
res = 1 - (0.118445 - 0.016747)
res
```
```
## [1] 0.898302
```
Type your response here. Therefore, there is a posterior probability of 0.8983 that the true value of theta, the true probability of winning a match when the team in question wears red, is not within the range of 0.49 and .51. Thus, there is a high plausbility in the idea that the true probability of winning a match when the team in question wears red is not within the 0.49 and 0.51 range as previously believed.
Write 1-3 clearly worded sentences reporting the conclusions of your analysis in this context.

Type your response here. Thus, after taking into consideration a prior distribution with a center of 0.5, as well as observing data with a sample size of 457 matches and 248 wins by the team that was wearing red, we have determine a posterior distribution for the true probability of winning a match given that a team is wearing red. This distribution is centered at 0.5227 and has a spread of 0.017. Additionally, there is a posterior probability of 98% that the true probability of a team winning given that they are wearing red is 48.3% and 56.2%, for competitors similar to those in the 2004 Olympic Games for four combat sports: boxing, tae kwon do, Greco-Roman wrestling, and freestyle wrestling ; after observing the sample data, it’s 49 times mores plausible that θ is inside [0.4830347 0.5623168] as outside.

Assume a Beta(200, 200) prior and the 248/457 sample data. Use JAGS to approximate the posterior distribution of \(\theta\). Include a plot of the simulated posterior distribution, and some summary statistics. Does the JAGS output reasonably match the posterior distribution that you computed previously?

# Type your code here.
library(rjags)

## Warning: package 'rjags' was built under R version 4.0.5

## Loading required package: coda

## Warning: package 'coda' was built under R version 4.0.5

## Linked to JAGS 4.3.0

## Loaded modules: basemod,bugs

library(bayesplot)

## Warning: package 'bayesplot' was built under R version 4.0.5

## This is bayesplot version 1.8.1

## - Online documentation and vignettes at mc-stan.org/bayesplot

## - bayesplot theme set to bayesplot::theme_default()

##    * Does _not_ affect other ggplot2 plots

##    * See ?bayesplot_theme_set for details on theme setting

library(runjags)

## Warning: package 'runjags' was built under R version 4.0.5

n = 457 # sample size
y = 248 # number of successes

model_string <- "model{

  # Likelihood
  y ~ dbinom(theta, n)

  # Prior
  theta ~ dbeta(alpha, beta)
  alpha <- 200 # prior successes
  beta <- 200 # prior failures

}"

dataList = list(y = y, n = n)

model <- jags.model(file = textConnection(model_string), 
                data = dataList)

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 1
##    Unobserved stochastic nodes: 1
##    Total graph size: 4
## 
## Initializing model

update(model, n.iter = 1000)

Nrep = 10000 # number of values to simulate

posterior_sample <- coda.samples(model,
                   variable.names = c("theta"),
                   n.iter = Nrep)
summary(posterior_sample)

## 
## Iterations = 2001:12000
## Thinning interval = 1 
## Number of chains = 1 
## Sample size per chain = 10000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##           Mean             SD       Naive SE Time-series SE 
##      0.5231261      0.0167655      0.0001677      0.0002145 
## 
## 2. Quantiles for each variable:
## 
##   2.5%    25%    50%    75%  97.5% 
## 0.4906 0.5119 0.5231 0.5344 0.5563

plot(posterior_sample)

library(bayesplot)
mcmc_hist(posterior_sample)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Type your response here.

The posterior distribution we calculated before without using JAGS is centered at 0.5227 and has a spread of 0.017. In comparison, the JAGS simulation has a mean of 0.5228 and a SD of 0.01689. These two summary statistics are extremely close to each and thus the JAGS output definitely does match the posterior distribution I previously computed.

A Beta(200, 200) distribution is closely approximated by a Normal distribution with mean 0.5 and SD 0.025. Suppose we had assumed a N(0.5, 0.025) prior for \(\theta\), with the 248/457 sample data. Use grid approximation to approximate the posterior distribution of \(\theta\) and its mean and SD, and central 50%, 80%, and 98% posterior credible intervals. Create a plot showing prior, (scaled) likelihood, and posterior. How does it compare to the posterior from the previous part (with the Beta(200, 200) prior)?

# Type your code here.
theta = seq(0, 1, 0.0001)

# prior
prior = dnorm(theta, 0.5, 0.025) # shape of prior
prior = prior / sum(prior) # scales so that prior sums to 1

# data
n = 457 # sample size
y = 248 # sample count of success

# likelihood, using binomial
likelihood = dbinom(y, n, theta) # function of theta

# posterior
product = likelihood * prior
posterior = product / sum(product)

# plot
ymax = max(c(prior, posterior))
scaled_likelihood = likelihood * ymax / max(likelihood)

plot(theta, prior, type='l', col='skyblue', xlim=c(0, 1), ylim=c(0, ymax), ylab='', yaxt='n')
par(new=T)
plot(theta, scaled_likelihood, type='l', col='orange', xlim=c(0, 1), ylim=c(0, ymax), ylab='',  yaxt='n')
par(new=T)
plot(theta, posterior, type='l', col='seagreen', xlim=c(0, 1), ylim=c(0, ymax), ylab='', yaxt='n')
legend("topleft", c("prior", "scaled likelihood", "posterior"), lty=1, col=c("skyblue", "orange", "seagreen"))

posterior_mean = sum(theta * posterior)
posterior_mean

## [1] 0.5227598

posterior_var = sum(theta ^ 2 * posterior) - posterior_mean ^ 2
posterior_sd = sqrt(posterior_var)
posterior_sd

## [1] 0.01707074

Type your response here. The posterior distribution we calculated before with a prior distribution of Beta(200,200) is centered at 0.5227 and has a spread of 0.017. In comparison, when using a normal distribution for our prior with mean 0.5 and SD of 0.025 the posterior distribution in this latter case has a mean of 0.52275 and a SD of 0.01707. These two summary statistics are extremely close to each and thus the posterior distribution using a prior of N(0.5,0.025) definitely does match the posterior distribution I previously computed using a prior distribution of Beta(200,200).

Use simulation to approximation the posterior distribution of the odds \(\phi = \frac{\theta}{1-\theta}\). Find and interpret in context a central 98% credible interval for \(\phi\). (Your posteriors for \(\theta\) from the previous parts should be pretty similar, so it doesn’t matter what you use here.)

  # Type your code here.
theta = seq(0, 1, 0.0001)

# prior
prior = dnorm(theta, 0.5, 0.025) # shape of prior
prior = prior / sum(prior) # scales so that prior sums to 1

# data
n = 457 # sample size
y = 248 # sample count of success

# likelihood, using binomial
likelihood = dbinom(y, n, theta) # function of theta

# posterior
product = likelihood * prior
posterior1 = product / sum(product)
posterior = posterior1 / (1 - posterior1)

# plot
ymax = max(c(prior, posterior))
scaled_likelihood = likelihood * ymax / max(likelihood)

plot(theta, prior, type='l', col='skyblue', xlim=c(0, 1), ylim=c(0, ymax), ylab='', yaxt='n')
par(new=T)
plot(theta, scaled_likelihood, type='l', col='orange', xlim=c(0, 1), ylim=c(0, ymax), ylab='',  yaxt='n')
par(new=T)
plot(theta, posterior, type='l', col='seagreen', xlim=c(0, 1), ylim=c(0, ymax), ylab='', yaxt='n')
legend("topleft", c("prior", "scaled likelihood", "posterior"), lty=1, col=c("skyblue", "orange", "seagreen"))

posterior_cdf = cumsum(posterior)
# posterior 98% central credible interval
c(theta[max(which(posterior_cdf <= 0.01))], theta[min(which(posterior_cdf >= 0.99))])

## [1] 0.4829 0.5614

Type your response here. There is a posterior probability of 98% that the true posterior odds of a team winning given that they are wearing red is 48.29% and 56.14%, for competitors similar to those in the 2004 Olympic Games for four combat sports: boxing, tae kwon do, Greco-Roman wrestling, and freestyle wrestling ; after observing the sample data, it’s 49 times mores plausible that θ is inside [0.4829 0.5614] as outside.

Problem 2.

This problem will investigate frequentist properties of Bayesian procedures (or lack thereof).

Consider estimating a population proportion \(\theta\) based on a sample of size \(n\) with \(Y\) successes, where \(Y\) has a Binomial(\(n\), \(\theta\)) distribution. Let \(\hat{p}=Y/n\) be the sample proportion.

The sample proportion \(\hat{p}\) is an unbiased estimator of the population proportion \(p\): \(E(\hat{p}) = p\). That is, over many hypothetical samples, sample proportions do not systematically overestimate or underestimate the population proportion, they are on average just right (assuming the sampling procedure tends to produce representative samples.)

Assume \(n=100\) and \(p=0.5\). Conduct a simulation to demonstrate that \(\hat{p}\) is unbiased. Hint: simulate many values of \(\hat{p}\), compute their average, and compare to \(p=0.5\).
```
# Type your code here.
theta = seq(0, 1, 0.0001)
# data
n = 100 # sample size

# likelihood, using binomial
y = rbinom(100000, n, theta) # function of theta
p.hat = mean(y) / n
p.hat
```
```
## [1] 0.4999513
```
Type your response here. After simulating 100,000 samples of p.hat, the average comes out to be very close or exactly 0.5 depending on the iteration of the simulation. Thus, we can say that p.hat is an unbiased estimator of the population proportion p, because the expected value of p.hat = population proportion (p).
Let \(\hat{\theta}\) denote the posterior mean in a Bayesian analysis. Is a Bayesian posterior mean \(\hat{\theta}\) unbiased for \(p\)? Assume a Beta(\(\alpha\), \(\beta\)) prior distribution for \(\theta\), with \(\alpha = 50\), \(\beta = 50\). Assume \(n=100\) and \(p=0.5\). Conduct a simulation to determine if the posterior mean is unbiased in this scenario. Hint: how does the Bayesian posterior mean \(\hat{\theta}\) relate to \(\hat{p}\)? You can simulate values of \(\hat{\theta}\) by first simulating values of \(\hat{p}\) as in the previous part, and then using the relationship between \(\hat{p}\) and \(\hat{\theta}\).
```
# prior
alpha_prior = 50
beta_prior = 50

# data
n = 100
p = .5
# posterior
alpha_post = alpha_prior + y
beta_post = beta_prior + n - y
posterior = rbeta(100000, alpha_post, beta_post)
mean(posterior)
```
```
## [1] 0.500082
```
Type your response here.

After simulating 100,000 samples of from the posterior distribution, utilizing the Beta-Binomial model, the average comes out to be very close or exactly 0.5 depending on the iteration of the simulation. Thus, we can say that mean of the posterior mean from the posterior distribution using Beta-Binomial is an unbiased estimator of the population proportion p.

Repeat the two previous parts for different scenarios of \(n\), \(p\), \(\alpha\), and \(\beta\). Here are a few suggestions, but you can experiment with your own choices too.
- \(n = 10, 100, 1000\)
- \(p = 0.5, 0.7, 0.9\)
- \(\alpha = 1, 10, 100\)
- \(\beta = 1, 10, 100\)

library(magrittr) # needs to be run every time you start R and want to use %>%

## 
## Attaching package: 'magrittr'

## The following object is masked from 'package:runjags':
## 
##     extract

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(janitor)

## Warning: package 'janitor' was built under R version 4.0.5

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(knitr)

## Warning: package 'knitr' was built under R version 4.0.5

# prior
alpha_prior = 10
beta_prior = 10

# data
n = 100

# posterior
alpha_post = alpha_prior + y
beta_post = beta_prior + n - y
posterior = rbeta(1000, alpha_post, beta_post)
p.hat = mean(posterior)
p.hat

## [1] 0.1251362

# prior
alpha_prior = 100
beta_prior = 100

# data
n = 100

# posterior
alpha_post = alpha_prior + y
beta_post = beta_prior + n - y
posterior = rbeta(1000, alpha_post, beta_post)
p.hat = mean(posterior)
p.hat

## [1] 0.3501234

# prior
alpha_prior = 1000
beta_prior = 1000

# data
n = 1000

# posterior
alpha_post = alpha_prior + y
beta_post = beta_prior + n - y
posterior = rbeta(1000, alpha_post, beta_post)
p.hat = mean(posterior)
p.hat

## [1] 0.3348879

Then write a few sentences summarizing your findings and comparing the sample proportion and the posterior mean in terms of their bias.

As we can see with these results, when the prior distribution becomes more in line with our expected value for the mean of p.hat when our alpha and betas become larger. This means that as our alpha and betas become larger our bias in this case decreases and similarly as our n (sample size) increases our bias decreases.

(You've already written code for the two previous parts.

Don’t copy the code a bunch of times. Instead, just change the inputs and record the outputs in a table like below.)

**Type your response here.**

| $n$ | $p$ | Mean of $\hat{p}$ | $\alpha$ | $\beta$ | Mean of $\hat{\theta}$ |
|----:|----:|------------------:|---------:|--------:|-----------------------:|
| 100 | 0.5 |             0.498 |        1 |       1 |                  0.506 |

A commonly used frequentist 95% confidence interval for \(p\) has endpoints \[ \hat{p}\pm 2 \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \] Remember what 95% confidence means: over many hypothetical samples, the interval computed using the above formula will contain, or “cover”, the value of \(p\) for 95% of samples.

Assume \(n=100\) and \(p=0.5\). Conduct a simulation to approximate the coverage level of the confidence interval procedure above. Hint: simulate a value of \(\hat{p}\), plug it into the formula to get a confidence interval, and then check if the interval contains \(p=0.5\). Repeat many times and find the proportion of intervals that contain \(p=0.5\).
```
theta = seq(0, 1, 0.0001)
# data
n = 100 # sample size

# likelihood, using binomial
y = rbinom(10000, n, theta) # function of theta
p.hat = mean(y) / n
cil = p.hat - 2*sqrt((p.hat*(1-p.hat))/n)
ciu = p.hat + 2*sqrt((p.hat*(1-p.hat))/n)
cil
```
```
## [1] 0.399903
```
```
ciu
```
```
## [1] 0.599903
```
Type your response here. Therefore, we are 95% confident that the true value of the population proportion p is between 0.400091 and 0.600091.
What is the frequentist coverage level for a central 95% posterior credible interval? Assume a Beta(\(\alpha\), \(\beta\)) prior distribution for \(\theta\), with \(\alpha = 50\), \(\beta = 50\). Assume \(n=100\) and \(p=0.5\). Conduct a simulation to approximate the coverage level for a central 95% posterior credible interval in this scenario. Hint: simulate \(\hat{p}\), find the Beta posterior distribution given \(\hat{p}\), then use qbeta to get the endpoints of the confidence interval. Then continue as in the previous part.
```
# Type your code here.
# prior
alpha_prior = 50
beta_prior = 50

# data
n = 100
p = .5
# posterior
alpha_post = alpha_prior + y
beta_post = beta_prior + n - y
posterior = rbeta(100000, alpha_post, beta_post)
# 98% posterior credible interval
cil = qbeta(.01, alpha_post, beta_post)
ciu = qbeta(.99, alpha_post, beta_post)
mean(cil) 
```
```
## [1] 0.4217952
```
```
mean(ciu)
```
```
## [1] 0.5781092
```
Type your response here. There is a posterior probability of 98% that the true value of the population proportion is between 42.2% and 57.8%, after observing the sample data, it’s 49 times mores plausible that θ is inside [0.422, 0.578] as outside.

theta = seq(0, 1, 0.0001)
    # data
n = 1000 # sample size

# likelihood, using binomial
y1 = rbinom(10000, n, theta) # function of theta
p.hat = mean(y1) / n
cil = p.hat - 2*sqrt((p.hat*(1-p.hat))/n)
ciu = p.hat + 2*sqrt((p.hat*(1-p.hat))/n)
cil

## [1] 0.4682201

ciu

## [1] 0.5314657

 ####################################
# prior
alpha_prior = 100
beta_prior = 100

# data
n = 100
p = .5
# posterior
alpha_post = alpha_prior + y
beta_post = beta_prior + n - y
posterior = rbeta(100000, alpha_post, beta_post)
# 98% posterior credible interval
cil = qbeta(.01, alpha_post, beta_post)
ciu = qbeta(.99, alpha_post, beta_post)
mean(cil)

## [1] 0.4343498

mean(ciu)

## [1] 0.5655862

Repeat the two previous parts for different scenarios of \(n\), \(p\), \(\alpha\), and \(\beta\). Here are a few suggestions, but you can experiment with your own choices too
- \(n = 100, 1000\)
- \(p = 0.5, 0.7, 0.9\)
- \(\alpha = 1, 10, 100\)
- \(\beta = 1, 10, 100\)
Then write a few sentences summarizing your findings and comparing the frequentist confidence interval and Bayesian cedible interval in terms of their coverage.

(You’ve already written code for the two previous parts. Don’t copy the code a bunch of times. Instead, just change the inputs and record the outputs in a table like below.)

Type your response here. Thus, we can see that as the sample size increases, the coverage of the confidence interval also increases. This makes sense as our posterior distribution is shifitng towards what we see in the data when p = .5. As, p increases, the coverage of confidence interval as well as the coverage of the credible interval deceases especially if the sample size also increases along with p. In the Beta prior case, as alpha and beta increase, so does the coverage of the credible interval, similar to the frequentists case

\(n\) \(p\) Coverage of Confidence interval \(\alpha\) \(\beta\) Coverage of Credible interval

100 0.5 0.937 1 1 0.957

STAT 415 Investigation 2

Gabriel Rodriguez

Problem 1.

Problem 2.