This chapter introduced the basic procedures for manipulating posterior distributions. Our fundamental tool is samples of parameter values drawn from the posterior distribution. These samples can be used to produce intervals, point estimates, posterior predictive checks, as well as other kinds of simulations. Posterior predictive checks combine uncertainty about parameters, as described by the posterior distribution, with uncertainty about outcomes, as described by the assumed likelihood function. These checks are useful for verifying that your software worked correctly. They are also useful for prospecting for ways in which your models are inadequate.
Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Make sure to include plots if the question requests them. Problems are labeled Easy (E), Medium (M), and Hard(H).
Finally, upon completion, name your final output .html file as: YourName_ANLY505-Year-Semester.html and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points
library(rethinking)
p_grid <- seq( from=0 , to=1 , length.out=1000 )
prior <- rep( 1 , 1000 )
likelihood <- dbinom( 6 , size=9 , prob=p_grid )
posterior <- likelihood * prior
posterior <- posterior / sum(posterior)
set.seed(100)
samples <- sample( p_grid , prob=posterior , size=1e4 , replace=TRUE )
Use the values in samples to answer the questions that follow.
3E1. How much posterior probability lies below p = 0.2?
# We can use logic indexing, sum them and divide by size
sum(samples < 0.2) / 1e4
## [1] 4e-04
# So we know 0.04% lies below p = 0.2
3E2. How much posterior probability lies above p = 0.8?
# Similar to 3E1
sum(samples > 0.8) / 1e4
## [1] 0.1116
# So we know 11.16% lies above p = 0.8
3E3. How much posterior probability lies between p = 0.2 and p = 0.8?
# similar to 3E1, 3E2
sum(samples > 0.2 & samples < 0.8) / 1e4
## [1] 0.888
# So we know 88.8% lies between 0.2 and 0.8.
3E4. 20% of the posterior probability lies below which value of p?
# We can use quantile function
quantile(samples, 0.2)
## 20%
## 0.5185185
3E5. 20% of the posterior probability lies above which value of p?
quantile(samples, 1 - 0.2)
## 80%
## 0.7557558
3E6. Which values of p contain the narrowest interval equal to 66% of the posterior probability?
# Use highest posterior density interval
HPDI(samples, prob = 0.66)
## |0.66 0.66|
## 0.5085085 0.7737738
# the p value is between 0.51 and 0.77.
3E7. Which values of p contain 66% of the posterior probability, assuming equal posterior probability both below and above the interval?
# Use percentile interval to find equal probability interval
PI(samples, prob = 0.66)
## 17% 83%
## 0.5025025 0.7697698
# the p value is between 0.50 and 0.77.
3M1. Suppose the globe tossing data had turned out to be 8 water in 15 tosses. Construct the posterior distribution, using grid approximation. Plot the posterior. Use the same flat prior as before.
p_grid <- seq( from=0 , to=1 , length.out=1000 )
prior <- rep( 1 , 1000 )
likelihood <- dbinom( 8 , size=15 , prob=p_grid )
posterior <- likelihood * prior
posterior <- posterior / sum(posterior)
set.seed(100)
plot(x = p_grid, y = posterior, type = "l")
3M2. Draw 10,000 samples from the grid approximation from above. Then use the samples to calculate the 90% HPDI for p.
samples <- sample( p_grid , prob=posterior , size=1e4 , replace=TRUE )
HPDI(samples, prob = 0.90)
## |0.9 0.9|
## 0.3343343 0.7217217
3M3. Construct a posterior predictive check for this model and data. This means simulate the distribution of samples, averaging over the posterior uncertainty in p. What is the probability of observing 8 water in 15 tosses?
# We can use rbinom to simulate the distribution of samples
simulate_samples <- rbinom( 10000, size = 15, prob = samples)
hist(simulate_samples)
mean(simulate_samples == 8)
## [1] 0.1499
# The average probably is 14.4%
3M4. Using the posterior distribution constructed from the new (8/15) data, now calculate the probability of observing 6 water in 9 tosses.
simulate_samples <- rbinom( 10000, size = 9, prob = samples)
hist(simulate_samples)
mean(simulate_samples == 6)
## [1] 0.1842
# The average probably is 18%
3M5. Start over at 3M1, but now use a prior that is zero below p = 0.5 and a constant above p = 0.5. This corresponds to prior information that a majority of the Earth’s surface is water. Repeat each problem above and compare the inferences. Plot the posterior. What difference does the better prior make? If it helps, compare inferences (using both priors) to the true value p = 0.7.
# Redo 3M1
# Since we don't have the flat prior, we can use ifelse() to get the prior
p_grid <- seq( from=0 , to=1 , length.out=1000 )
prior <- ifelse(p_grid < 0.5, 0, 1)
likelihood <- dbinom( 8 , size=15 , prob=p_grid )
posterior <- likelihood * prior
posterior <- posterior / sum(posterior)
set.seed(100)
plot(x = p_grid, y = posterior, type = "l")
# Comparing with 3M1, we can see that all probability value below 0.5 has 0 as the value, and the posterior increases. So it can be concluded that the probability values below 0.5 from 3M1 were shifted to the right hand side of the curve, resulting the height increase.
# Redo 3M2
samples <- sample( p_grid , prob=posterior , size=1e4 , replace=TRUE )
HPDI(samples, prob = 0.90)
## |0.9 0.9|
## 0.5005005 0.7097097
# Comparing with 3M2, it looks like the prior impact on the lower bound of the HPDI. So it can be concluded that the probability values below 0.5 were shifted to p value between 0.5 and 0.71.
# Redo 3M3
simulate_samples <- rbinom( 10000, size = 15, prob = samples)
hist(simulate_samples)
mean(simulate_samples == 8)
## [1] 0.163
# Comparing with 3M3, we can see the histogram model predicted value was shifted from 8 to 9. And the probability increased from 14.91% to 15.91%.
# Redo 3M4
simulate_samples <- rbinom( 10000, size = 9, prob = samples)
hist(simulate_samples)
mean(simulate_samples == 6)
## [1] 0.2353
# Comparing with 3M4, we can see the histogram model predicted value was shifted from 5 to 6, and the probability increased from 18.02% to 24.13%.
3M6. Suppose you want to estimate the Earth’s proportion of water very precisely. Specifically, you want the 99% percentile interval of the posterior distribution of p to be only 0.05 wide. This means the distance between the upper and lower bound of the interval should be 0.05. How many times will you have to toss the globe to do this?
# We can setup a function to compute the PI interval using the p value of 0.7 (actual percentage of water)
compute_width <- function(N) {
p <- 0.7
p_grid <- seq( from=0 , to=1 , length.out=1000 )
simulate_samples <- rbinom( 1, size = N, prob = p)
prior <- rep(1, 1000)
likelihood <- dbinom( simulate_samples , size=N , prob=p_grid )
posterior <- likelihood * prior
posterior <- posterior / sum(posterior)
samples <- sample( p_grid, prob=posterior , size=10000, replace=TRUE )
interval = PI(samples, 0.99)
width = interval[2] - interval[1]
return (width)
}
# Now we try different numbver of toss
N <- 100
compute_width(N)
## 100%
## 0.2302352
N <- 500
compute_width(N)
## 100%
## 0.1021021
N <- 1000
compute_width(N)
## 100%
## 0.07407407
N <- 1500
compute_width(N)
## 100%
## 0.06006006
N <- 2000
compute_width(N)
## 100%
## 0.05205205
N <- 2100
compute_width(N)
## 100%
## 0.05305305
N <- 2200
compute_width(N)
## 100%
## 0.05105105
N <- 2300
compute_width(N)
## 100%
## 0.04904905
# So we can conclude that it can take 2300 toss to get the interval to be 0.05.