It is unclear how much water is on the planet earth. This is important to know because it will help us survive.
To determine how much water is on earth, we sampled 9 random locations on a mock globe and recorded the presence of water at each location.
Statistical Analysis
To estimate the proportion of water versus land on earth, we fit a Bayesian binomial model using the number of successes (water) out of the number of trials (total samples). We assigned equal prior probability to each proportion of water, since we had no prior information about the proportion of water on this planet, we used a flat prior in which every possible proportion was given the same prior probability. However, based on previous surveys of other planets, we know that water is generally rare, but could also be common on some planets. Thus, alternative priors that reflect that knowledge might be more appropriate.
We estimated the posterior distribution of the proportion of water on earth using grid approximation. We then used the posterior predictive distribution to predict the number of water samples we would expect to get in a future planned study of 15,640 samples.
# Run the Analysis
waters = 6
tosses = 9
N = 10000 # number of conjectures to make for p_grid
p_grid <- seq(from = 0, to = 1, length.out = N) #1000 equally spaced numbers from 0 to 1
prob_p <- rep(1, N) #Flat Prior - repeat the number 1 1000 times.
prob_data <- dbinom(waters, size = tosses, prob = p_grid) # Likelihood - prob of 6 W in 9 tosses given each value in p_grid
posterior <- prob_data*prob_p # multiply prob_data by the prior probability
posterior <- posterior/sum(posterior) # Normalize the posterior so it sums to 1
# Extract the posterior via sampling (do this to summarize quantities of interest like the mean, credible intervals, median, sd, etc.)
samples <- sample(p_grid, prob = posterior, size = 60000, replace = TRUE)
Out of 9 samples, 6 contained water. Given these data and the model, there was a 95% probability that the proportion of water on earth was between 0.35 and 0.88 (Table 1) with a median of 0.6 (Figure 1).
Based on the posterior predictive distribution, we estimate that future surveys with a sample size of 15,640 should expect a median of ~10,000 water hits, but with reasonable expectations ranging from ~6,000 to 14,000 (Figure 2).
# Make table of posterior summary statistics
table1 <- tibble(median = median(samples),
mean = mean(samples),
sd = sd(samples),
low95 = quantile(samples, probs = 0.025),
high95 = quantile(samples, probs = 0.975)) %>%
mutate_if(is.numeric, round, 2) # round numeric values to 2 decimals
kable(table1, caption = "Table 1. Summary statistics of the posterior distribution of the proportion of water on earth") %>%
kable_styling(latex_options = c("striped", "hold_position")) #not essential, just a styling option
| median | mean | sd | low95 | high95 |
|---|---|---|---|---|
| 0.64 | 0.64 | 0.14 | 0.35 | 0.88 |
# Make a figure of the posterior distribution using ggplot
samples_as_tibble <- tibble(samples = samples) # tibble of samples
# Density plot with line for the median
ggplot(data = samples_as_tibble, aes(x = samples)) +
geom_histogram(fill = "dodgerblue", alpha = .8,
color = "grey80",
binwidth = 0.01) +
geom_vline(xintercept = median(samples), linetype = "dashed") +
labs(y = "Density",
x = "Proportion of Water")
Figure 1. Posterior distribution of the proportion of water on earth. The dashed line is the median.
# Make a figure of the posterior distribution using ggplot
forecast <- tibble(waters = rbinom(n = 10000, 15640, prob = samples))
# Density plot with line for the median
ggplot(data = forecast, aes(x = waters)) +
geom_histogram(fill = "dodgerblue", alpha = .8,
color = "grey80") +
geom_vline(xintercept = median(forecast$waters), linetype = "dashed") +
labs(y = "Density",
x = "Number of waters expected out of 15,640 future samples.")
Figure 2. Posterior predictive distribution of the number of water hits expected out of 15,640 future samples. The dashed line is the median.