Practice set 4.7

Submission instructions

I want you to start familiarizing yourself with R Quarto, a medium for compiling text, code, and output in one reproducible document. For this problem, use this .qmd file as a starting point for writing your simulation studies.

When finished with your solutions:

Click the Publish button
Select RPubs
Follow the prompts for creating an account on RPubs and publishing your solution

Submit BOTH your .qmd file and a link to your published html.

Question 1

Recall the following problem. Let \(X\) represents the number of cars driving past a parking ramp in any given hour, and suppose that \(X\sim POI(\lambda)\). Given \(X=x\) cars driving past a ramp, let \(Y\) represent the number that decide to park in the ramp, with \(Y|X=x\sim BIN(x,p)\). We’ve shown analytically that unconditionally, \(Y\sim POI(\lambda p)\). Verify this result via a simulation study over a grid of \(\lambda \in \{10, 20, 30\}\) and \(p\in \{0.2, 0.4, 0.6\}\) using the framework presented in your notes, with 10,000 simulated outcomes per \((\lambda, p)\) combination. Create a faceted plot of the overlaid analytic and empirial CDFs as well as the p-p plot.

I’ll get you started:

library(tidyverse)
library(purrrfect)
N <- 10000
set.seed(227)
(simstudy <- parameters(~ lambda, ~ p,
                        c(10, 20, 30),
                        c(0.2, 0.4, 0.6)) %>%
  add_trials(N) %>%
  mutate(X = map_int(lambda, \(l) rpois(1, l))) %>%
  mutate(Y = pmap_int(list(X, p), \(x, prob) rbinom(1, size = x, prob = prob))) %>%
  mutate(Fhat_Y = cume_dist(Y), .by = c(lambda, p)) %>%
  mutate(F_Y = ppois(Y, lambda * p))
) %>% head()

# A tibble: 6 × 7
  lambda     p .trial     X     Y Fhat_Y   F_Y
   <dbl> <dbl>  <dbl> <int> <int>  <dbl> <dbl>
1     10   0.2      1    11     3  0.858 0.857
2     10   0.2      2     7     1  0.413 0.406
3     10   0.2      3     8     0  0.138 0.135
4     10   0.2      4     7     3  0.858 0.857
5     10   0.2      5     9     1  0.413 0.406
6     10   0.2      6     6     1  0.413 0.406

(ggplot(aes(x = Y), data = simstudy)
  + geom_step(aes(y = Fhat_Y, color = "Simulated CDF"))
  + geom_step(aes(y = F_Y, color = "Analytic CDF"))
  + facet_grid(lambda ~ p, labeller = label_both))

Empirical vs analytic CDFs across λ and p.

(ggplot(aes(x = Fhat_Y, y = F_Y), data = simstudy)
  + geom_point(alpha = 0.6)
  + geom_abline(intercept = 0, slope = 1, color = "gray50", linetype = "dashed")
  + labs(y = expression(P(Y <= y)),
         x = expression(hat(P)(Y <= y)))
  + theme_classic(base_size = 12)
  + facet_grid(lambda ~ p, labeller = label_both)
)

P–P plot comparing empirical and analytic CDFs across λ and p.

Question 2

In this problem we will study the relationship between \(\alpha\) and \(Var(Y)\) for fixed \(\mu\) for the beta distribution.

Recall that if \(Y\sim BETA(\alpha,\beta)\) and with \(\mu \equiv E(Y)\), \(\beta = \alpha \cdot \frac{1-\mu}{\mu}\).

Use this fact to simulate 10,000 realizations of \(Y\sim BETA(\alpha,\beta)\) for each combination of \(\alpha \in \{2,4,8,16\}\) and \(\mu\in\{0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7\}\).

library(tidyverse)
library(purrrfect)
N <- 10000
set.seed(227)
(simstudy2 <- parameters(~alpha, ~mu,
                        c(2, 4, 8, 16),
                        c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7)
) %>%
  add_trials(N) %>%
  mutate(
    beta = alpha * (1 - mu) / mu,
    Y = pmap_dbl(list(alpha, beta), \(a, b) rbeta(1, a, b)),
    f_Y = dbeta(Y, alpha, beta),            # analytic density
    Fhat_Y = cume_dist(Y),                  # empirical CDF
    F_Y = pbeta(Y, alpha, beta)             # analytic CDF
  )
) %>% head()

# A tibble: 6 × 8
  alpha    mu .trial  beta      Y   f_Y Fhat_Y   F_Y
  <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl>
1     2   0.1      1    18 0.152   3.17 0.190  0.807
2     2   0.1      2    18 0.0508  7.16 0.0208 0.251
3     2   0.1      3    18 0.0773  6.73 0.0558 0.438
4     2   0.1      4    18 0.168   2.52 0.213  0.853
5     2   0.1      5    18 0.154   3.08 0.193  0.813
6     2   0.1      6    18 0.0764  6.77 0.0542 0.432

A)

Create a faceted plot of the empirical densities overlaid with the analytic densities for each \((\alpha, \mu)\) combination.

(ggplot(aes(x = Y), data = simstudy2)
  + geom_histogram(aes(y = after_stat(density)),
                   fill = 'goldenrod',
                   binwidth = 0.02)
  + geom_line(aes(y = f_Y), col = 'cornflowerblue', linewidth = 1)
  + labs(y = 'Beta density', x = 'Y')
  + theme_classic(base_size = 12)
  + facet_grid(alpha ~ mu, labeller = label_both, scales = "free_x")
)

B)

Use summarize() to find the simulated variance of \(Y\) for each \((\alpha,\mu)\) combination.

(sim_var <- simstudy2 %>%
  group_by(alpha, mu) %>%
  reframe(
    simulated_var = var(Y),
    theoretical_var = (alpha * beta) / ((alpha + beta)^2 * (alpha + beta + 1)),
    .groups = "drop"
  )
)

# A tibble: 280,000 × 5
   alpha    mu simulated_var theoretical_var .groups
   <dbl> <dbl>         <dbl>           <dbl> <chr>  
 1     2   0.1       0.00424         0.00429 drop   
 2     2   0.1       0.00424         0.00429 drop   
 3     2   0.1       0.00424         0.00429 drop   
 4     2   0.1       0.00424         0.00429 drop   
 5     2   0.1       0.00424         0.00429 drop   
 6     2   0.1       0.00424         0.00429 drop   
 7     2   0.1       0.00424         0.00429 drop   
 8     2   0.1       0.00424         0.00429 drop   
 9     2   0.1       0.00424         0.00429 drop   
10     2   0.1       0.00424         0.00429 drop   
# ℹ 279,990 more rows

C)

Create a plot of the simulated variances as a function of \(\alpha\). Use lines as your geometry, and color-code each line by \(\mu\).

(ggplot(sim_var, aes(x = alpha, y = simulated_var, color = as.factor(mu), group = mu))
  + geom_line(linewidth = 1)
  + labs(
      x = expression(alpha),
      y = "Simulated Variance",
      color = expression(mu)
    )
  + theme_classic(base_size = 12)
)

D)

Comment on how the combination of \(\alpha\) and \(\mu\) impact variance.

We can see that with a larger \(\mu\) and a small \(\alpha\), the variances differ significantly. As \(\alpha\) increases, the variances for each \(\mu\) exhibit a similar downward trend, and by the largest \(\alpha\), the variances are considerably closer together than they were initially. Therefore, the combination of \(\alpha\) and \(\mu\) affects the variance by controlling both the overall spread and the concentration of the Beta distribution: smaller \(\alpha\) values amplify the influence of \(\mu\) on variance, whereas larger \(\alpha\) values dominate the effect, producing more uniform variances across different values of \(\mu\).