Practice set 4.7

Submission instructions

I want you to start familiarizing yourself with R Quarto, a medium for compiling text, code, and output in one reproducible document. For this problem, use this .qmd file as a starting point for writing your simulation studies.

When finished with your solutions:

Click the Publish button
Select RPubs
Follow the prompts for creating an account on RPubs and publishing your solution

Submit BOTH your .qmd file and a link to your published html.

Question 1

Recall the following problem. Let \(X\) represents the number of cars driving past a parking ramp in any given hour, and suppose that \(X\sim POI(\lambda)\). Given \(X=x\) cars driving past a ramp, let \(Y\) represent the number that decide to park in the ramp, with \(Y|X=x\sim BIN(x,p)\). We’ve shown analytically that unconditionally, \(Y\sim POI(\lambda p)\). Verify this result via a simulation study over a grid of \(\lambda \in \{10, 20, 30\}\) and \(p\in \{0.2, 0.4, 0.6\}\) using the framework presented in your notes, with 10,000 simulated outcomes per \((\lambda, p)\) combination. Create a faceted plot of the overlaid analytic and empirial CDFs as well as the p-p plot.

I’ll get you started:

library(tidyverse)
library(purrrfect)
# Create Sim / Analytical Framework
N <- 10000
(simstudy <- parameters(~lambda, ~p,
                        c(10,20,30), c(.2,.4,.6))
%>% add_trials(N)
%>% mutate(X = map_dbl(lambda, \(L) rpois(1, L)))
%>% mutate(Y = pmap_dbl(list(X,p), \(var1, var2) rbinom(1, size = var1, prob = var2)))
%>% mutate(Fhat_Y = cume_dist(Y), .by = c(lambda, p))
%>% mutate(F_Y = ppois(Y, lambda * p))
)

# A tibble: 90,000 × 7
   lambda     p .trial     X     Y Fhat_Y   F_Y
    <dbl> <dbl>  <dbl> <dbl> <dbl>  <dbl> <dbl>
 1     10   0.2      1    14     3  0.857 0.857
 2     10   0.2      2    11     3  0.857 0.857
 3     10   0.2      3     8     2  0.682 0.677
 4     10   0.2      4    11     2  0.682 0.677
 5     10   0.2      5    13     3  0.857 0.857
 6     10   0.2      6     9     1  0.417 0.406
 7     10   0.2      7    10     1  0.417 0.406
 8     10   0.2      8    13     3  0.857 0.857
 9     10   0.2      9    10     3  0.857 0.857
10     10   0.2     10     8     1  0.417 0.406
# ℹ 89,990 more rows

Overlaid CDFs

(ggplot(aes(x = Y), data = simstudy)
 + geom_step(aes(y = Fhat_Y, color = "Simulated CDF"))
 + geom_step(aes(y = F_Y, color = "Analytic CDF"))
 + labs(y = expression(P(Y <= y)), x = "x", color = "")
 + theme_classic(base_size = 12)
 + facet_grid(lambda ~ p, labeller = label_both, scale= 'free_x')
)

P-P Plot

(ggplot(aes(x = Fhat_Y, y = F_Y), data = simstudy)
 + geom_point(alpha = 0.4)
 + geom_abline(intercept = 0, slope = 1, color = "grey", linetype = "dashed")
 + labs(y = expression(P(Y <= y)),
        x = expression(hat(P)(Y <= y)))
 + theme_classic(base_size = 12)
 + facet_grid(lambda ~ p, labeller = label_both)
)

Question 2

In this problem we will study the relationship between \(\alpha\) and \(Var(Y)\) for fixed \(\mu\) for the beta distribution.

Recall that if \(Y\sim BETA(\alpha,\beta)\) and with \(\mu \equiv E(Y)\), \(\beta = \alpha \cdot \frac{1-\mu}{\mu}\).

Use this fact to simulate 10,000 realizations of \(Y\sim BETA(\alpha,\beta)\) for each combination of \(\alpha \in \{2,4,8,16\}\) and \(\mu\in\{0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7\}\).

A)

Create a faceted plot of the empirical densities overlaid with the analytic densities for each \((\alpha, \mu)\) combination.

Sim Framework

N <- 10000

(simstudy <- parameters(~alpha, ~mu,
                        c(2, 4, 8, 16),
                        seq(.1, .7, by = .1))
  %>% add_trials(N)
  %>% mutate(beta = alpha * (1 - mu) / mu)
  %>% mutate(Y = pmap_dbl(list(alpha, beta),\(a, b) rbeta(1, shape1 = a, shape2 = b)))
  %>% mutate(Fhat_Y = cume_dist(Y), .by = c(alpha, mu))
  %>% mutate(F_Y = pbeta(Y, alpha, beta))
)

# A tibble: 280,000 × 7
   alpha    mu .trial  beta      Y Fhat_Y    F_Y
   <dbl> <dbl>  <dbl> <dbl>  <dbl>  <dbl>  <dbl>
 1     2   0.1      1    18 0.0267 0.0905 0.0902
 2     2   0.1      2    18 0.236  0.962  0.959 
 3     2   0.1      3    18 0.111  0.648  0.640 
 4     2   0.1      4    18 0.0576 0.306  0.300 
 5     2   0.1      5    18 0.0228 0.0675 0.0686
 6     2   0.1      6    18 0.0973 0.569  0.564 
 7     2   0.1      7    18 0.119  0.686  0.678 
 8     2   0.1      8    18 0.0675 0.374  0.370 
 9     2   0.1      9    18 0.180  0.888  0.881 
10     2   0.1     10    18 0.0357 0.147  0.146 
# ℹ 279,990 more rows

Plots

(ggplot(aes(x = Y), data = simstudy)
 + geom_step(aes(y = Fhat_Y, color = 'Simulated CDF'))
 + geom_step(aes(y = F_Y, color = 'Analytic CDF'))
 + labs(y = expression(P(Y <= y)), x = 'x', color = '')
 + theme_classic(base_size = 12)
 + facet_grid(alpha ~ mu, labeller = label_both, scales = 'free_x')
)

(ggplot(aes(x = Fhat_Y, y = F_Y), data = simstudy)
 + geom_point(alpha = 0.1, size = 0.3, color = "steelblue")
 + geom_abline(intercept = 0, slope = 1, color = "gray", linetype = "dashed")
 + labs(y = expression(P(Y <= y)),
        x = expression(hat(P)(Y <= y)))
 + coord_fixed()
 + theme_classic(base_size = 11)
 + facet_grid(alpha ~ mu, labeller = label_both)
)

B)

Use summarize() to find the simulated variance of \(Y\) for each \((\alpha,\mu)\) combination.

(sim_var <- simstudy %>%
  summarize(varY = var(Y), .by = c(alpha, mu)))

# A tibble: 28 × 3
   alpha    mu    varY
   <dbl> <dbl>   <dbl>
 1     2   0.1 0.00416
 2     2   0.2 0.0145 
 3     2   0.3 0.0269 
 4     2   0.4 0.0391 
 5     2   0.5 0.0505 
 6     2   0.6 0.0542 
 7     2   0.7 0.0542 
 8     4   0.1 0.00227
 9     4   0.2 0.00762
10     4   0.3 0.0149 
# ℹ 18 more rows

C)

Create a plot of the simulated variances as a function of \(\alpha\). Use lines as your geometry, and color-code each line by \(\mu\).

(ggplot(sim_var, aes(x = alpha, y = varY, color = as.factor(mu), group = mu))
 + geom_line(linewidth = 1)
 + geom_point()
 + labs(y = expression(hat(Var)(Y)), x = expression(alpha), color = expression(mu))
 + theme_classic(base_size = 12)
)

D)

Comment on how the combination of \(\alpha\) and \(\mu\) impact variance.

For small values of sigma, when mu is fixed, variance is larger. For large values of sigma when mu is fixed, variance is smaller. As mu increases, so does variance.