PS 7.1

5 Simulate the sampling distribution of Y ̅ in the dice-rolling example for n∈{20,40,80,160}. Obtain 10,000 realizations of Y ̅ per n. (Hint: you may want to write a helper function that takes as argument n and simulates n dice rolls. Then use this in the map() function to obtain your sample realization. Why can we get away with using map() here instead of pmap()?) Do the following: Plot the sampling distributions of Y ̅ for each n and comment on their center and variability; Aggregate to find E(Y ̅ ) and Var(Y ̅) for each n. Verify that these well-approximate the theoretical values.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(purrrfect)


Attaching package: 'purrrfect'

The following objects are masked from 'package:base':

    replicate, tabulate

library(ggh4x)
dice_sim <- parameters(~n, c(20, 40, 80, 160)) %>%
  add_trials(10000) %>%
  mutate(
    y_sample = map(n, ~ sample(1:6, size = .x, replace = TRUE)),
    Ybar = map_dbl(y_sample, mean)
  )

head(dice_sim)

# A tibble: 6 × 4
      n .trial y_sample    Ybar
  <dbl>  <dbl> <list>     <dbl>
1    20      1 <int [20]>  3.25
2    20      2 <int [20]>  3.45
3    20      3 <int [20]>  3.6 
4    20      4 <int [20]>  3.8 
5    20      5 <int [20]>  3.45
6    20      6 <int [20]>  3.8

ggplot(data = dice_sim, aes(x = Ybar)) + 
  geom_histogram(aes(y = after_stat(density)), bins = 50, fill = 'goldenrod') +
  geom_histogram(bins = 40, fill = "skyblue", color = "black") +
  facet_wrap(~ n, scales = "free_y") +
  labs(
    x = expression(bar(Y)),
    y = "Count",
    title = "Sampling Distributions of the Sample Mean"
  ) +
  theme_classic()

dice_summary <- dice_sim %>%
  group_by(n) %>%
  summarize(
    Ybar_mean = mean(Ybar),
    Ybar_var  = var(Ybar),
    .groups = "drop"
  ) %>%
  mutate(
    theoretical_mean = 3.5,
    theoretical_var  = 35 / (12 * n)
  )
dice_summary

# A tibble: 4 × 5
      n Ybar_mean Ybar_var theoretical_mean theoretical_var
  <dbl>     <dbl>    <dbl>            <dbl>           <dbl>
1    20      3.50   0.143               3.5          0.146 
2    40      3.50   0.0744              3.5          0.0729
3    80      3.50   0.0364              3.5          0.0365
4   160      3.50   0.0182              3.5          0.0182

We can see from our 20 rolls we have greater variability in our data and that our center seems to be anywhere between 3 and 4 and that we have spikes in our data randomly at certain values. We can see from our 40 rolls that we start to have less variability but a still noticable amount with a wide spread of values but our center begins to form around 3.5 but with some higher bars still appearing greater than that. We can see from 80 rolls that we have less variability again with our data more confined to that peak bar and we start to get a more defined center with a peaking bar at 3.5 and pretty even bars going down the higher and lower end of the value. We can see from our 160 rolls that we have little variability in our data and that we completely center on 3.5 almost with small bars to the higher and lower values and a clear center can easily be identified.

6. Verify via simulation study that the sum of an i.i.d. sample of POI(λ) random variables, S_n, is also Poisson with mean nλ by simulating 10,000 realizations of S_n for each combination of n∈{2,5,10} and λ∈{0.5,1.0,1.5} and plotting the simulated and analytic CDFs for each combination.

(sum_of_pois <- parameters(~n, ~lambda,
                             c(2, 5, 10), c(0.5, 1.0, 1.5)
                            )
                  %>% add_trials(10000)
                  %>% mutate(y_sample = pmap(list(n, lambda), \(nn, l) rpois(nn, l)))
                  %>% mutate(S_n = map_dbl(y_sample, sum)) 
                  %>% mutate(F = ppois(S_n, n * lambda)) 
                  %>% mutate(Fhat = cume_dist(S_n), .by = c(n, lambda))  
)

# A tibble: 90,000 × 7
       n lambda .trial y_sample    S_n     F  Fhat
   <dbl>  <dbl>  <dbl> <list>    <dbl> <dbl> <dbl>
 1     2    0.5      1 <int [2]>     2 0.920 0.918
 2     2    0.5      2 <int [2]>     0 0.368 0.369
 3     2    0.5      3 <int [2]>     1 0.736 0.733
 4     2    0.5      4 <int [2]>     2 0.920 0.918
 5     2    0.5      5 <int [2]>     0 0.368 0.369
 6     2    0.5      6 <int [2]>     0 0.368 0.369
 7     2    0.5      7 <int [2]>     1 0.736 0.733
 8     2    0.5      8 <int [2]>     0 0.368 0.369
 9     2    0.5      9 <int [2]>     1 0.736 0.733
10     2    0.5     10 <int [2]>     1 0.736 0.733
# ℹ 89,990 more rows

ggplot(data = sum_of_pois, aes(x = S_n)) + 
  geom_step(aes(y = F, col = 'Analytic CDF')) + 
  geom_step(aes(y = Fhat, col = 'Empirical CDF')) + 
  facet_nested(n~lambda, labeller = label_both, scale = 'free_y') + 
  labs(y = expression(bar(Y)), 
       title='Simulated and analytic CDFs of sample mean from poisson population',
       color='')+
  theme_classic()