Practice set 4.7

Submission instructions

I want you to start familiarizing yourself with R Quarto, a medium for compiling text, code, and output in one reproducible document. For this problem, use this .qmd file as a starting point for writing your simulation studies.

When finished with your solutions:

Click the Publish button
Select RPubs
Follow the prompts for creating an account on RPubs and publishing your solution

Submit BOTH your .qmd file and a link to your published html.

Question 1

Recall the following problem. Let \(X\) represents the number of cars driving past a parking ramp in any given hour, and suppose that \(X\sim POI(\lambda)\). Given \(X=x\) cars driving past a ramp, let \(Y\) represent the number that decide to park in the ramp, with \(Y|X=x\sim BIN(x,p)\). We’ve shown analytically that unconditionally, \(Y\sim POI(\lambda p)\). Verify this result via a simulation study over a grid of \(\lambda \in \{10, 20, 30\}\) and \(p\in \{0.2, 0.4, 0.6\}\) using the framework presented in your notes, with 10,000 simulated outcomes per \((\lambda, p)\) combination. Create a faceted plot of the overlaid analytic and empirical CDFs as well as the p-p plot.

I’ll get you started:

Creating the dataframe

library(tidyverse)

Warning: package 'tidyverse' was built under R version 4.3.3

Warning: package 'ggplot2' was built under R version 4.3.3

Warning: package 'tibble' was built under R version 4.3.3

Warning: package 'tidyr' was built under R version 4.3.3

Warning: package 'readr' was built under R version 4.3.3

Warning: package 'purrr' was built under R version 4.3.3

Warning: package 'dplyr' was built under R version 4.3.3

Warning: package 'stringr' was built under R version 4.3.3

Warning: package 'forcats' was built under R version 4.3.3

Warning: package 'lubridate' was built under R version 4.3.3

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(purrrfect)


Attaching package: 'purrrfect'

The following objects are masked from 'package:base':

    replicate, tabulate

N <- 10000
(sim1 <- parameters(~lambda,~p,
                    c(10,20,30),c(.2,.4,.6)
                    )
  %>%add_trials(N)
  %>%mutate(X = map_int(lambda,\(x) rpois(1,x)))
  %>%mutate(Y = pmap_int(list(X,p), \(x,y) rbinom(1,x,y)))
  %>%mutate(SimcdfY = cume_dist(Y), .by = c(lambda,p))
  %>%mutate(FY = ppois(Y,lambda*p))
)

# A tibble: 90,000 × 7
   lambda     p .trial     X     Y SimcdfY    FY
    <dbl> <dbl>  <dbl> <int> <int>   <dbl> <dbl>
 1     10   0.2      1    15     3   0.862 0.857
 2     10   0.2      2     9     5   0.982 0.983
 3     10   0.2      3    10     2   0.678 0.677
 4     10   0.2      4     9     1   0.407 0.406
 5     10   0.2      5    14     7   0.999 0.999
 6     10   0.2      6    10     0   0.132 0.135
 7     10   0.2      7    15     6   0.996 0.995
 8     10   0.2      8     7     2   0.678 0.677
 9     10   0.2      9    15     4   0.947 0.947
10     10   0.2     10     8     4   0.947 0.947
# ℹ 89,990 more rows

Plotting overlaid analytic and empirical CDF

(ggplot(aes(x=Y),data=sim1)
+ geom_step(aes(y =SimcdfY, color = 'Simulated CDF'))
+ geom_step(aes(y = FY, color = 'Analytic CDF'))
+ labs(y=expression(P(Y <= y)), x= 'y', color='')
+ theme_classic(base_size = 12)
+ facet_grid(lambda~p, labeller = label_both)
)

Plotting p-plot

(ggplot(aes(x=SimcdfY, y = FY),data=sim1)
+ geom_point()
+ labs(y=expression(P(Y<= y)), x= expression(hat(P)(Y <= y)))
+ geom_abline(intercept=0, slope=1)
+ theme_classic(base_size = 10)
+ facet_grid(lambda~p, labeller = label_both)
)

Question 2

In this problem we will study the relationship between \(\alpha\) and \(Var(Y)\) for fixed \(\mu\) for the beta distribution.

Recall that if \(Y\sim BETA(\alpha,\beta)\) and with \(\mu \equiv E(Y)\), \(\beta = \alpha \cdot \frac{1-\mu}{\mu}\).

Use this fact to simulate 10,000 realizations of \(Y\sim BETA(\alpha,\beta)\) for each combination of \(\alpha \in \{2,4,8,16\}\) and \(\mu\in\{0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7\}\).

A)

Create a faceted plot of the empirical densities overlaid with the analytic densities for each \((\alpha, \mu)\) combination.

Create dataframe

(sim2 <- parameters(~alpha,~mu,c(2,4,8,16),c(.1,.2,.3,.4,.5,.6,.7))
  
  %>%add_trials(N)
  %>%mutate(U = map_dbl(.trial, \(t) runif(1,0,1)))
  %>%mutate(B = pmap_dbl(list(alpha,mu),\(a,u) a*(1-u)/u))
  %>%mutate(Y = qbeta(U,alpha,B))
  %>%mutate(fY = dbeta(Y,alpha,B))
  %>%mutate(SimCDFY = cume_dist(Y), .by = c(alpha,mu))
  %>%mutate(FY = pbeta(Y,alpha,B))
)

# A tibble: 280,000 × 9
   alpha    mu .trial       U     B       Y    fY SimCDFY      FY
   <dbl> <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl>   <dbl>   <dbl>
 1     2   0.1      1 0.518      18 0.0897   6.21  0.520  0.518  
 2     2   0.1      2 0.612      18 0.106    5.41  0.612  0.612  
 3     2   0.1      3 0.361      18 0.0662   7.07  0.361  0.361  
 4     2   0.1      4 0.0470     18 0.0184   4.59  0.0462 0.0470 
 5     2   0.1      5 0.643      18 0.112    5.10  0.645  0.643  
 6     2   0.1      6 0.00225    18 0.00370  1.19  0.0019 0.00225
 7     2   0.1      7 0.0479     18 0.0186   4.62  0.047  0.0479 
 8     2   0.1      8 0.540      18 0.0932   6.04  0.540  0.540  
 9     2   0.1      9 0.521      18 0.0901   6.19  0.522  0.521  
10     2   0.1     10 0.659      18 0.115    4.94  0.660  0.659  
# ℹ 279,990 more rows

plot analytic and simulated probability density function

(ggplot(aes(x=Y),data=sim2)
+ geom_histogram(aes(y =after_stat(density)),
fill = 'red',
binwidth = 0.05, center = 0.05)
+ geom_line(aes(y = fY), col = 'blue', linewidth=1)
+ labs(y='Beta Density', x= 'y')
+ theme_classic(base_size = 12)
+ facet_grid(alpha~mu,labeller = label_both,scales = "free")
)

Plot analytic and simulated CDF

(ggplot(aes(x=Y),data=sim2)
+ geom_step(aes(y =SimCDFY, color = 'Simulated CDF'))
+ geom_step(aes(y = FY, color = 'Analytic CDF'))
+ labs(y=expression(P(Y <= y)), x= 'y', color='')
+ theme_classic(base_size = 12)
+ facet_grid(alpha~mu, labeller = label_both,scales = "free")
)

B)

Use summarize() to find the simulated variance of \(Y\) for each \((\alpha,\mu)\) combination.

simvar <-  (sim2 
  %>%summarize(var = var(Y),.by = c(alpha,mu))
  
)
simvar

# A tibble: 28 × 3
   alpha    mu     var
   <dbl> <dbl>   <dbl>
 1     2   0.1 0.00429
 2     2   0.2 0.0144 
 3     2   0.3 0.0276 
 4     2   0.4 0.0403 
 5     2   0.5 0.0511 
 6     2   0.6 0.0554 
 7     2   0.7 0.0545 
 8     4   0.1 0.00223
 9     4   0.2 0.00755
10     4   0.3 0.0146 
# ℹ 18 more rows

C)

Create a plot of the simulated variances as a function of \(\alpha\). Use lines as your geometry, and color-code each line by \(\mu\).

simvar <- (simvar%>%mutate(mu = factor(mu)))
(ggplot(data=simvar)
+geom_line(aes(x=alpha,y=var,group = mu, color = mu))
+ labs(y=expression(Variance(Y)), x= 'Alpha', color='Mu')
+ theme_classic(base_size = 12)
)

D)

Comment on how the combination of \(\alpha\) and \(\mu\) impact variance.

It appears that \(E(x) = \mu\) and \(Var(Y)\) have a positive relationship, i.e a higher \(\mu\) implies a higher \(Var(Y)\) when \(Y \sim BETA(\alpha,\beta)\) . And \(\alpha\) and \(Var(Y)\) have an inverse relationship, i.e a lower \(\alpha\) implies a lower \(Var(Y)\) .