PS 8.2

Consider the previous question. When θ=2, simulate the coverage of the exact and asymptotically-justified 95% confidence intervals for n∈{5,10,20,40,80,160,320}. Plot the coverages of these two intervals as a function of n and comment on your findings.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(purrrfect)

Attaching package: 'purrrfect'

The following objects are masked from 'package:base':

    replicate, tabulate
N <- 10000

(many_gamma_CIs <- parameters(~n, ~theta, c(5, 10, 20, 40, 80, 160,320), c(2))
           %>% add_trials(N)
           %>% mutate(Ysample = pmap(list(n,theta), .f = \(n,t) rgamma(n, t,rate = t/theta)))
           %>% mutate(Ybar = map_dbl(Ysample, mean), S2 = map_dbl(Ysample, var))
           %>% mutate(LCL_exact = Ybar/qgamma(0.975, shape = n, scale = 1/n), 
                      UCL_exact = Ybar/qgamma(0.025, shape = n, scale = 1/n),
                      LCL_asymptotic = Ybar - 1.96*sqrt(S2/n), 
                      UCL_asymptotic = Ybar + 1.96*sqrt(S2/n))
           %>% mutate(exact_covers = ifelse(LCL_exact < theta & UCL_exact > theta, 1, 0))
           %>%mutate(asymptotic_covers = ifelse(LCL_asymptotic < theta & UCL_asymptotic > theta, 1, 0))
)
# A tibble: 70,000 × 12
       n theta .trial Ysample    Ybar    S2 LCL_exact UCL_exact LCL_asymptotic
   <dbl> <dbl>  <dbl> <list>    <dbl> <dbl>     <dbl>     <dbl>          <dbl>
 1     5     2      1 <dbl [5]>  1.81 1.45      0.885      5.58          0.758
 2     5     2      2 <dbl [5]>  1.73 1.82      0.845      5.33          0.547
 3     5     2      3 <dbl [5]>  1.67 1.73      0.813      5.13          0.514
 4     5     2      4 <dbl [5]>  2.57 3.64      1.25       7.91          0.897
 5     5     2      5 <dbl [5]>  2.57 2.91      1.25       7.90          1.07 
 6     5     2      6 <dbl [5]>  2.36 0.678     1.15       7.26          1.64 
 7     5     2      7 <dbl [5]>  2.28 0.702     1.11       7.03          1.55 
 8     5     2      8 <dbl [5]>  3.59 3.16      1.75      11.1           2.03 
 9     5     2      9 <dbl [5]>  1.25 1.62      0.612      3.86          0.137
10     5     2     10 <dbl [5]>  4.01 4.12      1.96      12.3           2.23 
# ℹ 69,990 more rows
# ℹ 3 more variables: UCL_asymptotic <dbl>, exact_covers <dbl>,
#   asymptotic_covers <dbl>
(gamma_coverage <- many_gamma_CIs 
     %>% group_by(n, theta)    
     %>% summarize(exact_coverage = mean(exact_covers), 
                   asymptotic_coverage = mean(asymptotic_covers)
                   )
) 
`summarise()` has grouped output by 'n'. You can override using the `.groups`
argument.
# A tibble: 7 × 4
# Groups:   n [7]
      n theta exact_coverage asymptotic_coverage
  <dbl> <dbl>          <dbl>               <dbl>
1     5     2          0.994               0.842
2    10     2          0.995               0.898
3    20     2          0.994               0.922
4    40     2          0.993               0.931
5    80     2          0.996               0.946
6   160     2          0.994               0.94 
7   320     2          0.995               0.949
ggplot(data = gamma_coverage) + 
  geom_line(aes(x = n, y = exact_coverage, color = 'exact')) + 
  geom_line(aes(x = n, y = asymptotic_coverage, color = 'asymptotic')) +
  geom_hline(aes(yintercept = 0.95)) + 
  scale_color_discrete(name='') + ylab('Coverage') +
  theme_classic() + 
  ggtitle('Simulated coverage of nominal 95% CIs for exponential scale parameter') +
  facet_wrap(~theta, labeller = label_both)

We can see that the coverage grows stronger as the sample grows and that with a larger sample size we have better coverage.

In this problem, we investigate the impact of violations of the normality assumption on 95% confidence interval coverage of population mean μ and variance σ^2. Consider Y_1,…,Y_n, an i.i.d. sample from a not-necessarily-normal distribution. In class notes we derived the following 95% confidence intervals, which should have exact coverage for all n if the sample comes from a normal population. Assuming normal data, we derived a 95% confidence interval for μ as: Y ‾±t_(.975,n-1)⋅s/√n where t_(.975,n-1) is the 97.5th percentile from a t-distribution with n-1 degrees-of-freedom. We also derived a 95% confidence interval for σ^2, which should have correct coverage for all n if the population is normal: (((n-1) S2)/(χ_(.975,n-1)2 ),((n-1) S2)/(χ_(.025,n-1)2 )) where χ_(.975,n-1)^2 and χ_(.025,n-1)^2 are, respectively, the 97.5th and 2.5th percentiles from a χ^2-distribution with n-1 degrees-of-freedom. Via simulation study, investigate actual coverage of these 95% confidence intervals for μ and σ^2 when the data come from various non-normal distributions. Present your coverages in easily readable tables following the structure below. The first row should contain the true values of μ and σ^2, and the rest of the entries with the simulated coverages of the true population parameters. (The geometric distribution models the number of independent Bernoulli failures until the first success, with pmf p(y)=p(1-p)^y,y∈0,1,2,3,…). Comment on your findings. For what distributions and sample size combinations are the consequences of the normality assumption violation most severe? Least severe? Are the confidence intervals for μ or for σ^2 most effected?

library(tidyverse)
library(purrrfect)
library(knitr)

N <- 10000

(mu_CI_sim <- parameters(
    ~n, ~dist,
    c(5, 10, 20, 40, 80, 160, 320),
    c("unif", "exp", "geo")
  ) %>%
  add_trials(N) %>%
  mutate(
    Ysample = pmap(list(n, dist),\(n, d) switch(d,
                                                   "unif" = runif(n, 0, 8),
                                                   "exp"  = rexp(n, rate = 1/4),
                                                   "geo"  = rgeom(n, prob = 0.2)
        )
    )
  ) %>%
  mutate(
    Ybar = map_dbl(Ysample, mean), S   = map_dbl(Ysample, sd)) %>%
  mutate(LCL = Ybar - qt(0.975, df = n - 1) * S / sqrt(n),
         UCL = Ybar + qt(0.975, df = n - 1) * S / sqrt(n) ) %>%
  mutate(covers_mu = ifelse(LCL < 4 & UCL > 4, 1, 0)
  )
)
# A tibble: 210,000 × 9
       n dist  .trial Ysample    Ybar     S      LCL   UCL covers_mu
   <dbl> <chr>  <dbl> <list>    <dbl> <dbl>    <dbl> <dbl>     <dbl>
 1     5 unif       1 <dbl [5]>  2.57 1.88   0.233    4.90         1
 2     5 unif       2 <dbl [5]>  3.34 1.84   1.05     5.63         1
 3     5 unif       3 <dbl [5]>  4.81 2.06   2.25     7.37         1
 4     5 unif       4 <dbl [5]>  3.26 2.63  -0.00762  6.53         1
 5     5 unif       5 <dbl [5]>  3.48 1.95   1.07     5.90         1
 6     5 unif       6 <dbl [5]>  4.36 0.582  3.63     5.08         1
 7     5 unif       7 <dbl [5]>  3.72 2.72   0.342    7.11         1
 8     5 unif       8 <dbl [5]>  4.49 1.95   2.07     6.91         1
 9     5 unif       9 <dbl [5]>  4.99 2.31   2.12     7.87         1
10     5 unif      10 <dbl [5]>  3.12 3.16  -0.803    7.04         1
# ℹ 209,990 more rows
(var_CI_sim <- parameters(
    ~n, ~dist,
    c(5, 10, 20, 40, 80, 160, 320),
    c("unif", "exp", "geo")
  ) %>%
  add_trials(N) %>%
  mutate(
    Ysample = pmap(list(n, dist), \(n, d)  switch(d,
                                                    "unif" = runif(n, 0, 8),
                                                    "exp"  = rexp(n, rate = 1/4),
                                                    "geo"  = rgeom(n, prob = 0.2)
        )
    )
  ) %>%
  mutate(S2 = map_dbl(Ysample, var), sigma2_true = case_when(
      dist == "unif" ~ 64/12,
      dist == "exp"  ~ 16,
      dist == "geo"  ~ 20
    )
  ) %>% mutate(LCL = (n - 1) * S2 / qchisq(0.975, df = n - 1),
               UCL = (n - 1) * S2 / qchisq(0.025, df = n - 1)
  ) %>%mutate( covers_var = ifelse(LCL < sigma2_true & UCL > sigma2_true, 1, 0)
  )
)
# A tibble: 210,000 × 9
       n dist  .trial Ysample      S2 sigma2_true   LCL   UCL covers_var
   <dbl> <chr>  <dbl> <list>    <dbl>       <dbl> <dbl> <dbl>      <dbl>
 1     5 unif       1 <dbl [5]> 7.24         5.33 2.60  59.8           1
 2     5 unif       2 <dbl [5]> 9.43         5.33 3.39  77.9           1
 3     5 unif       3 <dbl [5]> 5.06         5.33 1.82  41.8           1
 4     5 unif       4 <dbl [5]> 8.46         5.33 3.04  69.9           1
 5     5 unif       5 <dbl [5]> 8.06         5.33 2.89  66.5           1
 6     5 unif       6 <dbl [5]> 4.02         5.33 1.44  33.2           1
 7     5 unif       7 <dbl [5]> 3.56         5.33 1.28  29.4           1
 8     5 unif       8 <dbl [5]> 0.443        5.33 0.159  3.66          0
 9     5 unif       9 <dbl [5]> 7.92         5.33 2.84  65.4           1
10     5 unif      10 <dbl [5]> 4.52         5.33 1.62  37.3           1
# ℹ 209,990 more rows
var_coverage_table <- var_CI_sim %>%
  summarize(
    coverage = mean(covers_var),
    .by = c(n, dist)
  )

var_coverage_wide <- var_coverage_table %>%
  pivot_wider(
    names_from  = dist,
    values_from = coverage
  )


library(knitr)

kable(
  var_coverage_wide,
  digits = 3,
  col.names = c("n",
                "Y ~ UNIF(0,8)",
                "Y ~ EXP(4)",
                "Y ~ GEO(0.2)")
)
n Y ~ UNIF(0,8) Y ~ EXP(4) Y ~ GEO(0.2)
5 0.986 0.821 0.819
10 0.993 0.768 0.769
20 0.995 0.729 0.730
40 0.997 0.697 0.711
80 0.998 0.696 0.687
160 0.997 0.679 0.684
320 0.998 0.680 0.674
mu_coverage_table <- mu_CI_sim %>%
  summarize(
    coverage = mean(covers_mu),
    .by = c(n, dist)
  )
mu_coverage_wide <- mu_coverage_table %>%
  pivot_wider(
    names_from  = dist,
    values_from = coverage
  )


library(knitr)

kable(mu_coverage_wide, digits = 3,
      col.names = c("n",
                    "Y ~ UNIF(0,8)",
                    "Y ~ EXP(4)",
                    "Y ~ GEO(0.2)"))
n Y ~ UNIF(0,8) Y ~ EXP(4) Y ~ GEO(0.2)
5 0.935 0.879 0.880
10 0.944 0.901 0.904
20 0.950 0.919 0.921
40 0.948 0.931 0.935
80 0.951 0.943 0.938
160 0.948 0.950 0.943
320 0.953 0.950 0.946

The t-based confidence interval for the mean is reasonably robust to violations of normality, with coverage approaching the nominal 95% level for moderate to large sample sizes. However, the χ²-based confidence interval for the variance performs poorly under skewed and discrete distributions, particularly for small samples, where coverage can fall far below the nominal level. The most severe violations occur for the geometric distribution with small n, while the uniform distribution exhibits the least severe effects.