7.1 - Intro to sampling distributions

What is a sampling distribution?

In STAT 460 we turn in full force to the concept of sampling distributions
The term “sampling distribution” is a rather nondescript way of talking about the key idea of:

The distribution of a statistic, computed from a sample of size \(n\), across repeated samples from a population

The sampling distribution visual

What do we need them for?

Sampling distributions are essential for basically all of statistics:
- Forming confidence intervals
- Obtaining test statistics and p-values
- Performing hypothesis tests
- Discussing quality of parameter estimators
All of which we will discuss in more detail this semester!

An i.i.d. sample

We will now often talk about a sample \(Y_1,...,Y_n\) drawn independently and identically distributed, or i.i.d. for short, from a population
We have actually already considered several examples of sampling distributions when \(n=2\):
- Let \(Y_1\) and \(Y_2\) be independent, and identically distributed (i.i.d.) \(N(0,1)\) random variables
- What is the distribution of \(U = Y_1 +Y_2\)?
Note that previously we simplied our notation with \(Y_1 \equiv X\) and \(Y_2 \equiv Y\), but starting now we will often want to consider the impact of \(n\) on our sampling distribution!

Review of the MGF method

The method of moment generating functions is commonly applied to find the distribution of sums, or sample means, of i.i.d. samples
Let \(Y_1,...,Y_n\) be an i.i.d. sample, thus with common MGF \(M_Y(t)\)
What is the distribution of \(U = Y_1+Y_2 + ... + Y_n = \sum_{i=1}^n Y_i\)?

\[M_U(t) = E(e^{tU}) = E\left(e^{t(Y_1 + Y_2+...+Y_n)}\right)=E\left(e^{tY_1}e^{tY_2}...e^{tY_n}\right)\]

\[=E\left(e^{tY_1}\right)E\left(e^{tY_2}\right)...E\left(e^{tY_n}\right) \mbox{ (by independence)}\]

\[= \prod_{i=1}^n M_{Y_i}(t)=\prod_{i=1}^n M_{Y}(t)=M_Y(t)^n \mbox{ (by having identical distributions)}\]

Example: Sum of exponentials

Let \(Y_1\), \(Y_2\),…,\(Y_n\) be an i.i.d. sample from a \(EXP(\lambda)\) distribution
Find the distribution of \(U = \sum_{i=1}^nY_i\)

\[M_Y(t) = \frac{\lambda}{\lambda-t}\]

\[M_U(t) = M_Y(t)^n = \left(\frac{\lambda}{\lambda-t}\right)^n\]

Note this is the MGF of a \(GAM(n,\lambda)\) distribution: thus \(U\sim GAM( n, \lambda)\)

Simulating sampling distributions

To simulate sampling distributions, we will often want to incorporate \(n\) as one of our parameters in a parameter grid using parameters()
Use pmap(list(n, parent_pars), .f = \(n, parent_pars) r_*(n, parent_pars)) to generate a list column where each row is one sample of size \(n\) drawn i.i.d. from the parent population with parameters parent_pars
- Why pmap()? Input: list (of parameters); Output: list (of the sample results)
Then, use map_dbl(sample_result, .f=summary_method) to input the sample (a list) as input, output a number (a double) as output

Simulating \(n\) i.i.d. \(EXP(\lambda)\) across grid

library(purrrfect)
library(tidyverse)

(sum_of_exps <- parameters(~n, ~lambda,
                             c(2, 5, 10), c(0.5, 1.0, 1.5)
                            )
                  %>%  mutate(y_sample = pmap(list(n, lambda), .f = \(n, l) rexp(n, rate = l))) 
)

# A tibble: 9 × 3
      n lambda y_sample  
  <dbl>  <dbl> <list>    
1     2    0.5 <dbl [2]> 
2     2    1   <dbl [2]> 
3     2    1.5 <dbl [2]> 
4     5    0.5 <dbl [5]> 
5     5    1   <dbl [5]> 
6     5    1.5 <dbl [5]> 
7    10    0.5 <dbl [10]>
8    10    1   <dbl [10]>
9    10    1.5 <dbl [10]>

Forming \(U\) and evaluating analytic density

(sum_of_exps <- parameters(~n, ~lambda,
                             c(2, 5, 10), c(0.5, 1.0, 1.5)
                            )
                  %>%  mutate(y_sample = pmap(list(n, lambda), .f = \(n, l) rexp(n, rate = l)))
                  %>%  mutate(u = map_dbl(y_sample, .f = sum))
                  %>%  mutate(f_U = dgamma(u, shape = n, rate = lambda))
)

# A tibble: 9 × 5
      n lambda y_sample       u     f_U
  <dbl>  <dbl> <list>     <dbl>   <dbl>
1     2    0.5 <dbl [2]>   4.13 0.131  
2     2    1   <dbl [2]>   2.83 0.167  
3     2    1.5 <dbl [2]>   1.22 0.439  
4     5    0.5 <dbl [5]>  10.7  0.0812 
5     5    1   <dbl [5]>   2.90 0.162  
6     5    1.5 <dbl [5]>   2.14 0.268  
7    10    0.5 <dbl [10]> 33.1  0.00826
8    10    1   <dbl [10]> 10.2  0.122  
9    10    1.5 <dbl [10]>  3.96 0.0998

Obtaining 10,000 replications per parameter combo

(sum_of_exps <- parameters(~n, ~lambda,
                             c(2, 5, 10), c(0.5, 1.0, 1.5)
                            )
                  %>%  add_trials(10000)
                  %>%  mutate(y_sample = pmap(list(n, lambda), .f = \(n, l) rexp(n, rate = l)))
                  %>%  mutate(u = map_dbl(y_sample, .f = sum))
                  %>%  mutate(f_U = dgamma(u, shape = n, rate = lambda))
)  %>% head()

# A tibble: 6 × 6
      n lambda .trial y_sample      u    f_U
  <dbl>  <dbl>  <dbl> <list>    <dbl>  <dbl>
1     2    0.5      1 <dbl [2]> 0.846 0.139 
2     2    0.5      2 <dbl [2]> 3.01  0.167 
3     2    0.5      3 <dbl [2]> 1.10  0.159 
4     2    0.5      4 <dbl [2]> 6.17  0.0705
5     2    0.5      5 <dbl [2]> 2.44  0.180 
6     2    0.5      6 <dbl [2]> 6.26  0.0684

Plotting

ggplot(data = sum_of_exps)+
  geom_histogram(aes(x = u, y = after_stat(density)), 
                 fill = 'goldenrod',
                 center = 0.1, binwidth = .2) +
  geom_line(aes(x = u, y = f_U), col='cornflowerblue') +
  facet_grid(n~lambda, labeller = label_both)+
  theme_classic(base_size = 16) +
  labs(title='Simulated and analytic densities for sum of exponentials')

simulated and analytic densities of sum of exponentials

Example: Mean of exponentials

Let \(Y_1\), \(Y_2\),…,\(Y_n\) be an i.i.d. sample from a \(EXP(\lambda)\) distribution
Find the distribution of \(\bar Y = \frac{\sum_{i=1}^nY_i}{n} = \frac{U}{n}\)

\[ M_U(t) = \left(\frac{\lambda}{\lambda-t}\right)^n\]

\[M_{\bar Y}(t) = E(e^{t \bar Y}) = E(e^{t \cdot U/n})=E(e^{t/n\cdot U}) \]

\[= M_U(t/n) = \left(\frac{\lambda}{\lambda-t/n}\right)^n= \left(\frac{n\lambda}{n\lambda-t}\right)^n\]

Note this is the MGF of a \(GAM(n,n\lambda)\) distribution: thus \(\bar Y\sim GAM( n, n\lambda)\)

Verifying via simulation

(mean_of_exps <- parameters(~n, ~lambda,
                             c(2, 5, 10), c(0.5, 1.0, 1.5)
                            )
                  %>%  add_trials(10000)
                  %>%  mutate(y_sample = pmap(list(n, lambda), .f = \(n, l) rexp(n, rate = l)))
                  %>%  mutate(ybar = map_dbl(y_sample, .f = mean))
                  %>%  mutate(f_ybar = dgamma(ybar, shape = n, rate = n*lambda))
)  %>% head()

# A tibble: 6 × 6
      n lambda .trial y_sample   ybar f_ybar
  <dbl>  <dbl>  <dbl> <list>    <dbl>  <dbl>
1     2    0.5      1 <dbl [2]> 0.175 0.147 
2     2    0.5      2 <dbl [2]> 0.766 0.356 
3     2    0.5      3 <dbl [2]> 3.04  0.145 
4     2    0.5      4 <dbl [2]> 5.14  0.0300
5     2    0.5      5 <dbl [2]> 3.81  0.0843
6     2    0.5      6 <dbl [2]> 3.95  0.0763

Plotting

ggplot(data = mean_of_exps)+
  geom_histogram(aes(x = ybar, y = after_stat(density)), 
                 fill = 'goldenrod',
                 center = 0.1, binwidth = .2) +
  geom_line(aes(x = ybar, y = f_ybar), col='cornflowerblue') +
  facet_grid(n~lambda, labeller = label_both)+
  theme_classic(base_size = 16) +
  labs(title='Simulated and analytic densities for sample mean of exponentials')

General result for MGF of \(\bar Y\)

The previous example illustrates a result that holds more generally
If \(Y_1\), \(Y_2\), …,\(Y_n\) represents an i.i.d. sample with common MGF \(M_Y(t)\) then with \(U = \sum_{i=1}^nY_i\) and \(\bar Y=\frac{U}{n}\):

\[M_{\bar Y}(t) = E(e^{t \bar Y}) =E(e^{t \cdot U/n})= M_U(t/n) = M_Y(t/n)^n\]

Example: mean of Normal samples

Let \(Y_1,Y_2,...,Y_n\) be an i.i.d. sample drawn from a \(N(\mu,\sigma^2)\) population.
Find the distribution of \(\bar Y =\frac{1}{n}\sum_{i=1}^n Y_i\).

\[M_Y(t) = e^{\mu t + \frac{\sigma^2 t^2}{2}}\]

\[M_{\bar Y}(t) = M_Y(t/n)^n = \left(e^{\mu (t/n) + \frac{\sigma^2 (t/n)^2}{2}}\right)^n\]

\[= e^{\frac{\mu t}{n}\cdot n + \frac{\sigma^2 t^2}{2n^2}\cdot n}= e^{\mu t + \frac{\sigma^2 t^2}{2n}}\]

Note this is the MGF of a \(N\left(\mu, \frac{\sigma^2}{n}\right)\) distribution: thus \(\bar Y\sim N\left( \mu, \frac{\sigma^2}{n}\right)\)

Verifying via simulation study

We now have three parameters to consider:

N <- 10000
(normal_mean_sims <- parameters(~n, ~mu, ~sigma,
           c(5, 10, 20), c(-2, 0, 2), c(2, 4)
           )
          %>% add_trials(N)
          %>% mutate(ysample = pmap(list(n, mu, sigma), \(nn, m, s) rnorm(nn, m, s)))
          %>% mutate(ybar = map_dbl(ysample, mean))
          %>% mutate(f = dnorm(ybar, mu, sigma/sqrt(n)), F = pnorm(ybar, mu, sigma/sqrt(n)))
          %>% mutate(Fhat = cume_dist(ybar), .by = c(n, mu, sigma))
) %>% head()

# A tibble: 6 × 9
      n    mu sigma .trial ysample     ybar       f     F  Fhat
  <dbl> <dbl> <dbl>  <dbl> <list>     <dbl>   <dbl> <dbl> <dbl>
1     5    -2     2      1 <dbl [5]> -1.92  0.444   0.535 0.531
2     5    -2     2      2 <dbl [5]>  0.696 0.00475 0.999 0.999
3     5    -2     2      3 <dbl [5]> -0.998 0.238   0.869 0.865
4     5    -2     2      4 <dbl [5]> -0.982 0.233   0.873 0.868
5     5    -2     2      5 <dbl [5]> -2.65  0.342   0.233 0.231
6     5    -2     2      6 <dbl [5]> -2.03  0.446   0.485 0.481

Plotting densities

We must now use the ggh4x package to visualize results across a 3-parameter grid:

library(ggh4x)
ggplot(data = normal_mean_sims, aes(x = ybar)) + 
  geom_histogram(aes(y = after_stat(density)), bins = 50, fill = 'goldenrod') + 
  geom_line(aes(y = f), col ='cornflowerblue') + 
  facet_nested(mu~sigma+n, labeller = label_both, scale = 'free_y') + 
  labs(y = expression(bar(Y)), 
       title='Simulated and analytic densities of sample mean from normal population')+
  theme_classic()

Simulated and analytic densities of sample mean from normal population

Plotting overlaid CDF plots

library(ggh4x)
ggplot(data = normal_mean_sims, aes(x = ybar)) + 
  geom_step(aes(y = F, col = 'Analytic CDF')) + 
  geom_step(aes(y = Fhat, col = 'Empirical CDF')) + 
  facet_nested(mu~sigma+n, labeller = label_both, scale = 'free_y') + 
  labs(y = expression(bar(Y)), 
       title='Simulated and analytic CDFs of sample mean from normal population',
       color='')+
  theme_classic()