7.1 - Intro to sampling distributions

What is a sampling distribution?

  • In STAT 460 we turn in full force to the concept of sampling distributions
  • The term “sampling distribution” is a rather nondescript way of talking about the key idea of:

The distribution of a statistic, computed from a sample of size \(n\), across repeated samples from a population

The sampling distribution visual

What do we need them for?

  • Sampling distributions are essential for basically all of statistics:
    • Forming confidence intervals
    • Obtaining test statistics and p-values
    • Performing hypothesis tests
    • Discussing quality of parameter estimators
  • All of which we will discuss in more detail this semester!

An i.i.d. sample

  • We will now often talk about a sample \(Y_1,...,Y_n\) drawn independently and identically distributed, or i.i.d. for short, from a population
  • We have actually already considered several examples of sampling distributions when \(n=2\):
    • Let \(Y_1\) and \(Y_2\) be independent, and identically distributed (i.i.d.) \(N(0,1)\) random variables
    • What is the distribution of \(U = Y_1 +Y_2\)?
  • Note that previously we simplied our notation with \(Y_1 \equiv X\) and \(Y_2 \equiv Y\), but starting now we will often want to consider the impact of \(n\) on our sampling distribution!

Review of the MGF method

  • The method of moment generating functions is commonly applied to find the distribution of sums, or sample means, of i.i.d. samples
  • Let \(Y_1,...,Y_n\) be an i.i.d. sample, thus with common MGF \(M_Y(t)\)
  • What is the distribution of \(U = Y_1+Y_2 + ... + Y_n = \sum_{i=1}^n Y_i\)?

\[M_U(t) = E(e^{tU}) = E\left(e^{t(Y_1 + Y_2+...+Y_n)}\right)=E\left(e^{tY_1}e^{tY_2}...e^{tY_n}\right)\]

\[=E\left(e^{tY_1}\right)E\left(e^{tY_2}\right)...E\left(e^{tY_n}\right) \mbox{ (by independence)}\]

\[= \prod_{i=1}^n M_{Y_i}(t)=\prod_{i=1}^n M_{Y}(t)=M_Y(t)^n \mbox{ (by having identical distributions)}\]

Example: Sum of exponentials

  • Let \(Y_1\), \(Y_2\),…,\(Y_n\) be an i.i.d. sample from a \(EXP(\lambda)\) distribution
  • Find the distribution of \(U = \sum_{i=1}^nY_i\)

\[M_Y(t) = \frac{\lambda}{\lambda-t}\]

\[M_U(t) = M_Y(t)^n = \left(\frac{\lambda}{\lambda-t}\right)^n\]

  • Note this is the MGF of a \(GAM(n,\lambda)\) distribution: thus \(U\sim GAM( n, \lambda)\)

Simulating sampling distributions

  • To simulate sampling distributions, we will often want to incorporate \(n\) as one of our parameters in a parameter grid using parameters()
  • Use pmap(list(n, parent_pars), .f = \(n, parent_pars) r_*(n, parent_pars)) to generate a list column where each row is one sample of size \(n\) drawn i.i.d. from the parent population with parameters parent_pars
    • Why pmap()? Input: list (of parameters); Output: list (of the sample results)
  • Then, use map_dbl(sample_result, .f=summary_method) to input the sample (a list) as input, output a number (a double) as output

Simulating \(n\) i.i.d. \(EXP(\lambda)\) across grid

library(purrrfect)
library(tidyverse)

(sum_of_exps <- parameters(~n, ~lambda,
                             c(2, 5, 10), c(0.5, 1.0, 1.5)
                            )
                  %>%  mutate(y_sample = pmap(list(n, lambda), .f = \(n, l) rexp(n, rate = l))) 
)  
# A tibble: 9 × 3
      n lambda y_sample  
  <dbl>  <dbl> <list>    
1     2    0.5 <dbl [2]> 
2     2    1   <dbl [2]> 
3     2    1.5 <dbl [2]> 
4     5    0.5 <dbl [5]> 
5     5    1   <dbl [5]> 
6     5    1.5 <dbl [5]> 
7    10    0.5 <dbl [10]>
8    10    1   <dbl [10]>
9    10    1.5 <dbl [10]>

Forming \(U\) and evaluating analytic density

(sum_of_exps <- parameters(~n, ~lambda,
                             c(2, 5, 10), c(0.5, 1.0, 1.5)
                            )
                  %>%  mutate(y_sample = pmap(list(n, lambda), .f = \(n, l) rexp(n, rate = l)))
                  %>%  mutate(u = map_dbl(y_sample, .f = sum))
                  %>%  mutate(f_U = dgamma(u, shape = n, rate = lambda))
)  
# A tibble: 9 × 5
      n lambda y_sample       u     f_U
  <dbl>  <dbl> <list>     <dbl>   <dbl>
1     2    0.5 <dbl [2]>   4.13 0.131  
2     2    1   <dbl [2]>   2.83 0.167  
3     2    1.5 <dbl [2]>   1.22 0.439  
4     5    0.5 <dbl [5]>  10.7  0.0812 
5     5    1   <dbl [5]>   2.90 0.162  
6     5    1.5 <dbl [5]>   2.14 0.268  
7    10    0.5 <dbl [10]> 33.1  0.00826
8    10    1   <dbl [10]> 10.2  0.122  
9    10    1.5 <dbl [10]>  3.96 0.0998 

Obtaining 10,000 replications per parameter combo

(sum_of_exps <- parameters(~n, ~lambda,
                             c(2, 5, 10), c(0.5, 1.0, 1.5)
                            )
                  %>%  add_trials(10000)
                  %>%  mutate(y_sample = pmap(list(n, lambda), .f = \(n, l) rexp(n, rate = l)))
                  %>%  mutate(u = map_dbl(y_sample, .f = sum))
                  %>%  mutate(f_U = dgamma(u, shape = n, rate = lambda))
)  %>% head()
# A tibble: 6 × 6
      n lambda .trial y_sample      u    f_U
  <dbl>  <dbl>  <dbl> <list>    <dbl>  <dbl>
1     2    0.5      1 <dbl [2]> 0.846 0.139 
2     2    0.5      2 <dbl [2]> 3.01  0.167 
3     2    0.5      3 <dbl [2]> 1.10  0.159 
4     2    0.5      4 <dbl [2]> 6.17  0.0705
5     2    0.5      5 <dbl [2]> 2.44  0.180 
6     2    0.5      6 <dbl [2]> 6.26  0.0684

Plotting

ggplot(data = sum_of_exps)+
  geom_histogram(aes(x = u, y = after_stat(density)), 
                 fill = 'goldenrod',
                 center = 0.1, binwidth = .2) +
  geom_line(aes(x = u, y = f_U), col='cornflowerblue') +
  facet_grid(n~lambda, labeller = label_both)+
  theme_classic(base_size = 16) +
  labs(title='Simulated and analytic densities for sum of exponentials')
simulated and analytic densities of sum of exponentials

Example: Mean of exponentials

  • Let \(Y_1\), \(Y_2\),…,\(Y_n\) be an i.i.d. sample from a \(EXP(\lambda)\) distribution
  • Find the distribution of \(\bar Y = \frac{\sum_{i=1}^nY_i}{n} = \frac{U}{n}\)

\[ M_U(t) = \left(\frac{\lambda}{\lambda-t}\right)^n\]

\[M_{\bar Y}(t) = E(e^{t \bar Y}) = E(e^{t \cdot U/n})=E(e^{t/n\cdot U}) \]

\[= M_U(t/n) = \left(\frac{\lambda}{\lambda-t/n}\right)^n= \left(\frac{n\lambda}{n\lambda-t}\right)^n\]

  • Note this is the MGF of a \(GAM(n,n\lambda)\) distribution: thus \(\bar Y\sim GAM( n, n\lambda)\)

Verifying via simulation

(mean_of_exps <- parameters(~n, ~lambda,
                             c(2, 5, 10), c(0.5, 1.0, 1.5)
                            )
                  %>%  add_trials(10000)
                  %>%  mutate(y_sample = pmap(list(n, lambda), .f = \(n, l) rexp(n, rate = l)))
                  %>%  mutate(ybar = map_dbl(y_sample, .f = mean))
                  %>%  mutate(f_ybar = dgamma(ybar, shape = n, rate = n*lambda))
)  %>% head()
# A tibble: 6 × 6
      n lambda .trial y_sample   ybar f_ybar
  <dbl>  <dbl>  <dbl> <list>    <dbl>  <dbl>
1     2    0.5      1 <dbl [2]> 0.175 0.147 
2     2    0.5      2 <dbl [2]> 0.766 0.356 
3     2    0.5      3 <dbl [2]> 3.04  0.145 
4     2    0.5      4 <dbl [2]> 5.14  0.0300
5     2    0.5      5 <dbl [2]> 3.81  0.0843
6     2    0.5      6 <dbl [2]> 3.95  0.0763

Plotting

ggplot(data = mean_of_exps)+
  geom_histogram(aes(x = ybar, y = after_stat(density)), 
                 fill = 'goldenrod',
                 center = 0.1, binwidth = .2) +
  geom_line(aes(x = ybar, y = f_ybar), col='cornflowerblue') +
  facet_grid(n~lambda, labeller = label_both)+
  theme_classic(base_size = 16) +
  labs(title='Simulated and analytic densities for sample mean of exponentials')
simulated and analytic densities of sum of exponentials

General result for MGF of \(\bar Y\)

  • The previous example illustrates a result that holds more generally
  • If \(Y_1\), \(Y_2\), …,\(Y_n\) represents an i.i.d. sample with common MGF \(M_Y(t)\) then with \(U = \sum_{i=1}^nY_i\) and \(\bar Y=\frac{U}{n}\):

\[M_{\bar Y}(t) = E(e^{t \bar Y}) =E(e^{t \cdot U/n})= M_U(t/n) = M_Y(t/n)^n\]

Example: mean of Normal samples

  • Let \(Y_1,Y_2,...,Y_n\) be an i.i.d. sample drawn from a \(N(\mu,\sigma^2)\) population.
  • Find the distribution of \(\bar Y =\frac{1}{n}\sum_{i=1}^n Y_i\).

\[M_Y(t) = e^{\mu t + \frac{\sigma^2 t^2}{2}}\]

\[M_{\bar Y}(t) = M_Y(t/n)^n = \left(e^{\mu (t/n) + \frac{\sigma^2 (t/n)^2}{2}}\right)^n\]

\[= e^{\frac{\mu t}{n}\cdot n + \frac{\sigma^2 t^2}{2n^2}\cdot n}= e^{\mu t + \frac{\sigma^2 t^2}{2n}}\]

  • Note this is the MGF of a \(N\left(\mu, \frac{\sigma^2}{n}\right)\) distribution: thus \(\bar Y\sim N\left( \mu, \frac{\sigma^2}{n}\right)\)

Verifying via simulation study

We now have three parameters to consider:

N <- 10000
(normal_mean_sims <- parameters(~n, ~mu, ~sigma,
           c(5, 10, 20), c(-2, 0, 2), c(2, 4)
           )
          %>% add_trials(N)
          %>% mutate(ysample = pmap(list(n, mu, sigma), \(nn, m, s) rnorm(nn, m, s)))
          %>% mutate(ybar = map_dbl(ysample, mean))
          %>% mutate(f = dnorm(ybar, mu, sigma/sqrt(n)), F = pnorm(ybar, mu, sigma/sqrt(n)))
          %>% mutate(Fhat = cume_dist(ybar), .by = c(n, mu, sigma))
) %>% head()
# A tibble: 6 × 9
      n    mu sigma .trial ysample     ybar       f     F  Fhat
  <dbl> <dbl> <dbl>  <dbl> <list>     <dbl>   <dbl> <dbl> <dbl>
1     5    -2     2      1 <dbl [5]> -1.92  0.444   0.535 0.531
2     5    -2     2      2 <dbl [5]>  0.696 0.00475 0.999 0.999
3     5    -2     2      3 <dbl [5]> -0.998 0.238   0.869 0.865
4     5    -2     2      4 <dbl [5]> -0.982 0.233   0.873 0.868
5     5    -2     2      5 <dbl [5]> -2.65  0.342   0.233 0.231
6     5    -2     2      6 <dbl [5]> -2.03  0.446   0.485 0.481

Plotting densities

We must now use the ggh4x package to visualize results across a 3-parameter grid:

library(ggh4x)
ggplot(data = normal_mean_sims, aes(x = ybar)) + 
  geom_histogram(aes(y = after_stat(density)), bins = 50, fill = 'goldenrod') + 
  geom_line(aes(y = f), col ='cornflowerblue') + 
  facet_nested(mu~sigma+n, labeller = label_both, scale = 'free_y') + 
  labs(y = expression(bar(Y)), 
       title='Simulated and analytic densities of sample mean from normal population')+
  theme_classic()
Simulated and analytic densities of sample mean from normal population

Plotting overlaid CDF plots

library(ggh4x)
ggplot(data = normal_mean_sims, aes(x = ybar)) + 
  geom_step(aes(y = F, col = 'Analytic CDF')) + 
  geom_step(aes(y = Fhat, col = 'Empirical CDF')) + 
  facet_nested(mu~sigma+n, labeller = label_both, scale = 'free_y') + 
  labs(y = expression(bar(Y)), 
       title='Simulated and analytic CDFs of sample mean from normal population',
       color='')+
  theme_classic()
Simulated and analytic densities of sample mean from normal population