7.7 - asymptotic sampling distributions and the central limit theorem

Exact sampling distributions

  • So far, all the sampling distributions we’ve covered have been exact sampling distributions
  • Definition: A distribution of a statistic that holds for any size \(n\)
  • Requires making assumptions about the parent population from which the sample comes
  • Examples:
    • \(Y_1,Y_2,...,Y_n \stackrel{i.i.d.}{\sim}EXP(\lambda) \Rightarrow Y_{(1)} \sim EXP(n\lambda)\)
    • \(Y_1,Y_2,...,Y_n \stackrel{i.i.d.}{\sim}GAM(\alpha, \lambda) \Rightarrow \bar Y \sim GAM(n\alpha, n\lambda)\)
    • \(Y_1,Y_2,...,Y_n \stackrel{i.i.d.}{\sim}N(\mu, \sigma^2) \Rightarrow \frac{(n-1)S^2}{\sigma^2} \sim \chi^2_{n-1}\)

Asymptotic sampling distribution

  • Asymptotic sampling distributions are distributions of statistics that hold for large \(n\) only
  • Disadvantage: requires large \(n\) to be valid
  • Advantages (major):
    • can relax assumptions about the population from which the sample is drawn
    • exact sampling distribution may not exist/is difficult to find; asymptotic distributions can provide ways of answering questions that are approximately correct if \(n\) is large

Formal definition of asymptotic distribution

  • Let \(U_n\) represent a statistic from a sample of size \(n\)
  • Let \(F_{U_n}(u) = P(U_n \le u)\) represent the CDF of \(U_n\).
  • Let \(U\) represent a variable with CDF \(F_U\).
  • \(U_n\) converges in distribution to \(U\), and we say \(U_n \rightarrow_d U\), if \(\lim_{n\rightarrow \infty} F_{U_n}(u) = F_U(u)\) for all \(u\) at which \(F\) is continuous.
  • \(F_U\) is the asymptotic distribution of the statistic \(U_n\)
  • If \(F_U\) is the CDF of a well-known form, this may be written as \(U_n \rightarrow_d \mbox{well-known distribution}\) instead of \(U\). E.g., \(U_n \rightarrow_d \chi^2_p\)

Example: geometric to exponential

  • Let \(Y \sim GEO(p)\) with pmf:

\[P(Y=y) = pq^{y-1}, y=1,2,3...\]

  • Can show that \(F_Y(y) = P(Y\le y) = 1-q^y\) for \(y \in \{1,2,3,...\}\).
  • Suppose \(p = \frac{1}{n}\), and let \(U_n = \frac{Y}{n}\). Then:

\[F_{U_n}(u) = P(U_n \le u) = P\left(\frac{Y}{n}\le u \right)\] \[= P(Y\le nu) = 1-\left(1-\frac{1}{n}\right)^{nu}= 1-\left[\left(1+\frac{-1}{n}\right)^n\right]^{u}\]

\[\rightarrow 1-(e^{-1})^{u} = 1-e^{-u} = F_U(u)\]

where \(U\sim EXP(1)\).

  • Thus \(U_n \rightarrow_d U\) or \(U_n \rightarrow_d EXP(1)\).

The central limit theorem

  • A powerful theorem that gives asymptotic normality of \(\bar Y\), regardless of the parent population:

CLT formally stated

  • Let \(Y_1,Y_2,...,Y_n\) be i.i.d. random variables from a population with \(E(Y_i) = \mu < \infty\) and \(Var(Y_i) = \sigma^2 < \infty\).
    • Note that no assumptions are made about normality of the individual \(Y_i\)!
  • Let:

\[U_n = \frac{\bar Y - \mu}{\sigma/\sqrt{n}} = \sqrt{n}\left(\frac{\bar Y - \mu}{\sigma}\right).\]

Then \(U_n \rightarrow_d N(0,1)\).

CLT in practice

  • Formally:

\[\frac{\bar Y - \mu}{\sigma/\sqrt{n}} \rightarrow_d N(0,1)\]

  • In practice, we say:

\[\bar Y \sim AN\left(\mu, \frac{\sigma^2}{n}\right)\]

  • \(\mu\), \(\sigma^2\): true mean/variance of possibly non-normal parent population

Points of clarification

  • The following facts are always true, for any size \(n\):

    1. \(E(\bar Y) = \mu\) (whatever it is), by linearity of expectation
    2. \(Var(\bar Y) = \sigma^2/n\), using properties of variance of linear combinations
  • What the CLT gives us is normality of the \(\bar Y\) for large \(n\).

“Proof” by simulation

Let’s illustrate the CLT via simulation study using the following scenario:

  • Suppose \(Y_1,Y_2, ..., Y_n\stackrel{i.i.d.}{\sim} EXP(\beta)\). Thus:

    • \(\mu = E(Y_i) = \beta\)
    • \(\sigma^2 = Var(Y_i) = \beta^2\)
    • \(E(\bar Y) = \mu = \beta\) for all \(n\)
    • \(Var(\bar Y) = \sigma^2/n = \beta^2/n\) for all \(n\)
    • \(\bar Y\) are normal for large \(n\) only

Code for simulation study

Note in our simulation, the analytic pdf and CDF against which we want to compare our simulated \(\bar Y\) are normal pdfs/CDFs:

library(purrrfect)
library(tidyverse)

N <- 10000
clt_sims <- (parameters(~beta, ~n, c(1, 1.5, 2), c(2,4,8,16,32))
 %>% add_trials(N)
 %>% mutate(Y_sample = pmap(list(beta, n), .f = \(b,n) rexp(n, rate = 1/b)))
 %>% mutate(Ybar = map_dbl(Y_sample, mean))
 %>% mutate(fU = dnorm(Ybar, mean = beta, sd = beta/sqrt(n)), 
            FU = pnorm(Ybar, mean = beta, sd = beta/sqrt(n)), 
            Fhat = cume_dist(Ybar), 
            .by = c(beta,n))
) 

Plots of densities

ggplot(data = clt_sims, aes(x = Ybar)) + 
  geom_histogram(aes(y = after_stat(density)), 
                 binwidth = .2, fill = 'goldenrod') + 
  geom_line(aes(y = fU), col ='cornflowerblue') + 
  facet_grid(n~beta, labeller = label_both, scale = 'free_y') + 
  labs(x=expression(bar(Y)))+
  theme_classic()
densities of simulation study

Overlaid CDF plots

ggplot(data = clt_sims, aes(x = Ybar) ) + 
  geom_step(aes(y = FU, col = 'Analytic Normal CDF')) +
  geom_step(aes(y = Fhat, col = 'Empirical CDF')) + 
  facet_grid(n~beta, labeller = label_both, scales = 'free') + 
  labs(color='', 
       x = expression(bar(Y)),
       y = 'CDF')+
  theme_classic()
Overlaid CDF plots of simulation study

P-P plots

ggplot(data = clt_sims ) + 
  geom_abline(intercept = 0, slope =1, col = 'red')+
  geom_point(aes(x= Fhat, y = FU), shape = '.') + 
  facet_grid(n~beta, labeller = label_both) + 
  xlab('Empirical CDF') + ylab('Analytic NORMAL CDF') + 
  theme_classic()
p-p plots of simulation study

Proof of the CLT preliminaries

  • A function \(f(n)\) is \(o(n)\) (“little oh of n”) if it goes to 0 faster than \(n\) goes to \(\infty\). More precisely, if \(\lim_{n\rightarrow \infty} n f(n) = 0\). Examples:

    • \(f(n) = \frac{1}{n^2} = o(n)\)
    • \(f(n) = e^{-n} = o(n)\)
    • \(f(n) = \frac{1}{\sqrt{n}} \ne o(n)\)
  • For any \(t\), \(\left(1 + \frac{t}{n} + o(n) \right)^n \rightarrow e^t\) (calc fact)
  • If \(Y_1,Y_2,...,Y_n\) are i.i.d. and \(U = \sum_{i=1}^n Y_i\), then \(M_{U}(t) = M_Y(t)^n\) (7.1)

CLT restated

  • Let \(Y_1,Y_2,...,Y_n\) be i.i.d. random variables from a population with \(E(Y_i) = \mu < \infty\) and \(Var(Y_i) = \sigma^2 < \infty\).
  • Let:

\[U_n = \frac{\bar Y - \mu}{\sigma/\sqrt{n}} = \sqrt{n}\left(\frac{\bar Y - \mu}{\sigma}\right),\]

then \(U_n \rightarrow_d N(0,1).\)

  • Proof outline:
    • Find MGF of \(U_n\)
    • Expand using Maclaurin series
    • Show it converges to a \(N(0,1)\) MGF

Finding MGF of \(U_n\)

With \(X_i = (Y_i - \mu)\) (note that \(E(X_i) = 0\) and \(E(X_i^2) = E((Y_i-\mu)^2) = \sigma^2\)): \[U_n = \frac{\bar Y - \mu}{\sigma/\sqrt{n}} = \frac{\frac{1}{n}\sum_{i=1}^n (Y_i - \mu) }{\sigma/\sqrt{n}}= \frac{\sum_{i=1}^n (Y_i - \mu) }{\sqrt{n}\sigma} = \frac{\sum_{i=1}^n X_i}{\sqrt{n} \sigma}\]

\[M_{U_n}(t) = E\left(e^{U_n t}\right)= E\left(e^{\frac{\sum_{i=1}^n X_i}{\sqrt{n} \sigma} t}\right) =E\left(e^{\sum_{i=1}^n X_i\cdot \frac{t}{\sqrt{n} \sigma} }\right)\]

\[=M_{\sum_{i=1}^n X_i}\left(\frac{t}{\sqrt{n} \sigma} \right) =\left(M_{X}\left(\frac{t}{\sqrt{n} \sigma} \right)\right)^n\]

Expanding MGF

\[\left(M_{X}\left(\frac{t}{\sqrt{n} \sigma} \right)\right)^n =\left[E\left(e^{\frac{Xt}{\sqrt{n} \sigma}} \right)\right]^n =\left[E\left(1+ \frac{Xt}{\sqrt{n} \sigma}+\frac{1}{2}\left(\frac{Xt}{\sqrt{n} \sigma}\right)^2+\frac{1}{3!}\left(\frac{Xt}{\sqrt{n} \sigma}\right)^3+\frac{1}{4!}\left(\frac{Xt}{\sqrt{n} \sigma}\right)^4+ ...\right)\right]^n\]

Apply linearity of expectation: \[=\left[1+ \frac{t}{\sqrt{n} \sigma}E(X)+\frac{1}{2}\left(\frac{t}{\sqrt{n} \sigma}\right)^2E(X^2)+\frac{1}{3!}\left(\frac{t}{\sqrt{n} \sigma}\right)^3E(X^3)+\frac{1}{4!}\left(\frac{t}{\sqrt{n} \sigma}\right)^4E(X^4)+ ...\right]^n\]

Recall that \(E(X) = 0, E(X^2)= \sigma^2\):

\[=\left[1+ 0+\frac{t^2/2}{n}+\underbrace{\frac{t^3}{3!n^{3/2}\sigma^3}E(X^3)+\frac{t^4}{4!n^{4/2}\sigma^4}E(X^4)+ ...}_{=o(n)}\right]^n \rightarrow e^{t^2/2},\]

the MGF of a \(N(0,1)\) random variable!

Application 1: exponential parent population

  • Suppose times between calls at a help center window are approximately exponential with mean \(\beta = 2\) minutes between calls.
  • The call center opens at 8AM. What is the probability that the average time between calls of the first 30 calls will be more than 2.5 minutes, i.e. \(P(\bar Y > 2.5)\)?
  • Population quantities:
    • \(\mu = 2\)
    • \(\sigma^2 = 2^2 = 4\)
  • Exact solution, using fact that \(\bar Y \sim GAM(n, \beta/n)\) for all \(n\):
1-pgamma(2.5, shape = 30, scale = 2/30)
[1] 0.0919017
  • Approximate solution, using \(\bar Y \sim AN(2, 4/n)\) for large \(n\):
1-pnorm(2.5, mean = 2, sd = sqrt(4/30))
[1] 0.08545176

Application 2: Weibull parent population

  • Previous example, we knew exact distribution of \(\bar Y\) so didn’t need the CLT. Not always the case! Suppose \(Y_1,...,Y_n\) is an i.i.d. sample from a Weibull distribution with pdf given by:

\[ f_Y(y) = \frac{\theta}{\beta}\left(\frac{y}{\beta}\right)^{\theta-1} e^{-(y/\beta)^\theta}, y > 0\]

  • Properties:

    • \(E(Y) =\mu= \beta \Gamma(1+1/\theta)\)
    • \(Var(Y) = \sigma^2 = \beta^2\left[\Gamma\left(1+\frac{2}{\theta}\right)-\left(\Gamma\left(1+\frac{1}{\theta}\right)\right)^2\right]\)
    • \(M_Y(t) = \sum_{i=0}^\infty \frac{t^i\lambda^i}{i!}\Gamma(1+i/\theta),\)

which means there’s no way to derive the exact distribution of \(\bar Y\). Normal approximation to the rescue!

Application 2: Finding approximate probability

  • Suppose we have a sample of size \(n = 50\) survival times from a Weibull population with parameters \(\beta = 3\) and \(\theta = 0.5\).
  • What is the approximate probability that the average of these survival times will be more than 8 days? I.e., want \(P(\bar Y >8)\).

\[\bar Y \sim AN\left(3 \Gamma(1+1/0.5),3^2\left[\Gamma\left(1+\frac{2}{0.5}\right)-\left(\Gamma\left(1+\frac{1}{0.5}\right)\right)^2\right]/50\right)\]

mu <- 3*gamma(1+1/0.5)
sigma2 <- 3^2*(gamma(1+2/0.5)-gamma(1+1/0.5)^2)

1-pnorm(8, mean = mu, sd = sqrt(sigma2/50))
[1] 0.1459203

Application 2: Simulating the probability

  • Verifying this approximation is decent by simulating 10,000 sample means when \(n = 50\) from a \(Weibull(\theta = 0.5,\beta = 3)\) distribution:
1-pnorm(8, mean = mu, sd = sqrt(sigma2/50))
[1] 0.1459203
(many_samples <- replicate(10000,  rweibull(50, shape =0.5, scale = 3),
                           .as = y_sample
                           )
                  %>% mutate(ybar = map_dbl(y_sample, mean))
 ) %>% head
# A tibble: 6 × 3
  .trial y_sample    ybar
   <dbl> <list>     <dbl>
1      1 <dbl [50]>  4.05
2      2 <dbl [50]>  7.64
3      3 <dbl [50]>  5.50
4      4 <dbl [50]>  5.33
5      5 <dbl [50]>  4.60
6      6 <dbl [50]>  3.65
many_samples  %>% 
  summarize('P(ybar >8)' = mean(ybar > 8))
# A tibble: 1 × 1
  `P(ybar >8)`
         <dbl>
1        0.132