7.7 - asymptotic sampling distributions and the central limit theorem

Exact sampling distributions

So far, all the sampling distributions we’ve covered have been exact sampling distributions
Definition: A distribution of a statistic that holds for any size \(n\)
Requires making assumptions about the parent population from which the sample comes

Examples:
- \(Y_1,Y_2,...,Y_n \stackrel{i.i.d.}{\sim}EXP(\lambda) \Rightarrow Y_{(1)} \sim EXP(n\lambda)\)
- \(Y_1,Y_2,...,Y_n \stackrel{i.i.d.}{\sim}GAM(\alpha, \lambda) \Rightarrow \bar Y \sim GAM(n\alpha, n\lambda)\)
- \(Y_1,Y_2,...,Y_n \stackrel{i.i.d.}{\sim}N(\mu, \sigma^2) \Rightarrow \frac{(n-1)S^2}{\sigma^2} \sim \chi^2_{n-1}\)

Asymptotic sampling distribution

Asymptotic sampling distributions are distributions of statistics that hold for large \(n\) only
Disadvantage: requires large \(n\) to be valid
Advantages (major):
- can relax assumptions about the population from which the sample is drawn
- exact sampling distribution may not exist/is difficult to find; asymptotic distributions can provide ways of answering questions that are approximately correct if \(n\) is large

Formal definition of asymptotic distribution

Let \(U_n\) represent a statistic from a sample of size \(n\)
Let \(F_{U_n}(u) = P(U_n \le u)\) represent the CDF of \(U_n\).
Let \(U\) represent a variable with CDF \(F_U\).

\(U_n\) converges in distribution to \(U\), and we say \(U_n \rightarrow_d U\), if \(\lim_{n\rightarrow \infty} F_{U_n}(u) = F_U(u)\) for all \(u\) at which \(F\) is continuous.
\(F_U\) is the asymptotic distribution of the statistic \(U_n\)
If \(F_U\) is the CDF of a well-known form, this may be written as \(U_n \rightarrow_d \mbox{well-known distribution}\) instead of \(U\). E.g., \(U_n \rightarrow_d \chi^2_p\)

Example: geometric to exponential

Let \(Y \sim GEO(p)\) with pmf:

\[P(Y=y) = pq^{y-1}, y=1,2,3...\]

Can show that \(F_Y(y) = P(Y\le y) = 1-q^y\) for \(y \in \{1,2,3,...\}\).

Suppose \(p = \frac{1}{n}\), and let \(U_n = \frac{Y}{n}\). Then:

\[F_{U_n}(u) = P(U_n \le u) = P\left(\frac{Y}{n}\le u \right)\] \[= P(Y\le nu) = 1-\left(1-\frac{1}{n}\right)^{nu}= 1-\left[\left(1+\frac{-1}{n}\right)^n\right]^{u}\]

\[\rightarrow 1-(e^{-1})^{u} = 1-e^{-u} = F_U(u)\]

where \(U\sim EXP(1)\).

Thus \(U_n \rightarrow_d U\) or \(U_n \rightarrow_d EXP(1)\).

The central limit theorem

A powerful theorem that gives asymptotic normality of \(\bar Y\), regardless of the parent population:

CLT formally stated

Let \(Y_1,Y_2,...,Y_n\) be i.i.d. random variables from a population with \(E(Y_i) = \mu < \infty\) and \(Var(Y_i) = \sigma^2 < \infty\).
- Note that no assumptions are made about normality of the individual \(Y_i\)!
Let:

\[U_n = \frac{\bar Y - \mu}{\sigma/\sqrt{n}} = \sqrt{n}\left(\frac{\bar Y - \mu}{\sigma}\right).\]

Then \(U_n \rightarrow_d N(0,1)\).

CLT in practice

Formally:

\[\frac{\bar Y - \mu}{\sigma/\sqrt{n}} \rightarrow_d N(0,1)\]

In practice, we say:

\[\bar Y \sim AN\left(\mu, \frac{\sigma^2}{n}\right)\]

\(\mu\), \(\sigma^2\): true mean/variance of possibly non-normal parent population

Points of clarification

The following facts are always true, for any size \(n\):
1. \(E(\bar Y) = \mu\) (whatever it is), by linearity of expectation
2. \(Var(\bar Y) = \sigma^2/n\), using properties of variance of linear combinations
What the CLT gives us is normality of the \(\bar Y\) for large \(n\).

“Proof” by simulation

Let’s illustrate the CLT via simulation study using the following scenario:

Suppose \(Y_1,Y_2, ..., Y_n\stackrel{i.i.d.}{\sim} EXP(\beta)\). Thus:
- \(\mu = E(Y_i) = \beta\)
- \(\sigma^2 = Var(Y_i) = \beta^2\)
- \(E(\bar Y) = \mu = \beta\) for all \(n\)
- \(Var(\bar Y) = \sigma^2/n = \beta^2/n\) for all \(n\)
- \(\bar Y\) are normal for large \(n\) only

Code for simulation study

Note in our simulation, the analytic pdf and CDF against which we want to compare our simulated \(\bar Y\) are normal pdfs/CDFs:

library(purrrfect)
library(tidyverse)

N <- 10000
clt_sims <- (parameters(~beta, ~n, c(1, 1.5, 2), c(2,4,8,16,32))
 %>% add_trials(N)
 %>% mutate(Y_sample = pmap(list(beta, n), .f = \(b,n) rexp(n, rate = 1/b)))
 %>% mutate(Ybar = map_dbl(Y_sample, mean))
 %>% mutate(fU = dnorm(Ybar, mean = beta, sd = beta/sqrt(n)), 
            FU = pnorm(Ybar, mean = beta, sd = beta/sqrt(n)), 
            Fhat = cume_dist(Ybar), 
            .by = c(beta,n))
)

Plots of densities

ggplot(data = clt_sims, aes(x = Ybar)) + 
  geom_histogram(aes(y = after_stat(density)), 
                 binwidth = .2, fill = 'goldenrod') + 
  geom_line(aes(y = fU), col ='cornflowerblue') + 
  facet_grid(n~beta, labeller = label_both, scale = 'free_y') + 
  labs(x=expression(bar(Y)))+
  theme_classic()

Overlaid CDF plots

ggplot(data = clt_sims, aes(x = Ybar) ) + 
  geom_step(aes(y = FU, col = 'Analytic Normal CDF')) +
  geom_step(aes(y = Fhat, col = 'Empirical CDF')) + 
  facet_grid(n~beta, labeller = label_both, scales = 'free') + 
  labs(color='', 
       x = expression(bar(Y)),
       y = 'CDF')+
  theme_classic()

P-P plots

ggplot(data = clt_sims ) + 
  geom_abline(intercept = 0, slope =1, col = 'red')+
  geom_point(aes(x= Fhat, y = FU), shape = '.') + 
  facet_grid(n~beta, labeller = label_both) + 
  xlab('Empirical CDF') + ylab('Analytic NORMAL CDF') + 
  theme_classic()

Proof of the CLT preliminaries

A function \(f(n)\) is \(o(n)\) (“little oh of n”) if it goes to 0 faster than \(n\) goes to \(\infty\). More precisely, if \(\lim_{n\rightarrow \infty} n f(n) = 0\). Examples:
- \(f(n) = \frac{1}{n^2} = o(n)\)
- \(f(n) = e^{-n} = o(n)\)
- \(f(n) = \frac{1}{\sqrt{n}} \ne o(n)\)

For any \(t\), \(\left(1 + \frac{t}{n} + o(n) \right)^n \rightarrow e^t\) (calc fact)
If \(Y_1,Y_2,...,Y_n\) are i.i.d. and \(U = \sum_{i=1}^n Y_i\), then \(M_{U}(t) = M_Y(t)^n\) (7.1)

CLT restated

Let \(Y_1,Y_2,...,Y_n\) be i.i.d. random variables from a population with \(E(Y_i) = \mu < \infty\) and \(Var(Y_i) = \sigma^2 < \infty\).
Let:

\[U_n = \frac{\bar Y - \mu}{\sigma/\sqrt{n}} = \sqrt{n}\left(\frac{\bar Y - \mu}{\sigma}\right),\]

then \(U_n \rightarrow_d N(0,1).\)

Proof outline:
- Find MGF of \(U_n\)
- Expand using Maclaurin series
- Show it converges to a \(N(0,1)\) MGF

Finding MGF of \(U_n\)

With \(X_i = (Y_i - \mu)\) (note that \(E(X_i) = 0\) and \(E(X_i^2) = E((Y_i-\mu)^2) = \sigma^2\)): \[U_n = \frac{\bar Y - \mu}{\sigma/\sqrt{n}} = \frac{\frac{1}{n}\sum_{i=1}^n (Y_i - \mu) }{\sigma/\sqrt{n}}= \frac{\sum_{i=1}^n (Y_i - \mu) }{\sqrt{n}\sigma} = \frac{\sum_{i=1}^n X_i}{\sqrt{n} \sigma}\]

\[M_{U_n}(t) = E\left(e^{U_n t}\right)= E\left(e^{\frac{\sum_{i=1}^n X_i}{\sqrt{n} \sigma} t}\right) =E\left(e^{\sum_{i=1}^n X_i\cdot \frac{t}{\sqrt{n} \sigma} }\right)\]

\[=M_{\sum_{i=1}^n X_i}\left(\frac{t}{\sqrt{n} \sigma} \right) =\left(M_{X}\left(\frac{t}{\sqrt{n} \sigma} \right)\right)^n\]

Expanding MGF

\[\left(M_{X}\left(\frac{t}{\sqrt{n} \sigma} \right)\right)^n =\left[E\left(e^{\frac{Xt}{\sqrt{n} \sigma}} \right)\right]^n =\left[E\left(1+ \frac{Xt}{\sqrt{n} \sigma}+\frac{1}{2}\left(\frac{Xt}{\sqrt{n} \sigma}\right)^2+\frac{1}{3!}\left(\frac{Xt}{\sqrt{n} \sigma}\right)^3+\frac{1}{4!}\left(\frac{Xt}{\sqrt{n} \sigma}\right)^4+ ...\right)\right]^n\]

Apply linearity of expectation: \[=\left[1+ \frac{t}{\sqrt{n} \sigma}E(X)+\frac{1}{2}\left(\frac{t}{\sqrt{n} \sigma}\right)^2E(X^2)+\frac{1}{3!}\left(\frac{t}{\sqrt{n} \sigma}\right)^3E(X^3)+\frac{1}{4!}\left(\frac{t}{\sqrt{n} \sigma}\right)^4E(X^4)+ ...\right]^n\]

Recall that \(E(X) = 0, E(X^2)= \sigma^2\):

\[=\left[1+ 0+\frac{t^2/2}{n}+\underbrace{\frac{t^3}{3!n^{3/2}\sigma^3}E(X^3)+\frac{t^4}{4!n^{4/2}\sigma^4}E(X^4)+ ...}_{=o(n)}\right]^n \rightarrow e^{t^2/2},\]

the MGF of a \(N(0,1)\) random variable!

Application 1: exponential parent population

Suppose times between calls at a help center window are approximately exponential with mean \(\beta = 2\) minutes between calls.
The call center opens at 8AM. What is the probability that the average time between calls of the first 30 calls will be more than 2.5 minutes, i.e. \(P(\bar Y > 2.5)\)?

Population quantities:
- \(\mu = 2\)
- \(\sigma^2 = 2^2 = 4\)

Exact solution, using fact that \(\bar Y \sim GAM(n, \beta/n)\) for all \(n\):

1-pgamma(2.5, shape = 30, scale = 2/30)

[1] 0.0919017

Approximate solution, using \(\bar Y \sim AN(2, 4/n)\) for large \(n\):

1-pnorm(2.5, mean = 2, sd = sqrt(4/30))

[1] 0.08545176

Application 2: Weibull parent population

Previous example, we knew exact distribution of \(\bar Y\) so didn’t need the CLT. Not always the case! Suppose \(Y_1,...,Y_n\) is an i.i.d. sample from a Weibull distribution with pdf given by:

\[ f_Y(y) = \frac{\theta}{\beta}\left(\frac{y}{\beta}\right)^{\theta-1} e^{-(y/\beta)^\theta}, y > 0\]

Properties:
- \(E(Y) =\mu= \beta \Gamma(1+1/\theta)\)
- \(Var(Y) = \sigma^2 = \beta^2\left[\Gamma\left(1+\frac{2}{\theta}\right)-\left(\Gamma\left(1+\frac{1}{\theta}\right)\right)^2\right]\)
- \(M_Y(t) = \sum_{i=0}^\infty \frac{t^i\lambda^i}{i!}\Gamma(1+i/\theta),\)

which means there’s no way to derive the exact distribution of \(\bar Y\). Normal approximation to the rescue!

Application 2: Finding approximate probability

Suppose we have a sample of size \(n = 50\) survival times from a Weibull population with parameters \(\beta = 3\) and \(\theta = 0.5\).
What is the approximate probability that the average of these survival times will be more than 8 days? I.e., want \(P(\bar Y >8)\).

\[\bar Y \sim AN\left(3 \Gamma(1+1/0.5),3^2\left[\Gamma\left(1+\frac{2}{0.5}\right)-\left(\Gamma\left(1+\frac{1}{0.5}\right)\right)^2\right]/50\right)\]

mu <- 3*gamma(1+1/0.5)
sigma2 <- 3^2*(gamma(1+2/0.5)-gamma(1+1/0.5)^2)

1-pnorm(8, mean = mu, sd = sqrt(sigma2/50))

[1] 0.1459203

Application 2: Simulating the probability

Verifying this approximation is decent by simulating 10,000 sample means when \(n = 50\) from a \(Weibull(\theta = 0.5,\beta = 3)\) distribution:

1-pnorm(8, mean = mu, sd = sqrt(sigma2/50))

[1] 0.1459203

(many_samples <- replicate(10000,  rweibull(50, shape =0.5, scale = 3),
                           .as = y_sample
                           )
                  %>% mutate(ybar = map_dbl(y_sample, mean))
 ) %>% head

# A tibble: 6 × 3
  .trial y_sample    ybar
   <dbl> <list>     <dbl>
1      1 <dbl [50]>  4.05
2      2 <dbl [50]>  7.64
3      3 <dbl [50]>  5.50
4      4 <dbl [50]>  5.33
5      5 <dbl [50]>  4.60
6      6 <dbl [50]>  3.65

many_samples  %>% 
  summarize('P(ybar >8)' = mean(ybar > 8))

# A tibble: 1 × 1
  `P(ybar >8)`
         <dbl>
1        0.132