In STAT 460 we turn in full force to the concept of sampling distributions
The term “sampling distribution” is a rather nondescript way of talking about the key idea of:
The distribution of a statistic, computed from a sample of size \(n\), across repeated samples from a population
The sampling distribution visual
What do we need them for?
Sampling distributions are essential for basically all of statistics:
Forming confidence intervals
Obtaining test statistics and p-values
Performing hypothesis tests
Discussing quality of parameter estimators
All of which we will discuss in more detail this semester!
An i.i.d. sample
We will now often talk about a sample \(Y_1,...,Y_n\) drawn independently and identically distributed, or i.i.d. for short, from a population
We have actually already considered several examples of sampling distributions when \(n=2\):
Let \(Y_1\) and \(Y_2\) be independent, and identically distributed (i.i.d.) \(N(0,1)\) random variables
What is the distribution of \(U = Y_1 +Y_2\)?
Note that previously we simplied our notation with \(Y_1 \equiv X\) and \(Y_2 \equiv Y\), but starting now we will often want to consider the impact of \(n\) on our sampling distribution!
Review of the MGF method
The method of moment generating functions is commonly applied to find the distribution of sums, or sample means, of i.i.d. samples
Let \(Y_1,...,Y_n\) be an i.i.d. sample, thus with common MGF \(M_Y(t)\)
What is the distribution of \(U = Y_1+Y_2 + ... + Y_n = \sum_{i=1}^n Y_i\)?
Note this is the MGF of a \(GAM(n,\lambda)\) distribution: thus \(U\sim GAM( n, \lambda)\)
Simulating sampling distributions
To simulate sampling distributions, we will often want to incorporate \(n\) as one of our parameters in a parameter grid using parameters()
Use pmap(list(n, parent_pars), .f = \(n, parent_pars) r_*(n, parent_pars)) to generate a list column where each row is one sample of size \(n\) drawn i.i.d. from the parent population with parameters parent_pars
Why pmap()? Input: list (of parameters); Output: list (of the sample results)
Then, use map_dbl(sample_result, .f=summary_method) to input the sample (a list) as input, output a number (a double) as output
Simulating \(n\) i.i.d. \(EXP(\lambda)\) across grid
ggplot(data = sum_of_exps)+geom_histogram(aes(x = u, y =after_stat(density)), fill ='goldenrod',center =0.1, binwidth = .2) +geom_line(aes(x = u, y = f_U), col='cornflowerblue') +facet_grid(n~lambda, labeller = label_both)+theme_classic(base_size =16) +labs(title='Simulated and analytic densities for sum of exponentials')
Example: Mean of exponentials
Let \(Y_1\), \(Y_2\),…,\(Y_n\) be an i.i.d. sample from a \(EXP(\lambda)\) distribution
Find the distribution of \(\bar Y = \frac{\sum_{i=1}^nY_i}{n} = \frac{U}{n}\)
ggplot(data = mean_of_exps)+geom_histogram(aes(x = ybar, y =after_stat(density)), fill ='goldenrod',center =0.1, binwidth = .2) +geom_line(aes(x = ybar, y = f_ybar), col='cornflowerblue') +facet_grid(n~lambda, labeller = label_both)+theme_classic(base_size =16) +labs(title='Simulated and analytic densities for sample mean of exponentials')
General result for MGF of \(\bar Y\)
The previous example illustrates a result that holds more generally
If \(Y_1\), \(Y_2\), …,\(Y_n\) represents an i.i.d. sample with common MGF \(M_Y(t)\) then with \(U = \sum_{i=1}^nY_i\) and \(\bar Y=\frac{U}{n}\):
We must now use the ggh4x package to visualize results across a 3-parameter grid:
library(ggh4x)ggplot(data = normal_mean_sims, aes(x = ybar)) +geom_histogram(aes(y =after_stat(density)), bins =50, fill ='goldenrod') +geom_line(aes(y = f), col ='cornflowerblue') +facet_nested(mu~sigma+n, labeller = label_both, scale ='free_y') +labs(y =expression(bar(Y)), title='Simulated and analytic densities of sample mean from normal population')+theme_classic()
Plotting overlaid CDF plots
library(ggh4x)ggplot(data = normal_mean_sims, aes(x = ybar)) +geom_step(aes(y = F, col ='Analytic CDF')) +geom_step(aes(y = Fhat, col ='Empirical CDF')) +facet_nested(mu~sigma+n, labeller = label_both, scale ='free_y') +labs(y =expression(bar(Y)), title='Simulated and analytic CDFs of sample mean from normal population',color='')+theme_classic()