According to the central limit theorem, when multiple independent samples are taken from the same population the means of those samples will tend to be normal, even if the original population was not normal. The mean and standard deviation of the sample means can also be predicted from the population mean \(\mu\) and standard deviation \(\sigma\). Specifically:

\[SEM = \frac{\sigma}{\sqrt{n}}\]

In this document I test this using simulated data. Samples of size 40 are drawn from an exponential distribution 1000 times. The theoretical and actual values for the mean, standard deviation, standard error of the mean are compared. The actual distribution of sample means is compared to the standard normal distribution using qq-plots.

Simulations

In the code below I set parametrs, randomly sampole data, calculate sample means, standard deviation and the standard error of the mean.

# set parameters
m <- 1000
n  <- 40
lambda <- 0.2

# get a m x n matrix of randoms from the exponential distribution
set.seed(420)
my.simulations <- as.data.frame(matrix(rexp(n * m, lambda), n))

#take the mean of each sample
my.sample.means<- as.data.frame(apply(my.simulations, 2, mean))
colnames(my.sample.means) <- 'sample.mean'
my.simulations <- melt(my.simulations)
colnames(my.simulations) <- c('sample.number','value')

# record sample numbers, sample mxn and nxm 
my.simulations$sample.number <- rep(1:m, each=n)
my.simulations$resample.number <- as.factor(rep(1:n,times=m))
my.simulations <- my.simulations %>% select(sample.number, resample.number, value)

# Calculate theoretical and sample mean, sd, and sem
theory_mu <- 1/lambda
theory_sigma  <- 1/lambda
theory_sem <- theory_sigma / sqrt(n)

sample_mean <- mean(my.simulations$value)
sample_sd <- sd(my.simulations$value)
sample_sem <- sd(my.sample.means$sample.mean)

The Samples

Three example sampling distributions are shown below.

my.simulations %>% filter(sample.number<4) %>% ggplot(aes(x=value)) +    geom_histogram(aes(y=..density..),bins=10,fill='deepskyblue4',color='gray',alpha=.7)+
    stat_function(fun=dexp, args=list(rate=lambda), size=1.5, color='darkgray',alpha=.7) +
    geom_vline(aes(xintercept=mean(5)),color='white',alpha=.8,size=1.5)+
    geom_vline(aes(xintercept=mean(value)), color='black',alpha=.5,size=1.5,linetype=2)  +
    facet_grid(.~sample.number) 

For each of the example sampling distributions above, the sample mean is shown using a dashed dark line and the population parameter \(\mu = 5\) is indicated with a solid white line.

The Combined Sampling Distribution

Next I combine all the samples together to show the distribution of all 40,000 samples taken from the exponential distribution.

ggplot(my.simulations, aes(x=value)) + 
    geom_histogram(aes(y=..density..), bins=40, alpha=.8, fill='deepskyblue4',color='gray') + 
    stat_function(fun=dexp, args=list(rate=lambda), size=2, color='darkgray',alpha=0.7) +
    geom_vline(aes(xintercept=mean(5)),color='white',alpha=1,size=2) +
    geom_vline(aes(xintercept=mean(value)), color='black',size=2,linetype=2,alpha=0.5) +
    labs(y='count')

The mean of the combined sampling distributuion was 4.9557223 and had a standard deviation of 4.9764622 compared to a predicted value of 5 for both.

The Distribution of Sample Means

A histogram of the distribution of sample means with the normal distibution superimposed:

ggplot(my.sample.means, aes(x=sample.mean)) +
        geom_histogram(aes(y=..density..), 
                       bins=20,fill='darkorchid4',color='gray',alpha=.7) + 
        geom_vline(aes(xintercept=mean(theory_mu)),
                   color='darkgray',size=2,alpha=0.7,linetype=1) +
        geom_vline(aes(xintercept=mean(sample.mean)), 
                   color='midnightblue',size=2,linetype=2,alpha=0.7) +
        stat_function(fun=dnorm,args=list(mean=sample_mean,sd=sample_sem),
                      color='midnightblue',size=3,alpha=0.7,linetype=2) +
        stat_function(fun=dnorm,args=list(mean=theory_mu,sd=theory_sem),
                      color='darkgray',size=3,alpha=0.7,linetype=1) +
        scale_x_continuous(breaks = (theory_mu) + c(-3:3)*theory_sem,
                           sec.axis = sec_axis(~(.- theory_mu)/theory_sem,
                                               name = 'standard normal'),
                           limits = c(3,7)) +
        labs(x='sample mean')

The Mean of the distribution of sample means is 4.9557223 compared to a predicted value of 5.

The Standard Error of the Mean is 0.7846463 compared to the predicted value 0.7905694.

Compare to the Standard Normal Distribution using q-q plot

One way to compare the shape of a distribution with the standard normal distribution is with a q-q plot. If two distributions have the same shape then the q-q plot will show a straight line along the diagonal of the plot. The q-q plots below compare the distribution of sample means (left) and samples of different size (right) with the normal distribution.

ggplot(my.sample.means, aes(sample=sample.mean)) + 
    geom_abline(intercept = theory_mu, slope=theory_sem, color='white', size=2)+
    stat_qq(color='darkorchid') +
    labs(x='Standard Normal Distribution',
         y='Sample Means', 
         title='Normal Distribution vs. Sample Means') 


ggplot(my.simulations) + 
    stat_qq(aes(sample=value, group=sample.number), 
            color='yellowgreen',alpha=0.1)  +  
    stat_qq(aes(sample=value, group=resample.number), 
            color='deepskyblue', alpha=.2) +
    stat_qq(aes(sample=value),color='deeppink',alpha=0.3)  + 
    labs(x='Standard Normal Distribution', y='Sample',
         title='Normal Distribution vs. Exponential Samples') +
    annotate("text", x = 2.5, y = 9, label = 'n = 40', 
             color='chartreuse4') +
    annotate("text", x = 3.5, y = 24, label = 'n = 1000', 
             color='deepskyblue4') +
    annotate("text", x = 4.1, y = 39, label = 'n = 40,000', 
             color='deeppink4') 

The plot above shows how the observed distribution of sample means comapres to the expected standard normal distribution. Overall, the line follows the diagonal so the sample appears to be normal.

We can see that at extreme values of the sample mean (Sample Mean < 4 or > 6) the expected distribution departs somewhat, but since the sample size is lower at the extremes it is hard to say if this departure is meaningful without adding standard-errors.

On the right I have plotted the the exponential samples relative to the normal distribution at 3 different sample sizes. We can see that the distributions are not normal as they depart widely from the diagonal. In addition we see that the distribution of sample quantiles becomes narrower with increasing sample size.

Next I show 2 more qq plots comparing the sample means and samples with the exponential distribution.

ggplot(my.sample.means, aes(sample=sample.mean)) + 
    stat_qq(distribution = stats::qexp, color='darkorchid')+
    labs(x='Exponential Distribution', y='Sample Means',
         title='Exponential Distribution vs. Sample Means') 

ggplot(my.simulations) +
    geom_abline(intercept = lambda, slope= 1/lambda, color='white', size=1)+
    stat_qq(aes(sample=value, group=sample.number), 
            distribution = stats::qexp, color='yellowgreen',alpha=0.1) +  
    stat_qq(aes(sample=value, group=resample.number), 
            distribution = stats::qexp, color='deepskyblue', alpha=.2) +
    stat_qq(aes(sample=value), 
            distribution = stats::qexp, color='deeppink',alpha=0.3)  +
    annotate("text", x = 5, y = 12, label = 'n = 40', color='chartreuse4') +
    annotate("text", x = 6.7, y = 24, label = 'n = 1000', color='deepskyblue4') +
    annotate("text", x = 8.5, y = 38, label = 'n = 40,000', color='deeppink4')+
    labs(x='Exponential Distribution', y='Sample',
         title='Exponential Distribution vs. Exponential Samples') 

On the left we see that the distribution of sample means is not exponential, as the quantile-quantile plot departs widely from the diagonal. The shape of the curve gives an indication of how the two distributions being compared differ in shape.

On the right we see the samples relative to the exponential distribution. At large sample sizes, the sampled distributions are the expected shape (the quantile-quantile plot for n = 40,000 lies straight down the diagonal). For smaller sample sizes (n = 40 in green) some of the distributions are far from the expected shape, but on average they are the expected shape. This demonstrates the Central Limit Theorem. It also extends it to demonstrate that the distributions of sample quantiles tend to be normal and unbiased, and have variance inversely proportional to sample size.

The Kolmogrov-Smirnov Statistic

The Kolmogorov-Smirnov (K-S) test is usd to compare whether a sample comes from a given distribution. Below I use it to test whether the distribution of sample means is identical to the standard normal.

# First standardize the means
x <- my.sample.means$sample.mean
x_std <- (x - mean(x)) / sd(x)

#Now apply in K-S test
ks.test(x_std, "pnorm")
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  x_std
## D = 0.028812, p-value = 0.3776
## alternative hypothesis: two-sided

The results of the K-S test shows that the observed distribution is not significantly different from the standard normal distibution, D = 0.03, p = .38.

Conclusion

I have shown that the parameters for the distribution of sample means can be predicted from the parameters of the sampled distribution. I showed that the sample means are approximately normally distributed and have mean, variance, and standard error as predicted by the central limit theorem. I have used quantile-quantile plots to compare the shapes of the sampled distribution the distribution of sample means to the theoretical distributions. Finally I used the Kolmogorov-Smirnov test to compare the observed distribution of sample means to the standard normal and found that it is not significantly different.


This is the work of Lisa Marie Pritchett and was updated on 2017-12-07.