According to Wikipedia the Central Limit Theorem states that “… the arithmetic mean of a sufficiently large number of iterates of independent random variables… will be approximately normally distributed, regardless of the underlying distribution”. I will demonstrate this theorem by simulating an exponential distribution (note this is, by definition, different than a normal distribution) and comparing this distribution to a thousand simulations of that same distribution, demonstrating that the means of these 1000 distributions will approximate a normal distribution.
To proceed, load appropriate R packages.
library(ggplot2)
First, I will randomly generate a sample of 40 numbers using the exponential distribution employing the R function rexp(n, rate) with rate = 0.2 and save this into the vector exp_sample.
exp_sample <- data.frame(random.values = rexp(n=40, rate = 0.2))
Next I will randomly generate 1000 of these distributions, take the arithmetic mean of each of these distributions and save this into sample_means.
exp_means = NULL
for (i in 1:1000) exp_means = c(exp_means, mean(rexp(n=40, rate = 0.2)))
sample_means <- data.frame(means = exp_means)
The mean of exponential distribution is given by 1/lambda, which in this case will equal 1/0.2 = 5. The distribution of our sample is plotted below, its arithmetic mean is 3.4335399 is plotted as the dark red vertical line.
g <- ggplot(data=exp_sample, aes(random.values)) +
geom_histogram(color="black",fill="light blue", binwidth = 0.5) +
theme_bw() +
geom_vline(xintercept = mean(exp_sample$random.values), color="dark red") +
ggtitle("Histogram of 40 Randoms,
Exponential Distribution with Lambda = 0.2")
print(g)
Next, we will look at the theoretical mean prdeicted by the Central Limit Theorem by taking the arithmetic mean of the means of our 1000 random samples. The distribution of these means is plotted below, their arithmetic mean is 5.0129359 and is plotted as the dark red vertical line.
g1 <- ggplot(data=sample_means, aes(means)) +
geom_histogram(color="black",fill="light blue", binwidth = 0.5) +
theme_bw() +
geom_vline(xintercept = mean(sample_means$means), color="dark red") +
ggtitle("Histogram of Means from 1000 Samples of 40 Randoms,
Exponential Distribution with Lambda = 0.2")
print(g1)
Note, as the Central Limit Theorem predicts even though the underlying distribution is exponential the means of a large number of randomly generated (i.e. independent) samples will have a normal distribution with their mean approximating the true mean.
To describe the spread of our data I will use standard deviation, which is the square root of the variance, which is the average of the squared differences of each data value from the sample’s mean. Note, that since we are using a sample, we average by N-1, or 39 in this case.
#A manual equation for standard deviation
sqrt(sum((exp_sample$random.values - mean(exp_sample$random.values))^2)/39)
## [1] 3.411844
#compared to the handy R function sd()!!
sd(exp_sample$random.values)
## [1] 3.411844
Now, as we demonstrated above our sample’s mean is itself a random variable whose distribution approximates a normal distribution given a large enough sample of means. The standard deviation of this large distribution of means is called the standard error of the mean and represents the spread, or variability, of the mean in a general population. Our small sample of N = 40 can estimate this standard error of the population mean by taking the sample standard deviation and dividing by the square root of the sample size.
se <- function(x) (sd(x)/sqrt(length(x)))
se(exp_sample$random.values)
## [1] 0.5394599
Let’s compare this standard error of the mean, estimated from our small sample to an actual standard deviation of our simulation of 1000 sample means.
sd(sample_means$means)
## [1] 0.7998722
An estimate that is not too bad given a sample size of 40!
To examine whether data follow a normal distribution, it may help to normalize the data, plot its density function and compare to a standard normal distribution with mean of zero and standard deviation of 1. To normalize the data, simply subtract each data value from the sample mean and divide by the sample standard deviation. The function below does this.
normalize <- function(x) (x - mean(x))/sd(x)
Let’s try this with our sample and plot a density function of our normalized data in blue and a standard normal density plot in red for comparison.
normal_sample <- data.frame(normals =normalize(exp_sample$random.values))
g3 <- ggplot(data=normal_sample, aes(normals)) +
xlim(-4, 4)+
geom_density(color = "light blue") +
theme_bw() +
stat_function(fun=dnorm, color = "dark red")+
ggtitle("Histogram of Normalized Values
of 40 Random Exponentials")
print(g3)
Not an awful overlap but not amazing either, unfortunately. Let’s try this with our 1000 sample means…
normal_means <- data.frame(normals =normalize(sample_means$means))
g4 <- ggplot(data=normal_means, aes(normals)) +
xlim(-4, 4)+
geom_density(color = "light blue") +
theme_bw() +
stat_function(fun=dnorm, color = "dark red")+
ggtitle("Histogram of Normalized Values of Means from
1000 Samples of 40 Random Exponentials")
print(g4)
A bit closer to approximating a normal distribution.