The Central Limit Theorem

Introduction

In this exercise we will compare an exponential distribution with a thousand means of 40 random variables from a simulated exponential distribution. This will help illustrate the Central Limit Theorem, which states that if you take a sufficiently large sample from a population, the distribution of the sample means will be normally distributed, regardless of the distribution of the population that was sampled.

Exponential distribution

The exponential distribution can be simulated in R with the function rexp(n, lambda), where n is the amount of random variables we want to get, and lambda is the rate parameter.

For a random variable \(X\), given \(X \sim Exp(\lambda)\), then:

The mean or expected value of an exponential distribution is as follows: \[E[X] = \frac{1}{\lambda}\]
It’s standard distribution is the same as the mean: \[ \sigma[X] = \frac{1}{\lambda}\]
For this exercise, a rate parameter of \(\lambda = 0.2\) was used. The expected value for the mean would be: \[E[X] = \frac{1}{\lambda} = \frac{1}{0.2} = 5\]
Then the theoretical standard distribution would be the same: \[ \sigma[X] = \frac{1}{\lambda} = \frac{1}{0.2} = 5\]
The standard error of the means, where n = sample size, which for our simulation exercise is 40, would be: \[ SE[X] = \frac{\sigma}{\sqrt{n}} = \frac{5}{\sqrt{40}} = 0.7905694\]
And the variance would be: \[ Var[X] = \frac{1}{\lambda^2} = \frac{1}{5^2} = 25\]

Simulation Exercise

A simulation was run, obtaining a sample consisting of 40 random variables with an exponential distribution. This process was repeated a thousand times, and the obtained values were stored in a matrix. The matrix was turned into a data frame. The mean, standard deviation and variance for each iteration of the simulation were computed and each value stored on new columns.

Since it’s a fairly large number of samples, the central limit theorem states that the distributions we obtain from the samples should be approximately normal.
Plots were created to observe the distribution of the sample mean and variance, and to compare them with their theoretical values.

Defining the parameters

As it has been stated before, for the random exponential distribution simulations the rate parameter lambda = 0.2 and n = 40 were used.

lambda <- 0.2
n <- 40

Running the simulated data

A for loop was used for the simulation, and the simulated data was stored in an object called sample. This sample will be a matrix, with each row corresponding to an iteration of the simulated data, containing 40 random exponential variables, for a total of a thousand iterations.

sample <- NULL
for(i in 1:1000){
        set.seed(i*2+8)
        sample <- rbind(sample, rexp(n = n, rate = lambda))
}

Computing new variables

Two new variables, mean and sd, were added for the mean and standard deviation for each row, respectively. We will do this with the mutate function, which adds a new variable without modifying the existing ones.

sample <- as.data.frame(sample) %>% 
        mutate(mean = apply(sample, 1, mean), 
               sd = apply(sample, 1, sd), 
               var = apply(sample, 1, var))

Now we will calculate the total mean, standard deviation and variance from the samples, and compare them with the corresponding theoretical values

sample_mean <- round(mean(sample$mean), 4)
sample_sd <- round(mean(sample$sd),4)
sample_var <- mean(sample$var)
theoretical_mean <- 1/lambda
theoretical_sd <- 1/lambda
se <- round(theoretical_sd/sqrt(40), 4)
theoretical_var <- 1/(0.2^2)

means <- c(theoretical_mean, sample_mean)
sds <- c(theoretical_sd, sample_sd)
vars <- c(theoretical_var, sample_var)

Comparing theoretical and sample values

A table was created using the kbl function, from the kableExtra library.

tabl <- data.frame("Mean" = means,
                    "Standard Deviation" = sds,
                    "Variance" = vars, 
                    row.names = c("Theoretical", "Sample"))
tabl %>% kbl() %>% 
        kable_styling(latex_options = "striped")

	Mean	Standard.Deviation	Variance
Theoretical	5.0000	5.0000	25.00000
Sample	4.9956	0.8201	25.06253

Above we can see that the values calculated from the sample are almost equal to their corresponding theoretical values.

Creating plots

Plots were created to visualize the distribution of our samples mean and variance. In the plot below are shown the distribution of the means from the sample.

g <- ggplot(data = sample, aes(x = mean))
plot1 <- g + geom_histogram(aes(y = ..density..), bins = 25, color = "black", fill = "lightblue")
plot1 + scale_x_continuous() + 
        geom_vline(aes(xintercept = mean(sample$mean), color = "Sample Mean"),
                    size = 1.5) + 
        geom_density(color = "black", lwd = 1.5) +
        geom_vline(aes(xintercept = theoretical_mean, color = "Theoretical Mean"),
                       linetype = "dashed", size = 1.5) +
        scale_color_manual(values = c("Sample Mean" = "blue",
                                      "Theoretical Mean" = "yellow")) +
        labs(title = "Random exponential variables mean distribution", 
             x = "Random Exponential Variables Means", 
             y = "Density") +
        theme_minimal()

The plot shown above demonstrates the central limit theorem and the law of large numbers. The distribution of the means of the sample follows a normal distribution, And the sample mean is a good estimator of the theoretical mean.

The same process was repeated for the samples variance.

p <- ggplot(data = sample, aes(x = var))
plot2 <- p + geom_histogram(aes(y = ..density..), bins = 25, color = "black", fill = "gray")
plot2 + scale_x_continuous() + 
        geom_vline(aes(xintercept = mean(sample$var), color = "Sample Variance"),
                    size = 1.5) + 
        geom_density(color = "black", lwd = 1.5) +
        geom_vline(aes(xintercept = theoretical_var, color = "Theoretical Variance"),
                       linetype = "dashed", size = 1.5) +
        scale_color_manual(values = c("Sample Variance" = "blue",
                                      "Theoretical Variance" = "orange")) +
        labs(title = "Random exponential variables variance distribution", 
             x = "Random Exponential Variables Variance", 
             y = "Density") +
        theme_minimal()

The sample variance follows an approximate normal distribution with a right skew. The sample variance is also a good estimator of the theoretical variance, following the central limit theorem and the law of large numbers.

A plot of a single row from the sample was plotted just for comparison.

t <- ggplot(data = sample, aes(x = V1))
t + geom_histogram(aes(y = ..density..), bins = 25, color = "black", fill = "orange") +
        geom_density(color = "black", lwd = 1.5) +
        labs(title = "Random exponential variables distribution", 
             x = "Random exponential variables", 
             y = "Count") + 
        theme_minimal()