Overview

In this project, the exponential distribution in R will be investigated and compared with the Central Limit Theorem. Through simulations, the Central Limit Theorem will be demostrated through computing the mean of a thousand simulations of size forty and see how the distribution of the sample means follow more a Gaussian curve than the original exponential.

Simulations

To produce the exponential distribution in R, the function rexp(n,lambda) will be used, where n is the sample size and lambda is the rate parameter. According to the project’s rules n is set to 40 and lambda is set to 0.2, as well as the number of simulations done will be 1000.

For this exercise, it will not be necesary to load any data; all the required data will be generated through R functions.

There are no required packages for this excercise.

So we have a population that follows a exponential distribution with a rate parameter lambda; theory says that the population mean pop_mean = 1 / lambda and the population standar deviation is pop_sd = 1 / lambda.
Let’s define those parameters in R:

n <- 40
lambda <- 0.2

pop_mean <- 1 / lambda
pop_sd <- 1 / lambda
pop_var <- pop_sd ^ 2

simulations <- 1000

# Set seed to generate always the same random numbers
set.seed(4)

For every simulation of size n, a sample mean is being generated, so 1000 simulations are generating 1000 sample means.
According to the Central Limit Theorem, those sample means follow a normal distribution if n is large enought; as well as the distribution is centered on the population mean and the standard deviation is equal to the standar error, which is the population standard deviation divided by squared root of n; that is std_error = pop_sd / sqrt( n )

The assumption I have made for this excercise is that n is large enough to use the normal distribution as the distribution resulting from the sample means.

To continue with the work, I create a matrix of random exponentials with n columns and 1000 rows, as said before, I am going to run 1000 simulations.

E <- matrix(rexp(simulations*n,lambda),simulations,n)

Let’s start answering some question for this project:

1. Comparison between sample mean and theoretical mean of the distribution

As said before, the theoretical mean of the distribution is the population mean, and it is defined as pop_mean = 1 / lambda.
The sample mean is just the average of the 1000 means from all the sample of size 40. To calculate this value, first we have to compute the average from the sample of size 40 for every simulation and then compute the average of the 1000 means. In other words, we calculate the average of every row from the E matrix (so we have 1000 averages) and then we calculate the average of those 1000 averages, which is the sample mean.

sample_mean <- mean(apply(E,1,mean))

# Create df1 as a data frame to present the results
df1 <- data.frame(
                   "Variable" = c("Population mean","Sample mean")
                  ,"Value" = c(round(pop_mean,3),round(sample_mean,3)))
diff <- round(abs(pop_mean - sample_mean),3)
print(df1)
##          Variable Value
## 1 Population mean 5.000
## 2     Sample mean 5.025

Sample mean is really close to its theoretical value. The different beetween both is 0.025.

2. How is the variance of the sample compared to the theoretical variance?

Recall that the theoretical variability of the distribution is the standard error, that “in terms of variance” is defined as std_error^2 = pop_sd^2 / n.
To compute the standard deviation of the sample means distribution, we just calculate the standard deviation of the 1000 means. It tells how variable the sample means are.

std_error <- round(pop_sd / sqrt( n ),3)
std_error2 <- std_error ^ 2

sample_sd <- sd(apply(E, 1, mean))
sample_variance <- sample_sd ^ 2

df2 <- data.frame(
                  "Variable" = c("Standard error^2","Sample variance")
                 ,"Value" = c(round(std_error2,3),round(sample_variance,3)))

diff2 <- round(abs(std_error2 - sample_variance),3)
print(df2)
##           Variable Value
## 1 Standard error^2 0.626
## 2  Sample variance 0.658

The sample variance is very close to the theoretical value, the standard error. The different between both is 0.032.

3. How different the 2 distributions are?

To answer this question I am going to plot the 2 distributions and make some comments about those.

# Plot the exponential distribution
hist(rexp(simulations, lambda)
     ,main = "Distribution of 1000 random Exponential"
     ,xlab = "variables"
     ,breaks = 15
     ,density = 15)

abline(v = pop_mean, col = "red")

All the values for the exponential distribution are positives. Most of the values are concentrated around 0 and as long as you get further from 0 the number of variables gets smaller. It forms a decreasing exponential curve as the varables go away from 0.
I have drawn a vertical line that shows the population mean equal to 5.

# Plot distribution of 1000 averages of 40 random exponential
mns = NULL
for (i in 1 : simulations){
  mns = c(mns, mean(rexp(40, lambda)))
} 
hist(mns
     ,main = "Distribution of 1000 means of 40 exponential"
     ,xlab = "Means of 40 exponential"
     ,breaks = 20 
     ,density = 15)
abline(v = pop_mean, col = "red")

The distribution of 1000 means of 40 random exponentials follows approximately a bell curve as predicted by the Central Limit Theorem. This curve is center on the population mean = 5 as drawn in the red vertical line, and it has a standard deviation equals to the standard error = 0.791.

Conclusions

With this excercise has been demonstrated how the Central Limit Theorem works with the Exponential distribution. 1000 means of random exponential of size 40 follows a Gaussian curve centered on the population mean with a standard deviation equals to the standard error when n is large enough.

We have also checked how the sample mean gets very close to the population mean and how the sample mean variance gets very close to the theoretical value, the standard error.