Author: Daniel Kinpara
Date: June, 2015
The objective of this study is to compare the exponential distribution with a theoretical distribution based in the Central Limit Theorem (CLT). In order to do that, it will be calculated the mean and the variance of an exponential distribution. Since one doesn’t know the population mean and variance, an estimation will be calculated through a simulations with randomized data.
The Negative Exponential Distribution (NED) is part of the Exponetial families of distributions, which includes the normal distribution, the binomial distribution, the gamma distribution, and the Poisson distribution (Wikipedia). The NED describes the time between events in a Poisson distribution. Let’s generate some NED data in R. The following code will accomplish that:
set.seed(1) ## for reproducibility reasons
lambda <- 0.2 ## defines lambda parameter
n <- 40 ## sample size
de <- rexp(n, lambda) ## generates 40 randomized numbers
sampleMean <- mean(de) ## sample mean
sampleVar <- var(de) ## sample variance
hist(de,
xlim = c(0, 30),
ylim = c(0, 30),
main = "Exponential Distribution Histogram",
xlab = "data",
ylab = "frequency") ## plots the histogram
Since it’s a Poisson event, the mean and the variance are equal. They can be calculated by the inverse of the paramter lambda. The sample mean is 4.860 and the sample variance is 23.569. Looking at the histogram, it’s hard to notice the “center of mass” of the graph in order to observe the average. The estimated variance of the population based on the sample variance of sample size n can be calculated as follow:
estimatedPopVar <- sampleVar / n
The estimated population variance is 0.589.
However, those results concern to one sample mean and variance. What about the population? Repeating the sampling many times (iteration) will get us closer and closer to the population’s mean and variance. The process of repeat the sampling is called simulation. A simulation of 1000 iterations of a sample size of 40 numbers can be accomplished by the following code:
sampleAvg <- NULL ## resets variable
for (i in 1:1000) { ## simulates 1000 times
de <- rexp(40, lambda) ## generates 40 randomized numbers
sampleAvg <- c(sampleAvg, mean(de)) ## calculates the average of 40 numbers
}
meansAvg <- mean(sampleAvg) ## calculates the average of the averages
meansVar <- var(sampleAvg) ## calculates the variance of the averages
hist(sampleAvg,
xlim = c(0, 10),
ylim = c(0, 250),
main = "Sample Means Histogram of a NED",
xlab = "data",
ylab = "frequency") ## plots the histogram
The calculated mean of the samples means is 4.989 and the variance of the samples means is 0.612.
The Central Limit Theorem (CLT) states that “the distribution of averages of independent and identically distributed (iid) variables becomes that of a standard normal as the sample size increases” (Caffo, 2015).
The theoretical mean (\(\mu\)) comes from the mean of the simulated 1000 sample means. The result is 4.989. The calculated sample mean (\(\bar X\)) is 4.860. The sample mean \(\bar X\) is close to the simulated mean \(\mu\).
The theoretical variance (\({\sigma^2}\)) comes from the variance of the simulation. The result is 0.612. The estimated population variance based on sample variance (\({S^2}\)) of a sample size of 40 observations is 0.589. The estimated population variance (\({S^2}\)) is close to the population variance \({\sigma^2}\) obtained in the simulation.
Let’s simulate the effect of the sample size on the average distribution in order to check whether the means of the many n size samples converges to a normal distribution. The code below will plot three histograms of 1000 iterations of calculated sample mean. We will cover three different sample sizes: 10, 20, and 30.
Through simulation, let’s compare the three different sample sizes: n = 10, n = 20, and n = 30.
As the figure presents, increasing the sample size makes the graph approach a bell shape. That’s an indicative of normal distribution with mean 5.000 and standard deviation 0.768.