In this project we will investigate the exponential distribution in R and compare it with the Central Limit Theorem (CLT).
The exponential distribution can be simulated in R with rexp(n, λ) where, \(\lambda\) is the rate parameter. The mean and standard deviation of the exponential distribution is 1/\(\lambda\). For this project, we will simulate \(40\) exponential values with \(\lambda=0.2\).
We know that, the theoretical mean, \(\mu\), for exponential distribution is \(1/\lambda\) or \(1/0.2=5.0\). From the Law of Large Numbers, we also know that, the sample mean of the population, \(\bar X\) will converge to \(\mu\) for sufficiently large sample size.
Therefore, to establish this comparison, we will simulate the following:
We will see that, the maximum number of times a mean value occurs will be close to the theoretical mean.
For this project, we will plot a histogram of all the calculated means and overlay a straight line parallel to y-axis passing through \(x=5.0\). We will see that, the peaks of the histogram will be close to this straight line.
As we can see, the thick, straight vertical line passes through \(5.0\) and the peaks of the histogram is are very close to this straight line.
We know that, the theoretical variance, \(\sigma ^2\), for exponential distribution is \(1/\lambda ^2\) or \(1/(0.2^2)=25.0\).
Therefore, to establish this comparison, we will simulate the following:
We will see that, the sample variance will be close to the theoretical variance.
For this project, we will plot the ratio of sample variance and theoretical variance. The sample variance will be calculated as \(rexp(k,λ)\), where \(k\) is the sample size or \(40\) in this case and \(\lambda =0.2\). The theoretical variance is \(\sigma ^2\).
To understand the plot better, let us do a dry run first with only one simulation, but, a large sample size.
# Set rate of exponential distribution, λ as 0.2
λ <- 0.2
σ <- 1/λ
# Set number of exponentials
k <- 40000
#
expSampleVector<-rexp(k,λ)
expSampleVariance<-var(expSampleVector)
# σ is population standard deviation; so, σ^2 is variance
expTheortVariance<-σ^2
expSampleVariance/expTheortVariance
## [1] 1.006607
The following plot will essentially repeat the code shown above with smaller size and large number of simulations.
As we can see, the peak of the histogram is around \(1.0\) i.e. the sample variance is close to the the theoretical variance.
In the real world, exponential distributions come up when we look at a series of events and measure the times between events, which are called interarrival times. If the events are equally likely to occur at any time, the distribution of interarrival times tends to look like an exponential distribution.
The Cumulative Distribution Function (CDF) or, distribution function is given by \[y=1-exp(-$\lambda x)\]
(Source: http://greenteapress.com/thinkstats/html/thinkstats005.html)
Since this project uses \(\lambda = 0.2\), a sample size of \(40\) and number of simulations as \(1000\), let us plot the CDF for \(40000\) values.
This plot appears as a straight line here because of the large number of values. For a more realistic example, consider the plot of inter-arrival times of birth here - inter-arrival birth times.
The Central Limit Theorem says that
The distribution of averages of IID variables becomes that of a standard normal as the size increases.
So, if we were to take the means of the variables drawn randomly from an exponential distribution and repeat this over and over again, the distribution should resemble a standard distribution. Specifically, if we
then, our distribution will converge to that of a standard normal distribution.
Formally, \[\frac{\bar X_n - \mu}{\sigma / \sqrt{n}}= \frac{\sqrt n (\bar X_n - \mu)}{\sigma} = \frac{\mbox{Estimate} - \mbox{Mean of estimate}}{\mbox{Std. Err. of estimate}}\] has a distribution like that of a standard normal for large \(n\).
As seen in the plot, the blue line displays the standard normal curve and the histogram shows how the values converge around the peak of the standard normal curve.
To show the convergence better, here is a plot for even larger \(n\), where \(n=k(500 exponentials)*i(8000 simulations)\).
Better !