In this analysis i will investigate the exponential distribution in R and compare it with the Central Limit Theorem. The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda. I have set lambda = 0.2 for all of the simulations.
set.seed(1)
library(ggplot2)
expmns = NULL
Creating a distribution of averages of 40 exponentials of size 1000
for (i in 1 : 1000) expmns = c(expmns, mean(rexp(40,0.2)))
Plotting the histogram of the distribution
g <- ggplot(as.data.frame(expmns), aes(x = expmns))
g <- g + geom_histogram(fill = "salmon", aes(y = ..density..), colour = "black")
g <- g + geom_vline(xintercept = 5, colour="red")+ geom_vline(xintercept = mean(expmns), colour = "blue")
g <- g + geom_density(size = 2)
g
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The theoretical mean is drawn as a vertical line in red and the sample mean is drawn in blue in the ablove plot. But the lines are overlapping as both values are pretty much same. To be precise lets get the mean of exponential distribution from (1/rate parameter).
1/0.2
## [1] 5
And also get the mean of the generated exponential.
mean(expmns)
## [1] 4.990025
This shows the mean of the sample and theoretical mean are both very much same.
The theoretical varience of exponential distribution is given by,
((1/0.2)/sqrt(40))^2
## [1] 0.625
whereas the variance of the generated exponential is given by,
var(expmns)
## [1] 0.6111165
We can conclude both varience are very much similar.
From the initial plot, the density line shows how the plot is slightly left skewed. But it is approximately normal.
To better analyce the CLT, lets create distribution of a large collection of random exponentials and compare with a plot of distribution of a large collection of averages of 40 exponentials.
Plot of distribution of a large collection of random exponentials
two<-NULL
for (i in 1 : 1000) two = c(two, rexp(40,0.2))
g <- ggplot(as.data.frame(two), aes(x = two))
g <- g + geom_histogram(fill = "salmon", aes(y = ..density..), colour = "black")
g <- g + geom_vline(xintercept = mean(two), colour = "blue")
g <- g + geom_density(size = 2)
g
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Plot of distribution of a large collection of averages of 40 exponentials.
largemns<-NULL
for (i in 1 : 10000) largemns = c(largemns, mean(rexp(40,0.2)))
g <- ggplot(as.data.frame(largemns), aes(x = largemns))
g <- g + geom_histogram(fill = "salmon", aes(y = ..density..), colour = "black")
g <- g + geom_vline(xintercept = 5, colour="red")+ geom_vline(xintercept = mean(expmns), colour = "blue")
g <- g + geom_density(size = 2)
g
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This demonstrates the Central Limit Theorem,distribution of averages of independent and identically distributed (IID) variables becomes that of a standard normal as the sample size increases