In this project you will investigate the exponential distribution in R and compare it with the Central Limit Theorem. The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda. Set lambda = 0.2 for all of the simulations. You will investigate the distribution of averages of 40 exponentials. Note that you will need to do a thousand simulations.
The code in the Appendix simulates 1000 sets of 40 samples each from a exponential distribution with lambda rate of 0.2 using the sapply function. It is collected to a tbl_df format for use with dplyr package later on. Also means of each simulation run of 40 samples are calculated.
Here is the summary of values from our simulation
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00004 1.45800 3.52700 5.02000 7.00100 48.25000
When comparing the results from simulation to the theoretical values. We can see that they are close to each other. For example, theoretical variance is 25 and we get about 24.55 from our simulation
## Mean SD Variance
## Simulated 5.02020855 4.95487795 24.5508155
## Theoretical 5.00000000 5.00000000 25.0000000
## Difference 0.02020855 -0.04512205 -0.4491845
Distribution of simulated values from the expotential distribution, indeed looks like an right-skewed exponential distribution. This is shown in the first plot of the Appendix.
When we calculate sample means and plot them, their distribution looks like a normal distribution as it should. This can be seen in the second plot of the Appendix. There the red line shows the mean of means and the green dashed line is the theoretical mean (5).
In our simulation, the differences of the sample statistics and theoretical values are marginal. Here the distribution of samples is a good estimate for the theoretical exponential distribution.
# Libraries and options
library(knitr)
library(dplyr)
library(ggplot2)
# Simulate data
set.seed(5678)
n <- 40
lambda <- 0.2
nSimulations <- 1000
dt <- tbl_df(data.frame(group = as.factor(rep(1:nSimulations, each = n)))) %>%
mutate(x = as.numeric(sapply(1 : nSimulations, function(x) rexp(n, lambda))))
# Calculate means per run
dtMeans <- dt %>%
group_by(group) %>%
summarise(mean = mean(x))
# Compare mean, standard deviation and variance of sample means to theoretical values
theoreticalMean <- 1 / lambda
xMean <- mean(dt$x)
meanDiff <- xMean - theoreticalMean
theoreticalSD <- 1 / lambda
xSD <- sd(dt$x)
sdDiff <- xSD - theoreticalSD
theoreticalVar <- 1 / lambda^2
xVar <- var(dt$x)
varDiff <- xVar - theoreticalVar
dfComparison <- data.frame(Mean = c(xMean, theoreticalMean, meanDiff),
SD = c(xSD, theoreticalSD, sdDiff),
Variance = c(xVar, theoreticalVar, varDiff),
row.names = c('Simulated', 'Theoretical', 'Difference'))
## Histogram of x
plot1 <- ggplot(dt, aes(x)) +
geom_histogram(binwidth = 0.1) +
labs(title = 'Distribution of All Samples')
## Histogram of sample means
plot2 <- ggplot(dtMeans, aes(mean)) +
geom_histogram(binwidth = 0.15) +
geom_vline(aes(xintercept = mean(dtMeans$mean)), colour = 'red') +
geom_vline(aes(xintercept = theoreticalMean), colour = 'green', linetype = 'dashed') +
labs(title = 'Distribution of Sample Means')