The objective of this exercise is to investigate the exponential distribution in R and compare it with the Central Limit Theorem. We will study the simulated mean, variance and distribution profile against the theoretical distribution.
Before we start the simulation, we shall determine the corresonding parameter of the distribution. We will set the random seed = 1000, lambda = 0.2, n = 40 and number of simulation = 1000.
The sample matrix is generated as below.
# parameter of the simulation
lambda = 0.2
n = 40
n.sim = 1000
set.seed(1000)
x = rexp(n.sim * n, rate = lambda)
sim = matrix(data = x, nrow = n.sim, ncol = n)
After that, the sample mean of each simulation is calculated.
# calculate the mean of each sampling
sim.mean = apply(sim, 1, mean)
head(sim.mean)
## [1] 4.450697 6.105520 4.933228 5.329610 4.989080 7.080864
Now compare both of the mean and visualise with histogram.
t.mean = 1/lambda
sampling.mean = mean(sim.mean)
## [1] "Theoretical mean: 5"
## [1] "Simulation mean: 4.99"
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
From the histogram of the sampling distribution, we can see that the distribution is approximatedly centered at 5. The reference line further evidence that the sampling distribution mean is a good approximation of the theoretical distribution mean.
Calculate the respective variance of the sampling and theoretical distribution.
# calculate the standard error of the theoretical mean
t.var = (1/lambda)^2/n
sim.var = var(sim.mean)
## [1] "Theoretical variance: 0.625"
## [1] "Simulation variance: 0.6584"
Again, we can see that the simulated mean standard error is a fairly good approximation of the theoretical mean standard error.
Superimpose the theoretical mean distribution and the simulation mean distribution in one plot to compare the distribution profile.
From the above plot, we can see that the shape of both distribution are almost identical.
We can further confirm this result by plotting the quantile-quantile plot.
From the quantile plot, we observed that the simulation distribution quantile does fit to the theoretical quantile, especially for those data points that are close to the central of the distribution.
From the above analysis, we can conclude that the Central Limit Theorem (CLM) does hold true irregardless of the distribution of the subject population or sample. However, we should note that several assumptions should be made in the application of CLM.
The assumptions are as below:
library(ggplot2)
# Plot the histogram of simulated sample
g = ggplot(data = as.data.frame(sim.mean)) +
geom_histogram(mapping = aes(x = sim.mean), fill = 'white', color = 'gray') + theme_bw()
# Add reference line to show the sampling mean and theoretical mean
g + geom_vline(aes(xintercept = sampling.mean, color = 'Simulated mean')) +
geom_vline(aes(xintercept = t.mean, color = 'Theoretical mean')) +
scale_color_manual(name = '', values = c('Simulated mean'= 'black', 'Theoretical mean' = 'red')) +
labs(title = 'Figure 1\nHistogram of sampling distribution', x = 'Sampling mean', y = 'Frequency')
ggplot()+geom_density(aes( x = sim.mean, color = 'Simulation')) +
stat_function(aes(x=c(2,8), color = 'Theoretical'),fun = dnorm, args = list(mean = t.mean, sd = sqrt(t.var))) +
geom_vline(mapping = aes(xintercept = qnorm(.975, t.mean, sqrt(t.var)), linetype = '95% conf'), show.legend = F) +
geom_vline(mapping = aes(xintercept = qnorm(.025, t.mean, sqrt(t.var)), linetype = '95% conf')) +
scale_color_manual(name = '', values = c('Simulation' = 'red', 'Theoretical' = 'black')) +
scale_linetype_manual(name = '', values = c('95% conf' = 2)) + theme_bw() +
labs(title = 'Figure 2\nDensity plot of Simulation Distribution and Theoretical Distribution', x = 'Mean', y = 'Density')
qqnorm(sim.mean, main = 'Figure 3
Quantile - Quantile Plot')
qqline(sim.mean, col =2)