by Richard Bagnall July 2015
The exponential distribution will be investigated and compared with the central limit threorem. The exponential distribution will be simulated in R with rexp(n, lambda) where lambda (\(\lambda\)), the rate parameter, will be set to 0.2; the mean (\(\mu\)) and the standard deviation (\(\sigma\)) of the exponential distribution is 1/lambda. The distribution of 1000 averages of 40 random exponentials will be simulated.
Aim: Use R code to simulate 1000 means of 40 random exponentials. Set lambda to 0.2 for all simulations.
lambda <- 0.2 # set lambda to 0.2
n <- 40 # set number of random exponentials to 40
means <- NULL # create an empty vector for the mean values
set.seed(1) # set seed for reproducibility
# calculate the mean of 40 random exponentials, one thousand times.
for (i in 1:1000) {means <- c(means, mean(rexp(n, lambda)))}
summary(means)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.108 4.445 4.950 4.990 5.492 7.491
Question 1. Show the sample mean and compare it to the theoretical mean: The theoretical population mean of the exponential distribution is 1/lambda: \[ \begin{aligned} \text{E}[X]=\frac{1}\lambda\ = \frac{1}{0.2} = 5.0 \end{aligned} \]
The above equation tells us that the distribution of the original population is centred at 5.0, and the theoretical mean is also 5.0. Taking a random sample of the original population and calculating the mean will give us an estimate of the population mean (i.e. the sample mean \(\overline{X}\)); it is the centre of mass of the sample. Mathemetically calculating the sample mean of 1000 lots of 40 random exponentials from the simulations is simple using R:
round(mean(means),2) # round to two decimal places
## [1] 4.99
The sample mean is 4.99 and the centre of the distribution of 1000 sample means of 40 random exponentials is also 4.99 (see Figure 1: r code is shown in Appendix 1). The sample mean 4.99 is a good approximation of the population mean 5.0. Increasing the number of random exponentials in each set, or increasing the number of simulations, would bring the sample mean even closer to the theoretical mean, and concentrate the density of the sample mean around the population mean.
Figure 1. Histogram of 1000 means of 40 random exponentials. Blue vertical line shows the sample mean; Red verticle line shows the theoretical mean.
ANSWER TO QUESTION 1: The sample mean and centre of the sample distribution is 4.99 and the theoretical mean and centre of distribution is 5.0.
Question 2. Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution. The variance (\(\sigma^2\)) is a measure of the spread of a set of numbers, i.e. how far each number in the set is from the mean. The variance of the population is the square of the population standard deviation:
\[ \begin{aligned} \text{Var}(X)=\sigma^2 = \left(\frac{1}{\lambda}\right)^2 = \left(\frac{1}{0.2}\right)^2 = 25 \end{aligned} \]
The sample variance (\(S^2\)) estimates the population variance. The sample variance of 1000 means of 40 random exponentials is calculated by squaring the difference between each sample mean and the population mean, and dividing the sum of the squares by the number of values in the set minus 1.
\[ \begin{aligned} S^2=\frac{\sum_{i=1}\left(X_i-\bar{X}\right)^2}{n-1} \end{aligned} \]
In R code, the sample variance of the simulated 1000 means of 40 random exponentials is:
round(sum(c(means - 5.0)^2) / (1000 - 1), 3) # rounded to 3 decimals
## [1] 0.611
or more simply, var(means)
= 0.611.
The distribution of the variances of 1000 means of 40 random exponentials is shown in Figure 2 (r code is shown in Appendix 2). The variance of the sample mean, 0.611, estimates how variable the means of 40 random exponentials are and shows the precision of the estimate of the means of 40 random exponentials. The theoretical variance of the sample = \(\sigma^2\)/n = 25/40 = 0.625
Figure 2. Histogram of the Sample Variance. Blue vertical line shows the sample variance; Red verticle line shows the theoretical variance.
ANSWER TO QUESTION 2: The variance of the sample distrbution is 0.611 and the theoretical variance of the distribution is 0.625.
The distribution of random exponentials is, not surprisingly, exponentially distributed! However, the central limit theorem states that the distribution of averages of independent and identically distributed (iid) variables becomes that of a standard normal as the sample size increases.
\[ \begin{aligned} \frac{\text{Estimate}-\text{Mean of Estimate}}{\text{Std. Err. of estimate}} = \frac{\bar{X}_n -\mu}{\sigma / \sqrt{n}} \end{aligned} \]
Plots of random exponentials and means of random exponentials are shown in Figure 3 (r code is shown in Appendix 3).
Figure 3. Histogram of distribution of 1000 exponentials (left panel) and distribution of 1000 means of 40 exponentials (right panel). Blue lines show the sample distribution; red line shows a standard normal distribution.
It is clear that the distribution of the 1000 random exponentials is not normally distributed (left panel of Figure 3). However, the distribution of a large collection of means of 40 random exponentials is approximately normally distributed (right panel of Figure 3). This can be seen from the bell shaped curve, which is symmetric about its mean; also, the mean, 0, and the median, -0.01 are approximately the same.
ANSWER TO QUESTION 3. The distribution is approximately normal.
# par(mar=c(4.1, 4.1, 4.1, 2.1))
# plot histogram of means
hist(means, breaks=20, col="grey", border="grey",
cex.main=0.8, cex.axis=0.8, cex.lab=0.8,
main="Histogram of 1000 Means of 40 Exponentials",
xlab="Mean Value of 40 Exponentials", yaxt="n")
# format y axis
axis(2, at=c(0,20,40,60,80), labels=c("0","20","40","60","80"), las=2)
# highlight the centre of the sample mass
abline(v=mean(means), col="blue")
# highlight the theoretical centre
abline(v=5.0, col="red")
# compare the sample mean and theoretical mean
text(max(means)-0.3, 80.0, cex=0.8,
paste("Sample mean = ", round(mean(means), 4),
"\nTheoretical mean = 5.00"), pos = 2)
# theoretical mean
theoMean <- 1/lambda
# calculate the sample variance
sampleVariance <- sum(c(means - theoMean)^2) / (1000-1)
# calculate the theoretical variance
theoStdDev <- 1/lambda
# calculate the theoretical standard deviation
theoVariance <- (theoStdDev^2) / 40
# plot the variances of 1000 means of 40 random exponentials
hist(c(means-5)^2, breaks=50, cex.main=0.8, cex.lab=0.8,
main = "Variance of 1000 Means of 40 Exponentials",
xlab="Variance of the Sample Mean", yaxt="n")
# format the y axis
axis(2, at=c(0,100,200,300), labels=c("0","100","200","300"), las=2)
# highlight the sample variance
abline(v=sampleVariance, col="blue")
# highlight the theoretical variance
abline(v=0.625, col="red")
# compare the sample variance and theoretical variance
text(5, 250.0, paste("Sample variance = ", round(sampleVariance, 3),
"\nTheoretical variance = ", theoVariance), pos = 2)
suppressMessages(suppressWarnings(library(ggplot2)))
suppressMessages(suppressWarnings(library(gridExtra)))
# make data frame of 1000 random exponentials
dfrexp <- as.data.frame(rexp(1000, 0.2))
# plot 1000 random exponentials
exp_plot <- ggplot(dfrexp, aes(x=dfrexp[,1]))
p1 <-exp_plot + geom_histogram(aes(y=..density..), color="black", fill="grey")
+ geom_density(color="blue")
+ ggtitle("Distribution of Exponentials")
+ xlab("Exponent Value")
+ ylab("Frequency")
+ theme(plot.title = element_text(size = 10.0),
axis.title = element_text(size = 8.0))
# make data frame of CLT of distribution of means
df <- as.data.frame( c(means - 5.0)/(25/sqrt(40)) )
# plot CLT distributions
means_plot <- ggplot(df, aes(x=df[,1]))
p2 <- means_plot + geom_histogram(aes(y=..density..), binwidth=0.05, color="black", fill="grey")
+ geom_density(color="blue")
+ ggtitle("Distribution of Means of 40 Exponentials")
+ xlab("Exponentials CLT simulations")
+ ylab("Frequency")
+ theme(plot.title = element_text(size = 10.0), axis.title = element_text(size = 8.0))
# add standard normal
p3 <-p2 + stat_function(fun = dnorm, arg = list(mean = 0, sd = sd(df[,1])), color="red")
# panel plot
suppressMessages(grid.arrange(p1, p3, ncol=2))