Relating sample mean from exponential distribution to population mean

Overview

With this report we explore the relationship between the sample mean and the population mean from exponential distribution. Parameters and statistics of interest are the mean, the variance and, in a broader sense, the shape of the distribution. We run 1000 simulations of 40 observations each time.

Simulations

We create a vector of size 1000 and fill it with the sample means of 1000 samples, each of which has size 40 observations. The distribution from which we draw the samples is exponential with rate parameter L = 0.2.

sample_means = NULL
for (i in 1 : 1000) sample_means = c(sample_means, mean(rexp(40, rate = 0.2)))

Population Mean versus Sample Mean

An exponential distribution with rate parameter equal to 0.2 is characterized by expected value (mean) equal to 5. In other words, E(x) = 1/rate = 1/0.2 = 5.

hist(rexp(1000, rate = 0.2), main = "Exponential population distribution of random variable x", xlab = "x ~ Exp(rate = 0.2), E(x) = 1/rate, V(x) = 1/rate^2")
abline(v = 1/0.2, col = "red", lwd = 3)
legend("topright", legend = "Mean of x = 1/rate", text.col = "red")

The sample mean (Xbar) from samples that come from exponential distribution (rate = 0.2) is expected to be equal to the population mean. In other words, the mean of the sample mean is as big as the population mean. If we use mathematical notation, we would write E(Xbar) = E(x) = 1/rate for x ~ Exp(rate = 0.2).

hist(sample_means, main = "Sampling distribution of sample mean Xbar", xlab = "Xbar ~ Norm(5, 1/1.6), E(xbar) = 1/rate, V(xbar) = 1/(n*rate^2)")
abline(v = 1/0.2, col = "red", lwd = 3)
legend("topright", legend = "Mean of Xbar = 1/rate", text.col = "red")

Because of this relationship between Xbar and x in terms of their expected values, we see on both graphs a red vertical line at point 5. It denotes the centre of the distributions.

Population Variance versus Sample Variance

An exponentially distributed random variable x has variance that is equal to 1/rate^2. In other words, V(x) = E(x)^2. In this particular example we have simulated 1000 results from exponential distribution with rate parameter 0.2. That is why the variance of the observations is 25.

boxplot(rexp(1000, rate = 0.2), ylim = c(0, 40), main = "Exponential population distribution of random variable x", xlab = "x ~ Exp(rate = 0.2), E(x) = 1/rate, V(x) = 1/rate^2")
abline(h = 1/(0.2*0.2), col = "red", lwd = 3)
legend("topright", legend = "Variance of x = 1/rate^2", text.col = "red")

Just like with the expected values of the population and the sampling distributions, there is theoretically known link between the variances. The variance of the sample mean is equal to the population variance divided by the sample size. In mathematical notation, V(Xbar) = V(x)/n. The bigger the sample size, the smaller the uncertainty in the sample.

boxplot(sample_means, ylim = c(0,40), main = "Sampling distribution of sample mean Xbar", xlab = "Xbar ~ Norm(5, 1/1.6), E(xbar) = 1/rate, V(xbar) = 1/(n*rate^2)")
abline(h = 1/(40*0.2*0.2), col = "red", lwd = 3)
legend("topright", legend = "Variance of Xbar = 1/(n*rate^2)", text.col = "red")

Distribution

In order to help you see that the sampling distribution of the sample mean is approximately normal - even when the underlying distribution is exponential (non-normal), we will overlay the histogram with a theoretical normal curve.

x <- sample_means 
h<-hist(x, breaks=10, xlim = c(2,8), ylim = c(0, 260), xlab="Xbar ~ Norm(5, 1/1.6), E(xbar) = 1/rate, V(xbar) = 1/(n*rate^2)",main="Sampling distribution of sample mean Xbar \n with Normal curve overlaid") 
xfit<-seq(min(x),max(x),length=40) 
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x)) 
yfit <- yfit*diff(h$mids[1:2])*length(x) 
lines(xfit, yfit, col="red", lwd=3)

The Central Limit Theorem proves that when we have a large size of the sample, if we sample a lot of samples - from whatever distribution - and construct their histogram, we will observe approximately normal distribution. The larger the sample, the closer to the normal look. Some textbook authors say that a sample size of just over 30 is enough to achieve this nice result. Here we had samples of 40. No less important was that we sampled a lot of samples - 1000. This aspect of the simulation also adds to the closeness to normality of the empirical distribution.