In this report, we are going to explore the large sample distribution of sample means of a collection of independent and identically distributed (IID) observations from random exponentials, and willl show that the distribution will approach to standard normal distribution as the sample size increases by comparing it with a normal distribution based on Central Limit Theorem (CLT).
n_samples <- 40
lambda <- 0.2
E_x <- 1/lambda; SD_x <- 1/lambda; SE_x <- SD_x/sqrt(n_samples)
# simulation
n_simulations <- 1000
set.seed(10)
obs <- matrix(rexp(n_samples*n_simulations, lambda), n_simulations, n_samples)
means_obs <- apply(obs, 1, mean); means_center <- mean(means_obs)
sd_obs <- apply(obs, 1, sd); sd_center <- mean(sd_obs)
se_obs <- sd_obs/sqrt(n_samples); se_center <- mean(sd_obs)/sqrt(n_samples)
The exponential distributoion has been simulated with rexp(n, lambda) function to investigate the asymptotic behaviour of sampled means of random variables with exponential distribution. The rate parameter lambda of exponential distribution has been set to 0.2, and total 40 samples were drawn from the exponential distribution. Histogram of an example samples drawn has been shown in Fig.1 below, and it shows the frequency of sample values distribute in an approximately exponential shape.
This process has been repeated 1000 times resulting in 1000 sample means of 40 random exponentials.
The 25%, 50% and 75% quantiles of the averages of 40 random exponentials from these 1000 simulations are:
summary(means_obs)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.772 4.484 5.017 5.045 5.579 7.880
The histogram of the sample mean distribution has been shown in Fig.2 below, which shows a similar shape to that of a normal distribution.
The theoretical mean of exponential distribution is \(\frac{1}{\lambda}\) = 5, and the corresponding mean from the simulation is 5.0450596. When both values are overlaid as the vertical lines over the histogram of Fig.2 (blue: theoretical mean, red: sample mean average), the similarity of the values can be confirmed visually.
Likewise, The theoretical standard error of 40 random exponentials, \(\frac{\frac{1}{\lambda}}{\sqrt{n}}\) = 0.7905694 and the mean of standard errors of 40 random exponential from 1000 simulations, SE = 0.7800232 are overlaid as vertical lines over the histogram of standard errors of 40 random exponentials from 1000 simulations (Fig.3). Both values are well matched visually in the figure.
To see if the distribution of averages of 40 random exponentials from 1000 simulations is approximately normal, let’s first look at the distribution of 1000 random exponential (equivalent to 1000 simulations with sample size = 1). The frequency of 1000 random exponentials has been displayed in the histogam in Fig.4-(Left) below, where the exponential probability density function (pdf) with \(\lambda\) = 0.2 has been overlaid to visually guide its approximation. So, when sample size = 1, the distribution of a large collection of random exponentials follows its exponential distribution not a normal distribution.
However, when we sample 40 random exponentials, the frequency of averages of 40 random exponentials shows more like a normal distribution as shown in the histogram of Fig.4-(Right) below, where a normal pdf of N(5, 0.625) has been overlaid. We can visually identify that the distribution of a large collection of averages of 40 exponentials is approximately normal.
Let’s do a inferential test for this. If we set the null hypothesis, \(H_{0}\): the mean of \(\bar X\)s is equal to \(\mu\) of N(\(\mu\) = 5 , \(\frac{\sigma^2}{n}\) = 0.625) and the alternative hypothesis, \(H_{a}\): the mean of \(\bar X\)s is not equal to \(\mu\) of N(\(\mu\) = 5 , \(\frac{\sigma^2}{n}\) = 0.625), then the two-sided t-test shows the following 95% confidence interval:
df <- n_samples - 1
t_95_twosided <- qt(0.975, df)
means_center +c(-1,1)*t_95_twosided*mean(sd_obs/sqrt(n_samples))
## [1] 3.467314 6.622805
Like wise, the 95% confidence interval for standard error of the averages of 40 random exponentials with \(H_{0}\): SE of \(\bar X\)s is equal to SE of N(\(\mu\) = 5 , \(\frac{\sigma^2}{n}\) = 0.625) and \(H_{a}\): SE of \(\bar X\)s is not equal to SE of N(\(\mu\) = 5 , \(\frac{\sigma^2}{n}\) = 0.625) is:
df <- n_simulations - 1
t_95_twosided <- qt(0.975, df)
se_center +c(-1,1)*t_95_twosided*sd(se_obs)/sqrt(n_simulations)
## [1] 0.7692365 0.7908099
Both the theoretica mean and standard error are located within the correponding T confidence intervals. Therefore, we fail to reject null hypotheis that the samples come from a normal distribution with the 95% T confidence intervals, and you can conclude that the averages of 40 random exponentials from 1000 simulations asymptotically approximates the normal distribution as the sample size increases.
main_str <- expression(paste("histogram of random exponential ",X[i]," (i=1,...,40)"))
hist(obs[3,], main = main_str, xlab = expression(X[i])); rug(obs[3,])
[Fig.1. Frequency of the value of random exponential \(X_i \; (i=1,...,40)\) has been shown in a histogram with bin size = 5. The individual \(X_i\) value is represented as a short line along the x-axis.]
main_str <- expression(paste("histogram of sample mean, ",bar(X),", from 1000 simulations"))
hist(means_obs, main = main_str, xlab = expression(bar(x))); rug(means_obs)
abline(v=E_x, col="blue", lwd=3); abline(v=means_center, col="red", lwd=3)
legend("topright", col=c("blue","red"), lwd=c(3,3), bty="n", legend = c("theoretical mean", "sample mean average"))
[Fig.2. Frequency of sample mean \(\bar X\) from 1000 simulations has been shown in histogram with bin size = 5. The individual \(\bar X\) value is represented as a short line along the x-axis, and both theoretical population mean (=5, blue) and average of sample means (=5.0450596, red) are overlaid over the histogram]
main_str <- expression(paste("histogram of standard error(SE) of ",bar(X)," from 1000 simulations"))
xlab_str <- expression(paste("Standard error of ",bar(X)))
hist(se_obs, main=main_str, xlab=xlab_str); rug(se_obs)
abline(v=SE_x, col="blue", lwd=3); abline(v=se_center, col="red", lwd=3)
legend("topright", col=c("blue","red"), lwd=c(3,3), bty="n", legend = c("theoretical SE", "sample SE average"))
[Fig.3. Frequency of standard error(SE) of \(\bar X\) from 1000 simulations has been shown in histogram with bin size = 0.1. The individual value of standard error of \(\bar X\) is represented as a short line along the x-axis, and both theoretical population SE (=0.7905694, blue) and average of sample SE (=0.7800232, red) are overlaid over the histogram]
par(mfrow=c(1,2), mar=c(4,4,2,1) )
# 1st sub-fig.
data1 <- rexp(1000, lambda)
main_str <- expression(paste("histogram of ", X[i]))
xlab_str <- expression(paste("Random exponential ",X[i]))
p1<-hist(data1, plot=FALSE); multiplier <- p1$counts/p1$density
p1_den <- density(data1); p1_den$y <- p1_den$y*multiplier[1]
y_val_den <- dexp(p1_den$x,lambda) # theoretical exponential pdf at the histogram bin locations
plot(p1, main=main_str, xlab=xlab_str); rug(data1)
lines(p1_den$x, y_val_den*multiplier, col="red", lwd=2) #
# 2nd sub-fig, 40 random exponential samples
main_str_p3 <- expression(paste("histogram of ", bar(X)))
xlab_str_p3 <- expression(paste("mean of 40 random exponentials, ", bar(X)))
p3 <- hist(means_obs, plot = FALSE ); multiplier_p3 <- p3$counts/p3$density
p3_den <- density(means_obs); p3_den$y <- p3_den$y*multiplier_p3[1]
y_val_den_p3 <- dnorm(p3_den$x,mean =E_x, sd=SD_x/sqrt(40))
plot(p3, main = main_str_p3, xlab=xlab_str_p3, xlim=c(2,10), ylim=c(0,250));rug(means_obs)
lines(p3_den$x, y_val_den_p3*multiplier_p3, col="blue", lwd=2) # => works!
## Warning in y_val_den_p3 * multiplier_p3: longer object length is not a
## multiple of shorter object length
[Fig.4. (Left) Frequency of sample values of 1000 random exponential, \(X_{i}\) has been shown in histogram with bin size = 0.5. The individual value of the sample \(X_{i}\) is represented as a short line along the x-axis. Then, the theoretical exponential probability densitiy function wiht the rate \(\lambda\) = 0.2 has been overlaid over the histogram. (Right) Frequency of averages of 40 random exponentials, \(\bar X\) has been shown in histogram with bin size = 0.5. The individual sample mean \(\bar X\) is represented as a short line along the x-axis. The theoretical normal probability densitiy function wiht mean = 5 and standard error = 0.7905694 has been overlaid over the histogram]