Overview

In this short report, we will show that the mean of a sample of exponentially distributed random numbers is approximately normally distributed. To do this, we will

Introduction

The exponential distribution corresponds to the distribution of times between events of a poisson process characterized by \(\lambda\). It is therefore parametrized by \(\lambda\) and has a density \(f(x) = \lambda {e}^{-\lambda x}\), where \(x\) is the time between two events. Its mean is \(\mu=1/\lambda\) and its variance is \(\sigma^2 = 1/\lambda^2\).

In the following, we will consider the sample mean of this distribution for \(n=40\) samples. We will fix \(\lambda=0.2\) during all simulations. To analyze the distribution of the sample mean, we will perform \(N=1000\) simulations of \(n=40\) samples of the exponential distribution. We will use the r function rexp(m,rate=0.2), to draw samples from the exponential distribution, where \(m\) is the number of samples that we want to draw from the exponential distribution.

Simulations

We performed \(N=1000\) simulations, each of these simulations consisting of \(n=40\) draws from the exponential distribution with \(\lambda=40\). We defined variables that contain the theoretical mean (\(\mu\)) and variance (\(\sigma^2\)) as well as a dataframe that contains \(N\) rows with two columns, one containing the sample Means (column Header “sampleMean”), one containing the sample Variance (“sampleVar”). We computed the expected standard error of the mean (“SEMean”) when drawing \(n=40\) samples from the exponential distribution. Finally, we computed a Z-Statistic to calculate a p-Values that aims to quantify the probability that the sample Means are drawn from a normal distribution that is given by the central limit theorem.

Sample Mean versus Theoretical Mean

As we will see in the appendix (“Simulated distribution versus theoretical distribution”), the r-function rexp indeed generates exponentially distributed random variables. We will now analyse the distribution of the sample mean of \(n=40\) samples from that distribution. The Central Limit Theorem tells us that the distribution of these sample means approximates a normal distribution when \(n\rightarrow\infty\).

In this case, the expected distribution follows \(\mathcal{N}(\mu,\sigma^2/n)\), where \(\mu\) is the population mean and \(\sigma^2\) the population variance of the underlying distribution (here the exponential distribution with \(\lambda=0.2\)).

\label{fig:distribution-of-mean}Theoretical and simulated distributions of sample mean

Theoretical and simulated distributions of sample mean

Figure displays the distribution of the sample mean of \(n=40\) samples from the exponential distribution and compares that to the expected distribution given by the central limit theorem. Note that we only expect the simulated distribution to exactly follow the expected distribution if \(n \rightarrow \infty\). As we have \(n=40\), we expect the distribution to display some deviations from the theoretical one. The mean of the sample means \(\mu_{Sample}\)=4.987 is very close to the expected mean 5.

Comparison of the sample Mean to the theoretical mean
\(\mu_{Sample}\) \(\mu\)
4.986508 5

Figure clearly shows that the distribution of the Sample mean is very close to the theoretical distribution (which is a normal distribution). There is actually no simple test to check whether a given data follows a normal distribution (at least, not to our knowledge). We can, however, perform a test that aims to reject the null-Hypothesis \(H_0\) that the data (\(N=1000\) sample means) is drawn from a normal distribution with mean \(\mu\) and variance \(\sigma^2/n\).

p-Values for a test checking if the data could be drawn from a normal distribution given by the central limit theorem
Value of Statistic p-Value of Statistic
-0.5396673 0.5894265

The high p-Value of larger than 58% indicates that we would have to choose a type-I-error rate of larger than 58% if we were to reject the \(H_0\). We are therefore led to conclude that we cannot reject the hypothesis that the data is drawn from a Normal distribution with mean \(\mu\) and variance \(\sigma^2/n\).

Sample Variance versus Theoretical Variance

We want to answer the question whether the sample Variance corresponds to the theoretical value. As a matter of fact, we do not know the theoretical distribution of the sample variance for an arbitrary number of samples. Therefore, we can only plot the distribution of the sample variance, display its mean, and compare that mean to the exptected mean of the sample variance which is \(\sigma^2\).

\label{fig:distribution-of-var}Theoretical and simulated distributions of sample variance

Theoretical and simulated distributions of sample variance

From Figure , we can see that this distribution is skewed; therefore, it is less normal than the distribution of the sample mean. However, the mean of the distribution is very close to the theoretical mean:

Comparison of the sample Mean and Variance to the theoretical values
Kind Sample Theory
Mean 4.986508 5
Variance 25.119897 25

Distribution

As we have seen in sections Sample Mean versus Theoretical Mean and Sample Variance versus Theoretical Variance, the distribution of the sample mean is very close to the theoretical (normal) distribution; the distribution of the sample Variance is not known; but we do know its mean. This mean is very close to the theoretical mean, while its distribution is skewed and is therefore not as similar to a normal distribution as is the distribution of the sample mean.

Appendix

Simulated distribution versus theoretical distribution

First, we will plot the theoretial distribution alongside the simulated results. Figure shows the results. As we can see, the simulated distribution is very close to the theoretical one, so we are satisfied that rexp really yields samples from an exponential distribution.

require(ggplot2)
g <- 
    ggplot(data=df_m,aes(x=sample)) + 
    geom_histogram(aes(y=..density..),color="black",fill="steelblue",alpha=0.5,bins=100) +
    geom_density(aes(color="Simulation")) +
    stat_function(aes(color="Theory"),fun=dexp,args=list(rate=lmbd)) +
    scale_colour_manual("", values = c("black", "red")) +
    labs(x="x",y="normalized histogram",title="Exponential Distribution") +
    theme(plot.title = element_text(hjust = 0.5))
g
\label{fig:check-distribution}Histogram of 40\*1000 Samples using the r function rexp(40\*000,rate=0.2) and comparison to the theoretical exponential distribution ($f(x) = \lambda {e}^{-\lambda x})$.

Histogram of 40*1000 Samples using the r function rexp(40*000,rate=0.2) and comparison to the theoretical exponential distribution (\(f(x) = \lambda {e}^{-\lambda x})\).

Code for the simulations, tables and figures

Code for the generation of simulations:

require(dplyr)
set.seed(42);
n <- 40;
N <- 1000;
lmbd <- 0.2;
mu <- 1./lmbd;
sigma <- 1./lmbd;
m <- rexp(n=n*N,rate=lmbd);
df_m <- data.frame(sample=m);
m2 <- matrix(m,N,n);
df_m2 <- data.frame(sampleMean=apply(m2,1,mean),
                    sampleVar=apply(m2,1,var));

mu_sample <- mean(df_m2$sampleMean);
var_sample <- var(df_m2$sampleMean);

SEMean <- sigma/sqrt(n); #Expected standard error of the mean for n samples

Z <- (mu_sample - mu) / ( (SEMean )/ sqrt(N)); # Z-statistic 
pZ <- min(c(pnorm(Z),pnorm(-Z)))*2 # p-value for Z-statistic

Code for the generation of Table

my.df <- data.frame(mu_sample=mean(df_m2$sampleMean),mu_theory=mu)
kable(my.df, col.names=c("$\\mu_{Sample}$","$\\mu$"),
      caption="\\label{table:Comparison-sample-mean-with-theoretical-mean}Comparison of the sample Mean to the theoretical mean")

Code for the generation of Table

my.df <- data.frame(statistic_value=Z,p_value=pZ)
kable(my.df, col.names=c("Value of Statistic","p-Value of Statistic"),
       caption="\\label{table:p-Values}p-Values for a test checking if the data could be drawn from a normal distribution given by the central limit theorem")

Code for the generation of Table

my.df <- data.frame(type=c("Mean","Variance"),sample=c(mean(df_m2$sampleMean),mean(df_m2$sampleVar)),theory=c(mu,sigma^2))
kable(my.df, col.names=c("Kind","Sample","Theory"),
      caption="\\label{table:Comparison-sample-mean-var-vs-theo-mean-var}Comparison of the sample Mean and Variance to the theoretical values")

Code for Fig. :

require(dplyr)
require(ggplot2)
dnorm2 <- function(x,...) {dnorm(x,mean=mu,sd=SEMean,...)};
qnorm2 <- function(x,...) {qnorm(x,mean=mu,sd=SEMean,...)};
brks <- c("Sim","Thr","samMean","thrMean");
lbls <- c("Simulation","Theory (CLT)","Sample Mean","Theoretical Mean")
vls <- c("Sim"="black","Thr"="red","samMean"="green","thrMean"="brown")
p1 <- 
    ggplot(data=df_m2,aes(x=sampleMean),alpha=0.3) + 
    geom_histogram(aes(y=..density..,color="Sim"),fill="steelblue",alpha=0.3,bins=30,alpha=0.3,size=0.1) +
    stat_function(aes(color="Thr"),fun = dnorm2) + 
    geom_vline(aes(xintercept = mean(df_m2$sampleMean), color = 'samMean')) +
    geom_vline(aes(xintercept = mu, color = 'thrMean')) +
    scale_colour_manual("", values = vls,labels=lbls,breaks=brks) +
    labs(x="mean values",y="density histogram", title="Distribution of the Sample Mean") +
    theme(plot.title = element_text(hjust = 0.5))
p1

Code for Fig. :

require(dplyr)
require(ggplot2)
dnorm2 <- function(x,...) {dnorm(x,mean=mu,sd=SEMean,...)};
qnorm2 <- function(x,...) {qnorm(x,mean=mu,sd=SEMean,...)};
brks <- c("Sim","samVar","thrVar");
lbls <- c("Simulation","Mean of Sample Variance","Mean of theoretical Variance")
vls <- c("Sim"="black","samVar"="green","thrVar"="brown")
p1 <- 
    ggplot(data=df_m2,aes(x=sampleVar),alpha=0.3) + 
    geom_histogram(aes(y=..density..,color="Sim"),fill="steelblue",alpha=0.3,bins=30,alpha=0.3,size=0.1) +
    geom_vline(aes(xintercept = mean(df_m2$sampleVar), color = 'samVar')) +
    geom_vline(aes(xintercept = sigma^2, color = 'thrVar')) +
    scale_colour_manual("", values = vls,labels=lbls,breaks=brks) +
    labs(x="variance",y="density histogram", title="Distribution of the Sample Variance") +
    theme(plot.title = element_text(hjust = 0.5))
p1