Investigation into the distribution of the mean of samples from an exponential distribution using sample size of 40

The purpose of this markdown document is to investigate the distribution of the mean of samples of random variables from an exponential distribution with lambda = 0.2 . A sample size of 40 will be used, and 1000 samples will be drawn.

For each sample of size 40 from an exponential distribution with lambda=0.2, the mean and variance will be calculated for that sample. This procedure will be repeated 1000 times. The 1000 means will then be plotted in a histogram so that the general shape of the distribution can be examined to see if it resembles a normal distribution.

Then the key quantiles for the distribution of the mean will be compared to the quantiles for a normal distribution to verify that the mean is in fact approximately normally distributed.

Finally, a 95% confidence interval will be calculated for each of the samples, and then the number of times this would include the true population mean of 1/lambda will be calculated, to see what the coverage is for the given sample size of 40, i.e. is it included in the calculated confidence intervals 95% of the time or more.

Setting up the simulation

The following code chunk sets up the basic parameters for the simulation:

true_lambda<-0.2
true_mean<-1/true_lambda
true_var<-1/true_lambda
nosim<-1000
sample_size<-40
set.seed(1234)

Note the use of set.seed(1234) to set the seed for the pseudo random number generator in R and enable the analysis to be reproduced exactly by anyone studying this markdown document. Next, we obtain 1000 random samples of size 40 from an exponential distribution with parameter lambda=0.2, saving the mean and variance of each sample into a vector:

mean_vec<-rep(NA,1000)
sd_vec<-rep(NA,1000)
for (i in 1:1000) {
    x<-rexp(sample_size,true_lambda)
    mean_vec[i]<-mean(x)
    sd_vec[i]<-sd(x)
    }
mean_of_sample_means<-mean(mean_vec)
sd_of_sample_means<-sd(mean_vec)
print(paste("mean of sample means",round(mean_of_sample_means,3),sep=" "))

## [1] "mean of sample means 4.974"

print(paste("standard deviation of sample means",round(sd_of_sample_means,3),sep=" "))

## [1] "standard deviation of sample means 0.755"

print(paste("theoretical mean",round(1/true_lambda,3)))

## [1] "theoretical mean 5"

print(paste("theoretical standard deviation",round((1/true_lambda)/sqrt(sample_size),3)))

## [1] "theoretical standard deviation 0.791"

The mean of the sample means, and the standard error of the sample means, are both very close to the theoretical values (the mean of the sample means should be about the same as the population mean of 1/true_lambda and the variance of the sample means should be equal to the population standard deviation divided by the square root of the sample size).

To investigate the shape of the distribution, a histogram of the sample means can be plotted:

hist(mean_vec,main="hisogram of sample means of size 40 from exponential distribution",xlab="sample means of size 40")

plot of chunk plot_data

The distribution has a single peak, appears roughly symmetrical about that peak, and has a shape approximating that of a normal distribution.

To investigate further, we can use the qqnorm function in r to compare the quantiles from our distribution against the theoretical quantiles from a normal distribution:

qqnorm(mean_vec,main="test distribution of sample means for normality")

plot of chunk quantiles

The points from the distribution lie along a well defined straight line, indicating that the distribution is approximately normally distributed.

Finally, we can calculate 95% confidence intervals for each of our 1000 samples, and then see what proportion of times the resulting interval would contain the true population mean of 1/lambda_true:

ll <- mean_vec - qnorm(.975) * sd_vec/sqrt(sample_size)
ul <- mean_vec + qnorm(.975) * sd_vec/sqrt(sample_size)
coverage<- mean(ll < 1/true_lambda & ul > 1/true_lambda)
print(paste("coverage for 95% confidence interval using samples of size 40 is",coverage,sep=" "))

## [1] "coverage for 95% confidence interval using samples of size 40 is 0.932"

The coverage is just over 93%, falling short of the 95% of times that we would require the confidence interval to contain the true population parameter.

Investigation into the distribution of the mean of samples from an exponential distribution using sample size of 40

Roy Murphy

Saturday, October 18, 2014

Setting up the simulation