Distribution of means of 40 samples from exponentatial distribution

Overview

The point of this report is to investigate the distribution of means of 40 samples from exponential distribution and test if central limit theorem could be used to approximate this sample distribution with normal distribution. One thousand of means of 40 values from exponential distribution with \(\lambda\) 0.2 were generated. Density histogram of resulted distribution was plotted and compared with the one calculated theoretically using central limit theorem.

Simulations

The data were simulated using rexp function with \(\lambda\) of 0.2 and organized into matrix with 40 rows and 1000 columns. Mean of each column was calculated and assigned to the variable means. The first row of the matrix was assigned to the variable orig.

For exploratory data analysis (EDA) the descriptive statistical summary of the sample distribution (Table 1 in the Appendix) and density histogram was performed (Figure 1 in the Appendix). Since skewness criterium (skew.2SE) is less than 1, skewness is not significantly different from 0 or in other words the sample distribution is symmetric. Kurtosis criterium (kurt.2SE) is less than 1, hence the sample distribution is mesokurtic. The results of the Shapiro-Wilk test are not interpretable in this case because of the large sample size and known bias of Shapiro-Wilk test at large sample sizes.

In contrast with the distribution of the means, original sample distribution follows a different pattern (Figure 2 in the Appendix). The density of this distribution peaks at the values close to 0, followed by exponential decay with the increasing values.

Sample Mean versus Theoretical Mean

Theoretical mean of exponential distribution is \(1 / \lambda\). Theoretical mean of the distribution of n means is \(1 / \lambda\) (5 for \(\lambda\) of 0.2, also see Table 2 in the Appendix). The mean of resulted simulated distribution is 4.99. According to the Central Limit Theorem theoretical confidence interval for the sample mean is \(\overline{X} \pm z_{\alpha / 2} \times \sigma / \sqrt{n}\). Applying this equation to the simulated data we obtain the confidence interval for the sample mean (4.68, 5.3) with 95% confidence. Since the confidence interval contains the theoretical mean (5), we fail to reject the hypothesis that the sample mean is different from the theoretical mean.

In order to visualize the sample distribution, the density histogram was created (see Figure 1 in Appendix). The sample mean and the theoretical mean are close to each other in relation to the sample variability.

Sample Variance versus Theoretical Variance

Theoretical variance of exponential distribution is \(1 / (\lambda ^ 2)\). Theoretical variance of the distribution of means of n samples is \(1 / (\lambda ^ 2 \times n)\) (0.625 for \(\lambda\) of 0.2 and n of 40). The sample variance of means from 40 samples of exponential distribution is 0.611. Therefore, the sample variance and theoretical variance are pretty close as well.

Distribution

The central limit theorem states that for large enough samples, the means of these samples are normally distributed. However central limit theorem does not specify how large the sample should be for it to be satisfied. In this section I am going to confirm that the simulated sample could be approximated by the normal distribution.

The sample mean is similar to the theoretical mean (Figure 1 and Table 2). The spread, which could be quantified by variance, appears similar between the theoretical normal distribution and the density histogram of the sample distribution. The density histogram of the sample distribution matches closely with the theoretical normal distribution (Figure 1). I have shown in the simulation section that the sample distribution is symmetrical and mesokurtic. These statements suggest that the samples follow normal distribution. Since due to the large sample size Shapiro-Wilk test for normality is uninterpretable, I constructed a quantile-quantile plot of sample quantiles vs theoretical normal quantiles (Figure 3). The points on the quantile-quantile plot follow mostly linear pattern, however the left-most points deviate slightly from the straight line. Based on this plot I conclude that the sample data are normally distributed.

Appendix

setwd("~/Documents/classes/dataScSpec/statInfer/Assignment")
require(knitr)
require(ggplot2)
require(pastecs)
lambda <- 0.2
n <- 40
nosim <- 1000
set.seed(1)
samples <- matrix(rexp(n * nosim, lambda), n, nosim)
means <- apply(samples, 2, mean)
orig <- samples[1, ]

kable(round(pastecs::stat.desc(means, norm = TRUE), 2)[c("min", "max", "median", "mean", 
                                    "skew.2SE", "kurt.2SE", "normtest.W", "normtest.p")], 
      caption = "Summary of the sample distribution")

Statistic <- c("Mean","Variance")
Sample <- c(round(mean(means),2),round(var(means),2))
Theoretical <- c(lambda, lambda ^ 2)
theTable <- data.frame(Statistic,Sample,Theoretical)
rownames(theTable) <- NULL
kable(theTable, caption = 'Mean and variance for sample and theoretical distributions.')

e <- ggplot(data.frame(means = means), aes(means)) + 
     geom_histogram(binwidth = lambda, aes(means, ..density..)) 
e <- e + stat_function(fun = function(x){dnorm(x, 
                                               mean = 1 / lambda, 
                                               sd = (1 / (lambda * sqrt(n))))}, 
                       size = 1, color = "red")
e <- e + geom_vline(xintercept = 5, lwd = 2, color = "green")
e <- e + geom_vline(xintercept = mean(means), lwd = 2, color = "yellow", lty = 2)
e <- e + ggtitle("Density histogram of means of 40 samples from exponential distribution")
e

h <- ggplot(data.frame(orig = orig), aes(orig)) + 
     geom_histogram(binwidth = 1, aes(orig, ..density..)) +
     labs(title = "Density histogram of the original exponential distribution",
          x = "original samples from exponential distribution")
h

qqnorm(means)
qqline(means, col = "red", lwd = 2)

Summary of the sample distribution
min	3.11
max	7.49
median	4.95
mean	4.99
skew.2SE	1.80
kurt.2SE	-0.45
normtest.W	0.99
normtest.p	0.00

Mean and variance for sample and theoretical distributions.
Statistic	Sample	Theoretical
Mean	4.99	0.20
Variance	0.61	0.04

Density histogram of 1000 simulated means of 40 samples from exponential distribution. Theoretical density is shown by red line. Theoretical mean is shown by green line. The mean of the sample distribution is shown by yellow dotted line.

Density histogram of the original exponentail distribution.

Q-Q plot of sample quantiles vs normal theoretical quantiles