Examination of Theoretical and Empirical Means and Variances of Exponential Distributions

Aaron Chandler

Assignment Purpose

This assignment compares the theoretical and empirical means and variances of the exponential distribution with lambda given as .2. The comparison requires an understanding and application of the Central Limit Theorom. Applying the CLT allows for the comparison between the mean of sample averages and the expected mean, as well as the variance of the sample averages and the expected variance.

Analytical Questions

How does the sample mean of exponentials compare to the expected mean of exponentials, both with lambda=.2?
Is the distribution of the sample means of exponentials approximately normal?

The procedure for answering this question includes:

Define the expected mean of an exponential distribution with lambda=.2.
Generate 40 exponentials from the exponential distribution with the same lambda as above.
Take the mean of the 40 exponentials and store the value in a vector.
Standardize the 1000 means of the 40 exponentials using the calculation: (x_bar-mu)/standard error of x_bar
Calculate the mean of the 1000 means and compare the value to theoretical mean using a z-test and the expected 95% confidence interval.
Plot the histogram of the distribution of the 40 point sample averages, the normal distribution given by the empirical mean and standard error of the means, and the normal distribution given by the theoretical mean and standard error of the mean.
Plot the confidence interval coverage and calculate the KS statistic of the distribution of the means.

The r code below carries out the above procedure. The results and analysis are below the code.

mu <- 1/0.2
mns = NULL
for (i in 1:1000) mns = c(mns, mean(rexp(40, 0.2)))
vars <- var(mns)
x_bar <- mean(mns)
theor_ci <- mu + c(1, -1) * qnorm(0.975) * (5/sqrt(40))

The theoretical mean-also known as the center of the distribution described above, is:

print(mu)

## [1] 5

The empirical mean of the distribution of sample means (based on a simulation of 1000 means of 40 expoentials) is:

print(x_bar)

## [1] 4.983

This is the center of the empirical distribution.

A t-test between x_bar and mu yields the following results:

t.test(mns, alternative = "two.sided", m = 5, conf.level = 0.95)

## 
##  One Sample t-test
## 
## data:  mns 
## t = -0.691, df = 999, p-value = 0.4897
## alternative hypothesis: true mean is not equal to 5 
## 95 percent confidence interval:
##  4.933 5.032 
## sample estimates:
## mean of x 
##     4.983

From the results above we fail to reject the null hypothesis that mu=5, that x_bar is not statistically different from mu, and x_bar is within the 95% confidence interval. Thus, we can conclude that mean of the sample averages of the exponential distribution is a consistent estimator of the population mean.

The figure below shows a comparison between the empirical distribution of sample means and normal probability density functions given by the theoretical and empirical point estimates. The distribution of sample means is standardized.

hist(mns, main = "Comparison between Distribution \nof Sample Means and Normal Curves", 
    xlab = "Sample Means", ylab = "Density", prob = T, ylim = c(0, 0.6))
curve(dnorm(x, mu, 5/sqrt(40)), col = "red", add = T)
curve(dnorm(x, mean(mns), sqrt(var(mns))), add = T, col = "blue")
abline(v = 5, col = "red")
legend("topright", c("Sample Means", "Normal", "Empirical", "E[x]"), col = c("black", 
    "red", "blue", "red"), lty = 1)

plot of chunk unnamed-chunk-5

A review of the figure above shows how closely the distributions based on theoretical and estimated parameters are to eachother.

The standard error of the sample means is:

print(sqrt(var(mns)))

## [1] 0.7937

compared to theoretical standard error of:

5/sqrt(40)

## [1] 0.7906

The standard error was used here in place of the variance because we are concerned with the variability in averages of means. And remember, the variance of sample mean is equal to sigma^2/n. That is, var(mns)=sigma^2/n. Therefore, the standard error of the mean is the square root of var(mns). The standard errors of the sample mean and population mean are quite comparable.

lvals <- seq(0, 10, by = 0.5)
coverage <- sapply(lvals, function(l) {
    lhats <- mns
    ll <- lhats - qnorm(0.975) * sqrt((1/0.2^2)/40)
    ul <- lhats + qnorm(0.975) * sqrt((1/0.2^2)/40)
    mean(ll < l & ul > l)
})

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 2.15.3

qplot(lvals, coverage) + geom_hline(yintercept = 0.95) + ylab("Percent Coverage") + 
    xlab("Lambda Values") + ggtitle("Coverage of 95% Confidence Interval of Lambda_Hat") + 
    geom_vline(xintercept = c(3.45, 6.55), color = "red")

plot of chunk unnamed-chunk-8

The plot above provides more evidence that the distribution of sample means follows a normal distribution. However, the level of coverage expected is not met. From visual examination, the confidence interval coverage of lambda_hat seems to only be 68%

How does the variance of the means of exponentials compare to the expected variance of exponentials?

The procedure for answering this question inovles the following:

Review the figure above for insight into variances from the distributions.
Define the expected variance of the given exponential distribution.
Calculate the variance of the distribution of the sample means.
Generate a distribution of sample variances from sample exponentials and repeat that 1000 times.
Compare the distribution of variances to theoretical distribution of variances using the same method as for the means.

Moving on to the theoretical and empirical mean sample variances, the expected variance is 1/lambda^2, or 1/.2^2=25. To get the empiricial sample variances, we employ a similar method as the one use to estimate the population mean above, but calculating the variance instead of the mean. Doing this only required substituting the population variance in for the population mean as the parameter to estimate.

The r code below carries out the procedure for estimating variance and the results and analysis are below the code.

sigma <- 1/0.2^2
vars = NULL
for (i in 1:1000) vars = c(vars, var(rexp(40, 0.2)))
s_bar <- mean(vars)
theor_ci <- sigma + c(1, -1) * qnorm(0.975) * sqrt(25)/sqrt(40)

The empirical variance is:

print(s_bar)

## [1] 25.08

This represents the center of the distribution of the average of the sample variances, and is the estimate of the population variance.

A t-test between s_bar and sigma yields the following results:

t.test(vars, alternative = "two.sided", m = 25, conf.level = 0.95)

## 
##  One Sample t-test
## 
## data:  vars 
## t = 0.212, df = 999, p-value = 0.8322
## alternative hypothesis: true mean is not equal to 25 
## 95 percent confidence interval:
##  24.37 25.78 
## sample estimates:
## mean of x 
##     25.08

The results above lead to the conclusion of failing to reject the null hypothesis and conclude that s_bar is not statistically different from sigma. This conclusion supports that the variance estimate using means of sample variances is a consistent estimator of the population variance, sigma=25.

The figure below compares the distribution of sample variances to normal distribution with mu=sigma and population standard error of 5/sqrt(40). The figure belows shows the standardized distribution of the sample variances.

hist(vars, main = "Comparison between Distribution \nof Sample Variances and Normal Distribution", 
    xlab = "Sample Variances", ylab = "Density", prob = T, ylim = c(0, 0.6))
curve(dnorm(x, 25, 5/sqrt(40)), col = "red", add = T)
curve(dnorm(x, mean(vars), sd(vars)/sqrt(40)), add = T, col = "blue")
abline(v = 25, col = "red")
legend("topright", c("Sample Means", "Normal", "Empirical", "E[x]"), col = c("black", 
    "red", "blue", "red"), lty = 1)

plot of chunk unnamed-chunk-12

The figure shows the poor fit of the distribution of the sample variances with the normal curve of mu=sigma and standard error of the population, although the distribution is centered around 25, close to the population variance. However, the standard error of the sample variances is much higher compared to the expected standard error, which explains the flatter curve. A simulation of a higher order could solve this.