I will explore, using simulated data, the behavior of the sample means, for iid data taken from an exponential distribution.
The simulation consists of taking 100,000 samples of size 40 each, from an exponentially distributed infinite population with parameter lambda = 0.2. The iid data with exponential distribution are generated with the rexp() function of the base distribution of R.
#---< The simulation
nsim <- 100000
n <- 40
lambda <- .2
simdata <- t(replicate(nsim, rexp(n, lambda)))
means <- rowMeans(simdata)
Next, the histogram of the 4,000,000 iid data taken from the population is show on the left. And on the right, the histogram of the 100,000 sample means.
The average of the 100,000 sample means is the estimator of the population mean (1/lambda), and the average of the 10,000 sample variances is the estimator of the population variance (1/lambda^2).
| Item | Mean | Variance |
|---|---|---|
| Simulated | 5.01 | 25.03 |
| Theoretical | 5 | 25 |
| % Error | 0.11 | 0.12 |
The two parameters are very well estimated, with estimation errors less than 0.13%
Let’s take a closer look at the distribution of sample means presented above.
By superimposing the normal distribution curve (red) and the density distribution curve for the sample means (blue), it can be seen that the latter has a behavior that is quite close to the normal distribution, for a sample size of 40.
Now, let’s evaluate the coverage provided by the confidence intervals for the lambda estimation.
#---< Coverage of Confidence intervals
alfa <- .05
Z_alf <- qnorm(1-alfa/2)
mean(1/means*(1 - Z_alf/sqrt(n)) < lambda &
1/means*(1 + Z_alf/sqrt(n)) > lambda) # (Guerriero, 2012)
## [1] 0.95119
As expected, given the chosen alpha value (0.05), in almost 95% of the confidence intervals calculated with the samples of size 40, the estimation of the lambda value is correct.
Guerriero, V. 2012. Power law distribution: Methods of multi-scale inferential statistics. Journal of Modern Mathematics Frontier 1:21-28.
##======================= This is All the code ======================
#---< Packages uses
library(pander)
#---< Seed for reproducibility
set.seed(1234)
#---< The simulation
nsim <- 100000
n <- 40
lambda <- .2
simdata <- t(replicate(nsim, rexp(n, lambda)))
means <- rowMeans(simdata)
#---< The histograms
par(mfrow=c(1, 2))
hist(simdata, breaks="FD")
hist(means, breaks="FD")
#---< Estimation vs Theoretical
varEst <- mean(apply(simdata, 1, var))
df <- data.frame(Item=c("Simulated", "Theoretical", "% Error"),
Mean=c(mean(means), 1/lambda,
abs(mean(means)-1/lambda)/(1/lambda)*100),
Variance=c(varEst, 1/lambda^2,
abs(varEst - 1/lambda^2)/(1/lambda^2)*100))
pander(df, round=2)
#---< Sampling distribution
hist(means, breaks="FD", prob=T, xlim=c(2, 8), ylim=c(0, .55))
abline(v=mean(means), col="green", lwd=2)
lines(density(means), col="blue", lwd=3)
curve(dnorm(x, mean=mean(means), sd=sd(means)), add=T, col="red", lwd=3)
legend(6.2, .52, c("Simulation", "Normal"), lty=c(1, 1), lwd=c(3,3),
col=c("blue", "red"))
#---< Coverage of Confidence intervals
alfa <- .05
Z_alf <- qnorm(1-alfa/2)
mean(1/means*(1 - Z_alf/sqrt(n)) < lambda &
1/means*(1 + Z_alf/sqrt(n)) > lambda) # (Guerriero, 2012)
#====================================================================