A html version is available at RPubs: http://rpubs.com/svicente99/Inference_Assesment_1


Synopsis

In this project we investigate the exponential distribution in R and compare it with the Central Limit Theorem. The exponential distribution is being simulated 1,000 times in R with rexp(n, lambda) where lambda=0.2 is the rate parameter. We are showing the distribution of averages of 40 exponentials.


Data Processing - Simulations

Setting parameters to be processed:

set.seed(100)
lambda <- 0.2; n_sims <- 1000; sample_size <- 40

The exponential distribution it has been simulated with ‘R’ function rexp(n, lambda). Lambda is the rate parameter. We create a matrix to save a thousand of simulations for 40 exponentials.

mat_simulation <- matrix(rexp(n=n_sims*sample_size, rate=lambda), n_sims, sample_size)
exp_means <- rowMeans(mat_simulation)
head(exp_means,6)    # below, 6 first means of exponential simulation
## [1] 5.159714 5.006515 6.030689 5.268777 6.836385 5.116371
length(exp_means)    # we run a thousand of simulations 
## [1] 1000

Then we set variables to store mean and standard deviation of simulated distribution. We know that both theoretical mean and standard deviation are 1/lambda. Thus we calculate these values compare with statistics get from simulation.

m <- mean(exp_means)
s <- sd(mat_simulation)
mu <- (1/lambda)
sd <- (1/lambda)
Comparing theoretical and distribution values, we may see they are very close:
-------+------------------+-------------------
 stat. |   Theoretical    |   Distribution 
-------+------------------+-------------------
 mean  |        5         |   4.9997019
-------+------------------+-------------------
 sd    |        5         |   5.0380337
-------+------------------+-------------------

Results - Inference

Let’s come back to the question which this analysis should respond: what we can infer about the population (theoretical exponential distribution) with the data we simulated (sampling)?

The best and unbiased estimate is just the sample mean (‘m’). To demonstrate our results is adequate an estimation by confidence interval. We choose a significant level of 1% (alfa); it represents 99% of confidence (high!).

Standard error of the mean is ‘s’ divided by square root of sample size. Multiplied by z-score of alfa/2 (two sided test) or .975, we have an upper and lower limit of the interval . Just add and subtract prior values to the point estimate of the mean.

std_err <- s/sqrt(sample_size)  ## standard error of the mean
alfa <- 99/100
lim_sup <- m + qnorm((1-alfa)/2) * std_err
lim_inf <- m - qnorm((1-alfa)/2) * std_err
C.I <- c(lim_sup, lim_inf)
print(C.I)  # this is our Confidence Interval
## [1] 2.947840 7.051564

Comparison - Theoretical x Sample Distribtions

In next figure we show sample distribution (obtained via 1,000 simulations) compared to the theoretical exponencial distribution approximate to Central Limit Theorem.

par(mfrow=c(1,1))
hist(exp_means, prob=TRUE, breaks=20, xlab="exponential means simulated")

abline(v=lim_inf, col="orange", lty=5)  ## lim.inf of interval
abline(v=lim_sup, col="orange", lty=5)  ## lim.sup of interval
abline(v=1/lambda, col="red", lty=2, lwd=3)  ## theoretical mean
curve(dnorm(x, mean=m, sd=std_err), add=TRUE, col="blue", lwd=2)

mtext( "MAIN TITLE", side=3, outer=TRUE, col="blue", font=2, cex=1.15 )  
mtext( "_number of cases_", side=1, outer=TRUE, col="blue", font=1, cex=0.9 )  
box("outer", col="maroon", lwd=3) 

# add legends
legend('topright', 
       c("simulation", "normal approx.", "C.I. lower", "C.I. upper", "dist.mean (m)"), lty=c(1,1,5,5,2), 
       lwd=c(1,2,1,1,3), col=c("black", "blue", "orange", "orange", "red"))

Sample means distribution seems to be normally distributed. From the histogram above, it’s easy to see that is symmetric and well attached to the normal curve.


Conclusion

The estimate value of the mean (‘m’) of the 1,000 sample means - 4.999 - is too much close to exponential mean (mu). Therefore, the distribution is the centered at and if we compared to the theoretical center of the distribution, both are very close to each other (1/lambda).

The estimate value of sampling standard deviation (‘s’) thru the 1,000 sample means is 5.038. It`s also very close to the theoretical standard deviation (sd).

Despite we don’t know what the real population looks like, we are 99% confident that the true mean it’s contained between lower and upper limits of the interval.

It’s a special situation we know the true population (an exponential distribution). So, after statistical inference we may point that the distribution of the mean of 40 exponentials is approximately normal.