INTRODUCTION

The purpose of this analysis is investigate the exponential distribution in R and compare it with the Central Limit Theorem. The exponential distribution will be simulated in R with rexp(n, lambda) where lambda, the rate parameter will be set to 0.2 . Our study will lead us to investigate the distribution of averages of 40 exponentials and perform a thousand simulations. Note : - The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda.

OUR STUDY STEPS :

1. How do the sample mean behaves in comparison with the theoretical mean of the distribution ?

First of all, one should remember that the theoretical mean is expressed as the quantity 1/lambda; concerning our study, our theoretical mean is equal to 1/0.2=5.

Let us perform our simulation :
0. To be able to get the same results, let set the seed to 0 :) .
1. Generate our SIM_40_EXP_1000 matrix, 1000 lines and 40 cols exponentials with lambda=0.2 .
2. Extract our mean vector VECT_40AVG_EXP_1000 of 1000 lines; each line being the average of 40 exponentials .
3. Let us describe our mean vector distribution .
4. Deduce our mean sim_mean = mean(VECT_40AVG_EXP_1000) which should be near our theoretical mean.

install.packages("gridExtra", repos = "http://cran.us.r-project.org")

## package 'gridExtra' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\mekie_000\AppData\Local\Temp\Rtmpc7k2C8\downloaded_packages

library(ggplot2)
library(gridExtra)
# 0 - To be able to get the same results, let set the seed to
# 1000 :)
set.seed(1000)
# 1a - SIM_40_by_1000_EXP is a 40*1000 exponentials
SIM_40_by_1000_EXP <- rexp(40 * 1000, 0.2)
# 1b - SIM_40_EXP_1000 is a 1000 lines and 40 cols matrix
SIM_40_EXP_1000 <- matrix(SIM_40_by_1000_EXP, nrow = 1000, ncol = 40)
# dim(SIM_40_EXP_1000)
writeLines(paste("\n", "The SIM_40_EXP_1000 matrix is", dim(SIM_40_EXP_1000)[1], 
    "lines(set of 40 exponential distributions) and\n", dim(SIM_40_EXP_1000)[2], 
    "columns(each being our randomely simulated exponential of parameter 0.2).\n"))

## 
##  The SIM_40_EXP_1000 matrix is 1000 lines(set of 40 exponential distributions) and
##  40 columns(each being our randomely simulated exponential of parameter 0.2).

# 2 - VECT_40AVG_EXP_1000 is a 1000 lines vector
VECT_40AVG_EXP_1000 <- apply(SIM_40_EXP_1000, 1, mean)
# dim(VECT_40AVG_EXP_1000)
writeLines(paste("\n", "The SIM_40_EXP_1000 matrix is", dim(data.frame(VECT_40AVG_EXP_1000))[1], 
    "lines(set of 40 exponential 
distributions averages) and", 
    dim(data.frame(VECT_40AVG_EXP_1000))[2], "column(average of our 40 randomely simulated
exponential of parameter 0.2).\n"))

## 
##  The SIM_40_EXP_1000 matrix is 1000 lines(set of 40 exponential 
## distributions averages) and 1 column(average of our 40 randomely simulated
## exponential of parameter 0.2).

# 3 - Let us describe our mean vector distribution .
ggplot(data.frame(VECT_40AVG_EXP_1000), aes(x = VECT_40AVG_EXP_1000)) + 
    geom_bar(stat = "density", aes(fill = VECT_40AVG_EXP_1000), 
        color = "cyan") + geom_vline(xintercept = 5, color = "red", 
    size = 2) + ggtitle("Our mean vector distribution approximately centered around '5' ") + 
    labs(x = "Averages of 40 exponential distributions")

One can observe that our mean vector is centered around “5” and seem to be approximately normal.

# 4 - sim_mean our mean of the 40 averages of expnential with
# lambda=0.2
sim_mean <- round(mean(VECT_40AVG_EXP_1000), 4)
paste("Our simulation mean is equal to :", sim_mean)

## [1] "Our simulation mean is equal to : 4.987"

As we expected, our simulation mean is near our theoretical one “5”.

2. How variable the sample is (via variance) ? Let us compare it to the theoretical standard deviation of the distribution.

As previously, one should remember that the theoretical standard deviation is expressed as the quantity 1/lambda; concerning our study; therefore, our theoretical standard deviation is equal to 1/0.2=5.

Let us perform our simulation :
1. Generate our “SIM_40_EXP_1000” matrix, 1000 lines and 40 cols exponentials with lambda=0.2 . 2. Extract our mean vector “VECT_40AVG_EXP_1000” of 1000 lines; each line being the average of 40 exponentials.
3. Deduce our standard deviation sim_sd = sd(VECT_40AVG_EXP_1000 which should be near our theoretical standard deviation.

As the 2 first steps were performed previously, les us execute the 3rd and last one :

# 3 - sim_sd our standard deviation of the 40 averages of
# expnential with lambda=0.2
sim_sd <- round(sd(VECT_40AVG_EXP_1000) * sqrt(40), 4)
paste("Our simulation standard deviation is equal to :", sim_sd)

## [1] "Our simulation standard deviation is equal to : 5.1317"

As we expected, our simulation standard error is near our theoretical one “5”.
The outstanding quantity to fit the theoretical standard deviation can be explained by the combination of the “Large Law Numbers”; in fact, increasing our number of simulation will lead us to our theoretical value.

3. Let us show that the distribution yielded from the averages is approximately normal.

For this section, we will proceed in 2 steps:
1. Graphically compare our previous mean vector of set of 40 exponential averages to the generated dataframe of 40,000 exponential.
2. Compare (graphically and statistically) compare our previous mean vector with a set 1,000 normal distribution of mean=5 and StandardDev=5.

# 1 - Graphically compare our previous mean vector of set of
# 40 exponential averages to the generated dataframe of 40000
# exponential.
#----------------------------------
par(mfcol = c(1, 2), mar = c(4, 4, 4, 4))
grid.arrange(ggplot(data.frame(VECT_40AVG_EXP_1000), aes(x = VECT_40AVG_EXP_1000)) + 
    geom_bar(stat = "density", aes(fill = VECT_40AVG_EXP_1000), 
        color = "cyan") + geom_vline(xintercept = 5, color = "red", 
    size = 2) + ggtitle("Our mean vector distribution approximately centered around '5' ") + 
    labs(x = "Averages of 40 exponential distributions"), ggplot(data.frame(SIM_40_by_1000_EXP), 
    aes(x = SIM_40_by_1000_EXP)) + geom_bar(stat = "density", 
    aes(fill = SIM_40_by_1000_EXP), color = "green") + geom_vline(xintercept = 5, 
    color = "red", size = 1) + ggtitle("Our 40,000 Exponential distributions ") + 
    labs(x = "Exponential distributions"), nrow = 1)

As we can see, our mean vector is far more gaussian than our initial set of 40,000 exponentials.

# 2 - Graphically compare our previous mean vector of set of
# 40 exponential averages to the generated dataframe of 40000
# exponential. 2a - Graphically comparison
#----------------------------------
par(mfcol = c(1, 2), mar = c(4, 4, 4, 4))
VECT_1000_NORM <- rnorm(1000, mean = 5, sd = 5)

grid.arrange(ggplot(data.frame(VECT_40AVG_EXP_1000), aes(x = VECT_40AVG_EXP_1000)) + 
    geom_bar(stat = "density", aes(fill = VECT_40AVG_EXP_1000), 
        color = "cyan") + geom_vline(xintercept = 5, color = "red", 
    size = 2) + ggtitle("Our mean vector distribution approximately centered around '5' ") + 
    labs(x = "Averages of 40 exponential distributions"), ggplot(data.frame(VECT_1000_NORM), 
    aes(x = VECT_1000_NORM)) + geom_bar(stat = "density", aes(fill = VECT_1000_NORM), 
    color = "green") + geom_vline(xintercept = 5, color = "red", 
    size = 2) + ggtitle("Our 10,000 Normal distributions of mean=5 and Stdev=5") + 
    labs(x = "Normal distributions"), nrow = 1)

As we can see, our mean vector distribution is not far from the real gaussian shown by the vector of 1000 normals.

Let us look more statistically at the differencies :
- As exponential averages are made from set of 40 exponentials (over “30”),
- As our graphical analysis have shown us some approximately normal distribution,
- As the variances are approximately equal, let us assume that they are.

We can assume that our mean vector of 40 averages is at least following the Student Gosset distribution. Following our assumption, let also assume :
- H0 hypothesis as the distrubutions are approximately the same, resulting in the difference between them equal to “0”;
- Ha hypothesis as the distrubutions are the not approximately same, resulting in the difference between them not equal to “0”;
We may perform a two-sided t.test on the difference of our “mean vector” and our “normal vector simulation”.

Setting the alpha parameter to 5% (==> 95% confidence interval); the beta to 10% (minimal required power to 90%), the yielded confidence interval off the difference is :

# 2b - Statistical comparison
#----------------------------------
EXPOS <- data.frame(VECT_40AVG_EXP_1000)
names(EXPOS) <- c("EXPOS")
NORMS <- data.frame(VECT_1000_NORM)
names(NORMS) <- c("NORMS")
# t.test compilation
the_test <- t.test(x = EXPOS$EXPOS, y = NORMS$NORMS, var.equal = FALSE, 
    paired = FALSE)
# preparing our comon Stdev
sd1 <- sim_sd
sd2 <- 5
n1 <- 1000
n2 <- 1000
sdg <- (((n1 - 1) * sd1^2 + (n2 - 1) * sd2^2)/(n1 + n2 - 2))^0.5
s_tot = sdg * (((1/n1) + (1/n2))^0.5)
# power.t.test -- get the corresponding power
Thepower <- power.t.test(n = 1000, delta = EXPOS$EXPOS - NORMS$NORMS, 
    sd = s_tot)
# Start writing concusions
writeLines(paste("-> Lower border =", round(the_test$conf[1], 
    4), "\n-> Upper border =", round(the_test$conf[2], 4), "\n-> P-Value =", 
    round(the_test$p.value, 4), "; largely over our 5% type1 error rate.", 
    "\n-> The corresponding power =", round(mean(Thepower$power), 
        4), "; highly reducing type2 error rate."))

## -> Lower border = -0.2975 
## -> Upper border = 0.3529 
## -> P-Value = 0.8673 ; largely over our 5% type1 error rate. 
## -> The corresponding power = 0.9946 ; highly reducing type2 error rate.

CONCLUSION

We can then easily conclude that H0, our null hyppothesis is confirmed; our difference is centered around “0”, with the p-value largely over 5% (= 0.05) and the corresponding power near 100%.
Our mean vector distribution of exponential averages is graphically and statistically “approximately normal”.

Statistical Inference : Peer Assessment 1 – PART1 – Exponential distribution & ‘Central Limit Theorem’

Ralph Kevin MEKIE

19 february 2019