This document is the final report of the Peer Assessment project - Part 1 from Coursera’s course Statistical Inference, as part of the Specialization in Data Science. It was built up in RStudio, using its knitr functions, meant to be published in pdf format.
Here, we investigate the exponential distribution in R and compare it with the Central Limit Theorem.
The exponential distribution can be simulated in R with the command rexp(n, lambda) where lambda is the rate parameter.
In probability theory and statistics, the Exponential Distribution is the probability distribution that describes the time between events in a Poisson process, i.e. a process in which events occur continuously and independently at a constant average rate.
The probability density function (pdf) of an exponential distribution is
\(f(x;\lambda) = \begin{cases} \lambda e^{-\lambda x} & x \ge 0, \\0 & x < 0.\end{cases}\)
The mean of exponential distribution is \(1/\lambda\) and the standard deviation is also \(1/\lambda\).
The graphical behavior of the Exponential Distribution can be better visualized with a small simulation of 200 data shown in the graphs below:
n <- 200
x <- 1:n
lambda <- 0.2
y <- rexp(n,lambda)
z <- data.frame(x,y)
graph01 <- ggplot(z, aes(y)) +
geom_histogram(binwidth = 1, colour = "#999999", fill= "#000099") +
ggtitle(expression(atop("Exponential Distribution",
atop(italic("n = 200, lambda = 0.2"),"")))) +
xlab("rexp(n,lambda)")
graph02 <- ggplot(z, aes(y)) +
geom_line(stat = "density") +
stat_function(fun = dexp, arg=list(rate=lambda), colour = "red") +
ggtitle(expression(atop("Exponential Distribution",
atop(italic("n = 200, lambda = 0.2"),"")))) +
xlab("rexp(n,lambda)")
grid.arrange(graph01, graph02, ncol = 2)
For all simulations, we set \(\lambda\)(lambda) = 0.2 and the seed as 1234, using the comand set.seed(123).
We will investigate the distribution of averages of 40 exponential datasets. Note that we will use a thousand simulations to guarantee data consistency and then compare the distribution they generate with both the normal distribution and the results that are predicted by the Central Limit Theorem..
The matrix with the 1000 datasets (rows) and 40 random exponential data each (columns) is obtained by:
numberSimul <- 1000
lambda <- 0.2
numberDatasets <- 40
set.seed(123)
dataExp <- rexp(numberSimul * numberDatasets, rate=lambda)
simulData <- matrix(dataExp, numberSimul, numberDatasets)
simulDataFR <- as.data.frame(simulData)
The vector with the averages of each dataset is:
simulMean <- rowMeans(simulData)
countMean <- 1:numberSimul
simulMeanFR <- data.frame(countMean, simulMean)
The distribution of averages has sample mean (\(\bar x\)) equal to the population mean (\(\mu\)). The sample data mean is :
sampleMean <- mean(simulMean)
sampleMean
## [1] 5.011911
In an Exponential Distribution, the predicted mathematical mean is \(\mu=1/\lambda\), where \(\mu= 1/\lambda = 1/0.2 = 5.00\).
The distribution of the averages of the datasets can be seen in the following graph:
ggplot(simulMeanFR, aes(simulMean)) +
geom_histogram(binwidth = 0.1, colour = "#999999", fill="#000099") +
geom_vline(xintercept = sampleMean, colour = "red", size=2) +
ggtitle(expression(atop("Distribution of Averages",
atop(italic("1000 datasets of 40 exponentials, lambda = 0.2"),"")))) +
xlab("averages")
In this case, we can conclude they are almost exactly the same as predicted.
The distribution of averages has its variance as \(\sigma^2 = s^2 / n\) , or :
sampleSD <- sd(simulMean)
sampleVar <- var(simulMean)
sampleVar
## [1] 0.6088292
The theoretical variance is defined as \(\sigma^2 = (1 / \lambda) ^ 2 / n\), or \(\sigma^2 = (1 / 0.2)^2 / 40 = 0.625\).
And, once again, the two numers (0.609 and 0.625) are very close to each other, as predicted by the CLT.
The approximation of the sample distribution to a Normal Distribution can be visually seen with the following graph:
ggplot(simulMeanFR, aes(simulMean) ) +
geom_histogram(aes(y = ..density..), binwidth=0.1,
colour = "black", fill = "#F8766D") +
geom_density(stat = "density", colour="blue",size=1.2) +
geom_vline(xintercept = sampleMean, colour = "blue", size=1) +
geom_vline(xintercept = 5, colour = "red", size=1) +
stat_function(fun = dnorm, arg = list(mean = sampleMean, sd = sampleSD),
colour = "red", size=1.5) +
ggtitle(expression(atop("Distribution of Averages",
atop(italic("1000 datasets of 40 exponentials,lambda = 0.2"),"")))) +
xlab("averages")
The averages are almos coincident and the shapes of the sample and the theoretical normal curves are very close.
The normality plot will look like :
qqnorm(simulMean)
qqline(simulMean, col = "red", lwd = 1)
Which are also very close.In any case, it is also recommended to perform a normality test to check the assumptions.
A Kolmogorov-Smirnov test can be used for this example.
ks.test(simulMean, "pnorm", mean = sampleMean, sd = sampleSD)
##
## One-sample Kolmogorov-Smirnov test
##
## data: simulMean
## D = 0.0206, p-value = 0.7881
## alternative hypothesis: two-sided
With such a high value for p, there is no evidence that we should doubt the distribution is normal, as stated by the Central Limit Theorem.