This report covers part 1 of the course project of the Statistical Inference course of the John Hopkins University, for which a simulation exercise has been performed. Accordingly, this report will cover the simulation of exponential distributions from which a distribution of averages is generated, which is then used to:
Below you can find an overview of the packages which were used to generate the results in this report. Make sure you have these installed and loaded, before running the rest of this document’s code in R (or R Studio).
library(ggplot2)
First of all, the exponential distribution must be simulated. As stated in the exercise, 1000 means must be obtained from simulated exponential distributions, where each mean is calculated using 40 simulated exponentials. Lambda is set equal to 0.2.
In order to do so, the functions rexp() and sapply() can be used to run 1000 simulations, in which each contains 40 exponentials. Subsequently, apply() can be used to calculate the mean of each of these simulations, returning a vector of means which can then be used to answer the questions of this exercise. Note that the set.seed() function is used to ensure that the same exponentials are generated if the code is rerun. The code to achieve the above is as follows:
# Parameters and seed
set.seed(1122)
nsim=1000
n=40
lambda=.2
# Simulation of the distributions
sim <- sapply(1:nsim, function(i) rexp(n=n, rate=lambda))
# Calculation of the means
means <- apply(sim, 2, mean)
After running this code, sim contains the simulations of the distributions, while means contains the means of each of these distributions.
Using the mean() function, the mean of the sample can be calculated. The theoretical mean of the exponential distribution is calculated as 1/lambda. Hence, the required code is as follows:
# Sample mean
mean(means)
## [1] 4.981784
# Theoretical mean
1/lambda
## [1] 5
Accordingly, it can be seen that the sample mean of 4.982 is very close to the theoretical mean of 5. To further prove this point, the graph below shows the distribution of the sample with a line plotted at the theoretical mean. ggplot2 is used to generate the graph:
g <- ggplot()
g +
aes(means) +
geom_histogram(bins=40, fill="Darkblue") +
geom_vline(xintercept=1/.2, lwd=1.5, col="Darkgrey") +
labs(x="Mean of simulated sample", y="Frequency", title="Distribution of means of simulated data")
As can be seen in the graph, the distribution seems to be centered around the theoretical mean, which further proves the point that the sample mean is very close to it.
To calculate the variance of the simulated sample, the sd() function can be used to obtain the standard error, which can then be squared to obtain the sample’s estimated variance of the mean. To obtain the theoretical standard error of the mean, lambda can be multiplied by 1 over the square root of n. Subsequently squaring this term gives the theoretical variance. The code to do so is as follows:
# Sample standard error and variance
vs <- sd(means)
vs^2
## [1] 0.6127445
# Theoretical standard error and variance
vh <- ((1/.2)/sqrt(n))
vh^2
## [1] 0.625
As previously seen with the means, the sample’s variance of 0.613 is also close to the theoretical variance of 0.625.
Before comparing the distribution of the means to the normal distribution, it is interesting to analyze the effect of using 1000 means of 40 distributions of exponentials, instead of simply using the distribution of 40000 exponentials. Below, you can see a histogram of 40000 exponentials.
g +
aes(rexp(n=40000, rate=lambda)) +
geom_histogram(fill="Darkblue", aes(y=..density..)) +
labs(x="Randomly generated exponentials", y="Frequency",
title="Histogram of randomly generated exponentials")
As is (hopefully) apparant, the distribution is not very similar to the normal distribution (as it is substantially, positively skewed). In the next graph, the distribution of the means of the 1000 exponential distributions is shown.
g +
aes(means) +
geom_histogram(aes(y=..density..), fill="Darkblue", bins=40) +
stat_function(fun=dnorm, args=list(mean=1/lambda, sd=(1/lambda)/sqrt(n)),
lwd=1.25, col="Darkred", lty=2) +
geom_density(col="Blue", lwd=1.25) +
labs(x="Means of simulated sample", y="Density", title="Density plot of simulated data")
As can be seen in the graph, this distribution approximates the normal distribution far better than the previous one. To make this comparison somewhat easier, the distribution of the sample is plotted as the light blue curve, whereas the normal distribution is plotted as the dashed, dark red line. As is apparent, these distributions do not differ that much from each other.
To bring the point home, the last part of this report shows the qqplot of the exponential distribution, in which the sample’s distribution is compared to the normal distribution. The closer these two are to eachother, the more the line in the graph will approximate a line from the origin to the right, upper corner of the graph. The code to generate said plot is as follows:
qqnorm(means)
Although somewhat redundant at this point, it is clear that the line approximates a line from the origin to the right upper corner of the graph, again showing that the simulated distribution approximates the normal distribution.