By: Leslie Self
For the Statistical Inference Course Project, I analyzed the exponential distribution sample of means and variance and compared that to the theoretical mean and variance and walked through how the sample distribution of the sample means will look more like a normal distribution the higher the sample size. From a variance perspective, regardless of sample size, the sample variance was close to the theoretical variance. With the sample mean, the higher your sample size, the more like the sample mean becomes to the theoretical mean.
In this section, I will investigate the average of 40 random exponential using 1000 simulations and compare the mean of the simulations to the theoretical mean. To get that sample, I need to create a data set by looping through 1000 simulations, where it will take the rexp function, which provides the exponential, to take 40 averages and get the mean where lambda = .2. Lambda = .2 is a set variable that was given as part of this assignment.
set.seed(1)
mns = NULL
lambda <- .2
n <- 40
for (i in 1 : 1000) mns = c(mns, mean(rexp(n, lambda)))
ExpMean <- mean(mns)
So of our random sample of 40 exponenetials at 1000 simulations, we get a sample mean of 4.9900252. Plotting those samples, you get
ExpSimulation <- as.data.frame(mns)
names(ExpSimulation) <- "MeanSample"
ggplot(ExpSimulation, aes(x=MeanSample)) +
geom_histogram(color="lightblue", fill = "lightblue", binwidth=1, aes(y=..density..)) +
geom_density(aes(y=..density..)) + ggtitle("Histogram of Sample\nExponential Distribution") +
geom_vline(xintercept=ExpMean, linetype="dotted")
Now that I have the sample mean, I will explain the theoretical mean. The Theoretical mean of an exponential distribution can be found using the formula: 1/lambda. So the theoretical mean in our case with a lambda = .2 is 1/.2 or 5. So using a 1000 simulations of 40 random exponentials, I get a mean of 4.9900252 which is very close to the theoretical mean which is 5. So in my simulation, taking 40 random exponentials, gets me almost to the theoretical mean of the distribution.
So in the example above, I looked at sample mean and compared that to the theoretical mean of an exponential distribution. Now I am going to do that again and use variance instead of mean. Variance is calculated by squaring the standard deviation. Standard Deviation for exponential distributions is the same as the mean or 1/lambda. So to calculate variance it is (1/lambda)^2. So following the example above, if I take the standard deviation of 1000 simulations of 40 exponential values, I can then take the average of that standard deviation vector and square it to get variance.
lambda <- .2
n <- 40
for (i in 1 : 1000) stddev = c(stddev, sd(rexp(n, lambda)))
ExpVar <- (mean(stddev))^2 #get the average Standard Deviation and then square it to get var
ExpVar
## [1] 23.96863
Taking it a step further, if I change the sample to get the average variance of 5 random exponentials instead of 40, I get 23.9686311. If I go the other direction, and take a sample size of 100, then I get a variance 24.3830181. Plotting for 5, 40 and 100 variances shows where all the sample variances hover around the 25 mark. Please note, I hid the r code that created the 5 and 100 variance vectors as well as the code to take my vectors to a dataframe called ExpVarianceDS.
ggplot(ExpVarianceDS, aes(x = variance, fill = size)) + geom_histogram(aes(y = ..density..)) +
ggtitle("Histograms of Variances") + facet_grid(. ~ size)
The variance doesn’t really change much with differences in sample size, but lets see how the sample variance is compared to the theoretical variance as the mean of the variance hovers around 25 in all three cases.
The theoretical variance of an exponential distribution is (1/lambda)^2. So in our case, with lambda = .2, which was given in the assignment, the theoretical variance should be (1/.2)^2 or 25. So using a 1000 simulations of 5, 40, and 100 random exponentials, from the datasets above, I get 23.9686311, 23.9686311, and 24.3830181 respectively, where all three are close to the theoretical mean which is 25.
So in the class, I have learned that it doesn’t matter what the population distribution looks like, if you take enough samples of that distribution, and take the mean and plot those on a graph, that sample distribution will look more and more normal. The higher the sample size of values chosen, the more normal it looks. This can be demonstrated through a couple plots. On the first plot, this is what a population exponential distribution resembles.
require("ggplot2")
set.seed(1)
ExpDS <- as.data.frame(rexp(1000, .2))
names(ExpDS) <- "obs"
ggplot(ExpDS, aes(x=obs)) + geom_histogram(color="lightblue", fill = "lightblue", binwidth=1, aes(y=..density..)) +
geom_density(aes(y=..density..)) + ggtitle("Histogram of\nExponential Distribution")
Below shows a plot with sample of 10, 50, and 100 averages of random exponential values. The sample distribution looks more and more normal, the higher the sample size. Again, I hid the code that just goes through what I did in the first example above for means for 10, 50 and 100 exponentials and combined those vectors into a dataframe called ExpMeansDS.
ggplot(ExpMeansDS, aes(x = Means, fill = size)) + geom_histogram(aes(y = ..density..), binwidth=1) +
geom_density(size = 1, aes(y=..density..)) + ggtitle("Histograms of Different Samples") + facet_grid(. ~ size)
So even though the population distribution of a exponential distribution looks nothing like a normal distribution, the samples I got in the simulation of 10, 50, 100 random exponentials 1000 times, the normal distribution is showing through, especially the higher the sample size.