First we shut the warnings off
knitr::opts_chunk$set(warning = FALSE)
In the first part of the Coursera Statistical Inference course, we validate the Central Limit Theorem. The assesmemt will include some eloratory plots with the hep of which we will conclude our analysis and eventually prove that the distribution is almost like a normal distribution.
In this project you will investigate the exponential distribution in R and compare it with the Central Limit Theorem. The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda. Set lambda = 0.2 for all of the simulations. You will investigate the distribution of averages of 40 exponentials. Note that you will need to do a thousand simulations.
Illustrate via simulation and associated explanatory text the properties of the distribution of the mean of 40 exponentials. You should
Show the sample mean and compare it to the theoretical mean of the distribution. Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution. Show that the distribution is approximately normal. In point 3, focus on the difference between the distribution of a large collection of random exponentials and the distribution of a large collection of averages of 40 exponentials.
The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also also 1/lambda. Set lambda = 0.2 for all of the simulations. In this simulation, you will investigate the distribution of averages of 40 exponential(0.2)s. Note that you will need to do a thousand or so simulated averages of 40 exponentials.
set.seed(12345)
lambda <- 0.2
s_size <- 1000
n <- 40
simulated_sample <- replicate(s_size, rexp(n, lambda))
means_of_exponentials <- apply(simulated_sample, 2, mean)
The theoretical and the sample mean of data is calculated.
s_mean <- mean(means_of_exponentials)
t_mean <- 1/lambda
s_mean
## [1] 4.971972
t_mean
## [1] 5
The sample mean is 4.971972 and the theoretical mean is 5, whcih are very close.
s_var <- var(means_of_exponentials)
t_var <- (1 / lambda)^2 / (n)
s_sd <- sd(means_of_exponentials)
t_sd <- 1/(lambda * sqrt(n))
Now we check the individual variances and standard deviations of the sample and the theoretical data
s_var
## [1] 0.5954369
t_var
## [1] 0.625
s_sd
## [1] 0.7716456
t_sd
## [1] 0.7905694
Hence all the variances and standard deviantions has been displayed. Now its time for the plot.
finaldata <- data.frame(means_of_exponentials)
library(ggplot2)
pl <- ggplot(finaldata, aes(x = means_of_exponentials))
pl <- pl + geom_histogram(aes(y = ..density..), fill = "grey66", color = "grey")
pl <- pl + labs(title = "Distribution of means of 40 Samples", x = "Mean of 40 Samples", y = "Density")
pl <- pl + geom_vline(aes(xintercept = s_mean, colour = "sample"))
pl <- pl + geom_vline(aes(xintercept = t_mean, colour = "theoretical"))
pl <- pl + stat_function(fun = dnorm, args = list(mean = s_mean, sd = s_sd), color = "gold1", size = 1.0)
pl <- pl + stat_function(fun = dnorm, args = list(mean = t_mean, sd = t_sd), colour = "red", size = 1.0)
pl
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The density of the actual data is shown by the light blue bars. The theoretical mean and the sample mean are so close that they nearly overlap. The “red” line shows the normal curve formed by the the theoretical mean and standard deviation. The “gold” line shows the curve formed by the sample mean and standard deviation. As you can see from the graph, the distribution of means of 40 exponential distributions is close to the normal distribution with the expected theoretical values based on the given lambda.
s_confinterval <- round (mean(means_of_exponentials) + c(-1,1)*1.96*sd(means_of_exponentials)/sqrt(n),3)
t_confinterval <- t_mean + c(-1,1) * 1.96 * sqrt(t_var)/sqrt(n)
s_confinterval
## [1] 4.733 5.211
t_confinterval
## [1] 4.755 5.245
Hence the confidence intervals of the theoretical and the sample were found out to be very close.