In this project we: - generate 40 exponentials as our sample - resample with replacement 1000 times - calculate both theoritical and sample statistics and compare
We illustrate via simulation and associated explanatory text the properties of the distribution of the mean of 40 exponentials. We will:
1- Show the sample mean and compare it to the theoretical mean of the distribution. 2- Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution. 3- Show that the distribution is approximately normal.
library(data.table)
library(ggplot2)
library(stats)
echo = TRUE # display the code
results = 'asis' # display the output without formatting
Set some of the variables as predefined in the project
set.seed(777)
lambda <- 0.2 # the rate
n <- 40 # exponential sample size
nsim <- 1000 # no of simulations
Create the data table
sampleData <- matrix(rexp(nsim*n, lambda), nrow=nsim)
dim(sampleData)
## [1] 1000 40
Now we have a matrix with 1000 rows and 40 columns of random data with exponential distribution
Show the sample mean and compare it to the theoretical mean of the distribution
th_mean <- 1/lambda
th_mean
## [1] 5
sim_mean <- rowMeans(sampleData) # calculate the average for each row
sam_mean <- mean(sim_mean)
sam_mean
## [1] 4.969789
Calculate the difference between the means
mean_diff <- th_mean - sam_mean
mean_diff
## [1] 0.03021104
The difference between the theoritical mean (expected mean) and the sample mean is 0.03021104.
Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution
th_variance <- (1/lambda)^2 / n
th_variance
## [1] 0.625
var(sim_mean)
## [1] 0.6028996
The theoritical variance (expected variance) and the sample variance are very close.
Doing this comparison because we are going to use the results for plotting
th_sd <- 1/(lambda * sqrt(n))
th_sd
## [1] 0.7905694
sam_sd <- sd(sim_mean)
sam_sd
## [1] 0.7764661
Here as well, the difference between the expected sd and the sample sd is insignificant.
Since the theoritical statistics are very close to sample statistics the thickness of the lines for the plots are going to be different to be able to identify them in case of an overlap.
df <- data.frame(sim_mean)
plot_1 <- ggplot(df) +
geom_histogram(mapping = aes(x = sim_mean , y = stat(density)), color = "black", fill = "lightgray") +
geom_vline(xintercept = th_mean, size = 2, color = "red") +
geom_vline(xintercept = sam_mean, size = 1, color = "blue")
plot_1
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
As expected, theoritical mean (red) and sample mean (blue) are overlapping.
Now, let’s plot the standard deviations superimposed on the same plot of the means
plot_2 <- plot_1 +
stat_function(fun = dnorm, args = list(mean = th_mean, sd = th_sd), color = "red", size = 2) +
stat_function(fun = dnorm, args = list(mean = sam_mean, sd = sam_sd), color = "blue", size = 1)
plot_2
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
As you can see, the distribution of means of 40 exponential distributions is close to a normal distribution. Compare this graph with the one below for 1000 random exponentials:
hist(rexp(1000, lambda), breaks = 20)
Theoritical confidence intervals
th_CI <- th_mean + c(-1,1) * 1.96 * sqrt(th_variance) / sqrt(n)
th_CI
## [1] 4.755 5.245
Sample confidence interval
sam_CI <- mean(sim_mean) + c(-1,1) * 1.96 * sd(sim_mean)/sqrt(n)
sam_CI
## [1] 4.729160 5.210418
The intervals are close as well.
Show that the distribution is approximately normal.
Perform a visual inspection using Q-Q plots The Q-Q plot shows that normality is probably a reasonably good approximation.
library(car)
## Loading required package: carData
qqPlot(sim_mean)
## [1] 291 98
If the data is normally distributed, the points in the QQ-normal plot lie on a straight diagonal line. As we can see in the above graph.
Let’s plot one for the exponential distribution so that we can see the difference.
qqPlot(rexp(1000, lambda))
## [1] 455 629
qqnorm is a generic function the default method of which produces a normal QQ plot of the values in y. qqline adds a line to a normal quantile-quantile plot which passes through the first and third quartiles.
qqnorm(sim_mean) ; qqline(sim_mean, col = "red")
qqnorm(rexp(1000, lambda)); qqline(rexp(1000, lambda), col = "red")
By comparing the graph of the distribution of 1000 random exponentials and the the graph of the distribution of 1000 means of 40 exponential distributions, we can see how Cnetral Limit Theorem manifests as a normal distribution.