This analysis is carried out as part I of the Statistical Inference course project. In this assignment we will explore an exponential distribution and learn about its mean and variance.
First, we will create a sample exponential distribution of \(n = 40\) and rate \(\lambda = 0.2\), and explore it by plotting its density plot.
The Central Limit Theorem (CLT) states that the sampling distribution of the sample means approaches a normal distribution centered around the population mean as the sample size gets larger. Therefore, we will create a large number of simulations of this distribution and explore the sample means to see if they behave as predicted by the CTL.
The following code simulates an exponential distribution with \(n = 40\) and rate \(\lambda = 0.2\), and puts it in a variable called simulation.
n <- 40; lambda <- 0.2
simulation <- rexp(n, rate=lambda)
To get an idea how this distribution looks, let’s plot it.
library(ggplot2); theme_set(theme_bw(12))
ggplot(data.frame(values=simulation), aes(values)) + geom_density() +
labs(
title = 'Density plot of the values in our simulation',
caption = 'Fig 1: This density plot indicates the distribution of values in our simulated exponential distribution'
) + theme(
plot.title = element_text(hjust = 0.5),
plot.caption = element_text(hjust = 0),
)
Per theory, the mean as well as the standard deviation of an exponential distribution is \(1/\lambda\). Therefore, its variance is \(\sigma^2 = 1/\lambda^2\). In our case these values are \(1/.2 = 5\) and \(25\) respectively. Let’s check if our simulation has a mean and a variance predicted by the theory.
c(mean = mean(simulation), variance = var(simulation))
## mean variance
## 4.825273 18.941988
As you can see above, both the mean and the variance of our simulation are close, though not very close, to their theoretical values of \(5\) and \(25\).
The above is a single sample simulation of an exponential distribution. In order to explore the properties of this distribution, we will need to create many more such simulations. Central limit theorem (CLT) states that the sampling distribution of the sample means approaches a normal distribution centered around the population mean as the sample size gets larger.
The following code creates 10000 simulations of an exponential distribution of \(n=40\) and \(\lambda=.2\), and puts them in a matrix called s10k. Each row of s10k holds a sample exponential distribution of \(n=40\) and \(\lambda=.2\).
s10k <- t(sapply(1:10000, function(i) rexp(40, rate=.2)))
dim(s10k)
## [1] 10000 40
Since per CLT means of samples of a population themselves are normally distributed around the population mean, let’s investigate if means of the 10000 sample in s10k average to their theoretical population mean 5. Also, let’s check if their variance averages to their theoretical variance of \(1/\lambda^2\), i.e., \(25\).
c(averageMean = mean(rowMeans(s10k)), averageVariance = mean(apply(s10k, 1, sd))^2)
## averageMean averageVariance
## 5.00277 23.92067
Indeed, unlike when we tested a single simulation above, the means and variances of the 10000 samples in s10K average very close to their theoretical population mean and population variance as predicted by the CLT.
Now let’s see how the sample means are distributed.
averageMean <- mean(rowMeans(s10k))
caption <- paste0(
'Fig 2: Distribution of sample means of the 10000 samples in s10k. ',
'The dark blue verticle line indicates\nthe theoretical populatione mean. ',
'The red dashed line indicates the average of the means of the\n10000 samples in s10k.')
theme_set(theme_bw(12))
ggplot(data.frame(means = rowMeans(s10k)), aes(means)) +
geom_density(fill='mistyrose2') + geom_vline(xintercept=5, size=1, color='navyblue') +
geom_vline(xintercept=averageMean, size=1, linetype='dashed', color='red') +
annotate(geom='text',
label= c('theoretical mean', 'average mean'),
x=c(5.2, averageMean - .2), y=.3, angle=90, color=c('navyblue', 'red'), size=3) +
xlab('Sample means') +
labs(
title = 'Distribution of sample means',
caption = caption) +
theme(
plot.title = element_text(hjust = 0.5),
plot.caption = element_text(hjust = 0)
)
Based on the figure 2 above, it is clear that the sample means of our simulated samples are normally distributed. The density curve is bell shaped and is centered around the theoretical population mean \(5\). The average sample mean \(5.003\) is very close to the theoretical population mean \(5\).
Now let’s see how the sample variances of samples in s10k are distributed.
sampleSds <- apply(s10k, 1, sd)
sampleVars <- sampleSds^2
meanVariance <- mean(sampleVars)
caption <- paste0(
'Fig 3: Distribution of sample variances of the 10000 samples in s10k. ',
'The dark blue verticle line\nindicates the theoretical populatione variance. ',
'The red dashed line indicates the average\nof the variances of the 10000 samples in s10k.'
)
theme_set(theme_bw(12))
ggplot(data.frame(vars = sampleVars), aes(vars)) +
geom_density(fill='mistyrose2') + geom_vline(xintercept=25, size=1, color='navyblue') +
geom_vline(xintercept=meanVariance, size=1, linetype='dashed', color='red') +
annotate(geom='text',
label= c('population variance', 'average sample variance'),
x=c(28, meanVariance - 3), y=.025, angle=90, color=c('navyblue', 'red'), size=3) +
xlab('Sample variance') +
labs(
title = 'Distribution of sample variance',
caption = caption) +
theme(
plot.title = element_text(hjust = 0.5),
plot.caption = element_text(hjust = 0)
)
The area under the curve in the plot above, though bell shaped, does not seem very symmetrical. All the same, it appears that the sample variances are normally distributed.