The project consists of two parts:
1. A simulation exercise.
2. Basic inferential data analysis.
Part 1: Simulation Exercise Instructions
In this project you will investigate the exponential distribution in R and compare it with the Central Limit Theorem. The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda. Set lambda = 0.2 for all of the simulations. You will investigate the distribution of averages of 40 exponentials. Note that you will need to do a thousand simulations.
Illustrate via simulation and associated explanatory text the properties of the distribution of the mean of 40 exponentials.
You should:
1. Show the sample mean and compare it to the theoretical mean of the distribution.
2. Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution.
3. Show that the distribution is approximately normal.
In point 3, focus on the difference between the distribution of a large collection of random exponentials and the distribution of a large collection of averages of 40 exponentials.
In this project the exponential distribution is investigated in R and compare it with Central Limit Theorem. The mean of exponential distribution is 1/lambda and the standard deviation is also a function of 1/lambda. The exponential distribution is simulated in R with rexp(n,lambda), where lambda=0.2 for all of the simulations, sample size n = 40, and the number of simulation =1000.
A series of 1000 simulations is run to create a data set for comparison purpose. Each simulation contain 40 observations and the expoential distribution function will be set to rexp(40, 0.2) where 0.2 is lambda value.
Given data: n = 40; simNum = 1000; lambda = 0.2
For reproducibility, set seed = 10000
Simulation of exponential distribution and calculation of summary statistics
The following code performs the simulations to collect necessary data
Exponential sampling parameters
n = 40
lambda = 0.2
simNum = 1000
set.seed(10000)
Make a data frame with the sampled exponential distribution data
simExp = function(n, lambda){
mean(rexp(n,lambda))
}
simul = data.frame(ncol=2,nrow=simNum)
names(simul) = c("Sample","Mean")
for (i in 1:simNum)
{
simul[i,1] = i
simul[i,2] = simExp(n,lambda)
}
#check the data frame created
#the top 6
head(simul)
## Sample Mean
## 1 1 5.276785
## 2 2 5.373668
## 3 3 4.936764
## 4 4 4.919416
## 5 5 5.738125
## 6 6 5.489181
# the bottom 6
tail(simul)
## Sample Mean
## 995 995 3.806482
## 996 996 4.542863
## 997 997 5.415689
## 998 998 5.558451
## 999 999 5.922309
## 1000 1000 5.019615
sample Mean
meanSample = mean(simul$Mean)
Theoretical Mean
meanTheory = 1/lambda
The simulated sample mean of 5.006 is close to the theoretical value of 5.
Make a histogram plot for exponential distributions of sample means
hist(simul$Mean, breaks = 30, prob = TRUE,col = "lightblue",
main="Exponential Distribution of Sample Means",
xlab="Means of 40 Simulated Samples", ylab = "Counts")
abline(v = meanTheory, col= "red", lwd = 3)
abline(v = meanSample, col = "blue",lwd = 2)
legend('topright', c("Sample Mean", "Theoretical Mean"),
bty = "n",
lty = c(1,1),
col = c(col = "blue", col = "red"))
The red vertical line indicates the theoretical sample mean, whereas the blue vertical line is the sample mean. The center of distribution of averages of 40 exponentials is very close to the theoretical center of the distribution.
Here, the variance present in the sample means of the 1000 simulations is compared with the theoretical variance of the population.
The Sample Variance
The variance of the sample means estimates the variance of the population by using the varience of the 1000 entries in the means vector times the sample size, 40.
var_sample = var(simul$Mean)
The Theoretical Variance
The theoretical variance of the population is given by \(σ^2=(1/lambda)^2/n\).
var_theory = (1/lambda)^2/n
The sample variance of the distribution is 0.614 and the theoretical variance is 0.625.
The following table shows the values for the sample and theoretical mean distribution and variances.
## Theroetical sample
## Mean 5.000 5.006
## Variance 0.625 0.614
Due to the central limit theorem, the averages of samples follow normal distribution. The figure above also shows the density computed using the histogram and the normal density plotted with theoretical mean and variance values indicate that the distribution is approximately normal.
hist(simul$Mean,
breaks = 30,
prob = TRUE,col = "lightblue",
main = "Density of Simulated Samples Means",
xlab = "Means of Exponential", ylab = "Mean Density")
lines(density(simul$Mean), col = "blue", lwd = 2)
abline(v = 1/lambda, col = "orange", lwd = 2)
xfit <- seq(min(simul$Mean), max(simul$Mean), length = 100)
yfit <- dnorm(xfit, mean = 1/lambda, sd = (1/lambda/sqrt(n)))
lines(xfit, yfit, pch = 22, col = "red", lwd = 2)
legend('topright', c("Simulated Values", "Theoretical Values"),
bty = "n", lwd = c(2,2), col = c("blue", "red"))
The above plot shows that the density curve (simulated value) is very similar as the normal distribution curve (theoretical values). It indicates that the sample distribution is approximately normal.
Also, the q-q plot below suggests the normality. The theoretical quantiles again match closely with the actual quantiles.
qqnorm(simul$Mean,main ="Normal Q-Q Plot", col = "blue")
qqline(simul$Mean, col = "red", lwd = 2)
These above listed methods of comparisons show that the distribution is approximately normal.
\[=====================================================================\]