knitr::opts_chunk$set(echo = TRUE)
This is the final assignment in the Statistical Significance course, and involves two parts * Simulation exercise * Basic Inferential data analysis # Part 1: Simulation Exercise This is the first part of the assignment
This task is set out in the assignment as follows:
In this project you will investigate the exponential distribution in R and compare it with the Central Limit Theorem. The exponential distribution can be simulated in R wiht ‘resp(n, lambda)’, where lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda. Set lambda = 0.2 for all of the simulations. You will investigate the distribution of averages of 40 exponentials. NOte that you will need to do a thousand simulations
Illustrate via simulation and associated explanatory text, the properties of the distribution of the mean of 40 exponentials * Show the sample mean and compare it to the theoretical mean of the distribution * Show how variable the sample is (via variance), and compare it to the theoretical variance of the distribution * Show that the distribution is approximately normal
First, we create a matrix and populate it with 40 exponential observations, through 1,000 simulations
The distribution is shown in the figure below, and features the theoretical mean as the thick vertical line
setwd("/home/johnm/Desktop")
set.seed(1234) ## Set the seed value in order to ensure reproducability
lambda <- 0.2
nrow <- 1000
ncol <- 40
simulatedData <- matrix(rexp(1000*40, lambda), nrow, ncol)
Means <- apply(simulatedData, 1, mean)
hist(Means, breaks = 50, main = "1,000 averages of 40 random exponentials", xlab = "Means", ylab = "Frequency")
abline(v = 1/lambda, lty = 1, lwd = 5)
legend("topright", lty = 1, lwd = 5, legend = "Theoretical Mean")
The actual sample mean is shown below, and is clearly very close to the theoretical mean
SampleMean <- NULL
for(index in 1:nrow)
{
SampleMean <- c(SampleMean, mean(rexp(ncol, lambda)))
}
mean(SampleMean)
## [1] 5.002242
The figure above, and the calculation both show that this is close to the theoretical value for the mean of 5
The standard error is given by standard deviation divided by the square root of the sample size : StdErr = (1/lambda)/sqrt(nrow). The standard variance, StdVar = (StdErr)^2
StdErr <- (1/lambda)/sqrt(nrow)
StdVar <- StdErr^2
standardDeviation <- sd(SampleMean)
standardVariance <- var(SampleMean)
distVar <- apply(simulatedData, 1, var)
fit <-seq(min(SampleMean), max(SampleMean), length = 100)
standard_fit <- dnorm(fit, mean = SampleMean, sd=StdErr)
hist(distVar, breaks = ncol, prob = T, main = "This distribution of 1,000 variance of 40 random exponentials", xlab = "Value of variances", ylab = "Frequency of variance", col = "lightblue")
abline(v = StdVar, lty = 1, lwd = 2, col = "red")
legend("topright", lty = 1, lwd = 5, col = "blue", legend = "Theoretical variance")
The sample variances have a distribution that approaches a normal distributoin, and is nearly centered on the theoretical variance
The following plots demonstrate that that distribution approaches a Normal distribution. The first plot below shows the distribution of exponentials with lambda equal to 0.2
par(mfrow = c(3, 1))
hist(simulatedData, breaks = 50, main = "Distribution of exponentials with lambda equals to 0.2", xlab = "Exponentials", col = "orange")
The second plot below shows the distribution of 1,000 averages of 40 random exponentials
hist(Means, breaks = 50, main = "The distribution of 1,000 averages of 40 random exponentials", xlab = "Value of means", ylab = "Frequency of means", col = "pink")
The third plot below shows the normal distribution, with the mean and standard deviation equal to that in the second plot above.
Comparing the second figure with the third figure, it shows that the distribution of the means tends toward the normal distribution with the same mean and standard deviation.
simNorm <- rnorm(1000, mean = mean(Means), sd = sd(Means))
hist(simNorm, breaks = 50, main = "A normal distribution with theoretical mean and sd of the exponentials", xlab = "Normal variables", col = "lightgreen")