Author: Sora Jin
This is a report to the Statistical Reference course project. Exponential distribution will be used (lamba =0.2 for all simulations) to test the Central Limit Theorem. 1000 Random exponential distributions are generated and the mean and variance will be compared to the theorectial mean and vairance, the overall goal is to show that the distribution will tend to be normal as the incresement of the number of the distribution.
generate a sample size of 40 exponential distributions, and repeat it 1000 times. so in total there are 40000 distributions.
data=rexp(40000,0.2)
data= matrix(data,nrow=1000,ncol=40)
dim(data)
## [1] 1000 40
By creating a matrix of 1000 rows and 40 columns, now we have a sample of 40 random exponential distributions with 1000 repeatitions.
To get the average mean of all data, first take the mean of each sample (each sample contain 40 exponential,which will get us 1000 means), then take the mean of these sample means. And to check if the mean distribution is close to the normal distribution, a historgram is created.
the theorectial mean, which is 1/lambda = 1/0.2 = 5, is highline in the following histogram.
mean1=apply(data,1,mean)
mean(mean1)
## [1] 5.004336
hist(mean1,col="grey",prob="TRUE")
abline(v=5,col="blue",lwd=3)
lines(density(mean1), col="blue", lwd=2)
lines(density(mean1, adjust=2), lty="dotted", col="darkgreen", lwd=2)
Clearly, The sample mean result above is pretty close to the theorectial mean. Also, the distribution for 1000 sample of 40 exponential distribitions is likely to be a normal distribution.
For the vairance, we simply take the vairance of the means fo 1000 samples. The theorectial vairance is 1/lambda^2 = 1/0.2^2 = 25. This value is highline in blue in the historgram.
variance=apply(data,1,var)
hist(variance,prob="TRUE",col="grey")
abline(v=5, col="blue",lwd=3)
lines(density(variance), col="blue", lwd=2)
lines(density(variance, adjust=2), lty="dotted", col="darkgreen", lwd=2)
We can see that the sample variance is pretty close to the theorectical variance, which means it estimates well.
On the other hand, if we analyse the variance for the 1000 means (as the following code,) the variance shows little relation to the theorectical variance.
var(mean1)
## [1] 0.6026256
According to the analysis of mean and vairance above, it appears that the stimulations are normal distribution. However, in order to be confident in the conclusion, we calculate a 95% condifent interval to see how certain we can be about it. (As if it is truly normal distribution, the mean is definitely 5, so we will run a condifent interval Z-test on the mean, setting the mean as 5.)
se = sd(mean1)/sqrt(40)
lower = mean(mean1) - 1.96 * se
upper = mean(mean1) + 1.96 * se
c(lower, upper)
## [1] 4.763762 5.244911
It shows that we are 95% confident that the true mean is somewhere between the lower bound and the upper bound. Clearly, we can draw a safe conclusion that the stimulations are approximately normal.