This is the project for the statistical inference class. In it, you will use simulation to explore inference and do some simple inferential data analysis. The project consists of two parts:
Each pdf report should be no more than 3 pages with 3 pages of supporting appendix material if needed.
Illustrate via simulation and associated explanatory text the properties of the distribution of the mean of 40 exponential(0.2)s. You should.
Note that for point 3, focus on the difference between the distribution of a large collection of random exponentials and the distribution of a large collection of averages of 40 exponentials.
First, load the packages knitr and ggplot2
library(knitr)
library(ggplot2)
The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also also 1/lambda. Set lambda = 0.2 for all of the simulations. In this simulation, you will investigate the distribution of averages of 40 exponential(0.2)s. Note that you will need to do a thousand or so simulated averages of 40 exponentials.
lambda = 0.2 # lambda for all simulations
n = 40 # number of exponential random variables
nsims = 1:10000 # number of simulated averages
set.seed(901)
means <- data.frame(x=sapply(nsims,function(x) {mean(rexp(n,lambda))})) # necessary means
head(means,2) # show the results
## x
## 1 3.981570
## 2 4.148078
Evaluate the stats needed to answer the questions.
mean(means$x) # Center of the distribution of sample means
## [1] 5.001601
sd(means$x) # Standard deviation of sample means
## [1] 0.7882242
(1/lambda)/(sqrt(40)) # Expected standard deviation
## [1] 0.7905694
var(means$x) # Variance of our simulations
## [1] 0.6212973
((1/lambda)/sqrt(40))^2 # Expected variance
## [1] 0.625
hmeans <- rowMeans(means) # Center of the distribution of sample means
1. Show where the distribution is centered at and compare it to the theoretical center of the distribution.
Answer: The Center of the distribution:5.0016014. Expected center: \(\lambda^{-1}\) = 5. The mean of the means of the exponential of 10000 simulations of 40 exponential(0.2)s is 5.0016014, which is very close to the expected mean of 1/0.2 = 5.0.
The distribution of sample means is centered at 5.0016014 and the theoretical center of the distribution is \(\lambda^{-1}\) = 5. The variance of sample means is 0.6212973 where the theoretical variance of the distribution is \(\sigma^2 / n = 1/(\lambda^2 n) = 1/(0.04 \times 40)\) = 0.625.
The averages of samples follow normal distribution for the central limit theorem. The first figure shows the density computed using the histogram and the normal density plotted with theoretical mean and variance values. The quantiles comparison figure suggests the normality.
2. Show how variable it is and compare it to the theoretical variance of the distribution.
Answer: The standard deviation of 0.7882242 is also close to the expected standard deviation of 0.7905694. Expected standard deviation using Central Limit Theorem. Likewise, the variance and expected variance are 0.6212973 and 0.625, respectively.
3. Show that the distribution is approximately normal.
Answer: The results of simulation are into a new vector called simulation, where the distribution is close to standard normal (theoretical). The plot of distribution (histogram and its estimated density) in black colour and the actual density of the standard normal in blue colour. If you can see, the red curve is close to the blue.
### System info My analysis has been originally created and run in RStudio Version 0.98.994 under OS X 10.9.5. Time and date about of the report generation
## [1] "Sat Nov 22 20:05:15 2014"
hmeans <- rowMeans(means)
hist(hmeans, col = I('#00e6ff'), breaks=40, prob=TRUE,
main="Distribution of averages of samples,
from exponential distribution",
xlab="")
# density of the averages of samples
lines(density(hmeans))
# theoretical center of distribution
abline(v=1/lambda, col="red")
# theoretical density of the averages of samples
xfit <- seq(min(hmeans), max(hmeans), length=100)
yfit <- dnorm(xfit, mean=1/lambda, sd=(1/lambda/sqrt(n)))
lines(xfit, yfit, pch=19, col="blue", lty=2)
# add legend
legend('topright', c("simulation", "theoretical"), lty=c(1,2), col=c("black", "blue"))
qqnorm(hmeans, col=I('red')); qqline(hmeans, col=I('blue'));
legend('topleft', c("qqnorm", "qqline"),col=c("red", "blue"))
Code of my first aproach
ggplot(data = means, aes(x = x)) +
geom_histogram(aes(y=..density..), fill = I('#00e6ff'),
binwidth = 0.20, color = I('blue')) +
stat_function(fun = dnorm, arg = list(mean = 5, sd = sd(means$x)))