Structure: Data Preparation, Section 1, Section 2, Section 3

The goal of this project is to investigate the exponential distribution in R and compare it with the Central Limit Theorem.

The exponential distribution will be simulated in R with rexp(n, lambda) where lambda is the rate parameter

We will investigate the distribution of averages of 40 exponentials based on a thousand simulations.

Data Preparation

Set the parameters.
set.seed(12345)
lambda = .2
mean = 1/lambda
sdev = 1/lambda
Process 1,000 simulations of 40 exponentials.
data <- matrix(rep(NA),nrow=40,ncol=1000)
for (i in 1:1000){
    sim <- rexp(40,rate=lambda)
    data[,i]  <- sim
}
Create a vector of sample means.
sample_means <- colMeans(data)
library(pander)
panderOptions("digits", 2)
pander(summary(sample_means))
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.7 4.5 4.9 5 5.5 8.3

Section 1

Show the sample mean and compare it to the theoretical mean of the distribution.

Calculate theoretical and sample summary statistics.
# theoretical mean
theoretical <- 1/.2
theoretical
## [1] 5
# distribution mean
distr_mean <- mean(sample_means)
distr_mean
## [1] 4.971972
Visualize the results of comparison.

Conclusion:

The sample mean is 4.971972 whereas the theoretical mean is 5. The sampling distribution plot indicates that the distribution center is located near the theoretical mean.

Section 2

Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution.

Calculate theoretical and sample summary statistics.
Conclusion:

Standard Deviation of the distribution is 0.772 with the theoretical standard deviation of 0.595. The theoretical variance is 0.625, while the actual variance of the distribution is 0.791.

The following table shows how variable the sample is compared to the theoretical values.

  Sample Theoretical
Standard deviation 0.772 0.595
Variance 0.791 0.625

Section 3

Show that the distribution is approximately normal.

Exponential distribution plot.

Conclusion:

Distribution of exponential avarages follows normal distributions according to the Central Limit Theorem(CLT) as shown on the graphs.

End

APPENDIX

Section 1

library(ggplot2)
qplot(sample_means, type = "histogram", binwidth=1/6) + 
    geom_histogram(colour="black", fill="steelblue") +
        labs(title = "Sampling Distribution",x="Sample means",y="Count")+
            geom_rug(col = "steelblue", alpha = 0.3) +
                 geom_vline(aes(xintercept=distr_mean, colour="red"),size=1.1) +
                 geom_vline(aes(xintercept=theoretical, colour="orange2"),size=1.1)

Section 2

Calculate theoretical and sample summary statistics.
SSD <- sd(sample_means) #sample standard deviation
SVAR <- SSD^2 #sample variance

n <- 40 # number of samples is equal to 40 (each samples consists of 1,000 observations)

TSD <- sdev/sqrt(n) #theoretical standard deviation
TVAR <- TSD^2 #theoretical variance
Conclusion table
x <- rbind(c(SSD,SVAR),c(TSD,TVAR))
rownames(x) <- c("Standard deviation","Variance"); colnames(x) <- c("Sample","Theoretical")

library(pander)
panderOptions("digits", 3)
pander(x)

Section 4

Plots 1&2:

library(ggplot2)
library(gridExtra)

plot1 <-  qplot(as.vector(data)) + 
        geom_histogram(colour="black", fill="steelblue") +
            labs(title = "Exponential Distribution",x="Values",y="Count") +
                geom_rug(col = "steelblue", alpha = 0.2) +
                    geom_vline(aes(xintercept=distr_mean,colour="red"),size=1.1)
     
df <- data.frame(Means=sample_means)

plot2 <-  ggplot(data = df, aes(x = Means)) + 
        geom_histogram(aes(y=..density..), fill = "whitesmoke", binwidth = 1/6, color = "royalblue", alpha = 1/2) +
            geom_vline(aes(xintercept=distr_mean, colour="Sample mean"), size = 1.25,linetype="dotdash") +
            geom_vline(aes(xintercept=theoretical,colour = "Theoretical mean"), size = 1.25, linetype="dashed") +
                geom_density(aes(color = "Means distribution"), size = 2.25, show_guide=FALSE) +
                stat_function(fun=dnorm, arg=list(mean=theoretical, sd=TSD), aes(color = "Normal distribution"), size = 2) +
        theme(legend.justification=c(1,0), legend.position=c(1.15,0.65)) + 
        labs(title = "Means Distribution", x = "Exponential means") +
        scale_x_continuous(limits = c(1, 10), breaks=1:10) +
        scale_color_discrete(name ="Compared Parameters") +
        geom_rug(col = "royalblue", alpha = 0.2)

grid.arrange(plot1, plot2,nrow=1)

Plot 3:

qqnorm(sample_means,main="Quantile-Quantile plot",xlab = "Theoretical Quantiles", ylab = "Sample Quantiles") 
qqline(sample_means,col=4)

#####The current system configuration:

sessionInfo()
## R version 3.1.3 (2015-03-09)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.3 (Yosemite)
## 
## locale:
## [1] ru_RU.UTF-8/ru_RU.UTF-8/ru_RU.UTF-8/C/ru_RU.UTF-8/ru_RU.UTF-8
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] gridExtra_0.9.1 ggplot2_1.0.1   pander_0.5.2   
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_1.2-6 digest_0.6.8     evaluate_0.7     formatR_1.2     
##  [5] gtable_0.1.2     htmltools_0.2.6  knitr_1.10.5     labeling_0.3    
##  [9] magrittr_1.5     MASS_7.3-40      munsell_0.4.2    plyr_1.8.2      
## [13] proto_0.3-10     Rcpp_0.11.6      reshape2_1.4.1   rmarkdown_0.5.1 
## [17] scales_0.2.4     stringi_0.4-1    stringr_1.0.0    tools_3.1.3     
## [21] yaml_2.1.13