Overview

In this project we are going to try to demonstrate three main things:

Show the sample mean and compare it to the theoretical mean of the distribution.
Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution.
Show that the distribution is approximately normal.

Load libraries

library(ggplot2)
library(knitr)

Simulations

These are the simulation parameters:

n <- 40
lambda <- 0.2
num_simulations <- 1000
set.seed(41925) # To ensure people can reproduce the research

Here I perform the simulation

means <- NULL
vars <- NULL

for (i in 1 : num_simulations) {
  simulatedData <- rexp(n, lambda)
  means  <- c(means, mean(simulatedData))
  vars <- c(vars, var(simulatedData))      
}

Sample Mean versus Theoretical Mean:

Here we calculate the sample mean

sample_mean <- mean(means)
sample_mean

## [1] 5.008348

theoretical_mean <- 1/lambda
theoretical_mean

## [1] 5

As we can see the sample mean is almost the same as the theoritical one. To represent this idea I provide the next plot image:

g1 <- qplot(means,  geom="histogram", xlab="Mean of 40 exponentials simulation", binwidth=0.2, xlim=c(1,9),
            main="Distribution of the mean of 1000 data samples (40 exponentials each)") 
g1 <- g1 + geom_vline(xintercept = theoretical_mean, color="yellow") 
g1 <- g1 + geom_text(mapping=aes(x=sample_mean, y=110, label=paste("sample mean=",round(sample_mean,3))), size=4, vjust= 1, hjust=-0.1)
g1

Sample Variance versus Theoretical Variance:

Here we calculate the sample variance and we can observe that the sample variance is very similar to the theoritical one:

sample_var <- mean(vars)
sample_var

## [1] 25.11678

theoretical_var <- (1/lambda)^2
theoretical_var

## [1] 25

We represent also a plot to see this:

g2 <- qplot(vars,  geom="histogram", xlab="Variance of 40 exponentials simulation", binwidth=2,
            main="Distribution of the variance of 1000 data samples (40 exponentials each)") 
g2 <- g2 + geom_vline(xintercept = theoretical_var, color="yellow") 
g2 <- g2 + geom_text(mapping=aes(x=sample_var, y=130, label=paste("sample variance=",round(sample_var,3))), size=4, hjust=-0.1)
g2

Distribution:

One of the easiest and clearest way to see if the data is normally distributed is to represent a Q-Q Plot.

qqnorm(means)
qqline(means, col = "yellow")

The linearity of the data across the straight line strongly suggests that the population data is normally distributed as we expected.

Conclusions

We have shown how the sample mean and variance and good stimators of the population mean and variance if the number of data samples is big enough. Additionally, we can see how the distribution tends to be normally distributed.

The session info is:

sessionInfo()

## R version 3.0.2 (2013-09-25)
## Platform: x86_64-apple-darwin10.8.0 (64-bit)
## 
## locale:
## [1] es_ES.UTF-8/es_ES.UTF-8/es_ES.UTF-8/C/es_ES.UTF-8/es_ES.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] knitr_1.11    ggplot2_1.0.1
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_1.2-6 digest_0.6.8     evaluate_0.8     formatR_1.0     
##  [5] grid_3.0.2       gtable_0.1.2     htmltools_0.2.6  labeling_0.3    
##  [9] MASS_7.3-29      munsell_0.4.2    plyr_1.8.1       proto_0.3-10    
## [13] Rcpp_0.11.5      reshape2_1.4.1   rmarkdown_0.8    scales_0.2.4    
## [17] stringr_0.6.2    tools_3.0.2      yaml_2.1.13

Stadistical Inference Project - Part 1

Jose A. Ruiperez Valiente