Statistical Inference Assignment

Overview

This project comprises of a simulation exercise to explore the properties of the distribution of averages of 40 exponentials. More details bellow.

In this project you will investigate the exponential distribution in R and compare it with the Central Limit Theorem. The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda. Set lambda = 0.2 for all of the simulations. You will investigate the distribution of averages of 40 exponentials. Note that you will need to do a thousand simulations.

Illustrate via simulation and associated explanatory text the properties of the distribution of the mean of 40 exponentials. You should:

Show the sample mean and compare it to the theoretical mean of the distribution.
Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution.
Show that the distribution is approximately normal.

In point 3, focus on the difference between the distribution of a large collection of random exponentials and the distribution of a large collection of averages of 40 exponentials.

Central limit theorem

In probability theory, the central limit theorem (CLT) states that, given certain conditions, the arithmetic mean of a sufficiently large number of iterates of independent random variables, each with a well-defined expected value and well-defined variance, will be approximately normally distributed, regardless of the underlying distribution.

Simulating a Poisson Distribution

# setting the seed so the research is reproducible
set.seed(13233)
# sample size and number of simulations
n <- 40; simulations <- 1000
# distribution parameters
lambda <- 0.2
# creating an empty vector
mns<-NULL
# for each simulation we calculate the mean
for (i in 1 : simulations) mns = c(mns, mean(rexp(n, lambda)))

Show the sample mean and compare it to the theoretical mean of the distribution.

# Sample mean from the simulated distributions
sm <- mean(mns)

## Confidence Intervals of the distribution
t.test(mns)$conf.int

## [1] 4.956908 5.053388
## attr(,"conf.level")
## [1] 0.95

# Theoritical mean of the distribution
tm <- 1/lambda

## [1] "The Sample mean is equal to 5.0051 and the theoritical mean is equal to 5 . The difference between the two means is equal to 0.0051"

meanDF <- data.frame(Mean.Title=c("Sample Mean", "Theoretical Mean"), Mean.Values=c(sm, tm))
meanDF

##         Mean.Title Mean.Values
## 1      Sample Mean    5.005148
## 2 Theoretical Mean    5.000000

## Generate a histogram to show Sample and Theoritical Mean
require(ggplot2)

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 3.2.3

ggplot(NULL, aes(x=mns))+
      geom_histogram(aes(y = ..density..), color="black", fill='NA', binwidth=.25) + 
      geom_density(color='blue',lwd=1.2)+
      geom_vline(data= meanDF, aes(xintercept=Mean.Values, colour=Mean.Title,linetype= Mean.Title), lwd=1.2 ,show.legend=T)+
      labs(title= 'Sample Means Distribution', x='Sample Means')+
      stat_function(fun = dnorm, arg=list( mean= tm, sd= .625 ), 
        color="red", size=1) +
              geom_rug(col = "darkred", alpha = 0.1)+
                    scale_x_continuous( breaks=1:10)

The graph above shows the distribution of a 1000 means simulated from the exponential distribution (each one with 40 observations) and we can clearly see that the distribution of the means (curve in blue) is normally distributed (curve in red is a normal distribution) as stated by the Central Limit Theorem. If we had larger sample size from each distribution then our data would had given much better approximation to the Normal Distribution.

We also include the distribution of the exponential distribution simulated a 1000 times, each one with 40 observations

set.seed(13233)
expDist <- replicate(n = simulations, expr = rexp(n, lambda))

ggplot(NULL, aes(x=as.vector(expDist)))+
      geom_histogram(color="black", fill='steelblue', binwidth=1.2)+
        labs(title = "Exponential Distribution", x = "values", y = "count")+
              geom_rug(col = "darkred", alpha = 0.1)

Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution.

# Sample Variance
sv <- var(mns)
# Theoritical variance
tv <-  1/(lambda^2*n)
# Variance Dataframe
varDF <- data.frame(Var.Title=c("Sample Variance", "Theoretical Variance"), Var.Values=c(sv, tv))
varDF

##              Var.Title Var.Values
## 1      Sample Variance  0.6043142
## 2 Theoretical Variance  0.6250000

3.Show that the distribution is approximately normal.

# We can check this using a qqplot

d <- data.frame(mns)
y <- quantile(mns[!is.na(mns)], c(0.25, 0.75))
x <- qnorm(c(0.25, 0.75))
slope <- diff(y)/diff(x)
int <- y[1L] - slope * x[1L]
ggplot(d, aes(sample = mns)) + stat_qq() + geom_abline(slope = slope, intercept = int)

## or you can just use this code
# qqnorm(mns);qqline(mns)

Q-Q plot is a graphical method for comparing two probability distributions by plotting their quantiles against each other. Here we test against the Normal Distribution. Observing the Q-Q plot of the data, the points lie approximate on the straight line. The linearity of the points suggests that the data are normally distributed.

Statistical Inference Assignment - Part 1 Simulation

Konstantinos Saittis

29 March 2016

Overview

Central limit theorem

Simulating a Poisson Distribution