Analysing R exponential data with simulations in R

Introduction

In this project the exponential distribution in R will be compared with the Central Limit Theorem. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda. Lambda will be set to 0.2 for all of the simulations. Will focus on the investiagion of the distribution of averages of 40 exponential withing 1000 simulations

Questions to be answered or approached

Show the sample mean and compare it to the theoretical mean of the distribution
Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution
Show that the distribution is approximately normal.

Simulation of data

First lets run 1000 simulations and compute the mean of the 40 random generated numbers

#Set seed for reproducibility
set.seed(123)
lambda <- 0.2 #Expected mean
theor_mean <- 1/lambda #Theoretical mean
theor_sigma <- 1/lambda #Theoretica standard deviation
n <- 40 #Nº of observations
expected_var <- theor_sigma^2/n

simulation=NULL
for (i in 1 : 1000) {
    simulation = c(simulation, mean(rexp(40,lambda)))
}

results <- data.frame(simulation,size=40)

- Question 1: Compare the theoretical mean vs the simulated one?

hist(simulation,breaks = 40, main="Distribution of 1000 random exponentials", xlab="Means")

#Mean from simulations
mean(simulation)

## [1] 5.011911

#Expected mean from exponential distribution
print(theor_mean <- 1/lambda)

## [1] 5

Difference <- (mean(simulation)/(1/lambda))*100 -1

As seen from the data, the simulated mean is close to the theoretical one which is 5.

- Question 2: Show how variable the sample is (via variance) and compare both

# Theoretical variance 
print(expected_var <- theor_sigma^2/n)

## [1] 0.625

# Simulated variance
var(simulation)

## [1] 0.6004928

The variance of the simulation, 0.6 is close to the theoretical variance, which is 0.625.

- Question 3: Is the distribution approximately normal?

According to the central limit theorem, which statesthat if you have a population with mean μ and standard deviation σ and take sufficiently large random samples from the population with replacement , then the distribution of the sample means will be approximately normally distributed (according to: https://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_Probability/BS704_Probability12.html) This should be the case of the previous simulation.

hist(simulation,breaks = 20, main="Distribution of 1000 random exponentials", xlab="means"
     ,probability = TRUE)
x<-seq(0,100,0.01) 
curve(dnorm(x, mean = mean(simulation), sd=sd(simulation)), col="green", lwd=2, add = TRUE)

From the graph it is clear that the simulated data is approximately normal. This should be more obvious if we run the simulations more times, for example 20000 times. Les see what happens.

simulation2=NULL
for (i in 1 : 20000) {
    simulation2 = c(simulation2, mean(rexp(40,lambda)))
}


hist(simulation2, breaks = 50, main="Distribution of 20.000 random exponentials", xlab="means"
     ,probability = TRUE)
x<-seq(0,100,0.01) 
curve(dnorm(x, mean = mean(simulation2), sd=sd(simulation2)), col="blue", lwd=2, add = TRUE)

The final plot shows a more normal distribution than the previous one. This shows that with more data or simulations the distribution is more related to a normal distribution.

Statistical Inference Course Project. Part 1

Maximiliano