Assignment

The project consists of two parts:
1. A simulation exercise.
2. Basic inferential data analysis. This document is focussing on the first question

A simulation exercise.

Assignment

In this project you will investigate the exponential distribution in R and compare it with the Central Limit Theorem. The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda. Set lambda = 0.2 for all of the simulations. You will investigate the distribution of averages of 40 exponentials. Note that you will need to do a thousand simulations.

Illustrate via simulation and associated explanatory text the properties of the distribution of the mean of 40 exponentials. You should:

  1. Show the sample mean and compare it to the theoretical mean of the distribution.
  2. Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution.
  3. Show that the distribution is approximately normal.

In point 3, focus on the difference between the distribution of a large collection of random exponentials and the distribution of a large collection of averages of 40 exponentials.

Summary

Based on a generated sample distribution of an exponential distribution the Central Limit Theory is investigated. The theoratical and the simulated characteristics of the distribution are compared and they are almost identical. Also a histogram of the means (rows of 40 values) is made. The histogram is normally distributed.

Research

Initialisation

In the initialisation two things have to happen, initialisation of R and defining the variables from the assignment.

generating the simulation set

In this step the simulation set has to be generated. But before the seed has to be set so the simulation can be repeated.

set.seed(1)
simulation_set <- matrix(rexp(n*set,lambda), set)

The mean and the variance / standard deviation of the rows have to be calculated

The mean of the rows have to be calculated so the distribution can be verified according the CLT.

row_mean <- apply(simulation_set, 1, mean)

sim_mean <- mean(row_mean)
sim_sd <- sd(row_mean)
sim_var <- sim_sd^2

print_result <- matrix(c(theo_mean, theo_sd, theo_var, sim_mean, sim_sd, sim_var), nrow = 3, ncol = 2)
dimnames(print_result) = list(c("mean","standard deviation", "variance"),
                              c("theoratical", "simulation"))
print(print_result)
##                    theoratical simulation
## mean                 5.0000000  4.9900252
## standard deviation   0.7905694  0.7859435
## variance             0.6250000  0.6177072

The differences between the theoratical characteristics of the distribution and the simulation seems to be negligible. If mean is rounded on three figures, it is only .01 off. The standard deviation and the variance are even less off (.004 and .007). This answers question 1 and 2.

Plotting the means

In this step the average of the means of each row is plotted.

hist(row_mean, density=100, breaks=20, prob=TRUE, col = "blue",
         xlab="average of 40", ylab = "density", 
         main="means of exponential distribution")
    curve(dnorm(x, mean=theo_mean, sd=theo_sd), 
          col="black", lwd=2, add=TRUE, yaxt="n")

In the bars the mean is given. In black the normal function based on the theoratical characteristics (mean and standard deviation) is printed. The distribution based on the plot looks normally distributed and this means that the Central Limit Theory is proven.