Part 1 of the course project - Simulation Exercise

Simulating the exponential distribution

First of all, the exponential distribution must be simulated. As stated in the exercise, 1000 means must be obtained from simulated exponential distributions, where each mean is calculated using 40 simulated exponentials. Lambda is set equal to 0.2.

In order to do so, the functions rexp() and sapply() can be used to run 1000 simulations, in which each contains 40 exponentials. Subsequently, apply() can be used to calculate the mean of each of these simulations, returning a vector of means which can then be used to answer the questions of this exercise. Note that the set.seed() function is used to ensure that the same exponentials are generated if the code is rerun. The code to achieve the above is as follows:

# Parameters and seed
set.seed(1122)
nsim=1000
n=40
lambda=.2

# Simulation of the distributions
sim <- sapply(1:nsim, function(i) rexp(n=n, rate=lambda))

# Calculation of the means
means <- apply(sim, 2, mean)

After running this code, sim contains the simulations of the distributions, while means contains the means of each of these distributions.

Comparing the sample mean with the theoretical mean

Using the mean() function, the mean of the sample can be calculated. The theoretical mean of the exponential distribution is calculated as 1/lambda. Hence, the required code is as follows:

# Sample mean
mean(means)

## [1] 4.981784

# Theoretical mean
1/lambda

## [1] 5

Accordingly, it can be seen that the sample mean of 4.982 is very close to the theoretical mean of 5. To further prove this point, the graph below shows the distribution of the sample with a line plotted at the theoretical mean. ggplot2 is used to generate the graph:

g <- ggplot()

g +
      aes(means) +
      geom_histogram(bins=40, fill="Darkblue") +
      geom_vline(xintercept=1/.2, lwd=1.5, col="Darkgrey") +
      labs(x="Mean of simulated sample", y="Frequency", title="Distribution of means of simulated data")

As can be seen in the graph, the distribution seems to be centered around the theoretical mean, which further proves the point that the sample mean is very close to it.

Comparing the sample variance with the theoretical variance

To calculate the variance of the simulated sample, the sd() function can be used to obtain the standard error, which can then be squared to obtain the sample’s estimated variance of the mean. To obtain the theoretical standard error of the mean, lambda can be multiplied by 1 over the square root of n. Subsequently squaring this term gives the theoretical variance. The code to do so is as follows:

#  Sample standard error and variance
vs <- sd(means)
vs^2

## [1] 0.6127445

# Theoretical standard error and variance
vh <- ((1/.2)/sqrt(n))
vh^2

## [1] 0.625

As previously seen with the means, the sample’s variance of 0.613 is also close to the theoretical variance of 0.625.

Comparing the distribution to the normal distribution

Before comparing the distribution of the means to the normal distribution, it is interesting to analyze the effect of using 1000 means of 40 distributions of exponentials, instead of simply using the distribution of 40000 exponentials. Below, you can see a histogram of 40000 exponentials.

g +
      aes(rexp(n=40000, rate=lambda)) +
      geom_histogram(fill="Darkblue", aes(y=..density..)) +
      labs(x="Randomly generated exponentials", y="Frequency", 
           title="Histogram of randomly generated exponentials")

As is (hopefully) apparant, the distribution is not very similar to the normal distribution (as it is substantially, positively skewed). In the next graph, the distribution of the means of the 1000 exponential distributions is shown.

g +
      aes(means) +
      geom_histogram(aes(y=..density..), fill="Darkblue", bins=40) +
      stat_function(fun=dnorm, args=list(mean=1/lambda, sd=(1/lambda)/sqrt(n)),
                    lwd=1.25, col="Darkred", lty=2) +
      geom_density(col="Blue", lwd=1.25) +
      labs(x="Means of simulated sample", y="Density", title="Density plot of simulated data")

As can be seen in the graph, this distribution approximates the normal distribution far better than the previous one. To make this comparison somewhat easier, the distribution of the sample is plotted as the light blue curve, whereas the normal distribution is plotted as the dashed, dark red line. As is apparent, these distributions do not differ that much from each other.

To bring the point home, the last part of this report shows the qqplot of the exponential distribution, in which the sample’s distribution is compared to the normal distribution. The closer these two are to eachother, the more the line in the graph will approximate a line from the origin to the right, upper corner of the graph. The code to generate said plot is as follows:

qqnorm(means)

Although somewhat redundant at this point, it is clear that the line approximates a line from the origin to the right upper corner of the graph, again showing that the simulated distribution approximates the normal distribution.

Course Project Statistical Inference - Part 1

Jorik Schra

19 juli 2017

Synopsis

Library

Part 1 of the course project - Simulation Exercise

Simulating the exponential distribution

Comparing the sample mean with the theoretical mean

Comparing the sample variance with the theoretical variance

Comparing the distribution to the normal distribution