Overview

The objective of this analysis is to look at the mean, variance and standard error of a simulation of numbers drawn at random from the exponential distribution. This distribution has a number of known properties of the theoretical total population, and this analysis will show that the above statistical properties of the sample dataset are close to the expected values as predicted by the theoretical population’s properties.

Introduction

The exponential distribution is defined by a ‘rate parameter’, known as ‘lambda’ (λ). It has a mean of 1λ and its standard deviation is also 1λ, meaning its variance is 1λ2. Finally its standard error of the mean, or ‘SEM’, is σn, meaning that a distribution of the means of a number of samples of size n should have this standard distribution.

Below is a table of the variables used in this analysis, with their associated mathematical symbols, a description of what it represents and how they are calculated.:

Variable Symbol Description Calculated As
lambda λ rate parameter of the exponential distribution This analysis will use a lambda value of 0.2
mu μ population mean 1λ
sigma σ population standard deviation 1λ
sigma2 σ2 population variance 1λ2
sem σx‾ standard error of the mean (SEM) σn
xbar x‾ sample mean 1nΣi=1nxi(where x is the sampled values)
s2 s2 sample variance 1n-1Σi=1nxi-x‾2
muxbar μx‾ mean of sample means 1sΣi=1sx‾i(where s is the number of simulations)
s2xbar sx‾2 variance of sample means 1s-1Σi=1sx‾i-μx‾2(where s is the number of simulations)
sxbar sx‾ standard deviation of sample means sx‾2

With lambda of 0.2, the distribution’s theoretical statistics are:

Stat Description Value
lambda rate parameter 0.2
mu mean 5
sigma standard deviation 5
sigma2 variance 25
n number of samples 40
sem standard error of the mean 0.7905694

Distribution of Sample

Here I take a random sample of 40000 variables from the exponential distribution. These same values will be split later into 1000 groups of 40 for multiple sample means to be taken, but here I plot the distribution of all of the sampled values along with the sample mean and variance and the population mean and variance Figure 1 demonstrates that the mean of this sample (4.9949399) is very close to the theoretical mean of the whole population (5) and that the sample variance (25.028788) is similarly very close to the population variance (25). The distribution of these random variables reflects very clearly the exponential distribution.

Distribution of Sample Means

I next split this sample of 40000 into 1000 groups of 40 and take the mean of each sample. I then plot the distribution of these sample means and compare this distribution’s standard deviation with the population ‘standard error of the mean’ (sem); and also its mean with the population mean. Whereas the sample of random exponential values followed the natural exponential distribution, Figure 2 shows that these sample means follow a normal distribution. It also shows that the mean of the sample means (4.9949399) is very close to the predicted population mean (5) and that the standard deviation of the sample means (0.7849616) is very close to the expected value, the standard error of the mean, or SEM (0.7905694).

Note that the mean of all samples (xbar = 4.9949399) is identical to the mean of the sample means (muxbar = 4.9949399).

The standard deviation of sample means (0.7849616) is significantly smaller than the population standard deviation, sigma (5). This will be because most of the variance of the randomly sampled data is essentially discarded when we take means of those sampled values.

Appendix

Introduction Code

lambda <- 0.2       # rate parameter
numsims <- 1000     # number of simulations
n <- 40             # size of each simulation

# calculate mean, standard deviation and variance of the population using
# the known characteristics of the exponential distribution
mu <- 1 / lambda
sigma <- 1 / lambda
sigma2 <- sigma^2
sem <- sigma / sqrt(n)

Distribution of Sample Code

set.seed(20150110)

# take a sample of rexp variables
sample <- rexp(n * numsims, lambda)

# calculate the sample mean and variance
xbar <- sum(sample) / length(sample)
s2 <- sum( (sample - xbar)^2) / (length(sample) - 1)

# make a plot of the sample values, with sample mean, sample variance,
# population mean and population variance overlayed
library(ggplot2)
col1 <- "red"
col2 <- "blue"
col3 <- "green4"
col4 <- "chocolate1"

g1  <- ggplot(
    data = data.frame(x = sample), aes(x = x)) +
    geom_density(size = 2, alpha = 0.2, fill = col1) +
    geom_vline(
        size = 2,
        alpha = 0.8,
        xintercept = c(mu, xbar, sigma2, s2),
        colour = c(col1, col2, col3, col4),
        lty = c(1, 2, 1, 2)) +
    annotate(
        "text",
        size = 5,
        alpha = 0.8,
        label = c(
            paste0("Population Mean (", mu, ")"),
            paste0("Sample\nMean\n(", round(xbar, 3), ")"),
            paste0("Population Variance (", round(sigma2, 3), ")"),
            paste0("Sample Variance (", round(s2, 3), ")")),
        x = c(mu, xbar, sigma2, s2),
        y = c(0.1, 0.05, 0.05, 0.03),
        hjust = c(-0.06, 1.06, -0.06, 1.06),
        colour = c(col1, col2, col3, col4)) +
    ggtitle(paste(
        "Figure 1: Density of", length(sample),
        "rexp variables (lambda=", lambda, ")")) +
    labs(y = "Density") +
    theme(plot.title = element_text(size = rel(1.5)))

print(g1)

Distribution of Sample Means Code

# split the sample into a matrix of 'numsims' rows each containing 'n' samples
# and take the mean of each sample, leaving 1000 mean values
smp.means <- apply(matrix(sample, numsims), 1, mean)

# calculate the mean, variance and standard distribution of these sample means
muxbar <- sum(smp.means) / numsims
s2xbar <- sum( (smp.means - muxbar)^2 ) / (numsims - 1)
sxbar <- sqrt(s2xbar)

# make a plot of the sample means, with their mean, standard deviation overlayed
# together with the population mean and population standard error of the mean
g2 <- ggplot(data = data.frame(x = smp.means), aes(x = x)) +
    geom_density(alpha = 0.2, size = 2, fill = col1) +
    geom_vline(
        size = 2,
        alpha = 0.8,
        xintercept = c(
            mu,
            muxbar,
            mu + c(1,-1)*sem,
            muxbar + c(1,-1)*sxbar),
        colour = c(col1, col2, col3, col3, col4, col4),
        lty = c(1, 2, 1, 1, 2, 2)) +
    annotate(
        "text",
        size = 5,
        alpha = 0.8,
        x = c(
            mu, muxbar, mu - sem,
            muxbar + sxbar),
        y = c(0.2, 0.3, 0.5, 0.1),
        hjust = c(-0.06, 1.06, -0.06, 1.06),
        colour = c("red", col2, col3, col4),
        label = c(
            paste0("Population\nMean (", mu, ")"),
            paste0("Mean of Sample\nMeans (", round(muxbar,3), ")"),
            paste0("SEM\n(+/- ", round(sem,3), ")"),
            paste0("SD of\nSample Means\n(+/- ", round(sxbar,3),")")))+
    ggtitle(paste(
        "Figure 2: Distribution of the means of", numsims, "samples of", n,
        "\nrexp distribution variables (lambda=", lambda, ")")) +
    labs(y = "Density") +
    theme(plot.title = element_text(size = rel(1.5)))

print(g2)