The objective of this analysis is to look at the mean, variance and standard error of a simulation of numbers drawn at random from the exponential distribution. This distribution has a number of known properties of the theoretical total population, and this analysis will show that the above statistical properties of the sample dataset are close to the expected values as predicted by the theoretical population’s properties.
The exponential distribution is defined by a ‘rate parameter’, known as ‘lambda’ (). It has a mean of and its standard deviation is also , meaning its variance is . Finally its standard error of the mean, or ‘SEM’, is , meaning that a distribution of the means of a number of samples of size n should have this standard distribution.
Below is a table of the variables used in this analysis, with their associated mathematical symbols, a description of what it represents and how they are calculated.:
Variable | Symbol | Description | Calculated As |
---|---|---|---|
lambda | rate parameter of the exponential distribution | This analysis will use a lambda value of 0.2 | |
mu | population mean | ||
sigma | population standard deviation | ||
sigma2 | population variance | ||
sem | standard error of the mean (SEM) | ||
xbar | sample mean | ||
s2 | sample variance | ||
muxbar | mean of sample means | ||
s2xbar | variance of sample means | ||
sxbar | standard deviation of sample means |
With lambda of 0.2, the distribution’s theoretical statistics are:
Stat | Description | Value |
---|---|---|
lambda | rate parameter | 0.2 |
mu | mean | 5 |
sigma | standard deviation | 5 |
sigma2 | variance | 25 |
n | number of samples | 40 |
sem | standard error of the mean | 0.7905694 |
Here I take a random sample of 40000 variables from the exponential distribution. These same values will be split later into 1000 groups of 40 for multiple sample means to be taken, but here I plot the distribution of all of the sampled values along with the sample mean and variance and the population mean and variance Figure 1 demonstrates that the mean of this sample (4.9949399) is very close to the theoretical mean of the whole population (5) and that the sample variance (25.028788) is similarly very close to the population variance (25). The distribution of these random variables reflects very clearly the exponential distribution.
I next split this sample of 40000 into 1000 groups of 40 and take the mean of each sample. I then plot the distribution of these sample means and compare this distribution’s standard deviation with the population ‘standard error of the mean’ (sem); and also its mean with the population mean. Whereas the sample of random exponential values followed the natural exponential distribution, Figure 2 shows that these sample means follow a normal distribution. It also shows that the mean of the sample means (4.9949399) is very close to the predicted population mean (5) and that the standard deviation of the sample means (0.7849616) is very close to the expected value, the standard error of the mean, or SEM (0.7905694).
Note that the mean of all samples (xbar = 4.9949399) is identical to the mean of the sample means (muxbar = 4.9949399).
The standard deviation of sample means (0.7849616) is significantly smaller than the population standard deviation, sigma (5). This will be because most of the variance of the randomly sampled data is essentially discarded when we take means of those sampled values.
lambda <- 0.2 # rate parameter
numsims <- 1000 # number of simulations
n <- 40 # size of each simulation
# calculate mean, standard deviation and variance of the population using
# the known characteristics of the exponential distribution
mu <- 1 / lambda
sigma <- 1 / lambda
sigma2 <- sigma^2
sem <- sigma / sqrt(n)
set.seed(20150110)
# take a sample of rexp variables
sample <- rexp(n * numsims, lambda)
# calculate the sample mean and variance
xbar <- sum(sample) / length(sample)
s2 <- sum( (sample - xbar)^2) / (length(sample) - 1)
# make a plot of the sample values, with sample mean, sample variance,
# population mean and population variance overlayed
library(ggplot2)
col1 <- "red"
col2 <- "blue"
col3 <- "green4"
col4 <- "chocolate1"
g1 <- ggplot(
data = data.frame(x = sample), aes(x = x)) +
geom_density(size = 2, alpha = 0.2, fill = col1) +
geom_vline(
size = 2,
alpha = 0.8,
xintercept = c(mu, xbar, sigma2, s2),
colour = c(col1, col2, col3, col4),
lty = c(1, 2, 1, 2)) +
annotate(
"text",
size = 5,
alpha = 0.8,
label = c(
paste0("Population Mean (", mu, ")"),
paste0("Sample\nMean\n(", round(xbar, 3), ")"),
paste0("Population Variance (", round(sigma2, 3), ")"),
paste0("Sample Variance (", round(s2, 3), ")")),
x = c(mu, xbar, sigma2, s2),
y = c(0.1, 0.05, 0.05, 0.03),
hjust = c(-0.06, 1.06, -0.06, 1.06),
colour = c(col1, col2, col3, col4)) +
ggtitle(paste(
"Figure 1: Density of", length(sample),
"rexp variables (lambda=", lambda, ")")) +
labs(y = "Density") +
theme(plot.title = element_text(size = rel(1.5)))
print(g1)
# split the sample into a matrix of 'numsims' rows each containing 'n' samples
# and take the mean of each sample, leaving 1000 mean values
smp.means <- apply(matrix(sample, numsims), 1, mean)
# calculate the mean, variance and standard distribution of these sample means
muxbar <- sum(smp.means) / numsims
s2xbar <- sum( (smp.means - muxbar)^2 ) / (numsims - 1)
sxbar <- sqrt(s2xbar)
# make a plot of the sample means, with their mean, standard deviation overlayed
# together with the population mean and population standard error of the mean
g2 <- ggplot(data = data.frame(x = smp.means), aes(x = x)) +
geom_density(alpha = 0.2, size = 2, fill = col1) +
geom_vline(
size = 2,
alpha = 0.8,
xintercept = c(
mu,
muxbar,
mu + c(1,-1)*sem,
muxbar + c(1,-1)*sxbar),
colour = c(col1, col2, col3, col3, col4, col4),
lty = c(1, 2, 1, 1, 2, 2)) +
annotate(
"text",
size = 5,
alpha = 0.8,
x = c(
mu, muxbar, mu - sem,
muxbar + sxbar),
y = c(0.2, 0.3, 0.5, 0.1),
hjust = c(-0.06, 1.06, -0.06, 1.06),
colour = c("red", col2, col3, col4),
label = c(
paste0("Population\nMean (", mu, ")"),
paste0("Mean of Sample\nMeans (", round(muxbar,3), ")"),
paste0("SEM\n(+/- ", round(sem,3), ")"),
paste0("SD of\nSample Means\n(+/- ", round(sxbar,3),")")))+
ggtitle(paste(
"Figure 2: Distribution of the means of", numsims, "samples of", n,
"\nrexp distribution variables (lambda=", lambda, ")")) +
labs(y = "Density") +
theme(plot.title = element_text(size = rel(1.5)))
print(g2)