library("knitr")
library("ggplot2")
library("ggthemes")
opts_chunk$set(fig.width=6, fig.height=2)
n <- 100; lambda <- 0.2
This document will explore the variance found in exponentially distributed data sets. Exponential data will be simulated via a Poisson distribution and descriptive statistics compared to their theoretical values.
Statistics of interest:
The exponential distribution is used to model the time between events that over time occur at a constant rate. For example, if cars on a freeway pass at an average rate of once every 5 minutes, the times between passings can create an exponential distribution.
The exponential distribution is defined with a probability density function (pdf) of
\[ pdf(x, \lambda) = \left\{ \begin{array}{lr} 0 & : x < 0 \\ \lambda e^{-\lambda x} & : x \ge 0 \end{array} \right. \]
and cumulative distribution function (cdf) of
\[ cdf(x, \lambda) = \left\{ \begin{array}{lr} 0 & : x < 0 \\ 1 - e^{-\lambda x} & : x \ge 0 \end{array} \right. \]
First we’ll create the data set that consists of the times (in minutes) at which the first 100 cars pass. From there, the difference between one car passing and the next car passing will create a delays data set that should follow an exponential distribution.
create_spacing <- function() {
pass_times <- c(0)
for (i in 1:n) {
time_to_next_event <- log(1 - runif(1)) / (-lambda)
pass_times <- c(pass_times, tail(pass_times, 1) + time_to_next_event)
}
spacing <- pass_times[2:(n+1)] - pass_times[1:n]
list(pass_times=pass_times, spacing=spacing)
}
res <- create_spacing(); pass_times <- res$pass_times; spacing <- res$spacing
The plot above shows the variability in the car spacing. The overall rate for the experiment of 100 cars remains fairly consistent and close to the simulated average 5 minutes per car as:
sprintf("%d cars pass in %g minutes for an average of %g minutes per car",
n, tail(pass_times, 1), tail(pass_times, 1) / n)
## [1] "100 cars pass in 537.075 minutes for an average of 5.37075 minutes per car"
The spacing between a cars are shown to follow an exponential distribution.
The expected duration between cars is \(1/\lambda=5\) min. In our simulated car data above, the sample mean (average duration between cars) is calculated to be mean(spacing) =
5.3707488 which is 7.41% off the theoretical value of \(1/\lambda=5\).
The accuracy of the sample mean is dependent on the sample size n
. If you wait until a larger quantity of cars have passed, the estimate for the overall rate will improve. After repeated observations, the sample mean is expected to vary about 5 for larger n
the variance in means would decrease. Next, we’ll repeat the experiment 1000 times and observe the variance in the collection of sample means.
xbar <- c(); for (i in 1:1000) { xbar <- c(xbar, mean(create_spacing()$spacing)) }
which is approximately normally distributed with a variance of var(xbar) =
0.2772142 which is an estimate of the theoretical sample mean variance of \(\sigma^2 / n = (1/\lambda^2) / n =\) 0.25.