This is a Statistical Inference Course Project report which explores the asymptotic behavior of the sample mean and sample variance when the samples are drawn from exponential distribution. The analysis compares the sample mean and sample variance with the population mean and population variance using Monte Carlo simulation for drawing the samples. The simulations demonstrate that the sample mean is an unbiased estimate for the theoretical mean and also examines the relationship between the variance of the sample mean and the theoretical variance articulated by the Central Limit Theorem.
## Warning: package 'lattice' was built under R version 3.1.3
## Warning: package 'knitr' was built under R version 3.1.3
opts_chunk$set(echo=TRUE, cache=TRUE)
Q1: Show the sample mean and compare it to the theoretical mean of the distribution.
nsim <- 1000
# Let ns be the sample size. ns=40
ns <- 40
# lambda is parameter of exponential distribution
lambda <- 0.2
Mu <- 1/lambda; sigma <- 1/lambda
# For exploring the sample size dependency, let us construct a vector with 6 possible sample sizes including the specific sample size under investigation (ns = 40)
n <- c(5,10,20,30,40,100)
nvec <- c(rep(5,nsim),rep(10,nsim),rep(20,nsim),rep(30,nsim),rep(40,nsim),rep(100,nsim))
sampleStats <- data.frame(SampleMean=rep(0,nsim*length(n)),SampleVariance=rep(0,nsim*length(n)),SampleSize=nvec)
# Populate the sampleStats dataframe for sample mean and sample variance for all sample sizes
for(i in 1:length(n)) {
start <- (i-1)*nsim+1
end <- start + nsim - 1
sampleStats[start:end,1] <- apply(matrix(rexp(nsim*n[i], lambda), nrow=nsim, ncol=n[i]),1,mean)
sampleStats[start:end,2] <- apply(matrix(rexp(nsim*n[i], lambda), nrow=nsim, ncol=n[i]),1,var)
}
sampleStats_5 <- filter(sampleStats, SampleSize == 5)
sampleStats_10 <- filter(sampleStats, SampleSize == 10)
sampleStats_20 <- filter(sampleStats, SampleSize == 20)
sampleStats_30 <- filter(sampleStats, SampleSize == 30)
sampleStats_40 <- filter(sampleStats, SampleSize == 40)
sampleStats_100 <- filter(sampleStats, SampleSize == 100)
See figure 1 in Appendix for the comparison of histograms of sample means as a function of different sample sizes including the sample size in question (40). It is clear from the density plots that the sample mean is centered around the theoretical mean which shows that the sample mean is an unbiased estimate.
# Compute the sample mean (m) for the specific case of interest (sample size = ns = 40)
m <- mean(sampleStats_40$SampleMean)
cat("Mean of Sample Means (1000 simulations with ns=40): ", m, "\n")
## Mean of Sample Means (1000 simulations with ns=40): 4.984159
cat("Theoretical Mean of exponential distribution(lambda = 0.2): ", Mu, "\n")
## Theoretical Mean of exponential distribution(lambda = 0.2): 5
cat("Sample Mean deviation from the theoretical mean: ", abs(Mu-m))
## Sample Mean deviation from the theoretical mean: 0.01584138
Q2: Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution.
In theory, the variance of the Sample Mean statistic v = sigma^2/ns, where sigma is the population variance and ns is the sample size.
cat("Theoretical Variance (exponential lambda=0.2) is: ", sigma^2, "\n")
## Theoretical Variance (exponential lambda=0.2) is: 25
# Compute the variance of the sample mean (s^2)
v <- var(sampleStats_40$SampleMean)
cat("Variance of the sample means (sample size=40): ", v, "\n")
## Variance of the sample means (sample size=40): 0.655065
cat("Theoretical Variance of the Sample Mean (sigma^2/ns): ", sigma^2/ns, "\n")
## Theoretical Variance of the Sample Mean (sigma^2/ns): 0.625
varDev <- 100 * abs(v-sigma^2/ns)/sigma^2/ns
cat("Variance of the Sample Mean relative to theoretical prediction: ",varDev, "%\n")
## Variance of the Sample Mean relative to theoretical prediction: 0.003006502 %
# Theoretical standard error for sample size = 40
se <- sigma/sqrt(ns)
cat("Standard error for ns=40: ", se, "\n")
## Standard error for ns=40: 0.7905694
# Compute 95% Confidence Interval
CI <- Mu + c(-1,1) * 1.96 * se
cat("CI Limits: ", CI, "\n")
## CI Limits: 3.450484 6.549516
The observed sample mean statistic falls well within the 95% CI.
Q3. Show that the distribution is approximately normal
Let us examine the difference between the distribution of a large collection of random exponentials and the distribution of a large collection of averages of 40 exponentials.
# Generate ns (=1000) samples drawn from a exponential distribution (lambda=0.2)
refDist <- rexp(nsim, lambda)
# Bind these 1000 samples in first column with the 1000 of the samples means with 40 samples from earlier on.
compDistDS <- data.frame(SampleMean_40=sampleStats_40$SampleMean, RefDist=refDist)
See figure 2 in Appendix for comparison of the inherent distribution function of the population from which the samples are drawn from (expontial distribution with lambda = 0.2). It can be seen that the probability distribution of the sample mean has no relationship to the original exponential pdf, but actually approaches normal distribution with a mean equal to the population mean as Central Limit Theorem predicts.
Refer to the following figure (Figure 1) which compares the sample mean distribution for different sample sizes.
Note: The panels in the figure are based on the sample size as conditional variable.
histogram(~SampleMean | as.factor(SampleSize),
sampleStats,
nint=50,
type="density",
panel=function(x, ...) {
mean.values <- mean(x)
panel.abline(v=mean.values, col.line="purple",lwd=3)
panel.histogram(x, ...)
panel.mathdensity(
dmath=dnorm,
col="brown",
lwd=c(2),
args=list(mean=mean(x),sd=sd(x))
)
panel.grid(h=5, v=5)
},
xlab="Sample Mean Statistic",
index.cond=list(c(5,6,3,4,1,2)),
col=c("orange"),
layout=c(2,3),
main="Figure 1: Histograms of Sample Means for Different Sample Sizes"
)
The following figure (figure 2) is a comparison of the original population distribution (exponential pdf) vs the sample mean pdf for sample size of 40. From the figure, it is discernible that the original distribution is exponential but the sample mean of 40 samples is asymptotically approaching a normal distribution function. This is a corroboration of CLT predictions about asymptotic behavior of sample mean. The normal curve fit in figure 1 for ns=40 is indicative of the closeness of the sample mean to normal distribution.
histogram( ~SampleMean_40 + RefDist,
compDistDS,
type="density",
nint=100,
xlim=c(0,20),
col=c("orange"),
panel=function(x, ...) {
mean.values <- mean(x)
panel.abline(v=mean.values, col.line="purple",lwd=3)
panel.histogram(x, ...)
panel.grid(h=5, v=5)
},
main="Figure 2: Sample Mean PDF vs Population PDF"
)
The following figure attmpts (figure 3) a comparison of the sample variance pdf as a function of sample size. It is discernible that the sample variance is an unbiased estimate of the population variance. The sample variance also asymptotically approaches the theoretical variance of the exponential distribution.
# Let us explore how the sample variance reflects on the theoretical variance for sample size = 40
histogram(~SampleVariance | as.factor(SampleSize),
sampleStats,
nint=100,
type="density",
panel=function(x, ...) {
mean.values <- mean(x)
panel.abline(v=mean.values, col.line="purple",lwd=3)
panel.histogram(x, ...)
panel.mathdensity(
dmath=dnorm,
col="brown",
lwd=c(2),
args=list(mean=mean(x),sd=sd(x))
)
panel.grid(h=5, v=5)
},
xlab="Sample Variance Statistic",
index.cond=list(c(5,6,3,4,1,2)),
col=c("orange"),
xlim=c(0,120),
layout=c(2,3),
main="Figure 3: Histograms of Sample Variance for Different Sample Sizes"
)