This is a Statistical Inference Course Project report which explores the asymptotic behavior of the sample mean and sample variance when the samples are drawn from exponential distribution. The analysis compares the sample mean and sample variance with the population mean and population variance using Monte Carlo simulation for drawing the samples. The simulations demonstrate that the sample mean is an unbiased estimate for the theoretical mean and also confirms the relationship between the variance of the sample mean and the theoretical variance articulated by the Central Limit Theorem.
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
##
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 3.1.3
## Loading required package: knitr
## Warning: package 'knitr' was built under R version 3.1.3
opts_chunk$set(echo=TRUE, cache=TRUE)
nsim <- 1000
# Let ns be the sample size. ns=40
ns <- 40
# lambda is parameter of exponential distribution
lambda <- 0.2
# For exploring the sample size dependency, let us construct a vector with 6 possible sample sizes including the specific sample size under investigation (ns = 40)
n <- c(5,10,20,30,40,100)
nvec <- c(rep(5,nsim),rep(10,nsim),rep(20,nsim),rep(30,nsim),rep(40,nsim),rep(100,nsim))
sampleStats <- data.frame(Mean=rep(0,nsim*length(n)),Variance=rep(0,nsim*length(n)),SampleSize=nvec)
# Setting up a data frame of 1000 simulations of the following:
# - sample mean (first column) for various sample sizes all in the same column
# - sample variance (second column) for various sample sizes all int same column
# - sample sizes (third column) to match the simulations for each sample size in first two columns
# - Note: The sample sizes chosen for the analysis are a superset and include 40 as one case
for(i in 1:length(n)) {
start <- (i-1)*nsim+1
end <- start + nsim - 1
sampleStats[start:end,1] <- apply(matrix(rexp(nsim*n[i], lambda), nrow=nsim, ncol=n[i]),1,mean)
sampleStats[start:end,2] <- apply(matrix(rexp(nsim*n[i], lambda), nrow=nsim, ncol=n[i]),1,var)
}
sampleStats_5 <- filter(sampleStats, SampleSize == 5)
sampleStats_10 <- filter(sampleStats, SampleSize == 10)
sampleStats_20 <- filter(sampleStats, SampleSize == 20)
sampleStats_30 <- filter(sampleStats, SampleSize == 30)
sampleStats_40 <- filter(sampleStats, SampleSize == 40)
sampleStats_100 <- filter(sampleStats, SampleSize == 100)
Let us plot the histograms for sample mean for various sample sizes (5, 10, 20, 30, 40, and 100) for a high level understanding of the dependency of sample mean on the sample size.
See Appendix figure 1 for the comparison of histograms of sample means as a function of different sample sizes including the sample size in question (40). It is clear from the density plots that the sample mean is centered around the theoretical mean which shows that the sample mean is an unbiased estimate.
# Compute the sample mean (m) for the specific case of interest (sample size = ns = 40)
m <- mean(sampleStats_40$Mean)
cat("Mean of Sample Means (1000 simulations with ns=40): ", m, "\n")
## Mean of Sample Means (1000 simulations with ns=40): 4.987502
# Sample size (ns) of 40 clearly corroborates with the theoretical mean for exponential distribution with lambda = 0.2. The theoretical mean is 1/lambda = 5
cat("Theoretical Mean of exponential distribution(lambda = 0.2): ", 1/lambda, "\n")
## Theoretical Mean of exponential distribution(lambda = 0.2): 5
# Compute the variance of the sample mean (s^2)
v <- var(sampleStats_40$Mean)
cat("Estimated Variance of the sample means: ", v, "\n")
## Estimated Variance of the sample means: 0.6585668
In theory, the variance of the Sample Mean statistic v = sigma^2/ns, where sigma is the population variance and ns is the sample size.
Alternately, sigma^2 ~ v * ns Let us check how the theory stacks up in these simulations for ns= 40
cat("Estimate of theoretical variance (lambda=0.2): ", v*ns, "\n")
## Estimate of theoretical variance (lambda=0.2): 26.34267
Sample size (ns) of 40 clearly corroborates with the theoretical standard deviation for exponential distribution with lambda = 0.2. The theoretical standard deviation is 1/lambda = 5 and variance = 25.
Let us examine the difference between the distribution of a large collection of random exponentials and the distribution of a large collection of averages of 40 exponentials.
# Generate ns (=1000) samples drawn from a exponential distribution (lambda=0.2)
refDist <- rexp(nsim, lambda)
# Bind these 1000 samples in first column with the 1000 of the samples means with 40 samples from earlier on.
compDistDS <- data.frame(SampleMean_40=sampleStats_40$Mean, RefDist=refDist)
See Appendix figure 2 for comparison of the inherent distribution function of the population from which the samples are drawn from (expontial distribution with lambda = 0.2). It can be seen that the probability distribution of the sample mean has no relationship to the original exponential pdf, but actually approaches normal distribution with a mean equal to the population mean as Central Limit Theorem predicts. Also the variance of the sample mean asymptotically reduces to zero as the sample size increases to a arbitrarily large value. This clearly corroborates the CLT.
This analysis in addition also compares the distribution of sample variance for various sample sizes including 40. Please refer to figure 3 in Appendix to see how the sample variance approaches the population variance for large sample sizes.
Refer to the following figure (Figure 1) which compares the sample mean distribution for different sample sizes.
Note: Each panel in the figure reflects a different sample size.
histogram(~Mean | as.factor(SampleSize),
sampleStats,
nint=50,
type="density",
panel=function(x, ...) {
mean.values <- mean(x)
panel.abline(v=mean.values, col.line="red",lwd=2, lty=2)
panel.histogram(x, ...)
panel.mathdensity(
dmath=dnorm,
col="black",
lwd=c(2),
args=list(mean=mean(x),sd=sd(x))
)
},
xlab="Sample Mean Statistic",
index.cond=list(c(5,6,3,4,1,2)),
col=c("orange"),
layout=c(2,3),
main="Fig 1: Histograms of Sample Means"
)
The following figure (figure 2) is a comparison of the original population distribution (exponential pdf) vs the sample mean pdf for sample size of 40. From the figure, it is discernible that the original distribution is exponential but the sample mean of 40 samples is asymptotically approaching a normal distribution function. This is a corroboration of CLT predictions about asymptotic behavior of sample mean.
histogram( ~SampleMean_40 + RefDist,
compDistDS,
type="density",
nint=50,
col=c("orange"),
panel=function(x, ...) {
mean.values <- mean(x)
panel.abline(v=mean.values, col.line="red",lwd=2, lty=2)
panel.histogram(x, ...)
},
main="Fig 2: Sample Mean PDF vs Population PDF"
)
The following figure attmpts (figure 3) a comparison of the sample variance pdf as a function of sample size. It is discernible that the sample variance asymptotically approaches the theoretical variance of the exponential distribution.
# Let us explore how the sample variance reflects on the theoretical variance for sample size = 40
histogram(~Variance | as.factor(SampleSize),
sampleStats,
nint=100,
type="density",
panel=function(x, ...) {
mean.values <- mean(x)
panel.abline(v=mean.values, col.line="red",lwd=2, lty=2)
panel.histogram(x, ...)
panel.mathdensity(
dmath=dnorm,
col="black",
lwd=c(2),
args=list(mean=mean(x),sd=sd(x))
)
},
xlab="Sample Variance Statistic",
index.cond=list(c(5,6,3,4,1,2)),
col=c("orange"),
xlim=c(0,120),
layout=c(2,3),
main="Fig 3: Histograms of Sample Variance"
)