Statistical Inference Course Project

Overview

This project investigates the exponential distribution in R through the lens of the Central Limit Theorem. The Central Limit Theorem states that the sampling distribution of any statistic will be normal or nearly normal, if the sample size is large enough.

The shape of the exponential distribution is defined by its \(\lambda\) which yields: \(\mu\) = 1/\(\lambda\) and \(\sigma\) = 1/\(\lambda\). Our exponential distribution has \(\lambda\) = 0.2 so: \(\mu\) = 5, \(\sigma\) = 5, and \(\sigma^2\) = 25.

The two statistics we are interested in are the mean,\(\sum_{i=1}^n X_i/n\) , and the variance,\(\sum_{i=1}^n (X_i - \bar{x})^2/n-1\).

We will use kurtosis and skewness to see how “normal” our sampling distributions are.

All of the code used to produce the results will be in the appendix.

Simulations

First, let’s construct a sample mean distribution by running 1000 simulations with sample sizes of 40 and take the mean of each sample. Our \(\lambda\) = 0.2.

This distribution looks far more normal than the underlying exponential distribution. We can clearly see that most of the mass is near the expected value and the density is bell shaped. This looks approximately normal because of our large total smaple size, 1000, for the sampling distribution of means. The skewness of 0.239 and kurtosis of 2.886 are near that of the normal distribution.

We can also see that the mean of the sampling distribution, 5.001, is very close to the \(\mu\). It is so close that the black and red mean lines are practically on top of each other. The right black line represents three expected standard deviations to the right of the expected mean and the red line represents three calculated standard deviations to the right of the calculated mean. The variance of the distribution is 0.638. This value is also very close to 0.625, which is the theoretical variance that is calculated with: S\(^2\) / \(n\).

Let’s take a look at the variance distribution.

This distribution looks somewhat normal but not quite there. The skewness,1.515, indicates this distribution is too right skewed to be considered normal. The right skewness makes sense because variance cannot be negative. Furthermore, the kurtosis, 7.146, indicates that there is a lot of mass in the tails. These two measures point to the fact that this distribution is not exactly normal but visually it looks like is related to the normal. The distribution mean, 25.007, is very close to the actual variance distribution mean of 25. They are so close that the black line representing the expected value is right on top of the red line which represents the computed variance mean. This where the CLT gets a little fuzzy with the “large enough sample size” requirement. If we increase the simulation size to something much larger, say 1000000, and the sample size to 1000 the distribution will approximate the normal.

Now let’s take a look at the difference between a large collection of random exponentials and the distribution of 10,000 of averages of 40 exponentials. We will construct a histogram of the underlying exponential distribution and a histogram of the distribution of exponential means.

As mentioned before, a large sample size will cause the sampling distribution of means to be approximately normal. The skewness of 0.307 and kurtosis of 3.144 are consistent with the normal distribution. This compares to the exponential distributions’ skewness of 1.996 and kurtosis of 8.939, which are far from normal.

To sum things up, we have used the CLT so show that the sampling distribution of the mean is approximately normal when the sample size is large enough, even when the underlying distribution is not normal. To see a sampling distribution of the variance that is approximately normal, one would need to run a large number of simulations because the variance distribution starts off with a lingering right skew that disappears only in very large sample sizes.

Appendix

set.seed(517)
mns = NULL
for (i in 1 : 1000) mns = c(mns, mean(rexp(n=40,rate=.2)))
plot(density(mns),main = "Density Plot of Mean Distribution", xlab = "Mean of X", col = "blue", sub = "Black = Expected Value, Red = Computed Value")
abline(v=5,lwd=1)

sampdistmean<-round(mean(mns),3)
sampdistvar<-round(var(mns),3)
sampdistsd<-round(sd(mns),3)
theorvar<-25/40

abline(v=sampdistmean,lwd=1,col="red")

abline(v=(3*sampdistvar)+sampdistmean,lwd=1,col="black")
abline(v=(3*theorvar)+5,lwd=1,col="red")

if("moments" %in% rownames(installed.packages()) == FALSE) {install.packages("moments")}

library(moments)

kurtosis<-round(kurtosis(mns),3)
skewness<-round(skewness(mns),3)

set.seed(517)
vars = NULL
for (i in 1 : 1000) vars = c(vars, var(rexp(n=40,rate=.2)))
plot(density(vars),main = "Density Plot of Variance Distribution", xlab = " Var of X", col = "blue", sub = "Black = Expected Value, Red = Computed Value")

abline(v=25,lwd=1)

vardistmean<-round(mean(vars),3)
vardistsd<-round(mean(vars),3)

abline(v=vardistmean,lwd=1,col="red")

library(moments)

kurtosis<-round(kurtosis(vars),3)
skewness<-round(skewness(vars),3)

par(mfrow=c(1,2))

set.seed(517)
expoDist <- replicate(n = 10000, expr = rexp(40, 0.2))
hist(expoDist, main = "10,000 Exp. Dists.", xlab = "X",xlim = c(0,30),col="blue", sub = "Black Line = Expected Value")
abline(v=5,lwd=10)

mns2 = NULL
for (i in 1 : 10000) mns2 = c(mns2, mean(rexp(n=40,rate=.2)))
hist(mns2, main = "10,000 Means Of Exp. Dists.", xlab = "X Bar",xlim = c(3,8),col="blue", sub = "Black Line = Expected Value")
abline(v=5,lwd=10)

library(moments)

kurtosis<-round(kurtosis(mns2),3)
skewness<-round(skewness(mns2),3)

kurtosis2<-round(kurtosis(as.vector(expoDist)),3)
skewness2<-round(skewness(as.vector(expoDist)),3)

Statistical Inference Course Project - Simulations

Daniel Alaiev

August 22, 2015

Overview

Simulations

Appendix