Demonstrating the Cental Limit Theorem with Exponential Data

Overview

This project is an investigation of the exponential distribution in R as compared to the Central Limit Theorem. The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda. I have set lambda = 0.2 for all of the simulations. I have investigated the distribution of averages of 40 exponentials. This was performed on the basis of a thousand simulations.

Data

First I would like to generate a set of random exponential samples with rate parameter, lambda, set equal to 0.2 . This is performed by the R statement, Ex<-rexp(1000,2). The following histogram represents the distribution of 1000 random exponetial numbers with rate set to 0.2 . The red line indicates the population mean equal to 5. From the histogram you can approximate the sample mean as in the vicinity of 5.

set.seed(6)
Ex<-rexp(1000,.2)
hist(Ex, ylab="Frequency of Occurance", xlab="Exponential Random Variable Value",main="Histogram of Frequency vs. Value")
abline(v=5,col="red",ylab="population Mean")

Simulation

The simulation consists of two sets of 1000 entries each, in the first set each entry is the mean of a set of 40 exponential data items(mns). The other is a similar set of variances(vrs). Thus we have two 1000 entry s for the mean and variance of the exponential population. The code for the calculations for each is given below.

#CREATE SIMULATION DATA FOR THE MEAN
mns = NULL
for (i in 1 : 1000) mns = c(mns, mean(rexp(40,.2)))

#CREATE SIMULATION DATA FOR THE VARIANCE
vrs=NULL
for (i in 1 : 1000) vrs = c(vrs, var(rexp(40,.2)))

Mean Simulation

Upon examination of the mean data, mns, we find individual values that appear to be centered about the population mean 5. It might not be immediately obvious that this is true. Evaluating the histogram, it appears more obivious the it is centered about 5, the population mean. The red line represents the population mean.

head(mns)

## [1] 6.144469 4.338597 3.955624 5.952396 3.905764 4.097218

str(mns)

##  num [1:1000] 6.14 4.34 3.96 5.95 3.91 ...

hist(mns,xlab="Sample Mean of 40 Exponential Random Variables",ylab="Number of Values",main="Frequency vs. Mean")
abline(v=5,col="red",ylab="population Mean")

By using the r function mean to evaluate the entire set of values of mean values, mns,we see that it is extremly close to the population mean of 5. This exemplifies the Central Limit Theorem since the larger the sample size, the closer the mean of the sample means approach the theoretical mean value of 5.

mean(mns)

## [1] 4.952961

Variance Simulation

Upon examination of the variance data, vrs, we find individual values that could be centered about the population variance 25. It might not be immediately obvious that this is true. Evaluating the histogram, it appears more obivious the mean value is centered about 25, the population variance. The red line represents the population variance.

head(vrs)

## [1] 36.29161 16.69447 47.83164 26.16336 12.11534 48.43394

str(vrs)

##  num [1:1000] 36.3 16.7 47.8 26.2 12.1 ...

hist(vrs,xlab="Sample Variance of 40 Exponential Random Variables",ylab="Number of Values",main="Frequency vs. Variance")
abline(v=25,col="red",ylab="population Variance")

By using the r function mean to evaluate the entire set of values of variance values, vrs,we see that it is extremly close to the population variance of 25. This exemplifies the Central Limit Theorem since the larger the sample size, the mean of the variance samples approaches the theoretical variance value of 25.

mean(vrs)

## [1] 25.33699

Distribution Analysis

If we examine the first histogram provided below we see that the distribution is not very normal. It appears lopsided towards the origin as one might expect. However, by taking large population samples, one can examine the mean, standard deviation and variance for these samples and come to the conclusion the their distribution is normal. If I change each of the parameters 1000 and 40, by one order of magnitude, I get even better results. Examine the histograms below where the parameters have been changed 10000 and 400.

set.seed(6)
Ex<-rexp(1000,.2)
hist(Ex, ylab="Frequency of Occurance", xlab="Exponential Random Variable Value",main="Histogram of Frequency vs. Value")
abline(v=5,col="red",ylab="population Mean")

#CREATE SIMULATION DATA FOR THE MEAN
ms = NULL
for (i in 1 : 10000) ms = c(ms, mean(rexp(400,.2)))

#CREATE SIMULATION DATA FOR THE VARIANCE
vs=NULL
for (i in 1 : 10000) vs = c(vs, var(rexp(400,.2)))

hist(ms,xlab="Sample Mean of 400 Exponential Random Variables",ylab="Number of Values",main="Frequency vs. Mean")
abline(v=5,col="red",ylab="population Mean")
lines(density(ms))

hist(vs,xlab="Sample Mean of 400 Exponential Random Variables",ylab="Number of Values",main="Frequency vs. Variance")
abline(v=25,col="red",ylab="population Mean")
lines(density(vs),lwd=3,col="blue")

This distribution for the sample mean and variance appears normal.