Statistical Inference Course Project Part 1

Simulated comparison of the means of samples of exponentiallly distributed iids and the CLT.

Introduction

The Central Limit Theorum (CLT) states that the mean of large numbers of random variables (identical and independantly distributed, with a defined mean and variance) will be approximately normally distributed, and as the number of observations in each mean calculation increases so the distribution of means approaches a normal distribution.

Although exponentially distributed random variables do have a defined mean and variance, they are significantly non-normal:
1. An exponential random variable cannot take a negative value.
2. Skewness and kurtosis for an exponential random variable are both greater than for a normally distributed random variable.

This makes exponentially distributed random variables a good illustration of the CLT. The random variable considered in the following is that created by taking the mean of forty iid exponentially distributed random variables with rate or lamda = 0.2.

Results

1. Theoretical and sample mean

On the histogram below, the theoretical mean is shown as a solid line and the average sample mean as a dashed line.

The values are quite close to each other – the theoretical mean, 1 / lambda = 5, and the average sample mean in this case ~ 5.02.

2. Theoretical and sample variance

This time the theoretical and average sample variances of the mean are shown as a solid line a dashed line, respectively.

Again, the values are close to each other. The variance of an exponentially distributed random variable (when lambda = 0.2) is 1/0.2^2 = 25. Therefore the theoretical CLT variance is sigma / n ^ 0.5 = sqrt(25/40) ~ 0.79, and the average sample variance is ~ 0.63. However, the histogram shows that although the theoretical and sample variances are close, none of the sample variances are as large as the theoretical variance. In fact, the largest sample variance is only approximately 0.73

3. Comparison with the normal distribution

There are a number of ways to assess the “normality” of a distribution, and the simplest of these is a visual one - the QQ plot. Sample quantiles are plotted against theoretical quantiles, and the relationship should be close to 1:1. In this case, means of forty iid exponential random variables compared to normally distributed iid random variables with mean = 5 and sd = 5/sqrt(40).

The result is quite close to the theoretical distribution (the relationship is close to 1:1 for a large interval) but not exact. In fact the Kolgmogorov-Smirnov goodness of fit test p value is 0.3999976 – which suggests that the null hypothesis that the data are normally distributed is unlikely but not impossible.

4. Comparison with the gamma distribution

As a comparison, here are the same results when comparing the data to the actual (known) distribution of the means of forty exponential random variables - the gamma distribution where alpha = 40 and beta = 8.

This is a much closer relationship between sample and theoretical quantiles, and the Kolgmogorov-Smirnov goodness of fit test p value is 0.3268093.

Code

1. Generating random variables

This code snippet replicates one thousand times the mean of forty exponential random variables, with lambda=0.2, and assigns to the variable “rv” in dataframe ‘rnd_df’:

lambda <- 0.2
rnd_df <- data.frame(rv = replicate(1000, mean(rexp(40, rate=lambda))))

Similarly, the next code snippet replicates one thousand times the mean of forty exponential random variables, takes the variance, and then replicates that procedure another thousand times. The variance rvs are assigned to the variable “rv” in dataframe ‘rnd_df2’:

lambda <- 0.2
rnd_df2 <- data.frame(rv = replicate(1000, 
                                     var(replicate(1000, mean(rexp(40, rate=lambda))))))

2. Plots and analysis

The histograms and QQ plots are then straight forward using ggplot2:

### first histogram:
ggplot(rnd_df, aes(rv)) + stat_bin(binwidth= 0.15, alpha=0.75) + 
        xlab("") + ylab("Count") + 
        geom_vline(data = mean_df,aes(xintercept=value, linetype=type)) +
        theme_bw() +  
        ggtitle("Histogram: averages of means of iid exponential variables") 

### second histogram:

ggplot(rnd_df2, aes(rv)) + stat_bin(binwidth= 0.01, alpha=0.75) + 
        xlab("") + ylab("Count") + 
        geom_vline(data = var_df, aes(xintercept=value, linetype=type)) +
        theme_bw() +  
        ggtitle("Histogram: variance of means of iid exponential variables") 

### first qq plot:

ggplot(rnd_df, aes(sample=rv)) + 
        stat_qq(distribution=qnorm, dparams=list(mean=5, sd = 5/sqrt(40))) +
        geom_abline(a=0,b=1) + 
        coord_fixed() + xlab("Theoretical normal") + ylab("Sample") + 
        theme_bw() +  
        ggtitle("QQ plot") 

### second qq plot:

ggplot(rnd_df, aes(sample=rv)) + 
        stat_qq(distribution=qgamma, dparams=list(shape=40, rate = 8)) +
        geom_abline(a=0,b=1) + 
        coord_fixed() + xlab("Theoretical gamma") + ylab("Sample") + 
        theme_bw() +  
        ggtitle("QQ plot")

Finally, the the ks.test p.values are calculated as follows:

# comparison to normal
ks.test(rnd_df$rv, pnorm, 5, 5/sqrt(40))

# comparison to gamma
ks.test(rnd_df$rv, pgamma, 40, 8)

To return only the p value, use eg

ks.test(rnd_df$rv, pnorm, 5, 5/sqrt(40))$p.value