Analysis

First, I’ll set some globals and constants for this R Markdown document.

knitr::opts_chunk$set(echo = TRUE,results='markup',cache=FALSE)
library(ggplot2)
set.seed(101)
lambda <- 0.2
n <- 40; B <- 1000
alpha = 0.05

Next, I’ll store one set of 1000 random variables and one set of 1000 averages of 40 random variables, taken from the exponential distribution. I’ll also calculate the mean and standard deviation of this distribution and the theoretical exponential distribution.

singleSimulation <- rexp(n,lambda)
allExpSimulations <- NULL
for (i in 1:B) allExpSimulations <- c(allExpSimulations,mean(rexp(n,lambda)))
simulationsMean <- mean(allExpSimulations); simulationsVar <- var(allExpSimulations)
theoreticalMean <- 1/lambda; theoreticalVar <- (1/lambda)^2
## we'll be showing a two-sided t-test (since the mean is approximately normal)
tquantiles <- qt(c(alpha/2,1-(alpha/2)), n-1)
tquantiles <- theoreticalMean + tquantiles*sqrt(theoreticalVar/n)
## and I'll make some "neater" ways of displaying these values (for the math later)
## note that these are *only* used when describing something, *not* when calculating something
simMean <- round(simulationsMean,digits=3); simVar <- round(simulationsVar,digits=3)
theoMean <- round(theoreticalMean,digits=3); theoVar <- round(theoreticalVar,digits=3)
simSD <- round(sqrt(simulationsVar),digits=3); 
theoSD <- round(sqrt(theoreticalVar),digits=3)

According to the CLT, we should see the following relations within our data: \[\bar{X}\sim N(\mu_{0},\sigma^{2})\] \[E(\bar{X})\sim \mu_{0}\] \[S(\bar{X})\sim \sigma/\sqrt{n}\]

So now let’s plot our histograms. I sampled \(10^{6}\) random points for the normal distribution, which seemed high enough for a good visual demonstration; note that this number is not included in any calculations.

g <- ggplot() +
    geom_density(aes(x=singleSimulation,color='single',fill='single'),alpha=0.25) +
    geom_vline(aes(xintercept=mean(singleSimulation)),color='red',linetype='solid',size=0.5) +
    
    geom_density(aes(x=allExpSimulations,color='sim',fill='sim'),alpha=0.3) +
    geom_vline(aes(xintercept=simulationsMean),color='blue',linetype='solid',size=1) +

    geom_vline(aes(xintercept=theoreticalMean),color='black',linetype='dashed',size=1) +
    geom_vline(aes(xintercept=tquantiles),color='black',linetype='dotted') +
    geom_density(aes(x=rnorm(1e+06,mean=theoreticalMean,sd=sqrt(theoreticalVar/n)),
                     color='theo',fill='theo'),alpha=0.2) +

    coord_cartesian(xlim = c(0,10)) +
    ylab("Density") + xlab("Value") +
    scale_color_manual(guide=FALSE,
                       values=c('blue','red','black'), 
                       breaks=c('single','sim','theo'),
                       labels=c(expression(X[exp]),
                                expression(bar(X)[exp]),
                                expression(paste('N(',mu[0],', ',n^0.5,sigma,')')))) +
    scale_fill_manual(guide='legend',name = '',
                      values=c('blue','red','black'),
                      breaks=c('single','sim','theo'),
                      labels=c(expression(X[exp]),
                               expression(bar(X)[exp]),
                                expression(paste('N(',mu[0],', ',sigma^2,')')))) +
    labs(caption = 'The dotted black line shows the two-sided 95% confidence interval for 
             our measurement, centered around the theoretical mean (dashed black). 
             The solid red line shows the mean of one example distribution of 1000 
             random exponentials. Finally, the solid blue line shows the average 
             of the 1000 simulated-means of 40 random exponentials.')
print(g)

As we can see from the plot, the (red) distribution of 1000 random exponentials is not Gaussian at all; however, an equally-sized (blue) distribution of 1000 averages of 40 random exponentials is shaped like a t-distribution. Furthermore, as correctly predicted by the CLT, this t-distribution is approaching a normal-distribution centered around the theoretical mean, which is also equal to its standard deviation for an exponential distribution (\(\mu_{0}=\sigma={\lambda^{-1}}=5\)). Substituting values in our previous equation with those we calculated, we can re-emphasize some of the values that are highlighted in the plot and find a t-statistic for this distribution:

\[\bar{X}_{exp}=t(E(\bar{X}_{exp}),n-1)=t(5.014,39)\sim N(5,25)=N(\mu_{0},\sigma^{2})\] \[S(\bar{X}_{exp})=0.773\sim 0.791=5/\sqrt{40}=\sigma/\sqrt{n}\] \[T=\frac{E(\bar{X}_{exp})-\mu_{0}}{\sigma/\sqrt n}=\frac{5.014-5}{5/\sqrt 40}=0.0174\]

From here, it’s a simple matter to convert from a t-value to a p-value. Note that we want the upper tail since we are finding the probability of values that are just-as or more extreme than \(E(\bar{X}_{exp})\):

pt(((simulationsMean-theoreticalMean)/(sqrt(theoreticalVar/n))),
   df=n-1,lower.tail = FALSE)

## [1] 0.4931206

This is telling us that under the same conditions (with sample sizes of 40 random exponentials), we have an approximately 49.31% chance of arriving at a similar or greater value of \(E(\bar{X}_{exp})\) for future simulations. This makes sense; the value we calculated is so close to \(\mu_{0}\) (comparatively to the distribution), that this value is approaching the point where we are 50% likely to get \(E(\bar{X}_{exp})<\mu_{0}\) and 50% likely to get \(E(\bar{X}_{exp})>\mu_{0}\). In other words, \(E(\bar{X}_{exp})\) is approaching \(\mu_{0}\) as \(n\) increases, just as the CLT predicted.

Statistical Inference Project, Part 1

Dave Nair

April 26, 2017

Analysis