Statistical Inference Assignment: Part 1

Overview

In this paper we will investigate the exponential distribution and compare it to the Central Limit Theorem. The distribution will be simulated using the rexp(n, lambda) command where lambda is the rate of the parameter. The mean of the exponential distribution is 1/lambda as well as its variance.

We adress three main issues: 1. Show where the distribution is centered at and compare it to the theoretical center of the distribution. 2. Show how variable it is and compare it to the theoretical variance of the distribution. 3. Show that the distribution is approximately normal.

We will use for all the simulations the parameter lambda = 0.2

Setting up the environment and the data.

First, we set up the working directory, and install knitr and ggplot2:

setwd("C:/Users/ftorrent/Desktop/Data Science Track1/Coursera/Statistical Inference")
library(knitr)
library(ggplot2)

I Set echo to true to be sure all code is shown in the report. One other note, to make this reproducible, meaning I get the same results each time it is run, the random seed needs to be set to a known value so the rexp function always generates the same array. In the following code I set it to 1.

opts_chunk$set(echo=TRUE)
set.seed(1)

Now I need to set the variables as defined in the problem. number of values (n) = 40 lambda = 0.2 number of iterations, at least 1000. numsim=2000

n<-40
lambda<-0.2
numsim<-2000

By using the rexp function, now I generate the data, creating a matrix with the parameters specified before:

dataset<-matrix(rexp(n*numsim,lambda),numsim)

Sample Mean versus Theoretical Mean

The theoretical mean is given, as 1/0.2. We need to get the sample mean (the mean of the dataset we created):

TheoryMean<-1/lambda
RowMeans<-apply(dataset,1,mean)
ActualMean<-mean(RowMeans)

Therefore, we get a sample mean of 5.01876, slightly above the theoretical mean of 5.

Sample Variance versus Theoretical Variance

We follow the same steps taken above, but instead of the mean we want to compare now the Variance, and we are going to get it through the Standard Deviation:

TheorySD<-((1/lambda) * (1/sqrt(n)))
ActualSD<-sd(RowMeans)
TheoryVar<-TheorySD^2
ActualVar<-var(RowMeans)

So we can observe that the theoretical variance is 0.625, whereas the sample variance is around 0.607, slightly below.

Show that the distribution is approximately normal

To show that, the best way is through a graph:

dfRowMeans<-data.frame(RowMeans) # convert to data.frame for ggplot
mp<-ggplot(dfRowMeans,aes(x=RowMeans))
mp<-mp+geom_histogram(binwidth = lambda,fill="green",color="black",aes(y = ..density..))
mp<-mp + labs(title="Density of 40 Numbers from Exponential Distribution", x="Mean of 40 Selections", y="Density")
mp<-mp + geom_vline(xintercept=ActualMean,size=1.0, color="black") # add a line for the actual mean
mp<-mp + stat_function(fun=dnorm,args=list(mean=ActualMean, sd=ActualSD),color = "blue", size = 1.0)
mp<-mp + geom_vline(xintercept=TheoryMean,size=1.0,color="yellow",linetype = "longdash")
mp<-mp + stat_function(fun=dnorm,args=list(mean=TheoryMean, sd=TheorySD),color = "red", size = 1.0)
mp

Here: Theoretical mean: dotted yellow line, Sample mean: black line. Normal curve formed by the the theoretical mean and standard deviation:Red line Curve formed by the sample mean and standard deviation: Blue line. *Density of the actual data :green bars.

We can observe in the graph that the Central Limit Theory is working here, making the sample data follow a normal curve. It can be observed as both the red and blue lines are virtually the same.

Conclusion

Both mean and SD of the sample are very close to the population means and SD. Moreover, CLT works by making the sample data be distributed as a normal variable, as shown in the graph.