Statistical Inference Course Project: PART 1

Overview:

The goal of the project is for the students to answer particular questions, demonstrate the use of simulation to explore inference, and do some simple inferential data analysis. A report will be created to answer the questions. I chose to use RPubs to publish this project, to demonstrate my knowledge of markdown, knitr and pdf-conversion. Ultimately the goal is to investigate the exponential distribution in R and compare it with the Central Limit Theorem.

1. Show the sample mean and compare it to the theoretical mean of the distribution.

2. Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution.

3. Show that the distribution is approximately normal.

Loading necessary packages

library(knitr)
library(ggplot2)
library(dplyr)

Simulations:

Part ONE: Sample Mean versus Theoretical Mean:

1. We set the number of simulations to 1,000, using n=40 and lambda =.02

numberOfSim <- 1000
sampleSize <- 40
lambda <- 0.2

2. In order to reproduce date, the seed is set and random values are generated.

set.seed(2)
simulatedData <- matrix(rexp(numberOfSim*sampleSize, rate=lambda), numberOfSim, sampleSize)
rowMeansSimulated <- rowMeans(simulatedData)
generatedMeansData<-data.frame(rowMeansSimulated)

3. Derive theoretical values:

theoreticalMean <- 1/lambda;
theoreticalSD <- 1/(lambda*sqrt(sampleSize))
theoreticalVar <- theoreticalSD^2

4. Plotting the exponential distribution with lambda = 2.0

firstplot <- ggplot(generatedMeansData,aes(x=rowMeansSimulated)) + labs(title="Sample Mean versus Theoretical Mean") + geom_histogram(bindwidth=lambda, fill="white",color="black", aes(y = ..density..)) + geom_density(alpha=.2, fill="#FF6666")
#create a vertical line showing the mean of the simulations distrbuted
firstplot = firstplot + geom_vline(aes(xintercept=mean(rowMeansSimulated, na.rm=T)), color="red", linetype="dashed", size=1) + xlab("Sample Mean") + ylab("Density")
#create a line to show where the theoretical mean is
firstplot = firstplot + geom_vline(aes(xintercept=theoreticalMean), color="blue", linetype="solid", size=1)
firstplot

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Looking at the plot above, the dashed line is the mean of the simulated distribution which is close to 5 which is close to the value of the theoretical mean 1/lambda. Let us compute the actual values:

The standard mean is computed by mean(generatedMeansData$rowMeansSimulated) Meanwhile, the theoretical mean, as computed above is: 5

Observation:

We can see that the sample mean has the same shape, center and distribution as the theoretical mean. This observation is in accordance to the predictions of Central Limit Theorem

Part TWO: Sample Variance versus Theoretical Variance:

1. First, the created data that will be plotted is generated

set.seed(2)
simulatedDataForVariance <- replicate(numberOfSim, (sd(rexp(sampleSize,lambda)/sqrt(sampleSize)))^2 )

2. Plotting the sample variance as a histogram while the theoretical variance of a normal distribution

#sample variance vs normal distribution 
dfVariance <- data.frame(simulatedDataForVariance)
secondplot <- ggplot(dfVariance, aes(x=simulatedDataForVariance)) +  geom_histogram(binwidth = lambda,fill="blue3",color="white", alpha=.3, aes(y=..density..))+ ylab("Density") + xlab("Sample Variance")+ ggtitle("Sample Variance VS Theoretical Variance (w/1,000 simulations)")
secondplot = secondplot + stat_function(fun = dnorm, color = "black", size = 2, arg = list(mean=mean(simulatedDataForVariance), sd = sd(simulatedDataForVariance)))
print(secondplot)

Distribution:

We can see that the sample variance matches the shape and central central weight of the theoretical variance generated by dnorm. Also, the more simulations, the narrower the graph is towards the center. Again, this distribution is in accordance to the predictions of Central Limit Theorem and Borel’s Law of Large Numbers.