An overview of the project

This work complets the Statistical Inference in Coursera Data Science class and it consists of two parts: Part 1: do a simulation to create some random data and do some analysis under the light of the Central Limit Theorem and Part 2: using one of the datasets in the R datasets library, do some analysis, some inferences and state a conclusion about the data.

Part 1 : Simulation exercise

Using rexp, we will create 40 random exponential distributions using lambda = 0.2, take the mean of this 40 draws and create a data vector with 1500 of these means. Than analyse the distribution of the value of the means. We are searchind the kind of distribution they will be disposed.

#Load libraries to help 
library(ggplot2)
#Set parameters
ECHO=TRUE
set.seed(2222)
lambda=0.2
exponentials=40
#Create the values
simulationMeans = NULL
for (i in 1:1500)simulationMeans = c(simulationMeans,mean(rexp(exponentials, lambda)))
summary(simulationMeans)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.847   4.450   4.928   4.967   5.441   7.248
#Obtain the Mean of the Means
mean(simulationMeans)
## [1] 4.967263
#calculate the theoretical Mean
theoreticalmean<-lambda^-1
theoreticalmean
## [1] 5
#Lets plot in a histogram 
hist (simulationMeans, col="#B1EFFF", main="Sample Mean versus Theoretical Mean", breaks=20)
#and draw the two lines for the Means
abline(v=mean(simulationMeans), lwd="2", col="#149403")
abline(v=mean (theoreticalmean), lwd="2", col="#d90b23")
text (6.5, 150, paste("Actual mean (green)= ", round (mean(simulationMeans),3), "\nTheoretical mean (red)= ",round(theoreticalmean,3)), col="#888888")

abs(mean(simulationMeans)-theoreticalmean)
## [1] 0.03273665

The value of the difference between theoretical and real mean are very little and we can assume that the Central Limit Theorem is valid in this simulation that is to say that increasing the number of samples, we will getting closer to the theoretical value.

Compare Variances

#Sample variance
simulationvar<-var(simulationMeans)
simulationvar
## [1] 0.5899575
#Theoretical Variance
Theoreticalvar<-(lambda * sqrt(exponentials))^-2
Theoreticalvar
## [1] 0.625
# Comparison 
simulationvar-Theoreticalvar
## [1] -0.0350425

Comparing the two values of the Variance we see that the values are very close.

Can we say that the distribution is normal?

#Lets draw the histogram for the simulation
hist(simulationMeans, prob=TRUE, col="#FFF4A1", main=" Distribution of the Means", breaks=20)
text (6.4, 0.55, paste("red: density function for the simulation\nblue:  the theorical normal distribution"), col="#888888")
#create a random values for normal distribution with theoretical values
x<-rnorm(10000,mean=5, sd=0.625)
#and compare with the density function that disperses the mass of the simulationMeans over a grid of 512 points using Fourier transform in a gaussian curve as default to smooth the line
lines(density(simulationMeans), lwd="3", col="#E6652E")
lines(density(x), lwd="3", col="#4444AB")

Conclusion of the simulation

The values of the 1500 means obteined each one from the 40 random exponential distribution, assuming lambda as 0.2, are distributed in a close to normal distribution

My best regards. Thanks for reading C.Werneck