Statistical Inference - Coursera Project

#An overview of the project This work completes the Statistical Inference in Coursera Data Science class and it contains of two parts: Part 1: do a simulation to create random data and do some analysis under the light of the Central Limit Theorem and Part 2: using one of the data sets in the R data sets library, do some analysis, some inferences and state a conclusion about the data.

Part 1 : Simulation exercise Using rexp, we will create 40 random exponential distributions using lambda = 0.2, now taking the mean of this 40 draws and create a data vector with 1500 of these means. Than analyse the distribution of the value of the means. We are searching the kind of distribution they will be disposed.

#Load libraries to help 
library(ggplot2)
#Set parameters
ECHO=TRUE
set.seed(2222)
lambda=0.2
exponentials=40
#Create the values
simulationMeans = NULL
for (i in 1:1500)simulationMeans = c(simulationMeans,mean(rexp(exponentials, lambda)))
summary(simulationMeans)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.847   4.450   4.928   4.967   5.441   7.248

#Now We Obtain the mean of means

mean(simulationMeans)

## [1] 4.967263

#calculate the theoretical Mean
theoreticalmean<-lambda^-1
theoreticalmean

## [1] 5

#Lets plot it in a histogram 
hist (simulationMeans, col="#FFE633", main="Sample Mean versus Theoretical Mean", breaks=20)
#and draw the two lines for the Means
abline(v=mean(simulationMeans), lwd="2", col="#33FFB2")
abline(v=mean (theoreticalmean), lwd="2", col="#335BFF")
text (6.5, 150, paste("Actual mean (green)= ", round (mean(simulationMeans),3), "\nTheoretical mean (red)= ",round(theoreticalmean,3)), col="#888888")

abs(mean(simulationMeans)-theoreticalmean)

## [1] 0.03273665

The value of the difference between theoretical and real mean are very little and we can assume that the Central Limit Theorem is valid in this simulation that is to say that increasing the number of samples, we will getting closer to the theoretical value.

Comparing Variances

#Sample variance
simulationvar<-var(simulationMeans)
simulationvar

## [1] 0.5899575

#Theoretical Variance
Theoreticalvar<-(lambda * sqrt(exponentials))^-2
Theoreticalvar

## [1] 0.625

# Comparison 
simulationvar-Theoreticalvar

## [1] -0.0350425

Comparing the two values of the Variance we see that the values are very close.

Can we say that the distribution is normal?

#Lets draw the histogram for the simulation
hist(simulationMeans, prob=TRUE, col="#6BFF33", main=" Distribution of the Means", breaks=20)
text (6.4, 0.55, paste("red: density function for the simulation\nblue:  the theorical normal distribution"), col="#888888")
#create a random values for normal distribution with theoretical values
x<-rnorm(10000,mean=5, sd=0.625)
#and compare with the density function that disperses the mass of the simulationMeans over a grid of 512 points using Fourier transform in a gaussian curve as default to smooth the line
lines(density(simulationMeans), lwd="3", col="#E6652E")
lines(density(x), lwd="3", col="#4444AB")

Conclusion of the simulation The values of the 1500 means obtained each one from the 40 random exponential distribution, assuming lambda as 0.2, are distributed in a close to normal distribution Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

Statistical Inference - Coursera Project

Saqib Ali

2023-07-08