Statistical inference - Course Project: Part 1

Overview

This exercise aim to show the power of central limit theorem comparing simulations results with theoretical expectations. The exercise is based on a sample of 1,000 means generated by 40 numbers (with exponential distribution profile with lambda 0.2). Comparisons between sample and theoretical results were made to prove the CLT. As a result, a graph was plotted showing the normality of the sample, and confirming the CLT.

Requeriments and Settings

Requirements to reproduce this exercise

# Loading libraries
library(ggplot2)

# Force results to be in English
Sys.setlocale("LC_ALL","English")
## [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
# Set seed
set.seed(2015)

Simulations

Exercise parameters:

simulations <- 1000
sample_size <- 40
lambda <- 0.2

Preparing data to make an analysis:

raw_sample_exponential <- replicate(simulations, rexp(sample_size, lambda))
sample_exponential <- apply(raw_sample_exponential, 2, mean)

The dataset created is a vector with 1000 of means.

Sample Mean versus Theoretical Mean

Theoretical Mean: The Theoretical mean of a exponential distribution rate is the inverse of lambda

theoretical_mean = 1 / lambda

So in this exercise the theorical mean is 5.

Sample Mean: The sample mean is showed bellow

sample_mean <- mean(sample_exponential)

The sample mean is 5.0115634.

Conclusions

Based on the results above, those means are very close due to the great amount of samples and simulations. This exercise shows the application of the Central Limit Theorem.

Sample Variance versus Theoretical Variance

To calculate the variance is necessary one step before calculating the standard deviation. Thus, this section is divided into 2 parts.

Theoretical Standard Deviation: The standard deviation is calculated analytically as follow

theoretical_sd <- (1/lambda)/(sqrt(sample_size))

So in this exercise the theorical standard deviantion is 0.7905694.

Sample Standard Deviation: The sample standard deviation is showed bellow

sample_sd <- sd(sample_exponential)

The sample standard deviation is 0.7913907.

Using the standard deviation calculated above. It is possible to calculate the variances.

Theoretical Variance: The Variance is the square of the standard deviation

theoretical_varicane <- theoretical_sd^2

So in this exercise the theorical variance is 0.625.

Sample Variance:

sample_variance <- sd(sample_exponential)

The sample variance is 0.7913907.

Conclusions

Based on the results above, those variances are very close due to the great amount of samples and simulations. This exercise shows the application of the Central Limit Theorem.

Show that the distribution is approximately normal

Using graphs this question could be easily answered

# Histogram of averages
hist(sample_exponential, breaks=20, prob=TRUE,
     main="Comparison of simulation results and theoretical expected (lambda=0.2)",
     xlab="")

# Draw a line of Density of the averages of samples
lines(density(sample_exponential),lwd=2,col="red")

# Theoretical center of distribution. In other words, the theretical mean.
abline(v=1/lambda, col="red",lwd=2)

# Theoretical density of the averages of the simulations samples
xfit <- seq(min(sample_exponential), max(sample_exponential), length=100)
yfit <- dnorm(xfit, mean=1/lambda, sd=(1/lambda/sqrt(sample_size)))

# Draw a line of Theoretical
lines(xfit, yfit, pch=20, col="blue", lty=4,lwd=2)

# Add legend in the histogram
legend('topright', c("simulation", "theoretical"), lty=c(1,2), col=c("red", "blue"),lwd=2)

Conclusions

The simulation is approximately normal. The graphic shows a histogram and a density line of a theoretical distribution, those informations are very close. Due to the central limit theorem, the averages of samples follow normal distribution.