Statistical Inference & Modeling Central Limit Theorem

Overview

This simulation wil investigate the exponential distribution in R and compare it to the Central Limit Theorem for the same dataset, based on the mean and standard deviation of 1/lambda, where lambda equals 0.2. The test will investigate the distribution averages of 40 exponentials with a thousand simulations.

We will then analyze the ToothGrowth dataset package, performing basic exploratory analysis and use confidence intervals to compare tooth growth by supp and dose.

Simulated Data

#Set seed for reliable reproducability
set.seed(82)

#Define lambda, number of exponentials, and 1000 random simulations
lambda <- .2
n <- 40
sims <- 1000

#Plot simulations to see the distribution of 1000 random uniforms
plot(rexp(sims, lambda), main = "Distribution of 1000 random uniforms with rate lambda")

#Now compare to the distribution of 1000 averages of 40 random uniforms
sim_avs = NULL
for (i in 1 : sims) sim_avs <- c(sim_avs, mean(rexp(n, lambda)))
hist(sim_avs, col="light blue", main="Distribution of means with standard deviation lambda", breaks=60)

We can see right away in our plots above that the distribution of 1000 random uniforms is scattered and generally flat, whereas the distribution of the same number of averages of 40 random uniforms is much more Gaussian.

Means comparison

Comparing the sample mean and the theoretical mean will show us how accurate our sample is to the Gaussian distribution.

#Actual mean
real_mean <- mean(sim_avs)
print(real_mean)

## [1] 4.977319

#Theoretical mean
t_mean <- 1/lambda
print(t_mean)

## [1] 5

hist(sim_avs, col="light blue", main="Theoretical vs Actual Mean", breaks=60)
abline(v = real_mean, col="green")
abline(v = t_mean, col="red")

The actual mean is 5.018401 (green line) and the theoretical mean is 5 (red line), which verifies that our simulated data using the Central Limit Theorem closely represents the results we would otherwise expect to get.

Comparison of Standard Deviations and Variances

# Standard deviation of data
real_sd <- sd(sim_avs)
print(real_sd)

## [1] 0.7997376

# Theoretical standard deviation
t_sd <- (1/lambda)/sqrt(n)
print(t_sd)

## [1] 0.7905694

# Variance of data
real_var <- real_sd^2
print(real_var)

## [1] 0.6395802

# Theoretical Variance
t_var <- ((1/lambda)*(1/sqrt(n)))^2
print(t_var)

## [1] 0.625

Both the standard deviations and the variances are nearly exactly the same. Were we to perform more simulations then we would see the numbers move closer and closer together.

Compare Plot to Normal Distribution

Now we will compare our histogram of data with the line of normal distribution to see how closely they relate.

# Plot our data again with a line of normal distribution
exes <- seq(min(sim_avs), max(sim_avs), length = 100) 
whys <- dnorm(exes, mean = 1/lambda, sd = 1/lambda/sqrt(n))
hist(sim_avs, col="light blue", main="Comparison to Normal Distribution", breaks=60)
lines(exes, whys*100)

Comparing our distribution to the expected distribution we see that the Central Limit Theorem provides us with very accurate distributions, with nominally varying standard deviations and variances. This confirms that taking small samples many times over as the Central Limit Theorem suggests, provides us with an accurate dataset for an otherwise small sample size.