Overview

This project consists of 2 parts:

  1. A simulation experiment to explore inference.
  2. Basic inferential analysis using the ToothGrowth data in R.

This will be Part 1 of the project.

Experiment: The task is to investigate the exponential distribution in R and compare it with the Central Limit Theorem. The exponential distribution will be simulated with rexp(n,lambda), where lambda is the rate parameter. The mean of the exponential distribution and standard deviation is 1/lambda. Lambda is set to 0.2 for all simulations.

Loading Required Libraries

library(ggplot2)
library(knitr)

Simulation

The exponential distribution will be simulated with rexp(n,lambda), where lambda is the rate parameter. The mean of the exponential distribution and standard deviation is 1/lambda. Lambda is set to 0.2 for all simulations.

Sample Exponential Distribution

# Sample Exponential Distribution
set.seed(2025) # For Reproducibility
nosim <- 1000 # Number of Simulations
n <- 40 # Number of Exponentials
lambda <- 0.2 # Rate Parameter 

simdata <- matrix(rexp(nosim * n, rate=lambda), nosim)
sim_mean <- rowMeans(simdata) # Row Means

# Calculate Mean, SC, and Variance of Sample Exponential Distribution
simdata_mean <- mean(sim_mean)
simdata_sd <- sd(sim_mean)
simdata_var <- var(sim_mean)

Given the number of exponentials and the rate parameter, the exponential distribution can be simulated by multiplying the exponential by the number of simulations, resulting in 1000 simulations of 40 exponentials.

Putting it in a matrix, we can use the apply function to find the mean for each row. With this, we can calculate the sample mean, standard deviation, and variance.

Theoretical Exponential Distribution

# Calculate the mean, df, and variance of the theoretical exponential distribution
t_mean = 1/lambda
t_sd = (1/lambda) * (1/sqrt(n))
t_var = t_sd^2

Plot of the Simulated Exponential Distribution vs. Averages of Simulated Exponentials

par(mfrow = c(1, 2))
par(mfrow = c(1, 2))
hist(simdata,
     col = "coral",
     main = "Simulated exponential distribution",
     xlab = "40 random exponentials")
hist(sim_mean,
     col = "cornsilk",
     main = "Averages of Simulated Exponentials",
     xlab = "Average of 40 exponentials")
abline(v = t_mean, col = "royalblue", lwd = 4) # theoretical mean

The left plot represents the simulated exponential distribution. Most of the data is shifted left due to the properties of the distribution.

The histogram on the right represents the averages for the simulated exponentials. The distribution appears more normal.

The blue line is the theoretical mean of the distribution and, as you can see, the distribution is closely centered over the mean.

Comparison Between the Sample and Theoretical Statistics

# Compute stats
sample_stats <- c(simdata_mean, simdata_sd, simdata_var)
theoretical_stats <- c(t_mean, t_sd, t_var)
diff <- c(
  abs(t_mean - simdata_mean),
  abs(t_sd - simdata_sd),
  t_var - simdata_var
)
# Create the data frame
stat_table <- data.frame(
  Sample = sample_stats,
  Theoretical = theoretical_stats,
  Difference = diff,
  row.names = c("Mean", "Std", "Variance")
)

knitr::kable(stat_table, caption = "Comparison of Sample vs Theoretical Statistics")
Comparison of Sample vs Theoretical Statistics
Sample Theoretical Difference
Mean 5.0142946 5.0000000 0.0142946
Std 0.7781771 0.7905694 0.0123923
Variance 0.6055597 0.6250000 0.0194403

Observing the table:

  1. The sample mean of the exponential distribution is 5.014 where the theoretical mean is 5.00 (1/lambda).
  2. The difference between the two is 0.014.
  3. The sample variance is 0.606, versus the theoretical of 0.625, a difference of 0.019.

Histogram and Density Plot

simdata_mean <- data.frame(sim_mean)
ggplot(simdata_mean, aes(sim_mean)) +
  geom_histogram(
    binwidth = .3,
    fill = "cornsilk",
    color = "black",
    aes(y = ..density..)
  ) +
  geom_density(color = "blue", lwd = 1) +
  labs(title = "Distribution of Random Exponential Values with 1000 simulations",
       x = "Average of 40 Exponentials", y = "Density") +
  stat_function(
    fun = dnorm,
    args = list(mean = t_mean, sd = t_sd),
    color = "red",
    lwd = 1
  ) +
  theme_bw()
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The red line is the theoretical distribution, while the blue is the density of the sample distribution. In the image, it is accurate to say the sample distribution is near normal.

Q-Q Plot

We can conclude that the sample distribution approximates the theoretical normal distribution fairly closely, as shown by the Q-Q plot. Unexpectedly, the tails are less normal.

qqnorm(sim_mean, col="black") # Sample Distribution
qqline(sim_mean, col="red", lwd=3) #Theoretical Distribution

Conclusion

Based on the comparisons and the plots, the simulated distribution does have similar means and variance as n grows larger. This affirms the validity of the Central Limit Theorem.

An important condition for the CLT is that the random variables are Independent and Identically Distributed (IID). As shown in the analysis, we can verify that these conditions are satisfied.