This project consists of 2 parts:
This will be Part 1 of the project.
Experiment: The task is to investigate the exponential distribution in R and compare it with the Central Limit Theorem. The exponential distribution will be simulated with rexp(n,lambda), where lambda is the rate parameter. The mean of the exponential distribution and standard deviation is 1/lambda. Lambda is set to 0.2 for all simulations.
library(ggplot2)
library(knitr)
The exponential distribution will be simulated with rexp(n,lambda), where lambda is the rate parameter. The mean of the exponential distribution and standard deviation is 1/lambda. Lambda is set to 0.2 for all simulations.
# Sample Exponential Distribution
set.seed(2025) # For Reproducibility
nosim <- 1000 # Number of Simulations
n <- 40 # Number of Exponentials
lambda <- 0.2 # Rate Parameter
simdata <- matrix(rexp(nosim * n, rate=lambda), nosim)
sim_mean <- rowMeans(simdata) # Row Means
# Calculate Mean, SC, and Variance of Sample Exponential Distribution
simdata_mean <- mean(sim_mean)
simdata_sd <- sd(sim_mean)
simdata_var <- var(sim_mean)
Given the number of exponentials and the rate parameter, the exponential distribution can be simulated by multiplying the exponential by the number of simulations, resulting in 1000 simulations of 40 exponentials.
Putting it in a matrix, we can use the apply function to find the mean for each row. With this, we can calculate the sample mean, standard deviation, and variance.
# Calculate the mean, df, and variance of the theoretical exponential distribution
t_mean = 1/lambda
t_sd = (1/lambda) * (1/sqrt(n))
t_var = t_sd^2
par(mfrow = c(1, 2))
par(mfrow = c(1, 2))
hist(simdata,
col = "coral",
main = "Simulated exponential distribution",
xlab = "40 random exponentials")
hist(sim_mean,
col = "cornsilk",
main = "Averages of Simulated Exponentials",
xlab = "Average of 40 exponentials")
abline(v = t_mean, col = "royalblue", lwd = 4) # theoretical mean
The left plot represents the simulated exponential distribution. Most of the data is shifted left due to the properties of the distribution.
The histogram on the right represents the averages for the simulated exponentials. The distribution appears more normal.
The blue line is the theoretical mean of the distribution and, as you can see, the distribution is closely centered over the mean.
# Compute stats
sample_stats <- c(simdata_mean, simdata_sd, simdata_var)
theoretical_stats <- c(t_mean, t_sd, t_var)
diff <- c(
abs(t_mean - simdata_mean),
abs(t_sd - simdata_sd),
t_var - simdata_var
)
# Create the data frame
stat_table <- data.frame(
Sample = sample_stats,
Theoretical = theoretical_stats,
Difference = diff,
row.names = c("Mean", "Std", "Variance")
)
knitr::kable(stat_table, caption = "Comparison of Sample vs Theoretical Statistics")
Sample | Theoretical | Difference | |
---|---|---|---|
Mean | 5.0142946 | 5.0000000 | 0.0142946 |
Std | 0.7781771 | 0.7905694 | 0.0123923 |
Variance | 0.6055597 | 0.6250000 | 0.0194403 |
Observing the table:
simdata_mean <- data.frame(sim_mean)
ggplot(simdata_mean, aes(sim_mean)) +
geom_histogram(
binwidth = .3,
fill = "cornsilk",
color = "black",
aes(y = ..density..)
) +
geom_density(color = "blue", lwd = 1) +
labs(title = "Distribution of Random Exponential Values with 1000 simulations",
x = "Average of 40 Exponentials", y = "Density") +
stat_function(
fun = dnorm,
args = list(mean = t_mean, sd = t_sd),
color = "red",
lwd = 1
) +
theme_bw()
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The red line is the theoretical distribution, while the blue is the density of the sample distribution. In the image, it is accurate to say the sample distribution is near normal.
We can conclude that the sample distribution approximates the theoretical normal distribution fairly closely, as shown by the Q-Q plot. Unexpectedly, the tails are less normal.
qqnorm(sim_mean, col="black") # Sample Distribution
qqline(sim_mean, col="red", lwd=3) #Theoretical Distribution
Based on the comparisons and the plots, the simulated distribution does have similar means and variance as n grows larger. This affirms the validity of the Central Limit Theorem.
An important condition for the CLT is that the random variables are Independent and Identically Distributed (IID). As shown in the analysis, we can verify that these conditions are satisfied.