How does a Simulated Exponential Distribution Compare with the Central Limit Theorem(CTL)?

Synopsis

A Simulation of the Exponential Distribution of forty exponential in one-thousand was compared to the CTL using RStudios 4.0.4, the exploration of this demonstrated that exponential distribution does follow the normal distribution. When the t test demonstrates that the probabilities of the simulation and the expected values proposed by the CLT converge on one another as sample size approaches the population size.

Introduction

In the Central Limit Theorem it states that “a properly normalized sum will tend toward a normal distribution even when the sample is not itself a normal distribution”. But what is this statement trying to convey with the term properly normalized? Properly normalized sum is different depending on the range the values. Normalization brings all the sampled values down to scale. For example, if our sample were the number of bacteria in a flask at any time point (t). Suppose population ranged from 2per liter to 100000 per liter. In order to normalize that sample taking the log of that population would be the most appropriate way to normalize the values. There are other types of normalization such as 1/n, sqrt(n), and z-score.

Exponential Distribution of 40 exponentials:

In the formula: . We have three important players in this formula. Lambda which is rate, \({e}\) (the Euler’s number) and \({x}\) which is always greater than zero. Lambda is very interesting as it is the rate. As lambda approaches infinity or gets very large. E converges onto 0 but when lambda approaches 0 as in a very small, non-negative number e converges on 1. But what if lambda is a negative number, and Is that possible in probability? Lambda as a negative number is a special case for the exponential distribution function, negative exponential distribution is a special case of the both the negative gamma and negative Weibull distributions falling at the intersection of these two curves on the skewness-kurtosis plot, which I will not explore.() However, the reader is invited to “play” with the manipulate function of the graph loaded below. {r} library(manipulate);manipulate(hist((rexp(lambda,n)),10 ), lambda = slider(1, 100), n=slider(1,1000))

Simulating the Exponential Curve

Understanding how a sample mean approaches the population mean through the power of large numbers. We will use a hypothetical example.

We had a health app start up that has a user usage rate with is describe by the following equation
\(F(x, {\lambda})= {\lambda}{e}^{-\lambda n}\) for x <0 and where \({lamba}\) is the rate at which users sign into the app.

The original sampling of users is n=40. Where the mu = 1/lambda.

lambda = 0.2
exponentials= 40
mu= 1/lambda
standard_deviation <-1/lambda
x <-c(1:40)
exp_distr_formula <-data.frame(lambda*exp(-(lambda)*x))

exp_distr<-function(x, lambda) {
  for (i in seq_along(x)){
    distr <-lambda*exp(-(lambda)*x)
    return(distr)
  }
}
y <- exp_distr(c(1:exponentials),0.2)

colnames(exp_distr_formula) <-c("x")
plot(exp_distr_formula$x)
abline(v=mu, col= "red")
abline(v=(1/lambda^2), col="purple")

Now, the start-up want to know how best to scale for usage when the go from n=40 to N= of 10,000. To do this, the simulate with a repeated independent simple random sample each with size of n=40, from the population.

For this simulation a rate of 0.2, was chosen for lambda, therefore mu and the standard deviation is 1/lambda. And the X was 40 exponentials.

Creating the 40,000 Simulations with a \({\mu=5}\)

These are our initial conditions

Exponential_Distribution = NULL # creating a null variable
Exponential_Distribution =(matrix(rexp(c(1:total), lambda), simulation_number))
theoretical_2 <-data.frame(Exponential_Distribution)
x=x
den_plot <-dexp(x =x, rate = 0.2) * length(Exponential_Distribution)
d <-density(den_plot)

g <-ggplot()+
  geom_histogram(
    mapping=aes(x=Exponential_Distribution),
    color=" purple",
    bins=100)+theme_classic()+ggtitle("Histogram of Exponential Distribution") +
  xlab("") + ylab("Frequency") 

 
g2 <-ggplot()+
      geom_line(mapping=aes(x=d$x, y=d$y))+theme_classic()+
theme_classic()+ggtitle("Density Plot of Exponential Distribution") +
  xlab("") + ylab("Frequency") 
g

g2  

The sampling frame is 40 columns of 1,000 rows. The mean is taken for all of the rows.
The means from the samples vary, as demonstrated by the RowMeans, this shows that any x is an example of the means which might be drawn from the population. A histogram can now be constructed with the sampleing distribution of the means.

theoretical_average <-theoretical_2%>%
   mutate(Mean_simulations= rowMeans(theoretical_2))%>%
   select(Mean_simulations)
head(theoretical_average)
theoretical_mean <-mean(theoretical_average$Mean_simulations)
theoretical_variance <-c(mu+var(theoretical_average$Mean_simulation),
                         mu-var(theoretical_average$Mean_simulations))
theoretical_sd <-c(mu+sd(theoretical_average$Mean_simulations), 
                   mu-sd(theoretical_average$Mean_simulations))
  g3 <- ggplot(theoretical_average)+
      geom_histogram(mapping= aes(x=Mean_simulations),
                     binwidth = 0.1,
                     bins=400,
                     color= "steel blue")+
      geom_vline(xintercept = theoretical_mean, color= "light blue", )+
      geom_vline(xintercept = mu, color="green")+
      geom_vline(xintercept = theoretical_variance, color="purple")+
      geom_vline(xintercept = theoretical_sd, color ="black") +
      geom_text(aes(x=theoretical_mean, label="theoretical mean", y=25), colour="light blue",
                text=element_text(size=5)) +
  geom_text(aes(x=mu, label="Mu", y=20), colour="green", text=element_text(size=4))+ 
    geom_text(aes(x=theoretical_sd[1], label="theoretical sd", y=10), colour="purple") + 
    geom_text(aes(x=theoretical_sd[2], label="theoretical sd", y=60), colour="purple")+theme_classic()+
theme_classic()+ggtitle("Histogram of Sample Means") +
  xlab("Sample Means") + ylab("Frequency") 

g3

Now we compare the Graph A with the Graph B.

The findings are generalized in the following statements. The sample distribution of x is more symmetrical than the population. Both the population distribution and the sampling distribution of are centered on \({/mu}\) The sampling distribution of is less spread out that the population distribuation.

Finding the central limit theorem states that the sampling distribution of the sample mean tends toward normality even when the underlying population in not normal. The influence of the CLT becaome strong as n gets large(the power of large numbers).

While not all sample means are equal to the populations mean, The expected value of the sample data distribution of the sample mean is populaton mean \({/mu}\)
This inequaity is addressed by the standard deviation, variance, and standard error.

The sample data distribution will be more tighly clustered around the population mean. There will be less variability in the sample means than an single observation.

According to the square root law the percision of the mean is inversely related to the square root of the sample size. When single observations have a standard deviation of {sigma} the sample mean has a standard deviation of sigma/srqt(n) better know as the standard error of the mean. {r} standard_error= (mean(Exponential_Distribution)-mu)/sqrt(exponentials) But why was this important? The reason the standard error of the mean is important is that it tells us how much a random sample is likely to deviate from the population mean.

 probabilities_vector <-c(0.05, 0.25,0.5, 0.75,0.95 )
 probabilities_vector
## [1] 0.05 0.25 0.50 0.75 0.95
 quantiles_for_each_sample <-quantile(theoretical_average$Mean_simulations, probabilities_vector)
 quantiles_for_each_sample
##       5%      25%      50%      75%      95% 
## 3.831339 4.419586 5.000255 5.544597 6.291083
 qnorm_of_central_limit <-qnorm(probabilities_vector, mean=mean(theoretical_average$Mean_simulations), sd=sd(theoretical_average$Mean_simulations))
qnorm_of_central_limit
## [1] 3.731311 4.490306 5.017875 5.545445 6.304440

With the values of the central limit and the 50 percentile of the sample mean being 5.00 and 5.01, respectively. If you struggled with this project as I did, I highly recommend the book Basic Biostatistics by Dr. B. Burt Gertman of San Jose State University. It provides great examples of sample size and Central limit testing.