The information given below are all for the completion of the Statistical Inference Course Project from Coursera’s Data Science by Johns Hopkins University. This project aims to investigate the exponential distribution while comparing with the Central Limit Theorem. Additionally, illustrations are shown via simulation and explanatory texts by the means of the following:
# Load libraries
library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(datasets)
library(gridExtra)
##
## Attaching package: 'gridExtra'
##
## The following object is masked from 'package:dplyr':
##
## combine
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(dplyr)
library(knitr)
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.3.3
##
## Attaching package: 'kableExtra'
##
## The following object is masked from 'package:dplyr':
##
## group_rows
Instructions –> You will investigate the distribution of the averages of 40 exponential
What you need to do:
In point 3, focus on the difference between the distribution of a large collection of random exponentials and the distribution of a large collection of averages of 40 exponentials.
The following pre-defined paramters are given for Part 1:
Exponential Distribution -> describes times between events happening at constant rate lambda with expected value 1/lambda.(Wikipedia)
The exponential distribution is simulated with
rexp(n, lambda)
where lambda is the rate parameter. The
mean of exponential distribution andstandard dev iation is
1/lambda
.
# Set constants
lambda <- 0.2 # Lamba for rexp
n <- 40 # Number of exponentials
simulations <- 1000 # Number of tests
# Set the seed to create reproducibility (generates uniform results)
set.seed(2021)
# Runing the simulations
rsims <- matrix(rexp(simulations * n,
rate = lambda),
simulations)
# Calculating the means of the exponential simulations
msim <- rowMeans(rsims) # Row Means
msim_df <- data.frame(msim) # DataFrame for Histogram representation
#Calculate mean, sd and variance of sample exp. simulations
rsims_mean <- mean(msim)
rsims_sd <- sd(msim)
rsims_var <- var(msim)
With the number of simulations, number of exponentials and the rate parameter, we can simulate the exponential distribution by multiplying the exponential by the number of simulations (giving us 1000 simulations of 40 exponentials).
We have then put it on a matrix form and used the apply function to find the mean of each row.
With this data we can then calculate the different requested statistical variables (variance, mean and standard deviation)
# Generate a histogram with the means + highlight the means of means
ggplot(msim_df) +
geom_histogram(aes(x = msim),
color = "white",
fill = "salmon",
bins = 39) +
scale_y_continuous(expand = c(0,0),
limits = c(0, 80, 10)) +
geom_vline(aes(xintercept = mean(msim)),
color = "black",
linewidth = 1.25,
linetype = "dashed") +
annotate("text",
label = "Mean = 5.008",
x = 5.6,
y = 70,
color = "black") +
labs(title = "Exp. Function Simulation",
subtitle = "Mean of Simulated Exponentials",
y = "Frequency",
x = "Mean") +
theme_test()
The histogram represents the averaged means of the simulated
exponentials. We can appreciate a normal distribution of the data. The
black dashed line represents the mean of the means.
# Calculate mean, sd, and variance of the theoretical exp. dist.
t_mean <- 1/lambda
t_sd <- (1/lambda) * (1/sqrt(n))
t_var <- t_sd^2
ggplot(msim_df) +
geom_histogram(aes(x = msim),
color = "white",
fill = "salmon",
bins = 39) +
scale_y_continuous(expand = c(0,0),
limits = c(0, 80, 10)) +
geom_vline(aes(xintercept = mean(msim)),
color = "black",
linewidth = 1.25,
linetype = "dashed") +
annotate("text",
label = "Mean = 5.008",
x = 5.6,
y = 70,
color = "black") +
geom_vline(aes(xintercept = t_mean),
color = "green",
linewidth = 1) +
annotate("text",
label = "Theoretical mean = 5",
x = 5.8,
y = 75,
color = "green") +
labs(title = "Exp. Function Simulation",
subtitle = "Mean of Simulated Exponentials",
y = "Frequency",
x = "Mean") +
theme_test()
In this histogram we can see, in graphical format, the difference
between the experimental (black) and theoretical (green) means.
mean <- data.frame(stat = "Mean",
exp_value = rsims_mean,
theoretical_value = t_mean)
sd <- data.frame(stat = "SD",
exp_value = rsims_sd,
theoretical_value = t_sd)
var <- data.frame(stat = "Var",
exp_value = rsims_var,
theoretical_value = t_var)
stats <- data.frame() %>%
rbind(mean) %>%
rbind(sd) %>%
rbind(var)
stats$difference <- stats$exp_value - stats$theoretical_value
kable(stats,
align = "lccrr") %>%
kable_styling(bootstrap_options = "bordered",
full_width = FALSE) %>%
row_spec(row = 0,
bold = TRUE)
stat | exp_value | theoretical_value | difference |
---|---|---|---|
Mean | 5.0086386 | 5.0000000 | 0.0086386 |
SD | 0.7882570 | 0.7905694 | -0.0023124 |
Var | 0.6213491 | 0.6250000 | -0.0036509 |
From the central limit theorem we know that the distribution of averages of normalized variables becomes that of a standard normal distribution as the sample size increases.
We want to normalize our sample means. To do this, we need to transform each mean in our simulated dataset according to
z-score = (xbar - 1/lambda) / (1/lambda / sqrt(n))
Based on the CLT result, it should be a normal distribution centered at approximately zero. To verify this, we create a plot comparing the density of our transformed sample means distribution with the density of the standard normal distribution.
# Compute the mean of the normalised means
z_mean <- msim_df %>%
mutate(z_score = (msim - 1/lambda) / (1/lambda / sqrt(n))) %>%
select(z_score) %>%
summarise(z_mean = mean(z_score)) %>%
unlist()
# Create Z-scores for sample means
msim_df_z <- msim_df %>%
mutate(z_score = (msim - 1/lambda) / (1/lambda / sqrt(n)))
ggplot(msim_df,
aes(x = msim)) +
geom_histogram(aes(y = ..density..),
color = "white",
fill = "salmon",
alpha = 0.4,
bins = 39) +
geom_density(color = "blue",
lwd = 1) +
stat_function(fun = dnorm,
args = list(mean = t_mean, sd = t_sd),
size = 1.3) +
scale_y_continuous(expand = c(0,0),
limits = c(0, 0.7, 0.1)) +
labs(title = "Distribution of Standarized Simulated Means",
y = "Density",
x = "Mean") +
annotate("text",
label = "Theoretical Normal Distribution",
x = 6.2,
y = 0.5,
color = "black")+
annotate("text",
label = "Experimental Distribution",
x = 5.7,
y = 0.53,
color = "blue") +
theme_test()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The black line is the theoretical normal distribution and, as we can see, it is very similar to the experimental distribution represented by the blue line.
# Q-Q plot of the sample distribution
qqnorm(msim,
col = "black")
# Q-Q plot of the theoretical distribution
qqline(msim,
col = "red",
lwd = 3)
Observing the normal Q-Q plot, we can conclude that the sample distribution approximates the theoretical normal distribution quite closely, with the tails being less normal.