OVERVIEW

The information given below are all for the completion of the Statistical Inference Course Project from Coursera’s Data Science by Johns Hopkins University. This project aims to investigate the exponential distribution while comparing with the Central Limit Theorem. Additionally, illustrations are shown via simulation and explanatory texts by the means of the following:

A Simulation Exercise (Part 1)
Basic inferential data analysis (Part 2) on the ToothGrowth dataset

# Load libraries

library(ggplot2)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(datasets)
library(gridExtra)

## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(dplyr)
library(knitr)
library(kableExtra)

## Warning: package 'kableExtra' was built under R version 4.3.3

## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows

PART 1: SIMULATIONS

Instructions –> You will investigate the distribution of the averages of 40 exponential

What you need to do:

Show the sample mean and compare it to the theoretical mean of the distributions.
Show how variable the sample is (via variance) and compare it to the thoretical variance of the distribution.
Show that the distribution is approximately normal.

In point 3, focus on the difference between the distribution of a large collection of random exponentials and the distribution of a large collection of averages of 40 exponentials.

The following pre-defined paramters are given for Part 1:

Set lambda = 0.2 for all of the simulations.
Investigate the distribution of averages of 40 exponentials.
Do a thousand simulations.

Exponential Distribution -> describes times between events happening at constant rate lambda with expected value 1/lambda.(Wikipedia)

The exponential distribution is simulated with rexp(n, lambda) where lambda is the rate parameter. The mean of exponential distribution andstandard dev iation is 1/lambda.

# Set constants

lambda <- 0.2 # Lamba for rexp
n <- 40 # Number of exponentials
simulations <- 1000 # Number of tests

# Set the seed to create reproducibility (generates uniform results)

set.seed(2021)
         
# Runing the simulations

rsims <- matrix(rexp(simulations * n,
                     rate = lambda),
                simulations)

# Calculating the means of the exponential simulations

msim <- rowMeans(rsims) # Row Means
msim_df <- data.frame(msim) # DataFrame for Histogram representation

#Calculate mean, sd and variance of sample exp. simulations

rsims_mean <- mean(msim)
rsims_sd <- sd(msim)
rsims_var <- var(msim)

With the number of simulations, number of exponentials and the rate parameter, we can simulate the exponential distribution by multiplying the exponential by the number of simulations (giving us 1000 simulations of 40 exponentials).

We have then put it on a matrix form and used the apply function to find the mean of each row.

With this data we can then calculate the different requested statistical variables (variance, mean and standard deviation)

# Generate a histogram with the means + highlight the means of means

ggplot(msim_df) +
  geom_histogram(aes(x = msim),
                 color = "white",
                 fill = "salmon",
                 bins = 39) +
  scale_y_continuous(expand = c(0,0),
                     limits = c(0, 80, 10)) +
  geom_vline(aes(xintercept = mean(msim)),
             color = "black",
             linewidth = 1.25,
             linetype = "dashed") +
  annotate("text",
           label = "Mean = 5.008",
           x = 5.6,
           y = 70,
           color = "black") +
  labs(title = "Exp. Function Simulation",
       subtitle = "Mean of Simulated Exponentials",
       y = "Frequency",
       x = "Mean") +
  theme_test()

The histogram represents the averaged means of the simulated exponentials. We can appreciate a normal distribution of the data. The black dashed line represents the mean of the means.

THEORETICAL EXPONENTIAL DISTRIBUTION

# Calculate mean, sd, and variance of the theoretical exp. dist.

t_mean <- 1/lambda
t_sd <- (1/lambda) * (1/sqrt(n))
t_var <- t_sd^2

Graphical representation of theoretical vs experimental mean

GRAPHICAL REPRESENTATION

ggplot(msim_df) +
  geom_histogram(aes(x = msim),
                 color = "white",
                 fill = "salmon",
                 bins = 39) +
  scale_y_continuous(expand = c(0,0),
                     limits = c(0, 80, 10)) +
  geom_vline(aes(xintercept = mean(msim)),
             color = "black",
             linewidth = 1.25,
             linetype = "dashed") +
  annotate("text",
           label = "Mean = 5.008",
           x = 5.6,
           y = 70,
           color = "black") +
  geom_vline(aes(xintercept = t_mean),
             color = "green",
             linewidth = 1) +
  annotate("text",
           label = "Theoretical mean = 5",
           x = 5.8,
           y = 75,
           color = "green") +
  labs(title = "Exp. Function Simulation",
       subtitle = "Mean of Simulated Exponentials",
       y = "Frequency",
       x = "Mean") +
  theme_test()

In this histogram we can see, in graphical format, the difference between the experimental (black) and theoretical (green) means.

COMPARISON BETWEEN EXPERIMENTAL AND THEORETICAL STATISTICS

mean <- data.frame(stat = "Mean", 
                   exp_value = rsims_mean, 
                   theoretical_value =  t_mean)
sd <- data.frame(stat = "SD", 
                 exp_value = rsims_sd,
                 theoretical_value = t_sd)
var <- data.frame(stat = "Var", 
                  exp_value = rsims_var, 
                  theoretical_value = t_var)

stats <- data.frame() %>%
  rbind(mean) %>%
  rbind(sd) %>%
  rbind(var)

stats$difference <- stats$exp_value - stats$theoretical_value

kable(stats,
      align = "lccrr") %>%
  kable_styling(bootstrap_options = "bordered",
                full_width = FALSE) %>%
  row_spec(row = 0,
           bold = TRUE)

stat	exp_value	theoretical_value	difference
Mean	5.0086386	5.0000000	0.0086386
SD	0.7882570	0.7905694	-0.0023124
Var	0.6213491	0.6250000	-0.0036509

EXPERIMENTAL VS THEORETICAL DISTRIBUTION

From the central limit theorem we know that the distribution of averages of normalized variables becomes that of a standard normal distribution as the sample size increases.

We want to normalize our sample means. To do this, we need to transform each mean in our simulated dataset according to

z-score = (xbar - 1/lambda) / (1/lambda / sqrt(n))

Based on the CLT result, it should be a normal distribution centered at approximately zero. To verify this, we create a plot comparing the density of our transformed sample means distribution with the density of the standard normal distribution.

# Compute the mean of the normalised means
z_mean <- msim_df %>%  
  mutate(z_score = (msim - 1/lambda) / (1/lambda / sqrt(n)))  %>%
  select(z_score) %>%
  summarise(z_mean = mean(z_score)) %>%
  unlist()

# Create Z-scores for sample means

msim_df_z <- msim_df %>%  
  mutate(z_score = (msim - 1/lambda) / (1/lambda / sqrt(n)))

ggplot(msim_df,
       aes(x = msim)) +
  geom_histogram(aes(y = ..density..),
                 color = "white",
                 fill = "salmon",
                 alpha = 0.4,
                 bins = 39) +
  geom_density(color = "blue",
               lwd = 1) +
  stat_function(fun = dnorm,
                args = list(mean = t_mean, sd = t_sd),
                size = 1.3) +
  scale_y_continuous(expand = c(0,0),
                     limits = c(0, 0.7, 0.1)) +
  labs(title = "Distribution of Standarized Simulated Means",
       y = "Density",
       x = "Mean") +
  annotate("text", 
           label = "Theoretical Normal Distribution",
           x = 6.2,
           y = 0.5,
           color = "black")+
   annotate("text", 
           label = "Experimental Distribution",
           x = 5.7,
           y = 0.53,
           color = "blue") +
  theme_test()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The black line is the theoretical normal distribution and, as we can see, it is very similar to the experimental distribution represented by the blue line.

# Q-Q plot of the sample distribution

qqnorm(msim,
       col = "black")

# Q-Q plot of the theoretical distribution

qqline(msim,
       col = "red",
       lwd = 3)

Observing the normal Q-Q plot, we can conclude that the sample distribution approximates the theoretical normal distribution quite closely, with the tails being less normal.

Statistical Inference Course Project

Macarena L. Fernandez Carro

2024-03-12