Part 1: Comparison of Exponential Distribution With the Central Limit Theorem

1: Simulation exercise

Overview

In this project the exponential distribution is investigated in R and compare it with Central Limit Theorem. The mean of exponential distribution is 1/lambda and the standard deviation is also a function of 1/lambda. The exponential distribution is simulated in R with rexp(n,lambda), where lambda=0.2 for all of the simulations, sample size n = 40, and the number of simulation =1000.

1.1.Simulations

A series of 1000 simulations is run to create a data set for comparison purpose. Each simulation contain 40 observations and the expoential distribution function will be set to rexp(40, 0.2) where 0.2 is lambda value.

Given data: n = 40; simNum = 1000; lambda = 0.2 For reproducibility, set seed = 22062024.

Simulation of exponential distribution and calculation of summary statistics

The code below executes the simulation to generate the data for the exponential distribution data

n = 40
lambda = 0.2
simNum = 1000
set.seed(22062024)

Next, we make a data frame with the sampled exponential distribution data and then visualize the first few lines and the last few lines of the data using head() and tail() function.

simExp = function(n, lambda){
    mean(rexp(n,lambda))
    }

simul = data.frame(ncol=2,nrow=simNum)
names(simul) = c("Sample","Mean")
for (i in 1:simNum)
{
    simul[i,1] = i
    simul[i,2] = simExp(n,lambda)
}
#check the data frame created
#the top 6
head(simul)
tail(simul)

1.2. Sample Mean vs Theoretical Mean of the Distribution.

We first find the sample mean as follows:

meanSample <- mean(simul$Mean)

meanSample
## [1] 5.033469

Next, the theoretical mean is evaluated by :

meanTheory <- 1/lambda

meanTheory
## [1] 5

We can see that the sample mean of 5.03 is close to the theoretical value of 5.

Then, we make a plot(Histogram) for the exponential distributions of the sample means. Herein, the Sample and theoretical means are showcased on the plot with distinct colored lines.

hist(simul$Mean, breaks = 30, prob = TRUE,col = "green", 
     main="Exponential Distribution of Sample Means", 
     xlab="Means of 40 Simulated Samples", ylab = "Counts")
abline(v = meanTheory, col= "red", lwd = 3)
abline(v = meanSample, col = "blue",lwd = 2)
legend('topright', c("Sample Mean", "Theoretical Mean"), 
       bty = "n",       
       lty = c(1,1), 
       col = c(col = "blue", col = "red"))

The red vertical line indicates the theoretical sample mean, whereas the blue vertical line is the sample mean. The center of distribution of averages of 40 exponentials is very close to the theoretical center of the distribution.

1.3. Sample Variance versus Theoretical Variance.

In this section, we compare the variance of the sample means of the 1000 simulations to the theoretical variance of the population.

Firstly, the sample variance is estimated by using the variance of the 1000 entries in the means vector times the sample size, 40.

var_sample <- var(simul$Mean)

var_sample
## [1] 0.6210099

Secondly we estimate the theoretical variance of the population given by the formula σ2=(1/lambda)2/n as follows:

var_theory = (1/lambda)^2/n

var_theory
## [1] 0.625

We conclude that, the sample variance of the distribution is 0.621 and the theoretical variance is 0.625.

The following illustrates the values for the sample and theoretical mean distribution and their variances.

##          Theroetical sample
## Mean           5.000  5.03
## Variance       0.625  0.621

1.4. Showing that the distribution is approximately normal

The central limit theorem tells us that the means of samples tend to follow a normal distribution. The figure below demonstrates this by comparing the density from the histogram with a normal density curve, which is plotted using the theoretical mean and variance. This comparison shows that the distribution is approximately normal.

hist(simul$Mean, 
     breaks = 30, 
     prob = TRUE,col = "green", 
     main = "Density of Simulated Samples Means", 
     xlab = "Means of Exponential", ylab = "Mean Density")
lines(density(simul$Mean), col = "blue", lwd = 2)
abline(v = 1/lambda, col = "orange", lwd = 2)
xfit <- seq(min(simul$Mean), max(simul$Mean), length = 100)
yfit <- dnorm(xfit, mean = 1/lambda, sd = (1/lambda/sqrt(n)))
lines(xfit, yfit, pch = 22, col = "red", lwd = 2)
legend('topright', c("Simulated Values", "Theoretical Values"), 
       bty = "n", lwd = c(2,2), col = c("blue", "red"))

The plot above illustrates that the density curve (simulated values) closely aligns with the normal distribution curve (theoretical values), indicating that the sample distribution is approximately normal. Additionally, the q-q plot below supports this, as the theoretical quantiles match closely with the actual quantiles, further suggesting normality.

qqnorm(simul$Mean,main ="Normal Q-Q Plot", col = "red")
qqline(simul$Mean, col = "blue", lwd = 2)

Hence, from the above, we can conclude that the comparisons confirms that the distribution is approximately normal.

Part 2 : Basic inferential data Analysis

Loading the required packages

library(ggplot2)
library(datasets)
#library(tidyverse)
library(gridExtra)

Load the ToothGrowth data and perform some basic exploratory data analyses

# Load the data ToothGrowth
data(ToothGrowth)
# Look at the structure of the data
str(ToothGrowth)
## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
head(ToothGrowth)
tail(ToothGrowth)

Provide a basic summary of the data.

summary(ToothGrowth)
##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

Compare means of the different delivery methods

tapply(ToothGrowth$len,ToothGrowth$supp, mean)
##       OJ       VC 
## 20.66333 16.96333

Make a boxplot to look at the data graphically

# Basic plots to visualize the data
plot1 <- ggplot(ToothGrowth, aes(x = supp, y = len)) +
  geom_boxplot() +
  labs(title = "Tooth Length by Supplement Type", x = "Supplement Type", y = "Tooth Length")

plot2 <- ggplot(ToothGrowth, aes(x = factor(dose), y = len)) +
  geom_boxplot() +
  labs(title = "Tooth Length by Dose", x = "Dose (mg/day)", y = "Tooth Length")

plot3 <- ggplot(ToothGrowth, aes(x = factor(dose), y = len, fill = supp)) +
  geom_boxplot() +
  labs(title = "Tooth Length by Dose and Supplement Type", x = "Dose (mg/day)", y = "Tooth Length") +
  facet_wrap(~supp)

# Arrange the plots in one place using grid.arrange
grid.arrange(plot1, plot2, plot3, ncol = 2)

Use confidence intervals and/or hypothesis tests to compare tooth growth by supp and dose.

We make the following Hypothesis :

  • Hypothesis for 0.5 and 1 mg/day:

    H_0: There is no difference in tooth growth by the delivery method.

    H_1: There is a difference in tooth growth by the delivery method.

  • Hypothesis for 2 mg/day:

    H_0: There is no difference in tooth growth by the delivery method.

    H_1: There is a difference in tooth growth by the delivery method.

First, make a comparison by delivery method for the same dosage

t05 <- t.test(len ~ supp, 
       data = rbind(ToothGrowth[(ToothGrowth$dose == 0.5) & 
                                      (ToothGrowth$supp == "OJ"),],
                    ToothGrowth[(ToothGrowth$dose == 0.5) & 
                                      (ToothGrowth$supp == "VC"),]), 
       var.equal = FALSE)

t1 <- t.test(len ~ supp, 
       data = rbind(ToothGrowth[(ToothGrowth$dose == 1) & 
                                      (ToothGrowth$supp == "OJ"),],
                    ToothGrowth[(ToothGrowth$dose == 1) & 
                                      (ToothGrowth$supp == "VC"),]), 
       var.equal = FALSE)

t2 <- t.test(len ~ supp, 
       data = rbind(ToothGrowth[(ToothGrowth$dose == 2) & 
                                      (ToothGrowth$supp == "OJ"),],
                    ToothGrowth[(ToothGrowth$dose == 2) & 
                                      (ToothGrowth$supp == "VC"),]), 
       var.equal = FALSE)

Make a summary of the conducted t.tests, which compare the delivery methods by dosage and take the p-values and CI

summaryBYsupp <- data.frame(
      "p-value" = c(t05$p.value, t1$p.value, t2$p.value),
      "Conf.Low" = c(t05$conf.int[1],t1$conf.int[1], t2$conf.int[1]),
      "Conf.High" = c(t05$conf.int[2],t1$conf.int[2], t2$conf.int[2]),
      row.names = c("Dosage .05","Dosage 1","Dosage 2"))
# Show the data table 
summaryBYsupp

State your conclusions and the assumptions needed for your conclusions

In a nutshell, with 95% confidence, we reject the null hypothesis that there is no difference in tooth growth by the delivery method for doses of 0.5 and 1 milligram/day. We observe p-values less than the threshold of 0.05, and the confidence intervals do not include 0. This indicates that the delivery method does significantly affect tooth growth at these dosages.

However, with 95% confidence, we fail to reject the null hypothesis for a dosage of 2 milligrams/day. In this case, we observe p-values greater than the threshold of 0.05, and the confidence intervals include 0. This suggests that, at a dosage of 2 milligrams/day, the delivery method does not significantly impact tooth growth.