In this project the exponential distribution is investigated in R and compare it with Central Limit Theorem. The mean of exponential distribution is 1/lambda and the standard deviation is also a function of 1/lambda. The exponential distribution is simulated in R with rexp(n,lambda), where lambda=0.2 for all of the simulations, sample size n = 40, and the number of simulation =1000.
A series of 1000 simulations is run to create a data set for comparison purpose. Each simulation contain 40 observations and the expoential distribution function will be set to rexp(40, 0.2) where 0.2 is lambda value.
Given data: n = 40; simNum = 1000; lambda = 0.2 For reproducibility, set seed = 22062024.
The code below executes the simulation to generate the data for the exponential distribution data
n = 40
lambda = 0.2
simNum = 1000
set.seed(22062024)
simExp = function(n, lambda){
mean(rexp(n,lambda))
}
simul = data.frame(ncol=2,nrow=simNum)
names(simul) = c("Sample","Mean")
for (i in 1:simNum)
{
simul[i,1] = i
simul[i,2] = simExp(n,lambda)
}
#check the data frame created
#the top 6
head(simul)
tail(simul)
We first find the sample mean as follows:
meanSample <- mean(simul$Mean)
meanSample
## [1] 5.033469
Next, the theoretical mean is evaluated by :
meanTheory <- 1/lambda
meanTheory
## [1] 5
We can see that the sample mean of 5.03 is close to the theoretical value of 5.
Then, we make a plot(Histogram) for the exponential distributions of the sample means. Herein, the Sample and theoretical means are showcased on the plot with distinct colored lines.
hist(simul$Mean, breaks = 30, prob = TRUE,col = "green",
main="Exponential Distribution of Sample Means",
xlab="Means of 40 Simulated Samples", ylab = "Counts")
abline(v = meanTheory, col= "red", lwd = 3)
abline(v = meanSample, col = "blue",lwd = 2)
legend('topright', c("Sample Mean", "Theoretical Mean"),
bty = "n",
lty = c(1,1),
col = c(col = "blue", col = "red"))
The red vertical line indicates the theoretical sample mean, whereas the blue vertical line is the sample mean. The center of distribution of averages of 40 exponentials is very close to the theoretical center of the distribution.
In this section, we compare the variance of the sample means of the 1000 simulations to the theoretical variance of the population.
Firstly, the sample variance is estimated by using the variance of the 1000 entries in the means vector times the sample size, 40.
var_sample <- var(simul$Mean)
var_sample
## [1] 0.6210099
Secondly we estimate the theoretical variance of the population given by the formula σ2=(1/lambda)2/n as follows:
var_theory = (1/lambda)^2/n
var_theory
## [1] 0.625
We conclude that, the sample variance of the distribution is 0.621 and the theoretical variance is 0.625.
The following illustrates the values for the sample and theoretical mean distribution and their variances.
## Theroetical sample
## Mean 5.000 5.03
## Variance 0.625 0.621
The central limit theorem tells us that the means of samples tend to follow a normal distribution. The figure below demonstrates this by comparing the density from the histogram with a normal density curve, which is plotted using the theoretical mean and variance. This comparison shows that the distribution is approximately normal.
hist(simul$Mean,
breaks = 30,
prob = TRUE,col = "green",
main = "Density of Simulated Samples Means",
xlab = "Means of Exponential", ylab = "Mean Density")
lines(density(simul$Mean), col = "blue", lwd = 2)
abline(v = 1/lambda, col = "orange", lwd = 2)
xfit <- seq(min(simul$Mean), max(simul$Mean), length = 100)
yfit <- dnorm(xfit, mean = 1/lambda, sd = (1/lambda/sqrt(n)))
lines(xfit, yfit, pch = 22, col = "red", lwd = 2)
legend('topright', c("Simulated Values", "Theoretical Values"),
bty = "n", lwd = c(2,2), col = c("blue", "red"))
The plot above illustrates that the density curve (simulated values) closely aligns with the normal distribution curve (theoretical values), indicating that the sample distribution is approximately normal. Additionally, the q-q plot below supports this, as the theoretical quantiles match closely with the actual quantiles, further suggesting normality.
qqnorm(simul$Mean,main ="Normal Q-Q Plot", col = "red")
qqline(simul$Mean, col = "blue", lwd = 2)
Hence, from the above, we can conclude that the comparisons confirms that the distribution is approximately normal.
library(ggplot2)
library(datasets)
#library(tidyverse)
library(gridExtra)
# Load the data ToothGrowth
data(ToothGrowth)
# Look at the structure of the data
str(ToothGrowth)
## 'data.frame': 60 obs. of 3 variables:
## $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
## $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
## $ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
head(ToothGrowth)
tail(ToothGrowth)
summary(ToothGrowth)
## len supp dose
## Min. : 4.20 OJ:30 Min. :0.500
## 1st Qu.:13.07 VC:30 1st Qu.:0.500
## Median :19.25 Median :1.000
## Mean :18.81 Mean :1.167
## 3rd Qu.:25.27 3rd Qu.:2.000
## Max. :33.90 Max. :2.000
tapply(ToothGrowth$len,ToothGrowth$supp, mean)
## OJ VC
## 20.66333 16.96333
# Basic plots to visualize the data
plot1 <- ggplot(ToothGrowth, aes(x = supp, y = len)) +
geom_boxplot() +
labs(title = "Tooth Length by Supplement Type", x = "Supplement Type", y = "Tooth Length")
plot2 <- ggplot(ToothGrowth, aes(x = factor(dose), y = len)) +
geom_boxplot() +
labs(title = "Tooth Length by Dose", x = "Dose (mg/day)", y = "Tooth Length")
plot3 <- ggplot(ToothGrowth, aes(x = factor(dose), y = len, fill = supp)) +
geom_boxplot() +
labs(title = "Tooth Length by Dose and Supplement Type", x = "Dose (mg/day)", y = "Tooth Length") +
facet_wrap(~supp)
# Arrange the plots in one place using grid.arrange
grid.arrange(plot1, plot2, plot3, ncol = 2)
We make the following Hypothesis :
Hypothesis for 0.5 and 1 mg/day:
H_0: There is no difference in tooth growth by the delivery method.
H_1: There is a difference in tooth growth by the delivery method.
Hypothesis for 2 mg/day:
H_0: There is no difference in tooth growth by the delivery method.
H_1: There is a difference in tooth growth by the delivery method.
t05 <- t.test(len ~ supp,
data = rbind(ToothGrowth[(ToothGrowth$dose == 0.5) &
(ToothGrowth$supp == "OJ"),],
ToothGrowth[(ToothGrowth$dose == 0.5) &
(ToothGrowth$supp == "VC"),]),
var.equal = FALSE)
t1 <- t.test(len ~ supp,
data = rbind(ToothGrowth[(ToothGrowth$dose == 1) &
(ToothGrowth$supp == "OJ"),],
ToothGrowth[(ToothGrowth$dose == 1) &
(ToothGrowth$supp == "VC"),]),
var.equal = FALSE)
t2 <- t.test(len ~ supp,
data = rbind(ToothGrowth[(ToothGrowth$dose == 2) &
(ToothGrowth$supp == "OJ"),],
ToothGrowth[(ToothGrowth$dose == 2) &
(ToothGrowth$supp == "VC"),]),
var.equal = FALSE)
summaryBYsupp <- data.frame(
"p-value" = c(t05$p.value, t1$p.value, t2$p.value),
"Conf.Low" = c(t05$conf.int[1],t1$conf.int[1], t2$conf.int[1]),
"Conf.High" = c(t05$conf.int[2],t1$conf.int[2], t2$conf.int[2]),
row.names = c("Dosage .05","Dosage 1","Dosage 2"))
# Show the data table
summaryBYsupp
In a nutshell, with 95% confidence, we reject the null hypothesis that there is no difference in tooth growth by the delivery method for doses of 0.5 and 1 milligram/day. We observe p-values less than the threshold of 0.05, and the confidence intervals do not include 0. This indicates that the delivery method does significantly affect tooth growth at these dosages.
However, with 95% confidence, we fail to reject the null hypothesis for a dosage of 2 milligrams/day. In this case, we observe p-values greater than the threshold of 0.05, and the confidence intervals include 0. This suggests that, at a dosage of 2 milligrams/day, the delivery method does not significantly impact tooth growth.