Recall what the Central Limit Theorem (CLT) states:
Let \(\{X_1,X_2,\ldots,X_n\}\) be a random sample of size \(n\)- that is, a sequence of independent and identically distributed random variables drawn from a distribution with mean \(\mu_X\) and variance \(\sigma_X^2\). If you take sufficiently large random samples from the population with replacement, then the distribution of the sample means will be approximately normally distributed. mathematical terms, for \(n\) (size of the sample) sufficiently large we have
\[\bar{X}= \frac{1}{n}\sum_{i=1}^n X_i \sim \mathcal{N}\left(\mu_X, \sigma_X^2/n\right).\]
Here we intend to show how the sample mean of exponential observations satisfies the CLT. That is, how the distribution of sample mean have a very gaussian style via simulation of \(1,000\) samples of size \(40\).
We denote and use the following variables:
n: Number of simulations to be runS: Size of each samplelambda: The rate parameter of our exponential distribution#Parameters of the simulation
n <- 1000
S <- 40
lambda <- 0.2
Now, we create a matrix of size n\(\times\)S where each of the columns are a simulation of size 40 and then we compute the mean of each of these columns.
#Simulations
dataMatrix <- matrix(rexp(S*n, lambda), nrow = S)
sampleMeans <- apply(dataMatrix, 2, mean)
Therefore, if we perform the sample mean of our observations we get an approximation of the population mean.
theoretical.mean <- 1/lambda
sample.mean <- mean(sampleMeans)
cbind(theoretical.mean, sample.mean)
## theoretical.mean sample.mean
## [1,] 5 5.026962
Also, the same relationship works with the sample and population variance
theoretical.variance <- ((1/lambda)^2)
sample.variance <- var(sampleMeans)*S
cbind(theoretical.variance, sample.variance)
## theoretical.variance sample.variance
## [1,] 25 24.89185
To show the normal distribution, we create the probability histogram of the observations and the theoretical distribution (in red). Clearly, the sample mean (marked in blue) is the central point of our distribution.
#Plotting the histogram
par(mfrow = c(1,2), mar =c(4.1,3.1,3.1,1.1))
hist(sampleMeans, 30, freq = F, main = "Sample Distribution")
abline(v = 1/0.2, lwd = 3, col = 'blue') #Theoretical Mean
#Theoretical Distribution
pp <- rnorm(2500, mean = 1/lambda, sd = 1/(lambda*sqrt(S)))
lines(sort(pp), dnorm(sort(pp), mean = 1/lambda, sd = 1/(lambda*sqrt(S))), col = "red", lwd = 2)
#Exponential Distribution
xx <- sort(rexp(2500, rate = lambda))
plot(xx, dexp(xx, rate = lambda), type = "l", main = "Exponential Distribution", xlab = "", lwd = 3)
Both distributions of the sample mean and populations, respectively.
We pretend to perform some basic inferential data analysis on the ToothGrowth dataset.
data(ToothGrowth)
From the documentation obtained by help(ToothGrowth):
First we obtain a few statistics of our dataset:
library(ggplot2); library(tidyverse)
summary(ToothGrowth)
## len supp dose
## Min. : 4.20 OJ:30 Min. :0.500
## 1st Qu.:13.07 VC:30 1st Qu.:0.500
## Median :19.25 Median :1.000
## Mean :18.81 Mean :1.167
## 3rd Qu.:25.27 3rd Qu.:2.000
## Max. :33.90 Max. :2.000
str(ToothGrowth)
## 'data.frame': 60 obs. of 3 variables:
## $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
## $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
## $ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
ToothGrowth %>% group_by(supp, dose) %>% summarise(lenMean = mean(len), count = n())
We are going to perform three analysis comparing the means between the two delivery methods: Oranje Juice (OJ) and Ascorbic Acid, i.e, vitamin C (VC), via hypothesis Testing:
Since we don’t have enough information about the data, we can perform t-tests for this purpose. Considerer the hypothesis test \[H_0: \mu_{_{OJ}} = \mu_{_{VC}} \quad \text{vs}\quad H_a: \mu_{_{OJ}} \neq \mu_{_{VC}}\] where \(\mu_{_{OJ}}\) and \(\mu_{_{VC}}\) are the means of the population of the Oranje Juice and Vitamin C methods, respectively.
aux <- ToothGrowth %>% mutate(dose = "ALL")
ToothGrowth %>% rbind(aux) %>%
ggplot(aes(x=supp, y = len, fill = supp)) +
geom_violin(size=1.5) +
facet_wrap(~dose)
Violin plots for each dosage and without grouping by that.
First, without taking in consideration the dosage.
t.test(len ~ supp, paired = F, data = ToothGrowth)
##
## Welch Two Sample t-test
##
## data: len by supp
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1710156 7.5710156
## sample estimates:
## mean in group OJ mean in group VC
## 20.66333 16.96333
Since the confidence interval contains the cero and its \(p\)-value is 0.0606345 \(>0.05\) we failed to reject the null hypothesis. In other words, there is no significance evidence to ensure mean distinction.
We now filter for the low-dosage.
lowdosage <- ToothGrowth %>% filter(dose == 0.5)
t.test(len ~ supp, paired = F, data = lowdosage)
##
## Welch Two Sample t-test
##
## data: len by supp
## t = 3.1697, df = 14.969, p-value = 0.006359
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.719057 8.780943
## sample estimates:
## mean in group OJ mean in group VC
## 13.23 7.98
Then, we filter for the mid-dosage.
middosage <- ToothGrowth %>% filter(dose == 1)
t.test(len ~ supp, paired = F, data = middosage)
##
## Welch Two Sample t-test
##
## data: len by supp
## t = 4.0328, df = 15.358, p-value = 0.001038
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.802148 9.057852
## sample estimates:
## mean in group OJ mean in group VC
## 22.70 16.77
Finall, we filter for the high-dosage.
highdosage <- ToothGrowth %>% filter(dose == 2)
t.test(len ~ supp, paired = F, data = highdosage)
##
## Welch Two Sample t-test
##
## data: len by supp
## t = -0.046136, df = 14.04, p-value = 0.9639
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.79807 3.63807
## sample estimates:
## mean in group OJ mean in group VC
## 26.06 26.14
In the next table, we present all the \(p-\)values of each hypothesis test
## all.dose low.dose mid.dose high.dose
## [1,] 0.06063451 0.006358607 0.001038376 0.9638516
From the table, we can conclude the following
0.5) and mid (1) dosages are significant distinct.