Author:Mrityunjay Pandey
As per the ask in the project this section will execute simulations and data analysises to illustrate application of the central limit theorem.
knitr::opts_chunk$set(echo = TRUE)
lambda <- 0.2
simulation_Data <- matrix(rexp(1000*40, lambda), nrow = 1000, ncol = 40)
distMean <- apply(simulation_Data, 1, mean)
hist(distMean, breaks = 50, main = "The distribution of 1000 averages of 40 random exponentials", xlab = "Value of means", ylab = "Frequency of means", col = "pink")
abline(v = 1/lambda, lty = 1, lwd = 5, col = "red")
legend("topright", lty = 1, lwd = 5, col = "red", legend = "theoretical mean")
The simulated sample means are normally distributed with a center very close to the theoretical mean.
distVar <- apply(simulation_Data, 1, var)
hist(distVar, breaks = 50, main = "The distribution of 1000 variance of 40 random exponentials", xlab = "Value of variances", ylab = "Frequency of variance", col = "light blue")
abline(v = (1/lambda)^2, lty = 1, lwd = 5, col = "blue")
legend("topright", lty = 1, lwd = 5, col = "blue", legend = "theoretical variance")
par(mfrow = c(3, 1))
hist(simulation_Data, breaks = 50, main = "Distribution of exponentials with lambda equals to 0.2", xlab = "Exponentials", col = "yellow")
hist(distMean, breaks = 50, main = "The distribution of 1000 averages of 40 random exponentials", xlab = "Value of means", ylab = "Frequency of means", col = "pink")
simNorm <- rnorm(1000, mean = mean(distMean), sd = sd(distMean))
hist(simNorm, breaks = 50, main = "A normal distribution with theoretical mean and sd of the exponentials", xlab = "Normal variables", col = "light green")
The first histogram is the distribution of the exponentials with lambda equals to 0.2. The second histogram is the distribution of 1000 averages of 40 random exponentials. The third histogram is a real normal distribution with a mean and standard deviation equals to the second histogram’s.Comparing the first with the second histogram, we can see the distrubution becames normal as the means were taken from each groups. It is a result of the central limit theorem. Comparing the second and the third histogram, we can see the distribution of the means is similar to a real normal distribution with the same mean and standard deviation.
library(stats)
data(ToothGrowth)
library(ggplot2)
qplot(dose, len, data = ToothGrowth, color = supp, geom = "point") + geom_smooth(method = "lm") + labs(title = "ToothGrowth") + labs(x = "Dose of supplements", y = "Length of teeth")
1.the length of teeth goes up as the dose of supplements increases, which indicates that the supplements may help teeth growth. 2.at the same dose, OJ seems to incur a higher increase of teeth growth than VC. 3.the slope of OJ is not as steep as the slope of VC, meaning an increase in VC may make a larger increase in teeth length than in OJ.
summary(ToothGrowth)
len supp dose
Min. : 4.20 OJ:30 Min. :0.500
1st Qu.:13.07 VC:30 1st Qu.:0.500
Median :19.25 Median :1.000
Mean :18.81 Mean :1.167
3rd Qu.:25.27 3rd Qu.:2.000
Max. :33.90 Max. :2.000
If we assume the data is normally distributed, we have a null hypothesis that there is no difference between the mean under each kind of supplements, or each dose of the supplements:
t.test(x = ToothGrowth$len, data = ToothGrowth, paired = FALSE, conf.level = 0.95)$conf.
[1] 16.83731 20.78936
attr(,"conf.level")
[1] 0.95
We will be able to construct a confidence interval that 95% of the time, an interval between 16.84 and 20.79 will contain the true mean of the population.
Then we calculate the mean under each kind of supplements, and each dose of the supplements
summary(ToothGrowth[ToothGrowth$supp == "OJ", ]$len)[4]
Mean
20.66333
summary(ToothGrowth[ToothGrowth$supp == "VC", ]$len)[4]
Mean
16.96333
Thus the mean of teeth growth after taking OJ is 20.66; the mean of teeth growth after taking VC is 16.96. Both the two are within the confidence interval. We fail to reject the null hypothesis that there is not a difference in teeth growth after taking the two kinds of supplements.
summary(ToothGrowth[ToothGrowth$dose == 0.5, ]$len)[4]
Mean
10.605
summary(ToothGrowth[ToothGrowth$dose == 1, ]$len)[4]
Mean
19.735
summary(ToothGrowth[ToothGrowth$dose == 2, ]$len)[4]
Mean
26.1
Thus the mean of teeth growth after taking dose of 0.5 is 10.6; the mean of teeth growth after taking dose of 1.0 is 19.74; the mean of teeth growth after taking dose of 2.0 is 26.1. We are able to reject the null hypotheis, and there is a difference in teeth growth between each dose of supplements.
Now we know the data is not normally distributed under each dose, along with this conclusion, we may assume the data is normally distributed within each dose. Then we will be able to compare the teeth growth between each supplements under each dose.
dose05 <- ToothGrowth[ToothGrowth$dose == 0.5, ]
t.test(x = dose05$len, paired = FALSE, conf.level = 0.95)$conf.
[1] 8.499046 12.710954
attr(,"conf.level")
[1] 0.95
mean(dose05[dose05$supp == "VC", ]$len)
[1] 7.98
mean(dose05[dose05$supp == "OJ", ]$len)
[1] 13.23
Thus under the dose of 0.5, there are 95% of the time that a confidence interval between 8.50 and 12.71 will contain the true population mean. We also know that the mean of teeth growth after taking 0.5 dose of VC is 7.98 and the mean of teeth growth after taking 0.5 dose of OJ is 13.23. We rejected the null hypo.
dose10 <- ToothGrowth[ToothGrowth$dose == 1, ]
t.test(x = dose10$len, paired = FALSE, conf.level = 0.95)$conf.
[1] 17.66851 21.80149
attr(,"conf.level")
[1] 0.95
mean(dose10[dose10$supp == "VC", ]$len)
[1] 16.77
mean(dose10[dose10$supp == "OJ", ]$len)
[1] 22.7
Thus under the dose of 1.0, there are 95% of the time that a confidence interval between 17.67 and 21.80 will contain the true population mean. We also know that the mean of teeth growth after taking 1.0 dose of VC is 16.77 and the mean of teeth growth after taking 1.0 dose of OJ is 22.7. We rejected the null hypo.
dose20 <- ToothGrowth[ToothGrowth$dose == 2, ]
t.test(x = dose20$len, paired = FALSE, conf.level = 0.95)$conf.
[1] 24.33364 27.86636
attr(,"conf.level")
[1] 0.95
mean(dose20[dose20$supp == "VC", ]$len)
[1] 26.14
mean(dose20[dose20$supp == "OJ", ]$len)
[1] 26.06
Thus under the dose of 2.0, there are 95% of the time that a confidence interval between 24.33 and 27.87 will contain the true population mean. We also know that the mean of teeth growth after taking 2.0 dose of VC is 26.14 and the mean of teeth growth after taking 2.0 dose of OJ is 26.06. Both of them are within the confidence interval. We fail to reject the null hypothesis.
The conclusion is when the dose is 0.5 or 1.0 there is a difference between the teeth growth after taking OJ and VC, while when the dose is 2.0, there is no difference between the teeth growth after taking OJ and VC. The assumption needed is we first assumed the whole population is normally distributed, then we assumed the population is normally distributed under each dose.