Overview: This part is going to execute simulations and data analysises to illustrate application of the central limit theorem. R programming will be the major tool to realize the mentioned goal.
knitr::opts_chunk$set(echo = TRUE)
lambda <- 0.2
simData <- matrix(rexp(1000*40, lambda), nrow = 1000, ncol = 40)
distMean <- apply(simData, 1, mean)
hist(distMean, breaks = 50, main = "The distribution of 1000 averages of 40 random exponentials", xlab = "Value of means", ylab = "Frequency of means", col = "pink")
abline(v = 1/lambda, lty = 1, lwd = 5, col = "red")
legend("topright", lty = 1, lwd = 5, col = "red", legend = "theoretical mean")
The simulated sample means are normally distributed with a center very close to the theoretical mean.
distVar <- apply(simData, 1, var)
hist(distVar, breaks = 50, main = "The distribution of 1000 variance of 40 random exponentials", xlab = "Value of variances", ylab = "Frequency of variance", col = "light blue")
abline(v = (1/lambda)^2, lty = 1, lwd = 5, col = "blue")
legend("topright", lty = 1, lwd = 5, col = "blue", legend = "theoretical variance")
The simulated sample variances are almost normally distributed with a center near the theoretical variance.
par(mfrow = c(3, 1))
hist(simData, breaks = 50, main = "Distribution of exponentials with lambda equals to 0.2", xlab = "Exponentials", col = "yellow")
hist(distMean, breaks = 50, main = "The distribution of 1000 averages of 40 random exponentials", xlab = "Value of means", ylab = "Frequency of means", col = "pink")
simNorm <- rnorm(1000, mean = mean(distMean), sd = sd(distMean))
hist(simNorm, breaks = 50, main = "A normal distribution with theoretical mean and sd of the exponentials", xlab = "Normal variables", col = "light green")
The first histogram is the distribution of the exponentials with lambda equals to 0.2. The second histogram is the distribution of 1000 averages of 40 random exponentials. The third histogram is a real normal distribution with a mean and standard deviation equals to the second histogram’s.Comparing the first with the second histogram, we can see the distrubution becames normal as the means were taken from each groups. It is a result of the central limit theorem. Comparing the second and the third histogram, we can see the distribution of the means is similar to a real normal distribution with the same mean and standard deviation.
Overview: In this part we will do some statistical data analysises about the Toothlength data. ###Question 1: Load the ToothGrowth data and perform some basic exploratory data analyses.
library(stats)
data(ToothGrowth)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.2
qplot(dose, len, data = ToothGrowth, color = supp, geom = "point") + geom_smooth(method = "lm") + labs(title = "ToothGrowth") + labs(x = "Dose of supplements", y = "Length of teeth")
Overally, the spread of the length data is as follows, with a mean of 18.81 and a standard deviation of 7.65.
summary(ToothGrowth)
## len supp dose
## Min. : 4.20 OJ:30 Min. :0.500
## 1st Qu.:13.07 VC:30 1st Qu.:0.500
## Median :19.25 Median :1.000
## Mean :18.81 Mean :1.167
## 3rd Qu.:25.27 3rd Qu.:2.000
## Max. :33.90 Max. :2.000
sd(ToothGrowth$len)
## [1] 7.649315
The summary of the length data while taking OJ as supplement is as follows, with a mean of 20.66 and a standard deviation of 6.61.
summary(ToothGrowth[ToothGrowth$supp == "OJ", ])
## len supp dose
## Min. : 8.20 OJ:30 Min. :0.500
## 1st Qu.:15.53 VC: 0 1st Qu.:0.500
## Median :22.70 Median :1.000
## Mean :20.66 Mean :1.167
## 3rd Qu.:25.73 3rd Qu.:2.000
## Max. :30.90 Max. :2.000
sd(ToothGrowth[ToothGrowth$supp == "OJ", ]$len)
## [1] 6.605561
The summary of the length data while taking VC as supplement is as follows, with a mean of 16.96 and a standard deviation of 8.27.
summary(ToothGrowth[ToothGrowth$supp == "VC", ])
## len supp dose
## Min. : 4.20 OJ: 0 Min. :0.500
## 1st Qu.:11.20 VC:30 1st Qu.:0.500
## Median :16.50 Median :1.000
## Mean :16.96 Mean :1.167
## 3rd Qu.:23.10 3rd Qu.:2.000
## Max. :33.90 Max. :2.000
sd(ToothGrowth[ToothGrowth$supp == "VC", ]$len)
## [1] 8.266029
To see a better spread of the data, we may look at the boxplots under different values of dose and different kinds of supplements.
g <- ggplot(ToothGrowth, aes(dose, len, group = supp))
g + geom_boxplot() + facet_grid(supp ~ dose) + labs(x = "Dose of supplements", y = "Length of teeth growth")
t.test(x = ToothGrowth$len, data = ToothGrowth, paired = FALSE, conf.level = 0.95)$conf.
## [1] 16.83731 20.78936
## attr(,"conf.level")
## [1] 0.95
We will be able to construct a confidence interval that 95% of the time, an interval between 16.84 and 20.79 will contain the true mean of the population.
Then we calculate the mean under each kind of supplements, and each dose of the supplements:
summary(ToothGrowth[ToothGrowth$supp == "OJ", ]$len)[4]
## Mean
## 20.66
summary(ToothGrowth[ToothGrowth$supp == "VC", ]$len)[4]
## Mean
## 16.96
Thus the mean of teeth growth after taking OJ is 20.66; the mean of teeth growth after taking VC is 16.96. Both the two are within the confidence interval. We fail to reject the null hypothesis that there is not a difference in teeth growth after taking the two kinds of supplements.
summary(ToothGrowth[ToothGrowth$dose == 0.5, ]$len)[4]
## Mean
## 10.6
summary(ToothGrowth[ToothGrowth$dose == 1, ]$len)[4]
## Mean
## 19.74
summary(ToothGrowth[ToothGrowth$dose == 2, ]$len)[4]
## Mean
## 26.1
Thus the mean of teeth growth after taking dose of 0.5 is 10.6; the mean of teeth growth after taking dose of 1.0 is 19.74; the mean of teeth growth after taking dose of 2.0 is 26.1. We are able to reject the null hypotheis, and there is a difference in teeth growth between each dose of supplements.
dose05 <- ToothGrowth[ToothGrowth$dose == 0.5, ]
t.test(x = dose05$len, paired = FALSE, conf.level = 0.95)$conf.
## [1] 8.499046 12.710954
## attr(,"conf.level")
## [1] 0.95
mean(dose05[dose05$supp == "VC", ]$len)
## [1] 7.98
mean(dose05[dose05$supp == "OJ", ]$len)
## [1] 13.23
Thus under the dose of 0.5, there are 95% of the time that a confidence interval between 8.50 and 12.71 will contain the true population mean. We also know that the mean of teeth growth after taking 0.5 dose of VC is 7.98 and the mean of teeth growth after taking 0.5 dose of OJ is 13.23. We rejected the null hypo.
dose10 <- ToothGrowth[ToothGrowth$dose == 1, ]
t.test(x = dose10$len, paired = FALSE, conf.level = 0.95)$conf.
## [1] 17.66851 21.80149
## attr(,"conf.level")
## [1] 0.95
mean(dose10[dose10$supp == "VC", ]$len)
## [1] 16.77
mean(dose10[dose10$supp == "OJ", ]$len)
## [1] 22.7
Thus under the dose of 1.0, there are 95% of the time that a confidence interval between 17.67 and 21.80 will contain the true population mean. We also know that the mean of teeth growth after taking 1.0 dose of VC is 16.77 and the mean of teeth growth after taking 1.0 dose of OJ is 22.7. We rejected the null hypo.
dose20 <- ToothGrowth[ToothGrowth$dose == 2, ]
t.test(x = dose20$len, paired = FALSE, conf.level = 0.95)$conf.
## [1] 24.33364 27.86636
## attr(,"conf.level")
## [1] 0.95
mean(dose20[dose20$supp == "VC", ]$len)
## [1] 26.14
mean(dose20[dose20$supp == "OJ", ]$len)
## [1] 26.06
Thus under the dose of 2.0, there are 95% of the time that a confidence interval between 24.33 and 27.87 will contain the true population mean. We also know that the mean of teeth growth after taking 2.0 dose of VC is 26.14 and the mean of teeth growth after taking 2.0 dose of OJ is 26.06. Both of them are within the confidence interval. We fail to reject the null hypothesis.
The conclusion is when the dose is 0.5 or 1.0 there is a difference between the teeth growth after taking OJ and VC, while when the dose is 2.0, there is no difference between the teeth growth after taking OJ and VC. The assumption needed is we first assumed the whole population is normally distributed, then we assumed the population is normally distributed under each dose.