@ Yiyang Zhao
This is a Coursera course project for Statistical Inference in the Data Science Specialization by Johns Hopkins University.
This part of the project aims to investigate the exponential distribution and compare it with the Central Limit Theorem.
hist(runif(1000))
mns = NULL
for (i in 1 : 1000) mns = c(mns, mean(runif(40)))
hist(mns)
# Setting the parameters
lambda <- 0.2
n <- 40
# Take a look at the original exponential distribution
x <- rexp(1000, lambda)
hist(x, 50, col = "gray80", border = 0, main = "Histogram of the Exponential Distribution", xlab = "Values of the Random Vairables")
# Calculate the mean and variance
mx <- mean(x)
vx <- var(x)
# Sampling from the exponential distribution 1000 times, each with 40 variables
x1000 <- sapply(rep(n, 1000), rexp, lambda)
x1000_m <- colMeans(x1000)
hist(x1000_m, 50, col = "gray80", border = 0, main = "Histogram of the Mean of the 1000 Samples", xlab = "Values of the Samples Means")
# Calculate the mean and variance
m1000 <- mean(x1000_m)
v1000 <- var(x1000_m)
mat <- matrix(c(1/lambda, 1/(lambda)^2, mx, vx, m1000, v1000), nrow=3, ncol=2, byrow = TRUE)
dimnames(mat) <- list(c("Theoretical Expectation", "Random Variables", "Samples of 40"), c("Mean", "Variance"))
print(mat)
## Mean Variance
## Theoretical Expectation 5.000000 25.0000000
## Random Variables 5.094366 25.9633321
## Samples of 40 5.036862 0.6544547
From the simulation histograms, it is obvious that the original distribution is positively skewed while the sample means are approximately normal: it is almost symmetrical about 5 and the area is divided into halves at 5.
While the mean of the sample means is approximately equal to 1/lambda, i.e. the mean of the original exponential distribution, the variance of the sample means is only about 1/40 of the original (since we have 40 random variables in each sample).
# Load the data.
library(datasets)
data(ToothGrowth)
tg <- ToothGrowth
# Take a look at the data.
head(tg)
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
# Obtain a basic summary of the data.
summary(tg)
## len supp dose
## Min. : 4.20 OJ:30 Min. :0.500
## 1st Qu.:13.07 VC:30 1st Qu.:0.500
## Median :19.25 Median :1.000
## Mean :18.81 Mean :1.167
## 3rd Qu.:25.27 3rd Qu.:2.000
## Max. :33.90 Max. :2.000
# Briefly understand the len-supp and len-dose relationships.
boxplot(len ~ supp, data = tg, main="Tooth Length by Supps",
xlab="supp", ylab="len")
boxplot(len ~ dose, data = tg, main="Tooth Length by Doses",
xlab="dose", ylab="len")
From the boxplots above, OJ results in a greater mean tooth length than VC. We may conduct a one-tailed test.
\(\alpha = 0.05\)
\(H_0: \mu_{oj} - \mu_{vc} = 0\)
\(H_0: \mu_{oj} - \mu_{vc} > 0\)
t.test(len ~ supp, data = tg, alternative = "greater", paired = FALSE, var.equal = FALSE)
##
## Welch Two Sample t-test
##
## data: len by supp
## t = 1.9153, df = 55.309, p-value = 0.03032
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 0.4682687 Inf
## sample estimates:
## mean in group OJ mean in group VC
## 20.66333 16.96333
Since 0.03032 < 0.05 (i.e. p-value < alpha), the null hypothesis is rejected.
Note that in the above hypothesis test, the other variable, dose is not kept constant. Since we are unsure about the relative mean tooth length when both factors are taken into account, we may conduct a two-tailed test.
\(\alpha = 0.05\)
\(H_0: \mu_{oj} - \mu_{vc} = 0\)
\(H_0: \mu_{oj} - \mu_{vc} \neq 0\)
§ 2.1 dose = 0.5
dose1 <- tg[tg$dose == 0.5, ]
t.test(len ~ supp, data = dose1, paired = FALSE, var.equal=FALSE)
##
## Welch Two Sample t-test
##
## data: len by supp
## t = 3.1697, df = 14.969, p-value = 0.006359
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.719057 8.780943
## sample estimates:
## mean in group OJ mean in group VC
## 13.23 7.98
Since 0.006359 < 0.05 (i.e. p-value < alpha), the null hypothesis is rejected.
§ 2.2 dose = 1.0
dose2 <- tg[tg$dose == 1.0, ]
t.test(len ~ supp, data = dose2, paired = FALSE, var.equal=FALSE)
##
## Welch Two Sample t-test
##
## data: len by supp
## t = 4.0328, df = 15.358, p-value = 0.001038
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.802148 9.057852
## sample estimates:
## mean in group OJ mean in group VC
## 22.70 16.77
Since 0.001038 < 0.05 (i.e. p-value < alpha), the null hypothesis is rejected.
§ 2.3 dose = 2.0
dose3 <- tg[tg$dose == 2.0, ]
t.test(len ~ supp, data = dose3, paired = FALSE, var.equal=FALSE)
##
## Welch Two Sample t-test
##
## data: len by supp
## t = -0.046136, df = 14.04, p-value = 0.9639
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.79807 3.63807
## sample estimates:
## mean in group OJ mean in group VC
## 26.06 26.14
Since 0.9639 > 0.05 (i.e. p-value > alpha), the null hypothesis cannot be rejected.
In general cases, the null hypothesis can be rejected. There is a statistically significant relation between the supplement and the length of the tooth. More specifically, the teeth treated with OJ have a greater mean length than those treated with VC. However, this conclusion might not hold when we take the dose into consideration. While the relation is still clear given dose = 0.5 or dose = 1.0, the mean tooth length given different supplement are approximately equal given dose = 2.0, suggesting no relation between length and supp. The test is not very conclusive.
From the boxplots above, the mean tooth length increases as the dose increases. We may conduct a one-tailed test.
\(\alpha = 0.05\)
dlow <- tg[tg$dose == 0.5, ]
dmed <- tg[tg$dose == 1.0, ]
dhigh <- tg[tg$dose == 2.0, ]
[Part 1]
\(H_0: \mu_{d_{med}} - \mu_{d_{low}} = 0\)
\(H_0: \mu_{d_{med}} - \mu_{d_{low}} > 0\)
t.test(dmed$len, dlow$len, alternative = "greater", paired = FALSE, var.equal = FALSE)
##
## Welch Two Sample t-test
##
## data: dmed$len and dlow$len
## t = 6.4766, df = 37.986, p-value = 6.342e-08
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 6.753323 Inf
## sample estimates:
## mean of x mean of y
## 19.735 10.605
Since 6.342e-08 < 0.05 (i.e. p-value < alpha), the null hypothesis is rejected.
[Part 2]
\(H_0: \mu_{d_{high}} - \mu_{d_{med}} = 0\)
\(H_0: \mu_{d_{high}} - \mu_{d_{med}} > 0\)
t.test(dhigh$len, dmed$len, alternative = "greater", paired = FALSE, var.equal = FALSE)
##
## Welch Two Sample t-test
##
## data: dhigh$len and dmed$len
## t = 4.9005, df = 37.101, p-value = 9.532e-06
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 4.17387 Inf
## sample estimates:
## mean of x mean of y
## 26.100 19.735
Since 9.532e-06 < 0.05 (i.e. p-value < alpha), the null hypothesis is rejected.
Therefore, there is a statistically significant relationship between the dose and the tooth length.
Note that in the above hypothesis test, the other variable, supp is not kept constant.
§ 2.1 supp = OJ
dlow_oj <- tg[tg$dose == 0.5 & tg$supp == "OJ", ]
dmed_oj <- tg[tg$dose == 1.0 & tg$supp == "OJ", ]
dhigh_oj <- tg[tg$dose == 2.0 & tg$supp == "OJ", ]
\(\alpha = 0.05\)
[Part 1]
\(H_0: \mu_{d_{med,oj}} - \mu_{d_{low,oj}} = 0\)
\(H_0: \mu_{d_{med,oj}} - \mu_{d_{low,oj}} > 0\)
t.test(dmed_oj$len, dlow_oj$len, alternative = "greater", paired = FALSE, var.equal = FALSE)
##
## Welch Two Sample t-test
##
## data: dmed_oj$len and dlow_oj$len
## t = 5.0486, df = 17.698, p-value = 4.392e-05
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 6.214316 Inf
## sample estimates:
## mean of x mean of y
## 22.70 13.23
Since 4.392e-05 < 0.05 (i.e. p-value < alpha), the null hypothesis is rejected.
[Part 2]
\(H_0: \mu_{d_{high,oj}} - \mu_{d_{med,oj}} = 0\)
\(H_0: \mu_{d_{high,oj}} - \mu_{d_{med,oj}} > 0\)
t.test(dhigh_oj$len, dmed_oj$len, alternative = "greater", paired = FALSE, var.equal = FALSE)
##
## Welch Two Sample t-test
##
## data: dhigh_oj$len and dmed_oj$len
## t = 2.2478, df = 15.842, p-value = 0.0196
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 0.7486236 Inf
## sample estimates:
## mean of x mean of y
## 26.06 22.70
Since 0.0196 < 0.05 (i.e. p-value < alpha), the null hypothesis is rejected.
Therefore, under OJ supp, there is a statistically significant relationship between the dose and the tooth length.
§ 2.2 supp = VC
dlow_vc <- tg[tg$dose == 0.5 & tg$supp == "VC", ]
dmed_vc <- tg[tg$dose == 1.0 & tg$supp == "VC", ]
dhigh_vc <- tg[tg$dose == 2.0 & tg$supp == "VC", ]
\(\alpha = 0.05\)
[Part 1]
\(H_0: \mu_{d_{med,vc}} - \mu_{d_{low,vc}} = 0\)
\(H_0: \mu_{d_{med,vc}} - \mu_{d_{low,vc}} > 0\)
t.test(dmed_vc$len, dlow_vc$len, alternative = "greater", paired = FALSE, var.equal = FALSE)
##
## Welch Two Sample t-test
##
## data: dmed_vc$len and dlow_vc$len
## t = 7.4634, df = 17.862, p-value = 3.406e-07
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 6.746867 Inf
## sample estimates:
## mean of x mean of y
## 16.77 7.98
Since 3.406e-07 < 0.05 (i.e. p-value < alpha), the null hypothesis is rejected.
[Part 2]
\(H_0: \mu_{d_{high,vc}} - \mu_{d_{med,vc}} = 0\)
\(H_0: \mu_{d_{high,vc}} - \mu_{d_{med,vc}} > 0\)
t.test(dhigh_vc$len, dmed_vc$len, alternative = "greater", paired = FALSE, var.equal = FALSE)
##
## Welch Two Sample t-test
##
## data: dhigh_vc$len and dmed_vc$len
## t = 5.4698, df = 13.6, p-value = 4.578e-05
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 6.346525 Inf
## sample estimates:
## mean of x mean of y
## 26.14 16.77
Since4.578e-05 < 0.05 (i.e. p-value < alpha), the null hypothesis is rejected.
Therefore, under VC supp, there is a statistically significant relationship between the dose and the tooth length.
The null hypothesis can be rejected. There is a statistically significant relation between the dose and the length of the tooth. More specifically, the teeth treated with higher dose have a greater mean length.
The OJ supplement at 0.5 and 1.0 dosages have significantly increases tooth growth than the VC at the same doses. The OJ or VC supplements at a dose of 2.0 do not result significantly different tooth growth. Increasing supplement dosage significantly increase tooth growth.
The variances between the sample popluations are not equal. The sample data is not paired. There are no other variables affecting the tooth length.