@ Yiyang Zhao

Project Overview

This is a Coursera course project for Statistical Inference in the Data Science Specialization by Johns Hopkins University.

Part 1 - A Simulation Exercise

This part of the project aims to investigate the exponential distribution and compare it with the Central Limit Theorem.

Motivating Example

hist(runif(1000))

mns = NULL
for (i in 1 : 1000) mns = c(mns, mean(runif(40)))
hist(mns)

Exponential Distribution

# Setting the parameters
lambda <- 0.2
n <- 40

# Take a look at the original exponential distribution
x <- rexp(1000, lambda)
hist(x, 50, col = "gray80", border = 0, main = "Histogram of the Exponential Distribution", xlab = "Values of the Random Vairables")

# Calculate the mean and variance
mx <- mean(x)
vx <- var(x)

# Sampling from the exponential distribution 1000 times, each with 40 variables
x1000 <- sapply(rep(n, 1000), rexp, lambda)
x1000_m <- colMeans(x1000)
hist(x1000_m, 50, col = "gray80", border = 0, main = "Histogram of the Mean of the 1000 Samples", xlab = "Values of the Samples Means")

# Calculate the mean and variance
m1000 <- mean(x1000_m)
v1000 <- var(x1000_m)


mat <- matrix(c(1/lambda, 1/(lambda)^2, mx, vx, m1000, v1000), nrow=3, ncol=2, byrow = TRUE)
dimnames(mat) <- list(c("Theoretical Expectation", "Random Variables", "Samples of 40"), c("Mean", "Variance"))
print(mat)

##                             Mean   Variance
## Theoretical Expectation 5.000000 25.0000000
## Random Variables        5.094366 25.9633321
## Samples of 40           5.036862  0.6544547

Conclusion

From the simulation histograms, it is obvious that the original distribution is positively skewed while the sample means are approximately normal: it is almost symmetrical about 5 and the area is divided into halves at 5.
While the mean of the sample means is approximately equal to 1/lambda, i.e. the mean of the original exponential distribution, the variance of the sample means is only about 1/40 of the original (since we have 40 random variables in each sample).

Part 2 - Basic Inferential Data Analysis Instructions

Data Summary

# Load the data.
library(datasets)
data(ToothGrowth)
tg <- ToothGrowth

# Take a look at the data.
head(tg)

##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

# Obtain a basic summary of the data.
summary(tg)

##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

# Briefly understand the len-supp and len-dose relationships.
boxplot(len ~ supp, data = tg, main="Tooth Length by Supps", 
   xlab="supp", ylab="len")

boxplot(len ~ dose, data = tg, main="Tooth Length by Doses", 
   xlab="dose", ylab="len")

Hypothesis Testing

I. There is no relation between the supplement and the length of the tooth.

1. Basic Analysis

From the boxplots above, OJ results in a greater mean tooth length than VC. We may conduct a one-tailed test.
\(\alpha = 0.05\)
\(H_0： \mu_{oj} - \mu_{vc} = 0\)
\(H_0： \mu_{oj} - \mu_{vc} > 0\)

t.test(len ~ supp, data = tg, alternative = "greater", paired = FALSE, var.equal = FALSE)

## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = 1.9153, df = 55.309, p-value = 0.03032
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  0.4682687       Inf
## sample estimates:
## mean in group OJ mean in group VC 
##         20.66333         16.96333

Since 0.03032 < 0.05 (i.e. p-value < alpha), the null hypothesis is rejected.

2. More Accurate Analysis

Note that in the above hypothesis test, the other variable, dose is not kept constant. Since we are unsure about the relative mean tooth length when both factors are taken into account, we may conduct a two-tailed test.
\(\alpha = 0.05\)
\(H_0： \mu_{oj} - \mu_{vc} = 0\)
\(H_0： \mu_{oj} - \mu_{vc} \neq 0\)

§ 2.1 dose = 0.5

dose1 <- tg[tg$dose == 0.5, ]
t.test(len ~ supp, data = dose1, paired = FALSE, var.equal=FALSE)

## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = 3.1697, df = 14.969, p-value = 0.006359
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.719057 8.780943
## sample estimates:
## mean in group OJ mean in group VC 
##            13.23             7.98

Since 0.006359 < 0.05 (i.e. p-value < alpha), the null hypothesis is rejected.

§ 2.2 dose = 1.0

dose2 <- tg[tg$dose == 1.0, ]
t.test(len ~ supp, data = dose2, paired = FALSE, var.equal=FALSE)

## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = 4.0328, df = 15.358, p-value = 0.001038
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.802148 9.057852
## sample estimates:
## mean in group OJ mean in group VC 
##            22.70            16.77

Since 0.001038 < 0.05 (i.e. p-value < alpha), the null hypothesis is rejected.

§ 2.3 dose = 2.0

dose3 <- tg[tg$dose == 2.0, ]
t.test(len ~ supp, data = dose3, paired = FALSE, var.equal=FALSE)

## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = -0.046136, df = 14.04, p-value = 0.9639
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.79807  3.63807
## sample estimates:
## mean in group OJ mean in group VC 
##            26.06            26.14

Since 0.9639 > 0.05 (i.e. p-value > alpha), the null hypothesis cannot be rejected.

3. Conclusion

In general cases, the null hypothesis can be rejected. There is a statistically significant relation between the supplement and the length of the tooth. More specifically, the teeth treated with OJ have a greater mean length than those treated with VC. However, this conclusion might not hold when we take the dose into consideration. While the relation is still clear given dose = 0.5 or dose = 1.0, the mean tooth length given different supplement are approximately equal given dose = 2.0, suggesting no relation between length and supp. The test is not very conclusive.

II. There is no relation between the dose and the length of the tooth.

1. Basic Analysis

From the boxplots above, the mean tooth length increases as the dose increases. We may conduct a one-tailed test.
\(\alpha = 0.05\)

dlow <- tg[tg$dose == 0.5, ]
dmed <- tg[tg$dose == 1.0, ]
dhigh <- tg[tg$dose == 2.0, ]

[Part 1]
\(H_0： \mu_{d_{med}} - \mu_{d_{low}} = 0\)
\(H_0： \mu_{d_{med}} - \mu_{d_{low}} > 0\)

t.test(dmed$len, dlow$len, alternative = "greater", paired = FALSE, var.equal = FALSE)

## 
##  Welch Two Sample t-test
## 
## data:  dmed$len and dlow$len
## t = 6.4766, df = 37.986, p-value = 6.342e-08
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  6.753323      Inf
## sample estimates:
## mean of x mean of y 
##    19.735    10.605

Since 6.342e-08 < 0.05 (i.e. p-value < alpha), the null hypothesis is rejected.

[Part 2]
\(H_0： \mu_{d_{high}} - \mu_{d_{med}} = 0\)
\(H_0： \mu_{d_{high}} - \mu_{d_{med}} > 0\)

t.test(dhigh$len, dmed$len, alternative = "greater", paired = FALSE, var.equal = FALSE)

## 
##  Welch Two Sample t-test
## 
## data:  dhigh$len and dmed$len
## t = 4.9005, df = 37.101, p-value = 9.532e-06
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  4.17387     Inf
## sample estimates:
## mean of x mean of y 
##    26.100    19.735

Since 9.532e-06 < 0.05 (i.e. p-value < alpha), the null hypothesis is rejected.
Therefore, there is a statistically significant relationship between the dose and the tooth length.

2. More Accurate Analysis

Note that in the above hypothesis test, the other variable, supp is not kept constant.

§ 2.1 supp = OJ

dlow_oj <- tg[tg$dose == 0.5 & tg$supp == "OJ", ]
dmed_oj <- tg[tg$dose == 1.0 & tg$supp == "OJ", ]
dhigh_oj <- tg[tg$dose == 2.0 & tg$supp == "OJ", ]

\(\alpha = 0.05\)

[Part 1]
\(H_0： \mu_{d_{med,oj}} - \mu_{d_{low,oj}} = 0\)
\(H_0： \mu_{d_{med,oj}} - \mu_{d_{low,oj}} > 0\)

t.test(dmed_oj$len, dlow_oj$len, alternative = "greater", paired = FALSE, var.equal = FALSE)

## 
##  Welch Two Sample t-test
## 
## data:  dmed_oj$len and dlow_oj$len
## t = 5.0486, df = 17.698, p-value = 4.392e-05
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  6.214316      Inf
## sample estimates:
## mean of x mean of y 
##     22.70     13.23

Since 4.392e-05 < 0.05 (i.e. p-value < alpha), the null hypothesis is rejected.

[Part 2]
\(H_0： \mu_{d_{high,oj}} - \mu_{d_{med,oj}} = 0\)
\(H_0： \mu_{d_{high,oj}} - \mu_{d_{med,oj}} > 0\)

t.test(dhigh_oj$len, dmed_oj$len, alternative = "greater", paired = FALSE, var.equal = FALSE)

## 
##  Welch Two Sample t-test
## 
## data:  dhigh_oj$len and dmed_oj$len
## t = 2.2478, df = 15.842, p-value = 0.0196
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  0.7486236       Inf
## sample estimates:
## mean of x mean of y 
##     26.06     22.70

Since 0.0196 < 0.05 (i.e. p-value < alpha), the null hypothesis is rejected.
Therefore, under OJ supp, there is a statistically significant relationship between the dose and the tooth length.

§ 2.2 supp = VC

dlow_vc <- tg[tg$dose == 0.5 & tg$supp == "VC", ]
dmed_vc <- tg[tg$dose == 1.0 & tg$supp == "VC", ]
dhigh_vc <- tg[tg$dose == 2.0 & tg$supp == "VC", ]

\(\alpha = 0.05\)

[Part 1]
\(H_0： \mu_{d_{med,vc}} - \mu_{d_{low,vc}} = 0\)
\(H_0： \mu_{d_{med,vc}} - \mu_{d_{low,vc}} > 0\)

t.test(dmed_vc$len, dlow_vc$len, alternative = "greater", paired = FALSE, var.equal = FALSE)

## 
##  Welch Two Sample t-test
## 
## data:  dmed_vc$len and dlow_vc$len
## t = 7.4634, df = 17.862, p-value = 3.406e-07
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  6.746867      Inf
## sample estimates:
## mean of x mean of y 
##     16.77      7.98

Since 3.406e-07 < 0.05 (i.e. p-value < alpha), the null hypothesis is rejected.

[Part 2]
\(H_0： \mu_{d_{high,vc}} - \mu_{d_{med,vc}} = 0\)
\(H_0： \mu_{d_{high,vc}} - \mu_{d_{med,vc}} > 0\)

t.test(dhigh_vc$len, dmed_vc$len, alternative = "greater", paired = FALSE, var.equal = FALSE)

## 
##  Welch Two Sample t-test
## 
## data:  dhigh_vc$len and dmed_vc$len
## t = 5.4698, df = 13.6, p-value = 4.578e-05
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  6.346525      Inf
## sample estimates:
## mean of x mean of y 
##     26.14     16.77

Since4.578e-05 < 0.05 (i.e. p-value < alpha), the null hypothesis is rejected.

Therefore, under VC supp, there is a statistically significant relationship between the dose and the tooth length.

3. Conclusion

The null hypothesis can be rejected. There is a statistically significant relation between the dose and the length of the tooth. More specifically, the teeth treated with higher dose have a greater mean length.

Conclusions and Assumptions

Conclusions

The OJ supplement at 0.5 and 1.0 dosages have significantly increases tooth growth than the VC at the same doses. The OJ or VC supplements at a dose of 2.0 do not result significantly different tooth growth. Increasing supplement dosage significantly increase tooth growth.

Assumptions

The variances between the sample popluations are not equal. The sample data is not paired. There are no other variables affecting the tooth length.

Statistical Inference Project

Project Overview

Part 1 - A Simulation Exercise

Motivating Example

Exponential Distribution

Conclusion

Part 2 - Basic Inferential Data Analysis Instructions

Data Summary

Hypothesis Testing

I. There is no relation between the supplement and the length of the tooth.

1. Basic Analysis

2. More Accurate Analysis

3. Conclusion

II. There is no relation between the dose and the length of the tooth.

1. Basic Analysis

2. More Accurate Analysis

3. Conclusion

Conclusions and Assumptions

Conclusions

Assumptions