Part 2: Analyze the ToothGrowth data in the R datasets package Overview: In this part we will do some statistical data analysises about the Toothlength data.
Question 1: Load the ToothGrowth data and perform some basic exploratory data analyses.
library(stats)
data(ToothGrowth)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.3
qplot(dose, len, data = ToothGrowth, color = supp, geom = "point") + geom_smooth(method = "lm") + labs(title = "ToothGrowth") + labs(x = "Dose of supplements", y = "Length of teeth")
## `geom_smooth()` using formula 'y ~ x'
From the plot we can get the length of teeth goes up as the dose of supplements increases, which indicates that the supplements may help teeth growth. at the same dose, OJ seems to incur a higher increase of teeth growth than VC. the slope of OJ is not as steep as the slope of VC, meaning an increase in VC may make a larger increase in teeth length than in OJ. Question 2: Provide a basic summary of the data. Overally, the spread of the length data is as follows, with a mean of 18.81 and a standard deviation of 7.65.
summary(ToothGrowth)
## len supp dose
## Min. : 4.20 OJ:30 Min. :0.500
## 1st Qu.:13.07 VC:30 1st Qu.:0.500
## Median :19.25 Median :1.000
## Mean :18.81 Mean :1.167
## 3rd Qu.:25.27 3rd Qu.:2.000
## Max. :33.90 Max. :2.000
sd(ToothGrowth$len)
## [1] 7.649315
The summary of the length data while taking OJ as supplement is as follows, with a mean of 20.66 and a standard deviation of 6.61.
summary(ToothGrowth[ToothGrowth$supp == "OJ", ])
## len supp dose
## Min. : 8.20 OJ:30 Min. :0.500
## 1st Qu.:15.53 VC: 0 1st Qu.:0.500
## Median :22.70 Median :1.000
## Mean :20.66 Mean :1.167
## 3rd Qu.:25.73 3rd Qu.:2.000
## Max. :30.90 Max. :2.000
sd(ToothGrowth[ToothGrowth$supp == "OJ", ]$len)
## [1] 6.605561
The summary of the length data while taking VC as supplement is as follows, with a mean of 16.96 and a standard deviation of 8.27.
summary(ToothGrowth[ToothGrowth$supp == "VC", ])
## len supp dose
## Min. : 4.20 OJ: 0 Min. :0.500
## 1st Qu.:11.20 VC:30 1st Qu.:0.500
## Median :16.50 Median :1.000
## Mean :16.96 Mean :1.167
## 3rd Qu.:23.10 3rd Qu.:2.000
## Max. :33.90 Max. :2.000
sd(ToothGrowth[ToothGrowth$supp == "VC", ]$len)
## [1] 8.266029
To see a better spread of the data, we may look at the boxplots under different values of dose and different kinds of supplements.
g <- ggplot(ToothGrowth, aes(dose, len, group = supp))
g + geom_boxplot() + facet_grid(supp ~ dose) + labs(x = "Dose of supplements", y = "Length of teeth growth")
Question 3: Use confidence intervals and/or hypothesis tests to compare tooth growth by supp and dose. (Only use the techniques from class, even if thereโs other approaches worth considering) If we assume the data is normally distributed, we have a null hypothesis that there is no difference between the mean under each kind of supplements, or each dose of the supplements:
t.test(x = ToothGrowth$len, data = ToothGrowth, paired = FALSE, conf.level = 0.95)$conf.
## [1] 16.83731 20.78936
## attr(,"conf.level")
## [1] 0.95
We will be able to construct a confidence interval that 95% of the time, an interval between 16.84 and 20.79 will contain the true mean of the population.
Then we calculate the mean under each kind of supplements, and each dose of the supplements:
summary(ToothGrowth[ToothGrowth$supp == "OJ", ]$len)[4]
## Mean
## 20.66333
summary(ToothGrowth[ToothGrowth$supp == "VC", ]$len)[4]
## Mean
## 16.96333
Thus the mean of teeth growth after taking OJ is 20.66; the mean of teeth growth after taking VC is 16.96. Both the two are within the confidence interval. We fail to reject the null hypothesis that there is not a difference in teeth growth after taking the two kinds of supplements.
summary(ToothGrowth[ToothGrowth$dose == 0.5, ]$len)[4]
## Mean
## 10.605
summary(ToothGrowth[ToothGrowth$dose == 1, ]$len)[4]
## Mean
## 19.735
summary(ToothGrowth[ToothGrowth$dose == 2, ]$len)[4]
## Mean
## 26.1
Thus the mean of teeth growth after taking dose of 0.5 is 10.6; the mean of teeth growth after taking dose of 1.0 is 19.74; the mean of teeth growth after taking dose of 2.0 is 26.1. We are able to reject the null hypotheis, and there is a difference in teeth growth between each dose of supplements.
Now we know the data is not normally distributed under each dose, along with this conclusion, we may assume the data is normally distributed within each dose. Then we will be able to compare the teeth growth between each supplements under each dose.
dose05 <- ToothGrowth[ToothGrowth$dose == 0.5, ]
t.test(x = dose05$len, paired = FALSE, conf.level = 0.95)$conf.
## [1] 8.499046 12.710954
## attr(,"conf.level")
## [1] 0.95
mean(dose05[dose05$supp == "VC", ]$len)
## [1] 7.98
mean(dose05[dose05$supp == "OJ", ]$len)
## [1] 13.23
Thus under the dose of 0.5, there are 95% of the time that a confidence interval between 8.50 and 12.71 will contain the true population mean. We also know that the mean of teeth growth after taking 0.5 dose of VC is 7.98 and the mean of teeth growth after taking 0.5 dose of OJ is 13.23. We rejected the null hypo.
dose10 <- ToothGrowth[ToothGrowth$dose == 1, ]
t.test(x = dose10$len, paired = FALSE, conf.level = 0.95)$conf.
## [1] 17.66851 21.80149
## attr(,"conf.level")
## [1] 0.95
mean(dose10[dose10$supp == "VC", ]$len)
## [1] 16.77
mean(dose10[dose10$supp == "OJ", ]$len)
## [1] 22.7
Thus under the dose of 1.0, there are 95% of the time that a confidence interval between 17.67 and 21.80 will contain the true population mean. We also know that the mean of teeth growth after taking 1.0 dose of VC is 16.77 and the mean of teeth growth after taking 1.0 dose of OJ is 22.7. We rejected the null hypo.
dose20 <- ToothGrowth[ToothGrowth$dose == 2, ]
t.test(x = dose20$len, paired = FALSE, conf.level = 0.95)$conf.
## [1] 24.33364 27.86636
## attr(,"conf.level")
## [1] 0.95
mean(dose20[dose20$supp == "VC", ]$len)
## [1] 26.14
mean(dose20[dose20$supp == "OJ", ]$len)
## [1] 26.06
Thus under the dose of 2.0, there are 95% of the time that a confidence interval between 24.33 and 27.87 will contain the true population mean. We also know that the mean of teeth growth after taking 2.0 dose of VC is 26.14 and the mean of teeth growth after taking 2.0 dose of OJ is 26.06. Both of them are within the confidence interval. We fail to reject the null hypothesis.
Question 4: State your conclusions and the assumptions needed for your conclusions. The conclusion is when the dose is 0.5 or 1.0 there is a difference between the teeth growth after taking OJ and VC, while when the dose is 2.0, there is no difference between the teeth growth after taking OJ and VC. The assumption needed is we first assumed the whole population is normally distributed, then we assumed the population is normally distributed under each dose.