In this project, we will analyze the Tooth Growth Data that comes with R. As we review the characteristics of the data, we will also provide statistics and confidence intervals on the data.
Let’s load and review the data and the structure.
data("ToothGrowth")
str(ToothGrowth) # Provides information about the structure of the data
## 'data.frame': 60 obs. of 3 variables:
## $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
## $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
## $ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
Quick peek into the data reveal that we have 3 variables (columns); len (length), supp (supplement) and dose and 60 observations (rows).
Let’s check if all the elements are valid or do we have missing elements.
any(is.na(ToothGrowth)) # Checks if there is any missing 'NA' values
## [1] FALSE
The result indicates that we don’t have any ‘NA’ values in the dataset.
summary(ToothGrowth) # Gives statistical information about the data
## len supp dose
## Min. : 4.20 OJ:30 Min. :0.500
## 1st Qu.:13.07 VC:30 1st Qu.:0.500
## Median :19.25 Median :1.000
## Mean :18.81 Mean :1.167
## 3rd Qu.:25.27 3rd Qu.:2.000
## Max. :33.90 Max. :2.000
The supp field has two category levels; OJ and VC levels. 30 values each. At this point, the natural question is if there is any correlation between len and dose. We can create a scatter plot to compare the average of each supplement for every dose.
d<- ToothGrowth
# Plot the data points
p<- ggplot(d, aes(x=dose, y=len, color=supp)) +geom_point()
p<- p + labs(title="Length vs. Dose by Supplement") + labs(x="Dose")+labs(y="Length")
p <- p+ theme(plot.title = element_text(hjust = 0.5))
# Group data by supp and dose then take average
d %>% group_by(supp, dose) %>% summarise(mean(len)) -> d_summary
colnames(d_summary)<- c("supp","dose","len")
# Plot the trend line
p<- p+ geom_line(data=d_summary,aes(group=supp,colour=supp))
print (p)
Looking at this plot, for doses less than 2, OJ treatment is better than Vitamin C treatment. The result delivers better average. For dose=2, either treatment delivers the same result in the population.
Let’s look at the distribution of this result in a panel plot to compare the results.
p<- ggplot(aes(x = supp, y = len), data = ToothGrowth)
p <- p + geom_boxplot(aes(fill = supp)) + facet_wrap(~ dose)
p<- p+ labs(title="Treatment Efficacy by Dose and Method") + labs(x="Supplement")+labs(y="Length")
p <- p+ theme(plot.title = element_text(hjust = 0.5))
print(p)
Based on the plot, we can state that for lower dose amounts less than 2 units, OJ has a distinct advantage over VC to support the tooth length.
In this section, we will look at dosage and supplement as factors supporting tooth growth.
To compare growth by dose, we have 3 H0 cases. The alternative hypotesis would be not equal. 1. mu0.5 - mu1=0 2. mu0.5 - mu2=0 3. mu1-mu2=0
# Separate the data into groups
ds1<- subset(ToothGrowth,dose>=0.5 & dose <=1)
ds2<-subset(ToothGrowth, dose==0.5 | dose==2)
ds3<-subset(ToothGrowth, dose>=1 & dose<=2)
# Apply t-test for small samples
# For mu0.5 - mu1 =0
t.test(data=ds1, len ~ dose, var.equal=FALSE, paired=FALSE )
##
## Welch Two Sample t-test
##
## data: len by dose
## t = -6.4766, df = 37.986, p-value = 1.268e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.983781 -6.276219
## sample estimates:
## mean in group 0.5 mean in group 1
## 10.605 19.735
# For mu0.5 - mu1 =2
t.test(data=ds2, len ~ dose, var.equal=FALSE, paired=FALSE )
##
## Welch Two Sample t-test
##
## data: len by dose
## t = -11.799, df = 36.883, p-value = 4.398e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -18.15617 -12.83383
## sample estimates:
## mean in group 0.5 mean in group 2
## 10.605 26.100
# For mu0=1 - mu1 =2
t.test(data=ds3, len ~ dose, var.equal=FALSE, paired=FALSE )
##
## Welch Two Sample t-test
##
## data: len by dose
## t = -4.9005, df = 37.101, p-value = 1.906e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -8.996481 -3.733519
## sample estimates:
## mean in group 1 mean in group 2
## 19.735 26.100
For each group, 95% the confidence interval is as follows.
dose>=0.5 & dose <=1) ==> -11.983781 -6.276219
dose==0.5 | dose==2) ==> -18.15617 -12.83383
dose>=1 & dose<=2) ==> 19.735 26.100
If we compare the group mean values for 0.5, 1 and 2 to the confidence interval, we can easily reject the null hypothesis. This means, There mean of these groups are different from each other and dose has a strong effect on tooth growth.
If we run the same t-test on the sample data to check the effect of supplement(supp), we find the following confidence interval.
t.test(len ~ supp, paired = F, var.equal = F, data = ToothGrowth)
##
## Welch Two Sample t-test
##
## data: len by supp
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1710156 7.5710156
## sample estimates:
## mean in group OJ mean in group VC
## 20.66333 16.96333
We see that both group mean values are outside of the confidence interval. Therefore, we reject the null hypotheses stating that the mean of these groups are different from each other. Delivery method does not have an effect on the tooth growth.
Based on the calculations above, delivery method for the vitamin C has no effect on tooth growth. However, dose has a strong effect.