A Summary of the Tooth Growth Data Set

The Details

Exploratory Analysis

The ToothGrowth data set is a standard data set in R. Let’s load the data set:

require("datasets")
str(ToothGrowth)

## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

And look at an excerpt of the documentation for more information on the columns:

?ToothGrowth

The response is the length of odontoblasts (teeth) in each of 10 guinea pigs at 
each of three dose levels of Vitamin C (0.5, 1, and 2 mg) with each of two delivery 
methods (orange juice or ascorbic acid).

A data frame with 60 observations on 3 variables:
    
    [,1]  len   numeric  Tooth length
    [,2]  supp  factor   Supplement type (VC or OJ)
    [,3]  dose  numeric  Dose in milligrams

Note:

We don’t know the units for tooth length. We will assume these measurements are millimeters (mm).
For supplement type, OJ is the encoded value for orange juice and VC is the encoded value for ascorbic acid.

Tooth length is the dependent variable. Let’s look at how many samples we have for the independent variables, supplement type and dosage:

table(ToothGrowth$supp, ToothGrowth$dose)

##     
##      0.5  1  2
##   OJ  10 10 10
##   VC  10 10 10

We see there are 10 samples for each dosage of each supplement, 20 samples for each dosage, and 30 samples for each supplement type. 10 is a small number of samples. Let’s keep that in mind in later analysis.

We can do our analysis for each supplement type, dosage pair, or we analyze a single variable at an aggregate level, ignoring the other variable. For example, comparing supplement type and ignoring dosage levels. Again, let’s keep this in mind.

Let’s look at a box plot of tooth length broken down by supplement type and dosage:

require(ggplot2)
ggplot(ToothGrowth, aes(x=factor(dose), y=len, fill=supp)) + 
  geom_boxplot() +
  ggtitle('Tooth Length by Supplement Type and Dosage') +
  xlab('Dosage (mgs)') +
  ylab('Tooth Length (mm)') +
  guides(fill=guide_legend(title='Supplement Type'))

From this plot, we see:

For 2.0 mg dosage, there appears to be no difference between OJ and VC.
For 1.0 mg dosage, OJ definitely appears to promote higher tooth growth that VC
For 0.5 mg dosage, OJ appears to promote higher tooth growth than VC.
For OJ and VC, a dosage of 1.0 mg promotes higher tooth growth than a dosage of 0.5 mg.

Exclude the 2.0 dosage samples

I know this is a somewhat arbitrary decision, but:

I’d prefer not to do the analysis for each dosage, supplement combination with only 10 samples.
I’d prefer to test across each variable with 20 samples.
We can see that for the 2.0 dosage, there is no difference between OJ and VC.
Including the 2.0 dosage samples could very well impact the analysis of which supplement type is more effective across dosage levels.

Based on those reasons, I’m going exclude the 2.0 mg dosage data from the rest of the analysis:

require(dplyr)
myToothData <- ToothGrowth %>% filter(dose != 2)
table(ToothGrowth$supp, ToothGrowth$dose)

##     
##      0.5  1  2
##   OJ  10 10 10
##   VC  10 10 10

Assumptions

For this analysis, we’re going to do a Welch’s t-test. We make the following assumptions (from the wikipedia entry for Welch’s t test):

The distribution of the populations being compared are normally distributed. We could formally test this, but for this exercise we’re just going to assume.
The variance of the two populations do not have to be the same. This is an advantage of using the Welch’s test.
The two samples sets should be independent, meaning random guinea pigs should have been selected for each supplement dosage. There’s no way for us to test this, so again we will assume.

Does supplement type have an impact on tooth growth?

We have seen that the supplement type appears to have an impact on tooth growth, now let’s do a formal test to see if we can legitimately make that claim.

We will compare the mean tooth length for each of the two supplement types, OJ and VC, across all dosage levels.

For this test:

The null hypothesis is that the mean of the OJ population is the same as the mean of the VC population.
The alternative hypothesis is that the two means are different

Let’s do the test (note we’re saying the two sets are not paired and that the variances are not equal):

t.test(len ~ supp, paired=FALSE, var.equal=FALSE, data=myToothData)

## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = 3.0503, df = 36.553, p-value = 0.004239
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.875234 9.304766
## sample estimates:
## mean in group OJ mean in group VC 
##           17.965           12.375

The output of the test states that:

There is 95% confidence that interval (1.875234 9.304766) contains the difference of the two means. This interval does not include 0. An interval that includes 0 would indicate that there is no difference in the sample means.
The p-value = 0.004239 says there is a 0.4% chance that the two means are the same.
The mean len of the OJ set is 17.965 and is greater than the mean len of the VC set 12.375.

So, since the 95% confidence interval does not contain 0 and the p-value is less that 5%, we reject the null hypothesis and conclude that:

Giving guinea pigs orange juice had a greater impact on tooth growth than giving them ascorbic acid.

Does dosage have an impact on tooth growth?

Now let’s test two dosage levels. For this test:

The null hypothesis is that the mean of the 0.5 dosage population is the same as the mean of the 1.0 dosage population.
The alternative hypothesis is that the two means are different

t.test(len ~ dose, paired=FALSE, var.equal=FALSE, data = myToothData)

## 
##  Welch Two Sample t-test
## 
## data:  len by dose
## t = -6.4766, df = 37.986, p-value = 1.268e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.983781  -6.276219
## sample estimates:
## mean in group 0.5   mean in group 1 
##            10.605            19.735