Statistical Analysis the ‘ToothGrowth’ dataset

Linear models are used to analyse the built in R data set “ToothGrowth”. The effect of supplements (vitamin C (VC) and orange juice (OJ)) at three different dose levels (0.5, 1, and 2 mg/day) on the length of odontoblasts in 60 guinea pigs are examined.

1. Exploratory Data Analysis

data("ToothGrowth")
head(ToothGrowth)

##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

summary(ToothGrowth)

##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

ggpairs(ToothGrowth)

Looking at the scatterplot matrix above, we can see from the box plot (first row, second column) that the supplement “OJ” has a higher median “len” than the supplement “VC”. Looking at the scatterplot of “dose” versus “len” we can see a positive correlation - there is a 0.803 correlation between the two variables in the data.

2. Comparing tooth growth by supp and dose using statistical tests

There appears to be a relationship between “dose” and “len” and “supp” and “len”. We start by performing statistical tests to see if there are differences in the effect on “len” of “supp” and "dose.

2.2 Supplement

par(mfrow=c(2,3))
plot_ly(data=ToothGrowth, x=~supp, y=~len, type="box")

The above boxplot suggests a difference in the effect on length of supplement.

VC = ToothGrowth$len[ToothGrowth$supp == "VC"]
OJ = ToothGrowth$len[ToothGrowth$supp == "OJ"]
t.test(VC, OJ, paired=FALSE, var.equal=TRUE)

## 
##  Two Sample t-test
## 
## data:  VC and OJ
## t = -1.9153, df = 58, p-value = 0.06039
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -7.5670064  0.1670064
## sample estimates:
## mean of x mean of y 
##  16.96333  20.66333

At a 5% significance level, we cannot conclude that the two supplements have a different effect on the “len”.

2.3 Dosage

plot_ly(data=ToothGrowth, x=~dose, y=~len, type="scatter", color=~supp)

## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode

To compare three independent groups (dosages of 0.5, 1 and 2 mg/day) we use an extension of the two-sample t-test, the one factor ANOVA. The null hypothesis here is that the means are the same. The alternative hypothesis is that at least one of the means are different to the others.

dose_0_5 = ToothGrowth$len[ToothGrowth$dose == 0.5]
dose_1_0 = ToothGrowth$len[ToothGrowth$dose == 1]
dose_2_0 = ToothGrowth$len[ToothGrowth$dose == 2]

aov_fit = aov(len ~ dose, ToothGrowth)
summary(aov_fit)

##             Df Sum Sq Mean Sq F value   Pr(>F)    
## dose         1   2224  2224.3   105.1 1.23e-14 ***
## Residuals   58   1228    21.2                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on this one-way analysis, dosage has a very significant effect on the tooth length.

3. Using linear models

The trouble with the above analysis is that it does not allow for confounding variables. To allow for this, we fit a few linear models to the data, incrementally adding more and more variables. We use ANOVA, the significance of predictors as well as adjusted R-squared to arrive at the “best” model.

lm_1 has poor fit with a low adjusted R-squared and insignificant (at a 5% level) predictor. We are essentially performing the same t-test in part 2.2 when we look at the significance of the predictor. See the Appendix for details of the fitted models.

After fitting dose as well in lm_2, both supplement and dose are very significant and the R-squared has risen dramatically. The F-test from the ANOVA also shows that adding the “dose” factor added a statistically significantly better model fit. lm_2 looks to be a better model than lm_1.

For completeness, we also test interactions between the “dose” and “supp” variables. At a 5% level, the interaction term is significant. All predictors in lm_3 are significant at a 5% level, it has the best adjusted R-squared, and the ANOVA also indicates that it is “worth” adding more predictors to arrive at lm_3.

Model diagnostics were performed. The Q-Q plot does not show obvious signs of non-normality. There is no clear pattern in the residual plots so there are no clear signs of misfit or heteroscedasticity. The one-way plots do not show clear signs of misfit or non-linearity.

Based on the diagnostics, the assumptions underpinning the linear models appear appropriate.

Thus, we conclude that lm_3 is the best model.

lm_1 = lm(len ~ supp, ToothGrowth)
lm_2 = lm(len ~ supp + dose, ToothGrowth)
lm_3 = lm(len ~ dose + dose + supp * dose, ToothGrowth)

summary(lm_3)

## 
## Call:
## lm(formula = len ~ dose + dose + supp * dose, data = ToothGrowth)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.2264 -2.8462  0.0504  2.2893  7.9386 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   11.550      1.581   7.304 1.09e-09 ***
## dose           7.811      1.195   6.534 2.03e-08 ***
## suppVC        -8.255      2.236  -3.691 0.000507 ***
## dose:suppVC    3.904      1.691   2.309 0.024631 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.083 on 56 degrees of freedom
## Multiple R-squared:  0.7296, Adjusted R-squared:  0.7151 
## F-statistic: 50.36 on 3 and 56 DF,  p-value: 6.521e-16

anova(lm_1, lm_2, lm_3)

## Analysis of Variance Table
## 
## Model 1: len ~ supp
## Model 2: len ~ supp + dose
## Model 3: len ~ dose + dose + supp * dose
##   Res.Df    RSS Df Sum of Sq        F  Pr(>F)    
## 1     58 3246.9                                  
## 2     57 1022.6  1   2224.30 133.4151 < 2e-16 ***
## 3     56  933.6  1     88.92   5.3335 0.02463 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

par(mfrow=c(2,3))
plot(lm_3)

for (i in c("supp", "dose")) {
    oneway(i,"len", lm_3,ToothGrowth) #oneway is a function i made to quickly plot one-way fits
}

4. Summary and Conclusions

4.1 Conclusions

Both dosage as well as supplement have a significant effect on teeth growth. Dosage has a positive association with teeth growth, with an estimated 7.811mm of teeth growth per mg/day. Vitamin C is associated with lower levels of teeth growth, with an estimated 8.255mm less teeth growth when compared to orange juice. Interestingly, whilst dosage is positively associated with teeth growth, the effect of dosage is stronger with vitamin C than orange juice.

4.2 Assumptions

Normal errors - normal QQ-plot validates this assumption
Homoscedasticity - residual plot shows no sign of heteroscedasticity
Linear relationship between dose and length - one-way tables do not show non-linearity between dosage and length

Appendix

summary(lm_1)

## 
## Call:
## lm(formula = len ~ supp, data = ToothGrowth)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.7633  -5.7633   0.4367   5.5867  16.9367 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   20.663      1.366  15.127   <2e-16 ***
## suppVC        -3.700      1.932  -1.915   0.0604 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.482 on 58 degrees of freedom
## Multiple R-squared:  0.05948,    Adjusted R-squared:  0.04327 
## F-statistic: 3.668 on 1 and 58 DF,  p-value: 0.06039

summary(lm_2)

## 
## Call:
## lm(formula = len ~ supp + dose, data = ToothGrowth)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.600 -3.700  0.373  2.116  8.800 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.2725     1.2824   7.231 1.31e-09 ***
## suppVC       -3.7000     1.0936  -3.383   0.0013 ** 
## dose          9.7636     0.8768  11.135 6.31e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.236 on 57 degrees of freedom
## Multiple R-squared:  0.7038, Adjusted R-squared:  0.6934 
## F-statistic: 67.72 on 2 and 57 DF,  p-value: 8.716e-16

summary(lm_3)

## 
## Call:
## lm(formula = len ~ dose + dose + supp * dose, data = ToothGrowth)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.2264 -2.8462  0.0504  2.2893  7.9386 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   11.550      1.581   7.304 1.09e-09 ***
## dose           7.811      1.195   6.534 2.03e-08 ***
## suppVC        -8.255      2.236  -3.691 0.000507 ***
## dose:suppVC    3.904      1.691   2.309 0.024631 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.083 on 56 degrees of freedom
## Multiple R-squared:  0.7296, Adjusted R-squared:  0.7151 
## F-statistic: 50.36 on 3 and 56 DF,  p-value: 6.521e-16