Linear models are used to analyse the built in R data set “ToothGrowth”. The effect of supplements (vitamin C (VC) and orange juice (OJ)) at three different dose levels (0.5, 1, and 2 mg/day) on the length of odontoblasts in 60 guinea pigs are examined.
data("ToothGrowth")
head(ToothGrowth)
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
summary(ToothGrowth)
## len supp dose
## Min. : 4.20 OJ:30 Min. :0.500
## 1st Qu.:13.07 VC:30 1st Qu.:0.500
## Median :19.25 Median :1.000
## Mean :18.81 Mean :1.167
## 3rd Qu.:25.27 3rd Qu.:2.000
## Max. :33.90 Max. :2.000
ggpairs(ToothGrowth)
Looking at the scatterplot matrix above, we can see from the box plot (first row, second column) that the supplement “OJ” has a higher median “len” than the supplement “VC”. Looking at the scatterplot of “dose” versus “len” we can see a positive correlation - there is a 0.803 correlation between the two variables in the data.
There appears to be a relationship between “dose” and “len” and “supp” and “len”. We start by performing statistical tests to see if there are differences in the effect on “len” of “supp” and "dose.
par(mfrow=c(2,3))
plot_ly(data=ToothGrowth, x=~supp, y=~len, type="box")
The above boxplot suggests a difference in the effect on length of supplement.
VC = ToothGrowth$len[ToothGrowth$supp == "VC"]
OJ = ToothGrowth$len[ToothGrowth$supp == "OJ"]
t.test(VC, OJ, paired=FALSE, var.equal=TRUE)
##
## Two Sample t-test
##
## data: VC and OJ
## t = -1.9153, df = 58, p-value = 0.06039
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -7.5670064 0.1670064
## sample estimates:
## mean of x mean of y
## 16.96333 20.66333
At a 5% significance level, we cannot conclude that the two supplements have a different effect on the “len”.
plot_ly(data=ToothGrowth, x=~dose, y=~len, type="scatter", color=~supp)
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
To compare three independent groups (dosages of 0.5, 1 and 2 mg/day) we use an extension of the two-sample t-test, the one factor ANOVA. The null hypothesis here is that the means are the same. The alternative hypothesis is that at least one of the means are different to the others.
dose_0_5 = ToothGrowth$len[ToothGrowth$dose == 0.5]
dose_1_0 = ToothGrowth$len[ToothGrowth$dose == 1]
dose_2_0 = ToothGrowth$len[ToothGrowth$dose == 2]
aov_fit = aov(len ~ dose, ToothGrowth)
summary(aov_fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## dose 1 2224 2224.3 105.1 1.23e-14 ***
## Residuals 58 1228 21.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on this one-way analysis, dosage has a very significant effect on the tooth length.
The trouble with the above analysis is that it does not allow for confounding variables. To allow for this, we fit a few linear models to the data, incrementally adding more and more variables. We use ANOVA, the significance of predictors as well as adjusted R-squared to arrive at the “best” model.
lm_1 has poor fit with a low adjusted R-squared and insignificant (at a 5% level) predictor. We are essentially performing the same t-test in part 2.2 when we look at the significance of the predictor. See the Appendix for details of the fitted models.
After fitting dose as well in lm_2, both supplement and dose are very significant and the R-squared has risen dramatically. The F-test from the ANOVA also shows that adding the “dose” factor added a statistically significantly better model fit. lm_2 looks to be a better model than lm_1.
For completeness, we also test interactions between the “dose” and “supp” variables. At a 5% level, the interaction term is significant. All predictors in lm_3 are significant at a 5% level, it has the best adjusted R-squared, and the ANOVA also indicates that it is “worth” adding more predictors to arrive at lm_3.
Model diagnostics were performed. The Q-Q plot does not show obvious signs of non-normality. There is no clear pattern in the residual plots so there are no clear signs of misfit or heteroscedasticity. The one-way plots do not show clear signs of misfit or non-linearity.
Based on the diagnostics, the assumptions underpinning the linear models appear appropriate.
Thus, we conclude that lm_3 is the best model.
lm_1 = lm(len ~ supp, ToothGrowth)
lm_2 = lm(len ~ supp + dose, ToothGrowth)
lm_3 = lm(len ~ dose + dose + supp * dose, ToothGrowth)
summary(lm_3)
##
## Call:
## lm(formula = len ~ dose + dose + supp * dose, data = ToothGrowth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.2264 -2.8462 0.0504 2.2893 7.9386
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.550 1.581 7.304 1.09e-09 ***
## dose 7.811 1.195 6.534 2.03e-08 ***
## suppVC -8.255 2.236 -3.691 0.000507 ***
## dose:suppVC 3.904 1.691 2.309 0.024631 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.083 on 56 degrees of freedom
## Multiple R-squared: 0.7296, Adjusted R-squared: 0.7151
## F-statistic: 50.36 on 3 and 56 DF, p-value: 6.521e-16
anova(lm_1, lm_2, lm_3)
## Analysis of Variance Table
##
## Model 1: len ~ supp
## Model 2: len ~ supp + dose
## Model 3: len ~ dose + dose + supp * dose
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 58 3246.9
## 2 57 1022.6 1 2224.30 133.4151 < 2e-16 ***
## 3 56 933.6 1 88.92 5.3335 0.02463 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mfrow=c(2,3))
plot(lm_3)
for (i in c("supp", "dose")) {
oneway(i,"len", lm_3,ToothGrowth) #oneway is a function i made to quickly plot one-way fits
}
Both dosage as well as supplement have a significant effect on teeth growth. Dosage has a positive association with teeth growth, with an estimated 7.811mm of teeth growth per mg/day. Vitamin C is associated with lower levels of teeth growth, with an estimated 8.255mm less teeth growth when compared to orange juice. Interestingly, whilst dosage is positively associated with teeth growth, the effect of dosage is stronger with vitamin C than orange juice.
summary(lm_1)
##
## Call:
## lm(formula = len ~ supp, data = ToothGrowth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.7633 -5.7633 0.4367 5.5867 16.9367
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.663 1.366 15.127 <2e-16 ***
## suppVC -3.700 1.932 -1.915 0.0604 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.482 on 58 degrees of freedom
## Multiple R-squared: 0.05948, Adjusted R-squared: 0.04327
## F-statistic: 3.668 on 1 and 58 DF, p-value: 0.06039
summary(lm_2)
##
## Call:
## lm(formula = len ~ supp + dose, data = ToothGrowth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.600 -3.700 0.373 2.116 8.800
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.2725 1.2824 7.231 1.31e-09 ***
## suppVC -3.7000 1.0936 -3.383 0.0013 **
## dose 9.7636 0.8768 11.135 6.31e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.236 on 57 degrees of freedom
## Multiple R-squared: 0.7038, Adjusted R-squared: 0.6934
## F-statistic: 67.72 on 2 and 57 DF, p-value: 8.716e-16
summary(lm_3)
##
## Call:
## lm(formula = len ~ dose + dose + supp * dose, data = ToothGrowth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.2264 -2.8462 0.0504 2.2893 7.9386
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.550 1.581 7.304 1.09e-09 ***
## dose 7.811 1.195 6.534 2.03e-08 ***
## suppVC -8.255 2.236 -3.691 0.000507 ***
## dose:suppVC 3.904 1.691 2.309 0.024631 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.083 on 56 degrees of freedom
## Multiple R-squared: 0.7296, Adjusted R-squared: 0.7151
## F-statistic: 50.36 on 3 and 56 DF, p-value: 6.521e-16