The ToothGrowth dataset loads with the datasets package. The data are from an experiment examining how vitamin C dosage delivered in 2 different methods predicts tooth growth in guinea pigs.
The data consist of 60 observations, representing 60 guinea pigs, and 3 variables:
VC'' is ascorbic acid, and
OJ’’ is orange juiceA quick look at ToothGrowth reveals that many guinea pigs were given the same dose of vitamin C. Three doses, 0.5, 1.0, and 2.0, were used. ​
head(ToothGrowth)
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
ToothGrowth %>% group_by(dose) %>% tally()
## # A tibble: 3 x 2
## dose n
## <dbl> <int>
## 1 0.5 20
## 2 1 20
## 3 2 20
#bar plot of counts of dose by supp
#data are balanced, so not so interesting
p1 <- ggplot(ToothGrowth, aes(x=as.factor(dose), fill=supp)) + geom_bar()
#density of len
p2 <-ggplot(ToothGrowth, aes(x=len)) + geom_density()
#density of len by supp
p3 <- ggplot(ToothGrowth, aes(x=len, color=supp)) +
geom_density()
grid.arrange(p1, p2, p3, ncol=2)
#not the best scatterplot
tp <- ggplot(ToothGrowth, aes(x=dose, y=len))
tp + geom_point()
#mean and cl of len at each dose
tp.1 <- tp + stat_summary(fun.data="mean_cl_normal")
tp.1
#add a line plot of means to see dose-len relationship
#maybe not linear
tp.2 <- tp.1 + stat_summary(fun.y="mean", geom="line")
## Warning: `fun.y` is deprecated. Use `fun` instead.
tp.2
Does a third variable moderate the x-y relationship?
Does the dose-len relationship depend on supp? We can specify new global aesthetics in aes.
#all plots in tp.2 will now be colored by supp
tp.2 + aes(color=supp)
#create dose-squared variable
ToothGrowth <- ToothGrowth %>% mutate(dosesq = (dose)^2)
lm2 <- lm(len ~ dose + dosesq*supp, data=ToothGrowth)
summary(lm2)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.7491667 2.7983895 0.2677135 7.899213e-01
## dose 30.1550000 5.2474684 5.7465806 4.114588e-07
## dosesq -8.7238095 2.0402571 -4.2758383 7.640686e-05
## suppVC -6.4783333 1.3762287 -4.7073088 1.739152e-05
## dosesq:suppVC 1.5876190 0.5770719 2.7511635 8.024694e-03
#create dataset with original data and diagnostic variables
lm2diag <- fortify(lm2)
#quick look
head(lm2diag)
## len dose dosesq supp .hat .sigma .cooksd .fitted .resid
## 1 4.2 0.5 0.25 VC 0.08095238 3.623134 0.0165458459 7.564286 -3.3642857
## 2 11.5 0.5 0.25 VC 0.08095238 3.611516 0.0226438547 7.564286 3.9357143
## 3 7.3 0.5 0.25 VC 0.08095238 3.654279 0.0001021058 7.564286 -0.2642857
## 4 5.8 0.5 0.25 VC 0.08095238 3.645881 0.0045503109 7.564286 -1.7642857
## 5 6.4 0.5 0.25 VC 0.08095238 3.650733 0.0019816291 7.564286 -1.1642857
## 6 10.0 0.5 0.25 VC 0.08095238 3.638080 0.0086727319 7.564286 2.4357143
## .stdresid
## 1 -0.96913367
## 2 1.13374237
## 3 -0.07613152
## 4 -0.50822934
## 5 -0.33539021
## 6 0.70164455
#q-q plot of residuals and diagonal reference line
#geom_abline default aesthetics are yintercept=0 and slope=1
ggplot(lm2diag, aes(sample=.stdresid)) +
stat_qq() +
geom_abline()
We next assess the assumptions of homoscedescasticity and linear relationships between the outcome and predictors. A residuals vs fitted (predicted value) plot assesses both of these assmuptions.
An even spread of residuals around 0 suggests homoscedasticity, and a zero, flat slope for residuals against fitted suggests linearity of predictor effects.
We build our residuals vs fitted plot like so:
#residual vs fitted, means and s.e.
ggplot(lm2diag, aes(x=.fitted, y=.stdresid)) +
geom_point() +
stat_summary() +
stat_summary(fun.y="mean", geom="line")
## Warning: `fun.y` is deprecated. Use `fun` instead.
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()
# in geom_text we SET size to 4 so that size of text is constant,
# and not tied to .cooksd. We also nudge the text .001 (x-axis units)
# to the left, and separate overlapping labels
ggplot(lm2diag, aes(x=.hat, y=.stdresid, size=.cooksd)) +
geom_point() +
geom_text(aes(label=row.names(lm2diag)),
size=4, nudge_x=-0.003, check_overlap=T)