ToothGrowth

The ToothGrowth dataset loads with the datasets package. The data are from an experiment examining how vitamin C dosage delivered in 2 different methods predicts tooth growth in guinea pigs.

The data consist of 60 observations, representing 60 guinea pigs, and 3 variables:

A quick look at ToothGrowth reveals that many guinea pigs were given the same dose of vitamin C. Three doses, 0.5, 1.0, and 2.0, were used. ​

head(ToothGrowth)
##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5
ToothGrowth %>% group_by(dose) %>% tally()
## # A tibble: 3 x 2
##    dose     n
##   <dbl> <int>
## 1   0.5    20
## 2   1      20
## 3   2      20

Distribution

#bar plot of counts of dose by supp
#data are balanced, so not so interesting
p1 <- ggplot(ToothGrowth, aes(x=as.factor(dose), fill=supp)) +  geom_bar()

#density of len
p2 <-ggplot(ToothGrowth, aes(x=len))  + geom_density()

#density of len by supp
p3 <- ggplot(ToothGrowth, aes(x=len, color=supp)) + 
  geom_density()

grid.arrange(p1, p2, p3, ncol=2)

Scatterplot

#not the best scatterplot
tp <- ggplot(ToothGrowth, aes(x=dose, y=len))
tp + geom_point()

#mean and cl of len at each dose
tp.1 <- tp + stat_summary(fun.data="mean_cl_normal")
tp.1

#add a line plot of means to see dose-len relationship
#maybe not linear
tp.2 <- tp.1 + stat_summary(fun.y="mean", geom="line")
## Warning: `fun.y` is deprecated. Use `fun` instead.
tp.2

Does a third variable moderate the x-y relationship?

Does the dose-len relationship depend on supp? We can specify new global aesthetics in aes.

#all plots in tp.2 will now be colored by supp
tp.2 + aes(color=supp)

Linear Model Regression

#create dose-squared variable
ToothGrowth <- ToothGrowth %>% mutate(dosesq = (dose)^2)
lm2 <- lm(len ~ dose + dosesq*supp, data=ToothGrowth)
summary(lm2)$coef
##                 Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)    0.7491667  2.7983895  0.2677135 7.899213e-01
## dose          30.1550000  5.2474684  5.7465806 4.114588e-07
## dosesq        -8.7238095  2.0402571 -4.2758383 7.640686e-05
## suppVC        -6.4783333  1.3762287 -4.7073088 1.739152e-05
## dosesq:suppVC  1.5876190  0.5770719  2.7511635 8.024694e-03
#create dataset with original data and diagnostic variables
lm2diag <- fortify(lm2)
#quick look
head(lm2diag)
##    len dose dosesq supp       .hat   .sigma      .cooksd  .fitted     .resid
## 1  4.2  0.5   0.25   VC 0.08095238 3.623134 0.0165458459 7.564286 -3.3642857
## 2 11.5  0.5   0.25   VC 0.08095238 3.611516 0.0226438547 7.564286  3.9357143
## 3  7.3  0.5   0.25   VC 0.08095238 3.654279 0.0001021058 7.564286 -0.2642857
## 4  5.8  0.5   0.25   VC 0.08095238 3.645881 0.0045503109 7.564286 -1.7642857
## 5  6.4  0.5   0.25   VC 0.08095238 3.650733 0.0019816291 7.564286 -1.1642857
## 6 10.0  0.5   0.25   VC 0.08095238 3.638080 0.0086727319 7.564286  2.4357143
##     .stdresid
## 1 -0.96913367
## 2  1.13374237
## 3 -0.07613152
## 4 -0.50822934
## 5 -0.33539021
## 6  0.70164455
#q-q plot of residuals and diagonal reference line
#geom_abline default aesthetics are yintercept=0 and slope=1
ggplot(lm2diag, aes(sample=.stdresid)) + 
  stat_qq() + 
  geom_abline()

Linearity and Homoscedasticity: residuals vs fitted

We next assess the assumptions of homoscedescasticity and linear relationships between the outcome and predictors. A residuals vs fitted (predicted value) plot assesses both of these assmuptions.

An even spread of residuals around 0 suggests homoscedasticity, and a zero, flat slope for residuals against fitted suggests linearity of predictor effects.

We build our residuals vs fitted plot like so:

  1. Start with a scatter plot of x=.fitted and y=.stdresid.
  2. Add a plot the means and standard errors of the residuals across fitted values using stat_summary. The standard error bars somewhat address homoskedasticity.
  3. Plot a line representing the mean trend of the residuals with another call to stat_summary (changing function to mean and geom to line). Normally, we would use geom_smooth to plot a loess curve to visualize the trend among residuals, but loess smooths do not work well when the variable mapped to x is discrete.
  4. The error bars do not appear to vary too much and the line seems sufficiently flat to feel that neither assumption has been violated.
#residual vs fitted, means and s.e.
ggplot(lm2diag, aes(x=.fitted, y=.stdresid)) + 
  geom_point() + 
  stat_summary() + 
  stat_summary(fun.y="mean", geom="line")
## Warning: `fun.y` is deprecated. Use `fun` instead.
## No summary function supplied, defaulting to `mean_se()`

## No summary function supplied, defaulting to `mean_se()



# in geom_text we SET size to 4 so that size of text is constant,
#   and not tied to .cooksd.  We also  nudge the text .001 (x-axis units)
#   to the left, and separate overlapping labels
ggplot(lm2diag, aes(x=.hat, y=.stdresid, size=.cooksd)) +     
  geom_point() +
  geom_text(aes(label=row.names(lm2diag)), 
             size=4, nudge_x=-0.003, check_overlap=T)