W5 Practical Demo

Author

Swidbert R. Ott

Mid-module survey, if you haven’t done it yet

1 Lecture data example

For this week’s demo, I’m using the lecture example of the bacterial growth experiment with or without lactose in the medium. I have simulated a larger sample size than in the lecture dataset, to make it more similar to what you have in the practical.

Research question: Effect of the presence of lactose in the growth medium on bacterial counts over time.

Outcome: Bacterial count (bacteria)

Predictors:

day of counting (day: 1, 2, 3, 4, 5)
growth medium (lactose – 0 = without, 1 = with; in the lecture, it was 1 = without, 2 = with)

2 Load and plot the data

TASK

Plot the data as a scatterplot, with different colours for the two levels of [the factor EV; here, lactose]. Because day is 1, 2, 3…, the points for one day will all line up and some may land on top of another, make it harder to interpret the plot.

You can try two solutions in ggplot:

Use geom_jitter to jitter the points a little along the day-axis - but be mindful that doing so for a continuous variable means that your plot is no longer a faithful representation of the data;
Use geom_point but make the points bigger (e.g, size = 3) and use the alpha option to make them semi-transparent: they will appear darker where they overlap. alpha = 0.5 is a good starting point.

Tip

If you click on the circled numbers below the code, it will highlight the code that I’m referring to.

MyBugs <- read.csv(file = "w5-classdemo.csv")
1MyBugs <- within(MyBugs, {
  lactose <- factor(lactose);             #
2  contrasts(lactose) <- contr.sum
})

library(ggplot2)
3base.plot <- ggplot(MyBugs, aes(x = day, y = bacteria, colour = lactose))
4base.plot + geom_point()

1: The within( ... ) trick saves me having to say MyBugs$ every time I want to refer to a variable within the dataframe (in this instance, it is just lactose).
2: I’m setting up the contrast for lactose here already.
3: I’m setting up and saving an “empty” ggplot, for reusing later; this saves me typing this line over and over for each of the plots below.
4: Only now am I actually plotting it with the points.

Solution 1: Jittering the points sideways (only!)

base.plot + geom_jitter(height = 0, width = 0.1)

Solution 2: Using transparency (alpha)

base.plot + geom_point(size=3, alpha=0.5)

BB TEST Q1

From just looking at the plot of the raw data, what statements best describe the conclusions from the experiment?

Interpretation

In my example, the counts are higher with lactose on all days, but the difference clearly gets bigger:

Lactose-containing medium has higher bacterial counts on average than lactose-free medium on all days tested.
The difference between lactose and lactose-free medium is quite small on day 1 but increases very clearly as time progresses. On the last day, the difference between the two media is much larger than the spread of the data for either medium.

2.1 `day`: continuous or categorical?

BB TEST Q2

What is the most compelling reason for treating day as a continuous EV in this analysis?

Interpretation

In my example, the response seems to closely follow a linear trend for both growth media.

3 Finally: the LM

3.1 Set up contrasts

I’ve already set up my contrasts above, while I declared lactose a factor.

3.2 Fit the LM and interpret the results

TASK

Fit the LM that is most appropriate for answering this research question.

Interpret the results of your LM analysis to answer the following questions:

What question does the test for an interaction relate to in this analysis?
How strong is the evidence that the difference in bacterial counts in the two growth media is dependent on time (i.e., days since starting the culture)?

m.bugs <- lm(bacteria ~ day * lactose, MyBugs)
anova(m.bugs)

Analysis of Variance Table

Response: bacteria
            Df  Sum Sq Mean Sq F value    Pr(>F)    
day          1 1212.04 1212.04 505.526 < 2.2e-16 ***
lactose      1 1772.62 1772.62 739.338 < 2.2e-16 ***
day:lactose  1   97.09   97.09  40.497 1.351e-08 ***
Residuals   76  182.22    2.40                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Remember how to interpret models with interaction terms

Recall, we have two EVs ($X_1 = \mathtt{day},\ X_2 = \mathtt{lactose}$), but we are asking three research questions.

If the evidence for an interaction day:lactose is weak (“non-significant” – in the practical, P>0.01), report the separate findings for the two main effects, day and lactose.
If there is clear evidence for an interaction (in the practical, P<0.01), only report the results for the interaction – since this implies that both main effects (day and lactose) are important.
This is also why we don’t do, or need, a “Type III ANOVA”!

BB TEST Q3

Select the statement that is most appropriate for summarising your ANOVA results.

Interpretation

In my example, the two growth media differ in the bacterial counts they yield, and this difference in bacterial counts increases over time ($P = 1.35\times 10^{-8}$).

3.3 Using `emmeans` for “ANCOVA” interaction plots

TASK

Use the emmip() function to visualise the model estimates as an interaction plot.

library(emmeans)
emmip(m.bugs, lactose ~ day, cov.reduce = range)

3.4 Prediction

TASK

Use the model coefficients to predict the response [here, bacterial counts] 2.5 days into the experiment, and with lactose in the medium.

Do it ‘by hand’

I’d strongly suggest you do it by hand from the model coefficients, it will help you get your head round the meaning of the coefficients.

summary(m.bugs)             # Not needed - only to see the results.


Call:
lm(formula = bacteria ~ day * lactose, data = MyBugs)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.8368 -1.0668 -0.2863  0.8977  4.1417 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)    2.6557     0.4060   6.541 6.34e-09 ***
day            2.7523     0.1224  22.484  < 2e-16 ***
lactose1      -2.3702     0.4060  -5.838 1.23e-07 ***
day:lactose1  -0.7790     0.1224  -6.364 1.35e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.548 on 76 degrees of freedom
Multiple R-squared:  0.9442,    Adjusted R-squared:  0.942 
F-statistic: 428.5 on 3 and 76 DF,  p-value: < 2.2e-16

bug.coefs <- coef(m.bugs)   # Saving the coefficients for use below.

What do the coefficients mean here

Careful: With sum contrasts, the coefficient for the first factor level will be level1. In my example, the first level of lactose is “0” which means no lactose, so this is what lactose1 refers to!
Sanity check: Refer to the plot - without lactose (“0”) comes out lower, so it makes sense that the estimate for lactose1 comes out negative! (Remember that the lactose1 estimate would be for day = 0…)
Same for interaction: The slope without lactose (“0”) is shallower and the day:lactose1 estimate is negative (since it is the deviation from the fictive average slope).

1bug.coefs["(Intercept)"] +
2  bug.coefs["day"] * 2.5 +
3  bug.coefs["lactose1"] * -1 +
4  bug.coefs["day:lactose1"] * -1 * 2.5

1: Self-explanatory, we need to include the intercept…
2: Add the estimate for day multiplied by 2.5 (days).
At this point, we’d have the estimate for the mean of with and without lactose 2.5 days into the experiment.
3: Add the estimate for the lactose1 effect multiplied by the value of lactose in terms of its contrast: for “with lactose” this is -1, i.e., we sign-invert.
4: Finally, add the interaction – we need to multiply by both the value of day and the contrast value of lactose.

(Intercept) 
   13.85415

3.4.1 Easy visual check

Using the interaction plot we did earlier:

1emmip(m.bugs, lactose ~ day, cov.reduce = range) +
2  geom_vline(xintercept = 2.5) +
3  geom_hline(yintercept = 13.85)

1: emmip() is just a ‘wrapper’ around ggplot() so we can use ggplot commands to add stuff to the plot!
2: draw a vertical line at 2.5 along the day axis – the value we’re predicting for.
3: draw a horizontal line at 13.85 along the y-axis – the value that we got for our prediction.

Great, the predicted point is bang on the fit line for “with lactose”!

3.4.2 Visualising the slopes

The mean slope – half-way between with and without lactose:

emmip(m.bugs, lactose ~ day, cov.reduce = range) +
  geom_abline(linetype = 2,  # dashed
    intercept = bug.coefs["(Intercept)"],
    slope = bug.coefs["day"]
    )

The slope for “with lactose” (but not yet with the “with lactose” shift in intercept):

emmip(m.bugs, lactose ~ day, cov.reduce = range) +
  geom_abline(linetype = 2,  # dashed
    intercept = bug.coefs["(Intercept)"],
    slope = bug.coefs["day"] + bug.coefs["day:lactose1"] * -1
    )

Correcting the intercept to be “with lactose” as well:

emmip(m.bugs, lactose ~ day, cov.reduce = range) +
  geom_abline(linetype = 2,  # dashed
    intercept = bug.coefs["(Intercept)"] + bug.coefs["lactose1"] * -1,
    slope = bug.coefs["day"] + bug.coefs["day:lactose1"] * -1
    )

Putting the predicted lines on the original data

base.plot + geom_point(size=3, alpha=0.5) +
  geom_abline(
    intercept = bug.coefs["(Intercept)"] + bug.coefs["lactose1"],
    slope = bug.coefs["day"] + bug.coefs["day:lactose1"]
    ) +
  geom_abline(
    intercept = bug.coefs["(Intercept)"] + bug.coefs["lactose1"] * -1,
    slope = bug.coefs["day"] + bug.coefs["day:lactose1"] * -1
    )

4 Treating dose as categorical

TASK

Fit a new model that treats day as a factor (categorical EV). Hint: You could do this from within the model formula, without changing the data frame itself.

m.bugs2 <- lm(bacteria ~ factor(day) * lactose, MyBugs)

BB TEST Q5

Which of the following statements about treating day as continuous or categorical are correct?

anova(m.bugs)     # `days` as covariate

Analysis of Variance Table

Response: bacteria
            Df  Sum Sq Mean Sq F value    Pr(>F)    
day          1 1212.04 1212.04 505.526 < 2.2e-16 ***
lactose      1 1772.62 1772.62 739.338 < 2.2e-16 ***
day:lactose  1   97.09   97.09  40.497 1.351e-08 ***
Residuals   76  182.22    2.40                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(m.bugs2)    # `days` as factor

Analysis of Variance Table

Response: bacteria
                    Df  Sum Sq Mean Sq F value    Pr(>F)    
factor(day)          4 1225.39  306.35 127.757 < 2.2e-16 ***
lactose              1 1772.62 1772.62 739.240 < 2.2e-16 ***
factor(day):lactose  4   98.11   24.53  10.229 1.404e-06 ***
Residuals           70  167.85    2.40                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

TASK

Use emmip() from the emmeans package to create an interaction plot for the model that treats day as a factor. Note there are two ways of plotting this that amount to the same thing. Can you figure out what the two ways of plotting are, and which one is better for comparing with the continuous day version of the model?

emmip(m.bugs2, lactose ~ factor(day))

or the other way around – probably less intuitive here:

emmip(m.bugs2, factor(day) ~ lactose)

5 BONUS: multiple predictions in one go

Lets do this with the model that has day as continuous: predict

without lactose, on days 4, 4.5 and 6 (6 is dangerous – we are extrapolating beyond the range of the available data!)
with lactose, on the same days.

pred.df <- data.frame(
  day = c(4, 4.5, 6, 4, 4.5, 6),
  lactose = factor( c(0, 0, 0, 1, 1, 1) )
)

pred.df

  day lactose
1 4.0       0
2 4.5       0
3 6.0       0
4 4.0       1
5 4.5       1
6 6.0       1

Looking good… could also do this using rep() which comes in handy for building bigger dataframes from scratch:

pred.df <- data.frame(
  day = rep( c(4, 4.5, 6), 2),
  lactose = factor(
    rep(c(0, 1), each=3))
)
pred.df

  day lactose
1 4.0       0
2 4.5       0
3 6.0       0
4 4.0       1
5 4.5       1
6 6.0       1

Now, predict on these data.

predict(m.bugs, newdata = pred.df)

        1         2         3         4         5         6 
 8.178718  9.165378 12.125356 19.151126 20.916784 26.213758

But may be easier to read if we put it together with the data

pred.df$prediction <- predict(m.bugs, newdata = pred.df)
pred.df

  day lactose prediction
1 4.0       0   8.178718
2 4.5       0   9.165378
3 6.0       0  12.125356
4 4.0       1  19.151126
5 4.5       1  20.916784
6 6.0       1  26.213758

1 Lecture data example

2 Load and plot the data

2.1 day: continuous or categorical?

3 Finally: the LM

3.1 Set up contrasts

3.2 Fit the LM and interpret the results

3.3 Using emmeans for “ANCOVA” interaction plots

3.4 Prediction

3.4.1 Easy visual check

3.4.2 Visualising the slopes

4 Treating dose as categorical

5 BONUS: multiple predictions in one go

2.1 `day`: continuous or categorical?

3.3 Using `emmeans` for “ANCOVA” interaction plots