- 1
-
The
within( ... )trick saves me having to sayMyBugs$every time I want to refer to a variable within the dataframe (in this instance, it is justlactose). - 2
-
I’m setting up the contrast for
lactosehere already. - 3
- I’m setting up and saving an “empty” ggplot, for reusing later; this saves me typing this line over and over for each of the plots below.
- 4
- Only now am I actually plotting it with the points.
W5 Practical Demo
1 Lecture data example
For this week’s demo, I’m using the lecture example of the bacterial growth experiment with or without lactose in the medium. I have simulated a larger sample size than in the lecture dataset, to make it more similar to what you have in the practical.
Research question: Effect of the presence of lactose in the growth medium on bacterial counts over time.
Outcome: Bacterial count (bacteria)
Predictors:
- day of counting (
day: 1, 2, 3, 4, 5) - growth medium (
lactose– 0 = without, 1 = with; in the lecture, it was 1 = without, 2 = with)
2 Load and plot the data
Plot the data as a scatterplot, with different colours for the two levels of [the factor EV; here, lactose]. Because day is 1, 2, 3…, the points for one day will all line up and some may land on top of another, make it harder to interpret the plot.
You can try two solutions in ggplot:
- Use
geom_jitterto jitter the points a little along theday-axis - but be mindful that doing so for a continuous variable means that your plot is no longer a faithful representation of the data; - Use
geom_pointbut make the points bigger (e.g,size = 3) and use thealphaoption to make them semi-transparent: they will appear darker where they overlap.alpha = 0.5is a good starting point.
If you click on the circled numbers below the code, it will highlight the code that I’m referring to.
Solution 1: Jittering the points sideways (only!)
base.plot + geom_jitter(height = 0, width = 0.1)Solution 2: Using transparency (alpha)
base.plot + geom_point(size=3, alpha=0.5)From just looking at the plot of the raw data, what statements best describe the conclusions from the experiment?
In my example, the counts are higher with lactose on all days, but the difference clearly gets bigger:
- Lactose-containing medium has higher bacterial counts on average than lactose-free medium on all days tested.
- The difference between lactose and lactose-free medium is quite small on day 1 but increases very clearly as time progresses. On the last day, the difference between the two media is much larger than the spread of the data for either medium.
2.1 day: continuous or categorical?
What is the most compelling reason for treating day as a continuous EV in this analysis?
In my example, the response seems to closely follow a linear trend for both growth media.
3 Finally: the LM
3.1 Set up contrasts
I’ve already set up my contrasts above, while I declared lactose a factor.
3.2 Fit the LM and interpret the results
Fit the LM that is most appropriate for answering this research question.
Interpret the results of your LM analysis to answer the following questions:
- What question does the test for an interaction relate to in this analysis?
- How strong is the evidence that the difference in bacterial counts in the two growth media is dependent on time (i.e., days since starting the culture)?
m.bugs <- lm(bacteria ~ day * lactose, MyBugs)
anova(m.bugs)Analysis of Variance Table
Response: bacteria
Df Sum Sq Mean Sq F value Pr(>F)
day 1 1212.04 1212.04 505.526 < 2.2e-16 ***
lactose 1 1772.62 1772.62 739.338 < 2.2e-16 ***
day:lactose 1 97.09 97.09 40.497 1.351e-08 ***
Residuals 76 182.22 2.40
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Recall, we have two EVs (\(X_1 = \mathtt{day},\ X_2 = \mathtt{lactose}\)), but we are asking three research questions.
- If the evidence for an interaction
day:lactoseis weak (“non-significant” – in the practical, P>0.01), report the separate findings for the two main effects,dayandlactose. - If there is clear evidence for an interaction (in the practical, P<0.01), only report the results for the interaction – since this implies that both main effects (
dayandlactose) are important. - This is also why we don’t do, or need, a “Type III ANOVA”!
Select the statement that is most appropriate for summarising your ANOVA results.
In my example, the two growth media differ in the bacterial counts they yield, and this difference in bacterial counts increases over time (\(P = 1.35\times 10^{-8}\)).
3.3 Using emmeans for “ANCOVA” interaction plots
Use the emmip() function to visualise the model estimates as an interaction plot.
library(emmeans)
emmip(m.bugs, lactose ~ day, cov.reduce = range)3.4 Prediction
Use the model coefficients to predict the response [here, bacterial counts] 2.5 days into the experiment, and with lactose in the medium.
I’d strongly suggest you do it by hand from the model coefficients, it will help you get your head round the meaning of the coefficients.
summary(m.bugs) # Not needed - only to see the results.
Call:
lm(formula = bacteria ~ day * lactose, data = MyBugs)
Residuals:
Min 1Q Median 3Q Max
-2.8368 -1.0668 -0.2863 0.8977 4.1417
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.6557 0.4060 6.541 6.34e-09 ***
day 2.7523 0.1224 22.484 < 2e-16 ***
lactose1 -2.3702 0.4060 -5.838 1.23e-07 ***
day:lactose1 -0.7790 0.1224 -6.364 1.35e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.548 on 76 degrees of freedom
Multiple R-squared: 0.9442, Adjusted R-squared: 0.942
F-statistic: 428.5 on 3 and 76 DF, p-value: < 2.2e-16
bug.coefs <- coef(m.bugs) # Saving the coefficients for use below.Careful: With sum contrasts, the coefficient for the first factor level will be
level1. In my example, the first level oflactoseis “0” which means no lactose, so this is whatlactose1refers to!Sanity check: Refer to the plot - without lactose (“0”) comes out lower, so it makes sense that the estimate for
lactose1comes out negative! (Remember that thelactose1estimate would be for day = 0…)Same for interaction: The slope without lactose (“0”) is shallower and the
day:lactose1estimate is negative (since it is the deviation from the fictive average slope).
- 1
- Self-explanatory, we need to include the intercept…
- 2
-
Add the estimate for
daymultiplied by 2.5 (days).
At this point, we’d have the estimate for the mean of with and without lactose 2.5 days into the experiment. - 3
-
Add the estimate for the
lactose1effect multiplied by the value oflactosein terms of its contrast: for “with lactose” this is-1, i.e., we sign-invert. - 4
-
Finally, add the interaction – we need to multiply by both the value of
dayand the contrast value oflactose.
(Intercept)
13.85415
3.4.1 Easy visual check
Using the interaction plot we did earlier:
- 1
-
emmip()is just a ‘wrapper’ aroundggplot()so we can use ggplot commands to add stuff to the plot! - 2
-
draw a vertical line at 2.5 along the
dayaxis – the value we’re predicting for. - 3
- draw a horizontal line at 13.85 along the y-axis – the value that we got for our prediction.
Great, the predicted point is bang on the fit line for “with lactose”!
3.4.2 Visualising the slopes
The mean slope – half-way between with and without lactose:
emmip(m.bugs, lactose ~ day, cov.reduce = range) +
geom_abline(linetype = 2, # dashed
intercept = bug.coefs["(Intercept)"],
slope = bug.coefs["day"]
)The slope for “with lactose” (but not yet with the “with lactose” shift in intercept):
emmip(m.bugs, lactose ~ day, cov.reduce = range) +
geom_abline(linetype = 2, # dashed
intercept = bug.coefs["(Intercept)"],
slope = bug.coefs["day"] + bug.coefs["day:lactose1"] * -1
)Correcting the intercept to be “with lactose” as well:
emmip(m.bugs, lactose ~ day, cov.reduce = range) +
geom_abline(linetype = 2, # dashed
intercept = bug.coefs["(Intercept)"] + bug.coefs["lactose1"] * -1,
slope = bug.coefs["day"] + bug.coefs["day:lactose1"] * -1
)Putting the predicted lines on the original data
base.plot + geom_point(size=3, alpha=0.5) +
geom_abline(
intercept = bug.coefs["(Intercept)"] + bug.coefs["lactose1"],
slope = bug.coefs["day"] + bug.coefs["day:lactose1"]
) +
geom_abline(
intercept = bug.coefs["(Intercept)"] + bug.coefs["lactose1"] * -1,
slope = bug.coefs["day"] + bug.coefs["day:lactose1"] * -1
)4 Treating dose as categorical
Fit a new model that treats day as a factor (categorical EV). Hint: You could do this from within the model formula, without changing the data frame itself.
m.bugs2 <- lm(bacteria ~ factor(day) * lactose, MyBugs)Which of the following statements about treating day as continuous or categorical are correct?
anova(m.bugs) # `days` as covariateAnalysis of Variance Table
Response: bacteria
Df Sum Sq Mean Sq F value Pr(>F)
day 1 1212.04 1212.04 505.526 < 2.2e-16 ***
lactose 1 1772.62 1772.62 739.338 < 2.2e-16 ***
day:lactose 1 97.09 97.09 40.497 1.351e-08 ***
Residuals 76 182.22 2.40
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(m.bugs2) # `days` as factorAnalysis of Variance Table
Response: bacteria
Df Sum Sq Mean Sq F value Pr(>F)
factor(day) 4 1225.39 306.35 127.757 < 2.2e-16 ***
lactose 1 1772.62 1772.62 739.240 < 2.2e-16 ***
factor(day):lactose 4 98.11 24.53 10.229 1.404e-06 ***
Residuals 70 167.85 2.40
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Use emmip() from the emmeans package to create an interaction plot for the model that treats day as a factor. Note there are two ways of plotting this that amount to the same thing. Can you figure out what the two ways of plotting are, and which one is better for comparing with the continuous day version of the model?
emmip(m.bugs2, lactose ~ factor(day))or the other way around – probably less intuitive here:
emmip(m.bugs2, factor(day) ~ lactose)5 BONUS: multiple predictions in one go
Lets do this with the model that has day as continuous: predict
- without lactose, on days 4, 4.5 and 6 (6 is dangerous – we are extrapolating beyond the range of the available data!)
- with lactose, on the same days.
pred.df <- data.frame(
day = c(4, 4.5, 6, 4, 4.5, 6),
lactose = factor( c(0, 0, 0, 1, 1, 1) )
)
pred.df day lactose
1 4.0 0
2 4.5 0
3 6.0 0
4 4.0 1
5 4.5 1
6 6.0 1
Looking good… could also do this using rep() which comes in handy for building bigger dataframes from scratch:
pred.df <- data.frame(
day = rep( c(4, 4.5, 6), 2),
lactose = factor(
rep(c(0, 1), each=3))
)
pred.df day lactose
1 4.0 0
2 4.5 0
3 6.0 0
4 4.0 1
5 4.5 1
6 6.0 1
Now, predict on these data.
predict(m.bugs, newdata = pred.df) 1 2 3 4 5 6
8.178718 9.165378 12.125356 19.151126 20.916784 26.213758
But may be easier to read if we put it together with the data
pred.df$prediction <- predict(m.bugs, newdata = pred.df)
pred.df day lactose prediction
1 4.0 0 8.178718
2 4.5 0 9.165378
3 6.0 0 12.125356
4 4.0 1 19.151126
5 4.5 1 20.916784
6 6.0 1 26.213758