Practical 4 - AESC40180 25/26

Author

Virginia Morera-Pujol

Introduction

The goal of this practical is for you to run different versions of linear models, explore the outputs, and learn how to interpret and diagnose them. The data for this practical comes from an experiment in which the growth (in mm) of two species of caterpillars (A and B) was studied, and the effect of the content of tannins in their diet explored. The data is contained in the file named caterpillars.csv.

Housekeeping

Use the well known code to clear environment, set working directory if necessary, and load the data.

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Rows: 20 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Species
dbl (2): growth, tannin

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data cleaning

Check if any data cleaning is necessary for this dataset: is it in long, or wide format? Does it need pivoting? Any character variables need to be turned into factors? If so, perform the necessary transformations.

Data visualisation

We will use ggplot to produce a scatterplot and add a model fit line. Below I have provided part of the code, but you need to replace the text in quotation marks by the proper aes() arguments. Check previous practicals to figure out how, if you’re struggling. I have, however, given you the full code to add the fit line (geom_smooth()) as this is new to you.

ggplot(caterpillars) + 
  geom_point("replace this with the necessary code") + 
  geom_smooth(aes(x = tannin, y = growth), method = lm, se = FALSE) + 
  theme_bw()

`geom_smooth()` using formula = 'y ~ x'

Are tannins good for caterpillars?
Can you tell, from this plot, if different species have different susceptibility to tannins?

Run linear model

The way to write lm equations is exactly the same as an ANOVA equation. Because, as we’ve said plenty of times, an ANOVA is just a particular case of lm. So, you should be able to write the code to run a lm testing the effect of tannins (only, for now) on growth based on the caterpillars dataset, and store it in an object (in my case I called it lm1, but feel free to name it whatever). Then, print the model summary.


Call:
lm(formula = growth ~ tannin, data = caterpillars)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.76364 -0.58182  0.08864  0.77614  2.46364 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 11.80909    0.48424   24.39 3.06e-15 ***
tannin      -1.09091    0.09071  -12.03 4.87e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.165 on 18 degrees of freedom
Multiple R-squared:  0.8893,    Adjusted R-squared:  0.8832 
F-statistic: 144.6 on 1 and 18 DF,  p-value: 4.869e-10

Write down the equation of the model, and then replace the betas with the actual values obtained from the summary

If you prefer to check confident intervals rather than p-values (I do!) you can obtain confidence intervals for your parameters with the function confint()

confint(lm1)

                2.5 %     97.5 %
(Intercept) 10.791736 12.8264457
tannin      -1.281477 -0.9003411

As an alternative to p-values, we would say that the parameters are significantly different from zero if the confidence intervals don’t contain zero.

Predict from the model

One thing we haven’t learned in the lecture is that we can actually use the model to predict y (growth) for different values of x (tannins). For that, we need to use the function predict(), which takes at least 2 arguments.

The first argument is the name of the model (in my case lm1, you’ll have to use whatever you’ve called your model)
The second argument is the new data onto which we are predicting. In this case, let’s say I’m interested in predicting the growth if the tannin content of the diet is 0.5, 1.5, 2.5, 3.5, 4.5, 5.5, and 6.5.
The third argument in this case is telling R that I want it to produce confidence intervals around my predicted values. That is, considering the error of the model, give me the most likely prediction, but also a range of values where that prediction could be, considering the error of the model.
Feel free to explore other arguments that the function can take by typing help(predict) into your console

*Importantly, we shouldn’t use this function to predict outside the data space. That is, we can ask R to predict the growth of the caterpillas if the tannin content in the diet is 15. However, this is too far away from the values of tannins we’ve used to run the model, and there’s no guarantee that the relationship between tannins and growth will stay the same that far out.

In the code below, you will learn how to generate a new dataset that has the needed values in the tannin column, and then all empty (NA) cells in the growth column, that we will fill in with our prediction

new_data <- data.frame(
  tannin = c(0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5), 
  growth = NA
)

Now we use our model (lm1) and our newly created dataframe (new_data) to predict, and we’ll store the output in an object called pred. The additional pipe and as.data.frame will just turn the output into a proper dataframe that we can work with.

pred <- predict(lm1, new_data, interval = "prediction") %>% 
  as.data.frame()

If you open that dataframe you’ll see it has three columns, fit, lwr, and upr. Fit corresponds to the predicted value, lwr to the lower end of the confidence interval, and upr to the upper end of the confidence interval. We are now going to add those columns to our new_data dataset which already contains the pre-defined tannin value that we have predicted on

new_data <- new_data %>% 
  mutate(growth = pred$fit, 
         lwr = pred$lwr, 
         upr = pred$upr)

This now allows us to plot the predicted values with geom_point and geom_errorbar. Since you’ve already encountered this type of plot before, I’m going to give you only the “skeleton” and you have to replace the text inside the geoms with the proper code

ggplot(new_data) +
  geom_point("replace this with the necessary code") +
  geom_errorbar("replace this with the necessary code") + 
  theme_bw()

Model diagnostics

Diagnostic plots

Diagnostic plots are obtained in the same way as for ANOVAS as (once again all together now) ANOVAS are just a special case of linear models.Remember to divide the plotting area into a 2x2 plot grid, and to turn it back into a 1x1 grid.

ANOVA to compare with null model

In reality, as we have seen in the lecture, this nested ANOVAS test the significance of the betas in the same way as the t-test in the model summary do, but it’s good practice for you to run the null model and the anova. I’ve given you the “skeleton” again, but make sure to replace the text with your actual code

First, construct and run the null model and store it in an object named lm0 (or whatever you want)

lm0 <- lm("your code here")

Now, compare the two models (null model and the tannins model) using the function anova

anova("your code here")

What hypotheses are we testing here?
How do these F values relate to the t values in the lm summary?

Other diagnostics

RE-run the model summary and explore it.

What other ways of evaluating a model fit can you find here?
Is the model a good fit to the data? How can you tell?
Is the effect of tannins significantly different from 0? How can you tell? (there should be 2 ways of telling)
Do caterpillars experience any significant growth when the amount of tannins in the diet is 0? How much?
What hypothesis is the F value testing?

Your job

This is the part of the lab where you’ll actually learn, as I’m not giving you any code and very little instructions and you get to independently decide what to do. I would rather you spend most of the lab working on this part, and only a bit of time working through the introductory part, so if you find yourself still in the first part after 11, make sure to move onto this.

If you explore the dataset that you loaded at the beginning, in addition to growth and tannins content, it has a third column containing a categorical variable, “Species”. For this to work, you need to turn this into a factor (which you should have done at the start, but double check).

Run a multivariate linear model (that could also be called an ANCOVA) by testing the effect of tannins and species on caterpillar growth. These are some of the questions you can try and answer

Is there a difference in growth between species?
Do different species respond differently to the presence of tannins on their diet? (hint: is there an interaction between species and tannins)
Which model is a better fit to the data, the model with just tannins, the model with tannins and species, or the model with tannins, species, and an interaction? (hint: check adjusted R²).
Is the model with tannins better than the null model? Is the model with tannins and species better than the model with just tannins? Is the model with tannins, species, and the interaction between them better than the model with just tannins and species? (hint: these models are nested, use an ANOVA!)
Look at the model residuals for all models. Which residuals are more adjusted to a normal distribution?

As always, feel free to call me or the demonstrators over to discuss the code and outputs of these more complex model, or to send me the code or answers to these questions on an email if you want some feedback.