In this practical you will run your first model (yay!), which will be an ANOVA. You will have to go through the process of checking the variables normality, running the ANOVA, interpreting and reporting the results. The data we will be using is contained in the dataset called yields.csv. This is a very simple dataset with 10 observations (rows) of three variables (columns). The dataset represents crop yields per unit area measured from 10 randomly selected fields on each of three soil types (sand, clay, loam). All fields were sown with the same variety of seed and provided with the same fertilizer and pest control inputs, so we’re only testing the effect of soil type.
Housekeeping
You know the drill by now, clear environment, set working directory, load necessary packages. The functions necessary to run an anova are available on base R (no packages needed) so we only really need tidyverse for data wrangling and plotting.
rm(list=ls()) # this clears the environment from any leftover objects# setwd("Yourpath") #this tells R what folder you'll be working out oflibrary(tidyverse) #this loads the packages we'll need for easy coding
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Load data
You can load the data with the function read_csv in the same way you did in the two previous labs, but I’m not giving you the code here so you can practice. Call the data whatever you want, but I am going to call it yields
# add line of code here that will load your dataset. Hint: the function is called read_csv
Rows: 10 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (3): sand, clay, loam
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Data wrangling
If you check the structure of the data (hint: the function to do that is called str) you’ll realise the dataset has three columns, one for each soil type, and 10 rows (10 fields sampled in each soil type). You now will be introduced to another beautiful bit of data wrangling that is made easier with the tidyverse functions, we are going to pivot this table from wide to long format. If you remember the theory about the ANOVAs, there are two type of variables, response variable, and explanatory variable.
What would be the response variable for this ANOVA?
And what would be the explanatory variable?
Now that we have our variables, we need to transform our dataset, so that we have our response variable (yield) in one numeric column and our explanatory variable (soil_type) in one factor column.
Now that we have our dataset in the long format, we can display the means and standard deviations of our three types of soil. You have already done this before using the group_by() and summarise() functions, so I am going to let you figure out how to do it on your own (you can find the code to do it on the day 1 practical).
We can visualise the two components of variance (within, spread of each sample, and between, distance between sample means) by plotting the sample distributions. Follow the code below.
ggplot(yields_long) +geom_density(aes(x = yield, fill = soil_type), alpha =0.5) +theme_bw()
If now all the yields are in the one column called yields, how did I manage to plot one distribution for each treatment (each factor of the variable soil_type ?
Do you think we could display the overall distribution (not by source) behind these three? How?
Code
ggplot(yields_long) +geom_density(aes(x = yield), fill ="gray") +geom_density(aes(x = yield, fill = soil_type), alpha =0.5) +theme_bw()
Normality checks
One of the assumptions of ANOVA is that data are normally distributed. In addition to checking graphically, through histograms or density plots, we can check by performing a Shapiro-Wilks test. In case you don’t remember how to perform it, here’s the code
shapiro.test(yields_long$yield)
Shapiro-Wilk normality test
data: yields_long$yield
W = 0.97214, p-value = 0.5993
Do you remember how to interpret this output? What does the W mean? What does the p-value mean?
Is our data normally distributed?
Now, you might think it makes more sense to check normality for each level of the factor soil_type. That is, is the distribution of the clay sample normal? Is the distribution of the loam sample normal? And you would not be wrong! You will learn more about the assumption of normality and how we check for normality when there are many factors affecting our data, but for now, write your own code to check for normality of each soil type independently (Hint: it might be easier if you use the wide version of the data, so yields instead of yields_long
# add code here to perform Shapiro-Wilks test on each soil type independently
Let’s run the anova!
There are two different ways to run an ANOVA, and they are equivalent. Remember how I told you an ANOVA is just a particular case of a linear model, just with a categorical explanatory variable? This is where I prove it.
The formula for our model is (formula 1) \[
\text{yield}_i = \beta_0 + \beta_1 \,\text{soil\_type}_i + \varepsilon_i,
\]
However, because we know soil_type has three different levels, this can be rewritten as (formula 2)
Why do you think there is no specified coefficient for “clay”?
You can directly run the anova with this data. To do that, you can run the aov() function and store the result in an object called aov_yields, and then you can ask R for a summary of the anova by running the function summary() on that object:
aov_yields <-aov(yield ~ soil_type, data = yields_long)summary(aov_yields)
Df Sum Sq Mean Sq F value Pr(>F)
soil_type 2 99.2 49.60 4.245 0.025 *
Residuals 27 315.5 11.69
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
However, to prove to you how ANOVA is just a particular case of a linear model, this is another way of doing this:
lm_yields <-lm(yield ~ soil_type, data = yields_long)anova(lm_yields)
Analysis of Variance Table
Response: yield
Df Sum Sq Mean Sq F value Pr(>F)
soil_type 2 99.2 49.600 4.2447 0.02495 *
Residuals 27 315.5 11.685
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Yes, the fact that the direct function is called aov() whereas the function to get an ANOVA output from the lm() function is called anova() is confusing. And it is really not important for you to remember this detail. I’m just proving to you that ANOVA = lm.
Why do you think there are 27 degrees of freedom in the “residuals” field? Look at formula 2. How many parameters are we estimating?
Why do you think there are 2 df in the “soil_type” field? This is a bit more complicated and it takes a bit to wrap your head arund.
How is the value of F calculated?
How do you interpret that p-value?
Great, now we know that the three samples are not drawn from the same population. However, can we know which comparisons (clay vs loam or clay vs sand) are causing this output? We can check the output of the linear model itself, to check differences between levels:
summary(lm_yields)
Call:
lm(formula = yield ~ soil_type, data = yields_long)
Residuals:
Min 1Q Median 3Q Max
-8.5 -1.8 0.3 1.7 7.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.500 1.081 10.638 3.7e-11 ***
soil_typeloam 2.800 1.529 1.832 0.0781 .
soil_typesand -1.600 1.529 -1.047 0.3046
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.418 on 27 degrees of freedom
Multiple R-squared: 0.2392, Adjusted R-squared: 0.1829
F-statistic: 4.245 on 2 and 27 DF, p-value: 0.02495
You can interpret the linear model result by replacing the coefficients in formula 2:
Which soil type has the highest yield? Is that significantly lower than the yield of clay soils?
Which soil type has the lowest yield? Is that significantly higher than the yield of clay soils?
From the comparisons displayed, what can you infer about the difference in yield between loam and sand?
Model diagnostics
Now we know what the results of our ANOVA are, there is only one question. Do we trust these results? To check our ANOVA fit and see if we can trust it or not, there are several graphics we can plot. For once, we won’t be plotting with ggplot2, we will use base R plotting. This can be done with ggplot2 but it’s a more complicated process, and we don’t need complication in our lives.
We will use the function plot() on our aov object, however, this will produce several plots in a sequence and we need a way to tell R that we want to see all the plots at once, in a 2x2 grid. That’s what the first line of code is doing (I won’t dwell on it because we’ll seldom use it, but feel free to explore on your own)
par(mfrow =c(2,2))plot(aov_yields)
How to interpret these? Sometimes looking at the plot titles doesn’t help much. “Residuals vs. Fitted” might make some sort of sense, but “Scale-Location” gives very little information, for example. However, if you check the axes names it might give you a few more clues. Find below a short explanation of what each plot is doing:
Residuals vs fitted: this plot displays fitted values (since the explanatory variable here is a factor, the fitted values are the predicted mean for each group) in the x axis and the residuals (how far each individual value is from the mean) in the y axis. You want your residuals to be centered around 0 for all groups, and the way to check if they aren’t is to see how much the red line (LOESS fit) deviates from a flat, horizontal line. In this case, it is “flat enough”
Q-Q residuals: this plot checks the assumption that residuals are linear. You already know what a Q-Q plot is, this one just displays the residuals (difference between each value and its group mean) instead of the sample. Can you remember how you identify normality with a Q-Q plot?
Scale-Location: this displays the fitted values (again, predicted group means) vs the square root of the standardised residuals (just a different way of displaying residuals is all you need to know). Check the outlier values (16, 2, and 8) to see how this plot is similar or differs from the first one. This plot is checking homogeneity of variances amongst groups (homoscedasticity). Here, instead of checking if the mean of the residuals of each level is above or below 0 (position) you’re checking the spread of these resiudals. If one level had much more spread than the others, that means that the variances are not equivalent amongst groups, and that is a violation of ANOVA assumptions.
Constant leverage: in an ANOVA,there is not a huge difference between this plot and the first and third plots. This shows you the levels in the x axis (which means they’re not ordered by mean anymore, but alphabetically) and the standardised residuals (without the square root in this case) in the y axis. Here you can also check that the means of the errors are around zero, that the variances are constant, but additionally you can identify which group the outliers are in.
What is your task here?
Understand what each line of code is doing. Add your own code where required, so you can produce your own ANOVA analysis and diagnostic plots.
Look at the ANOVA outputs (both from the aov() and the summary(lm()) functions. Make sure you understand what each of the lines in those outputs mean.
If you’ve done the two tasks above and still have time, practice how you’d report these results in a scientific publication. Look up other papers using anova, find examples online, ask your LLM of choice. Feel free to show me what you’ve written (in person or sending it to me by email) and get some feedback if you want, but this is not a graded activity.