Topic 7: One-way ANOVA


In this computer lab, we will extend our understanding of one-way ANOVAs, by practicing what we have learnt in Topic 7.


1 One-way ANOVA

🏡 A one-way ANOVA is used to test for differences in means between two or more independent groups.

In this question, we will assess the average life expectancy of people living on the different continents around the world (excluding Antarctica!), using a subset of data from the gapminder R package (see (Gapminder 2021a), also (Gapminder 2021b)).

The data set we will analyse, gapminder_2002.csv, contains data from 142 different countries, and was recorded for the year 2002. In this question, we will focus solely on the following variables:

  • Dependent variable: Life Expectancy
    (Numeric: The average life expectancy at birth -in years - of individuals from a country)

  • Independent variable: Continent\(^\dagger\)
    (Categorical: The continent from which the data was obtained. One of Africa, Americas, Europe, Asia_Oceania)

\(^\dagger\) Note that for this data set, North America and South America have been combined into the (super) continent Americas, while Asia_Oceania includes Asia, Australia and New Zealand.

1.1

💻 Download the gapminder_2002.csv file from the LMS, and save it in a relevant location on your device.

1.2

💻 Open RStudio and create a new script file. Set your working directory to the folder in which you saved the gapminder_2002.csv file, then import the data set by running the code below:

gapminder_2002 <- read.csv(“gapminder_2002.csv”, header = T)

1.3 Initial Exploratory Analysis

💻 Let’s start by carrying out some exploratory data analysis. To begin, visualise our data by constructing a box plot of people’s Life Expectancy, separated by Continent. Make sure to label your box plot clearly.

Hint: Recall that we covered how to create box plots of one variable, separated by another variable, in previous core computer labs - see e.g.  Computer Lab 7 .

1.3.1

💻 Compute the sample mean and standard deviation of people’s Life Expectancy by Continent. Also note down the sample size for each Continent.

Hint: Recall that we performed similar calculations in Computer Lab 7 . If you are not sure how to proceed with this question, check the following code.

# Compute mean life expectancy 
tapply(gapminder_2002$lifeExp, gapminder_2002$continent, mean)
# Compute sample size for continents observations
table(gapminder_2002$continent)
🎧 Online students 💬 Enter your answers next to the question in the shared Google Doc.

1.3.2

🏡 What do you observe from your results for 1.3 and 1.3.1?

🎧 Online students 💬 Leave a comment about your observations next to the question in the shared Google Doc.

1.4 Defining Hypotheses

🏡 We would like to test, at the \(5\%\) level of significance, whether people’s Life Expectancy value differs depending upon the Continent in which they live. In order to carry out our one-way ANOVA, we first need to clearly define our null and alternative hypotheses (\(H_0\) and \(H_1\) respectively).

Suppose that we let \(\mu_1\) denote the true (population) average life expectancy at birth of people born in the continent Africa.

Using this notation as a guide, define similar notation for the other Continent categories, and use these to define an appropriate \(H_0\) and \(H_1\).

Hint: Check the Topic 7 readings if you are unsure how to proceed.

1.5 Conducting a one-way ANOVA in R

💻 In R, we can use the aov function to carry out the one-way ANOVA described in part 1.4 above.

The aov function has the following structure:

example_anova <- aov(y_variable ~ x_variable, data = example_data)

Here, we model a chosen y_variable from our data set against a chosen x_variable. We specify the data set to use via the data = argument.

Using this information, conduct a one-way ANOVA of Life Expectancy across Continents, using the gapminder_2002 data, as discussed in 1.4.

Note: Since we specify the data set via the data = argument, we don’t need to write e.g gapminder$gdpPercap if we want to include the gdpPercap variable in our ANOVA - we can simply write gdpPercap in the y_variable or x_variable position, as desired.

1.5.1

💻 Assess your ANOVA results using the summary R command, and note the following:

  • The degrees of freedom \(d1\) and \(d2\);
  • The p-value;
  • The test statistic (F value)

Hint: You can use the summary function as shown in the R code below - you will need to extrapolate from this example:

# Summarise results stored in object `example`
summary(example_anova)
🎧 Online students 💬 Enter your answers next to the question in the shared Google Doc.

1.5.2

🏡 Write a brief statement that summarises your results.

1.6 Test Assumption Checks

🏡 So far, we have proceeded assuming that the one-way ANOVA test assumptions were satisfied for our analysis. We should check these assumptions now.

Similar to the independent samples \(t\)-test, we have 4 test assumptions to check:

  1. The data are numeric
  2. The observations are independent
  3. The groups have equal variances
  4. The one-way ANOVA residuals are normally distributed

1.6.1 Equal Variances Assumption Check

We know that the data are numeric (1), and we can assume that the observations are independent between continents (2). Therefore, we next need to test for the equality of variances between the groups (3).

To check this, run the code library(car), and then use the leveneTest R command to carry out the Levene’s Test for equal variances.

What do you conclude? Provide a simple sentence, using the \(p\)-value you obtain from the Levene’s Test to support your decision.

Hint: We can interpret the Levene’s Test output just as we did in the independent samples \(t\)-test scenario in Computer Lab 7.

🎧 Online students 💬 Enter your answer next to the question in the shared Google Doc.

1.6.2 Normality Assumption Check - Visual

🏡 We also need to check the normality of the residuals produced for our one-way ANOVA (4). We can access these using ...$residuals (where you will need to replace the ...s with the name you chose for your ANOVA - e.g. example_anova$residuals).

Complete the following:

  • Create a histogram of the residuals.
  • Overlay a normal curve on this histogram, using the residuals data to inform your choice of mean and standard deviation.
  • Also create a Normal Q-Q plot of the residuals.

Based upon visual inspection of these plots, what do you conclude?

🎧 Online students 💬 Enter your answer next to the question in the shared Google Doc.

1.6.3 Normality Assumption Check - Formal

🏡 To support your conclusion to 1.6.2, it is important to carry out a formal statistical test. Use the Shapiro-Wilk test to assess the normality of the residuals.

What do you find? Does this support your answer to 1.6.2?

🎧 Online students 💬 Enter your answer next to the question in the shared Google Doc.

1.7 Post-hoc Testing

🏡 Regardless of your conclusions above in 1.6, we will proceed under the assumption that our one-way ANOVA test assumptions have been safely met. Our next step is to conduct a Tukey HSD post-hoc test.

Use the TukeyHSD R function to carry out this test for our selected data.

1.7.1

🏡 Interpret the results of the Tukey HSD post-hoc test for any 2 of the various comparisons. Which comparisons, if any, are statistically significant? Are there any comparisons that are not statistically significant?

🎧 Online students 💬 Enter your answer next to the question in the shared Google Doc.

1.8 Effect Size

🏡 To conclude, we should also check the effect size for our one-way ANOVA.

Use the etaSquared function from the lsr R package to calculate the \(\eta^2\) effect size, and provide an interpretation of this effect size.

Hint: You can use the etaSquared function in a similar manner to how you used the summary function in 1.5.1.

🎧 Online students 💬 Enter your answer next to the question in the shared Google Doc.

2 Extension: One-way ANOVA Practice

💻 The gapminder_2002.csv file also includes the variable GDP per capita, which records the 2002 Gross Domestic Product (GDP) per capita for the different countries.

If you have time, repeat Question 1, but this time use the following structure:

  • Dependent variable: GDP per capita
  • Independent variable: Continent (Africa, Americas, Asia, Europe, Oceania)
🎧 Online students 💬 Volunteer to share your screen and explain your answers to this question.


References

Gapminder. 2021a. “Happiness Score (WHR) [.csv File].” 2021. http://gapm.io/dhapiscore\_whr.
———. 2021b. “Income Per Person [.csv File].” 2021. http://gapm.io/dgdppc.


These notes have been prepared by Rupert Kuveke. The copyright for the material in these notes resides with the author named above, with the Department of Mathematical and Physical Sciences and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License BY-NC-ND.