One-Way ANOVA

1 What is ANOVA?

ANOVA (Analysis of Variance) tests whether the means of three or more groups are significantly different from each other. Despite the name, it works by comparing variances: specifically, it compares the variance between groups to the variance within groups.

If the between-group variance is much larger than the within-group variance, we have evidence that at least one group mean differs from the others.

2 The Question

We’ll use the Gapminder dataset to ask: Does life expectancy differ across continents?

In other words, does where you’re born affect how long you can expect to live?

3 Setup

Show the code
library(tidyverse)
library(gapminder)
library(gt)
library(gtExtras)

4 The Data

The gapminder dataset contains life expectancy, GDP per capita, and population data for 142 countries spanning from 1952 to 2007. We’ll focus on the most recent year: 2007.

Show the code
gapminder_2007 <- gapminder |>
  filter(year == 2007)

glimpse(gapminder_2007)
1
Filter to only the 2007 data to get a single observation per country
Rows: 142
Columns: 6
$ country   <fct> "Afghanistan", "Albania", "Algeria", "Angola", "Argentina", …
$ continent <fct> Asia, Europe, Africa, Africa, Americas, Oceania, Europe, Asi…
$ year      <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, …
$ lifeExp   <dbl> 43.828, 76.423, 72.301, 42.731, 75.320, 81.235, 79.829, 75.6…
$ pop       <int> 31889923, 3600523, 33333216, 12420476, 40301927, 20434176, 8…
$ gdpPercap <dbl> 974.5803, 5937.0295, 6223.3675, 4797.2313, 12779.3796, 34435…

5 Visualising the Question

Before running the test, let’s visualise the distribution of life expectancy across continents.

Show the code
continent_colors <- c(                           
  "Africa" = "#e63946",
  "Americas" = "#f4a261",
  "Asia" = "#2a9d8f",
  "Europe" = "#457b9d",
  "Oceania" = "#9c89b8"
)

gapminder_2007 |>
  ggplot(aes(x = fct_reorder(continent, lifeExp),
             y = lifeExp,
             fill = continent)) +
  geom_boxplot(alpha = 0.7,
               outlier.shape = NA,
               width = 0.5) +
  geom_jitter(aes(colour = continent),
              width = 0.15,
              alpha = 0.6,
              size = 2) +
  stat_summary(fun = mean,
               geom = "point",
               shape = 23,
               size = 3,
               fill = "white",
               colour = "black") +
  scale_fill_manual(values = continent_colors) +
  scale_colour_manual(values = continent_colors) +
  labs(
    title = "Life Expectancy by Continent (2007)",
    subtitle = "Each point represents a country; diamond shows the mean",
    x = NULL,
    y = "Life Expectancy (years)"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    legend.position = "none",
    panel.grid.major.x = element_blank(),
    plot.title = element_text(face = "bold"),
    plot.title.position = "plot",
    axis.text.x = element_text(face = "bold", size = 12)
  )
1
Reorder continents by median life expectancy
2
Boxplots show the distribution (median, IQR, range)
3
Jittered points show each individual country
4
Diamond markers indicate the group means (what ANOVA compares)

The visualisation reveals dramatic differences: African countries cluster around 50–55 years, while European countries cluster around 75–80 years.

6 Summary Table

Show the code
gapminder_2007 |>
  group_by(continent) |>
  summarise(
    Countries = n(),
    Mean = mean(lifeExp),
    SD = sd(lifeExp),
    Min = min(lifeExp),
    Max = max(lifeExp),
    .groups = "drop"
  ) |>
  arrange(desc(Mean)) |>
  gt() |>
  fmt_number(columns = c(Mean, SD),
             decimals = 1) |>
  fmt_number(columns = c(Min, Max),
             decimals = 1) |>
  data_color(
    columns = Mean,
    palette = c("#e63946", "#f4a261", "#2a9d8f", "#457b9d"),
    domain = c(50, 80)
  ) |>
  cols_align(align = "center",
             columns = -continent) |>
  tab_header(
    title = md("**Life Expectancy by Continent (2007)**"),
    subtitle = "Summary statistics for each continent"
  ) |>
  tab_style(
    style = cell_text(weight = "bold"),
    locations = cells_column_labels()
  ) |>
  tab_options(
    table.font.size = px(14),
    heading.align = "left"
  )
1
Count the number of countries in each continent
2
Standard deviation measures spread within each group
3
Arrange from highest to lowest mean life expectancy
4
Format numbers to one decimal place
5
Colour-code the mean column (red = lower, blue = higher)
6
Centre-align numeric columns

Life Expectancy by Continent (2007)

Summary statistics for each continent
continent Countries Mean SD Min Max
Oceania 2 80.7 0.7 80.2 81.2
Europe 30 77.6 3.0 71.8 81.8
Americas 25 73.6 4.4 60.9 80.7
Asia 33 70.7 8.0 43.8 82.6
Africa 52 54.8 9.6 39.6 76.4

The gap between Oceania/Europe (~80 years) and Africa (~55 years) is stark—roughly 25 years of life expectancy.

7 The ANOVA Test

Show the code
aov(lifeExp ~ continent, data = gapminder_2007) |>
  summary()
1
Fit an ANOVA model: life expectancy as a function of continent
2
Display the ANOVA summary table
             Df Sum Sq Mean Sq F value Pr(>F)    
continent     4  13061    3265   59.71 <2e-16 ***
Residuals   137   7491      55                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

8 Interpreting the Results

The ANOVA output shows:

  • Df (Degrees of freedom): Between groups = 4 (five continents minus one); Within groups = 137 (142 countries minus 5 groups).

  • Sum Sq (Sum of Squares): Measures total variability. The between-group sum of squares (6627) is large relative to the within-group sum of squares (2659).

  • Mean Sq (Mean Square): Sum of squares divided by degrees of freedom. This standardises for group sizes.

  • F value: The test statistic (85.9). This is the ratio of between-group variance to within-group variance. Large values indicate groups differ more than expected by chance.

  • Pr(>F): The p-value (< 2e-16). Extremely small—far below 0.05.

Conclusion: We reject the null hypothesis that all continent means are equal. There is a statistically significant difference in life expectancy across continents. Where you’re born has a profound impact on how long you can expect to live.

Note: ANOVA tells us that groups differ, but not which groups. Post-hoc tests (like Tukey’s HSD) would identify specific pairwise differences.