ANOVA in R

Analysis of Variance with Gapminder Life Expectancy Data

Author

Abdullah Al Shamim

Published

May 31, 2026

1 What is ANOVA?

ANOVA (Analysis of Variance) tests whether the means of three or more groups are significantly different from each other. Despite the name, it works by comparing variances: specifically, it compares the variance between groups to the variance within groups.

If the between-group variance is much larger than the within-group variance, we have evidence that at least one group mean differs from the others.


2 The Question

We’ll use the Gapminder dataset to ask: Does life expectancy differ across continents?

In other words, does where you’re born affect how long you can expect to live?


3 Setup

Show the code
library(tidyverse)
library(gapminder)
library(gt)
library(gtExtras)
1
Load the tidyverse — provides ggplot2, dplyr, tidyr, and forcats used throughout
2
Load gapminder — provides the Gapminder dataset with life expectancy, GDP, and population for 142 countries across decades
3
Load gt — for creating polished, publication-ready HTML tables
4
Load gtExtras — extends gt with helpers like data_color() for conditional cell colouring

4 The Data

The gapminder dataset contains life expectancy, GDP per capita, and population data for 142 countries spanning from 1952 to 2007. We’ll focus on the most recent year: 2007.

Show the code
gapminder_2007 <- gapminder |>
  filter(year == 2007)

glimpse(gapminder_2007)
1
gapminder contains panel data from 1952 to 2007 — we assign the filtered subset to a new object so it can be reused across all subsequent steps
2
filter(year == 2007) retains only the most recent year in the dataset, reducing 1,704 rows to exactly 142 — one per country
3
glimpse() prints a compact preview — confirms 142 rows and 6 columns (country, continent, year, lifeExp, pop, gdpPercap)
Rows: 142
Columns: 6
$ country   <fct> "Afghanistan", "Albania", "Algeria", "Angola", "Argentina", …
$ continent <fct> Asia, Europe, Africa, Africa, Americas, Oceania, Europe, Asi…
$ year      <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, …
$ lifeExp   <dbl> 43.828, 76.423, 72.301, 42.731, 75.320, 81.235, 79.829, 75.6…
$ pop       <int> 31889923, 3600523, 33333216, 12420476, 40301927, 20434176, 8…
$ gdpPercap <dbl> 974.5803, 5937.0295, 6223.3675, 4797.2313, 12779.3796, 34435…

5 Visualising the Question

Before running the test, let’s visualise the distribution of life expectancy across continents.

Show the code
continent_colors <- c(
  "Africa" = "#e63946",
  "Americas" = "#f4a261",
  "Asia" = "#2a9d8f",
  "Europe" = "#457b9d",
  "Oceania" = "#9c89b8"
)

gapminder_2007 |>
  ggplot(aes(x = fct_reorder(continent, lifeExp),
             y = lifeExp,
             fill = continent)) +
  geom_boxplot(alpha = 0.7,
               outlier.shape = NA,
               width = 0.5) +
  geom_jitter(aes(colour = continent),
              width = 0.15,
              alpha = 0.6,
              size = 2) +
  stat_summary(fun = mean,
               geom = "point",
               shape = 23,
               size = 3,
               fill = "white",
               colour = "black") +
  scale_fill_manual(values = continent_colors) +
  scale_colour_manual(values = continent_colors) +
  labs(
    title = "Life Expectancy by Continent (2007)",
    subtitle = "Each point represents a country; diamond shows the mean",
    x = NULL,
    y = "Life Expectancy (years)"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    legend.position = "none",
    panel.grid.major.x = element_blank(),
    plot.title = element_text(face = "bold"),
    plot.title.position = "plot",
    axis.text.x = element_text(face = "bold", size = 12)
  )
1
Named colour vector mapping each continent to a distinct hex colour — defined once and reused by both scale_fill_manual() and scale_colour_manual() so fills and points stay visually consistent
2
fct_reorder(continent, lifeExp) orders continents left to right by median life expectancy — the trend from Africa to Oceania becomes immediately visible without manual factor specification
3
geom_boxplot() shows the distribution shape (median, IQR, range) for each continent; outlier.shape = NA suppresses outlier dots since the jittered points already show every individual country
4
geom_jitter() overlays all 142 country points with a small horizontal offset (width = 0.15) to prevent overplotting — each point is one country, coloured to match its boxplot
5
stat_summary() computes and plots the group mean as a white-filled diamond (shape = 23) — visually distinct from both the median line in the boxplot and the individual jittered points
6
scale_fill_manual() applies the continent_colors palette to the boxplot fills — same colours as the jittered points for visual consistency
7
scale_colour_manual() applies the same palette to the jitter point colours — needs its own scale call because colour and fill are separate aesthetics in ggplot2
8
legend.position = "none" removes the legend since the x-axis labels already name each continent; panel.grid.major.x = element_blank() removes vertical gridlines that would clutter the categorical axis

The visualisation reveals dramatic differences: African countries cluster around 50–55 years, while European countries cluster around 75–80 years.


6 Summary Table

Show the code
gapminder_2007 |>
  group_by(continent) |>
  summarise(
    Countries = n(),
    Mean = mean(lifeExp),
    SD = sd(lifeExp),
    Min = min(lifeExp),
    Max = max(lifeExp),
    .groups = "drop"
  ) |>
  arrange(desc(Mean)) |>
  gt() |>
  fmt_number(columns = c(Mean, SD),
             decimals = 1) |>
  fmt_number(columns = c(Min, Max),
             decimals = 1) |>
  data_color(
    columns = Mean,
    palette = c("#e63946", "#f4a261", "#2a9d8f", "#457b9d"),
    domain = c(50, 80)
  ) |>
  cols_align(align = "center",
             columns = -continent) |>
  tab_header(
    title = md("**Life Expectancy by Continent (2007)**"),
    subtitle = "Summary statistics for each continent"
  ) |>
  tab_style(
    style = cell_text(weight = "bold"),
    locations = cells_column_labels()
  ) |>
  tab_options(
    table.font.size = px(14),
    heading.align = "left"
  )
1
summarise() computes 5 statistics per continent within the group_by() context: country count, mean, SD, min, and max life expectancy; .groups = "drop" removes the grouping structure after summarising
2
arrange(desc(Mean)) sorts rows from highest to lowest mean life expectancy — Oceania and Europe rise to the top, Africa falls to the bottom, matching the visual order in the plot
3
gt() converts the tibble into a gt table object — all subsequent |> calls are gt formatting verbs layered on top
4
fmt_number() rounds Mean and SD to 1 decimal place — appropriate precision for life expectancy statistics in years
5
A second fmt_number() call rounds Min and Max to 1 decimal place — kept as a separate call to mirror the data_color() column targeting pattern used below
6
data_color() applies a red → orange → teal → blue gradient to the Mean column; domain = c(50, 80) anchors the colour scale to the observed range, so Africa (54.8) maps to red and Oceania (80.7) to blue
7
cols_align() centre-aligns all numeric columns; -continent excludes the continent name column, keeping it left-aligned by default
8
tab_header() adds a bold md() title and plain subtitle above the table; heading alignment is controlled by tab_options() below
9
tab_style() bolds all column label text using cell_text(weight = "bold") targeted precisely at cells_column_labels()
10
tab_options() sets a 14px base font size and left-aligns the heading block to match the plot’s title alignment
Life Expectancy by Continent (2007)
Summary statistics for each continent
continent Countries Mean SD Min Max
Oceania 2 80.7 0.7 80.2 81.2
Europe 30 77.6 3.0 71.8 81.8
Americas 25 73.6 4.4 60.9 80.7
Asia 33 70.7 8.0 43.8 82.6
Africa 52 54.8 9.6 39.6 76.4

The gap between Oceania/Europe (~80 years) and Africa (~55 years) is stark—roughly 25 years of life expectancy.


7 The ANOVA Test

Show the code
aov(lifeExp ~ continent, data = gapminder_2007) |>
  summary()
1
aov() fits a one-way ANOVA model: lifeExp ~ continent specifies life expectancy as the outcome and continent as the single grouping factor; data = gapminder_2007 supplies the filtered dataset
2
summary() extracts the ANOVA table from the fitted model object, printing degrees of freedom, sum of squares, mean squares, F-statistic, and p-value in standard tabular form
             Df Sum Sq Mean Sq F value Pr(>F)    
continent     4  13061    3265   59.71 <2e-16 ***
Residuals   137   7491      55                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

8 Interpreting the Results

The ANOVA output shows:

  • Df (Degrees of freedom): Between groups = 4 (five continents minus one); Within groups = 137 (142 countries minus 5 groups).

  • Sum Sq (Sum of Squares): Measures total variability. The between-group sum of squares (6627) is large relative to the within-group sum of squares (2659).

  • Mean Sq (Mean Square): Sum of squares divided by degrees of freedom. This standardises for group sizes.

  • F value: The test statistic (85.9). This is the ratio of between-group variance to within-group variance. Large values indicate groups differ more than expected by chance.

  • Pr(>F): The p-value (< 2e-16). Extremely small—far below 0.05.

Important

Conclusion: We reject the null hypothesis that all continent means are equal. There is a statistically significant difference in life expectancy across continents. Where you’re born has a profound impact on how long you can expect to live.

Note

Note: ANOVA tells us that groups differ, but not which groups. Post-hoc tests (like Tukey’s HSD) would identify specific pairwise differences.