Analysis of Variance with Gapminder Life Expectancy Data
Author
Abdullah Al Shamim
Published
May 31, 2026
1 What is ANOVA?
ANOVA (Analysis of Variance) tests whether the means of three or more groups are significantly different from each other. Despite the name, it works by comparing variances: specifically, it compares the variance between groups to the variance within groups.
If the between-group variance is much larger than the within-group variance, we have evidence that at least one group mean differs from the others.
2 The Question
We’ll use the Gapminder dataset to ask: Does life expectancy differ across continents?
In other words, does where you’re born affect how long you can expect to live?
Load the tidyverse — provides ggplot2, dplyr, tidyr, and forcats used throughout
2
Load gapminder — provides the Gapminder dataset with life expectancy, GDP, and population for 142 countries across decades
3
Load gt — for creating polished, publication-ready HTML tables
4
Load gtExtras — extends gt with helpers like data_color() for conditional cell colouring
4 The Data
The gapminder dataset contains life expectancy, GDP per capita, and population data for 142 countries spanning from 1952 to 2007. We’ll focus on the most recent year: 2007.
Named colour vector mapping each continent to a distinct hex colour — defined once and reused by both scale_fill_manual() and scale_colour_manual() so fills and points stay visually consistent
2
fct_reorder(continent, lifeExp) orders continents left to right by median life expectancy — the trend from Africa to Oceania becomes immediately visible without manual factor specification
3
geom_boxplot() shows the distribution shape (median, IQR, range) for each continent; outlier.shape = NA suppresses outlier dots since the jittered points already show every individual country
4
geom_jitter() overlays all 142 country points with a small horizontal offset (width = 0.15) to prevent overplotting — each point is one country, coloured to match its boxplot
5
stat_summary() computes and plots the group mean as a white-filled diamond (shape = 23) — visually distinct from both the median line in the boxplot and the individual jittered points
6
scale_fill_manual() applies the continent_colors palette to the boxplot fills — same colours as the jittered points for visual consistency
7
scale_colour_manual() applies the same palette to the jitter point colours — needs its own scale call because colour and fill are separate aesthetics in ggplot2
8
legend.position = "none" removes the legend since the x-axis labels already name each continent; panel.grid.major.x = element_blank() removes vertical gridlines that would clutter the categorical axis
The visualisation reveals dramatic differences: African countries cluster around 50–55 years, while European countries cluster around 75–80 years.
summarise() computes 5 statistics per continent within the group_by() context: country count, mean, SD, min, and max life expectancy; .groups = "drop" removes the grouping structure after summarising
2
arrange(desc(Mean)) sorts rows from highest to lowest mean life expectancy — Oceania and Europe rise to the top, Africa falls to the bottom, matching the visual order in the plot
3
gt() converts the tibble into a gt table object — all subsequent |> calls are gt formatting verbs layered on top
4
fmt_number() rounds Mean and SD to 1 decimal place — appropriate precision for life expectancy statistics in years
5
A second fmt_number() call rounds Min and Max to 1 decimal place — kept as a separate call to mirror the data_color() column targeting pattern used below
6
data_color() applies a red → orange → teal → blue gradient to the Mean column; domain = c(50, 80) anchors the colour scale to the observed range, so Africa (54.8) maps to red and Oceania (80.7) to blue
7
cols_align() centre-aligns all numeric columns; -continent excludes the continent name column, keeping it left-aligned by default
8
tab_header() adds a bold md() title and plain subtitle above the table; heading alignment is controlled by tab_options() below
9
tab_style() bolds all column label text using cell_text(weight = "bold") targeted precisely at cells_column_labels()
10
tab_options() sets a 14px base font size and left-aligns the heading block to match the plot’s title alignment
Life Expectancy by Continent (2007)
Summary statistics for each continent
continent
Countries
Mean
SD
Min
Max
Oceania
2
80.7
0.7
80.2
81.2
Europe
30
77.6
3.0
71.8
81.8
Americas
25
73.6
4.4
60.9
80.7
Asia
33
70.7
8.0
43.8
82.6
Africa
52
54.8
9.6
39.6
76.4
The gap between Oceania/Europe (~80 years) and Africa (~55 years) is stark—roughly 25 years of life expectancy.
7 The ANOVA Test
Show the code
aov(lifeExp ~ continent, data = gapminder_2007) |>summary()
1
aov() fits a one-way ANOVA model: lifeExp ~ continent specifies life expectancy as the outcome and continent as the single grouping factor; data = gapminder_2007 supplies the filtered dataset
2
summary() extracts the ANOVA table from the fitted model object, printing degrees of freedom, sum of squares, mean squares, F-statistic, and p-value in standard tabular form
Df Sum Sq Mean Sq F value Pr(>F)
continent 4 13061 3265 59.71 <2e-16 ***
Residuals 137 7491 55
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
8 Interpreting the Results
The ANOVA output shows:
Df (Degrees of freedom): Between groups = 4 (five continents minus one); Within groups = 137 (142 countries minus 5 groups).
Sum Sq (Sum of Squares): Measures total variability. The between-group sum of squares (6627) is large relative to the within-group sum of squares (2659).
Mean Sq (Mean Square): Sum of squares divided by degrees of freedom. This standardises for group sizes.
F value: The test statistic (85.9). This is the ratio of between-group variance to within-group variance. Large values indicate groups differ more than expected by chance.
Pr(>F): The p-value (< 2e-16). Extremely small—far below 0.05.
Important
Conclusion: We reject the null hypothesis that all continent means are equal. There is a statistically significant difference in life expectancy across continents. Where you’re born has a profound impact on how long you can expect to live.
Note
Note: ANOVA tells us that groups differ, but not which groups. Post-hoc tests (like Tukey’s HSD) would identify specific pairwise differences.