Show the code
library(tidyverse)
library(gapminder)
library(gt)
library(scales)Simple linear regression models the relationship between two continuous variables by fitting a straight line through the data. It answers two questions: (1) Is there a relationship? and (2) How strong is it?
The model takes the form: \(y = \beta_0 + \beta_1 x + \varepsilon\)
Where \(\beta_0\) is the intercept, \(\beta_1\) is the slope (the change in \(y\) for each unit increase in \(x\)), and \(\varepsilon\) is the error term.
We’ll use the Gapminder dataset to ask: Does national wealth predict life expectancy?
In other words, does money buy longer life?
library(tidyverse)
library(gapminder)
library(gt)
library(scales)We’ll use 2007 data from Gapminder, examining the relationship between GDP per capita and life expectancy across 142 countries.
gapminder_2007 <- gapminder |>
filter(year == 2007) |>
mutate(
log_gdp = log10(gdpPercap),
label = case_when(
country %in% c("Norway", "United States",
"South Africa", "Botswana",
"Afghanistan", "Japan",
"Sierra Leone", "Gabon") ~ as.character(country),
TRUE ~ NA_character_
)
)
glimpse(gapminder_2007)Rows: 142
Columns: 8
$ country <fct> "Afghanistan", "Albania", "Algeria", "Angola", "Argentina", …
$ continent <fct> Asia, Europe, Africa, Africa, Americas, Oceania, Europe, Asi…
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, …
$ lifeExp <dbl> 43.828, 76.423, 72.301, 42.731, 75.320, 81.235, 79.829, 75.6…
$ pop <int> 31889923, 3600523, 33333216, 12420476, 40301927, 20434176, 8…
$ gdpPercap <dbl> 974.5803, 5937.0295, 6223.3675, 4797.2313, 12779.3796, 34435…
$ log_gdp <dbl> 2.988818, 3.773569, 3.794025, 3.680991, 4.106510, 4.537005, …
$ label <chr> "Afghanistan", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
Let’s first look at the relationship using raw GDP per capita.
continent_colors <- c(
"Africa" = "#e63946",
"Americas" = "#f4a261",
"Asia" = "#2a9d8f",
"Europe" = "#457b9d",
"Oceania" = "#9c89b8"
)
ggplot(gapminder_2007,
aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(colour = continent, size = pop),
alpha = 0.7) +
geom_smooth(method = "lm",
colour = "#1d3557",
fill = "#1d3557",
alpha = 0.2) +
annotate("rect",
xmin = -2000, xmax = 15000,
ymin = 38, ymax = 83,
fill = "#e63946", alpha = 0.1) +
annotate("text",
x = 7500, y = 85,
label = "Most of the world\nis crammed in here",
fontface = "italic",
colour = "#e63946",
size = 4) +
annotate("curve",
x = 42000, y = 60,
xend = 49500, yend = 78,
curvature = -0.3,
arrow = arrow(length = unit(2, "mm")),
colour = "grey40") +
annotate("text",
x = 38000, y = 58,
label = "Norway:\nrich AND healthy",
size = 3.5,
colour = "grey30") +
scale_x_continuous(labels = dollar_format()) +
scale_colour_manual(values = continent_colors) +
scale_size_continuous(range = c(2, 12),
labels = label_number(scale = 1e-6,
suffix = "M"),
breaks = c(10e6, 100e6, 500e6, 1000e6)) +
labs(
title = "The Problem: A Curved Relationship",
subtitle = "GDP per capita vs life expectancy shows diminishing returns",
x = "GDP per Capita",
y = "Life Expectancy (years)",
colour = "Continent",
size = "Population"
) +
theme_minimal(base_size = 14) +
theme(
plot.title = element_text(face = "bold", size = 18),
plot.subtitle = element_text(colour = "grey40", size = 12),
plot.title.position = "plot",
legend.position = "right",
panel.grid.minor = element_blank()
)The relationship is clearly curved—gains in life expectancy diminish as GDP increases. A linear model isn’t appropriate here. The solution? Log-transform GDP.
Taking the logarithm of GDP “stretches out” the lower values and “compresses” the higher values, linearising the relationship.
# Find positions for labeled countries
label_data <- gapminder_2007 |>
filter(!is.na(label)) |>
mutate(
hjust = case_when(
country == "Norway" ~ 1.1,
country == "Japan" ~ 1.1,
country == "United States" ~ -0.1,
country == "Gabon" ~ -0.1,
country == "South Africa" ~ -0.1,
country == "Botswana" ~ -0.1,
country == "Afghanistan" ~ 0.5,
country == "Sierra Leone" ~ 0.5,
TRUE ~ 0.5
),
vjust = case_when(
country == "Afghanistan" ~ 1.5,
country == "Sierra Leone" ~ 1.5,
country == "Japan" ~ 0.5,
TRUE ~ 0.5
)
)
ggplot(gapminder_2007,
aes(x = log_gdp, y = lifeExp)) +
# Regression line (behind points)
geom_smooth(method = "lm",
colour = "#1d3557",
fill = "#1d3557",
linewidth = 1.2,
alpha = 0.2) +
# All points
geom_point(aes(colour = continent, size = pop),
alpha = 0.7) +
# Highlighted points (interesting countries)
geom_point(data = label_data,
aes(colour = continent),
size = 5,
shape = 21,
fill = "white",
stroke = 2) +
# Country labels
geom_text(data = label_data,
aes(label = country,
hjust = hjust,
vjust = vjust),
size = 3.5,
fontface = "bold") +
# Annotation: the story
annotate("label",
x = 2.8, y = 81,
label = "10× richer ≈ 9 years longer life",
fill = "#f1faee",
colour = "#1d3557",
fontface = "bold",
size = 4.5,
label.padding = unit(0.5, "lines"),
label.r = unit(0.3, "lines")) +
# Annotation: residual callout
annotate("segment",
x = 4.1, xend = 4.1,
y = 52.5, yend = 72,
linetype = "dashed",
colour = "#e63946",
linewidth = 0.8) +
annotate("text",
x = 4.2, y = 62,
label = "Botswana &\nSouth Africa:\nHIV epidemic",
hjust = 0,
size = 3,
colour = "#e63946",
linewidth = 0.3) +
# Scales
scale_x_continuous(
breaks = c(2.5, 3, 3.5, 4, 4.5, 5),
labels = c("$300", "$1K", "$3K",
"$10K", "$30K", "$100K")
) +
scale_colour_manual(values = continent_colors) +
scale_size_continuous(range = c(2, 12),
guide = "none") +
labs(
title = "Does Money Buy Longer Life?",
subtitle = "Log-transforming GDP reveals a clear linear relationship",
x = "GDP per Capita (log scale)",
y = "Life Expectancy (years)",
colour = "Continent"
) +
coord_cartesian(clip = "off") +
theme_minimal(base_size = 14) +
theme(
plot.title = element_text(face = "bold", size = 20),
plot.subtitle = element_text(colour = "grey40", size = 12),
plot.title.position = "plot",
legend.position = "bottom",
legend.title = element_text(face = "bold"),
panel.grid.minor = element_blank(),
plot.margin = margin(10, 20, 10, 10)
) +
guides(colour = guide_legend(override.aes = list(size = 4)))Now the relationship is linear! The log transformation also reveals an important story: some countries (like Botswana and South Africa) fall well below the line—their life expectancy is lower than their wealth would predict, largely due to the HIV/AIDS epidemic.
model <- lm(lifeExp ~ log_gdp, data = gapminder_2007)
summary(model)
Call:
lm(formula = lifeExp ~ log_gdp, data = gapminder_2007)
Residuals:
Min 1Q Median 3Q Max
-25.947 -2.661 1.215 4.469 13.115
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.950 3.858 1.283 0.202
log_gdp 16.585 1.019 16.283 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7.122 on 140 degrees of freedom
Multiple R-squared: 0.6544, Adjusted R-squared: 0.652
F-statistic: 265.2 on 1 and 140 DF, p-value: < 2.2e-16
The regression output tells us:
Coefficients:
Intercept (-2.92): The predicted life expectancy when log₁₀(GDP) = 0 (i.e., GDP = $1). Not meaningful in practice, but necessary for the equation.
log_gdp (16.60): For each unit increase in log₁₀(GDP), life expectancy increases by ~16.6 years. Since log₁₀ is base 10, this means a 10-fold increase in GDP is associated with ~16.6 additional years of life expectancy.
Model Fit:
R² = 0.65: 65% of the variation in life expectancy is explained by GDP per capita. This is a strong relationship.
p-value (< 2.2e-16): The relationship is highly statistically significant.
coefs <- coef(model)Our fitted model is:
\[\widehat{\text{Life Expectancy}} = 4.9 + 16.6 \times \log_{10}(\text{GDP per capita})\]
Conclusion: National wealth is a powerful predictor of life expectancy. However, the relationship isn’t deterministic—countries like Japan exceed expectations while others like South Africa fall short, reminding us that health outcomes depend on more than just money.