Simple Linear Regression

1 What is Simple Linear Regression?

Simple linear regression models the relationship between two continuous variables by fitting a straight line through the data. It answers two questions: (1) Is there a relationship? and (2) How strong is it?

The model takes the form: \(y = \beta_0 + \beta_1 x + \varepsilon\)

Where \(\beta_0\) is the intercept, \(\beta_1\) is the slope (the change in \(y\) for each unit increase in \(x\)), and \(\varepsilon\) is the error term.

2 The Question

We’ll use the Gapminder dataset to ask: Does national wealth predict life expectancy?

In other words, does money buy longer life?

3 Setup

Show the code
library(tidyverse)
library(gapminder)
library(gt)
library(scales)

4 The Data

We’ll use 2007 data from Gapminder, examining the relationship between GDP per capita and life expectancy across 142 countries.

Show the code
gapminder_2007 <- gapminder |>
  filter(year == 2007) |>
  mutate(
    log_gdp = log10(gdpPercap),
    label = case_when(
      country %in% c("Norway", "United States", 
                     "South Africa", "Botswana",
                     "Afghanistan", "Japan", 
                     "Sierra Leone", "Gabon") ~ as.character(country),
      TRUE ~ NA_character_
    )
  )

glimpse(gapminder_2007)
1
Filter to 2007 data only
2
Create log-transformed GDP (base 10 for interpretability)
3
Flag interesting countries for labelling
Rows: 142
Columns: 8
$ country   <fct> "Afghanistan", "Albania", "Algeria", "Angola", "Argentina", …
$ continent <fct> Asia, Europe, Africa, Africa, Americas, Oceania, Europe, Asi…
$ year      <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, …
$ lifeExp   <dbl> 43.828, 76.423, 72.301, 42.731, 75.320, 81.235, 79.829, 75.6…
$ pop       <int> 31889923, 3600523, 33333216, 12420476, 40301927, 20434176, 8…
$ gdpPercap <dbl> 974.5803, 5937.0295, 6223.3675, 4797.2313, 12779.3796, 34435…
$ log_gdp   <dbl> 2.988818, 3.773569, 3.794025, 3.680991, 4.106510, 4.537005, …
$ label     <chr> "Afghanistan", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

5 The Problem with Raw GDP

Let’s first look at the relationship using raw GDP per capita.

Show the code
continent_colors <- c(
  "Africa" = "#e63946",
  "Americas" = "#f4a261",
  "Asia" = "#2a9d8f",
  "Europe" = "#457b9d",
  "Oceania" = "#9c89b8"
)

ggplot(gapminder_2007, 
       aes(x = gdpPercap, y = lifeExp)) +
  geom_point(aes(colour = continent, size = pop),
             alpha = 0.7) +
  geom_smooth(method = "lm",
              colour = "#1d3557",
              fill = "#1d3557",
              alpha = 0.2) +
  annotate("rect",
           xmin = -2000, xmax = 15000,
           ymin = 38, ymax = 83,
           fill = "#e63946", alpha = 0.1) +
  annotate("text",
           x = 7500, y = 85,
           label = "Most of the world\nis crammed in here",
           fontface = "italic",
           colour = "#e63946",
           size = 4) +
  annotate("curve",
           x = 42000, y = 60,
           xend = 49500, yend = 78,
           curvature = -0.3,
           arrow = arrow(length = unit(2, "mm")),
           colour = "grey40") +
  annotate("text",
           x = 38000, y = 58,
           label = "Norway:\nrich AND healthy",
           size = 3.5,
           colour = "grey30") +
  scale_x_continuous(labels = dollar_format()) +
  scale_colour_manual(values = continent_colors) +
  scale_size_continuous(range = c(2, 12),
                        labels = label_number(scale = 1e-6, 
                                              suffix = "M"),
                        breaks = c(10e6, 100e6, 500e6, 1000e6)) +
  labs(
    title = "The Problem: A Curved Relationship",
    subtitle = "GDP per capita vs life expectancy shows diminishing returns",
    x = "GDP per Capita",
    y = "Life Expectancy (years)",
    colour = "Continent",
    size = "Population"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(face = "bold", size = 18),
    plot.subtitle = element_text(colour = "grey40", size = 12),
    plot.title.position = "plot",
    legend.position = "right",
    panel.grid.minor = element_blank()
  )
1
Point size represents population; colour represents continent
2
Add linear regression line with confidence interval
3
Highlight the crowded region where most countries cluster
4
Curved arrow annotation pointing to an outlier (Norway)
5
Format x-axis as currency
6
Scale point sizes and format population labels

The relationship is clearly curved—gains in life expectancy diminish as GDP increases. A linear model isn’t appropriate here. The solution? Log-transform GDP.

6 The Solution: Log Transformation

Taking the logarithm of GDP “stretches out” the lower values and “compresses” the higher values, linearising the relationship.

Show the code
# Find positions for labeled countries
label_data <- gapminder_2007 |>
  filter(!is.na(label)) |>
  mutate(
    hjust = case_when(
      country == "Norway" ~ 1.1,
      country == "Japan" ~ 1.1,
      country == "United States" ~ -0.1,
      country == "Gabon" ~ -0.1,
      country == "South Africa" ~ -0.1,
      country == "Botswana" ~ -0.1,
      country == "Afghanistan" ~ 0.5,
      country == "Sierra Leone" ~ 0.5,
      TRUE ~ 0.5
    ),
    vjust = case_when(
      country == "Afghanistan" ~ 1.5,
      country == "Sierra Leone" ~ 1.5,
      country == "Japan" ~ 0.5,
      TRUE ~ 0.5
    )
  )

ggplot(gapminder_2007, 
       aes(x = log_gdp, y = lifeExp)) +
  
  # Regression line (behind points)
  geom_smooth(method = "lm",
              colour = "#1d3557",
              fill = "#1d3557",
              linewidth = 1.2,
              alpha = 0.2) +
  
  # All points
  geom_point(aes(colour = continent, size = pop),
             alpha = 0.7) +
  
  # Highlighted points (interesting countries)
  geom_point(data = label_data,
             aes(colour = continent),
             size = 5,
             shape = 21,
             fill = "white",
             stroke = 2) +
  
  # Country labels
  geom_text(data = label_data,
            aes(label = country,
                hjust = hjust,
                vjust = vjust),
            size = 3.5,
            fontface = "bold") +
  
  # Annotation: the story
  annotate("label",
           x = 2.8, y = 81,
           label = "10× richer ≈ 9 years longer life",
           fill = "#f1faee",
           colour = "#1d3557",
           fontface = "bold",
           size = 4.5,
           label.padding = unit(0.5, "lines"),
           label.r = unit(0.3, "lines")) +
  
  # Annotation: residual callout
  annotate("segment",
           x = 4.1, xend = 4.1,
           y = 52.5, yend = 72,
           linetype = "dashed",
           colour = "#e63946",
           linewidth = 0.8) +
  annotate("text",
           x = 4.2, y = 62,
           label = "Botswana &\nSouth Africa:\nHIV epidemic",
           hjust = 0,
           size = 3,
           colour = "#e63946",
           linewidth = 0.3) +
  
  # Scales
  scale_x_continuous(
    breaks = c(2.5, 3, 3.5, 4, 4.5, 5),
    labels = c("$300", "$1K", "$3K", 
               "$10K", "$30K", "$100K")
  ) +
  scale_colour_manual(values = continent_colors) +
  scale_size_continuous(range = c(2, 12),
                        guide = "none") +
  
  labs(
    title = "Does Money Buy Longer Life?",
    subtitle = "Log-transforming GDP reveals a clear linear relationship",
    x = "GDP per Capita (log scale)",
    y = "Life Expectancy (years)",
    colour = "Continent"
  ) +
  coord_cartesian(clip = "off") +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(face = "bold", size = 20),
    plot.subtitle = element_text(colour = "grey40", size = 12),
    plot.title.position = "plot",
    legend.position = "bottom",
    legend.title = element_text(face = "bold"),
    panel.grid.minor = element_blank(),
    plot.margin = margin(10, 20, 10, 10)
  ) +
  guides(colour = guide_legend(override.aes = list(size = 4)))
1
Regression line with 95% confidence band
2
Points sized by population, coloured by continent
3
Ring highlights around labeled countries (white fill, coloured border)
4
Country labels with custom positioning
5
Label annotation box summarising the key finding
6
Dashed line showing residuals for countries affected by HIV
7
Custom axis labels showing actual dollar values for log scale
8
Hide population legend to reduce clutter
9
Allow annotations to extend beyond plot area
10
Make legend points larger for visibility

Now the relationship is linear! The log transformation also reveals an important story: some countries (like Botswana and South Africa) fall well below the line—their life expectancy is lower than their wealth would predict, largely due to the HIV/AIDS epidemic.

7 The Regression Model

Show the code
model <- lm(lifeExp ~ log_gdp, data = gapminder_2007)

summary(model)
1
Fit a linear model: life expectancy predicted by log GDP
2
Display the full model summary

Call:
lm(formula = lifeExp ~ log_gdp, data = gapminder_2007)

Residuals:
    Min      1Q  Median      3Q     Max 
-25.947  -2.661   1.215   4.469  13.115 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)    4.950      3.858   1.283    0.202    
log_gdp       16.585      1.019  16.283   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.122 on 140 degrees of freedom
Multiple R-squared:  0.6544,    Adjusted R-squared:  0.652 
F-statistic: 265.2 on 1 and 140 DF,  p-value: < 2.2e-16

8 Interpreting the Results

The regression output tells us:

Coefficients:

  • Intercept (-2.92): The predicted life expectancy when log₁₀(GDP) = 0 (i.e., GDP = $1). Not meaningful in practice, but necessary for the equation.

  • log_gdp (16.60): For each unit increase in log₁₀(GDP), life expectancy increases by ~16.6 years. Since log₁₀ is base 10, this means a 10-fold increase in GDP is associated with ~16.6 additional years of life expectancy.

Model Fit:

  • R² = 0.65: 65% of the variation in life expectancy is explained by GDP per capita. This is a strong relationship.

  • p-value (< 2.2e-16): The relationship is highly statistically significant.

9 The Equation

Show the code
coefs <- coef(model)

Our fitted model is:

\[\widehat{\text{Life Expectancy}} = 4.9 + 16.6 \times \log_{10}(\text{GDP per capita})\]

Conclusion: National wealth is a powerful predictor of life expectancy. However, the relationship isn’t deterministic—countries like Japan exceed expectations while others like South Africa fall short, reminding us that health outcomes depend on more than just money.