2025-10-23

About the Gapminder Dataset

  • Compiled by the Gapminder Foundation, a non-profit venture in Sweden that collects global statistics on health, economy, & population development
  • Tracks data from 142 countries across five continents from 1952 to 2007
  • Variables:
    • Country name
    • Continent of country
    • Year of observation
    • Average life expectancy (years)
    • Total population
    • GDP per capita (USD, adjusted for inflation)

How Wealth Relates to Health

Linear Regression Example

GDP & Life Expectancy Over Time

Linear Regression Model

  • For this example, I used the linear regression equation to model the relationship between GDP & life expectancy:

    \(\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 \times X\)

  • Variables:

    • \(\hat{Y}\) = Life expectancy (in years) for the country
    • \(\hat{\beta}_0\) = Life expectancy when GDP equals 0
    • \(\hat{\beta}_1\) = Change in life expectancy per unit of GDP
    • \(X\) = GDP per capita (log) for the country
  • Using this equation, we can plot the regression line (slide 4) & predict a country’s life expectancy given its GDP

P-value

  • Using the summary() function, we can determine the p-value & evaluate the statistical significance of the model:

    \(p < 2.2 \times 10^{-16}\)

    • Relationship between GDP & life expectancy is highly statistically significant (p < 0.05)
    • Wealthier nations have a higher average life expectancy than poorer ones
    • Suggests that a country’s economic stability is correlated with their population’s health

Coefficient of Determination

  • For this example, I used the coefficient of determination to calculate the correlative strength between GDP & life expectancy:

    \(R^2 = 1 - \frac{SS_{\text{res}}}{SS_{\text{tot}}}\)

    • \(SS_{\text{res}}\) = Residual sum of squares \(\to\) variation in life expectancy not explained by GDP
    • \(SS_{\text{tot}}\) = Total sum of squares \(\to\) total variation in life expectancy across all countries
  • Plugging in our data:

    • \(R^2\) = 0.6522466 \(\to\) GDP explains ~65% of the variation in life expectancy (strong positive correlation between wealth & overall well-being)
    • Remaining 35% is likely due to other variables not included in the model (education, inequality, etc.)

R Code

# 3D Scatter Plot of Changes in GDP & Life Expectancy Over Time
plot_ly(gapminder, x = gdpPercap, y = lifeExp, z = year,
        type = "scatter3d", mode = "markers", color = continent,
        marker = list(size = 4, opacity = 0.8)) %>%
    layout(title = "GDP vs. Life Expectancy vs. Year", scene = list(
      xaxis = list(title = "GDP per Capita (log)", type = "log"),
      yaxis = list(title = "Life Expectancy (years)"),
      zaxis = list(title = "Year")))
# Linear Relationship Between GDP & Life Expectancy
data2007 <- subset(gapminder, year == 2007)
ggplot(data2007, aes(x = gdpPercap, y = lifeExp)) +
  scale_x_log10(labels = scales::comma) +
  geom_point(aes(color = continent), size = 1.5, alpha = 0.6) +
  geom_smooth(method = "lm", se = T, color = "midnightblue") +
  labs(title = "Linear Regression of GDP vs. Life Expectancy (2007)",
       x = "GDP per Capita (log)", y = "Life Expectancy (years)") +
  theme_light()
`geom_smooth()` using formula = 'y ~ x'