Stage 1 WDI Regression Analysis: Life Expectancy

Author

Selhan Cil

Published

May 10, 2026

# Packages used in this report
library(WDI)
library(tidyverse)
library(knitr)
library(scales)

theme_set(theme_minimal(base_size = 12))

1 Overview

This report uses a real-world economic dataset from the World Bank World Development Indicators (WDI) database. The report focuses only on a regression problem. The outcome variable is life expectancy at birth, which is a continuous numeric variable measured in years.

The dataset is downloaded directly from the World Bank WDI API, so the report is reproducible as long as the required R packages are installed and an internet connection is available.

Data source: https://databank.worldbank.org/source/world-development-indicators

2 Dataset Description and Source

The dataset is a country-year panel for the years 2010-2023. Each observation represents one country in one year. The target variable is life_expectancy. The main explanatory variables are gdp_per_capita, health_expenditure, unemployment, inflation, and urban_population.

This dataset is economically relevant because life expectancy is not only a health outcome but also a development outcome. Countries with higher income, better health investment, more stable labor markets, lower macroeconomic instability, and better access to urban services may experience longer life expectancy. Therefore, life expectancy can be studied as an outcome related to economic development and social welfare.

The main regression model is:

life_expectancy = gdp_per_capita + health_expenditure + unemployment + inflation + urban_population

Variable WDI Indicator Economic Meaning
life_expectancy SP.DYN.LE00.IN Life expectancy at birth, total (years)
gdp_per_capita NY.GDP.PCAP.KD Economic development / income level
health_expenditure SH.XPD.CHEX.GD.ZS Health investment
unemployment SL.UEM.TOTL.ZS Labor market conditions
inflation FP.CPI.TOTL.ZG Macroeconomic stability
urban_population SP.URB.TOTL.IN.ZS Urbanization and access to services

3 Research Question

To what extent do economic development, health investment, labor market conditions, macroeconomic stability, and urbanization explain differences in life expectancy across countries?

4 Data Import and Cleaning

# WDI indicators used in the regression dataset
wdi_indicators <- c(
  life_expectancy = "SP.DYN.LE00.IN",
  gdp_per_capita = "NY.GDP.PCAP.KD",
  health_expenditure = "SH.XPD.CHEX.GD.ZS",
  unemployment = "SL.UEM.TOTL.ZS",
  inflation = "FP.CPI.TOTL.ZG",
  urban_population = "SP.URB.TOTL.IN.ZS"
)

# Import data directly from the World Bank WDI API.
# extra = TRUE adds country metadata such as region and income group.
wdi_raw <- WDI(
  country = "all",
  indicator = wdi_indicators,
  start = 2010,
  end = 2023,
  extra = TRUE
)

# Clean the dataset:
# 1. Remove aggregate regions.
# 2. Keep only the variables needed for the regression analysis.
# 3. Convert year and indicator variables to the correct data types.
# 4. Remove observations with missing values.
# 5. Create log transformations for skewed economic variables.
wdi_regression <- wdi_raw |>
  filter(region != "Aggregates") |>
  select(
    country, iso3c, year, region, income,
    life_expectancy, gdp_per_capita, health_expenditure,
    unemployment, inflation, urban_population
  ) |>
  mutate(
    year = as.integer(year),
    across(
      c(
        life_expectancy, gdp_per_capita, health_expenditure,
        unemployment, inflation, urban_population
      ),
      as.numeric
    )
  ) |>
  drop_na() |>
  mutate(
    log_life_expectancy = log(life_expectancy),
    log_gdp_per_capita = log(gdp_per_capita)
  )

dataset_size <- tibble(
  observations = nrow(wdi_regression),
  variables = ncol(wdi_regression)
)

kable(dataset_size, caption = "Cleaned WDI Regression Dataset Size")
Cleaned WDI Regression Dataset Size
observations variables
2350 13

The cleaned dataset satisfies the assignment requirement because it contains more than 500 observations and more than 5 variables including the target variable.

5 Summary Statistics

target_summary <- wdi_regression |>
  summarise(
    mean = mean(life_expectancy),
    median = median(life_expectancy),
    standard_deviation = sd(life_expectancy),
    minimum = min(life_expectancy),
    q1 = quantile(life_expectancy, 0.25),
    q3 = quantile(life_expectancy, 0.75),
    maximum = max(life_expectancy)
  )

kable(
  target_summary,
  digits = 2,
  caption = "Summary Statistics for Life Expectancy"
)
Summary Statistics for Life Expectancy
mean median standard_deviation minimum q1 q3 maximum
71.62 72.78 8.07 18.82 65.87 77.72 84.56
regressor_summary <- wdi_regression |>
  summarise(
    across(
      c(
        gdp_per_capita, health_expenditure, unemployment,
        inflation, urban_population
      ),
      list(
        mean = mean,
        median = median,
        sd = sd
      ),
      .names = "{.col}_{.fn}"
    )
  ) |>
  pivot_longer(
    everything(),
    names_to = c("variable", ".value"),
    names_pattern = "(.+)_(mean|median|sd)"
  )

kable(
  regressor_summary,
  digits = 2,
  caption = "Summary Statistics for Explanatory Variables"
)
Summary Statistics for Explanatory Variables
variable mean median sd
gdp_per_capita 13492.86 5348.27 18811.81
health_expenditure 6.35 6.03 2.70
unemployment 7.76 5.77 5.94
inflation 6.25 3.40 18.35
urban_population 58.83 61.04 22.21

6 Probability Distribution Analysis

6.1 Original Target Variable

ggplot(wdi_regression, aes(x = life_expectancy)) +
  geom_histogram(bins = 30, fill = "#2f6f73", color = "white") +
  labs(
    title = "Distribution of Life Expectancy",
    x = "Life expectancy at birth (years)",
    y = "Number of country-year observations"
  )

The distribution of life expectancy is not perfectly normal. It is mildly left-skewed because many country-year observations are concentrated at relatively high life expectancy values, while a smaller group of countries has much lower life expectancy. Still, the distribution is more compact than many monetary economic variables.

6.2 Log-Transformed Target Variable

ggplot(wdi_regression, aes(x = log_life_expectancy)) +
  geom_histogram(bins = 30, fill = "#b46a3c", color = "white") +
  labs(
    title = "Distribution of Log Life Expectancy",
    x = "Log of life expectancy",
    y = "Number of country-year observations"
  )

The log transformation slightly compresses higher values, but it does not dramatically improve the shape because life expectancy is not strongly right-skewed. A normal distribution is a reasonable first approximation for life expectancy, although the variable is naturally bounded and therefore cannot be perfectly normal.

7 Regression Analysis

# GDP per capita is logged because income variables are usually right-skewed.
life_expectancy_model <- lm(
  life_expectancy ~ log_gdp_per_capita + health_expenditure +
    unemployment + inflation + urban_population,
  data = wdi_regression
)

model_summary <- summary(life_expectancy_model)

coefficient_table <- as.data.frame(coef(model_summary)) |>
  rownames_to_column("variable") |>
  rename(
    estimate = Estimate,
    standard_error = `Std. Error`,
    t_value = `t value`,
    p_value = `Pr(>|t|)`
  )

kable(
  coefficient_table,
  digits = 4,
  caption = "Linear Regression Results"
)
Linear Regression Results
variable estimate standard_error t_value p_value
(Intercept) 31.3488 0.6310 49.6828 0.0000
log_gdp_per_capita 4.3600 0.0979 44.5546 0.0000
health_expenditure 0.2507 0.0355 7.0574 0.0000
unemployment -0.0767 0.0153 -5.0250 0.0000
inflation 0.0029 0.0049 0.5999 0.5486
urban_population 0.0285 0.0060 4.7339 0.0000
fit_statistics <- tibble(
  r_squared = model_summary$r.squared,
  adjusted_r_squared = model_summary$adj.r.squared,
  residual_standard_error = model_summary$sigma,
  observations = nobs(life_expectancy_model)
)

kable(
  fit_statistics,
  digits = 3,
  caption = "Model Fit Statistics"
)
Model Fit Statistics
r_squared adjusted_r_squared residual_standard_error observations
0.719 0.718 4.284 2350

The model uses logged GDP per capita because income is usually highly right-skewed. This allows the coefficient on log_gdp_per_capita to be interpreted as the expected change in life expectancy associated with a proportional change in income, holding the other variables constant.

8 Regression Visualizations

ggplot(wdi_regression, aes(x = log_gdp_per_capita, y = life_expectancy)) +
  geom_point(alpha = 0.35, color = "#345995") +
  geom_smooth(method = "lm", se = TRUE, color = "#c44536") +
  labs(
    title = "Life Expectancy and GDP per Capita",
    x = "Log GDP per capita",
    y = "Life expectancy at birth (years)"
  )

ggplot(wdi_regression, aes(x = health_expenditure, y = life_expectancy)) +
  geom_point(alpha = 0.35, color = "#2f6f73") +
  geom_smooth(method = "lm", se = TRUE, color = "#c44536") +
  labs(
    title = "Life Expectancy and Health Expenditure",
    x = "Current health expenditure (% of GDP)",
    y = "Life expectancy at birth (years)"
  )

9 Interpretation

The expected signs of the main coefficients are based on economic theory. GDP per capita is expected to have a positive relationship with life expectancy because higher income can improve nutrition, housing, sanitation, and access to medical services. Health expenditure is also expected to be positively related to life expectancy because it reflects investment in health systems. Unemployment and inflation are expected to have negative relationships because they may represent weaker labor market conditions and macroeconomic instability. Urban population share may be positive if urbanization improves access to hospitals, education, and infrastructure, although the effect can vary across countries.

This regression is an associational model, not a causal model. The results show whether the selected economic indicators are correlated with life expectancy after controlling for the other variables in the model. A stronger causal analysis would require additional design choices, such as country fixed effects, year fixed effects, instrumental variables, or a more detailed panel-data strategy.

10 Conclusion

This WDI regression dataset is appropriate for the assignment because it has a continuous target variable, more than 500 observations, and more than 5 variables. The analysis focuses on life expectancy as an economic development outcome and studies how income, health investment, labor market conditions, macroeconomic stability, and urbanization are associated with differences in life expectancy across countries.