Stage 1 WDI Regression Analysis: Life Expectancy

Author

Selhan Cil

Published

May 10, 2026

# Packages used in this report
library(WDI)
library(tidyverse)
library(knitr)
library(scales)

theme_set(theme_minimal(base_size = 12))

1 Overview

This report uses a real-world economic dataset from the World Bank World Development Indicators (WDI) database. The report focuses only on a regression problem. The outcome variable is life expectancy at birth, which is a continuous numeric variable measured in years.

The dataset is downloaded directly from the World Bank WDI API, so the report is reproducible as long as the required R packages are installed and an internet connection is available.

Data source: https://databank.worldbank.org/source/world-development-indicators

2 Dataset Description and Source

The dataset is a country-year panel for the years 2010-2023. Each observation represents one country in one year. The target variable is life_expectancy. The main explanatory variables are gdp_per_capita, health_expenditure, unemployment, inflation, and urban_population.

This dataset is economically relevant because life expectancy is not only a health outcome but also a development outcome. Countries with higher income, better health investment, more stable labor markets, lower macroeconomic instability, and better access to urban services may experience longer life expectancy. Therefore, life expectancy can be studied as an outcome related to economic development and social welfare.

The main regression model is:

life_expectancy = gdp_per_capita + health_expenditure + unemployment + inflation + urban_population

Variable	WDI Indicator	Economic Meaning
`life_expectancy`	`SP.DYN.LE00.IN`	Life expectancy at birth, total (years)
`gdp_per_capita`	`NY.GDP.PCAP.KD`	Economic development / income level
`health_expenditure`	`SH.XPD.CHEX.GD.ZS`	Health investment
`unemployment`	`SL.UEM.TOTL.ZS`	Labor market conditions
`inflation`	`FP.CPI.TOTL.ZG`	Macroeconomic stability
`urban_population`	`SP.URB.TOTL.IN.ZS`	Urbanization and access to services

3 Research Question

To what extent do economic development, health investment, labor market conditions, macroeconomic stability, and urbanization explain differences in life expectancy across countries?

4 Data Import and Cleaning

# WDI indicators used in the regression dataset
wdi_indicators <- c(
  life_expectancy = "SP.DYN.LE00.IN",
  gdp_per_capita = "NY.GDP.PCAP.KD",
  health_expenditure = "SH.XPD.CHEX.GD.ZS",
  unemployment = "SL.UEM.TOTL.ZS",
  inflation = "FP.CPI.TOTL.ZG",
  urban_population = "SP.URB.TOTL.IN.ZS"
)

# Import data directly from the World Bank WDI API.
# extra = TRUE adds country metadata such as region and income group.
wdi_raw <- WDI(
  country = "all",
  indicator = wdi_indicators,
  start = 2010,
  end = 2023,
  extra = TRUE
)

# Clean the dataset:
# 1. Remove aggregate regions.
# 2. Keep only the variables needed for the regression analysis.
# 3. Convert year and indicator variables to the correct data types.
# 4. Remove observations with missing values.
# 5. Create log transformations for skewed economic variables.
wdi_regression <- wdi_raw |>
  filter(region != "Aggregates") |>
  select(
    country, iso3c, year, region, income,
    life_expectancy, gdp_per_capita, health_expenditure,
    unemployment, inflation, urban_population
  ) |>
  mutate(
    year = as.integer(year),
    across(
      c(
        life_expectancy, gdp_per_capita, health_expenditure,
        unemployment, inflation, urban_population
      ),
      as.numeric
    )
  ) |>
  drop_na() |>
  mutate(
    log_life_expectancy = log(life_expectancy),
    log_gdp_per_capita = log(gdp_per_capita)
  )

dataset_size <- tibble(
  observations = nrow(wdi_regression),
  variables = ncol(wdi_regression)
)

kable(dataset_size, caption = "Cleaned WDI Regression Dataset Size")

Cleaned WDI Regression Dataset Size
observations	variables
2350	13

The cleaned dataset satisfies the assignment requirement because it contains more than 500 observations and more than 5 variables including the target variable.

5 Summary Statistics

target_summary <- wdi_regression |>
  summarise(
    mean = mean(life_expectancy),
    median = median(life_expectancy),
    standard_deviation = sd(life_expectancy),
    minimum = min(life_expectancy),
    q1 = quantile(life_expectancy, 0.25),
    q3 = quantile(life_expectancy, 0.75),
    maximum = max(life_expectancy)
  )

kable(
  target_summary,
  digits = 2,
  caption = "Summary Statistics for Life Expectancy"
)

Summary Statistics for Life Expectancy
mean	median	standard_deviation	minimum	q1	q3	maximum
71.62	72.78	8.07	18.82	65.87	77.72	84.56

regressor_summary <- wdi_regression |>
  summarise(
    across(
      c(
        gdp_per_capita, health_expenditure, unemployment,
        inflation, urban_population
      ),
      list(
        mean = mean,
        median = median,
        sd = sd
      ),
      .names = "{.col}_{.fn}"
    )
  ) |>
  pivot_longer(
    everything(),
    names_to = c("variable", ".value"),
    names_pattern = "(.+)_(mean|median|sd)"
  )

kable(
  regressor_summary,
  digits = 2,
  caption = "Summary Statistics for Explanatory Variables"
)

Summary Statistics for Explanatory Variables
variable	mean	median	sd
gdp_per_capita	13492.86	5348.27	18811.81
health_expenditure	6.35	6.03	2.70
unemployment	7.76	5.77	5.94
inflation	6.25	3.40	18.35
urban_population	58.83	61.04	22.21

6 Probability Distribution Analysis

6.1 Original Target Variable

ggplot(wdi_regression, aes(x = life_expectancy)) +
  geom_histogram(bins = 30, fill = "#2f6f73", color = "white") +
  labs(
    title = "Distribution of Life Expectancy",
    x = "Life expectancy at birth (years)",
    y = "Number of country-year observations"
  )

The distribution of life expectancy is not perfectly normal. It is mildly left-skewed because many country-year observations are concentrated at relatively high life expectancy values, while a smaller group of countries has much lower life expectancy. Still, the distribution is more compact than many monetary economic variables.

6.2 Log-Transformed Target Variable

ggplot(wdi_regression, aes(x = log_life_expectancy)) +
  geom_histogram(bins = 30, fill = "#b46a3c", color = "white") +
  labs(
    title = "Distribution of Log Life Expectancy",
    x = "Log of life expectancy",
    y = "Number of country-year observations"
  )

The log transformation slightly compresses higher values, but it does not dramatically improve the shape because life expectancy is not strongly right-skewed. A normal distribution is a reasonable first approximation for life expectancy, although the variable is naturally bounded and therefore cannot be perfectly normal.

7 Regression Analysis

# GDP per capita is logged because income variables are usually right-skewed.
life_expectancy_model <- lm(
  life_expectancy ~ log_gdp_per_capita + health_expenditure +
    unemployment + inflation + urban_population,
  data = wdi_regression
)

model_summary <- summary(life_expectancy_model)

coefficient_table <- as.data.frame(coef(model_summary)) |>
  rownames_to_column("variable") |>
  rename(
    estimate = Estimate,
    standard_error = `Std. Error`,
    t_value = `t value`,
    p_value = `Pr(>|t|)`
  )

kable(
  coefficient_table,
  digits = 4,
  caption = "Linear Regression Results"
)

Linear Regression Results
variable	estimate	standard_error	t_value	p_value
(Intercept)	31.3488	0.6310	49.6828	0.0000
log_gdp_per_capita	4.3600	0.0979	44.5546	0.0000
health_expenditure	0.2507	0.0355	7.0574	0.0000
unemployment	-0.0767	0.0153	-5.0250	0.0000
inflation	0.0029	0.0049	0.5999	0.5486
urban_population	0.0285	0.0060	4.7339	0.0000

fit_statistics <- tibble(
  r_squared = model_summary$r.squared,
  adjusted_r_squared = model_summary$adj.r.squared,
  residual_standard_error = model_summary$sigma,
  observations = nobs(life_expectancy_model)
)

kable(
  fit_statistics,
  digits = 3,
  caption = "Model Fit Statistics"
)

Model Fit Statistics
r_squared	adjusted_r_squared	residual_standard_error	observations
0.719	0.718	4.284	2350

The model uses logged GDP per capita because income is usually highly right-skewed. This allows the coefficient on log_gdp_per_capita to be interpreted as the expected change in life expectancy associated with a proportional change in income, holding the other variables constant.

8 Regression Visualizations

ggplot(wdi_regression, aes(x = log_gdp_per_capita, y = life_expectancy)) +
  geom_point(alpha = 0.35, color = "#345995") +
  geom_smooth(method = "lm", se = TRUE, color = "#c44536") +
  labs(
    title = "Life Expectancy and GDP per Capita",
    x = "Log GDP per capita",
    y = "Life expectancy at birth (years)"
  )

ggplot(wdi_regression, aes(x = health_expenditure, y = life_expectancy)) +
  geom_point(alpha = 0.35, color = "#2f6f73") +
  geom_smooth(method = "lm", se = TRUE, color = "#c44536") +
  labs(
    title = "Life Expectancy and Health Expenditure",
    x = "Current health expenditure (% of GDP)",
    y = "Life expectancy at birth (years)"
  )

9 Interpretation

The expected signs of the main coefficients are based on economic theory. GDP per capita is expected to have a positive relationship with life expectancy because higher income can improve nutrition, housing, sanitation, and access to medical services. Health expenditure is also expected to be positively related to life expectancy because it reflects investment in health systems. Unemployment and inflation are expected to have negative relationships because they may represent weaker labor market conditions and macroeconomic instability. Urban population share may be positive if urbanization improves access to hospitals, education, and infrastructure, although the effect can vary across countries.

This regression is an associational model, not a causal model. The results show whether the selected economic indicators are correlated with life expectancy after controlling for the other variables in the model. A stronger causal analysis would require additional design choices, such as country fixed effects, year fixed effects, instrumental variables, or a more detailed panel-data strategy.

10 Conclusion

This WDI regression dataset is appropriate for the assignment because it has a continuous target variable, more than 500 observations, and more than 5 variables. The analysis focuses on life expectancy as an economic development outcome and studies how income, health investment, labor market conditions, macroeconomic stability, and urbanization are associated with differences in life expectancy across countries.