# Packages used in this report
library(WDI)
library(tidyverse)
library(knitr)
library(scales)
theme_set(theme_minimal(base_size = 12))Stage 1 WDI Regression Analysis: Life Expectancy
1 Overview
This report uses a real-world economic dataset from the World Bank World Development Indicators (WDI) database. The report focuses only on a regression problem. The outcome variable is life expectancy at birth, which is a continuous numeric variable measured in years.
The dataset is downloaded directly from the World Bank WDI API, so the report is reproducible as long as the required R packages are installed and an internet connection is available.
Data source: https://databank.worldbank.org/source/world-development-indicators
2 Dataset Description and Source
The dataset is a country-year panel for the years 2010-2023. Each observation represents one country in one year. The target variable is life_expectancy. The main explanatory variables are gdp_per_capita, health_expenditure, unemployment, inflation, and urban_population.
This dataset is economically relevant because life expectancy is not only a health outcome but also a development outcome. Countries with higher income, better health investment, more stable labor markets, lower macroeconomic instability, and better access to urban services may experience longer life expectancy. Therefore, life expectancy can be studied as an outcome related to economic development and social welfare.
The main regression model is:
life_expectancy = gdp_per_capita + health_expenditure + unemployment + inflation + urban_population
| Variable | WDI Indicator | Economic Meaning |
|---|---|---|
life_expectancy |
SP.DYN.LE00.IN |
Life expectancy at birth, total (years) |
gdp_per_capita |
NY.GDP.PCAP.KD |
Economic development / income level |
health_expenditure |
SH.XPD.CHEX.GD.ZS |
Health investment |
unemployment |
SL.UEM.TOTL.ZS |
Labor market conditions |
inflation |
FP.CPI.TOTL.ZG |
Macroeconomic stability |
urban_population |
SP.URB.TOTL.IN.ZS |
Urbanization and access to services |
3 Research Question
To what extent do economic development, health investment, labor market conditions, macroeconomic stability, and urbanization explain differences in life expectancy across countries?
4 Data Import and Cleaning
# WDI indicators used in the regression dataset
wdi_indicators <- c(
life_expectancy = "SP.DYN.LE00.IN",
gdp_per_capita = "NY.GDP.PCAP.KD",
health_expenditure = "SH.XPD.CHEX.GD.ZS",
unemployment = "SL.UEM.TOTL.ZS",
inflation = "FP.CPI.TOTL.ZG",
urban_population = "SP.URB.TOTL.IN.ZS"
)
# Import data directly from the World Bank WDI API.
# extra = TRUE adds country metadata such as region and income group.
wdi_raw <- WDI(
country = "all",
indicator = wdi_indicators,
start = 2010,
end = 2023,
extra = TRUE
)
# Clean the dataset:
# 1. Remove aggregate regions.
# 2. Keep only the variables needed for the regression analysis.
# 3. Convert year and indicator variables to the correct data types.
# 4. Remove observations with missing values.
# 5. Create log transformations for skewed economic variables.
wdi_regression <- wdi_raw |>
filter(region != "Aggregates") |>
select(
country, iso3c, year, region, income,
life_expectancy, gdp_per_capita, health_expenditure,
unemployment, inflation, urban_population
) |>
mutate(
year = as.integer(year),
across(
c(
life_expectancy, gdp_per_capita, health_expenditure,
unemployment, inflation, urban_population
),
as.numeric
)
) |>
drop_na() |>
mutate(
log_life_expectancy = log(life_expectancy),
log_gdp_per_capita = log(gdp_per_capita)
)
dataset_size <- tibble(
observations = nrow(wdi_regression),
variables = ncol(wdi_regression)
)
kable(dataset_size, caption = "Cleaned WDI Regression Dataset Size")| observations | variables |
|---|---|
| 2350 | 13 |
The cleaned dataset satisfies the assignment requirement because it contains more than 500 observations and more than 5 variables including the target variable.
5 Summary Statistics
target_summary <- wdi_regression |>
summarise(
mean = mean(life_expectancy),
median = median(life_expectancy),
standard_deviation = sd(life_expectancy),
minimum = min(life_expectancy),
q1 = quantile(life_expectancy, 0.25),
q3 = quantile(life_expectancy, 0.75),
maximum = max(life_expectancy)
)
kable(
target_summary,
digits = 2,
caption = "Summary Statistics for Life Expectancy"
)| mean | median | standard_deviation | minimum | q1 | q3 | maximum |
|---|---|---|---|---|---|---|
| 71.62 | 72.78 | 8.07 | 18.82 | 65.87 | 77.72 | 84.56 |
regressor_summary <- wdi_regression |>
summarise(
across(
c(
gdp_per_capita, health_expenditure, unemployment,
inflation, urban_population
),
list(
mean = mean,
median = median,
sd = sd
),
.names = "{.col}_{.fn}"
)
) |>
pivot_longer(
everything(),
names_to = c("variable", ".value"),
names_pattern = "(.+)_(mean|median|sd)"
)
kable(
regressor_summary,
digits = 2,
caption = "Summary Statistics for Explanatory Variables"
)| variable | mean | median | sd |
|---|---|---|---|
| gdp_per_capita | 13492.86 | 5348.27 | 18811.81 |
| health_expenditure | 6.35 | 6.03 | 2.70 |
| unemployment | 7.76 | 5.77 | 5.94 |
| inflation | 6.25 | 3.40 | 18.35 |
| urban_population | 58.83 | 61.04 | 22.21 |
6 Probability Distribution Analysis
6.1 Original Target Variable
ggplot(wdi_regression, aes(x = life_expectancy)) +
geom_histogram(bins = 30, fill = "#2f6f73", color = "white") +
labs(
title = "Distribution of Life Expectancy",
x = "Life expectancy at birth (years)",
y = "Number of country-year observations"
)The distribution of life expectancy is not perfectly normal. It is mildly left-skewed because many country-year observations are concentrated at relatively high life expectancy values, while a smaller group of countries has much lower life expectancy. Still, the distribution is more compact than many monetary economic variables.
6.2 Log-Transformed Target Variable
ggplot(wdi_regression, aes(x = log_life_expectancy)) +
geom_histogram(bins = 30, fill = "#b46a3c", color = "white") +
labs(
title = "Distribution of Log Life Expectancy",
x = "Log of life expectancy",
y = "Number of country-year observations"
)The log transformation slightly compresses higher values, but it does not dramatically improve the shape because life expectancy is not strongly right-skewed. A normal distribution is a reasonable first approximation for life expectancy, although the variable is naturally bounded and therefore cannot be perfectly normal.
7 Regression Analysis
# GDP per capita is logged because income variables are usually right-skewed.
life_expectancy_model <- lm(
life_expectancy ~ log_gdp_per_capita + health_expenditure +
unemployment + inflation + urban_population,
data = wdi_regression
)
model_summary <- summary(life_expectancy_model)
coefficient_table <- as.data.frame(coef(model_summary)) |>
rownames_to_column("variable") |>
rename(
estimate = Estimate,
standard_error = `Std. Error`,
t_value = `t value`,
p_value = `Pr(>|t|)`
)
kable(
coefficient_table,
digits = 4,
caption = "Linear Regression Results"
)| variable | estimate | standard_error | t_value | p_value |
|---|---|---|---|---|
| (Intercept) | 31.3488 | 0.6310 | 49.6828 | 0.0000 |
| log_gdp_per_capita | 4.3600 | 0.0979 | 44.5546 | 0.0000 |
| health_expenditure | 0.2507 | 0.0355 | 7.0574 | 0.0000 |
| unemployment | -0.0767 | 0.0153 | -5.0250 | 0.0000 |
| inflation | 0.0029 | 0.0049 | 0.5999 | 0.5486 |
| urban_population | 0.0285 | 0.0060 | 4.7339 | 0.0000 |
fit_statistics <- tibble(
r_squared = model_summary$r.squared,
adjusted_r_squared = model_summary$adj.r.squared,
residual_standard_error = model_summary$sigma,
observations = nobs(life_expectancy_model)
)
kable(
fit_statistics,
digits = 3,
caption = "Model Fit Statistics"
)| r_squared | adjusted_r_squared | residual_standard_error | observations |
|---|---|---|---|
| 0.719 | 0.718 | 4.284 | 2350 |
The model uses logged GDP per capita because income is usually highly right-skewed. This allows the coefficient on log_gdp_per_capita to be interpreted as the expected change in life expectancy associated with a proportional change in income, holding the other variables constant.
8 Regression Visualizations
ggplot(wdi_regression, aes(x = log_gdp_per_capita, y = life_expectancy)) +
geom_point(alpha = 0.35, color = "#345995") +
geom_smooth(method = "lm", se = TRUE, color = "#c44536") +
labs(
title = "Life Expectancy and GDP per Capita",
x = "Log GDP per capita",
y = "Life expectancy at birth (years)"
)ggplot(wdi_regression, aes(x = health_expenditure, y = life_expectancy)) +
geom_point(alpha = 0.35, color = "#2f6f73") +
geom_smooth(method = "lm", se = TRUE, color = "#c44536") +
labs(
title = "Life Expectancy and Health Expenditure",
x = "Current health expenditure (% of GDP)",
y = "Life expectancy at birth (years)"
)9 Interpretation
The expected signs of the main coefficients are based on economic theory. GDP per capita is expected to have a positive relationship with life expectancy because higher income can improve nutrition, housing, sanitation, and access to medical services. Health expenditure is also expected to be positively related to life expectancy because it reflects investment in health systems. Unemployment and inflation are expected to have negative relationships because they may represent weaker labor market conditions and macroeconomic instability. Urban population share may be positive if urbanization improves access to hospitals, education, and infrastructure, although the effect can vary across countries.
This regression is an associational model, not a causal model. The results show whether the selected economic indicators are correlated with life expectancy after controlling for the other variables in the model. A stronger causal analysis would require additional design choices, such as country fixed effects, year fixed effects, instrumental variables, or a more detailed panel-data strategy.
10 Conclusion
This WDI regression dataset is appropriate for the assignment because it has a continuous target variable, more than 500 observations, and more than 5 variables. The analysis focuses on life expectancy as an economic development outcome and studies how income, health investment, labor market conditions, macroeconomic stability, and urbanization are associated with differences in life expectancy across countries.