ECON 465 – Stage 1: Data Acquisition & Probability Analysis
Author
Sude Arslan & Selhan Çil
Published
May 3, 2026
1. Economic Question
To what extent can health expenditure, GDP per capita, urbanization, and fertility rates predict life expectancy across countries?
2. Dataset Description
This dataset was obtained from the World Bank World Development Indicators (WDI) database using the WDI package in R. It contains data for 200+ countries from 2000 to 2020, with 3,948 observations and 11 variables. The dataset includes key health, demographic, and economic indicators that may influence life expectancy across countries. The WDI API was used directly in R to ensure full reproducibility, with no manual downloads required.
Source: https://data.worldbank.org
3. Data Import & Cleaning
library(WDI)library(tidyverse)# Import data directly from World Bank APIdf <-WDI(country ="all",indicator =c(life_expectancy ="SP.DYN.LE00.IN",health_expenditure ="SH.XPD.CHEX.GD.ZS",gdp_per_capita ="NY.GDP.PCAP.CD",urbanization_rate ="SP.URB.TOTL.IN.ZS",fertility_rate ="SP.DYN.TFRT.IN" ),start =2000,end =2020,extra =TRUE)# Remove aggregates (e.g., "World", "Euro Area") to keep country-level data only# Drop missing values and select relevant variablesdf_clean <- df %>%filter(region !="Aggregates") %>%drop_na() %>%select(country, iso3c, year, region, income, life_expectancy, health_expenditure, gdp_per_capita, urbanization_rate, fertility_rate) %>%mutate(# Log transformation applied due to strong right skew in GDP per capita# This better captures proportional differences across countriesgdp_per_capita_log =log(gdp_per_capita),year =as.integer(year) )cat("Observations:", nrow(df_clean), "\n")
life_expectancy health_expenditure gdp_per_capita urbanization_rate
Min. :14.66 Min. : 1.223 Min. : 109.6 Min. : 8.044
1st Qu.:64.23 1st Qu.: 4.099 1st Qu.: 1307.8 1st Qu.: 37.796
Median :71.44 Median : 5.565 Median : 4201.3 Median : 57.895
Mean :69.98 Mean : 6.126 Mean : 13011.1 Mean : 56.816
3rd Qu.:76.56 3rd Qu.: 7.860 3rd Qu.: 14477.1 3rd Qu.: 74.927
Max. :86.15 Max. :24.458 Max. :204263.8 Max. :100.000
fertility_rate
Min. :0.837
1st Qu.:1.714
Median :2.413
Mean :2.929
3rd Qu.:3.913
Max. :7.829
5. Probability Distribution Analysis
5.1 Selected Variable: Life Expectancy
Life expectancy is a continuous variable representing the average number of years a person is expected to live at birth. It is expressed as a decimal figure at the country-year level.
Mean Median Std_Dev Q1 Q3
1 69.97972 71.438 8.824294 64.2305 76.5631
5.2 Histogram - Life Expectancy (Original)
ggplot(df_clean, aes(x = life_expectancy)) +geom_histogram(bins =30, fill ="steelblue", color ="white") +labs(title ="Distribution of Life Expectancy",x ="Life Expectancy (years)",y ="Frequency" ) +theme_minimal()
The distribution is slightly left-skewed, indicating that most countries cluster at higher life expectancy levels, while a smaller number of low-income countries create a long left tail.
5.3 Histogram - Life Expectancy (Log Transformed)
df_clean <- df_clean %>%mutate(life_expectancy_log =log(life_expectancy))ggplot(df_clean, aes(x = life_expectancy_log)) +geom_histogram(bins =30, fill ="seagreen", color ="white") +labs(title ="Distribution of Life Expectancy (Log Transformed)",x ="log(Life Expectancy)",y ="Frequency" ) +theme_minimal()
After applying the log transformation, the distribution becomes more symmetric. This transformation improves normality assumptions, which is important for regression analysis in the next stage.
5.4 Histogram - GDP per Capita (Original vs Log)
GDP per capita is strongly right-skewed, a classic example of a log-normal distribution. A log transformation was applied to better capture proportional differences across countries.
ggplot(df_clean, aes(x = gdp_per_capita)) +geom_histogram(bins =30, fill ="tomato", color ="white") +labs(title ="Distribution of GDP per Capita (Original)",x ="GDP per Capita (USD)",y ="Frequency" ) +theme_minimal()
ggplot(df_clean, aes(x = gdp_per_capita_log)) +geom_histogram(bins =30, fill ="darkorange", color ="white") +labs(title ="Distribution of GDP per Capita (Log Transformed)",x ="log(GDP per Capita)",y ="Frequency" ) +theme_minimal()
The original distribution is heavily right-skewed with extreme outliers (wealthy nations). After log transformation, the distribution becomes approximately normal, consistent with a log-normal distribution.
5.5 Proposed Theoretical Distribution
Life expectancy: Approximately normal distribution, with a slight left skew due to low-income country outliers.
GDP per capita: Consistent with a log-normal distribution. Strongly right-skewed in original form, approximately normal after log transformation.
6. Exploratory Visualizations
6.1 Life Expectancy by Region
ggplot(df_clean, aes(x =reorder(region, life_expectancy),y = life_expectancy, fill = region)) +geom_boxplot() +coord_flip() +labs(title ="Life Expectancy by Region",x ="", y ="Life Expectancy (years)" ) +theme_minimal() +theme(legend.position ="none")
Regional differences in life expectancy are substantial, suggesting that region should be considered as a control variable in the predictive model.
6.2 Life Expectancy by Income Group
ggplot(df_clean, aes(x =reorder(income, life_expectancy),y = life_expectancy, fill = income)) +geom_boxplot() +labs(title ="Life Expectancy by Income Group",x ="Income Group", y ="Life Expectancy (years)" ) +theme_minimal() +theme(legend.position ="none")
Higher income groups show clearly higher life expectancy, consistent with human capital theory.
6.3 Life Expectancy vs GDP per Capita
ggplot(df_clean, aes(x = gdp_per_capita_log, y = life_expectancy)) +geom_point(alpha =0.3, color ="steelblue") +geom_smooth(method ="lm", color ="red") +labs(title ="Life Expectancy vs GDP per Capita (Log)",x ="log(GDP per Capita)",y ="Life Expectancy (years)" ) +theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
There is a positive relationship between income and life expectancy, though the slope flattens at higher income levels, suggesting diminishing returns.
6.4 Life Expectancy vs Health Expenditure
ggplot(df_clean, aes(x = health_expenditure, y = life_expectancy)) +geom_point(alpha =0.3, color ="tomato") +geom_smooth(method ="lm", color ="darkred") +labs(title ="Life Expectancy vs Health Expenditure",x ="Health Expenditure (% of GDP)",y ="Life Expectancy (years)" ) +theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
Higher health spending is associated with higher life expectancy, though variation suggests efficiency differences across countries.
6.5 Life Expectancy vs Fertility Rate
ggplot(df_clean, aes(x = fertility_rate, y = life_expectancy)) +geom_point(alpha =0.3, color ="purple") +geom_smooth(method ="lm", color ="darkviolet") +labs(title ="Life Expectancy vs Fertility Rate",x ="Fertility Rate",y ="Life Expectancy (years)" ) +theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
There is a strong negative relationship between fertility rate and life expectancy, consistent with demographic transition theory. —