ECON 465 – Stage 1: Data Acquisition & Probability Analysis

Author

Sude Arslan & Selhan Çil

Published

May 3, 2026

1. Economic Question

To what extent can health expenditure, GDP per capita, urbanization, and fertility rates predict life expectancy across countries?

2. Dataset Description

This dataset was obtained from the World Bank World Development Indicators (WDI) database using the WDI package in R. It contains data for 200+ countries from 2000 to 2020, with 3,948 observations and 11 variables. The dataset includes key health, demographic, and economic indicators that may influence life expectancy across countries. The WDI API was used directly in R to ensure full reproducibility, with no manual downloads required.

Source: https://data.worldbank.org

3. Data Import & Cleaning

library(WDI)
library(tidyverse)

# Import data directly from World Bank API
df <- WDI(
  country = "all",
  indicator = c(
    life_expectancy    = "SP.DYN.LE00.IN",
    health_expenditure = "SH.XPD.CHEX.GD.ZS",
    gdp_per_capita     = "NY.GDP.PCAP.CD",
    urbanization_rate  = "SP.URB.TOTL.IN.ZS",
    fertility_rate     = "SP.DYN.TFRT.IN"
  ),
  start = 2000,
  end = 2020,
  extra = TRUE
)

# Remove aggregates (e.g., "World", "Euro Area") to keep country-level data only
# Drop missing values and select relevant variables
df_clean <- df %>%
  filter(region != "Aggregates") %>%
  drop_na() %>%
  select(country, iso3c, year, region, income,
         life_expectancy, health_expenditure,
         gdp_per_capita, urbanization_rate, fertility_rate) %>%
  mutate(
    # Log transformation applied due to strong right skew in GDP per capita
    # This better captures proportional differences across countries
    gdp_per_capita_log = log(gdp_per_capita),
    year = as.integer(year)
  )

cat("Observations:", nrow(df_clean), "\n")
Observations: 3948 
cat("Variables:", ncol(df_clean), "\n")
Variables: 11 

4. Summary Statistics

summary(df_clean %>% select(life_expectancy, health_expenditure,
                             gdp_per_capita, urbanization_rate,
                             fertility_rate))
 life_expectancy health_expenditure gdp_per_capita     urbanization_rate
 Min.   :14.66   Min.   : 1.223     Min.   :   109.6   Min.   :  8.044  
 1st Qu.:64.23   1st Qu.: 4.099     1st Qu.:  1307.8   1st Qu.: 37.796  
 Median :71.44   Median : 5.565     Median :  4201.3   Median : 57.895  
 Mean   :69.98   Mean   : 6.126     Mean   : 13011.1   Mean   : 56.816  
 3rd Qu.:76.56   3rd Qu.: 7.860     3rd Qu.: 14477.1   3rd Qu.: 74.927  
 Max.   :86.15   Max.   :24.458     Max.   :204263.8   Max.   :100.000  
 fertility_rate 
 Min.   :0.837  
 1st Qu.:1.714  
 Median :2.413  
 Mean   :2.929  
 3rd Qu.:3.913  
 Max.   :7.829  

5. Probability Distribution Analysis

5.1 Selected Variable: Life Expectancy

Life expectancy is a continuous variable representing the average number of years a person is expected to live at birth. It is expressed as a decimal figure at the country-year level.

df_clean %>%
  summarise(
    Mean    = mean(life_expectancy),
    Median  = median(life_expectancy),
    Std_Dev = sd(life_expectancy),
    Q1      = quantile(life_expectancy, 0.25),
    Q3      = quantile(life_expectancy, 0.75)
  )
      Mean Median  Std_Dev      Q1      Q3
1 69.97972 71.438 8.824294 64.2305 76.5631

5.2 Histogram - Life Expectancy (Original)

ggplot(df_clean, aes(x = life_expectancy)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  labs(
    title = "Distribution of Life Expectancy",
    x = "Life Expectancy (years)",
    y = "Frequency"
  ) +
  theme_minimal()

The distribution is slightly left-skewed, indicating that most countries cluster at higher life expectancy levels, while a smaller number of low-income countries create a long left tail.

5.3 Histogram - Life Expectancy (Log Transformed)

df_clean <- df_clean %>%
  mutate(life_expectancy_log = log(life_expectancy))

ggplot(df_clean, aes(x = life_expectancy_log)) +
  geom_histogram(bins = 30, fill = "seagreen", color = "white") +
  labs(
    title = "Distribution of Life Expectancy (Log Transformed)",
    x = "log(Life Expectancy)",
    y = "Frequency"
  ) +
  theme_minimal()

After applying the log transformation, the distribution becomes more symmetric. This transformation improves normality assumptions, which is important for regression analysis in the next stage.

5.4 Histogram - GDP per Capita (Original vs Log)

GDP per capita is strongly right-skewed, a classic example of a log-normal distribution. A log transformation was applied to better capture proportional differences across countries.

ggplot(df_clean, aes(x = gdp_per_capita)) +
  geom_histogram(bins = 30, fill = "tomato", color = "white") +
  labs(
    title = "Distribution of GDP per Capita (Original)",
    x = "GDP per Capita (USD)",
    y = "Frequency"
  ) +
  theme_minimal()

ggplot(df_clean, aes(x = gdp_per_capita_log)) +
  geom_histogram(bins = 30, fill = "darkorange", color = "white") +
  labs(
    title = "Distribution of GDP per Capita (Log Transformed)",
    x = "log(GDP per Capita)",
    y = "Frequency"
  ) +
  theme_minimal()

The original distribution is heavily right-skewed with extreme outliers (wealthy nations). After log transformation, the distribution becomes approximately normal, consistent with a log-normal distribution.

5.5 Proposed Theoretical Distribution

  • Life expectancy: Approximately normal distribution, with a slight left skew due to low-income country outliers.
  • GDP per capita: Consistent with a log-normal distribution. Strongly right-skewed in original form, approximately normal after log transformation.

6. Exploratory Visualizations

6.1 Life Expectancy by Region

ggplot(df_clean, aes(x = reorder(region, life_expectancy),
                     y = life_expectancy, fill = region)) +
  geom_boxplot() +
  coord_flip() +
  labs(
    title = "Life Expectancy by Region",
    x = "", y = "Life Expectancy (years)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Regional differences in life expectancy are substantial, suggesting that region should be considered as a control variable in the predictive model.

6.2 Life Expectancy by Income Group

ggplot(df_clean, aes(x = reorder(income, life_expectancy),
                     y = life_expectancy, fill = income)) +
  geom_boxplot() +
  labs(
    title = "Life Expectancy by Income Group",
    x = "Income Group", y = "Life Expectancy (years)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Higher income groups show clearly higher life expectancy, consistent with human capital theory.

6.3 Life Expectancy vs GDP per Capita

ggplot(df_clean, aes(x = gdp_per_capita_log, y = life_expectancy)) +
  geom_point(alpha = 0.3, color = "steelblue") +
  geom_smooth(method = "lm", color = "red") +
  labs(
    title = "Life Expectancy vs GDP per Capita (Log)",
    x = "log(GDP per Capita)",
    y = "Life Expectancy (years)"
  ) +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

There is a positive relationship between income and life expectancy, though the slope flattens at higher income levels, suggesting diminishing returns.

6.4 Life Expectancy vs Health Expenditure

ggplot(df_clean, aes(x = health_expenditure, y = life_expectancy)) +
  geom_point(alpha = 0.3, color = "tomato") +
  geom_smooth(method = "lm", color = "darkred") +
  labs(
    title = "Life Expectancy vs Health Expenditure",
    x = "Health Expenditure (% of GDP)",
    y = "Life Expectancy (years)"
  ) +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

Higher health spending is associated with higher life expectancy, though variation suggests efficiency differences across countries.

6.5 Life Expectancy vs Fertility Rate

ggplot(df_clean, aes(x = fertility_rate, y = life_expectancy)) +
  geom_point(alpha = 0.3, color = "purple") +
  geom_smooth(method = "lm", color = "darkviolet") +
  labs(
    title = "Life Expectancy vs Fertility Rate",
    x = "Fertility Rate",
    y = "Life Expectancy (years)"
  ) +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

There is a strong negative relationship between fertility rate and life expectancy, consistent with demographic transition theory. —