ECON 465 – Stage 1: Data Acquisition & Probability Analysis

Author

Selhan Çil & Sude Arslan

Published

May 10, 2026

1. Economic Question

Regression: To what extent can health expenditure, GDP per capita, urbanization, and fertility rates predict life expectancy across countries?

Classification: Can a country be classified as having high child mortality based on economic and health indicators such as GDP per capita, health expenditure, and access to sanitation?

2. Dataset Description

2.1 Regression Dataset (Continuous Outcome)

This dataset was obtained from the World Bank World Development Indicators (WDI) database using the WDI package in R. It contains data for 200+ countries from 2000 to 2020, with approximately 3,948 observations and 11 variables. The dataset includes key health, demographic, and economic indicators that may influence life expectancy across countries. The WDI API was used directly in R to ensure full reproducibility, with no manual downloads required.

Outcome variable: Life expectancy at birth (continuous, in years)

Source: https://data.worldbank.org

2.2 Classification Dataset (Binary Outcome)

This dataset is also sourced from the World Bank WDI database using the same WDI package in R, covering 200+ countries from 2000 to 2020. The binary outcome variable is constructed from the under-5 child mortality rate: countries above the sample median are coded as 1 (high mortality), and those at or below the median are coded as 0 (low mortality).

Outcome variable: high_mortality = 1 if under-5 mortality rate is above the sample median, 0 otherwise.

Predictors: GDP per capita (log), health expenditure (% of GDP), access to basic sanitation (% of population), urbanization rate.

Source: https://data.worldbank.org


3. Data Import & Cleaning

3.1 Regression Dataset

library(WDI)
library(tidyverse)

# Import regression data directly from World Bank API
df <- WDI(
  country = "all",
  indicator = c(
    life_expectancy    = "SP.DYN.LE00.IN",
    health_expenditure = "SH.XPD.CHEX.GD.ZS",
    gdp_per_capita     = "NY.GDP.PCAP.CD",
    urbanization_rate  = "SP.URB.TOTL.IN.ZS",
    fertility_rate     = "SP.DYN.TFRT.IN"
  ),
  start = 2000,
  end = 2020,
  extra = TRUE
)

df_clean <- df %>%
  filter(region != "Aggregates") %>%
  drop_na() %>%
  select(country, iso3c, year, region, income,
         life_expectancy, health_expenditure,
         gdp_per_capita, urbanization_rate, fertility_rate) %>%
  mutate(
    gdp_per_capita_log = log(gdp_per_capita),
    year = as.integer(year)
  )

cat("Observations:", nrow(df_clean), "\n")
Observations: 3948 
cat("Variables:", ncol(df_clean), "\n")
Variables: 11 
cat("Rows dropped:", nrow(df) - nrow(df_clean), "\n")
Rows dropped: 1638 

3.2 Classification Dataset

# Import classification data from World Bank API
df_class_raw <- WDI(
  country = "all",
  indicator = c(
    child_mortality    = "SH.DYN.MORT",       # Under-5 mortality rate (per 1,000)
    gdp_per_capita     = "NY.GDP.PCAP.CD",
    health_expenditure = "SH.XPD.CHEX.GD.ZS",
    sanitation         = "SH.STA.BASS.ZS",    # Access to basic sanitation (%)
    urbanization_rate  = "SP.URB.TOTL.IN.ZS"
  ),
  start = 2000,
  end = 2020,
  extra = TRUE
)

df_class <- df_class_raw %>%
  filter(region != "Aggregates") %>%
  drop_na() %>%
  select(country, iso3c, year, region, income,
         child_mortality, gdp_per_capita,
         health_expenditure, sanitation, urbanization_rate) %>%
  mutate(
    gdp_per_capita_log = log(gdp_per_capita),
    # Binary outcome: 1 = high child mortality (above median), 0 = low
    high_mortality = ifelse(child_mortality > median(child_mortality), 1, 0),
    year = as.integer(year)
  )

cat("Observations:", nrow(df_class), "\n")
Observations: 3880 
cat("Variables:", ncol(df_class), "\n")
Variables: 12 
cat("Rows dropped:", nrow(df_class_raw) - nrow(df_class), "\n")
Rows dropped: 1706 
cat("High mortality (1):", sum(df_class$high_mortality), "\n")
High mortality (1): 1939 
cat("Low mortality  (0):", sum(df_class$high_mortality == 0), "\n")
Low mortality  (0): 1941 

4. Summary Statistics

4.1 Regression Dataset

summary(df_clean %>% select(life_expectancy, health_expenditure,
                             gdp_per_capita, urbanization_rate,
                             fertility_rate))
 life_expectancy health_expenditure gdp_per_capita     urbanization_rate
 Min.   :14.66   Min.   : 1.223     Min.   :   109.6   Min.   :  8.044  
 1st Qu.:64.23   1st Qu.: 4.099     1st Qu.:  1307.8   1st Qu.: 37.796  
 Median :71.44   Median : 5.565     Median :  4201.3   Median : 57.895  
 Mean   :69.98   Mean   : 6.126     Mean   : 13011.1   Mean   : 56.816  
 3rd Qu.:76.56   3rd Qu.: 7.860     3rd Qu.: 14477.1   3rd Qu.: 74.927  
 Max.   :86.15   Max.   :24.458     Max.   :204263.8   Max.   :100.000  
 fertility_rate 
 Min.   :0.837  
 1st Qu.:1.714  
 Median :2.413  
 Mean   :2.929  
 3rd Qu.:3.913  
 Max.   :7.829  

4.2 Classification Dataset

summary(df_class %>% select(child_mortality, gdp_per_capita,
                              health_expenditure, sanitation,
                              urbanization_rate))
 child_mortality  gdp_per_capita     health_expenditure   sanitation     
 Min.   :  1.50   Min.   :   109.6   Min.   : 1.223     Min.   :  2.966  
 1st Qu.:  8.40   1st Qu.:  1299.7   1st Qu.: 4.092     1st Qu.: 46.768  
 Median : 21.00   Median :  4177.2   Median : 5.561     Median : 85.717  
 Mean   : 37.68   Mean   : 12755.0   Mean   : 6.130     Mean   : 71.950  
 3rd Qu.: 55.00   3rd Qu.: 14356.5   3rd Qu.: 7.864     3rd Qu.: 97.205  
 Max.   :489.30   Max.   :204263.8   Max.   :24.458     Max.   :100.000  
 urbanization_rate
 Min.   :  8.044  
 1st Qu.: 37.623  
 Median : 57.969  
 Mean   : 56.779  
 3rd Qu.: 74.891  
 Max.   :100.000  

5. Probability Distribution Analysis

5.1 Regression Dataset – Life Expectancy

Life expectancy is a continuous variable representing the average number of years a person is expected to live at birth, measured at the country-year level.

df_clean %>%
  summarise(
    Mean    = mean(life_expectancy),
    Median  = median(life_expectancy),
    Std_Dev = sd(life_expectancy),
    Q1      = quantile(life_expectancy, 0.25),
    Q3      = quantile(life_expectancy, 0.75)
  )
      Mean Median  Std_Dev      Q1      Q3
1 69.97972 71.438 8.824294 64.2305 76.5631

Histogram – Life Expectancy (Original)

ggplot(df_clean, aes(x = life_expectancy)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  labs(
    title = "Distribution of Life Expectancy",
    x = "Life Expectancy (years)",
    y = "Frequency"
  ) +
  theme_minimal()

The distribution is slightly left-skewed, indicating that most countries cluster at higher life expectancy levels, while a smaller number of low-income countries create a long left tail.

Histogram – GDP per Capita (Original vs Log)

ggplot(df_clean, aes(x = gdp_per_capita)) +
  geom_histogram(bins = 30, fill = "tomato", color = "white") +
  labs(
    title = "Distribution of GDP per Capita (Original)",
    x = "GDP per Capita (USD)", y = "Frequency"
  ) +
  theme_minimal()

ggplot(df_clean, aes(x = gdp_per_capita_log)) +
  geom_histogram(bins = 30, fill = "darkorange", color = "white") +
  labs(
    title = "Distribution of GDP per Capita (Log Transformed)",
    x = "log(GDP per Capita)", y = "Frequency"
  ) +
  theme_minimal()

The original distribution is heavily right-skewed with extreme outliers (wealthy nations). After log transformation, the distribution becomes approximately normal, consistent with a log-normal distribution.

Proposed Theoretical Distribution – Regression

  • Life expectancy: Approximately normal distribution with a slight left skew due to low-income country outliers.
  • GDP per capita: Consistent with a log-normal distribution; approximately normal after log transformation.

5.2 Classification Dataset – Child Mortality

Child mortality rate (under-5, per 1,000 live births) is a continuous variable that is dichotomized into a binary outcome based on the sample median.

df_class %>%
  summarise(
    Mean    = mean(child_mortality),
    Median  = median(child_mortality),
    Std_Dev = sd(child_mortality),
    Q1      = quantile(child_mortality, 0.25),
    Q3      = quantile(child_mortality, 0.75)
  )
      Mean Median  Std_Dev  Q1 Q3
1 37.67892     21 40.75137 8.4 55

Histogram – Child Mortality Rate

ggplot(df_class, aes(x = child_mortality)) +
  geom_histogram(bins = 30, fill = "firebrick", color = "white") +
  labs(
    title = "Distribution of Under-5 Child Mortality Rate",
    x = "Child Mortality (per 1,000 live births)",
    y = "Frequency"
  ) +
  theme_minimal()

The distribution is strongly right-skewed, reflecting that most countries have relatively low child mortality rates, while a small number of low-income countries have very high rates. This is consistent with a log-normal distribution.

Binary Outcome Balance

df_class %>%
  count(high_mortality) %>%
  mutate(
    label = ifelse(high_mortality == 1, "High (1)", "Low (0)"),
    share = round(n / sum(n) * 100, 1)
  )
  high_mortality    n    label share
1              0 1941  Low (0)    50
2              1 1939 High (1)    50

By construction (median split), the binary outcome is approximately balanced at 50-50, which is desirable for classification analysis.

Proposed Theoretical Distribution – Classification

  • Child mortality rate: Right-skewed, consistent with a log-normal distribution.
  • Binary outcome: Bernoulli distribution with p ≈ 0.5 by construction (median split).

6. Exploratory Visualizations

6.1 Regression Dataset

Life Expectancy by Region

ggplot(df_clean, aes(x = reorder(region, life_expectancy),
                     y = life_expectancy, fill = region)) +
  geom_boxplot() +
  coord_flip() +
  labs(title = "Life Expectancy by Region",
       x = "", y = "Life Expectancy (years)") +
  theme_minimal() +
  theme(legend.position = "none")

Regional differences in life expectancy are substantial, suggesting that region should be considered as a control variable in the model.

Life Expectancy by Income Group

ggplot(df_clean, aes(x = reorder(income, life_expectancy),
                     y = life_expectancy, fill = income)) +
  geom_boxplot() +
  labs(title = "Life Expectancy by Income Group",
       x = "Income Group", y = "Life Expectancy (years)") +
  theme_minimal() +
  theme(legend.position = "none")

Higher income groups show clearly higher life expectancy, consistent with human capital theory.

Life Expectancy vs GDP per Capita (Log)

ggplot(df_clean, aes(x = gdp_per_capita_log, y = life_expectancy)) +
  geom_point(alpha = 0.3, color = "steelblue") +
  geom_smooth(method = "lm", color = "red") +
  labs(title = "Life Expectancy vs GDP per Capita (Log)",
       x = "log(GDP per Capita)", y = "Life Expectancy (years)") +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

There is a positive relationship between income and life expectancy, though the slope flattens at higher income levels, suggesting diminishing returns.

Life Expectancy vs Health Expenditure

ggplot(df_clean, aes(x = health_expenditure, y = life_expectancy)) +
  geom_point(alpha = 0.3, color = "tomato") +
  geom_smooth(method = "lm", color = "darkred") +
  labs(title = "Life Expectancy vs Health Expenditure",
       x = "Health Expenditure (% of GDP)",
       y = "Life Expectancy (years)") +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

Higher health spending is associated with higher life expectancy, though variation suggests efficiency differences across countries.

Life Expectancy vs Fertility Rate

ggplot(df_clean, aes(x = fertility_rate, y = life_expectancy)) +
  geom_point(alpha = 0.3, color = "purple") +
  geom_smooth(method = "lm", color = "darkviolet") +
  labs(title = "Life Expectancy vs Fertility Rate",
       x = "Fertility Rate", y = "Life Expectancy (years)") +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

There is a strong negative relationship between fertility rate and life expectancy, consistent with demographic transition theory.


6.2 Classification Dataset

Child Mortality by Region

ggplot(df_class, aes(x = reorder(region, child_mortality),
                     y = child_mortality, fill = region)) +
  geom_boxplot() +
  coord_flip() +
  labs(title = "Child Mortality Rate by Region",
       x = "", y = "Under-5 Mortality (per 1,000)") +
  theme_minimal() +
  theme(legend.position = "none")

Sub-Saharan Africa shows substantially higher child mortality rates, while Europe and high-income regions cluster near zero.

Child Mortality vs GDP per Capita (Log)

ggplot(df_class, aes(x = gdp_per_capita_log, y = child_mortality,
                     color = factor(high_mortality))) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", color = "black") +
  scale_color_manual(values = c("0" = "steelblue", "1" = "firebrick"),
                     labels = c("Low Mortality", "High Mortality"),
                     name = "") +
  labs(title = "Child Mortality vs GDP per Capita (Log)",
       x = "log(GDP per Capita)",
       y = "Under-5 Mortality (per 1,000)") +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

Wealthier countries have much lower child mortality. The binary classification is well-separated along the income dimension.

Child Mortality vs Sanitation Access

ggplot(df_class, aes(x = sanitation, y = child_mortality,
                     color = factor(high_mortality))) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", color = "black") +
  scale_color_manual(values = c("0" = "steelblue", "1" = "firebrick"),
                     labels = c("Low Mortality", "High Mortality"),
                     name = "") +
  labs(title = "Child Mortality vs Access to Sanitation",
       x = "Population with Basic Sanitation (%)",
       y = "Under-5 Mortality (per 1,000)") +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

Access to sanitation is strongly negatively associated with child mortality, highlighting the importance of infrastructure in health outcomes.