ECON 465 – Stage 1: Data Acquisition & Probability Analysis
Author
Selhan Çil & Sude Arslan
Published
May 10, 2026
1. Economic Question
Regression: To what extent can health expenditure, GDP per capita, urbanization, and fertility rates predict life expectancy across countries?
Classification: Can a country be classified as having high child mortality based on economic and health indicators such as GDP per capita, health expenditure, and access to sanitation?
2. Dataset Description
2.1 Regression Dataset (Continuous Outcome)
This dataset was obtained from the World Bank World Development Indicators (WDI) database using the WDI package in R. It contains data for 200+ countries from 2000 to 2020, with approximately 3,948 observations and 11 variables. The dataset includes key health, demographic, and economic indicators that may influence life expectancy across countries. The WDI API was used directly in R to ensure full reproducibility, with no manual downloads required.
Outcome variable: Life expectancy at birth (continuous, in years)
Source: https://data.worldbank.org
2.2 Classification Dataset (Binary Outcome)
This dataset is also sourced from the World Bank WDI database using the same WDI package in R, covering 200+ countries from 2000 to 2020. The binary outcome variable is constructed from the under-5 child mortality rate: countries above the sample median are coded as 1 (high mortality), and those at or below the median are coded as 0 (low mortality).
Outcome variable:high_mortality = 1 if under-5 mortality rate is above the sample median, 0 otherwise.
Predictors: GDP per capita (log), health expenditure (% of GDP), access to basic sanitation (% of population), urbanization rate.
child_mortality gdp_per_capita health_expenditure sanitation
Min. : 1.50 Min. : 109.6 Min. : 1.223 Min. : 2.966
1st Qu.: 8.40 1st Qu.: 1299.7 1st Qu.: 4.092 1st Qu.: 46.768
Median : 21.00 Median : 4177.2 Median : 5.561 Median : 85.717
Mean : 37.68 Mean : 12755.0 Mean : 6.130 Mean : 71.950
3rd Qu.: 55.00 3rd Qu.: 14356.5 3rd Qu.: 7.864 3rd Qu.: 97.205
Max. :489.30 Max. :204263.8 Max. :24.458 Max. :100.000
urbanization_rate
Min. : 8.044
1st Qu.: 37.623
Median : 57.969
Mean : 56.779
3rd Qu.: 74.891
Max. :100.000
5. Probability Distribution Analysis
5.1 Regression Dataset – Life Expectancy
Life expectancy is a continuous variable representing the average number of years a person is expected to live at birth, measured at the country-year level.
Mean Median Std_Dev Q1 Q3
1 69.97972 71.438 8.824294 64.2305 76.5631
Histogram – Life Expectancy (Original)
ggplot(df_clean, aes(x = life_expectancy)) +geom_histogram(bins =30, fill ="steelblue", color ="white") +labs(title ="Distribution of Life Expectancy",x ="Life Expectancy (years)",y ="Frequency" ) +theme_minimal()
The distribution is slightly left-skewed, indicating that most countries cluster at higher life expectancy levels, while a smaller number of low-income countries create a long left tail.
Histogram – GDP per Capita (Original vs Log)
ggplot(df_clean, aes(x = gdp_per_capita)) +geom_histogram(bins =30, fill ="tomato", color ="white") +labs(title ="Distribution of GDP per Capita (Original)",x ="GDP per Capita (USD)", y ="Frequency" ) +theme_minimal()
ggplot(df_clean, aes(x = gdp_per_capita_log)) +geom_histogram(bins =30, fill ="darkorange", color ="white") +labs(title ="Distribution of GDP per Capita (Log Transformed)",x ="log(GDP per Capita)", y ="Frequency" ) +theme_minimal()
The original distribution is heavily right-skewed with extreme outliers (wealthy nations). After log transformation, the distribution becomes approximately normal, consistent with a log-normal distribution.
Proposed Theoretical Distribution – Regression
Life expectancy: Approximately normal distribution with a slight left skew due to low-income country outliers.
GDP per capita: Consistent with a log-normal distribution; approximately normal after log transformation.
5.2 Classification Dataset – Child Mortality
Child mortality rate (under-5, per 1,000 live births) is a continuous variable that is dichotomized into a binary outcome based on the sample median.
Mean Median Std_Dev Q1 Q3
1 37.67892 21 40.75137 8.4 55
Histogram – Child Mortality Rate
ggplot(df_class, aes(x = child_mortality)) +geom_histogram(bins =30, fill ="firebrick", color ="white") +labs(title ="Distribution of Under-5 Child Mortality Rate",x ="Child Mortality (per 1,000 live births)",y ="Frequency" ) +theme_minimal()
The distribution is strongly right-skewed, reflecting that most countries have relatively low child mortality rates, while a small number of low-income countries have very high rates. This is consistent with a log-normal distribution.
high_mortality n label share
1 0 1941 Low (0) 50
2 1 1939 High (1) 50
By construction (median split), the binary outcome is approximately balanced at 50-50, which is desirable for classification analysis.
Proposed Theoretical Distribution – Classification
Child mortality rate: Right-skewed, consistent with a log-normal distribution.
Binary outcome: Bernoulli distribution with p ≈ 0.5 by construction (median split).
6. Exploratory Visualizations
6.1 Regression Dataset
Life Expectancy by Region
ggplot(df_clean, aes(x =reorder(region, life_expectancy),y = life_expectancy, fill = region)) +geom_boxplot() +coord_flip() +labs(title ="Life Expectancy by Region",x ="", y ="Life Expectancy (years)") +theme_minimal() +theme(legend.position ="none")
Regional differences in life expectancy are substantial, suggesting that region should be considered as a control variable in the model.
Life Expectancy by Income Group
ggplot(df_clean, aes(x =reorder(income, life_expectancy),y = life_expectancy, fill = income)) +geom_boxplot() +labs(title ="Life Expectancy by Income Group",x ="Income Group", y ="Life Expectancy (years)") +theme_minimal() +theme(legend.position ="none")
Higher income groups show clearly higher life expectancy, consistent with human capital theory.
Life Expectancy vs GDP per Capita (Log)
ggplot(df_clean, aes(x = gdp_per_capita_log, y = life_expectancy)) +geom_point(alpha =0.3, color ="steelblue") +geom_smooth(method ="lm", color ="red") +labs(title ="Life Expectancy vs GDP per Capita (Log)",x ="log(GDP per Capita)", y ="Life Expectancy (years)") +theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
There is a positive relationship between income and life expectancy, though the slope flattens at higher income levels, suggesting diminishing returns.
Life Expectancy vs Health Expenditure
ggplot(df_clean, aes(x = health_expenditure, y = life_expectancy)) +geom_point(alpha =0.3, color ="tomato") +geom_smooth(method ="lm", color ="darkred") +labs(title ="Life Expectancy vs Health Expenditure",x ="Health Expenditure (% of GDP)",y ="Life Expectancy (years)") +theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
Higher health spending is associated with higher life expectancy, though variation suggests efficiency differences across countries.
Life Expectancy vs Fertility Rate
ggplot(df_clean, aes(x = fertility_rate, y = life_expectancy)) +geom_point(alpha =0.3, color ="purple") +geom_smooth(method ="lm", color ="darkviolet") +labs(title ="Life Expectancy vs Fertility Rate",x ="Fertility Rate", y ="Life Expectancy (years)") +theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
There is a strong negative relationship between fertility rate and life expectancy, consistent with demographic transition theory.
6.2 Classification Dataset
Child Mortality by Region
ggplot(df_class, aes(x =reorder(region, child_mortality),y = child_mortality, fill = region)) +geom_boxplot() +coord_flip() +labs(title ="Child Mortality Rate by Region",x ="", y ="Under-5 Mortality (per 1,000)") +theme_minimal() +theme(legend.position ="none")
Sub-Saharan Africa shows substantially higher child mortality rates, while Europe and high-income regions cluster near zero.
Child Mortality vs GDP per Capita (Log)
ggplot(df_class, aes(x = gdp_per_capita_log, y = child_mortality,color =factor(high_mortality))) +geom_point(alpha =0.3) +geom_smooth(method ="lm", color ="black") +scale_color_manual(values =c("0"="steelblue", "1"="firebrick"),labels =c("Low Mortality", "High Mortality"),name ="") +labs(title ="Child Mortality vs GDP per Capita (Log)",x ="log(GDP per Capita)",y ="Under-5 Mortality (per 1,000)") +theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
Wealthier countries have much lower child mortality. The binary classification is well-separated along the income dimension.
Child Mortality vs Sanitation Access
ggplot(df_class, aes(x = sanitation, y = child_mortality,color =factor(high_mortality))) +geom_point(alpha =0.3) +geom_smooth(method ="lm", color ="black") +scale_color_manual(values =c("0"="steelblue", "1"="firebrick"),labels =c("Low Mortality", "High Mortality"),name ="") +labs(title ="Child Mortality vs Access to Sanitation",x ="Population with Basic Sanitation (%)",y ="Under-5 Mortality (per 1,000)") +theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
Access to sanitation is strongly negatively associated with child mortality, highlighting the importance of infrastructure in health outcomes.