Stage 1: Data Proposal and Probability Analysis Report

Author

Selhan Çil & Sude Arslan

Published

May 10, 2026

# Packages used in this report
library(WDI)
library(tidyverse)
library(knitr)
library(scales)

theme_set(theme_minimal(base_size = 12))

1 Overview

This report proposes two real-world economic datasets obtained from the World Bank World Development Indicators (WDI) database. The first dataset is designed for a regression problem with a continuous target variable. The second dataset is designed for a classification problem with a binary target variable created from a macroeconomic growth indicator.

The data source for both datasets is the World Bank WDI API: https://databank.worldbank.org/source/world-development-indicators

2 Dataset 1: Regression Dataset

2.1 Dataset Description and Source

The first dataset is a country-year panel from the World Bank WDI database for the years 2010-2023. Each observation represents one country in one year. The target variable is life expectancy at birth, which is a continuous numeric outcome. The main model uses GDP per capita, health expenditure, unemployment, inflation, and urban population share as explanatory variables.

This dataset is relevant to economics because life expectancy is closely connected to economic development, health spending, income levels, labor market conditions, urbanization, and macroeconomic stability. It can be used to study whether countries with higher income and stronger health-related investment tend to have better development outcomes.

The main model uses income, health expenditure, unemployment, inflation, and urbanization because these variables are widely available across countries in WDI. Education and poverty are theoretically important because they capture human capital and economic deprivation, but they may contain many missing values. Therefore, they can be considered in an alternative model if enough observations remain.

Main model:

life_expectancy = gdp_per_capita + health_expenditure + unemployment + inflation + urban_population

Alternative model:

life_expectancy = gdp_per_capita + health_expenditure + education + poverty + urban_population

Source indicators:

Variable	WDI Indicator	Description
`life_expectancy`	`SP.DYN.LE00.IN`	Life expectancy at birth, total (years)
`gdp_per_capita`	`NY.GDP.PCAP.KD`	GDP per capita, constant 2015 US$
`health_expenditure`	`SH.XPD.CHEX.GD.ZS`	Current health expenditure (% of GDP)
`inflation`	`FP.CPI.TOTL.ZG`	Inflation, consumer prices (annual %)
`unemployment`	`SL.UEM.TOTL.ZS`	Unemployment, total (% of total labor force)
`urban_population`	`SP.URB.TOTL.IN.ZS`	Urban population (% of total population)

2.2 Economic Question

To what extent do economic development, health investment, labor market conditions, macroeconomic stability, and urbanization explain differences in life expectancy across countries?

2.3 Data Import and Cleaning

# WDI indicators for the regression dataset
regression_indicators <- c(
  life_expectancy = "SP.DYN.LE00.IN",
  gdp_per_capita = "NY.GDP.PCAP.KD",
  health_expenditure = "SH.XPD.CHEX.GD.ZS",
  inflation = "FP.CPI.TOTL.ZG",
  unemployment = "SL.UEM.TOTL.ZS",
  urban_population = "SP.URB.TOTL.IN.ZS"
)

# Import data directly from the World Bank WDI API.
# extra = TRUE adds metadata such as region and income group.
regression_raw <- WDI(
  country = "all",
  indicator = regression_indicators,
  start = 2010,
  end = 2023,
  extra = TRUE
)

# Clean the data:
# 1. Remove regional and income-group aggregates.
# 2. Keep only the variables needed for analysis.
# 3. Remove rows with missing values.
# 4. Add a log-transformed target variable for distribution analysis.
regression_data <- regression_raw |>
  filter(region != "Aggregates") |>
  select(
    country, iso3c, year, region, income,
    life_expectancy, gdp_per_capita, health_expenditure, inflation, unemployment,
    urban_population
  ) |>
  mutate(
    year = as.integer(year),
    across(
      c(
        life_expectancy, gdp_per_capita, health_expenditure, inflation, unemployment,
        urban_population
      ),
      as.numeric
    ),
    log_life_expectancy = log(life_expectancy)
  ) |>
  drop_na()

# Check whether the dataset satisfies the assignment requirements.
regression_size <- tibble(
  observations = nrow(regression_data),
  variables = ncol(regression_data)
)

kable(regression_size, caption = "Regression Dataset Size")

Regression Dataset Size
observations	variables
2350	12

2.4 Summary Statistics for Target Variable

regression_summary <- regression_data |>
  summarise(
    mean = mean(life_expectancy),
    median = median(life_expectancy),
    standard_deviation = sd(life_expectancy),
    minimum = min(life_expectancy),
    q1 = quantile(life_expectancy, 0.25),
    q3 = quantile(life_expectancy, 0.75),
    maximum = max(life_expectancy)
  )

kable(
  regression_summary,
  digits = 2,
  caption = "Summary Statistics for Life Expectancy"
)

Summary Statistics for Life Expectancy
mean	median	standard_deviation	minimum	q1	q3	maximum
71.62	72.78	8.07	18.82	65.87	77.72	84.56

2.5 Histogram of Original Target Variable

ggplot(regression_data, aes(x = life_expectancy)) +
  geom_histogram(bins = 30, fill = "#2f6f73", color = "white") +
  labs(
    title = "Distribution of Life Expectancy",
    x = "Life expectancy at birth (years)",
    y = "Number of country-year observations"
  )

The distribution of life expectancy is not perfectly normal. It is mildly left-skewed because many countries are concentrated around relatively high life expectancy values, while a smaller group of countries has much lower life expectancy. However, compared with many economic variables, the distribution is still fairly compact and close to bell-shaped.

2.6 Histogram of Log-Transformed Target Variable

ggplot(regression_data, aes(x = log_life_expectancy)) +
  geom_histogram(bins = 30, fill = "#b46a3c", color = "white") +
  labs(
    title = "Distribution of Log Life Expectancy",
    x = "Log of life expectancy",
    y = "Number of country-year observations"
  )

The log transformation slightly compresses higher life expectancy values, but it does not dramatically improve the distribution because life expectancy was not strongly right-skewed in the first place. A normal distribution is a reasonable first approximation for this target variable, although the bounded nature of life expectancy means it will not be perfectly normal.

3 Dataset 2: Classification Dataset

3.1 Dataset Description and Source

The second dataset is also a country-year panel from the World Bank WDI database for the years 2010-2023. Each observation represents one country in one year. The binary target variable is high_growth, which equals 1 if real GDP growth is at least 3 percent in that country-year and equals 0 otherwise.

This dataset is relevant to economics because identifying high-growth country-years is a common macroeconomic classification problem. Governments, investors, and international organizations often want to understand whether indicators such as investment, inflation, unemployment, trade openness, education, and population growth are associated with stronger economic performance.

Source indicators:

Variable	WDI Indicator	Description
`gdp_growth`	`NY.GDP.MKTP.KD.ZG`	GDP growth (annual %)
`investment_share`	`NE.GDI.TOTL.ZS`	Gross capital formation (% of GDP)
`inflation`	`FP.CPI.TOTL.ZG`	Inflation, consumer prices (annual %)
`unemployment`	`SL.UEM.TOTL.ZS`	Unemployment, total (% of total labor force)
`trade_share`	`NE.TRD.GNFS.ZS`	Trade (% of GDP)
`secondary_enrollment`	`SE.SEC.ENRR`	School enrollment, secondary (% gross)
`population_growth`	`SP.POP.GROW`	Population growth (annual %)

3.2 Economic Question

Can macroeconomic indicators classify whether a country-year experienced high GDP growth?

3.3 Data Import and Cleaning

# WDI indicators for the classification dataset
classification_indicators <- c(
  gdp_growth = "NY.GDP.MKTP.KD.ZG",
  investment_share = "NE.GDI.TOTL.ZS",
  inflation = "FP.CPI.TOTL.ZG",
  unemployment = "SL.UEM.TOTL.ZS",
  trade_share = "NE.TRD.GNFS.ZS",
  secondary_enrollment = "SE.SEC.ENRR",
  population_growth = "SP.POP.GROW"
)

# Import data directly from the World Bank WDI API.
classification_raw <- WDI(
  country = "all",
  indicator = classification_indicators,
  start = 2010,
  end = 2023,
  extra = TRUE
)

# Clean the data:
# 1. Remove aggregate regions.
# 2. Keep the relevant columns.
# 3. Convert variables to numeric types.
# 4. Remove missing values.
# 5. Create a binary classification target.
classification_data <- classification_raw |>
  filter(region != "Aggregates") |>
  select(
    country, iso3c, year, region, income,
    gdp_growth, investment_share, inflation, unemployment,
    trade_share, secondary_enrollment, population_growth
  ) |>
  mutate(
    year = as.integer(year),
    across(
      c(
        gdp_growth, investment_share, inflation, unemployment,
        trade_share, secondary_enrollment, population_growth
      ),
      as.numeric
    )
  ) |>
  drop_na() |>
  mutate(
    high_growth = if_else(gdp_growth >= 3, 1L, 0L),
    high_growth_label = factor(
      high_growth,
      levels = c(0, 1),
      labels = c("Low or moderate growth", "High growth")
    ),
    log_high_growth = log1p(high_growth)
  )

# Check whether the dataset satisfies the assignment requirements.
classification_size <- tibble(
  observations = nrow(classification_data),
  variables = ncol(classification_data)
)

kable(classification_size, caption = "Classification Dataset Size")

Classification Dataset Size
observations	variables
1631	15

3.4 Summary Statistics for Target Variable

classification_summary <- classification_data |>
  summarise(
    mean = mean(high_growth),
    median = median(high_growth),
    standard_deviation = sd(high_growth),
    minimum = min(high_growth),
    q1 = quantile(high_growth, 0.25),
    q3 = quantile(high_growth, 0.75),
    maximum = max(high_growth),
    high_growth_share = mean(high_growth)
  )

kable(
  classification_summary,
  digits = 3,
  caption = "Summary Statistics for High-Growth Classification Target"
)

Summary Statistics for High-Growth Classification Target
mean	median	standard_deviation	minimum	q1	q3	maximum	high_growth_share
0.54	1	0.499	0	0	1	1	0.54

For a binary variable, the mean is also the probability that the outcome equals 1. Therefore, the mean of high_growth can be interpreted as the share of country-year observations with GDP growth of at least 3 percent.

3.5 Histogram of Original Target Variable

ggplot(classification_data, aes(x = high_growth)) +
  geom_histogram(
    binwidth = 0.5,
    boundary = -0.25,
    fill = "#5f6f9f",
    color = "white"
  ) +
  scale_x_continuous(
    breaks = c(0, 1),
    labels = c("0 = Low/moderate", "1 = High")
  ) +
  labs(
    title = "Distribution of High-Growth Target",
    x = "High-growth classification target",
    y = "Number of country-year observations"
  )

The target variable is binary, so it is not normally distributed. The histogram has mass only at 0 and 1. This is expected for a classification outcome.

3.6 Histogram of Log-Transformed Target Variable

ggplot(classification_data, aes(x = log_high_growth)) +
  geom_histogram(
    binwidth = 0.25,
    boundary = -0.125,
    fill = "#8a6b35",
    color = "white"
  ) +
  scale_x_continuous(
    breaks = c(0, log(2)),
    labels = c("log(1 + 0)", "log(1 + 1)")
  ) +
  labs(
    title = "Distribution of Log-Transformed High-Growth Target",
    x = "log(1 + high_growth)",
    y = "Number of country-year observations"
  )

The log transformation does not make the binary target continuous or normal. It only moves the value 1 to log(2) while the value 0 remains 0. Because this is a classification problem, the most appropriate theoretical distribution is a Bernoulli distribution, not a normal, log-normal, or exponential distribution. If the analysis were instead focused on the continuous variable gdp_growth, a normal distribution might be considered as a rough approximation, but the binary target itself is best modeled as Bernoulli.

4 Conclusion

Both datasets satisfy the assignment requirements because each contains more than 500 observations and more than 5 variables including the target variable. The first dataset supports a regression analysis of life expectancy, while the second dataset supports a classification analysis of high-growth country-years. The probability analysis shows that life expectancy is approximately normal with mild skewness, while the high-growth target is binary and should be treated as Bernoulli.