# Import World Bank indicators
indicators <- c(
gdp_growth = "NY.GDP.MKTP.KD.ZG",
renewable_energy = "EG.FEC.RNEW.ZS",
population_growth = "SP.POP.GROW",
trade_openness = "NE.TRD.GNFS.ZS",
inflation = "FP.CPI.TOTL.ZG",
gdp_per_capita = "NY.GDP.PCAP.CD"
)Stage 1: Renewable Energy and Economic Growth
Introduction
This Stage 1 report includes two real-world economic datasets using World Bank World Development Indicators. The first dataset is designed for a regression problem with a continuous target variable, GDP growth. The second dataset is designed for a classification problem with a binary target variable, high inflation status.
The purpose of this report is to collect, clean, and explore both datasets before predictive modeling in Stage 2.
Dataset 1: Regression Dataset
Economic Question
Can renewable energy consumption help predict GDP growth across countries?
raw_data <- WDI(
country = "all",
indicator = indicators,
start = 2000,
end = 2023,
extra = TRUE
)
glimpse(raw_data)Rows: 6,384
Columns: 18
$ country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghan…
$ iso2c <chr> "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF"…
$ iso3c <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AF…
$ year <int> 2007, 2010, 2011, 2021, 2012, 2009, 2020, 2000, 2014…
$ status <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", …
$ lastupdated <chr> "2026-04-08", "2026-04-08", "2026-04-08", "2026-04-0…
$ gdp_growth <dbl> 13.8263195, 14.3624415, 0.4263548, -20.7388394, 12.7…
$ renewable_energy <dbl> 28.8, 15.2, 12.6, 20.0, 15.4, 16.5, 18.2, 45.0, 19.1…
$ population_growth <dbl> 1.8925975, 2.9346867, 3.6915031, 2.3560978, 4.047862…
$ trade_openness <dbl> NA, NA, NA, 51.41172, NA, NA, 46.70989, NA, NA, NA, …
$ inflation <dbl> 8.6805708, 2.1785375, 11.8041858, 5.1332034, 6.44121…
$ gdp_per_capita <dbl> 376.2232, 560.6215, 606.6947, 356.4962, 651.4171, 45…
$ region <chr> "Middle East, North Africa, Afghanistan & Pakistan",…
$ capital <chr> "Kabul", "Kabul", "Kabul", "Kabul", "Kabul", "Kabul"…
$ longitude <chr> "69.1761", "69.1761", "69.1761", "69.1761", "69.1761…
$ latitude <chr> "34.5228", "34.5228", "34.5228", "34.5228", "34.5228…
$ income <chr> "Low income", "Low income", "Low income", "Low incom…
$ lending <chr> "IDA", "IDA", "IDA", "IDA", "IDA", "IDA", "IDA", "ID…
# Clean dataset and remove missing values
clean_data <- raw_data %>%
clean_names() %>%
filter(region != "Aggregates") %>%
select(
country,
iso2c,
year,
gdp_growth,
renewable_energy,
population_growth,
trade_openness,
inflation,
gdp_per_capita,
income
) %>%
drop_na(
gdp_growth,
renewable_energy,
population_growth,
trade_openness,
inflation,
gdp_per_capita
)
glimpse(clean_data)Rows: 3,472
Columns: 10
$ country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Albani…
$ iso2c <chr> "AF", "AF", "AF", "AL", "AL", "AL", "AL", "AL", "AL"…
$ year <int> 2021, 2020, 2022, 2003, 2019, 2020, 2005, 2021, 2018…
$ gdp_growth <dbl> -20.7388394, -2.3511007, -6.2401720, 5.3332643, 2.06…
$ renewable_energy <dbl> 20.0, 18.2, 20.0, 33.7, 40.1, 44.4, 36.8, 41.9, 37.8…
$ population_growth <dbl> 2.3560978, 3.1536092, 1.4357044, -0.3741492, -1.5431…
$ trade_openness <dbl> 51.41172, 46.70989, 72.88547, 64.82322, 75.38213, 59…
$ inflation <dbl> 5.13320341, 5.60188791, 13.71210237, 0.48400261, 1.4…
$ gdp_per_capita <dbl> 356.4962, 510.7871, 357.2612, 1908.6990, 6069.4390, …
$ income <chr> "Low income", "Low income", "Low income", "Upper mid…
nrow(clean_data)[1] 3472
Dataset Description and Source
The dataset was obtained from the World Bank World Development Indicators database using the WDI package in R. The dataset contains country-level observations from 2000 to 2023. After cleaning and removing missing values, the final dataset contains 3,472 country-year observations and 10 variables.
This dataset is relevant because renewable energy and economic growth are important topics in modern economics. Governments increasingly try to balance economic growth with environmental sustainability.
This is the regression dataset because the target variable, GDP growth, is continuous.
Source: https://data.worldbank.org/
# Calculate summary statistics for renewable energy consumption
renewable_summary <- clean_data %>%
summarise(
mean = mean(renewable_energy),
median = median(renewable_energy),
sd = sd(renewable_energy),
min = min(renewable_energy),
q1 = quantile(renewable_energy, 0.25),
q3 = quantile(renewable_energy, 0.75),
max = max(renewable_energy)
)
renewable_summary mean median sd min q1 q3 max
1 31.59453 23.1 28.75385 0 6.5 51.325 98.3
Interpretation of Summary Statistics
The average renewable energy consumption is approximately 31.6 percent, while the median is 23.1 percent. The standard deviation is relatively high, indicating substantial variation across countries. Some countries use very little renewable energy, while others rely heavily on renewable sources.
# Create histogram of renewable energy consumption
ggplot(clean_data, aes(x = renewable_energy)) +
geom_histogram(bins = 30) +
labs(
title = "Renewable Energy Consumption Distribution",
x = "Renewable energy consumption (% of total final energy consumption)",
y = "Number of observations"
)Histogram Interpretation
The histogram indicates that renewable energy consumption is not normally distributed. Most observations are concentrated at lower and moderate renewable energy levels, while fewer countries exhibit extremely high renewable energy shares. This creates a right-skewed distribution with a long upper tail.
# Apply log transformation
clean_data <- clean_data %>%
mutate(log_renewable_energy = log(renewable_energy + 1))
ggplot(clean_data, aes(x = log_renewable_energy)) +
geom_histogram(bins = 30) +
labs(
title = "Distribution of Log Renewable Energy Consumption",
x = "Log of renewable energy consumption",
y = "Number of observations"
)Log Transformation Interpretation
After applying the log transformation, the distribution becomes less skewed and more balanced. The transformed variable appears closer to a normal distribution than the original variable.
Theoretical Distribution
Based on the histogram, renewable energy consumption appears to follow a right-skewed distribution. Therefore, a log-normal distribution may better approximate the data than a normal distribution. After applying the log transformation, the distribution becomes more symmetric and closer to a normal shape.
Dataset 2: Classification Dataset
Economic Question
Can macroeconomic indicators classify countries as high-inflation or low-inflation?
Dataset Description and Source
The second dataset is also created from the World Bank World Development Indicators. It uses the same country-year observations, but the target variable is transformed into a binary classification outcome. The variable high_inflation equals 1 if a country-year observation has inflation above 10 percent, and 0 otherwise.
This dataset is relevant to economics because high inflation is an important macroeconomic problem. Classifying high-inflation periods can help policymakers understand which economic conditions are associated with inflation risk.
Source: https://data.worldbank.org/
# Create a binary target variable for the classification dataset
classification_data <- clean_data %>%
mutate(
high_inflation = ifelse(inflation > 10, 1, 0)
) %>%
select(
country,
iso2c,
year,
high_inflation,
inflation,
gdp_growth,
renewable_energy,
population_growth,
trade_openness,
gdp_per_capita,
income
)
glimpse(classification_data)Rows: 3,472
Columns: 11
$ country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Albani…
$ iso2c <chr> "AF", "AF", "AF", "AL", "AL", "AL", "AL", "AL", "AL"…
$ year <int> 2021, 2020, 2022, 2003, 2019, 2020, 2005, 2021, 2018…
$ high_inflation <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ inflation <dbl> 5.13320341, 5.60188791, 13.71210237, 0.48400261, 1.4…
$ gdp_growth <dbl> -20.7388394, -2.3511007, -6.2401720, 5.3332643, 2.06…
$ renewable_energy <dbl> 20.0, 18.2, 20.0, 33.7, 40.1, 44.4, 36.8, 41.9, 37.8…
$ population_growth <dbl> 2.3560978, 3.1536092, 1.4357044, -0.3741492, -1.5431…
$ trade_openness <dbl> 51.41172, 46.70989, 72.88547, 64.82322, 75.38213, 59…
$ gdp_per_capita <dbl> 356.4962, 510.7871, 357.2612, 1908.6990, 6069.4390, …
$ income <chr> "Low income", "Low income", "Low income", "Upper mid…
nrow(classification_data)[1] 3472
# Calculate summary statistics for the binary target variable
classification_summary <- classification_data %>%
summarise(
mean = mean(high_inflation),
median = median(high_inflation),
sd = sd(high_inflation),
min = min(high_inflation),
q1 = quantile(high_inflation, 0.25),
q3 = quantile(high_inflation, 0.75),
max = max(high_inflation)
)
classification_summary mean median sd min q1 q3 max
1 0.125 0 0.3307666 0 0 0 1
Interpretation of Summary Statistics
The variable high_inflation is a binary target variable. A value of 1 means that inflation is above 10 percent, while a value of 0 means that inflation is 10 percent or below. The mean of this variable represents the share of country-year observations classified as high inflation.
# Create histogram for high inflation classification target
ggplot(classification_data, aes(x = high_inflation)) +
geom_histogram(bins = 2) +
labs(
title = "High Inflation Classification Distribution",
x = "High inflation status (0 = Low inflation, 1 = High inflation)",
y = "Number of observations"
)Histogram Interpretation
The histogram shows the distribution of the binary classification target. Since high_inflation only takes the values 0 and 1, the distribution is not normal. Instead, it shows how many observations belong to the low-inflation group and how many belong to the high-inflation group.
# Apply log transformation to the binary target variable
classification_data <- classification_data %>%
mutate(log_high_inflation = log(high_inflation + 1))
ggplot(classification_data, aes(x = log_high_inflation)) +
geom_histogram(bins = 2) +
labs(
title = "Log-Transformed High Inflation Distribution",
x = "Log of high inflation status",
y = "Number of observations"
)Log Transformation Interpretation
Because high_inflation is a binary variable, the log transformation does not create a normal distribution. It only changes the values from 0 and 1 to 0 and log(2). Therefore, log transformation is more useful for continuous skewed variables than for binary classification targets.
Theoretical Distribution
The high_inflation variable follows a Bernoulli distribution because it has only two possible outcomes: 0 or 1. Therefore, a normal or log-normal distribution is not appropriate for this target variable.
Conclusion
In Stage 1, two economic datasets were prepared for future predictive modeling. The first dataset is a regression dataset with GDP growth as the continuous target variable. The second dataset is a classification dataset with high inflation status as the binary target variable.
The probability analysis showed that renewable energy consumption is right-skewed and becomes more balanced after log transformation. For the classification dataset, high_inflation follows a Bernoulli distribution because it only takes the values 0 and 1. Overall, both datasets satisfy the project requirements and are ready for Stage 2 modeling.