As an Economics major, I’ve learned quite a bit on macroeconomic theory. Each country has different economic indicators that give insight into each economy’s performance. Whether its GDP (gross domestic product) or the inflation rate, all of these factors can help showcase the state of an economy. For this project, I’ve examined these indicators to help answer the following questions:
The data set itself is from Kaggle, and it describes each countries economic indicators from 2010-2025. The population would be all the indicators for each country outside of the 2010-2025 time interval. However, since nations usually record their economic activity on a quarterly basis, it would be difficult to assess all of them. The data set relies on reported data from the World Bank, which further collects information from national statistic offices, central banks, and federal governments. Although I won’t be constantly analyzing all variables, there are 10 unique key variables that consist of the following: GDP, GDP per Capita, Inflation, Unemployment Rate, Public Debt, Government Revenue/Expense, Current Account Balance, Tax Revenue, Interest Rate, and GDP Growth.
# Install packages
library(readr)
## Warning: package 'readr' was built under R version 4.5.2
# Read file
world_bank_data <- read_csv("world_bank_data_2025.csv")
## Rows: 3472 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): country_name, country_id
## dbl (14): year, Inflation (CPI %), GDP (Current USD), GDP per Capita (Curren...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# View raw data + check for missing values
summary(world_bank_data)
## country_name country_id year Inflation (CPI %)
## Length:3472 Length:3472 Min. :2010 Min. : -6.687
## Class :character Class :character 1st Qu.:2014 1st Qu.: 1.402
## Mode :character Mode :character Median :2018 Median : 3.214
## Mean :2018 Mean : 6.233
## 3rd Qu.:2021 3rd Qu.: 6.187
## Max. :2025 Max. :557.202
## NA's :778
## GDP (Current USD) GDP per Capita (Current USD) Unemployment Rate (%)
## Min. :3.211e+07 Min. : 193 Min. : 0.100
## 1st Qu.:6.265e+09 1st Qu.: 2281 1st Qu.: 3.611
## Median :2.587e+10 Median : 6828 Median : 5.771
## Mean :3.964e+11 Mean : 18483 Mean : 7.841
## 3rd Qu.:1.875e+11 3rd Qu.: 23727 3rd Qu.:10.732
## Max. :2.772e+13 Max. :256581 Max. :35.359
## NA's :539 NA's :534 NA's :677
## Interest Rate (Real, %) Inflation (GDP Deflator, %) GDP Growth (% Annual)
## Min. :-81.132 Min. :-28.760 Min. :-54.336
## 1st Qu.: 1.734 1st Qu.: 1.218 1st Qu.: 0.997
## Median : 5.079 Median : 3.223 Median : 3.100
## Mean : 5.405 Mean : 6.635 Mean : 2.854
## 3rd Qu.: 8.869 3rd Qu.: 6.905 3rd Qu.: 5.355
## Max. : 61.883 Max. :921.536 Max. : 86.827
## NA's :1737 NA's :568 NA's :560
## Current Account Balance (% GDP) Government Expense (% of GDP)
## Min. :-60.878 Min. : 0.00014
## 1st Qu.: -7.497 1st Qu.: 17.51148
## Median : -2.656 Median : 26.00085
## Mean : -2.363 Mean : 27.32536
## 3rd Qu.: 1.855 3rd Qu.: 34.88458
## Max. :235.751 Max. :103.72579
## NA's :909 NA's :1652
## Government Revenue (% of GDP) Tax Revenue (% of GDP)
## Min. : 0.00008 Min. : 0.00006
## 1st Qu.: 17.63915 1st Qu.: 12.28534
## Median : 24.82142 Median : 16.32144
## Mean : 26.67747 Mean : 16.96992
## 3rd Qu.: 32.70078 3rd Qu.: 21.44866
## Max. :344.99945 Max. :147.64020
## NA's :1643 NA's :1639
## Gross National Income (USD) Public Debt (% of GDP)
## Min. :5.108e+07 Min. : 1.846
## 1st Qu.:7.476e+09 1st Qu.: 33.894
## Median :2.987e+10 Median : 51.651
## Mean :4.142e+11 Mean : 61.864
## 3rd Qu.:1.973e+11 3rd Qu.: 81.931
## Max. :2.758e+13 Max. :249.366
## NA's :676 NA's :2620
sum(is.na(world_bank_data))
## [1] 14532
# Delete rows with missing values
world_bank_clean <- world_bank_data[complete.cases(world_bank_data),]
# Change 2 columns to factor variables
world_bank_clean$country_name <- as.factor(world_bank_clean$country_name)
world_bank_clean$country_id <- as.factor(world_bank_clean$country_id)
str(world_bank_clean[c("country_name", "country_id", "year")])
## tibble [535 × 3] (S3: tbl_df/tbl/data.frame)
## $ country_name: Factor w/ 59 levels "Albania","Armenia",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ country_id : Factor w/ 59 levels "al","am","au",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : num [1:535] 2011 2012 2013 2014 2015 ...
As shown from the code, I specifically decided to delete any rows with missing variables using the “complete cases” function (which I found the method from online, and cited the website at the end). When I initially looked at the data on Kaggle, I noticed that majority of rows were missing data from different columns. Due to this, I was planning on entirely erasing the columns that lacked the most data, since I didn’t want to sacrifice losing too much data. However, after using the summary function, I realized that there was a total of 3472 observations out of the 16 columns.So, instead I decided to prioritize a complete, full dataset. After erasing the rows with missing pieces, I ended up with 535 observations, which is still a large number that showcases a big picture.
# Descriptive Statistics:
# 1. View structure of cleaned data
str(world_bank_clean)
## tibble [535 × 16] (S3: tbl_df/tbl/data.frame)
## $ country_name : Factor w/ 59 levels "Albania","Armenia",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ country_id : Factor w/ 59 levels "al","am","au",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : num [1:535] 2011 2012 2013 2014 2015 ...
## $ Inflation (CPI %) : num [1:535] 3.43 2.03 1.94 1.63 3.5 ...
## $ GDP (Current USD) : num [1:535] 1.29e+10 1.23e+10 1.28e+10 1.32e+10 1.14e+10 ...
## $ GDP per Capita (Current USD) : num [1:535] 4437 4248 4413 4579 3953 ...
## $ Unemployment Rate (%) : num [1:535] 13.5 13.4 15.9 18.1 17.2 ...
## $ Interest Rate (Real, %) : num [1:535] 9.89 9.74 9.51 6.32 7.27 ...
## $ Inflation (GDP Deflator, %) : num [1:535] 2.315 1.043 0.289 1.55 0.564 ...
## $ GDP Growth (% Annual) : num [1:535] 2.55 1.42 1 1.77 2.22 ...
## $ Current Account Balance (% GDP): num [1:535] -12.93 -10.2 -9.27 -10.78 -8.6 ...
## $ Government Expense (% of GDP) : num [1:535] 23 23 24.2 24.4 24.4 ...
## $ Government Revenue (% of GDP) : num [1:535] 23.9 23.1 22.3 24.5 24.7 ...
## $ Tax Revenue (% of GDP) : num [1:535] 18 17.5 16.5 18.3 18.5 ...
## $ Gross National Income (USD) : num [1:535] 1.29e+10 1.22e+10 1.30e+10 1.33e+10 1.15e+10 ...
## $ Public Debt (% of GDP) : num [1:535] 69.6 63.7 70.6 73.3 79.9 ...
# 2. Averages
mean(world_bank_clean$`Inflation (CPI %)`, na.rm = TRUE)
## [1] 4.706523
mean(world_bank_clean$`GDP (Current USD)`, na.rm = TRUE)
## [1] 771288857174
mean(world_bank_clean$`GDP per Capita (Current USD)`, na.rm = TRUE)
## [1] 16222.4
mean(world_bank_clean$`Unemployment Rate (%)`, na.rm = TRUE)
## [1] 7.992738
mean(world_bank_clean$`Public Debt (% of GDP)`, na.rm = TRUE)
## [1] 54.6657
# 3. Standard Deviations & Variances
sd(world_bank_clean$`Inflation (CPI %)`, na.rm = TRUE)
## [1] 5.553966
sd(world_bank_clean$`GDP (Current USD)`, na.rm = TRUE)
## [1] 2.828027e+12
sd(world_bank_clean$`GDP per Capita (Current USD)`, na.rm = TRUE)
## [1] 21064.36
sd(world_bank_clean$`Unemployment Rate (%)`, na.rm = TRUE)
## [1] 6.117287
sd(world_bank_clean$`Public Debt (% of GDP)`, na.rm = TRUE)
## [1] 30.56349
# Minimums & Maximums
min(world_bank_clean$`GDP (Current USD)`, na.rm = TRUE)
## [1] 420828262
max(world_bank_clean$`GDP (Current USD)`, na.rm = TRUE)
## [1] 2.368117e+13
# 4. Proportion of countries GDPs above average?
average_gdp <- mean(world_bank_clean$`GDP (Current USD)`, na.rm = TRUE)
gdp_above_average <- world_bank_clean[
world_bank_clean$`GDP (Current USD)` > average_gdp,
]
head(gdp_above_average)
## # A tibble: 6 × 16
## country_name country_id year `Inflation (CPI %)` `GDP (Current USD)`
## <fct> <fct> <dbl> <dbl> <dbl>
## 1 Australia au 2010 2.92 1.15e12
## 2 Australia au 2011 3.30 1.40e12
## 3 Australia au 2012 1.76 1.55e12
## 4 Australia au 2013 2.45 1.58e12
## 5 Australia au 2014 2.49 1.47e12
## 6 Australia au 2015 1.51 1.35e12
## # ℹ 11 more variables: `GDP per Capita (Current USD)` <dbl>,
## # `Unemployment Rate (%)` <dbl>, `Interest Rate (Real, %)` <dbl>,
## # `Inflation (GDP Deflator, %)` <dbl>, `GDP Growth (% Annual)` <dbl>,
## # `Current Account Balance (% GDP)` <dbl>,
## # `Government Expense (% of GDP)` <dbl>,
## # `Government Revenue (% of GDP)` <dbl>, `Tax Revenue (% of GDP)` <dbl>,
## # `Gross National Income (USD)` <dbl>, `Public Debt (% of GDP)` <dbl>
# 5. What's the poorest country and richest country?
# Richest nation:
world_bank_clean[
which.max(world_bank_clean$`GDP (Current USD)`),
]
## # A tibble: 1 × 16
## country_name country_id year `Inflation (CPI %)` `GDP (Current USD)`
## <fct> <fct> <dbl> <dbl> <dbl>
## 1 United States us 2021 4.70 2.37e13
## # ℹ 11 more variables: `GDP per Capita (Current USD)` <dbl>,
## # `Unemployment Rate (%)` <dbl>, `Interest Rate (Real, %)` <dbl>,
## # `Inflation (GDP Deflator, %)` <dbl>, `GDP Growth (% Annual)` <dbl>,
## # `Current Account Balance (% GDP)` <dbl>,
## # `Government Expense (% of GDP)` <dbl>,
## # `Government Revenue (% of GDP)` <dbl>, `Tax Revenue (% of GDP)` <dbl>,
## # `Gross National Income (USD)` <dbl>, `Public Debt (% of GDP)` <dbl>
# Poorest nation:
world_bank_clean[
which.min(world_bank_clean$`GDP (Current USD)`),
]
## # A tibble: 1 × 16
## country_name country_id year `Inflation (CPI %)` `GDP (Current USD)`
## <fct> <fct> <dbl> <dbl> <dbl>
## 1 Tonga to 2016 2.58 420828262.
## # ℹ 11 more variables: `GDP per Capita (Current USD)` <dbl>,
## # `Unemployment Rate (%)` <dbl>, `Interest Rate (Real, %)` <dbl>,
## # `Inflation (GDP Deflator, %)` <dbl>, `GDP Growth (% Annual)` <dbl>,
## # `Current Account Balance (% GDP)` <dbl>,
## # `Government Expense (% of GDP)` <dbl>,
## # `Government Revenue (% of GDP)` <dbl>, `Tax Revenue (% of GDP)` <dbl>,
## # `Gross National Income (USD)` <dbl>, `Public Debt (% of GDP)` <dbl>
# Visualizations
# 1. Histogram
hist_gdp <- hist(world_bank_clean$`GDP (Current USD)`,
main = "Histogram of GDP (Current USD)",
xlab = "GDP",
ylab = "Frequency",
col = "lightblue",
breaks = 20)
# 2. Boxplot
boxplot_gdp <- boxplot(world_bank_clean$`GDP (Current USD)`,
main = "Boxplot of GDP",
ylab = "GDP (Current USD)",
col = "orange")
# 3. Scatter plot
plot(world_bank_clean$`GDP (Current USD)`,
world_bank_clean$`GDP per Capita (Current USD)`,
main = "Scatter Plot: GDP vs GDP per Capita",
xlab = "GDP (Current USD)",
ylab = "GDP per Capita (Current USD)",
pch = 19,
col = "blue")
# 4. Showcasing top ten countries by wealth!
# (highest to lowest gdps)
sorted_data <- world_bank_clean[
order(-world_bank_clean$`GDP (Current USD)`),
]
# (top 10 countries)
top_10 <- sorted_data[1:10,]
# Bar plot
barplot(top_10$`GDP (Current USD)`,
names.arg = top_10$country_name,
main = "Top 10 Countries by GDP",
ylab = "GDP (Current USD)",
col = "pink",
las = 2,
cex.names = 0.7)
In macroeconomics, Philip’s curve is the long-run, inverse relationship between inflation and unemployment. If we want to see if inflation, and unemployment can be tested within the curve, the distribution must be known.
The central limit theorem states that as the sample size gets bigger, specifically above 30, all distributions get approximately normal. Therefore, in this case since there’s around 535 rows, meaning x is bigger than 30, our data approaches should theoretically approach normalcy.
# Inflation:
# Histogram
hist(world_bank_clean$`Inflation (CPI %)`,
main = "Histogram of Inflation (CPI %)",
xlab = "Inflation (CPI %)",
col = "lightblue",
breaks = 20)
# QQ-plot
qqnorm(world_bank_clean$`Inflation (CPI %)`)
qqline(world_bank_clean$`Inflation (CPI %)`, col="lightgreen")
# Unemployment:
# Histogram
hist(world_bank_clean$`Unemployment Rate (%)`,
main = "Histogram of Unemployment Rate",
xlab = "Unemployment Rate (%)",
col = "lightgreen",
breaks = 20)
# QQ-plot
qqnorm(world_bank_clean$`Unemployment Rate (%)`)
qqline(world_bank_clean$`Unemployment Rate (%)`, col="yellow")
Based on the visualizations, both inflation and unemployment are approximately normal. Although they might not align perfectly, we can still test these two variables and their relationship in the Phillip’s Curve.
# 1. Confidence interval for inflation (only)
mean_inflation <- mean(world_bank_clean$`Inflation (CPI %)`)
se_inflation <- sd(world_bank_clean$`Inflation (CPI %)`) /
sqrt(length(world_bank_clean$`Inflation (CPI %)`))
ci_lower <- mean_inflation - 1.96 * se_inflation
ci_upper <- mean_inflation + 1.96 * se_inflation
print(ci_lower)
## [1] 4.23589
print(ci_upper)
## [1] 5.177156
# 2. Confidence interval for unemployment difference between rich and poor countries
median_gdp <- median(world_bank_clean$`GDP per Capita (Current USD)`, na.rm = TRUE)
rich <- world_bank_clean$`Unemployment Rate (%)`[
world_bank_clean$`GDP per Capita (Current USD)` >= median_gdp
]
poor <- world_bank_clean$`Unemployment Rate (%)`[
world_bank_clean$`GDP per Capita (Current USD)` < median_gdp
]
mean_rich <- mean(rich, na.rm = TRUE)
mean_poor <- mean(poor, na.rm = TRUE)
diff <- mean_rich - mean_poor
se_diff <- sqrt(
sd(rich, na.rm = TRUE)^2 / length(na.omit(rich)) +
sd(poor, na.rm = TRUE)^2 / length(na.omit(poor))
)
lower_diff <- diff - 1.96 * se_diff
upper_diff <- diff + 1.96 * se_diff
print(lower_diff)
## [1] -2.906194
print(upper_diff)
## [1] -0.853623
Based on the first interval, we’re 95% confident that the real mean inflation rate for all countries is in between 4.2359, and 5.1772.
Based on the second interval, we’re 95% confident that the difference in unemployment rates between rich and poor countries alike, lies between -2.9062, and 0.8536. The interval contains zero, so statistically, there’s no significant difference in unemployment rates between rich and poor countries at the confidence level.
# hypothesis test 1: unemployment in 2016 vs. 2017
unemp_2016 <- world_bank_clean$`Unemployment Rate (%)`[
world_bank_clean$year == 2016
]
unemp_2017 <- world_bank_clean$`Unemployment Rate (%)`[
world_bank_clean$year == 2017
]
# Two-sample t-test
test1 <- t.test(unemp_2016, unemp_2017, na.rm = TRUE)
# Results
test1$statistic
## t
## 0.1491322
test1$p.value
## [1] 0.8818253
mean(unemp_2016, na.rm = TRUE)
## [1] 8.24119
mean(unemp_2017, na.rm = TRUE)
## [1] 8.027525
# Hypothesis test 2: unemployment
high_unemp_2016 <- unemp_2016 > 6
high_unemp_2017 <- unemp_2017 > 6
# Counts
x <- c(sum(high_unemp_2016, na.rm = TRUE),
sum(high_unemp_2017, na.rm = TRUE))
n <- c(length(na.omit(unemp_2016)),
length(na.omit(unemp_2017)))
# Proportion test
test2 <- prop.test(x, n)
# Results
test2$statistic
## X-squared
## 6.314619e-31
test2$p.value
## [1] 1
After conducting the hyothesis tests, for the first test, the null hypothesis states that the average unemployment rate in 2016 equals the average in 2017. The alternative states that they are different.The output of the t-test was 1.491 and a p‑value of 0.8818. Since the p‑value is greater than 0.05, we fail to reject the null hypothesis. This showcases there is no significant difference in average unemployment rates between the years 2016 and 2017.
For the second hypothesis test, a proportion test, the null hypothesis states that the proportion of countries with high unemployment is the same in 2016 and 2017. The alternative hypothesis is the opposite, and says that it’s different. The chi-square statistic is approximately zero, with a p-value of 1. Since the p-value is significantly higher than 0.05, we also fail to reject the null hypothesis. These numbers indicate that both years were very close in proximity.
Overall, these hypothesis tests make sense, given that unemployment is usually known for having smaller changes from each period. Although it could be a different story from a quarter to quarter basis, yearly unemployment is consist unless a major event (such as COVID-19 or Great Depression) occurs.
The confidence intervals were 95% confident that the mean would contain 0, so it does align with the failure to reject the null hypothesis.
After a thorough analysis of the world bank data, I drew the following conclusions:
A few limitations were most likely the missing data that would’ve given a clearer insight on the research questions. I started with around 3,000 observations, and ended with 535 observations, so it’s evident that the loss could’ve impacted the accuracy of the results. In addition, I believe the removal of rows with missing data (NaNs) could’ve led into a biased territory, where I’m only selecting out of convenience rather than a holistic overview of all countries. Perhaps if a country such as Algeria, for example, had consistent missing data points across all recorded years, I wouldn’t even factor the entire country into my analysis.
For the future, a few good extensions worth mentioning could’ve been factoring in major events such as COVID-19. Since this dataset spanned from 2010-2025, I could’ve examined the years 2020-2021 and compare it to earlier years to see how they change the fundamental questions being asked. Also, even though this specific dataset didn’t have it, perhaps more microeconomic topics such as specific towns or cities could be analyzed instead of large nations; I wonder how the data would differentiate from being in a smaller context?
For this project, I used the AI model Deepseek to help format my code towards the end, and specifically to help debug and simplify the code for the confidence interval section. I also referenced various websites for coding help, and will list them below:
Sources:
https://www.geeksforgeeks.org/r-language/remove-rows-with-missing-values-using-r/ https://www.statology.org/character-to-factor-in-r/ https://www.geeksforgeeks.org/r-language/which-function-in-r/ https://www.statology.org/two-sample-t-test-in-r/ https://www.sthda.com/english/wiki/two-proportions-z-test-in-r https://www.statisticshowto.com/probability-and-statistics/