1. Project Overview

As an Economics major, I’ve learned quite a bit on macroeconomic theory. Each country has different economic indicators that give insight into each economy’s performance. Whether its GDP (gross domestic product) or the inflation rate, all of these factors can help showcase the state of an economy. For this project, I’ve examined these indicators to help answer the following questions:

The data set itself is from Kaggle, and it describes each countries economic indicators from 2010-2025. The population would be all the indicators for each country outside of the 2010-2025 time interval. However, since nations usually record their economic activity on a quarterly basis, it would be difficult to assess all of them. The data set relies on reported data from the World Bank, which further collects information from national statistic offices, central banks, and federal governments. Although I won’t be constantly analyzing all variables, there are 10 unique key variables that consist of the following: GDP, GDP per Capita, Inflation, Unemployment Rate, Public Debt, Government Revenue/Expense, Current Account Balance, Tax Revenue, Interest Rate, and GDP Growth.

  1. Download Files
# Install packages 

library(readr)
## Warning: package 'readr' was built under R version 4.5.2
# Read file

world_bank_data <- read_csv("world_bank_data_2025.csv")
## Rows: 3472 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): country_name, country_id
## dbl (14): year, Inflation (CPI %), GDP (Current USD), GDP per Capita (Curren...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
  1. Data Cleaning
# View raw data + check for missing values  

summary(world_bank_data)
##  country_name        country_id             year      Inflation (CPI %)
##  Length:3472        Length:3472        Min.   :2010   Min.   : -6.687  
##  Class :character   Class :character   1st Qu.:2014   1st Qu.:  1.402  
##  Mode  :character   Mode  :character   Median :2018   Median :  3.214  
##                                        Mean   :2018   Mean   :  6.233  
##                                        3rd Qu.:2021   3rd Qu.:  6.187  
##                                        Max.   :2025   Max.   :557.202  
##                                                       NA's   :778      
##  GDP (Current USD)   GDP per Capita (Current USD) Unemployment Rate (%)
##  Min.   :3.211e+07   Min.   :   193               Min.   : 0.100       
##  1st Qu.:6.265e+09   1st Qu.:  2281               1st Qu.: 3.611       
##  Median :2.587e+10   Median :  6828               Median : 5.771       
##  Mean   :3.964e+11   Mean   : 18483               Mean   : 7.841       
##  3rd Qu.:1.875e+11   3rd Qu.: 23727               3rd Qu.:10.732       
##  Max.   :2.772e+13   Max.   :256581               Max.   :35.359       
##  NA's   :539         NA's   :534                  NA's   :677          
##  Interest Rate (Real, %) Inflation (GDP Deflator, %) GDP Growth (% Annual)
##  Min.   :-81.132         Min.   :-28.760             Min.   :-54.336      
##  1st Qu.:  1.734         1st Qu.:  1.218             1st Qu.:  0.997      
##  Median :  5.079         Median :  3.223             Median :  3.100      
##  Mean   :  5.405         Mean   :  6.635             Mean   :  2.854      
##  3rd Qu.:  8.869         3rd Qu.:  6.905             3rd Qu.:  5.355      
##  Max.   : 61.883         Max.   :921.536             Max.   : 86.827      
##  NA's   :1737            NA's   :568                 NA's   :560          
##  Current Account Balance (% GDP) Government Expense (% of GDP)
##  Min.   :-60.878                 Min.   :  0.00014            
##  1st Qu.: -7.497                 1st Qu.: 17.51148            
##  Median : -2.656                 Median : 26.00085            
##  Mean   : -2.363                 Mean   : 27.32536            
##  3rd Qu.:  1.855                 3rd Qu.: 34.88458            
##  Max.   :235.751                 Max.   :103.72579            
##  NA's   :909                     NA's   :1652                 
##  Government Revenue (% of GDP) Tax Revenue (% of GDP)
##  Min.   :  0.00008             Min.   :  0.00006     
##  1st Qu.: 17.63915             1st Qu.: 12.28534     
##  Median : 24.82142             Median : 16.32144     
##  Mean   : 26.67747             Mean   : 16.96992     
##  3rd Qu.: 32.70078             3rd Qu.: 21.44866     
##  Max.   :344.99945             Max.   :147.64020     
##  NA's   :1643                  NA's   :1639          
##  Gross National Income (USD) Public Debt (% of GDP)
##  Min.   :5.108e+07           Min.   :  1.846       
##  1st Qu.:7.476e+09           1st Qu.: 33.894       
##  Median :2.987e+10           Median : 51.651       
##  Mean   :4.142e+11           Mean   : 61.864       
##  3rd Qu.:1.973e+11           3rd Qu.: 81.931       
##  Max.   :2.758e+13           Max.   :249.366       
##  NA's   :676                 NA's   :2620
sum(is.na(world_bank_data))
## [1] 14532
# Delete rows with missing values 

world_bank_clean <- world_bank_data[complete.cases(world_bank_data),]

# Change 2 columns to factor variables 


world_bank_clean$country_name <- as.factor(world_bank_clean$country_name)
world_bank_clean$country_id <- as.factor(world_bank_clean$country_id)

str(world_bank_clean[c("country_name", "country_id", "year")])
## tibble [535 × 3] (S3: tbl_df/tbl/data.frame)
##  $ country_name: Factor w/ 59 levels "Albania","Armenia",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ country_id  : Factor w/ 59 levels "al","am","au",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year        : num [1:535] 2011 2012 2013 2014 2015 ...

As shown from the code, I specifically decided to delete any rows with missing variables using the “complete cases” function (which I found the method from online, and cited the website at the end). When I initially looked at the data on Kaggle, I noticed that majority of rows were missing data from different columns. Due to this, I was planning on entirely erasing the columns that lacked the most data, since I didn’t want to sacrifice losing too much data. However, after using the summary function, I realized that there was a total of 3472 observations out of the 16 columns.So, instead I decided to prioritize a complete, full dataset. After erasing the rows with missing pieces, I ended up with 535 observations, which is still a large number that showcases a big picture.

  1. Exploratory Data Analysis
# Descriptive Statistics: 


# 1. View structure of cleaned data 

str(world_bank_clean)
## tibble [535 × 16] (S3: tbl_df/tbl/data.frame)
##  $ country_name                   : Factor w/ 59 levels "Albania","Armenia",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ country_id                     : Factor w/ 59 levels "al","am","au",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year                           : num [1:535] 2011 2012 2013 2014 2015 ...
##  $ Inflation (CPI %)              : num [1:535] 3.43 2.03 1.94 1.63 3.5 ...
##  $ GDP (Current USD)              : num [1:535] 1.29e+10 1.23e+10 1.28e+10 1.32e+10 1.14e+10 ...
##  $ GDP per Capita (Current USD)   : num [1:535] 4437 4248 4413 4579 3953 ...
##  $ Unemployment Rate (%)          : num [1:535] 13.5 13.4 15.9 18.1 17.2 ...
##  $ Interest Rate (Real, %)        : num [1:535] 9.89 9.74 9.51 6.32 7.27 ...
##  $ Inflation (GDP Deflator, %)    : num [1:535] 2.315 1.043 0.289 1.55 0.564 ...
##  $ GDP Growth (% Annual)          : num [1:535] 2.55 1.42 1 1.77 2.22 ...
##  $ Current Account Balance (% GDP): num [1:535] -12.93 -10.2 -9.27 -10.78 -8.6 ...
##  $ Government Expense (% of GDP)  : num [1:535] 23 23 24.2 24.4 24.4 ...
##  $ Government Revenue (% of GDP)  : num [1:535] 23.9 23.1 22.3 24.5 24.7 ...
##  $ Tax Revenue (% of GDP)         : num [1:535] 18 17.5 16.5 18.3 18.5 ...
##  $ Gross National Income (USD)    : num [1:535] 1.29e+10 1.22e+10 1.30e+10 1.33e+10 1.15e+10 ...
##  $ Public Debt (% of GDP)         : num [1:535] 69.6 63.7 70.6 73.3 79.9 ...
# 2. Averages 

mean(world_bank_clean$`Inflation (CPI %)`, na.rm = TRUE)
## [1] 4.706523
mean(world_bank_clean$`GDP (Current USD)`, na.rm = TRUE)
## [1] 771288857174
mean(world_bank_clean$`GDP per Capita (Current USD)`, na.rm = TRUE)
## [1] 16222.4
mean(world_bank_clean$`Unemployment Rate (%)`, na.rm = TRUE)
## [1] 7.992738
mean(world_bank_clean$`Public Debt (% of GDP)`, na.rm = TRUE)
## [1] 54.6657
# 3. Standard Deviations & Variances 

sd(world_bank_clean$`Inflation (CPI %)`, na.rm = TRUE)
## [1] 5.553966
sd(world_bank_clean$`GDP (Current USD)`, na.rm = TRUE)
## [1] 2.828027e+12
sd(world_bank_clean$`GDP per Capita (Current USD)`, na.rm = TRUE)
## [1] 21064.36
sd(world_bank_clean$`Unemployment Rate (%)`, na.rm = TRUE)
## [1] 6.117287
sd(world_bank_clean$`Public Debt (% of GDP)`, na.rm = TRUE)
## [1] 30.56349
# Minimums & Maximums

min(world_bank_clean$`GDP (Current USD)`, na.rm = TRUE)
## [1] 420828262
max(world_bank_clean$`GDP (Current USD)`, na.rm = TRUE)
## [1] 2.368117e+13
# 4. Proportion of countries GDPs above average? 


average_gdp <- mean(world_bank_clean$`GDP (Current USD)`, na.rm = TRUE)
gdp_above_average <- world_bank_clean[
  world_bank_clean$`GDP (Current USD)` > average_gdp, 
]
head(gdp_above_average)
## # A tibble: 6 × 16
##   country_name country_id  year `Inflation (CPI %)` `GDP (Current USD)`
##   <fct>        <fct>      <dbl>               <dbl>               <dbl>
## 1 Australia    au          2010                2.92             1.15e12
## 2 Australia    au          2011                3.30             1.40e12
## 3 Australia    au          2012                1.76             1.55e12
## 4 Australia    au          2013                2.45             1.58e12
## 5 Australia    au          2014                2.49             1.47e12
## 6 Australia    au          2015                1.51             1.35e12
## # ℹ 11 more variables: `GDP per Capita (Current USD)` <dbl>,
## #   `Unemployment Rate (%)` <dbl>, `Interest Rate (Real, %)` <dbl>,
## #   `Inflation (GDP Deflator, %)` <dbl>, `GDP Growth (% Annual)` <dbl>,
## #   `Current Account Balance (% GDP)` <dbl>,
## #   `Government Expense (% of GDP)` <dbl>,
## #   `Government Revenue (% of GDP)` <dbl>, `Tax Revenue (% of GDP)` <dbl>,
## #   `Gross National Income (USD)` <dbl>, `Public Debt (% of GDP)` <dbl>
# 5. What's the poorest country and richest country? 


# Richest nation:

world_bank_clean[
  which.max(world_bank_clean$`GDP (Current USD)`), 
]
## # A tibble: 1 × 16
##   country_name  country_id  year `Inflation (CPI %)` `GDP (Current USD)`
##   <fct>         <fct>      <dbl>               <dbl>               <dbl>
## 1 United States us          2021                4.70             2.37e13
## # ℹ 11 more variables: `GDP per Capita (Current USD)` <dbl>,
## #   `Unemployment Rate (%)` <dbl>, `Interest Rate (Real, %)` <dbl>,
## #   `Inflation (GDP Deflator, %)` <dbl>, `GDP Growth (% Annual)` <dbl>,
## #   `Current Account Balance (% GDP)` <dbl>,
## #   `Government Expense (% of GDP)` <dbl>,
## #   `Government Revenue (% of GDP)` <dbl>, `Tax Revenue (% of GDP)` <dbl>,
## #   `Gross National Income (USD)` <dbl>, `Public Debt (% of GDP)` <dbl>
# Poorest nation: 

world_bank_clean[
  which.min(world_bank_clean$`GDP (Current USD)`), 
]
## # A tibble: 1 × 16
##   country_name country_id  year `Inflation (CPI %)` `GDP (Current USD)`
##   <fct>        <fct>      <dbl>               <dbl>               <dbl>
## 1 Tonga        to          2016                2.58          420828262.
## # ℹ 11 more variables: `GDP per Capita (Current USD)` <dbl>,
## #   `Unemployment Rate (%)` <dbl>, `Interest Rate (Real, %)` <dbl>,
## #   `Inflation (GDP Deflator, %)` <dbl>, `GDP Growth (% Annual)` <dbl>,
## #   `Current Account Balance (% GDP)` <dbl>,
## #   `Government Expense (% of GDP)` <dbl>,
## #   `Government Revenue (% of GDP)` <dbl>, `Tax Revenue (% of GDP)` <dbl>,
## #   `Gross National Income (USD)` <dbl>, `Public Debt (% of GDP)` <dbl>
# Visualizations 


# 1. Histogram


hist_gdp <- hist(world_bank_clean$`GDP (Current USD)`, 
                 main = "Histogram of GDP (Current USD)",
                 xlab = "GDP",
                 ylab = "Frequency",
                 col = "lightblue",
                 breaks = 20)

# 2. Boxplot

boxplot_gdp <- boxplot(world_bank_clean$`GDP (Current USD)`,
                       main = "Boxplot of GDP",
                       ylab = "GDP (Current USD)",
                       col = "orange")

# 3. Scatter plot

plot(world_bank_clean$`GDP (Current USD)`, 
     world_bank_clean$`GDP per Capita (Current USD)`,
     main = "Scatter Plot: GDP vs GDP per Capita",
     xlab = "GDP (Current USD)",
     ylab = "GDP per Capita (Current USD)",
     pch = 19,
     col = "blue")

# 4.  Showcasing top ten countries by wealth!


# (highest to lowest gdps)

sorted_data <- world_bank_clean[
  order(-world_bank_clean$`GDP (Current USD)`), 
]
# (top 10 countries)

top_10 <- sorted_data[1:10,]

# Bar plot

barplot(top_10$`GDP (Current USD)`,
        names.arg = top_10$country_name,
        main = "Top 10 Countries by GDP",
        ylab = "GDP (Current USD)",
        col = "pink",
        las = 2,
        cex.names = 0.7)

  1. Distributions and Modeling Assumptions

In macroeconomics, Philip’s curve is the long-run, inverse relationship between inflation and unemployment. If we want to see if inflation, and unemployment can be tested within the curve, the distribution must be known.

The central limit theorem states that as the sample size gets bigger, specifically above 30, all distributions get approximately normal. Therefore, in this case since there’s around 535 rows, meaning x is bigger than 30, our data approaches should theoretically approach normalcy.

# Inflation:


# Histogram

hist(world_bank_clean$`Inflation (CPI %)`,
     main = "Histogram of Inflation (CPI %)",
     xlab = "Inflation (CPI %)",
     col = "lightblue",
     breaks = 20)

# QQ-plot

qqnorm(world_bank_clean$`Inflation (CPI %)`)
qqline(world_bank_clean$`Inflation (CPI %)`, col="lightgreen")

# Unemployment:


# Histogram


hist(world_bank_clean$`Unemployment Rate (%)`,
     main = "Histogram of Unemployment Rate",
     xlab = "Unemployment Rate (%)",
     col = "lightgreen",
     breaks = 20)

# QQ-plot

qqnorm(world_bank_clean$`Unemployment Rate (%)`)
qqline(world_bank_clean$`Unemployment Rate (%)`, col="yellow")

Based on the visualizations, both inflation and unemployment are approximately normal. Although they might not align perfectly, we can still test these two variables and their relationship in the Phillip’s Curve.

  1. Confidence Intervals
# 1. Confidence interval for inflation (only)

mean_inflation <- mean(world_bank_clean$`Inflation (CPI %)`)
se_inflation <- sd(world_bank_clean$`Inflation (CPI %)`) / 
  sqrt(length(world_bank_clean$`Inflation (CPI %)`))
ci_lower <- mean_inflation - 1.96 * se_inflation
ci_upper <- mean_inflation + 1.96 * se_inflation

print(ci_lower)
## [1] 4.23589
print(ci_upper)
## [1] 5.177156
# 2. Confidence interval for unemployment difference between rich and poor countries

median_gdp <- median(world_bank_clean$`GDP per Capita (Current USD)`, na.rm = TRUE)
rich <- world_bank_clean$`Unemployment Rate (%)`[
  world_bank_clean$`GDP per Capita (Current USD)` >= median_gdp
]
poor <- world_bank_clean$`Unemployment Rate (%)`[
  world_bank_clean$`GDP per Capita (Current USD)` < median_gdp
]

mean_rich <- mean(rich, na.rm = TRUE)
mean_poor <- mean(poor, na.rm = TRUE)
diff <- mean_rich - mean_poor
se_diff <- sqrt(
  sd(rich, na.rm = TRUE)^2 / length(na.omit(rich)) + 
  sd(poor, na.rm = TRUE)^2 / length(na.omit(poor))
)
lower_diff <- diff - 1.96 * se_diff
upper_diff <- diff + 1.96 * se_diff

print(lower_diff)
## [1] -2.906194
print(upper_diff)
## [1] -0.853623

Based on the first interval, we’re 95% confident that the real mean inflation rate for all countries is in between 4.2359, and 5.1772.

Based on the second interval, we’re 95% confident that the difference in unemployment rates between rich and poor countries alike, lies between -2.9062, and 0.8536. The interval contains zero, so statistically, there’s no significant difference in unemployment rates between rich and poor countries at the confidence level.

  1. Hypothesis Testing
# hypothesis test 1: unemployment in 2016 vs. 2017

unemp_2016 <- world_bank_clean$`Unemployment Rate (%)`[
  world_bank_clean$year == 2016
]

unemp_2017 <- world_bank_clean$`Unemployment Rate (%)`[
  world_bank_clean$year == 2017
]

# Two-sample t-test
test1 <- t.test(unemp_2016, unemp_2017, na.rm = TRUE)

# Results
test1$statistic
##         t 
## 0.1491322
test1$p.value
## [1] 0.8818253
mean(unemp_2016, na.rm = TRUE)
## [1] 8.24119
mean(unemp_2017, na.rm = TRUE)
## [1] 8.027525
# Hypothesis test 2: unemployment 

high_unemp_2016 <- unemp_2016 > 6
high_unemp_2017 <- unemp_2017 > 6

# Counts
x <- c(sum(high_unemp_2016, na.rm = TRUE),
       sum(high_unemp_2017, na.rm = TRUE))

n <- c(length(na.omit(unemp_2016)),
       length(na.omit(unemp_2017)))

# Proportion test
test2 <- prop.test(x, n)

# Results
test2$statistic
##    X-squared 
## 6.314619e-31
test2$p.value
## [1] 1

After conducting the hyothesis tests, for the first test, the null hypothesis states that the average unemployment rate in 2016 equals the average in 2017. The alternative states that they are different.The output of the t-test was 1.491 and a p‑value of 0.8818. Since the p‑value is greater than 0.05, we fail to reject the null hypothesis. This showcases there is no significant difference in average unemployment rates between the years 2016 and 2017.

For the second hypothesis test, a proportion test, the null hypothesis states that the proportion of countries with high unemployment is the same in 2016 and 2017. The alternative hypothesis is the opposite, and says that it’s different. The chi-square statistic is approximately zero, with a p-value of 1. Since the p-value is significantly higher than 0.05, we also fail to reject the null hypothesis. These numbers indicate that both years were very close in proximity.

Overall, these hypothesis tests make sense, given that unemployment is usually known for having smaller changes from each period. Although it could be a different story from a quarter to quarter basis, yearly unemployment is consist unless a major event (such as COVID-19 or Great Depression) occurs.

The confidence intervals were 95% confident that the mean would contain 0, so it does align with the failure to reject the null hypothesis.

  1. Interpretation and Conclusion

After a thorough analysis of the world bank data, I drew the following conclusions:

A few limitations were most likely the missing data that would’ve given a clearer insight on the research questions. I started with around 3,000 observations, and ended with 535 observations, so it’s evident that the loss could’ve impacted the accuracy of the results. In addition, I believe the removal of rows with missing data (NaNs) could’ve led into a biased territory, where I’m only selecting out of convenience rather than a holistic overview of all countries. Perhaps if a country such as Algeria, for example, had consistent missing data points across all recorded years, I wouldn’t even factor the entire country into my analysis.

For the future, a few good extensions worth mentioning could’ve been factoring in major events such as COVID-19. Since this dataset spanned from 2010-2025, I could’ve examined the years 2020-2021 and compare it to earlier years to see how they change the fundamental questions being asked. Also, even though this specific dataset didn’t have it, perhaps more microeconomic topics such as specific towns or cities could be analyzed instead of large nations; I wonder how the data would differentiate from being in a smaller context?

  1. AI + Source Acknowledgements

For this project, I used the AI model Deepseek to help format my code towards the end, and specifically to help debug and simplify the code for the confidence interval section. I also referenced various websites for coding help, and will list them below:

Sources:

https://www.geeksforgeeks.org/r-language/remove-rows-with-missing-values-using-r/ https://www.statology.org/character-to-factor-in-r/ https://www.geeksforgeeks.org/r-language/which-function-in-r/ https://www.statology.org/two-sample-t-test-in-r/ https://www.sthda.com/english/wiki/two-proportions-z-test-in-r https://www.statisticshowto.com/probability-and-statistics/