Load required packages
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Load the World Bank Dataset
#fill '..' values in numerical columns with NA.
world_bank <- read_csv("C:/Users/SP KHALID/Downloads/WDI- World Bank Dataset.csv" , na = c('..'))
## Rows: 1675 Columns: 19
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Time Code, Country Name, Country Code, Region, Income Group
## dbl (14): Time, GDP (constant 2015 US$), GDP growth (annual %), GDP (current...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
world_bank
## # A tibble: 1,675 × 19
## Time `Time Code` `Country Name` `Country Code` Region `Income Group`
## <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 2000 YR2000 Brazil BRA Latin America… Upper middle …
## 2 2000 YR2000 China CHN East Asia & P… Upper middle …
## 3 2000 YR2000 France FRA Europe & Cent… High income
## 4 2000 YR2000 Germany DEU Europe & Cent… High income
## 5 2000 YR2000 India IND South Asia Lower middle …
## 6 2000 YR2000 Indonesia IDN East Asia & P… Upper middle …
## 7 2000 YR2000 Italy ITA Europe & Cent… High income
## 8 2000 YR2000 Japan JPN East Asia & P… High income
## 9 2000 YR2000 Korea, Rep. KOR East Asia & P… High income
## 10 2000 YR2000 Mexico MEX Latin America… Upper middle …
## # ℹ 1,665 more rows
## # ℹ 13 more variables: `GDP (constant 2015 US$)` <dbl>,
## # `GDP growth (annual %)` <dbl>, `GDP (current US$)` <dbl>,
## # `Unemployment, total (% of total labor force)` <dbl>,
## # `Inflation, consumer prices (annual %)` <dbl>, `Labor force, total` <dbl>,
## # `Population, total` <dbl>,
## # `Exports of goods and services (% of GDP)` <dbl>, …
dim(world_bank)
## [1] 1675 19
# Check column data types
glimpse(world_bank)
## Rows: 1,675
## Columns: 19
## $ Time <dbl> 2000, 20…
## $ `Time Code` <chr> "YR2000"…
## $ `Country Name` <chr> "Brazil"…
## $ `Country Code` <chr> "BRA", "…
## $ Region <chr> "Latin A…
## $ `Income Group` <chr> "Upper m…
## $ `GDP (constant 2015 US$)` <dbl> 1.18642e…
## $ `GDP growth (annual %)` <dbl> 4.387949…
## $ `GDP (current US$)` <dbl> 6.554482…
## $ `Unemployment, total (% of total labor force)` <dbl> NA, 3.70…
## $ `Inflation, consumer prices (annual %)` <dbl> 7.044141…
## $ `Labor force, total` <dbl> 80295093…
## $ `Population, total` <dbl> 17401828…
## $ `Exports of goods and services (% of GDP)` <dbl> 10.18805…
## $ `Imports of goods and services (% of GDP)` <dbl> 12.45171…
## $ `General government final consumption expenditure (% of GDP)` <dbl> 18.76784…
## $ `Foreign direct investment, net inflows (% of GDP)` <dbl> 5.033917…
## $ `Gross savings (% of GDP)` <dbl> 13.99170…
## $ `Current account balance (% of GDP)` <dbl> -4.04774…
# Convert Time column to integer
world_bank$Time <- as.integer(world_bank$Time)
# Clean column names
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
df <- world_bank |> clean_names()
glimpse(df)
## Rows: 1,675
## Columns: 19
## $ time <int> 2000, …
## $ time_code <chr> "YR200…
## $ country_name <chr> "Brazi…
## $ country_code <chr> "BRA",…
## $ region <chr> "Latin…
## $ income_group <chr> "Upper…
## $ gdp_constant_2015_us <dbl> 1.1864…
## $ gdp_growth_annual_percent <dbl> 4.3879…
## $ gdp_current_us <dbl> 6.5544…
## $ unemployment_total_percent_of_total_labor_force <dbl> NA, 3.…
## $ inflation_consumer_prices_annual_percent <dbl> 7.0441…
## $ labor_force_total <dbl> 802950…
## $ population_total <dbl> 174018…
## $ exports_of_goods_and_services_percent_of_gdp <dbl> 10.188…
## $ imports_of_goods_and_services_percent_of_gdp <dbl> 12.451…
## $ general_government_final_consumption_expenditure_percent_of_gdp <dbl> 18.767…
## $ foreign_direct_investment_net_inflows_percent_of_gdp <dbl> 5.0339…
## $ gross_savings_percent_of_gdp <dbl> 13.991…
## $ current_account_balance_percent_of_gdp <dbl> -4.047…
head(df["gdp_constant_2015_us"], 5)
## # A tibble: 5 × 1
## gdp_constant_2015_us
## <dbl>
## 1 1186420000000
## 2 2825100000000
## 3 2025860000000
## 4 2885960000000
## 5 800534000000
This column was unclear because of the term ‘constant_2015’ and is it inflation adjusted. Does it represent GDP per capita or total GDP?
After going through the world bank documentation, it is clear that GDP is adjusted for inflation and for the constant GDP term, world bank uses a uniform 2015 exchange rate for the entire series.
By doing this, they can ensure that changes in todays market don’t influence any country’s historical economic performance and comparisons can be done without distortion
If I didn’t read the documentation, I might have made comparisons using gdp_current_us. Growth might look higher than it actually is.
head(df["foreign_direct_investment_net_inflows_percent_of_gdp"],5)
## # A tibble: 5 × 1
## foreign_direct_investment_net_inflows_percent_of_gdp
## <dbl>
## 1 5.03
## 2 3.44
## 3 3.04
## 4 12.6
## 5 0.765
This column was unclear because of the term ‘net_inflows’ and if it actually meant inflows minus outflows. Whether it is expressed in terms of gdp percentage or USD.
After going through the world bank documentation I found ou that it shows net foreign direct investment for each country in terms of percentage of gdp.
By doing this, world bank enures that comparison between small and large economies is fair.
I might have treated it as total net amount in dollars and compared it with current_gdp incorrectly.
head(df["current_account_balance_percent_of_gdp"], 5)
## # A tibble: 5 × 1
## current_account_balance_percent_of_gdp
## <dbl>
## 1 -4.05
## 2 1.68
## 3 1.18
## 4 -1.51
## 5 -0.982
Even after going through documentation, I am unclear whether all countries are calculated the same way and if remittances are included. The world bank follows the IMF standards for this but it is still ambiguous how exports, imports and primary/secondary income are defined for each country. The documentatio doesn’t explain the exact method used to calculate these values and how missing values for developing countries are incorporated.
df_income <- df |>
group_by(time,income_group) |>
summarise(
mean_fdi = mean(foreign_direct_investment_net_inflows_percent_of_gdp, na.rm = TRUE),
.groups = "drop"
)
df_income
## # A tibble: 100 × 3
## time income_group mean_fdi
## <int> <chr> <dbl>
## 1 2000 High income 5.61
## 2 2000 Low income 2.40
## 3 2000 Lower middle income 1.66
## 4 2000 Upper middle income 2.72
## 5 2001 High income 3.83
## 6 2001 Low income 3.97
## 7 2001 Lower middle income 1.90
## 8 2001 Upper middle income 3.83
## 9 2002 High income 3.19
## 10 2002 Low income 5.09
## # ℹ 90 more rows
library(ggplot2)
ggplot(df_income, aes(x = time, y = mean_fdi, color = income_group)) +
geom_line() +
labs(
title = "Foreign Direct Investment (% of GDP) over time")
FDI volatility differs significantly across income groups, with low-income countries showing large spikes that may exaggerate economic impact due to small GDP denominators. It is unclear whether extreme fluctuations reflect real investment shocks or reporting inconsistencies in the dataset. This poses a risk of misleading cross income group comparisons and we can counter this using median values or smoothing trends eliminating distortion.
top_10_pop_names <- df |>
filter(time == max(time)) |>
slice_max(order_by = population_total, n = 10) |>
pull(country_name)
df_top_10 <- df |>
filter(country_name %in% top_10_pop_names) |>
group_by(time, country_name) |>
summarise(
mean_fdi = mean(foreign_direct_investment_net_inflows_percent_of_gdp, na.rm = TRUE),
.groups = "drop")
df_top_10
## # A tibble: 250 × 3
## time country_name mean_fdi
## <int> <chr> <dbl>
## 1 2000 Bangladesh 0.525
## 2 2000 Brazil 5.03
## 3 2000 China 3.44
## 4 2000 Ethiopia 1.63
## 5 2000 India 0.765
## 6 2000 Indonesia -2.76
## 7 2000 Mexico 2.48
## 8 2000 Pakistan 0.310
## 9 2000 Russian Federation 1.03
## 10 2000 United States 3.41
## # ℹ 240 more rows
ggplot(df_top_10, aes(x = time, y = mean_fdi, color = country_name)) +
geom_line() +
labs(
title = "Foreign Direct Investment for top 10 populated countries")
Top 10 populated country trends reveal negative FDI values and sharp swings, raising questions about whether “net inflows” reflect capital increase or data revisions. Differences across countries may originate from structural size effects rather than true economic performance. The risk is misinterpreting volatility as instability, so checking documentation and comparing multi-year averages would reduce misjudgment.
The significant risks are associated with misrepresentation of inflation due to presence of constant and current gdp. Without going through the documentation, some indiactors might be wrongly plotted total sum instead of percentage of gdp. It is suggested to always check documentation to avoid any ambiguity. Consistent units should be used for the whole datase and some indicators should be labelled as inflation-adjusted values.
summary(df$income_group)
## Length Class Mode
## 1675 character character
sum(is.na(df$income_group))
## [1] 0
df |>
group_by(income_group) |>
summarise(
not_null = sum(is.na(df$income_group))
)
## # A tibble: 4 × 2
## income_group not_null
## <chr> <int>
## 1 High income 0
## 2 Low income 0
## 3 Lower middle income 0
## 4 Upper middle income 0
summary(df$region)
## Length Class Mode
## 1675 character character
sum(is.na(df$region))
## [1] 0
df |>
group_by(region) |>
summarise(
not_null = sum(is.na(df$region))
)
## # A tibble: 7 × 2
## region not_null
## <chr> <int>
## 1 East Asia & Pacific 0
## 2 Europe & Central Asia 0
## 3 Latin America & Caribbean 0
## 4 Middle East & North Africa 0
## 5 North America 0
## 6 South Asia 0
## 7 Sub-Saharan Africa 0
Both Categorical columns have zero null values and each row is assigned to an income group and region.
df |>
count(time, income_group)
## # A tibble: 100 × 3
## time income_group n
## <int> <chr> <int>
## 1 2000 High income 24
## 2 2000 Low income 11
## 3 2000 Lower middle income 16
## 4 2000 Upper middle income 16
## 5 2001 High income 24
## 6 2001 Low income 12
## 7 2001 Lower middle income 15
## 8 2001 Upper middle income 16
## 9 2002 High income 24
## 10 2002 Low income 12
## # ℹ 90 more rows
Each income group is present in each time period.
df |>
count(time, region)
## # A tibble: 175 × 3
## time region n
## <int> <chr> <int>
## 1 2000 East Asia & Pacific 11
## 2 2000 Europe & Central Asia 18
## 3 2000 Latin America & Caribbean 9
## 4 2000 Middle East & North Africa 8
## 5 2000 North America 2
## 6 2000 South Asia 4
## 7 2000 Sub-Saharan Africa 15
## 8 2001 East Asia & Pacific 11
## 9 2001 Europe & Central Asia 18
## 10 2001 Latin America & Caribbean 9
## # ℹ 165 more rows
Similarly, each region is present in each time period.
Defining outlier using interquartile range IQR
Q1 <- quantile(df$inflation_consumer_prices_annual_percent, 0.25, na.rm = TRUE)
Q3 <- quantile(df$inflation_consumer_prices_annual_percent, 0.75, na.rm = TRUE)
IQR_val <- Q3 - Q1
lower <- Q1 - 1.5 * IQR_val
upper <- Q3 + 1.5 * IQR_val
cat("Lower Quartile (Q1):", Q1, "\n")
## Lower Quartile (Q1): 1.850795
cat("Upper Quartile (Q3):", Q3, "\n")
## Upper Quartile (Q3): 7.848777
cat("Lower Bound:", lower, "\n")
## Lower Bound: -7.14618
cat("Upper Bound:" , upper, "\n")
## Upper Bound: 16.84575
These extreme values for lower and upper bounds likely correspond to periods of economic crisis or hyperinflation. While statistically classified as outliers, they may represent real economic conditions rather than data errors. This raises the question of whether such values should be removed or analyzed separately.