Data Dive : Week 5

Load required packages

library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Load the World Bank Dataset

#fill '..' values in numerical columns with NA.
world_bank <- read_csv("C:/Users/SP KHALID/Downloads/WDI- World Bank Dataset.csv" , na = c('..'))

## Rows: 1675 Columns: 19
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (5): Time Code, Country Name, Country Code, Region, Income Group
## dbl (14): Time, GDP (constant 2015 US$), GDP growth (annual %), GDP (current...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

world_bank

## # A tibble: 1,675 × 19
##     Time `Time Code` `Country Name` `Country Code` Region         `Income Group`
##    <dbl> <chr>       <chr>          <chr>          <chr>          <chr>         
##  1  2000 YR2000      Brazil         BRA            Latin America… Upper middle …
##  2  2000 YR2000      China          CHN            East Asia & P… Upper middle …
##  3  2000 YR2000      France         FRA            Europe & Cent… High income   
##  4  2000 YR2000      Germany        DEU            Europe & Cent… High income   
##  5  2000 YR2000      India          IND            South Asia     Lower middle …
##  6  2000 YR2000      Indonesia      IDN            East Asia & P… Upper middle …
##  7  2000 YR2000      Italy          ITA            Europe & Cent… High income   
##  8  2000 YR2000      Japan          JPN            East Asia & P… High income   
##  9  2000 YR2000      Korea, Rep.    KOR            East Asia & P… High income   
## 10  2000 YR2000      Mexico         MEX            Latin America… Upper middle …
## # ℹ 1,665 more rows
## # ℹ 13 more variables: `GDP (constant 2015 US$)` <dbl>,
## #   `GDP growth (annual %)` <dbl>, `GDP (current US$)` <dbl>,
## #   `Unemployment, total (% of total labor force)` <dbl>,
## #   `Inflation, consumer prices (annual %)` <dbl>, `Labor force, total` <dbl>,
## #   `Population, total` <dbl>,
## #   `Exports of goods and services (% of GDP)` <dbl>, …

dim(world_bank)

## [1] 1675   19

# Check column data types
glimpse(world_bank)

## Rows: 1,675
## Columns: 19
## $ Time                                                          <dbl> 2000, 20…
## $ `Time Code`                                                   <chr> "YR2000"…
## $ `Country Name`                                                <chr> "Brazil"…
## $ `Country Code`                                                <chr> "BRA", "…
## $ Region                                                        <chr> "Latin A…
## $ `Income Group`                                                <chr> "Upper m…
## $ `GDP (constant 2015 US$)`                                     <dbl> 1.18642e…
## $ `GDP growth (annual %)`                                       <dbl> 4.387949…
## $ `GDP (current US$)`                                           <dbl> 6.554482…
## $ `Unemployment, total (% of total labor force)`                <dbl> NA, 3.70…
## $ `Inflation, consumer prices (annual %)`                       <dbl> 7.044141…
## $ `Labor force, total`                                          <dbl> 80295093…
## $ `Population, total`                                           <dbl> 17401828…
## $ `Exports of goods and services (% of GDP)`                    <dbl> 10.18805…
## $ `Imports of goods and services (% of GDP)`                    <dbl> 12.45171…
## $ `General government final consumption expenditure (% of GDP)` <dbl> 18.76784…
## $ `Foreign direct investment, net inflows (% of GDP)`           <dbl> 5.033917…
## $ `Gross savings (% of GDP)`                                    <dbl> 13.99170…
## $ `Current account balance (% of GDP)`                          <dbl> -4.04774…

# Convert Time column to integer
world_bank$Time <- as.integer(world_bank$Time)

# Clean column names
library(janitor)

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

df <- world_bank |> clean_names()
glimpse(df)

## Rows: 1,675
## Columns: 19
## $ time                                                            <int> 2000, …
## $ time_code                                                       <chr> "YR200…
## $ country_name                                                    <chr> "Brazi…
## $ country_code                                                    <chr> "BRA",…
## $ region                                                          <chr> "Latin…
## $ income_group                                                    <chr> "Upper…
## $ gdp_constant_2015_us                                            <dbl> 1.1864…
## $ gdp_growth_annual_percent                                       <dbl> 4.3879…
## $ gdp_current_us                                                  <dbl> 6.5544…
## $ unemployment_total_percent_of_total_labor_force                 <dbl> NA, 3.…
## $ inflation_consumer_prices_annual_percent                        <dbl> 7.0441…
## $ labor_force_total                                               <dbl> 802950…
## $ population_total                                                <dbl> 174018…
## $ exports_of_goods_and_services_percent_of_gdp                    <dbl> 10.188…
## $ imports_of_goods_and_services_percent_of_gdp                    <dbl> 12.451…
## $ general_government_final_consumption_expenditure_percent_of_gdp <dbl> 18.767…
## $ foreign_direct_investment_net_inflows_percent_of_gdp            <dbl> 5.0339…
## $ gross_savings_percent_of_gdp                                    <dbl> 13.991…
## $ current_account_balance_percent_of_gdp                          <dbl> -4.047…

Unclear Columns

gdp_constant_2015_us

head(df["gdp_constant_2015_us"], 5)

## # A tibble: 5 × 1
##   gdp_constant_2015_us
##                  <dbl>
## 1        1186420000000
## 2        2825100000000
## 3        2025860000000
## 4        2885960000000
## 5         800534000000

Why unclear

This column was unclear because of the term ‘constant_2015’ and is it inflation adjusted. Does it represent GDP per capita or total GDP?

Documentation

After going through the world bank documentation, it is clear that GDP is adjusted for inflation and for the constant GDP term, world bank uses a uniform 2015 exchange rate for the entire series.

Why encoded this way

By doing this, they can ensure that changes in todays market don’t influence any country’s historical economic performance and comparisons can be done without distortion

What could go wrong if you didn’t read docs?

If I didn’t read the documentation, I might have made comparisons using gdp_current_us. Growth might look higher than it actually is.

foreign_direct_investment_net_inflows_percent_of_gdp

head(df["foreign_direct_investment_net_inflows_percent_of_gdp"],5)

## # A tibble: 5 × 1
##   foreign_direct_investment_net_inflows_percent_of_gdp
##                                                  <dbl>
## 1                                                5.03 
## 2                                                3.44 
## 3                                                3.04 
## 4                                               12.6  
## 5                                                0.765

Why unclear

This column was unclear because of the term ‘net_inflows’ and if it actually meant inflows minus outflows. Whether it is expressed in terms of gdp percentage or USD.

Documentation

After going through the world bank documentation I found ou that it shows net foreign direct investment for each country in terms of percentage of gdp.

Why encoded this way?

By doing this, world bank enures that comparison between small and large economies is fair.

What could go wrong if you didn’t read docs?

I might have treated it as total net amount in dollars and compared it with current_gdp incorrectly.

Something still unclear after documentation

head(df["current_account_balance_percent_of_gdp"], 5)

## # A tibble: 5 × 1
##   current_account_balance_percent_of_gdp
##                                    <dbl>
## 1                                 -4.05 
## 2                                  1.68 
## 3                                  1.18 
## 4                                 -1.51 
## 5                                 -0.982

Even after going through documentation, I am unclear whether all countries are calculated the same way and if remittances are included. The world bank follows the IMF standards for this but it is still ambiguous how exports, imports and primary/secondary income are defined for each country. The documentatio doesn’t explain the exact method used to calculate these values and how missing values for developing countries are incorporated.

Visualizations

By income_group

df_income <- df |>
  group_by(time,income_group) |>
  summarise(
    mean_fdi = mean(foreign_direct_investment_net_inflows_percent_of_gdp, na.rm = TRUE),
    .groups = "drop"
  )
df_income

## # A tibble: 100 × 3
##     time income_group        mean_fdi
##    <int> <chr>                  <dbl>
##  1  2000 High income             5.61
##  2  2000 Low income              2.40
##  3  2000 Lower middle income     1.66
##  4  2000 Upper middle income     2.72
##  5  2001 High income             3.83
##  6  2001 Low income              3.97
##  7  2001 Lower middle income     1.90
##  8  2001 Upper middle income     3.83
##  9  2002 High income             3.19
## 10  2002 Low income              5.09
## # ℹ 90 more rows

library(ggplot2)

ggplot(df_income, aes(x = time, y = mean_fdi, color = income_group)) +
  geom_line() +
  labs(
    title = "Foreign Direct Investment (% of GDP) over time")

FDI volatility differs significantly across income groups, with low-income countries showing large spikes that may exaggerate economic impact due to small GDP denominators. It is unclear whether extreme fluctuations reflect real investment shocks or reporting inconsistencies in the dataset. This poses a risk of misleading cross income group comparisons and we can counter this using median values or smoothing trends eliminating distortion.

By top 10 countries with highest population

top_10_pop_names <- df |>
  filter(time == max(time)) |>
  slice_max(order_by = population_total, n = 10) |>
  pull(country_name)

df_top_10 <- df |>
  filter(country_name %in% top_10_pop_names) |>
  group_by(time, country_name) |>
  summarise(
    mean_fdi = mean(foreign_direct_investment_net_inflows_percent_of_gdp, na.rm = TRUE),
    .groups = "drop")
  df_top_10

## # A tibble: 250 × 3
##     time country_name       mean_fdi
##    <int> <chr>                 <dbl>
##  1  2000 Bangladesh            0.525
##  2  2000 Brazil                5.03 
##  3  2000 China                 3.44 
##  4  2000 Ethiopia              1.63 
##  5  2000 India                 0.765
##  6  2000 Indonesia            -2.76 
##  7  2000 Mexico                2.48 
##  8  2000 Pakistan              0.310
##  9  2000 Russian Federation    1.03 
## 10  2000 United States         3.41 
## # ℹ 240 more rows

ggplot(df_top_10, aes(x = time, y = mean_fdi, color = country_name)) +
  geom_line() +
  labs(
    title = "Foreign Direct Investment for top 10 populated countries")

Top 10 populated country trends reveal negative FDI values and sharp swings, raising questions about whether “net inflows” reflect capital increase or data revisions. Differences across countries may originate from structural size effects rather than true economic performance. The risk is misinterpreting volatility as instability, so checking documentation and comparing multi-year averages would reduce misjudgment.

Overall Significant Risks and Mitigation.

The significant risks are associated with misrepresentation of inflation due to presence of constant and current gdp. Without going through the documentation, some indiactors might be wrongly plotted total sum instead of percentage of gdp. It is suggested to always check documentation to avoid any ambiguity. Consistent units should be used for the whole datase and some indicators should be labelled as inflation-adjusted values.

Two Categorical Columns

summary(df$income_group)

##    Length     Class      Mode 
##      1675 character character

sum(is.na(df$income_group))

## [1] 0

df |>
  group_by(income_group) |>
  summarise(
    not_null = sum(is.na(df$income_group))
  )

## # A tibble: 4 × 2
##   income_group        not_null
##   <chr>                  <int>
## 1 High income                0
## 2 Low income                 0
## 3 Lower middle income        0
## 4 Upper middle income        0

summary(df$region)

##    Length     Class      Mode 
##      1675 character character

sum(is.na(df$region))

## [1] 0

df |>
  group_by(region) |>
  summarise(
    not_null = sum(is.na(df$region))
  )

## # A tibble: 7 × 2
##   region                     not_null
##   <chr>                         <int>
## 1 East Asia & Pacific               0
## 2 Europe & Central Asia             0
## 3 Latin America & Caribbean         0
## 4 Middle East & North Africa        0
## 5 North America                     0
## 6 South Asia                        0
## 7 Sub-Saharan Africa                0

Both Categorical columns have zero null values and each row is assigned to an income group and region.

Implicit missing

df |>
  count(time, income_group)

## # A tibble: 100 × 3
##     time income_group            n
##    <int> <chr>               <int>
##  1  2000 High income            24
##  2  2000 Low income             11
##  3  2000 Lower middle income    16
##  4  2000 Upper middle income    16
##  5  2001 High income            24
##  6  2001 Low income             12
##  7  2001 Lower middle income    15
##  8  2001 Upper middle income    16
##  9  2002 High income            24
## 10  2002 Low income             12
## # ℹ 90 more rows

Each income group is present in each time period.

df |>
  count(time, region)

## # A tibble: 175 × 3
##     time region                         n
##    <int> <chr>                      <int>
##  1  2000 East Asia & Pacific           11
##  2  2000 Europe & Central Asia         18
##  3  2000 Latin America & Caribbean      9
##  4  2000 Middle East & North Africa     8
##  5  2000 North America                  2
##  6  2000 South Asia                     4
##  7  2000 Sub-Saharan Africa            15
##  8  2001 East Asia & Pacific           11
##  9  2001 Europe & Central Asia         18
## 10  2001 Latin America & Caribbean      9
## # ℹ 165 more rows

Similarly, each region is present in each time period.

Continuous Column

Defining outlier using interquartile range IQR

Q1 <- quantile(df$inflation_consumer_prices_annual_percent, 0.25, na.rm = TRUE)
Q3 <- quantile(df$inflation_consumer_prices_annual_percent, 0.75, na.rm = TRUE)
IQR_val <- Q3 - Q1

lower <- Q1 - 1.5 * IQR_val
upper <- Q3 + 1.5 * IQR_val

cat("Lower Quartile (Q1):", Q1, "\n")

## Lower Quartile (Q1): 1.850795

cat("Upper Quartile (Q3):", Q3, "\n")

## Upper Quartile (Q3): 7.848777

cat("Lower Bound:", lower, "\n")

## Lower Bound: -7.14618

cat("Upper Bound:" , upper, "\n")

## Upper Bound: 16.84575

These extreme values for lower and upper bounds likely correspond to periods of economic crisis or hyperinflation. While statistically classified as outliers, they may represent real economic conditions rather than data errors. This raises the question of whether such values should be removed or analyzed separately.

Data Dive : Week 5

Mohid

2026-02-14

Unclear Columns

Why unclear

Documentation

Why encoded this way

What could go wrong if you didn’t read docs?

Why unclear

Documentation

Why encoded this way?

What could go wrong if you didn’t read docs?

Something still unclear after documentation

Visualizations

By income_group

By top 10 countries with highest population

Overall Significant Risks and Mitigation.

Two Categorical Columns

Implicit missing

Continuous Column