title: ‘Data Dive : Week 3’ author: “Mohid” date: “2026-02-03”
## Data Dive
### Week 3
# Load required packages
library(readr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ purrr 1.0.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library (ggplot2)
Load the World Bank Dataset
#fill '..' values in numerical columns with NA.
world_bank <- read_csv("C:/Users/SP KHALID/Downloads/WDI- World Bank Dataset.csv" , na = c('..'))
## Rows: 1675 Columns: 19
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Time Code, Country Name, Country Code, Region, Income Group
## dbl (14): Time, GDP (constant 2015 US$), GDP growth (annual %), GDP (current...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
world_bank
## # A tibble: 1,675 × 19
## Time `Time Code` `Country Name` `Country Code` Region `Income Group`
## <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 2000 YR2000 Brazil BRA Latin America… Upper middle …
## 2 2000 YR2000 China CHN East Asia & P… Upper middle …
## 3 2000 YR2000 France FRA Europe & Cent… High income
## 4 2000 YR2000 Germany DEU Europe & Cent… High income
## 5 2000 YR2000 India IND South Asia Lower middle …
## 6 2000 YR2000 Indonesia IDN East Asia & P… Upper middle …
## 7 2000 YR2000 Italy ITA Europe & Cent… High income
## 8 2000 YR2000 Japan JPN East Asia & P… High income
## 9 2000 YR2000 Korea, Rep. KOR East Asia & P… High income
## 10 2000 YR2000 Mexico MEX Latin America… Upper middle …
## # ℹ 1,665 more rows
## # ℹ 13 more variables: `GDP (constant 2015 US$)` <dbl>,
## # `GDP growth (annual %)` <dbl>, `GDP (current US$)` <dbl>,
## # `Unemployment, total (% of total labor force)` <dbl>,
## # `Inflation, consumer prices (annual %)` <dbl>, `Labor force, total` <dbl>,
## # `Population, total` <dbl>,
## # `Exports of goods and services (% of GDP)` <dbl>, …
dim(world_bank)
## [1] 1675 19
# Check column data types
glimpse(world_bank)
## Rows: 1,675
## Columns: 19
## $ Time <dbl> 2000, 20…
## $ `Time Code` <chr> "YR2000"…
## $ `Country Name` <chr> "Brazil"…
## $ `Country Code` <chr> "BRA", "…
## $ Region <chr> "Latin A…
## $ `Income Group` <chr> "Upper m…
## $ `GDP (constant 2015 US$)` <dbl> 1.18642e…
## $ `GDP growth (annual %)` <dbl> 4.387949…
## $ `GDP (current US$)` <dbl> 6.554482…
## $ `Unemployment, total (% of total labor force)` <dbl> NA, 3.70…
## $ `Inflation, consumer prices (annual %)` <dbl> 7.044141…
## $ `Labor force, total` <dbl> 80295093…
## $ `Population, total` <dbl> 17401828…
## $ `Exports of goods and services (% of GDP)` <dbl> 10.18805…
## $ `Imports of goods and services (% of GDP)` <dbl> 12.45171…
## $ `General government final consumption expenditure (% of GDP)` <dbl> 18.76784…
## $ `Foreign direct investment, net inflows (% of GDP)` <dbl> 5.033917…
## $ `Gross savings (% of GDP)` <dbl> 13.99170…
## $ `Current account balance (% of GDP)` <dbl> -4.04774…
# Convert Time column to integer
world_bank$Time <- as.integer(world_bank$Time)
# Verify Datatypes
glimpse(world_bank)
## Rows: 1,675
## Columns: 19
## $ Time <int> 2000, 20…
## $ `Time Code` <chr> "YR2000"…
## $ `Country Name` <chr> "Brazil"…
## $ `Country Code` <chr> "BRA", "…
## $ Region <chr> "Latin A…
## $ `Income Group` <chr> "Upper m…
## $ `GDP (constant 2015 US$)` <dbl> 1.18642e…
## $ `GDP growth (annual %)` <dbl> 4.387949…
## $ `GDP (current US$)` <dbl> 6.554482…
## $ `Unemployment, total (% of total labor force)` <dbl> NA, 3.70…
## $ `Inflation, consumer prices (annual %)` <dbl> 7.044141…
## $ `Labor force, total` <dbl> 80295093…
## $ `Population, total` <dbl> 17401828…
## $ `Exports of goods and services (% of GDP)` <dbl> 10.18805…
## $ `Imports of goods and services (% of GDP)` <dbl> 12.45171…
## $ `General government final consumption expenditure (% of GDP)` <dbl> 18.76784…
## $ `Foreign direct investment, net inflows (% of GDP)` <dbl> 5.033917…
## $ `Gross savings (% of GDP)` <dbl> 13.99170…
## $ `Current account balance (% of GDP)` <dbl> -4.04774…
df_income <- world_bank |>
filter(Time == max(Time, na.rm = TRUE)) |>
group_by(`Income Group`) |>
summarise(
n = n(),
avg_gdp = mean(`GDP (current US$)`, na.rm = TRUE)) |>
mutate(probability = n / sum(n))
df_income
## # A tibble: 4 × 4
## `Income Group` n avg_gdp probability
## <chr> <int> <dbl> <dbl>
## 1 High income 24 2.65e12 0.358
## 2 Low income 12 3.21e10 0.179
## 3 Lower middle income 15 4.52e11 0.224
## 4 Upper middle income 16 1.82e12 0.239
# Assign tags to probabilities
df_income <- df_income %>%
mutate(tag = ifelse(n == min(n), "LOW_PROBABILITY", "NORMAL"))
df_income
## # A tibble: 4 × 5
## `Income Group` n avg_gdp probability tag
## <chr> <int> <dbl> <dbl> <chr>
## 1 High income 24 2.65e12 0.358 NORMAL
## 2 Low income 12 3.21e10 0.179 LOW_PROBABILITY
## 3 Lower middle income 15 4.52e11 0.224 NORMAL
## 4 Upper middle income 16 1.82e12 0.239 NORMAL
Low-income countries have the lowest probability of appearing when randomly selecting a row from the dataset. This suggests thatlow-income economies are underrepresented in the data, which may reflect limited reporting capacity or missing economic records.
Low-income countries have fewer observations because their economic data is reported less consistently across years.
df_region <- world_bank |>
filter(Time == max(Time, na.rm = TRUE)) |>
group_by(`Region`) |>
summarise(
n = n(),
avg_unemployment = mean(`Unemployment, total (% of total labor force)`, na.rm = TRUE)
) |>
mutate(probability = n / sum(n),
tag = ifelse(n == min(n), "LOW_PROBABILITY", "NORMAL"))
df_region
## # A tibble: 7 × 5
## Region n avg_unemployment probability tag
## <chr> <int> <dbl> <dbl> <chr>
## 1 East Asia & Pacific 11 2.91 0.164 NORMAL
## 2 Europe & Central Asia 18 5.80 0.269 NORMAL
## 3 Latin America & Caribbean 9 5.54 0.134 NORMAL
## 4 Middle East & North Africa 8 2.77 0.119 NORMAL
## 5 North America 2 5.19 0.0299 LOW_PROBABILITY
## 6 South Asia 4 4.17 0.0597 NORMAL
## 7 Sub-Saharan Africa 15 15.7 0.224 NORMAL
The probability of selecting a country from Europe & Central Asia is higher compared to other regions, indicating that this region is overrepresented in the dataset. Conversely, regions with fewer observations have a lower probability of being selected.
Countries with limited data reporting may have fewer observations, while Europe & Central Asia may appear more frequently due to more number of countries or better reporting infrastructure.
ggplot(df_region, aes(x = `Region`, y = n)) +
geom_col() +
coord_flip()
df_combo <- world_bank |>
filter(Time == max(Time, na.rm = TRUE)) |>
group_by(`Income Group`, `Region` ) |>
summarise(n = n(), .groups = "drop") |>
mutate(probability = n / sum(n),
tag = ifelse(n == min(n), "LOW_PROBABILITY", "NORMAL"))
df_combo
## # A tibble: 18 × 5
## `Income Group` Region n probability tag
## <chr> <chr> <int> <dbl> <chr>
## 1 High income East Asia & Pacific 5 0.0746 NORMAL
## 2 High income Europe & Central Asia 12 0.179 NORMAL
## 3 High income Latin America & Caribbean 2 0.0299 NORMAL
## 4 High income Middle East & North Africa 3 0.0448 NORMAL
## 5 High income North America 2 0.0299 NORMAL
## 6 Low income Middle East & North Africa 1 0.0149 LOW_PROBABI…
## 7 Low income Sub-Saharan Africa 11 0.164 NORMAL
## 8 Lower middle income East Asia & Pacific 2 0.0299 NORMAL
## 9 Lower middle income Europe & Central Asia 1 0.0149 LOW_PROBABI…
## 10 Lower middle income Latin America & Caribbean 2 0.0299 NORMAL
## 11 Lower middle income Middle East & North Africa 3 0.0448 NORMAL
## 12 Lower middle income South Asia 4 0.0597 NORMAL
## 13 Lower middle income Sub-Saharan Africa 3 0.0448 NORMAL
## 14 Upper middle income East Asia & Pacific 4 0.0597 NORMAL
## 15 Upper middle income Europe & Central Asia 5 0.0746 NORMAL
## 16 Upper middle income Latin America & Caribbean 5 0.0746 NORMAL
## 17 Upper middle income Middle East & North Africa 1 0.0149 LOW_PROBABI…
## 18 Upper middle income Sub-Saharan Africa 1 0.0149 LOW_PROBABI…
Low income and lower-middle income group combinations are rare. Hence we can say that the probability of selecting is very low thus making them anomalies.
### Hypothesis Some countries might not be in some income groups due to potential economic constraints.
### Missing Combinations
# Identify existing pairs
df_pairs <- world_bank |>
distinct (`Income Group`, `Region`)
df_pairs
## # A tibble: 18 × 2
## `Income Group` Region
## <chr> <chr>
## 1 Upper middle income Latin America & Caribbean
## 2 Upper middle income East Asia & Pacific
## 3 High income Europe & Central Asia
## 4 Lower middle income South Asia
## 5 High income East Asia & Pacific
## 6 High income Middle East & North Africa
## 7 Upper middle income Europe & Central Asia
## 8 High income North America
## 9 Lower middle income Middle East & North Africa
## 10 Upper middle income Sub-Saharan Africa
## 11 High income Latin America & Caribbean
## 12 Upper middle income Middle East & North Africa
## 13 Lower middle income East Asia & Pacific
## 14 Lower middle income Sub-Saharan Africa
## 15 Lower middle income Europe & Central Asia
## 16 Lower middle income Latin America & Caribbean
## 17 Low income Sub-Saharan Africa
## 18 Low income Middle East & North Africa
# All possibe pairs
all_combos <- expand.grid(
`Income Group` = unique(world_bank$`Income Group`),
Region = unique(world_bank$Region)
)
all_combos
## Income Group Region
## 1 Upper middle income Latin America & Caribbean
## 2 High income Latin America & Caribbean
## 3 Lower middle income Latin America & Caribbean
## 4 Low income Latin America & Caribbean
## 5 Upper middle income East Asia & Pacific
## 6 High income East Asia & Pacific
## 7 Lower middle income East Asia & Pacific
## 8 Low income East Asia & Pacific
## 9 Upper middle income Europe & Central Asia
## 10 High income Europe & Central Asia
## 11 Lower middle income Europe & Central Asia
## 12 Low income Europe & Central Asia
## 13 Upper middle income South Asia
## 14 High income South Asia
## 15 Lower middle income South Asia
## 16 Low income South Asia
## 17 Upper middle income Middle East & North Africa
## 18 High income Middle East & North Africa
## 19 Lower middle income Middle East & North Africa
## 20 Low income Middle East & North Africa
## 21 Upper middle income North America
## 22 High income North America
## 23 Lower middle income North America
## 24 Low income North America
## 25 Upper middle income Sub-Saharan Africa
## 26 High income Sub-Saharan Africa
## 27 Lower middle income Sub-Saharan Africa
## 28 Low income Sub-Saharan Africa
# Missing pairs
missing_combos <- anti_join(
all_combos,
df_pairs,
by = c("Region", "Income Group")
)
missing_combos
## Income Group Region
## 1 Low income Latin America & Caribbean
## 2 Low income East Asia & Pacific
## 3 Low income Europe & Central Asia
## 4 Upper middle income South Asia
## 5 High income South Asia
## 6 Low income South Asia
## 7 Upper middle income North America
## 8 Lower middle income North America
## 9 Low income North America
## 10 High income Sub-Saharan Africa
Some region–income group combinations do not exist in the dataset. This likely reflects real-world economic structure rather than missing data. For example, a region may have no countries in the “low income” category.
# Most and Least counts
combo_counts <- world_bank |>
filter(Time == max(Time, na.rm = TRUE)) |>
count(Region,`Income Group`, sort = TRUE)
combo_counts
## # A tibble: 18 × 3
## Region `Income Group` n
## <chr> <chr> <int>
## 1 Europe & Central Asia High income 12
## 2 Sub-Saharan Africa Low income 11
## 3 East Asia & Pacific High income 5
## 4 Europe & Central Asia Upper middle income 5
## 5 Latin America & Caribbean Upper middle income 5
## 6 East Asia & Pacific Upper middle income 4
## 7 South Asia Lower middle income 4
## 8 Middle East & North Africa High income 3
## 9 Middle East & North Africa Lower middle income 3
## 10 Sub-Saharan Africa Lower middle income 3
## 11 East Asia & Pacific Lower middle income 2
## 12 Latin America & Caribbean High income 2
## 13 Latin America & Caribbean Lower middle income 2
## 14 North America High income 2
## 15 Europe & Central Asia Lower middle income 1
## 16 Middle East & North Africa Low income 1
## 17 Middle East & North Africa Upper middle income 1
## 18 Sub-Saharan Africa Upper middle income 1
More frequent combinations represent dominant economic patterns and mostly countries in the whole of europe, having similar growth patterns. On the other hand, the least counts are lower middle income countries in the sub-saharan africa region, showing a lower impact in the global economic indicators.
ggplot(combo_counts, aes(x = Region, y = n, fill = `Income Group`)) +
geom_col() +
coord_flip()
We can the see the bar chart for different combinations of region and income group. High income group and Europ/Central asian combo is dominating because of the their higher impact in the global economic patterns and data reporting in WDI dataset.