title: ‘Data Dive : Week 3’ author: “Mohid” date: “2026-02-03”

## Data Dive

### Week 3

# Load required packages

library(readr)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ purrr     1.0.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library (ggplot2)

Load the World Bank Dataset

#fill '..' values in numerical columns with NA.
world_bank <- read_csv("C:/Users/SP KHALID/Downloads/WDI- World Bank Dataset.csv" , na = c('..'))

## Rows: 1675 Columns: 19
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (5): Time Code, Country Name, Country Code, Region, Income Group
## dbl (14): Time, GDP (constant 2015 US$), GDP growth (annual %), GDP (current...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

world_bank

## # A tibble: 1,675 × 19
##     Time `Time Code` `Country Name` `Country Code` Region         `Income Group`
##    <dbl> <chr>       <chr>          <chr>          <chr>          <chr>         
##  1  2000 YR2000      Brazil         BRA            Latin America… Upper middle …
##  2  2000 YR2000      China          CHN            East Asia & P… Upper middle …
##  3  2000 YR2000      France         FRA            Europe & Cent… High income   
##  4  2000 YR2000      Germany        DEU            Europe & Cent… High income   
##  5  2000 YR2000      India          IND            South Asia     Lower middle …
##  6  2000 YR2000      Indonesia      IDN            East Asia & P… Upper middle …
##  7  2000 YR2000      Italy          ITA            Europe & Cent… High income   
##  8  2000 YR2000      Japan          JPN            East Asia & P… High income   
##  9  2000 YR2000      Korea, Rep.    KOR            East Asia & P… High income   
## 10  2000 YR2000      Mexico         MEX            Latin America… Upper middle …
## # ℹ 1,665 more rows
## # ℹ 13 more variables: `GDP (constant 2015 US$)` <dbl>,
## #   `GDP growth (annual %)` <dbl>, `GDP (current US$)` <dbl>,
## #   `Unemployment, total (% of total labor force)` <dbl>,
## #   `Inflation, consumer prices (annual %)` <dbl>, `Labor force, total` <dbl>,
## #   `Population, total` <dbl>,
## #   `Exports of goods and services (% of GDP)` <dbl>, …

dim(world_bank)

## [1] 1675   19

# Check column data types
glimpse(world_bank)

## Rows: 1,675
## Columns: 19
## $ Time                                                          <dbl> 2000, 20…
## $ `Time Code`                                                   <chr> "YR2000"…
## $ `Country Name`                                                <chr> "Brazil"…
## $ `Country Code`                                                <chr> "BRA", "…
## $ Region                                                        <chr> "Latin A…
## $ `Income Group`                                                <chr> "Upper m…
## $ `GDP (constant 2015 US$)`                                     <dbl> 1.18642e…
## $ `GDP growth (annual %)`                                       <dbl> 4.387949…
## $ `GDP (current US$)`                                           <dbl> 6.554482…
## $ `Unemployment, total (% of total labor force)`                <dbl> NA, 3.70…
## $ `Inflation, consumer prices (annual %)`                       <dbl> 7.044141…
## $ `Labor force, total`                                          <dbl> 80295093…
## $ `Population, total`                                           <dbl> 17401828…
## $ `Exports of goods and services (% of GDP)`                    <dbl> 10.18805…
## $ `Imports of goods and services (% of GDP)`                    <dbl> 12.45171…
## $ `General government final consumption expenditure (% of GDP)` <dbl> 18.76784…
## $ `Foreign direct investment, net inflows (% of GDP)`           <dbl> 5.033917…
## $ `Gross savings (% of GDP)`                                    <dbl> 13.99170…
## $ `Current account balance (% of GDP)`                          <dbl> -4.04774…

# Convert Time column to integer
world_bank$Time <- as.integer(world_bank$Time)

# Verify Datatypes
glimpse(world_bank)

## Rows: 1,675
## Columns: 19
## $ Time                                                          <int> 2000, 20…
## $ `Time Code`                                                   <chr> "YR2000"…
## $ `Country Name`                                                <chr> "Brazil"…
## $ `Country Code`                                                <chr> "BRA", "…
## $ Region                                                        <chr> "Latin A…
## $ `Income Group`                                                <chr> "Upper m…
## $ `GDP (constant 2015 US$)`                                     <dbl> 1.18642e…
## $ `GDP growth (annual %)`                                       <dbl> 4.387949…
## $ `GDP (current US$)`                                           <dbl> 6.554482…
## $ `Unemployment, total (% of total labor force)`                <dbl> NA, 3.70…
## $ `Inflation, consumer prices (annual %)`                       <dbl> 7.044141…
## $ `Labor force, total`                                          <dbl> 80295093…
## $ `Population, total`                                           <dbl> 17401828…
## $ `Exports of goods and services (% of GDP)`                    <dbl> 10.18805…
## $ `Imports of goods and services (% of GDP)`                    <dbl> 12.45171…
## $ `General government final consumption expenditure (% of GDP)` <dbl> 18.76784…
## $ `Foreign direct investment, net inflows (% of GDP)`           <dbl> 5.033917…
## $ `Gross savings (% of GDP)`                                    <dbl> 13.99170…
## $ `Current account balance (% of GDP)`                          <dbl> -4.04774…

Three groupby dataframes

1: Groupby income group for any one time period

df_income <- world_bank |>
  filter(Time == max(Time, na.rm = TRUE)) |>
  group_by(`Income Group`) |>
  summarise(
    n = n(),
    avg_gdp = mean(`GDP (current US$)`, na.rm = TRUE)) |>
  mutate(probability = n / sum(n))
df_income

## # A tibble: 4 × 4
##   `Income Group`          n avg_gdp probability
##   <chr>               <int>   <dbl>       <dbl>
## 1 High income            24 2.65e12       0.358
## 2 Low income             12 3.21e10       0.179
## 3 Lower middle income    15 4.52e11       0.224
## 4 Upper middle income    16 1.82e12       0.239

# Assign tags to probabilities
df_income <- df_income %>%
  mutate(tag = ifelse(n == min(n), "LOW_PROBABILITY", "NORMAL"))
df_income

## # A tibble: 4 × 5
##   `Income Group`          n avg_gdp probability tag            
##   <chr>               <int>   <dbl>       <dbl> <chr>          
## 1 High income            24 2.65e12       0.358 NORMAL         
## 2 Low income             12 3.21e10       0.179 LOW_PROBABILITY
## 3 Lower middle income    15 4.52e11       0.224 NORMAL         
## 4 Upper middle income    16 1.82e12       0.239 NORMAL

Low-income countries have the lowest probability of appearing when randomly selecting a row from the dataset. This suggests thatlow-income economies are underrepresented in the data, which may reflect limited reporting capacity or missing economic records.

Hypothesis

Low-income countries have fewer observations because their economic data is reported less consistently across years.

1: Groupby region

df_region <- world_bank |>
  filter(Time == max(Time, na.rm = TRUE)) |>
  group_by(`Region`) |>
  summarise(
    n = n(),
    avg_unemployment = mean(`Unemployment, total (% of total labor force)`, na.rm = TRUE)
  ) |>
  mutate(probability = n / sum(n),
         tag = ifelse(n == min(n), "LOW_PROBABILITY", "NORMAL"))
df_region

## # A tibble: 7 × 5
##   Region                         n avg_unemployment probability tag            
##   <chr>                      <int>            <dbl>       <dbl> <chr>          
## 1 East Asia & Pacific           11             2.91      0.164  NORMAL         
## 2 Europe & Central Asia         18             5.80      0.269  NORMAL         
## 3 Latin America & Caribbean      9             5.54      0.134  NORMAL         
## 4 Middle East & North Africa     8             2.77      0.119  NORMAL         
## 5 North America                  2             5.19      0.0299 LOW_PROBABILITY
## 6 South Asia                     4             4.17      0.0597 NORMAL         
## 7 Sub-Saharan Africa            15            15.7       0.224  NORMAL

The probability of selecting a country from Europe & Central Asia is higher compared to other regions, indicating that this region is overrepresented in the dataset. Conversely, regions with fewer observations have a lower probability of being selected.

Hypothesis

Countries with limited data reporting may have fewer observations, while Europe & Central Asia may appear more frequently due to more number of countries or better reporting infrastructure.

ggplot(df_region, aes(x = `Region`, y = n)) +
  geom_col() +
  coord_flip()

Two Categorical Values

df_combo <- world_bank |>
  filter(Time == max(Time, na.rm = TRUE)) |>
  group_by(`Income Group`, `Region` ) |>
  summarise(n = n(), .groups = "drop") |>
  mutate(probability = n / sum(n),
         tag = ifelse(n == min(n), "LOW_PROBABILITY", "NORMAL"))
df_combo

## # A tibble: 18 × 5
##    `Income Group`      Region                         n probability tag         
##    <chr>               <chr>                      <int>       <dbl> <chr>       
##  1 High income         East Asia & Pacific            5      0.0746 NORMAL      
##  2 High income         Europe & Central Asia         12      0.179  NORMAL      
##  3 High income         Latin America & Caribbean      2      0.0299 NORMAL      
##  4 High income         Middle East & North Africa     3      0.0448 NORMAL      
##  5 High income         North America                  2      0.0299 NORMAL      
##  6 Low income          Middle East & North Africa     1      0.0149 LOW_PROBABI…
##  7 Low income          Sub-Saharan Africa            11      0.164  NORMAL      
##  8 Lower middle income East Asia & Pacific            2      0.0299 NORMAL      
##  9 Lower middle income Europe & Central Asia          1      0.0149 LOW_PROBABI…
## 10 Lower middle income Latin America & Caribbean      2      0.0299 NORMAL      
## 11 Lower middle income Middle East & North Africa     3      0.0448 NORMAL      
## 12 Lower middle income South Asia                     4      0.0597 NORMAL      
## 13 Lower middle income Sub-Saharan Africa             3      0.0448 NORMAL      
## 14 Upper middle income East Asia & Pacific            4      0.0597 NORMAL      
## 15 Upper middle income Europe & Central Asia          5      0.0746 NORMAL      
## 16 Upper middle income Latin America & Caribbean      5      0.0746 NORMAL      
## 17 Upper middle income Middle East & North Africa     1      0.0149 LOW_PROBABI…
## 18 Upper middle income Sub-Saharan Africa             1      0.0149 LOW_PROBABI…

Low income and lower-middle income group combinations are rare. Hence we can say that the probability of selecting is very low thus making them anomalies.

### Hypothesis Some countries might not be in some income groups due to potential economic constraints.

### Missing Combinations

# Identify existing pairs
df_pairs <- world_bank |>
  distinct (`Income Group`, `Region`)
df_pairs

## # A tibble: 18 × 2
##    `Income Group`      Region                    
##    <chr>               <chr>                     
##  1 Upper middle income Latin America & Caribbean 
##  2 Upper middle income East Asia & Pacific       
##  3 High income         Europe & Central Asia     
##  4 Lower middle income South Asia                
##  5 High income         East Asia & Pacific       
##  6 High income         Middle East & North Africa
##  7 Upper middle income Europe & Central Asia     
##  8 High income         North America             
##  9 Lower middle income Middle East & North Africa
## 10 Upper middle income Sub-Saharan Africa        
## 11 High income         Latin America & Caribbean 
## 12 Upper middle income Middle East & North Africa
## 13 Lower middle income East Asia & Pacific       
## 14 Lower middle income Sub-Saharan Africa        
## 15 Lower middle income Europe & Central Asia     
## 16 Lower middle income Latin America & Caribbean 
## 17 Low income          Sub-Saharan Africa        
## 18 Low income          Middle East & North Africa

# All possibe pairs
all_combos <- expand.grid(
  `Income Group` = unique(world_bank$`Income Group`),
  Region = unique(world_bank$Region)
)
all_combos

##           Income Group                     Region
## 1  Upper middle income  Latin America & Caribbean
## 2          High income  Latin America & Caribbean
## 3  Lower middle income  Latin America & Caribbean
## 4           Low income  Latin America & Caribbean
## 5  Upper middle income        East Asia & Pacific
## 6          High income        East Asia & Pacific
## 7  Lower middle income        East Asia & Pacific
## 8           Low income        East Asia & Pacific
## 9  Upper middle income      Europe & Central Asia
## 10         High income      Europe & Central Asia
## 11 Lower middle income      Europe & Central Asia
## 12          Low income      Europe & Central Asia
## 13 Upper middle income                 South Asia
## 14         High income                 South Asia
## 15 Lower middle income                 South Asia
## 16          Low income                 South Asia
## 17 Upper middle income Middle East & North Africa
## 18         High income Middle East & North Africa
## 19 Lower middle income Middle East & North Africa
## 20          Low income Middle East & North Africa
## 21 Upper middle income              North America
## 22         High income              North America
## 23 Lower middle income              North America
## 24          Low income              North America
## 25 Upper middle income         Sub-Saharan Africa
## 26         High income         Sub-Saharan Africa
## 27 Lower middle income         Sub-Saharan Africa
## 28          Low income         Sub-Saharan Africa

# Missing pairs
missing_combos <- anti_join(
  all_combos,
  df_pairs,
  by = c("Region", "Income Group")
)
missing_combos

##           Income Group                    Region
## 1           Low income Latin America & Caribbean
## 2           Low income       East Asia & Pacific
## 3           Low income     Europe & Central Asia
## 4  Upper middle income                South Asia
## 5          High income                South Asia
## 6           Low income                South Asia
## 7  Upper middle income             North America
## 8  Lower middle income             North America
## 9           Low income             North America
## 10         High income        Sub-Saharan Africa

Some region–income group combinations do not exist in the dataset. This likely reflects real-world economic structure rather than missing data. For example, a region may have no countries in the “low income” category.

# Most and Least counts
combo_counts <- world_bank |>
  filter(Time == max(Time, na.rm = TRUE)) |>
  count(Region,`Income Group`, sort = TRUE)
combo_counts

## # A tibble: 18 × 3
##    Region                     `Income Group`          n
##    <chr>                      <chr>               <int>
##  1 Europe & Central Asia      High income            12
##  2 Sub-Saharan Africa         Low income             11
##  3 East Asia & Pacific        High income             5
##  4 Europe & Central Asia      Upper middle income     5
##  5 Latin America & Caribbean  Upper middle income     5
##  6 East Asia & Pacific        Upper middle income     4
##  7 South Asia                 Lower middle income     4
##  8 Middle East & North Africa High income             3
##  9 Middle East & North Africa Lower middle income     3
## 10 Sub-Saharan Africa         Lower middle income     3
## 11 East Asia & Pacific        Lower middle income     2
## 12 Latin America & Caribbean  High income             2
## 13 Latin America & Caribbean  Lower middle income     2
## 14 North America              High income             2
## 15 Europe & Central Asia      Lower middle income     1
## 16 Middle East & North Africa Low income              1
## 17 Middle East & North Africa Upper middle income     1
## 18 Sub-Saharan Africa         Upper middle income     1

More frequent combinations represent dominant economic patterns and mostly countries in the whole of europe, having similar growth patterns. On the other hand, the least counts are lower middle income countries in the sub-saharan africa region, showing a lower impact in the global economic indicators.

Visualization

ggplot(combo_counts, aes(x = Region, y = n, fill = `Income Group`)) +
  geom_col() +
  coord_flip()

We can the see the bar chart for different combinations of region and income group. High income group and Europ/Central asian combo is dominating because of the their higher impact in the global economic patterns and data reporting in WDI dataset.