Data Dive : Week 4

Data Dive

Week 3

Load required packages

library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Load the World Bank Dataset

#fill '..' values in numerical columns with NA.
world_bank <- read_csv("C:/Users/SP KHALID/Downloads/WDI- World Bank Dataset.csv" , na = c('..'))

## Rows: 1675 Columns: 19
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (5): Time Code, Country Name, Country Code, Region, Income Group
## dbl (14): Time, GDP (constant 2015 US$), GDP growth (annual %), GDP (current...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

world_bank

## # A tibble: 1,675 × 19
##     Time `Time Code` `Country Name` `Country Code` Region         `Income Group`
##    <dbl> <chr>       <chr>          <chr>          <chr>          <chr>         
##  1  2000 YR2000      Brazil         BRA            Latin America… Upper middle …
##  2  2000 YR2000      China          CHN            East Asia & P… Upper middle …
##  3  2000 YR2000      France         FRA            Europe & Cent… High income   
##  4  2000 YR2000      Germany        DEU            Europe & Cent… High income   
##  5  2000 YR2000      India          IND            South Asia     Lower middle …
##  6  2000 YR2000      Indonesia      IDN            East Asia & P… Upper middle …
##  7  2000 YR2000      Italy          ITA            Europe & Cent… High income   
##  8  2000 YR2000      Japan          JPN            East Asia & P… High income   
##  9  2000 YR2000      Korea, Rep.    KOR            East Asia & P… High income   
## 10  2000 YR2000      Mexico         MEX            Latin America… Upper middle …
## # ℹ 1,665 more rows
## # ℹ 13 more variables: `GDP (constant 2015 US$)` <dbl>,
## #   `GDP growth (annual %)` <dbl>, `GDP (current US$)` <dbl>,
## #   `Unemployment, total (% of total labor force)` <dbl>,
## #   `Inflation, consumer prices (annual %)` <dbl>, `Labor force, total` <dbl>,
## #   `Population, total` <dbl>,
## #   `Exports of goods and services (% of GDP)` <dbl>, …

dim(world_bank)

## [1] 1675   19

# Check column data types
glimpse(world_bank)

## Rows: 1,675
## Columns: 19
## $ Time                                                          <dbl> 2000, 20…
## $ `Time Code`                                                   <chr> "YR2000"…
## $ `Country Name`                                                <chr> "Brazil"…
## $ `Country Code`                                                <chr> "BRA", "…
## $ Region                                                        <chr> "Latin A…
## $ `Income Group`                                                <chr> "Upper m…
## $ `GDP (constant 2015 US$)`                                     <dbl> 1.18642e…
## $ `GDP growth (annual %)`                                       <dbl> 4.387949…
## $ `GDP (current US$)`                                           <dbl> 6.554482…
## $ `Unemployment, total (% of total labor force)`                <dbl> NA, 3.70…
## $ `Inflation, consumer prices (annual %)`                       <dbl> 7.044141…
## $ `Labor force, total`                                          <dbl> 80295093…
## $ `Population, total`                                           <dbl> 17401828…
## $ `Exports of goods and services (% of GDP)`                    <dbl> 10.18805…
## $ `Imports of goods and services (% of GDP)`                    <dbl> 12.45171…
## $ `General government final consumption expenditure (% of GDP)` <dbl> 18.76784…
## $ `Foreign direct investment, net inflows (% of GDP)`           <dbl> 5.033917…
## $ `Gross savings (% of GDP)`                                    <dbl> 13.99170…
## $ `Current account balance (% of GDP)`                          <dbl> -4.04774…

# Convert Time column to integer
world_bank$Time <- as.integer(world_bank$Time)

# Clean column names
library(janitor)

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

df <- world_bank |> clean_names()
glimpse(df)

## Rows: 1,675
## Columns: 19
## $ time                                                            <int> 2000, …
## $ time_code                                                       <chr> "YR200…
## $ country_name                                                    <chr> "Brazi…
## $ country_code                                                    <chr> "BRA",…
## $ region                                                          <chr> "Latin…
## $ income_group                                                    <chr> "Upper…
## $ gdp_constant_2015_us                                            <dbl> 1.1864…
## $ gdp_growth_annual_percent                                       <dbl> 4.3879…
## $ gdp_current_us                                                  <dbl> 6.5544…
## $ unemployment_total_percent_of_total_labor_force                 <dbl> NA, 3.…
## $ inflation_consumer_prices_annual_percent                        <dbl> 7.0441…
## $ labor_force_total                                               <dbl> 802950…
## $ population_total                                                <dbl> 174018…
## $ exports_of_goods_and_services_percent_of_gdp                    <dbl> 10.188…
## $ imports_of_goods_and_services_percent_of_gdp                    <dbl> 12.451…
## $ general_government_final_consumption_expenditure_percent_of_gdp <dbl> 18.767…
## $ foreign_direct_investment_net_inflows_percent_of_gdp            <dbl> 5.0339…
## $ gross_savings_percent_of_gdp                                    <dbl> 13.991…
## $ current_account_balance_percent_of_gdp                          <dbl> -4.047…

This WDI dataset currently shows country-time observations in each row alongwith different economic indicators. For this investigation, I have focused on following indicators to asses how sampling variablity affects conclusions :

GDP growth
Unemployment
Inflation
Income group

Select Indicators

df <- df |>
  select(
    time,
    country_name,
    region,
    income_group,
    gdp_growth_annual_percent,
    unemployment_total_percent_of_total_labor_force,
    inflation_consumer_prices_annual_percent
  )

Sampling Parameters 1

A sampling fraction of 25 % is chosen for simulating mid-sized data collections. No of samples are kept 3.

sample_frac <- 0.25
n_samples <- 3

Sub-samples 1

df_samples <- tibble()

for (sample_i in 1: n_samples) {
  df_i <- df |>
    sample_n(size = sample_frac * nrow(df), replace = TRUE) |>
    mutate (sample_num = sample_i)
  
  df_samples <- bind_rows(df_samples, df_i)
}

Summaries accross samples 1

df_samples |>
  filter(time == max(time, na.rm = TRUE)) |>
  group_by(sample_num) |>
  summarise(
    mean_gdp_growth = mean(gdp_growth_annual_percent, na.rm = TRUE),
    median_unemployment = median(unemployment_total_percent_of_total_labor_force, na.rm = TRUE),
    mean_inflation = mean(inflation_consumer_prices_annual_percent, na.rm = TRUE),
    n = n()
  )

## # A tibble: 3 × 5
##   sample_num mean_gdp_growth median_unemployment mean_inflation     n
##        <int>           <dbl>               <dbl>          <dbl> <int>
## 1          1            4.22                5.20           4.38    15
## 2          2            2.39                3.44           4.63    14
## 3          3            3.27                4.25           6.37    19

The summary statistics accross different samples show comparable results to each other for mean gdp growth and median unemployment. However, mean inflation values are inconsistent with each other showing a broad range from 2.42 to 21.87 %.

Comparing income groups accross samples 1

df_samples |>
  filter(time == max(time, na.rm = TRUE)) |>
  group_by(sample_num, income_group) |>
  summarise(
    count = n(),
    .groups = 'drop'
  )

## # A tibble: 12 × 3
##    sample_num income_group        count
##         <int> <chr>               <int>
##  1          1 High income             6
##  2          1 Low income              2
##  3          1 Lower middle income     4
##  4          1 Upper middle income     3
##  5          2 High income             5
##  6          2 Low income              3
##  7          2 Lower middle income     2
##  8          2 Upper middle income     4
##  9          3 High income             5
## 10          3 Low income              3
## 11          3 Lower middle income     4
## 12          3 Upper middle income     7

Anomalies 1

df_samples |>
  filter(time == max(time, na.rm = TRUE)) |>
  group_by(sample_num) |>
  summarise(
    min_gdp_growth = min(gdp_growth_annual_percent, na.rm = TRUE),
    max_gdp_growth = max(gdp_growth_annual_percent, na.rm = TRUE)
  )

## # A tibble: 3 × 3
##   sample_num min_gdp_growth max_gdp_growth
##        <int>          <dbl>          <dbl>
## 1          1          2.00            8.89
## 2          2         -4.17            7.09
## 3          3          0.103          10.3

While negative GDP growth appears in some sub-samples, these extremes are not consistent, suggesting they may reflect sampling variability rather than systematic differences.

Changing sampling parameters (10 %)

sample_frac <- 0.10
n_samples <- 3

Sub-samples 2

df_samples <- tibble()

for (sample_i in 1: n_samples) {
  df_i <- df |>
    sample_n(size = sample_frac * nrow(df), replace = TRUE) |>
    mutate (sample_num = sample_i)
  
  df_samples <- bind_rows(df_samples, df_i)
}

Summaries accross samples 2

df_samples |>
  filter(time == max(time, na.rm = TRUE)) |>
  group_by(sample_num) |>
  summarise(
    mean_gdp_growth = mean(gdp_growth_annual_percent, na.rm = TRUE),
    median_unemployment = median(unemployment_total_percent_of_total_labor_force, na.rm = TRUE),
    mean_inflation = mean(inflation_consumer_prices_annual_percent, na.rm = TRUE),
    n = n()
  )

## # A tibble: 3 × 5
##   sample_num mean_gdp_growth median_unemployment mean_inflation     n
##        <int>           <dbl>               <dbl>          <dbl> <int>
## 1          1            3.20                7.09           6.35     9
## 2          2            3.72                3.90           5.60     4
## 3          3            4.07                4.17           3.86    10

The summary statistics vary substantially accross sub-samples. Mean gdp is varying from 2.52 to 3.18% and median unemployment ranges from 2.9 to 5.36%. The mean inflation values are still wide in range from 2.7 to 10.9%

Comparing income groups accross samples 2

df_samples |>
  filter(time == max(time, na.rm = TRUE)) |>
  group_by(sample_num, income_group) |>
  summarise(
    count = n(),
    .groups = 'drop'
  )

## # A tibble: 11 × 3
##    sample_num income_group        count
##         <int> <chr>               <int>
##  1          1 High income             3
##  2          1 Low income              1
##  3          1 Lower middle income     2
##  4          1 Upper middle income     3
##  5          2 High income             1
##  6          2 Low income              1
##  7          2 Lower middle income     2
##  8          3 High income             2
##  9          3 Low income              2
## 10          3 Lower middle income     4
## 11          3 Upper middle income     2

Anomalies 2

df_samples |>
  filter(time == max(time, na.rm = TRUE)) |>
  group_by(sample_num) |>
  summarise(
    min_gdp_growth = min(gdp_growth_annual_percent, na.rm = TRUE),
    max_gdp_growth = max(gdp_growth_annual_percent, na.rm = TRUE)
  )

## # A tibble: 3 × 3
##   sample_num min_gdp_growth max_gdp_growth
##        <int>          <dbl>          <dbl>
## 1          1          0.916           5.69
## 2          2          1.13            4.80
## 3          3         -1.12            8.89

Extreme GDP growth values also appear, including a maximum of 8.89% in one sub-sample. These results suggest that small samples are highly sensitive to which countries are included, increasing the likelihood that extreme values could be misinterpreted as meaningful patterns or anomalies.

Changing sampling parameters (75 %)

sample_frac <- 0.75
n_samples <- 3

Sub-samples 3

df_samples <- tibble()

for (sample_i in 1: n_samples) {
  df_i <- df |>
    sample_n(size = sample_frac * nrow(df), replace = TRUE) |>
    mutate (sample_num = sample_i)
  
  df_samples <- bind_rows(df_samples, df_i)
}

Summaries accross samples 3

df_samples |>
  filter(time == max(time, na.rm = TRUE)) |>
  group_by(sample_num) |>
  summarise(
    mean_gdp_growth = mean(gdp_growth_annual_percent, na.rm = TRUE),
    median_unemployment = median(unemployment_total_percent_of_total_labor_force, na.rm = TRUE),
    mean_inflation = mean(inflation_consumer_prices_annual_percent, na.rm = TRUE),
    n = n()
  )

## # A tibble: 3 × 5
##   sample_num mean_gdp_growth median_unemployment mean_inflation     n
##        <int>           <dbl>               <dbl>          <dbl> <int>
## 1          1            2.96                4.36           7.86    47
## 2          2            3.35                4.17           5.47    52
## 3          3            3.02                4.34           7.11    52

With a 75% sampling fraction, estimates become substantially more stable across sub-samples. Mean GDP growth converges to a narrow range (2.48% to 3.00%), and median unemployment differences are relatively small from 3.47% to 4.44%. Inflation estimates also show reduced dispersion compared to smaller samples.

Comparing income groups accross samples 3

df_samples |>
  filter(time == max(time, na.rm = TRUE)) |>
  group_by(sample_num, income_group) |>
  summarise(
    count = n(),
    .groups = 'drop'
  )

## # A tibble: 12 × 3
##    sample_num income_group        count
##         <int> <chr>               <int>
##  1          1 High income            13
##  2          1 Low income             10
##  3          1 Lower middle income    11
##  4          1 Upper middle income    13
##  5          2 High income            13
##  6          2 Low income             12
##  7          2 Lower middle income    11
##  8          2 Upper middle income    16
##  9          3 High income            19
## 10          3 Low income             13
## 11          3 Lower middle income     9
## 12          3 Upper middle income    11

Anomalies 3

df_samples |>
  filter(time == max(time, na.rm = TRUE)) |>
  group_by(sample_num) |>
  summarise(
    min_gdp_growth = min(gdp_growth_annual_percent, na.rm = TRUE),
    max_gdp_growth = max(gdp_growth_annual_percent, na.rm = TRUE)
  )

## # A tibble: 3 × 3
##   sample_num min_gdp_growth max_gdp_growth
##        <int>          <dbl>          <dbl>
## 1          1          -4.17           8.89
## 2          2          -4.17           8.89
## 3          3          -4.17          10.3

Although minimum GDP growth values remain negative in some sub-samples, the overall patterns are consistent, indicating that larger samples reduce the influence of outliers and random variation on summary statistics.

What would you have called an anomaly in one sub-sample that you wouldn’t in another?

Extreme inflation values and unusually high or low GDP growth rates appear in some 10% sub-samples but not in others (75%). These values might be classified as anomalies when viewed individually. However, their inconsistency across sub-samples suggests they are more likely the result of sampling variability rather than true outliers.

Are there aspects of the data that are consistent among all sub-samples?

Across all sub-samples, average GDP growth and unemployment remains positive and within a relatively narrow range. These consistent features indicate underlying stability in the data despite sampling-related variation.

Observation and Conclusion

From the above observations, we saw clear relationship between sample size and reliability of corresponding summary statistics. We see highly variable results and anomalies from smaller samples (10 %) while increasing upto 75% samples produces consistent and robust patterns. This shows us how concluding findings from limited data can question credibility of statistics and why adequate sample size is important.