Load required packages
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Load the World Bank Dataset
#fill '..' values in numerical columns with NA.
world_bank <- read_csv("C:/Users/SP KHALID/Downloads/WDI- World Bank Dataset.csv" , na = c('..'))
## Rows: 1675 Columns: 19
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Time Code, Country Name, Country Code, Region, Income Group
## dbl (14): Time, GDP (constant 2015 US$), GDP growth (annual %), GDP (current...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
world_bank
## # A tibble: 1,675 × 19
## Time `Time Code` `Country Name` `Country Code` Region `Income Group`
## <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 2000 YR2000 Brazil BRA Latin America… Upper middle …
## 2 2000 YR2000 China CHN East Asia & P… Upper middle …
## 3 2000 YR2000 France FRA Europe & Cent… High income
## 4 2000 YR2000 Germany DEU Europe & Cent… High income
## 5 2000 YR2000 India IND South Asia Lower middle …
## 6 2000 YR2000 Indonesia IDN East Asia & P… Upper middle …
## 7 2000 YR2000 Italy ITA Europe & Cent… High income
## 8 2000 YR2000 Japan JPN East Asia & P… High income
## 9 2000 YR2000 Korea, Rep. KOR East Asia & P… High income
## 10 2000 YR2000 Mexico MEX Latin America… Upper middle …
## # ℹ 1,665 more rows
## # ℹ 13 more variables: `GDP (constant 2015 US$)` <dbl>,
## # `GDP growth (annual %)` <dbl>, `GDP (current US$)` <dbl>,
## # `Unemployment, total (% of total labor force)` <dbl>,
## # `Inflation, consumer prices (annual %)` <dbl>, `Labor force, total` <dbl>,
## # `Population, total` <dbl>,
## # `Exports of goods and services (% of GDP)` <dbl>, …
dim(world_bank)
## [1] 1675 19
# Check column data types
glimpse(world_bank)
## Rows: 1,675
## Columns: 19
## $ Time <dbl> 2000, 20…
## $ `Time Code` <chr> "YR2000"…
## $ `Country Name` <chr> "Brazil"…
## $ `Country Code` <chr> "BRA", "…
## $ Region <chr> "Latin A…
## $ `Income Group` <chr> "Upper m…
## $ `GDP (constant 2015 US$)` <dbl> 1.18642e…
## $ `GDP growth (annual %)` <dbl> 4.387949…
## $ `GDP (current US$)` <dbl> 6.554482…
## $ `Unemployment, total (% of total labor force)` <dbl> NA, 3.70…
## $ `Inflation, consumer prices (annual %)` <dbl> 7.044141…
## $ `Labor force, total` <dbl> 80295093…
## $ `Population, total` <dbl> 17401828…
## $ `Exports of goods and services (% of GDP)` <dbl> 10.18805…
## $ `Imports of goods and services (% of GDP)` <dbl> 12.45171…
## $ `General government final consumption expenditure (% of GDP)` <dbl> 18.76784…
## $ `Foreign direct investment, net inflows (% of GDP)` <dbl> 5.033917…
## $ `Gross savings (% of GDP)` <dbl> 13.99170…
## $ `Current account balance (% of GDP)` <dbl> -4.04774…
# Convert Time column to integer
world_bank$Time <- as.integer(world_bank$Time)
# Clean column names
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
df <- world_bank |> clean_names()
glimpse(df)
## Rows: 1,675
## Columns: 19
## $ time <int> 2000, …
## $ time_code <chr> "YR200…
## $ country_name <chr> "Brazi…
## $ country_code <chr> "BRA",…
## $ region <chr> "Latin…
## $ income_group <chr> "Upper…
## $ gdp_constant_2015_us <dbl> 1.1864…
## $ gdp_growth_annual_percent <dbl> 4.3879…
## $ gdp_current_us <dbl> 6.5544…
## $ unemployment_total_percent_of_total_labor_force <dbl> NA, 3.…
## $ inflation_consumer_prices_annual_percent <dbl> 7.0441…
## $ labor_force_total <dbl> 802950…
## $ population_total <dbl> 174018…
## $ exports_of_goods_and_services_percent_of_gdp <dbl> 10.188…
## $ imports_of_goods_and_services_percent_of_gdp <dbl> 12.451…
## $ general_government_final_consumption_expenditure_percent_of_gdp <dbl> 18.767…
## $ foreign_direct_investment_net_inflows_percent_of_gdp <dbl> 5.0339…
## $ gross_savings_percent_of_gdp <dbl> 13.991…
## $ current_account_balance_percent_of_gdp <dbl> -4.047…
This WDI dataset currently shows country-time observations in each row alongwith different economic indicators. For this investigation, I have focused on following indicators to asses how sampling variablity affects conclusions :
GDP growth
Unemployment
Inflation
Income group
df <- df |>
select(
time,
country_name,
region,
income_group,
gdp_growth_annual_percent,
unemployment_total_percent_of_total_labor_force,
inflation_consumer_prices_annual_percent
)
A sampling fraction of 25 % is chosen for simulating mid-sized data collections. No of samples are kept 3.
sample_frac <- 0.25
n_samples <- 3
df_samples <- tibble()
for (sample_i in 1: n_samples) {
df_i <- df |>
sample_n(size = sample_frac * nrow(df), replace = TRUE) |>
mutate (sample_num = sample_i)
df_samples <- bind_rows(df_samples, df_i)
}
df_samples |>
filter(time == max(time, na.rm = TRUE)) |>
group_by(sample_num) |>
summarise(
mean_gdp_growth = mean(gdp_growth_annual_percent, na.rm = TRUE),
median_unemployment = median(unemployment_total_percent_of_total_labor_force, na.rm = TRUE),
mean_inflation = mean(inflation_consumer_prices_annual_percent, na.rm = TRUE),
n = n()
)
## # A tibble: 3 × 5
## sample_num mean_gdp_growth median_unemployment mean_inflation n
## <int> <dbl> <dbl> <dbl> <int>
## 1 1 4.22 5.20 4.38 15
## 2 2 2.39 3.44 4.63 14
## 3 3 3.27 4.25 6.37 19
The summary statistics accross different samples show comparable results to each other for mean gdp growth and median unemployment. However, mean inflation values are inconsistent with each other showing a broad range from 2.42 to 21.87 %.
df_samples |>
filter(time == max(time, na.rm = TRUE)) |>
group_by(sample_num, income_group) |>
summarise(
count = n(),
.groups = 'drop'
)
## # A tibble: 12 × 3
## sample_num income_group count
## <int> <chr> <int>
## 1 1 High income 6
## 2 1 Low income 2
## 3 1 Lower middle income 4
## 4 1 Upper middle income 3
## 5 2 High income 5
## 6 2 Low income 3
## 7 2 Lower middle income 2
## 8 2 Upper middle income 4
## 9 3 High income 5
## 10 3 Low income 3
## 11 3 Lower middle income 4
## 12 3 Upper middle income 7
df_samples |>
filter(time == max(time, na.rm = TRUE)) |>
group_by(sample_num) |>
summarise(
min_gdp_growth = min(gdp_growth_annual_percent, na.rm = TRUE),
max_gdp_growth = max(gdp_growth_annual_percent, na.rm = TRUE)
)
## # A tibble: 3 × 3
## sample_num min_gdp_growth max_gdp_growth
## <int> <dbl> <dbl>
## 1 1 2.00 8.89
## 2 2 -4.17 7.09
## 3 3 0.103 10.3
While negative GDP growth appears in some sub-samples, these extremes are not consistent, suggesting they may reflect sampling variability rather than systematic differences.
sample_frac <- 0.10
n_samples <- 3
df_samples <- tibble()
for (sample_i in 1: n_samples) {
df_i <- df |>
sample_n(size = sample_frac * nrow(df), replace = TRUE) |>
mutate (sample_num = sample_i)
df_samples <- bind_rows(df_samples, df_i)
}
df_samples |>
filter(time == max(time, na.rm = TRUE)) |>
group_by(sample_num) |>
summarise(
mean_gdp_growth = mean(gdp_growth_annual_percent, na.rm = TRUE),
median_unemployment = median(unemployment_total_percent_of_total_labor_force, na.rm = TRUE),
mean_inflation = mean(inflation_consumer_prices_annual_percent, na.rm = TRUE),
n = n()
)
## # A tibble: 3 × 5
## sample_num mean_gdp_growth median_unemployment mean_inflation n
## <int> <dbl> <dbl> <dbl> <int>
## 1 1 3.20 7.09 6.35 9
## 2 2 3.72 3.90 5.60 4
## 3 3 4.07 4.17 3.86 10
The summary statistics vary substantially accross sub-samples. Mean gdp is varying from 2.52 to 3.18% and median unemployment ranges from 2.9 to 5.36%. The mean inflation values are still wide in range from 2.7 to 10.9%
df_samples |>
filter(time == max(time, na.rm = TRUE)) |>
group_by(sample_num, income_group) |>
summarise(
count = n(),
.groups = 'drop'
)
## # A tibble: 11 × 3
## sample_num income_group count
## <int> <chr> <int>
## 1 1 High income 3
## 2 1 Low income 1
## 3 1 Lower middle income 2
## 4 1 Upper middle income 3
## 5 2 High income 1
## 6 2 Low income 1
## 7 2 Lower middle income 2
## 8 3 High income 2
## 9 3 Low income 2
## 10 3 Lower middle income 4
## 11 3 Upper middle income 2
df_samples |>
filter(time == max(time, na.rm = TRUE)) |>
group_by(sample_num) |>
summarise(
min_gdp_growth = min(gdp_growth_annual_percent, na.rm = TRUE),
max_gdp_growth = max(gdp_growth_annual_percent, na.rm = TRUE)
)
## # A tibble: 3 × 3
## sample_num min_gdp_growth max_gdp_growth
## <int> <dbl> <dbl>
## 1 1 0.916 5.69
## 2 2 1.13 4.80
## 3 3 -1.12 8.89
Extreme GDP growth values also appear, including a maximum of 8.89% in one sub-sample. These results suggest that small samples are highly sensitive to which countries are included, increasing the likelihood that extreme values could be misinterpreted as meaningful patterns or anomalies.
sample_frac <- 0.75
n_samples <- 3
df_samples <- tibble()
for (sample_i in 1: n_samples) {
df_i <- df |>
sample_n(size = sample_frac * nrow(df), replace = TRUE) |>
mutate (sample_num = sample_i)
df_samples <- bind_rows(df_samples, df_i)
}
df_samples |>
filter(time == max(time, na.rm = TRUE)) |>
group_by(sample_num) |>
summarise(
mean_gdp_growth = mean(gdp_growth_annual_percent, na.rm = TRUE),
median_unemployment = median(unemployment_total_percent_of_total_labor_force, na.rm = TRUE),
mean_inflation = mean(inflation_consumer_prices_annual_percent, na.rm = TRUE),
n = n()
)
## # A tibble: 3 × 5
## sample_num mean_gdp_growth median_unemployment mean_inflation n
## <int> <dbl> <dbl> <dbl> <int>
## 1 1 2.96 4.36 7.86 47
## 2 2 3.35 4.17 5.47 52
## 3 3 3.02 4.34 7.11 52
With a 75% sampling fraction, estimates become substantially more stable across sub-samples. Mean GDP growth converges to a narrow range (2.48% to 3.00%), and median unemployment differences are relatively small from 3.47% to 4.44%. Inflation estimates also show reduced dispersion compared to smaller samples.
df_samples |>
filter(time == max(time, na.rm = TRUE)) |>
group_by(sample_num, income_group) |>
summarise(
count = n(),
.groups = 'drop'
)
## # A tibble: 12 × 3
## sample_num income_group count
## <int> <chr> <int>
## 1 1 High income 13
## 2 1 Low income 10
## 3 1 Lower middle income 11
## 4 1 Upper middle income 13
## 5 2 High income 13
## 6 2 Low income 12
## 7 2 Lower middle income 11
## 8 2 Upper middle income 16
## 9 3 High income 19
## 10 3 Low income 13
## 11 3 Lower middle income 9
## 12 3 Upper middle income 11
df_samples |>
filter(time == max(time, na.rm = TRUE)) |>
group_by(sample_num) |>
summarise(
min_gdp_growth = min(gdp_growth_annual_percent, na.rm = TRUE),
max_gdp_growth = max(gdp_growth_annual_percent, na.rm = TRUE)
)
## # A tibble: 3 × 3
## sample_num min_gdp_growth max_gdp_growth
## <int> <dbl> <dbl>
## 1 1 -4.17 8.89
## 2 2 -4.17 8.89
## 3 3 -4.17 10.3
Although minimum GDP growth values remain negative in some sub-samples, the overall patterns are consistent, indicating that larger samples reduce the influence of outliers and random variation on summary statistics.
What would you have called an anomaly in one sub-sample that you wouldn’t in another?
Extreme inflation values and unusually high or low GDP growth rates appear in some 10% sub-samples but not in others (75%). These values might be classified as anomalies when viewed individually. However, their inconsistency across sub-samples suggests they are more likely the result of sampling variability rather than true outliers.
Are there aspects of the data that are consistent among all sub-samples?
Across all sub-samples, average GDP growth and unemployment remains positive and within a relatively narrow range. These consistent features indicate underlying stability in the data despite sampling-related variation.
From the above observations, we saw clear relationship between sample size and reliability of corresponding summary statistics. We see highly variable results and anomalies from smaller samples (10 %) while increasing upto 75% samples produces consistent and robust patterns. This shows us how concluding findings from limited data can question credibility of statistics and why adequate sample size is important.