In this data dive, I examine the Statistical Performance Indicators dataset from the World Bank to identify unclear elements in the data and documentation. The dataset contains 4,340 rows and 12 columns, with each row representing a country in a specific year. The goal is to critically evaluate what’s clear, what’s unclear, and what issues might affect analysis.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
dataset <- read.csv("dataset.csv")
Before reading the documentation, several columns and values in this dataset were unclear to me:
iso3c ColumnAt first glance, the iso3c column contains three-letter
country codes like “DNK”, “FIN”, “POL”, etc. Without documentation, I
didn’t know what these codes meant or what standard they followed.
What could have happened if I didn’t read the documentation? I might have assumed these were just random abbreviations or tried to create my own mapping to country names, which could lead to errors if I misidentified a country code.
The dataset has six different score columns:
overall_score, data_use_score,
data_services_score, data_products_score,
data_sources_score, and
data_infrastructure_score. Before reading the
documentation, it wasn’t clear:
What could have happened? I might have compared scores incorrectly or misinterpreted what a “high” or “low” score means for a country. For example, I might have thought a score of 50 was “average” when it could actually be quite low in this context.
Even after reading the documentation, there’s one major element that remains unclear: How Are the Scores Calculated? The documentation tells us that the scores measure “statistical capacity” and range from 0-100, but it doesn’t explain:
data_use_score?overall_score? Is it a simple average, or are some
components more important than others?This is concerning because without knowing how scores are calculated, it’s hard to: - Interpret differences between countries - Understand if small differences (like 85 vs 87) are meaningful - Know if we can compare scores across different years
This histogram shows that most countries cluster in the 40-80 range, but without knowing how the score is calculated, I can’t interpret what this distribution means. Are countries genuinely improving over time, or is the scoring system biased toward certain types of countries?
dataset |>
ggplot() +
geom_histogram(mapping = aes(x = overall_score),
bins = 30,
fill = "#3182bd",
color = "white") +
labs(title = "Distribution of Overall Scores Across All Countries",
x = "Overall Score",
y = "Count") +
theme_minimal()
## Warning: Removed 2915 rows containing non-finite outside the scale range
## (`stat_bin()`).
This scatterplot shows a positive relationship between infrastructure score and overall score, but without documentation on the calculation formula, I can’t tell: - Is infrastructure weighted more heavily in the overall score? - Are there outliers that don’t follow the expected pattern? - Should all countries fall along a specific line if the formula is consistent?
Why this is concerning: If I wanted to advise a country on how to improve their overall score, I wouldn’t know which component score to focus on without understanding the weighting system.
dataset |>
filter(year == 2023) |>
ggplot() +
geom_point(mapping = aes(x = data_infrastructure_score,
y = overall_score,
color = income),
alpha = 0.6) +
labs(title = "Infrastructure Score vs Overall Score (2023)",
subtitle = "Without knowing the calculation method, we can't tell if this is the right relationship",
x = "Data Infrastructure Score",
y = "Overall Score",
color = "Income Level") +
theme_minimal()
## Warning: Removed 30 rows containing missing values or values outside the scale range
## (`geom_point()`).
The lack of transparency in score construction introduces several risks:
1. Policy Misinterpretation Risk
Countries may be ranked or compared without understanding weighting decisions. Policymakers could allocate resources based on misleading comparisons.
2. Overinterpretation of Small Differences
Small numerical differences may not represent meaningful performance gaps. A country scoring 85 vs 87 might be treated as significantly different when the difference could be statistical noise.
3. Temporal Comparison Risk
If methodology changes across years, trends may reflect scoring adjustments rather than real improvements. Countries might appear to improve or decline artificially.
To reduce these risks, I would:
Avoid strict country rankings without understanding the methodology
Clearly state methodological uncertainty in all analyses
Avoid causal claims based on score differences
Treat small differences (less than 5 points) cautiously
Investigate whether methodology changed over time before analyzing trends
Check for NA values in categorical columns shows that there are no explicitly missing values in the categorical columns (iso3c, country, region, income).
dataset |>
summarize(
missing_iso3c = sum(is.na(iso3c)),
missing_country = sum(is.na(country)),
missing_region = sum(is.na(region)),
missing_income = sum(is.na(income))
)
## missing_iso3c missing_country missing_region missing_income
## 1 0 0 0 0
Check if there are gaps in the years for countries shows that there ARE implicitly missing rows! Not all countries have data for every year in the dataset. Some countries might be missing certain years, which could indicate:
Insight and Significance
This implicit missingness creates bias in longitudinal analysis. If missing years are systematically related to a country’s region, income level, or performance, then trends over time may be misleading. For example, if low-performing countries stop reporting data, average scores might appear to improve artificially.
Further Investigation Needed
I would need to determine: - Whether missing years are randomly distributed or systematic - If certain regions or income groups have more missing data - Whether countries that stopped reporting had lower scores - How to handle these gaps when analyzing trends
# See what years are available
year_range <- dataset |>
summarize(min_year = min(year),
max_year = max(year))
year_range
## min_year max_year
## 1 2004 2023
# Check how many unique countries we have
n_countries <- dataset |>
distinct(country) |>
nrow()
# Expected rows if every country had data for every year
expected_rows <- n_countries * (year_range$max_year - year_range$min_year + 1)
cat("Unique countries:", n_countries, "\n")
## Unique countries: 217
cat("Actual rows:", nrow(dataset), "\n")
## Actual rows: 4340
cat("Expected rows if complete:", expected_rows, "\n")
## Expected rows if complete: 4340
# Find countries with missing years
dataset |>
group_by(country) |>
summarize(n_years = n(),
years_present = list(year)) |>
filter(n_years < (year_range$max_year - year_range$min_year + 1)) |>
head(10)
## # A tibble: 0 × 3
## # ℹ 3 variables: country <chr>, n_years <int>, years_present <list>
Check if any region or income categories have very few or zero countries shows that Some combinations of region and income level have very few countries. For example, there might be very few “Low income” countries in “Europe & Central Asia” or “High income” countries in “Sub-Saharan Africa”. These aren’t technically “empty” but having only 1-2 countries in a category makes comparisons difficult.
Insight and Significance
Very small groups behave similarly to empty groups statistically. Comparisons across such categories may be unstable and misleading. A single country’s performance could be interpreted as representing an entire region-income combination, which is not valid.
dataset |>
filter(year == 2023) |>
count(region, income) |>
arrange(n)
## region income n
## 1 East Asia & Pacific Low income 1
## 2 Latin America & Caribbean Not classified 1
## 3 South Asia Low income 1
## 4 South Asia Upper middle income 1
## 5 Sub-Saharan Africa High income 1
## 6 Middle East & North Africa Low income 2
## 7 Europe & Central Asia Lower middle income 3
## 8 North America High income 3
## 9 Latin America & Caribbean Lower middle income 4
## 10 Middle East & North Africa Upper middle income 4
## 11 South Asia Lower middle income 6
## 12 Sub-Saharan Africa Upper middle income 6
## 13 Middle East & North Africa Lower middle income 7
## 14 Middle East & North Africa High income 8
## 15 East Asia & Pacific Upper middle income 9
## 16 East Asia & Pacific Lower middle income 12
## 17 East Asia & Pacific High income 15
## 18 Europe & Central Asia Upper middle income 15
## 19 Latin America & Caribbean High income 18
## 20 Latin America & Caribbean Upper middle income 19
## 21 Sub-Saharan Africa Lower middle income 19
## 22 Sub-Saharan Africa Low income 22
## 23 Europe & Central Asia High income 40
populationI’ll examine the population column to identify outliers. I’ll use multiple methods to define outliers:
dataset |>
filter(year == 2023) |>
ggplot() +
geom_boxplot(mapping = aes(x = population, y = "")) +
scale_x_continuous(labels = scales::comma) +
labs(title = "Population Distribution (2023) - IQR Method",
x = "Population",
y = "") +
theme_minimal()
Using the standard boxplot definition (1.5 × IQR), I can identify extreme population outliers:
pop_2023 <- dataset |>
filter(year == 2023)
q1 <- quantile(pop_2023$population, 0.25)
q3 <- quantile(pop_2023$population, 0.75)
iqr <- q3 - q1
lower_bound <- q1 - 1.5 * iqr
upper_bound <- q3 + 1.5 * iqr
outliers <- pop_2023 |>
filter(population > upper_bound | population < lower_bound) |>
select(country, population) |>
arrange(desc(population))
outliers
## country population
## 1 India 1428627663
## 2 China 1410710000
## 3 United States 334914895
## 4 Indonesia 277534122
## 5 Pakistan 240485658
## 6 Nigeria 223804632
## 7 Brazil 216422446
## 8 Bangladesh 172954319
## 9 Russian Federation 143826130
## 10 Mexico 128455567
## 11 Ethiopia 126527060
## 12 Japan 124516650
## 13 Philippines 117337368
## 14 Egypt, Arab Rep. 112716598
## 15 Congo, Dem. Rep. 102262808
## 16 Viet Nam 98858950
## 17 Iran, Islamic Rep. 89172767
## 18 Turkiye 85326000
## 19 Germany 84482267
## 20 Thailand 71801279
## 21 United Kingdom 68350000
## 22 France 68170228
## 23 Tanzania 67438106
Looking at this histogram, there’s a clear gap between the bulk of countries (under 100 million) and a few extremely large countries (over 500 million). This visual gap suggests that countries with populations over 300 million could be considered outliers.
dataset |>
filter(year == 2023) |>
ggplot() +
geom_histogram(mapping = aes(x = population),
bins = 50,
fill = "#3182bd",
color = "white") +
scale_x_continuous(labels = scales::comma) +
labs(title = "Population Distribution Shows Clear Gaps",
x = "Population",
y = "Count") +
theme_minimal()
I define population outliers as countries with populations above 300 million based on the following reasons:
These outliers (China, India, USA, Indonesia, Pakistan) represent over 40% of the world’s population but are less than 3% of countries in the dataset.
nsight and Significance
When analyzing relationships between population and scores, these outliers create a dilemma: removing them may improve statistical stability but could distort global representation. Any decision to exclude them must be explicitly justified based on the research question.
How I would handle them:
dataset |>
filter(year == 2023, population > 300000000) |>
select(country, population, overall_score) |>
arrange(desc(population))
## country population overall_score
## 1 India 1428627663 73.62833
## 2 China 1410710000 59.50375
## 3 United States 334914895 93.43042
The biggest question remaining is: How exactly are these scores calculated? To properly interpret this data, I would need: - The formula for combining component scores into the overall score - The criteria and thresholds for each individual score - Whether the scoring methodology has changed over time - What a score of 0 vs missing data means
Without this information, any conclusions about country performance or trends over time could be misleading.