Week 5 Data Dive — Documentation

Introduction

In this data dive, I examine the Statistical Performance Indicators dataset from the World Bank to identify unclear elements in the data and documentation. The dataset contains 4,340 rows and 12 columns, with each row representing a country in a specific year. The goal is to critically evaluate what’s clear, what’s unclear, and what issues might affect analysis.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
dataset <- read.csv("dataset.csv")

Task 1: Unclear Columns Before Reading Documentation

Before reading the documentation, several columns and values in this dataset were unclear to me:

The iso3c Column

At first glance, the iso3c column contains three-letter country codes like “DNK”, “FIN”, “POL”, etc. Without documentation, I didn’t know what these codes meant or what standard they followed.

What could have happened if I didn’t read the documentation? I might have assumed these were just random abbreviations or tried to create my own mapping to country names, which could lead to errors if I misidentified a country code.

The Score Columns (values between 0-100)

The dataset has six different score columns: overall_score, data_use_score, data_services_score, data_products_score, data_sources_score, and data_infrastructure_score. Before reading the documentation, it wasn’t clear:

  • What these scores actually measure
  • Whether higher scores are better or worse
  • How these scores relate to each other
  • Whether they’re on the same scale

What could have happened? I might have compared scores incorrectly or misinterpreted what a “high” or “low” score means for a country. For example, I might have thought a score of 50 was “average” when it could actually be quite low in this context.

Task 2: Unclear Elements Even AFTER Reading Documentation

Even after reading the documentation, there’s one major element that remains unclear: How Are the Scores Calculated? The documentation tells us that the scores measure “statistical capacity” and range from 0-100, but it doesn’t explain:

  • What specific metrics or criteria go into each score? For example, what exactly makes a country score 95 vs 85 on data_use_score?
  • How are the component scores weighted to create the overall_score? Is it a simple average, or are some components more important than others?
  • What does a score of 0 mean vs a missing value? Are there countries with 0 scores, and if so, what does that indicate?

This is concerning because without knowing how scores are calculated, it’s hard to: - Interpret differences between countries - Understand if small differences (like 85 vs 87) are meaningful - Know if we can compare scores across different years

Task 3: Two Visualizations Highlighting the Unclear Issue

Visualization: Distribution of Overall Scores

This histogram shows that most countries cluster in the 40-80 range, but without knowing how the score is calculated, I can’t interpret what this distribution means. Are countries genuinely improving over time, or is the scoring system biased toward certain types of countries?

dataset |>
  ggplot() +
  geom_histogram(mapping = aes(x = overall_score), 
                 bins = 30, 
                 fill = "#3182bd", 
                 color = "white") +
  labs(title = "Distribution of Overall Scores Across All Countries",
       x = "Overall Score",
       y = "Count") +
  theme_minimal()
## Warning: Removed 2915 rows containing non-finite outside the scale range
## (`stat_bin()`).

Visualization: Relationship Between Component Scores and Overall Score

This scatterplot shows a positive relationship between infrastructure score and overall score, but without documentation on the calculation formula, I can’t tell: - Is infrastructure weighted more heavily in the overall score? - Are there outliers that don’t follow the expected pattern? - Should all countries fall along a specific line if the formula is consistent?

Why this is concerning: If I wanted to advise a country on how to improve their overall score, I wouldn’t know which component score to focus on without understanding the weighting system.

dataset |>
  filter(year == 2023) |>
  ggplot() +
  geom_point(mapping = aes(x = data_infrastructure_score, 
                          y = overall_score,
                          color = income),
             alpha = 0.6) +
  labs(title = "Infrastructure Score vs Overall Score (2023)",
       subtitle = "Without knowing the calculation method, we can't tell if this is the right relationship",
       x = "Data Infrastructure Score",
       y = "Overall Score",
       color = "Income Level") +
  theme_minimal()
## Warning: Removed 30 rows containing missing values or values outside the scale range
## (`geom_point()`).

Risk Analysis and Mitigation

The lack of transparency in score construction introduces several risks:

1. Policy Misinterpretation Risk

Countries may be ranked or compared without understanding weighting decisions. Policymakers could allocate resources based on misleading comparisons.

2. Overinterpretation of Small Differences

Small numerical differences may not represent meaningful performance gaps. A country scoring 85 vs 87 might be treated as significantly different when the difference could be statistical noise.

3. Temporal Comparison Risk

If methodology changes across years, trends may reflect scoring adjustments rather than real improvements. Countries might appear to improve or decline artificially.

To reduce these risks, I would:

  • Avoid strict country rankings without understanding the methodology

  • Clearly state methodological uncertainty in all analyses

  • Avoid causal claims based on score differences

  • Treat small differences (less than 5 points) cautiously

  • Investigate whether methodology changed over time before analyzing trends

Task 4: Categorical Columns and Missing Data

Explicitly Missing Rows

Check for NA values in categorical columns shows that there are no explicitly missing values in the categorical columns (iso3c, country, region, income).

dataset |>
  summarize(
    missing_iso3c = sum(is.na(iso3c)),
    missing_country = sum(is.na(country)),
    missing_region = sum(is.na(region)),
    missing_income = sum(is.na(income))
  )
##   missing_iso3c missing_country missing_region missing_income
## 1             0               0              0              0

Implicitly Missing Rows

Check if there are gaps in the years for countries shows that there ARE implicitly missing rows! Not all countries have data for every year in the dataset. Some countries might be missing certain years, which could indicate:

  • The country didn’t report data that year
  • The country didn’t exist in that year (new countries)
  • Data wasn’t collected for that country

Insight and Significance

This implicit missingness creates bias in longitudinal analysis. If missing years are systematically related to a country’s region, income level, or performance, then trends over time may be misleading. For example, if low-performing countries stop reporting data, average scores might appear to improve artificially.

Further Investigation Needed

I would need to determine: - Whether missing years are randomly distributed or systematic - If certain regions or income groups have more missing data - Whether countries that stopped reporting had lower scores - How to handle these gaps when analyzing trends

# See what years are available
year_range <- dataset |>
  summarize(min_year = min(year),
            max_year = max(year))

year_range
##   min_year max_year
## 1     2004     2023
# Check how many unique countries we have
n_countries <- dataset |>
  distinct(country) |>
  nrow()

# Expected rows if every country had data for every year
expected_rows <- n_countries * (year_range$max_year - year_range$min_year + 1)

cat("Unique countries:", n_countries, "\n")
## Unique countries: 217
cat("Actual rows:", nrow(dataset), "\n")
## Actual rows: 4340
cat("Expected rows if complete:", expected_rows, "\n")
## Expected rows if complete: 4340
# Find countries with missing years
dataset |>
  group_by(country) |>
  summarize(n_years = n(),
            years_present = list(year)) |>
  filter(n_years < (year_range$max_year - year_range$min_year + 1)) |>
  head(10)
## # A tibble: 0 × 3
## # ℹ 3 variables: country <chr>, n_years <int>, years_present <list>

Empty Groups

Check if any region or income categories have very few or zero countries shows that Some combinations of region and income level have very few countries. For example, there might be very few “Low income” countries in “Europe & Central Asia” or “High income” countries in “Sub-Saharan Africa”. These aren’t technically “empty” but having only 1-2 countries in a category makes comparisons difficult.

Insight and Significance

Very small groups behave similarly to empty groups statistically. Comparisons across such categories may be unstable and misleading. A single country’s performance could be interpreted as representing an entire region-income combination, which is not valid.

dataset |>
  filter(year == 2023) |>
  count(region, income) |>
  arrange(n)
##                        region              income  n
## 1         East Asia & Pacific          Low income  1
## 2   Latin America & Caribbean      Not classified  1
## 3                  South Asia          Low income  1
## 4                  South Asia Upper middle income  1
## 5          Sub-Saharan Africa         High income  1
## 6  Middle East & North Africa          Low income  2
## 7       Europe & Central Asia Lower middle income  3
## 8               North America         High income  3
## 9   Latin America & Caribbean Lower middle income  4
## 10 Middle East & North Africa Upper middle income  4
## 11                 South Asia Lower middle income  6
## 12         Sub-Saharan Africa Upper middle income  6
## 13 Middle East & North Africa Lower middle income  7
## 14 Middle East & North Africa         High income  8
## 15        East Asia & Pacific Upper middle income  9
## 16        East Asia & Pacific Lower middle income 12
## 17        East Asia & Pacific         High income 15
## 18      Europe & Central Asia Upper middle income 15
## 19  Latin America & Caribbean         High income 18
## 20  Latin America & Caribbean Upper middle income 19
## 21         Sub-Saharan Africa Lower middle income 19
## 22         Sub-Saharan Africa          Low income 22
## 23      Europe & Central Asia         High income 40

Task 5: Defining Outliers for a Continuous Column

Outliers in population

I’ll examine the population column to identify outliers. I’ll use multiple methods to define outliers:

Method 1: Interquartile Range (IQR)

dataset |>
  filter(year == 2023) |>
  ggplot() +
  geom_boxplot(mapping = aes(x = population, y = "")) +
  scale_x_continuous(labels = scales::comma) +
  labs(title = "Population Distribution (2023) - IQR Method",
       x = "Population",
       y = "") +
  theme_minimal()

Using the standard boxplot definition (1.5 × IQR), I can identify extreme population outliers:

pop_2023 <- dataset |>
  filter(year == 2023)

q1 <- quantile(pop_2023$population, 0.25)
q3 <- quantile(pop_2023$population, 0.75)
iqr <- q3 - q1

lower_bound <- q1 - 1.5 * iqr
upper_bound <- q3 + 1.5 * iqr

outliers <- pop_2023 |>
  filter(population > upper_bound | population < lower_bound) |>
  select(country, population) |>
  arrange(desc(population))

outliers
##               country population
## 1               India 1428627663
## 2               China 1410710000
## 3       United States  334914895
## 4           Indonesia  277534122
## 5            Pakistan  240485658
## 6             Nigeria  223804632
## 7              Brazil  216422446
## 8          Bangladesh  172954319
## 9  Russian Federation  143826130
## 10             Mexico  128455567
## 11           Ethiopia  126527060
## 12              Japan  124516650
## 13        Philippines  117337368
## 14   Egypt, Arab Rep.  112716598
## 15   Congo, Dem. Rep.  102262808
## 16           Viet Nam   98858950
## 17 Iran, Islamic Rep.   89172767
## 18            Turkiye   85326000
## 19            Germany   84482267
## 20           Thailand   71801279
## 21     United Kingdom   68350000
## 22             France   68170228
## 23           Tanzania   67438106

Method 2: Visual Gaps in the Data

Looking at this histogram, there’s a clear gap between the bulk of countries (under 100 million) and a few extremely large countries (over 500 million). This visual gap suggests that countries with populations over 300 million could be considered outliers.

dataset |>
  filter(year == 2023) |>
  ggplot() +
  geom_histogram(mapping = aes(x = population), 
                 bins = 50,
                 fill = "#3182bd", 
                 color = "white") +
  scale_x_continuous(labels = scales::comma) +
  labs(title = "Population Distribution Shows Clear Gaps",
       x = "Population",
       y = "Count") +
  theme_minimal()

My Definition of Outliers for Population

I define population outliers as countries with populations above 300 million based on the following reasons:

  1. Clear visual separation: The histogram shows a distinct gap above this threshold
  2. Extreme values: These countries are fundamentally different in scale from most nations
  3. Analytical impact: Including these countries in averages or correlations would heavily skew results
  4. Practical significance: Countries like China, India, and USA operate at a completely different scale than most countries

These outliers (China, India, USA, Indonesia, Pakistan) represent over 40% of the world’s population but are less than 3% of countries in the dataset.

nsight and Significance

When analyzing relationships between population and scores, these outliers create a dilemma: removing them may improve statistical stability but could distort global representation. Any decision to exclude them must be explicitly justified based on the research question.

How I would handle them:

  • Analyze them separately as a distinct category
  • Use logarithmic scales to reduce their influence
  • Note their influence explicitly in conclusions
  • Report both analyses (with and without outliers) for transparency
dataset |>
  filter(year == 2023, population > 300000000) |>
  select(country, population, overall_score) |>
  arrange(desc(population))
##         country population overall_score
## 1         India 1428627663      73.62833
## 2         China 1410710000      59.50375
## 3 United States  334914895      93.43042

Summary of Findings

  1. Before documentation: Country codes and score meanings were unclear
  2. After documentation: The score calculation methodology remains unexplained
  3. Visualizations: Showed why unclear calculations make interpretation difficult
  4. Missing data:
    • No explicit missing values in categorical columns
    • Implicit missing years for some countries
    • Some region-income combinations have very few observations
  5. Outliers: Defined population outliers as countries over 300 million based on visual gaps and practical significance

Further Questions

The biggest question remaining is: How exactly are these scores calculated? To properly interpret this data, I would need: - The formula for combining component scores into the overall score - The criteria and thresholds for each individual score - Whether the scoring methodology has changed over time - What a score of 0 vs missing data means

Without this information, any conclusions about country performance or trends over time could be misleading.