In this data dive, I explore the Statistical Performance Indicators (SPI) dataset from the World Bank, accessed via TidyTuesday. Each row in this dataset represents a country–year observation, tracking how well countries manage and use statistical data across multiple dimensions over the years 2004–2023.
I will work with two pairs of variables:
data_infrastructure_score (original, explanatory) and
overall_score (original, response), paired with
infra_dev (created — each country’s deviation from its
region’s average infrastructure score that year)population (original) paired
with years_since (created — how many years ago the
observation was recorded). This pair does not include a defined response
variable, so no confidence interval is constructed for this
relationship.#load libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(boot)
#load dataset
dataset <- read_csv("dataset.csv")
## Rows: 4340 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): iso3c, country, region, income
## dbl (8): year, population, overall_score, data_use_score, data_services_scor...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
In this step, I create two new variables that will be used in my analysis:
infra_dev: Measures how far a country’s
infrastructure score is from its region’s average in the same year.
infra_dev allows us to compare countries relative to their
regional peers instead of using raw infrastructure scores.
years_since: Measures how many years have passed
since the observation year. years_since helps us examine
whether older observations differ systematically from more recent
ones.
dataset_clean <- dataset |>
# 1: Group by region and year
group_by(region, year) |>
# 2: Create infra_dev
mutate(
infra_dev = data_infrastructure_score -
mean(data_infrastructure_score, na.rm = TRUE)
) |>
# 3: Remove grouping
ungroup() |>
# 4: Create years_since
mutate(
years_since = max(year) - year
)
dataset_clean |>
select(country, region, year,
data_infrastructure_score, infra_dev,
overall_score, population, years_since) |>
head(10)
## # A tibble: 10 × 8
## country region year data_infrastructure_…¹ infra_dev overall_score
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Denmark Europe & Ce… 2023 100 11.1 95.3
## 2 Finland Europe & Ce… 2023 100 11.1 95.1
## 3 Poland Europe & Ce… 2023 100 11.1 94.7
## 4 Sweden Europe & Ce… 2023 100 11.1 94.4
## 5 Spain Europe & Ce… 2023 100 11.1 94.3
## 6 Netherlands Europe & Ce… 2023 100 11.1 94.3
## 7 Slovenia Europe & Ce… 2023 100 11.1 94.2
## 8 Portugal Europe & Ce… 2023 100 11.1 93.8
## 9 Italy Europe & Ce… 2023 100 11.1 93.6
## 10 Norway Europe & Ce… 2023 100 11.1 93.6
## # ℹ abbreviated name: ¹data_infrastructure_score
## # ℹ 2 more variables: population <dbl>, years_since <dbl>
infra_dev (Explanatory) and
overall_score (Response)A country’s data infrastructure, its systems, standards, and capacity
for collecting and managing data, is a logical driver of its overall
statistical performance. infra_dev captures how much a
country’s infrastructure stands out relative to its regional peers. We
treat overall_score as the response and
infra_dev as the explanatory variable.
Unlike using data_infrastructure_score directly,
infra_dev adds interpretive value: it tells us whether a
country’s infrastructure advantage (or disadvantage) relative to its
region predicts its overall score, which is a more nuanced and
meaningful question.
This pairing is also statistically clean; infra_dev is
calculated from data_infrastructure_score, not from
overall_score, so there is no circular dependency between
the explanatory and response variables.
ggplot(dataset_clean, aes(x = infra_dev, y = overall_score)) +
geom_point(alpha = 0.2, color = "steelblue") +
geom_smooth(method = "lm", se = FALSE, color = "darkred", linewidth = 0.8) +
geom_vline(xintercept = 0, linetype = "dashed", color = "gray50") +
labs(
title = "Overall Score vs. Infrastructure Deviation from Regional Average",
subtitle = "Each point is one country-year. Positive infra_dev = above regional infrastructure average.",
x = "Infrastructure Score Deviation from Regional Average (infra_dev)",
y = "Overall Score (0-100)"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2915 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 2915 rows containing missing values or values outside the scale range
## (`geom_point()`).
overall_score depends on multiple sub-scores beyond
infrastructure alone.infra_dev (far below their regional
peers in infrastructure) but not necessarily the lowest overall scores,
suggesting that strong performance in other areas (e.g., data use or
services) can partially offset infrastructure weaknesses.infra_dev values, indicating more variability among
countries that underperform regionally in infrastructure.Since both variables are continuous, we use Pearson’s correlation coefficient.
The correlation is positive and moderately strong, reflecting a real
(not mathematically guaranteed) relationship between infrastructure
advantage and overall performance. The value is meaningful — it tells us
that countries which outperform their region in infrastructure tend to
have higher overall scores, but the relationship is not perfect, which
makes sense since overall_score aggregates multiple
dimensions beyond infrastructure. A further question worth
investigating: does this correlation differ by region? For example, does
infrastructure matter more as a predictor in lower-income regions where
it may be more of a binding constraint?
r_pair1 <- cor(dataset_clean$infra_dev, dataset_clean$overall_score, use = "complete.obs")
cat("Pearson's r (infra_dev vs. overall_score):", round(r_pair1, 3))
## Pearson's r (infra_dev vs. overall_score): 0.637
population and years_sincepopulation captures the size of a country in a given
year, while years_since captures how long ago that
observation was recorded. These two variables do not have a clear causal
direction — the passage of time doesn’t cause a country to be
larger, and population size doesn’t cause years to pass. So
here, we treat them as two standalone variables with no
defined response or explanatory role. We might expect a slight negative
relationship: older observations (larger years_since) would
tend to show smaller population counts, since the world’s population has
grown over this time window.
ggplot(dataset_clean, aes(x = years_since, y = population)) +
geom_point(alpha = 0.15, color = "darkorange") +
scale_y_log10(labels = scales::label_number(scale = 1e-6, suffix = "M")) +
annotation_logticks(sides = "l") +
geom_smooth(method = "lm", se = FALSE, color = "darkred", linewidth = 0.8) +
labs(
title = "Population vs. Years Since Observation",
subtitle = "Y-axis is log-scaled. Older observations (higher years_since) may show smaller populations.",
x = "Years Since Observation",
y = "Population (log scale, millions)"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
years_since,
suggesting that population size is mostly determined by which country is
being observed, not when.The correlation is very weak and slightly negative, which matches the
plot. The negative sign makes sense, older observations correspond to
slightly smaller populations as the world’s population has grown, but
the magnitude is tiny, meaning time explains very little of the
variation in population size across countries. Most of that variation
comes from country-level differences that years_since
cannot capture.
r_pair2 <- cor(dataset_clean$years_since, log10(dataset_clean$population), use = "complete.obs")
cat("Pearson's r (years_since vs. log10 population):", round(r_pair2, 3))
## Pearson's r (years_since vs. log10 population): -0.03
Following the approach from the lab notebook, I define a reusable bootstrapped confidence interval function:
boot_ci <- function(v, func = mean, conf = 0.95, n_iter = 1000) {
boot_func <- function(x, i) func(x[i], na.rm = TRUE)
b <- boot(v, boot_func, R = n_iter)
boot.ci(b, conf = conf, type = "perc")
}
overall_score (Response Variable, Pair 1)We want to estimate the mean overall statistical performance score across all countries and years in this dataset.
set.seed(42)
overall_vals <- dataset_clean$overall_score[!is.na(dataset_clean$overall_score)]
cat("Number of observations:", length(overall_vals), "\n")
## Number of observations: 1425
cat("Sample mean:", round(mean(overall_vals), 2), "\n")
## Sample mean: 64.95
cat("Sample SD:", round(sd(overall_vals), 2), "\n")
## Sample SD: 17.51
set.seed(42)
ci_overall <- boot_ci(overall_vals, func = mean, conf = 0.95)
ci_overall
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = b, conf = conf, type = "perc")
##
## Intervals :
## Level Percentile
## 95% (64.07, 65.90 )
## Calculations and Intervals on Original Scale
The 95% bootstrapped confidence interval for the mean
overall_score represents the range we are 95% confident
contains the true mean overall_score for the population represented in
this dataset (all country-year observations from 2004–2023).
The interval is quite narrow relative to the full 0–100 scale. This makes sense given that we have over 4,000 observations, giving us a very precise estimate of the mean. However, we should interpret this carefully: this CI pools all years together. Since statistical capacity has generally improved over time, the true mean for 2023 is likely higher than the true mean for 2004. The CI here reflects an overall average across the entire time window, not any single year’s performance.
How would the CI change if we computed it only for the most recent year (2023)? And does the CI differ meaningfully by income group? Are high-income countries tightly clustered (narrow CI), while low-income countries show more spread (wider CI)?
In this data dive, I worked with the World Bank Statistical Performance Indicators dataset and:
infra_dev (deviation of
infrastructure score from regional average) and years_since
(years since the observation year).infra_dev
vs. overall_score, a moderate positive correlation
reflecting a real, meaningful relationship: countries that outperform
their region in infrastructure tend to have higher overall statistical
performance scores.years_since
vs. population, a very weak negative correlation,
confirming that time alone is a poor predictor of population size. This
pair has no response variable and no CI was constructed for it.overall_score, finding a narrow and precise estimate of the
population mean given the large sample size.The broader takeaway is that statistical infrastructure is an important driver of overall capacity, but it is not the only factor; the moderate (rather than perfect) correlation tells us that countries can compensate for infrastructure gaps through strong performance in other dimensions such as data use and services.