Week 6 Data Dive — Confidence Intervals

Introduction

In this data dive, I explore the Statistical Performance Indicators (SPI) dataset from the World Bank, accessed via TidyTuesday. Each row in this dataset represents a country–year observation, tracking how well countries manage and use statistical data across multiple dimensions over the years 2004–2023.

I will work with two pairs of variables:

  • Pair 1 (Explanatory → Response): data_infrastructure_score (original, explanatory) and overall_score (original, response), paired with infra_dev (created — each country’s deviation from its region’s average infrastructure score that year)
  • Pair 2: population (original) paired with years_since (created — how many years ago the observation was recorded). This pair does not include a defined response variable, so no confidence interval is constructed for this relationship.
#load libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(boot)

#load dataset
dataset <- read_csv("dataset.csv")
## Rows: 4340 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): iso3c, country, region, income
## dbl (8): year, population, overall_score, data_use_score, data_services_scor...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Loading and Building the Dataset

In this step, I create two new variables that will be used in my analysis:

  1. infra_dev: Measures how far a country’s infrastructure score is from its region’s average in the same year. infra_dev allows us to compare countries relative to their regional peers instead of using raw infrastructure scores.

  2. years_since: Measures how many years have passed since the observation year. years_since helps us examine whether older observations differ systematically from more recent ones.

dataset_clean <- dataset |>
  
  # 1: Group by region and year
  group_by(region, year) |>
  
  # 2: Create infra_dev
  mutate(
    infra_dev = data_infrastructure_score - 
                mean(data_infrastructure_score, na.rm = TRUE)
  ) |>
  
  # 3: Remove grouping
  ungroup() |>
  
  # 4: Create years_since
  mutate(
    years_since = max(year) - year
  )
dataset_clean |>
  select(country, region, year,
         data_infrastructure_score, infra_dev,
         overall_score, population, years_since) |>
  head(10)
## # A tibble: 10 × 8
##    country     region        year data_infrastructure_…¹ infra_dev overall_score
##    <chr>       <chr>        <dbl>                  <dbl>     <dbl>         <dbl>
##  1 Denmark     Europe & Ce…  2023                    100      11.1          95.3
##  2 Finland     Europe & Ce…  2023                    100      11.1          95.1
##  3 Poland      Europe & Ce…  2023                    100      11.1          94.7
##  4 Sweden      Europe & Ce…  2023                    100      11.1          94.4
##  5 Spain       Europe & Ce…  2023                    100      11.1          94.3
##  6 Netherlands Europe & Ce…  2023                    100      11.1          94.3
##  7 Slovenia    Europe & Ce…  2023                    100      11.1          94.2
##  8 Portugal    Europe & Ce…  2023                    100      11.1          93.8
##  9 Italy       Europe & Ce…  2023                    100      11.1          93.6
## 10 Norway      Europe & Ce…  2023                    100      11.1          93.6
## # ℹ abbreviated name: ¹​data_infrastructure_score
## # ℹ 2 more variables: population <dbl>, years_since <dbl>

Pair 1: infra_dev (Explanatory) and overall_score (Response)

A country’s data infrastructure, its systems, standards, and capacity for collecting and managing data, is a logical driver of its overall statistical performance. infra_dev captures how much a country’s infrastructure stands out relative to its regional peers. We treat overall_score as the response and infra_dev as the explanatory variable. Unlike using data_infrastructure_score directly, infra_dev adds interpretive value: it tells us whether a country’s infrastructure advantage (or disadvantage) relative to its region predicts its overall score, which is a more nuanced and meaningful question.

This pairing is also statistically clean; infra_dev is calculated from data_infrastructure_score, not from overall_score, so there is no circular dependency between the explanatory and response variables.

Visualization

ggplot(dataset_clean, aes(x = infra_dev, y = overall_score)) +
  geom_point(alpha = 0.2, color = "steelblue") +
  geom_smooth(method = "lm", se = FALSE, color = "darkred", linewidth = 0.8) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "gray50") +
  labs(
    title = "Overall Score vs. Infrastructure Deviation from Regional Average",
    subtitle = "Each point is one country-year. Positive infra_dev = above regional infrastructure average.",
    x = "Infrastructure Score Deviation from Regional Average (infra_dev)",
    y = "Overall Score (0-100)"
  ) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2915 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 2915 rows containing missing values or values outside the scale range
## (`geom_point()`).

Observations from the plot:

  • There is a clear positive trend: countries whose infrastructure score exceeds their regional average tend to also have higher overall scores.
  • The relationship is moderately strong but not perfectly linear; there is noticeable spread around the trend line, which is expected since overall_score depends on multiple sub-scores beyond infrastructure alone.
  • There are some outliers on the left side, countries with very negative infra_dev (far below their regional peers in infrastructure) but not necessarily the lowest overall scores, suggesting that strong performance in other areas (e.g., data use or services) can partially offset infrastructure weaknesses.
  • The spread appears slightly wider for negative infra_dev values, indicating more variability among countries that underperform regionally in infrastructure.
  • Insight: Infrastructure relative to regional peers is a meaningful predictor of overall statistical performance. Countries that lead their region in infrastructure tend to also lead in overall capacity. However, the spread in the plot tells us that infrastructure is not the only driver; other dimensions of statistical capacity also contribute.

Correlation Coefficient

Since both variables are continuous, we use Pearson’s correlation coefficient.

The correlation is positive and moderately strong, reflecting a real (not mathematically guaranteed) relationship between infrastructure advantage and overall performance. The value is meaningful — it tells us that countries which outperform their region in infrastructure tend to have higher overall scores, but the relationship is not perfect, which makes sense since overall_score aggregates multiple dimensions beyond infrastructure. A further question worth investigating: does this correlation differ by region? For example, does infrastructure matter more as a predictor in lower-income regions where it may be more of a binding constraint?

r_pair1 <- cor(dataset_clean$infra_dev, dataset_clean$overall_score, use = "complete.obs")
cat("Pearson's r (infra_dev vs. overall_score):", round(r_pair1, 3))
## Pearson's r (infra_dev vs. overall_score): 0.637

Pair 2: population and years_since

population captures the size of a country in a given year, while years_since captures how long ago that observation was recorded. These two variables do not have a clear causal direction — the passage of time doesn’t cause a country to be larger, and population size doesn’t cause years to pass. So here, we treat them as two standalone variables with no defined response or explanatory role. We might expect a slight negative relationship: older observations (larger years_since) would tend to show smaller population counts, since the world’s population has grown over this time window.

Visualization

ggplot(dataset_clean, aes(x = years_since, y = population)) +
  geom_point(alpha = 0.15, color = "darkorange") +
  scale_y_log10(labels = scales::label_number(scale = 1e-6, suffix = "M")) +
  annotation_logticks(sides = "l") +
  geom_smooth(method = "lm", se = FALSE, color = "darkred", linewidth = 0.8) +
  labs(
    title = "Population vs. Years Since Observation",
    subtitle = "Y-axis is log-scaled. Older observations (higher years_since) may show smaller populations.",
    x = "Years Since Observation",
    y = "Population (log scale, millions)"
  ) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Observations from the plot:

  • The y-axis is log-scaled because population spans an enormous range, from tiny island nations to countries with over a billion people.
  • There are clear outliers at the very top; these are almost certainly China and India, which consistently have populations far exceeding all other countries across all years.
  • The trend line shows a slight negative relationship, older observations have slightly smaller population values, consistent with global population growth over 2004–2023.
  • The relationship is very weak visually; the points are spread widely across all values of years_since, suggesting that population size is mostly determined by which country is being observed, not when.
  • Insight: Time alone is a poor predictor of population size within this 20-year window. The dominant factor is simply the country identity, since the difference between the smallest and largest countries far exceeds any within-country change over 20 years. To properly study population trends over time, each country would need to be analyzed individually.

Correlation Coefficient

The correlation is very weak and slightly negative, which matches the plot. The negative sign makes sense, older observations correspond to slightly smaller populations as the world’s population has grown, but the magnitude is tiny, meaning time explains very little of the variation in population size across countries. Most of that variation comes from country-level differences that years_since cannot capture.

r_pair2 <- cor(dataset_clean$years_since, log10(dataset_clean$population), use = "complete.obs")
cat("Pearson's r (years_since vs. log10 population):", round(r_pair2, 3))
## Pearson's r (years_since vs. log10 population): -0.03

Confidence Interval

Bootstrapped CI Helper Function

Following the approach from the lab notebook, I define a reusable bootstrapped confidence interval function:

boot_ci <- function(v, func = mean, conf = 0.95, n_iter = 1000) {
  boot_func <- function(x, i) func(x[i], na.rm = TRUE)
  b <- boot(v, boot_func, R = n_iter)
  boot.ci(b, conf = conf, type = "perc")
}

CI for overall_score (Response Variable, Pair 1)

We want to estimate the mean overall statistical performance score across all countries and years in this dataset.

set.seed(42)
overall_vals <- dataset_clean$overall_score[!is.na(dataset_clean$overall_score)]

cat("Number of observations:", length(overall_vals), "\n")
## Number of observations: 1425
cat("Sample mean:", round(mean(overall_vals), 2), "\n")
## Sample mean: 64.95
cat("Sample SD:", round(sd(overall_vals), 2), "\n")
## Sample SD: 17.51
set.seed(42)
ci_overall <- boot_ci(overall_vals, func = mean, conf = 0.95)
ci_overall
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = b, conf = conf, type = "perc")
## 
## Intervals : 
## Level     Percentile     
## 95%   (64.07, 65.90 )  
## Calculations and Intervals on Original Scale

Conclusion

The 95% bootstrapped confidence interval for the mean overall_score represents the range we are 95% confident contains the true mean overall_score for the population represented in this dataset (all country-year observations from 2004–2023).

The interval is quite narrow relative to the full 0–100 scale. This makes sense given that we have over 4,000 observations, giving us a very precise estimate of the mean. However, we should interpret this carefully: this CI pools all years together. Since statistical capacity has generally improved over time, the true mean for 2023 is likely higher than the true mean for 2004. The CI here reflects an overall average across the entire time window, not any single year’s performance.

Further questions

How would the CI change if we computed it only for the most recent year (2023)? And does the CI differ meaningfully by income group? Are high-income countries tightly clustered (narrow CI), while low-income countries show more spread (wider CI)?

Summary

In this data dive, I worked with the World Bank Statistical Performance Indicators dataset and:

  1. Created two new columns: infra_dev (deviation of infrastructure score from regional average) and years_since (years since the observation year).
  2. Explored infra_dev vs. overall_score, a moderate positive correlation reflecting a real, meaningful relationship: countries that outperform their region in infrastructure tend to have higher overall statistical performance scores.
  3. Explored years_since vs. population, a very weak negative correlation, confirming that time alone is a poor predictor of population size. This pair has no response variable and no CI was constructed for it.
  4. Built a bootstrapped 95% confidence interval for overall_score, finding a narrow and precise estimate of the population mean given the large sample size.

The broader takeaway is that statistical infrastructure is an important driver of overall capacity, but it is not the only factor; the moderate (rather than perfect) correlation tells us that countries can compensate for infrastructure gaps through strong performance in other dimensions such as data use and services.