Introduction

This report analyzes data from 187 countries to understand what drives overall statistical performance. Using the World Bank Statistical Performance Indicators (SPI) dataset from 2023, we explore which data capability dimensions – such as data use, data products, and data infrastructure – are most strongly associated with a country’s overall score.

The analysis includes exploratory data visualizations, a hypothesis test comparing income groups, and regression models to identify the strongest predictors. The findings are intended to help international development organizations decide where to focus their investments when supporting countries with weak data systems.


Audience

This report is written for policymakers and analysts at international development organizations, such as the World Bank or the United Nations. These are professionals who work on improving national data systems in developing countries and need to know where to focus their efforts.

The goal is to give them clear, data-driven guidance on which areas of a country’s data system matter most for overall statistical performance.


Background

The World Bank Statistical Performance Indicators (SPI) dataset measures how effectively countries collect, produce, and use data. Strong statistical systems are essential for informed policymaking and resource allocation. Without reliable data, governments cannot design effective policies or track progress toward development goals.


Objective

Main Question:

Which data capability dimensions are most strongly associated with a country’s overall statistical performance?

Understanding this helps organizations decide where to invest when supporting countries with weak statistical systems.


Data Overview

Variable Role Description
overall_score Outcome Overall statistical performance (0-100)
data_use_score Predictor How well data is used
data_products_score Predictor Quality of statistical outputs
income Group variable Income level of the country
region Group variable World region

Exploratory Data Analysis

# Load required packages
library(ggplot2)
library(dplyr)

# Load the dataset
df <- read.csv("dataset.csv")

# Keep only 2023 data and remove rows with missing values
df2023 <- df %>%
  filter(year == 2023) %>%
  filter(!is.na(overall_score), !is.na(data_use_score), !is.na(data_products_score))

# Remove "Not classified" income group
df2023 <- df2023 %>%
  filter(income != "Not classified")

# Set income as an ordered factor for better plot ordering
df2023$income <- factor(df2023$income,
  levels = c("Low income", "Lower middle income", "Upper middle income", "High income"))

Plot 1: Distribution of Overall Score

p1 <- ggplot(df2023, aes(x = overall_score)) +
  geom_histogram(binwidth = 5, fill = "#2196F3", color = "white", alpha = 0.8) +
  labs(
    title = "Distribution of Overall Statistical Performance Score (2023)",
    x = "Overall Score (0 to 100)",
    y = "Number of Countries"
  ) +
  theme_minimal()

p1

ggsave("plot1_histogram.png", plot = p1, width = 8, height = 5)

Interpretation: The overall score ranges from about 28 to 95 across countries. Most countries fall in the mid-to-high range (60 to 90), with noticeable variation. This wide spread confirms that there is a real difference in statistical performance across countries that is worth explaining.


Plot 2: Overall Score by Income Group

p2 <- ggplot(df2023, aes(x = income, y = overall_score, fill = income)) +
  geom_boxplot(alpha = 0.7) +
  labs(
    title = "Overall Score by Income Group (2023)",
    x = "Income Group",
    y = "Overall Score (0 to 100)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

p2

ggsave("plot2_boxplot.png", plot = p2, width = 8, height = 5)

Interpretation: High-income countries score around 81 on average, compared to around 56 for low-income countries. This 25-point gap is consistent across all income groups – as income increases, so does statistical performance. This pattern motivates the hypothesis test in the analysis section.


Plot 3: Data Use Score vs Overall Score

p3 <- ggplot(df2023, aes(x = data_use_score, y = overall_score)) +
  geom_point(alpha = 0.5, color = "#2196F3") +
  geom_smooth(method = "lm", color = "darkred", se = TRUE) +
  labs(
    title = "Data Use Score vs Overall Score (2023)",
    x = "Data Use Score",
    y = "Overall Score (0 to 100)"
  ) +
  theme_minimal()

p3

ggsave("plot3_scatter.png", plot = p3, width = 8, height = 5)

Interpretation: There is a clear positive relationship between data use score and overall score. Countries that score higher on data use tend to have higher overall statistical performance. The red line shows the linear trend, and the shaded area shows the confidence interval. This linear pattern supports the use of regression in the analysis section.


Assumptions

Before running the analysis, it is important to state the assumptions made:

  1. Using 2023 only: The dataset covers 20 years, but earlier years have many missing values. Using 2023 gives the most complete and consistent picture. This makes the analysis simpler and more reliable.

  2. Linear relationships: The scatterplot in Plot 3 shows that the relationship between predictors and overall score appears roughly linear. This supports the use of linear regression.

  3. Missing values: Rows with missing values in key variables were removed. This is acceptable because the missing data is spread across different regions and income groups, so it is unlikely to create a strong bias.

  4. Independence of observations: Each row represents one country in one year. Countries are treated as independent observations in this cross-sectional analysis.

  5. No causation: This analysis shows associations between variables. It does not prove that one variable causes another. Other factors such as government investment or international support may also explain the patterns.


Analysis

Part A: Hypothesis Test

Question: Do high-income countries have a significantly different overall score compared to low-income countries?

  • H0 (Null hypothesis): There is no difference in overall score between high-income and low-income countries
  • H1 (Alternative hypothesis): High-income countries have a higher overall score than low-income countries
# Separate the two groups
high_income <- df2023 %>% filter(income == "High income") %>% pull(overall_score)
low_income  <- df2023 %>% filter(income == "Low income")  %>% pull(overall_score)

# Run a two-sample t-test
t.test(high_income, low_income, alternative = "greater")
## 
##  Welch Two Sample t-test
## 
## data:  high_income and low_income
## t = 7.5645, df = 53.091, p-value = 2.76e-10
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  19.3647     Inf
## sample estimates:
## mean of x mean of y 
##  81.23470  56.36651

Interpretation: The p-value is 2.76e-10, which is much smaller than 0.05. This means we reject the null hypothesis and conclude that high-income countries have a significantly higher overall score than low-income countries. The average score for high-income countries is 81.2, compared to 56.4 for low-income countries, a difference of about 25 points. This result connects directly to the main objective: income level is strongly associated with statistical performance.


Part B: Regression Model

Step 1 – Simple Model

We start with one predictor to keep the interpretation clear.

# Simple linear regression: one predictor
model1 <- lm(overall_score ~ data_use_score, data = df2023)
summary(model1)
## 
## Call:
## lm(formula = overall_score ~ data_use_score, data = df2023)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -30.1424  -6.6160   0.6688   6.9058  20.5067 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     8.45691    2.79396   3.027  0.00283 ** 
## data_use_score  0.75100    0.03324  22.590  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.73 on 184 degrees of freedom
## Multiple R-squared:  0.735,  Adjusted R-squared:  0.7336 
## F-statistic: 510.3 on 1 and 184 DF,  p-value: < 2.2e-16

Interpretation: A 1-point increase in data use score is associated with a 0.75-point increase in overall score. The R-squared value of 0.735 means that data use alone explains about 73.5% of the variation in overall statistical performance across countries. This is a strong result for a single predictor.


Step 2 – Extended Model

We add a second predictor to see if data products also explains overall performance.

# Extended model: two predictors
model2 <- lm(overall_score ~ data_use_score + data_products_score, data = df2023)
summary(model2)
## 
## Call:
## lm(formula = overall_score ~ data_use_score + data_products_score, 
##     data = df2023)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.9811  -5.7568   0.6582   5.8016  18.1231 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -20.68848    4.11925  -5.022 1.21e-06 ***
## data_use_score        0.48687    0.04153  11.722  < 2e-16 ***
## data_products_score   0.66491    0.07700   8.635 2.83e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.379 on 183 degrees of freedom
## Multiple R-squared:  0.8117, Adjusted R-squared:  0.8097 
## F-statistic: 394.5 on 2 and 183 DF,  p-value: < 2.2e-16

Interpretation: Adding data products score increases the R-squared value to 0.812, meaning the two predictors together explain about 81.2% of the variation in overall score – an improvement of nearly 8 percentage points over the simple model. Both predictors show a positive and statistically significant association with overall performance. The focus here is on general patterns, not causal claims.


Conclusions and Recommendations

Based on this analysis, the following patterns were found:

Recommendation for international organizations:

When supporting countries with weak statistical systems, data use capacity should be the first priority. Countries that actively use data in policy and planning tend to perform better overall. Investment in data use training, open data policies, and data-driven governance can have a meaningful impact on overall statistical performance.

A second area to consider is data products – the quality and availability of statistical outputs. Countries that produce better data products also tend to score higher overall.

Limitations:


Project Slides

View Slides


Presentation Video


Data source: World Bank Statistical Performance Indicators (SPI), 2023.