# Load required packages
library(ggplot2)
library(dplyr)
# Load the dataset
df <- read.csv("dataset.csv")
# Keep only 2023 data and remove rows with missing values
df2023 <- df %>%
filter(year == 2023) %>%
filter(!is.na(overall_score), !is.na(data_use_score), !is.na(data_products_score))
# Remove "Not classified" income group
df2023 <- df2023 %>%
filter(income != "Not classified")
# Set income as an ordered factor (for better plot ordering)
df2023$income <- factor(df2023$income,
levels = c("Low income", "Lower middle income", "Upper middle income", "High income"))What Drives a Country’s Data Performance? A Global Analysis Using World Bank Indicators
Audience
This report is written for policymakers and analysts at international development organizations, such as the World Bank or the United Nations. These are professionals who work on improving national data systems in developing countries and need to know where to focus their efforts.
The goal is to give them clear, data-driven guidance on which areas of a country’s data system matter most for overall statistical performance.
Background and Objective
Good data systems are important for every country. When governments can collect, manage, and use data effectively, they can make better decisions about health, education, and economic policy. However, not all countries have the same level of data capability.
The World Bank measures statistical performance through its Statistical Performance Indicators (SPI) dataset. It tracks how well countries are doing across five key dimensions:
- Data Use – how well data is used by government and public institutions
- Data Services – the quality of data services provided to users
- Data Products – the quality and availability of statistical outputs
- Data Sources – the quality and coverage of data collection
- Data Infrastructure – the systems and laws that support data work
Each country receives an overall score (0 to 100) based on these five dimensions.
Main Question:
Which data capability dimensions are most strongly associated with a country’s overall statistical performance?
Understanding this helps organizations decide where to invest when supporting countries with weak statistical systems.
Data Overview
- Source: World Bank Statistical Performance Indicators (SPI)
- Coverage: 217 countries, years 2004 to 2023
- Focus for this analysis: Year 2023 only
- Reason: 2023 is the most recent and most complete year in the dataset
- Final sample after removing missing values: 187 countries
- Key variables used:
| Variable | Role | Description |
|---|---|---|
overall_score |
Outcome | Overall statistical performance (0-100) |
data_use_score |
Predictor | How well data is used |
data_products_score |
Predictor | Quality of statistical outputs |
income |
Group variable | Income level of the country |
region |
Group variable | World region |
Exploratory Data Analysis
Plot 1: Distribution of Overall Score
# Histogram
p1 <- ggplot(df2023, aes(x = overall_score)) +
geom_histogram(binwidth = 5, fill = "#2196F3", color = "white", alpha = 0.8) +
labs(
title = "Distribution of Overall Statistical Performance Score (2023)",
x = "Overall Score (0 to 100)",
y = "Number of Countries"
) +
theme_minimal()
# Show on webpage
p1# Save as image file
ggsave("plot1_histogram.png", plot = p1, width = 8, height = 5)
Interpretation: The overall score ranges from about 28 to 95 across countries. The distribution is spread out with many countries scoring between 50 and 85. This shows a large gap in statistical performance around the world, which confirms that there is a real difference to explain.
Plot 2: Overall Score by Income Group
# boxplot
p2 <- ggplot(df2023, aes(x = income, y = overall_score, fill = income)) +
geom_boxplot(alpha = 0.7) +
labs(
title = "Overall Score by Income Group (2023)",
x = "Income Group",
y = "Overall Score (0 to 100)"
) +
theme_minimal() +
theme(legend.position = "none")
# Show on webpage
p2# Save as image file
ggsave("plot2_boxplot.png", plot = p2, width = 8, height = 5)Interpretation: High-income countries clearly score higher on average (around 81) compared to low-income countries (around 56). This pattern is consistent across all income groups – as income increases, so does statistical performance. This suggests that resources and development level play a role in data capability.
Plot 3: Data Use Score vs Overall Score
# scatterplot
p3 <- ggplot(df2023, aes(x = data_use_score, y = overall_score)) +
geom_point(alpha = 0.5, color = "#2196F3") +
geom_smooth(method = "lm", color = "darkred", se = TRUE) +
labs(
title = "Data Use Score vs Overall Score (2023)",
x = "Data Use Score",
y = "Overall Score (0 to 100)"
) +
theme_minimal()
# Show on webpage
p3# Save as image file
ggsave("plot3_scatter.png", plot = p3, width = 8, height = 5)Interpretation: There is a clear positive relationship between data use score and overall score. Countries that score higher on data use tend to have higher overall statistical performance. The red line shows the linear trend, and the shaded area shows the confidence interval around it.
Assumptions
Before running the analysis, it is important to state the assumptions made:
Using 2023 only: The dataset covers 20 years, but earlier years have many missing values. Using 2023 gives the most complete and consistent picture. This makes the analysis simpler and more reliable.
Linear relationships: The scatterplot in Plot 3 shows that the relationship between predictors and overall score appears roughly linear. This supports the use of linear regression.
Missing values: Rows with missing values in key variables were removed. This is acceptable because the missing data is spread across different regions and income groups, so it is unlikely to create a strong bias.
Independence of observations: Each row represents one country in one year. Countries are treated as independent observations in this cross-sectional analysis.
Analysis
Part A: Hypothesis Test
Question: Do high-income countries have a significantly different overall score compared to low-income countries?
- H0 (Null hypothesis): There is no difference in overall score between high-income and low-income countries
- H1 (Alternative hypothesis): High-income countries have a higher overall score than low-income countries
# Separate the two groups
high_income <- df2023 %>% filter(income == "High income") %>% pull(overall_score)
low_income <- df2023 %>% filter(income == "Low income") %>% pull(overall_score)
# Run a two-sample t-test
t.test(high_income, low_income, alternative = "greater")
Welch Two Sample t-test
data: high_income and low_income
t = 7.5645, df = 53.091, p-value = 2.76e-10
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
19.3647 Inf
sample estimates:
mean of x mean of y
81.23470 56.36651
Interpretation: The t-test compares the average overall score between the two groups. If the p-value is less than 0.05, we reject the null hypothesis and conclude that high-income countries perform significantly better. This result connects directly to the main objective: income level is associated with statistical performance.
Part B: Regression Model
Step 1 – Simple Model
We start with one predictor to keep the interpretation clear.
# Simple linear regression: one predictor
model1 <- lm(overall_score ~ data_use_score, data = df2023)
summary(model1)
Call:
lm(formula = overall_score ~ data_use_score, data = df2023)
Residuals:
Min 1Q Median 3Q Max
-30.1424 -6.6160 0.6688 6.9058 20.5067
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.45691 2.79396 3.027 0.00283 **
data_use_score 0.75100 0.03324 22.590 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 8.73 on 184 degrees of freedom
Multiple R-squared: 0.735, Adjusted R-squared: 0.7336
F-statistic: 510.3 on 1 and 184 DF, p-value: < 2.2e-16
Interpretation: This model asks: how much does overall score change for each 1-point increase in data use score? The coefficient tells us the direction and strength of the relationship. The R-squared value tells us how much of the variation in overall score is explained by data use alone.
Step 2 – Extended Model
We add a second predictor to see if data products also explain overall performance.
# Extended model: two predictors
model2 <- lm(overall_score ~ data_use_score + data_products_score, data = df2023)
summary(model2)
Call:
lm(formula = overall_score ~ data_use_score + data_products_score,
data = df2023)
Residuals:
Min 1Q Median 3Q Max
-21.9811 -5.7568 0.6582 5.8016 18.1231
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -20.68848 4.11925 -5.022 1.21e-06 ***
data_use_score 0.48687 0.04153 11.722 < 2e-16 ***
data_products_score 0.66491 0.07700 8.635 2.83e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7.379 on 183 degrees of freedom
Multiple R-squared: 0.8117, Adjusted R-squared: 0.8097
F-statistic: 394.5 on 2 and 183 DF, p-value: < 2.2e-16
Interpretation: This model includes both data use and data products as predictors. By comparing the R-squared values of model 1 and model 2, we can see whether adding data products improves the explanation of overall score. The focus here is on general patterns, not causal claims.
Conclusions and Recommendations
Based on this analysis, the following patterns were found:
- Countries with higher data use scores tend to have significantly higher overall statistical performance scores.
- High-income countries score much higher than low-income countries on average, which suggests that resources and institutional capacity matter.
- The regression model shows that data use is a strong predictor of overall performance.
Recommendation for international organizations:
When supporting countries with weak statistical systems, data use capacity should be the first priority. Countries that actively use data in policy and planning tend to perform better overall. Investment in data use training, open data policies, and data-driven governance can have a meaningful impact on overall statistical performance.
A second area to consider is data products – the quality and availability of statistical outputs. Countries that produce better data products also tend to score higher overall.
Limitations:
- This analysis uses data from 2023 only. It does not show changes over time.
- The relationship between variables does not prove causation. Other factors (such as government investment or international support) may also explain the patterns.
- Some countries were removed due to missing data, which may slightly affect the results.
Project Slides
Click below to view the presentation:
Project Slides: View Slides
Data source: World Bank Statistical Performance Indicators (SPI), 2023.