This report analyzes data from 187 countries to understand what drives overall statistical performance. Using the World Bank Statistical Performance Indicators (SPI) dataset from 2023, we explore which data capability dimensions – such as data use, data products, and data infrastructure – are most strongly associated with a country’s overall score.
The analysis includes exploratory data visualizations, a hypothesis test comparing income groups, and regression models to identify the strongest predictors. The findings are intended to help international development organizations decide where to focus their investments when supporting countries with weak data systems.
This report is written for policymakers and analysts at international development organizations, such as the World Bank or the United Nations. These are professionals who work on improving national data systems in developing countries and need to know where to focus their efforts.
The goal is to give them clear, data-driven guidance on which areas of a country’s data system matter most for overall statistical performance.
The World Bank Statistical Performance Indicators (SPI) dataset measures how effectively countries collect, produce, and use data. Strong statistical systems are essential for informed policymaking and resource allocation. Without reliable data, governments cannot design effective policies or track progress toward development goals.
Main Question:
Which data capability dimensions are most strongly associated with a country’s overall statistical performance?
Understanding this helps organizations decide where to invest when supporting countries with weak statistical systems.
| Variable | Role | Description |
|---|---|---|
overall_score |
Outcome | Overall statistical performance (0-100) |
data_use_score |
Predictor | How well data is used |
data_products_score |
Predictor | Quality of statistical outputs |
income |
Group variable | Income level of the country |
region |
Group variable | World region |
# Load required packages
library(ggplot2)
library(dplyr)
# Load the dataset
df <- read.csv("dataset.csv")
# Keep only 2023 data and remove rows with missing values
df2023 <- df %>%
filter(year == 2023) %>%
filter(!is.na(overall_score), !is.na(data_use_score), !is.na(data_products_score))
# Remove "Not classified" income group
df2023 <- df2023 %>%
filter(income != "Not classified")
# Set income as an ordered factor for better plot ordering
df2023$income <- factor(df2023$income,
levels = c("Low income", "Lower middle income", "Upper middle income", "High income"))
p1 <- ggplot(df2023, aes(x = overall_score)) +
geom_histogram(binwidth = 5, fill = "#2196F3", color = "white", alpha = 0.8) +
labs(
title = "Distribution of Overall Statistical Performance Score (2023)",
x = "Overall Score (0 to 100)",
y = "Number of Countries"
) +
theme_minimal()
p1
ggsave("plot1_histogram.png", plot = p1, width = 8, height = 5)
Interpretation: The overall score ranges from about 28 to 95 across countries. Most countries fall in the mid-to-high range (60 to 90), with noticeable variation. This wide spread confirms that there is a real difference in statistical performance across countries that is worth explaining.
p2 <- ggplot(df2023, aes(x = income, y = overall_score, fill = income)) +
geom_boxplot(alpha = 0.7) +
labs(
title = "Overall Score by Income Group (2023)",
x = "Income Group",
y = "Overall Score (0 to 100)"
) +
theme_minimal() +
theme(legend.position = "none")
p2
ggsave("plot2_boxplot.png", plot = p2, width = 8, height = 5)
Interpretation: High-income countries score around 81 on average, compared to around 56 for low-income countries. This 25-point gap is consistent across all income groups – as income increases, so does statistical performance. This pattern motivates the hypothesis test in the analysis section.
p3 <- ggplot(df2023, aes(x = data_use_score, y = overall_score)) +
geom_point(alpha = 0.5, color = "#2196F3") +
geom_smooth(method = "lm", color = "darkred", se = TRUE) +
labs(
title = "Data Use Score vs Overall Score (2023)",
x = "Data Use Score",
y = "Overall Score (0 to 100)"
) +
theme_minimal()
p3
ggsave("plot3_scatter.png", plot = p3, width = 8, height = 5)
Interpretation: There is a clear positive relationship between data use score and overall score. Countries that score higher on data use tend to have higher overall statistical performance. The red line shows the linear trend, and the shaded area shows the confidence interval. This linear pattern supports the use of regression in the analysis section.
Before running the analysis, it is important to state the assumptions made:
Using 2023 only: The dataset covers 20 years, but earlier years have many missing values. Using 2023 gives the most complete and consistent picture. This makes the analysis simpler and more reliable.
Linear relationships: The scatterplot in Plot 3 shows that the relationship between predictors and overall score appears roughly linear. This supports the use of linear regression.
Missing values: Rows with missing values in key variables were removed. This is acceptable because the missing data is spread across different regions and income groups, so it is unlikely to create a strong bias.
Independence of observations: Each row represents one country in one year. Countries are treated as independent observations in this cross-sectional analysis.
No causation: This analysis shows associations between variables. It does not prove that one variable causes another. Other factors such as government investment or international support may also explain the patterns.
Question: Do high-income countries have a significantly different overall score compared to low-income countries?
# Separate the two groups
high_income <- df2023 %>% filter(income == "High income") %>% pull(overall_score)
low_income <- df2023 %>% filter(income == "Low income") %>% pull(overall_score)
# Run a two-sample t-test
t.test(high_income, low_income, alternative = "greater")
##
## Welch Two Sample t-test
##
## data: high_income and low_income
## t = 7.5645, df = 53.091, p-value = 2.76e-10
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 19.3647 Inf
## sample estimates:
## mean of x mean of y
## 81.23470 56.36651
Interpretation: The p-value is 2.76e-10, which is much smaller than 0.05. This means we reject the null hypothesis and conclude that high-income countries have a significantly higher overall score than low-income countries. The average score for high-income countries is 81.2, compared to 56.4 for low-income countries, a difference of about 25 points. This result connects directly to the main objective: income level is strongly associated with statistical performance.
Step 1 – Simple Model
We start with one predictor to keep the interpretation clear.
# Simple linear regression: one predictor
model1 <- lm(overall_score ~ data_use_score, data = df2023)
summary(model1)
##
## Call:
## lm(formula = overall_score ~ data_use_score, data = df2023)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.1424 -6.6160 0.6688 6.9058 20.5067
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.45691 2.79396 3.027 0.00283 **
## data_use_score 0.75100 0.03324 22.590 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.73 on 184 degrees of freedom
## Multiple R-squared: 0.735, Adjusted R-squared: 0.7336
## F-statistic: 510.3 on 1 and 184 DF, p-value: < 2.2e-16
Interpretation: A 1-point increase in data use score is associated with a 0.75-point increase in overall score. The R-squared value of 0.735 means that data use alone explains about 73.5% of the variation in overall statistical performance across countries. This is a strong result for a single predictor.
Step 2 – Extended Model
We add a second predictor to see if data products also explains overall performance.
# Extended model: two predictors
model2 <- lm(overall_score ~ data_use_score + data_products_score, data = df2023)
summary(model2)
##
## Call:
## lm(formula = overall_score ~ data_use_score + data_products_score,
## data = df2023)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.9811 -5.7568 0.6582 5.8016 18.1231
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -20.68848 4.11925 -5.022 1.21e-06 ***
## data_use_score 0.48687 0.04153 11.722 < 2e-16 ***
## data_products_score 0.66491 0.07700 8.635 2.83e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.379 on 183 degrees of freedom
## Multiple R-squared: 0.8117, Adjusted R-squared: 0.8097
## F-statistic: 394.5 on 2 and 183 DF, p-value: < 2.2e-16
Interpretation: Adding data products score increases the R-squared value to 0.812, meaning the two predictors together explain about 81.2% of the variation in overall score – an improvement of nearly 8 percentage points over the simple model. Both predictors show a positive and statistically significant association with overall performance. The focus here is on general patterns, not causal claims.
Based on this analysis, the following patterns were found:
Recommendation for international organizations:
When supporting countries with weak statistical systems, data use capacity should be the first priority. Countries that actively use data in policy and planning tend to perform better overall. Investment in data use training, open data policies, and data-driven governance can have a meaningful impact on overall statistical performance.
A second area to consider is data products – the quality and availability of statistical outputs. Countries that produce better data products also tend to score higher overall.
Limitations:
Data source: World Bank Statistical Performance Indicators (SPI), 2023.