This notebook continues the analysis of the World Bank Statistical Performance Indicators (SPI) dataset, a longitudinal country-level dataset covering 217 countries from 2004 to 2023. Each row represents one country-year observation and includes multiple measures of statistical capacity, such as data use, data products, and infrastructure.
Building on the hypothesis testing work from Week 7, this analysis
shifts toward modeling. Two questions are addressed. First, does income
group explain differences in overall statistical performance across
countries? This is tested using a one-way ANOVA. Second, can a single
continuous sub-indicator, data_products_score, predict a
country’s overall performance? This is explored through simple linear
regression.
These two approaches move beyond comparing group means toward understanding the structure of the relationship between statistical capacity indicators.
As in the previous analysis, the dataset is filtered to the most recent year (2023) to ensure that each country appears only once. Because the same country is recorded multiple times across years, using all years would violate the independence assumption required for both ANOVA and regression.
# Load libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
# Load dataset
dataset <- read_csv("dataset.csv")
## Rows: 4340 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): iso3c, country, region, income
## dbl (8): year, population, overall_score, data_use_score, data_services_scor...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Filter to 2023 and remove "Not classified" income group
data_2023 <- dataset %>%
filter(year == 2023) %>%
filter(income != "Not classified")
cat("Rows in 2023 snapshot:", nrow(data_2023), "\n")
## Rows in 2023 snapshot: 216
cat("Unique countries:", n_distinct(data_2023$country), "\n")
## Unique countries: 216
# Confirm group sizes
table(data_2023$income)
##
## High income Low income Lower middle income Upper middle income
## 85 26 51 54
The 2023 snapshot contains 216 countries across four income groups: High income (n = 85), Upper middle income (n = 54), Lower middle income (n = 51), and Low income (n = 26). These are sufficient group sizes for ANOVA.
Before running any tests, descriptive statistics for
overall_score across income groups ground the analysis and
set expectations for what the formal test should find.
data_2023 %>%
filter(!is.na(overall_score)) %>%
group_by(income) %>%
summarise(
n = n(),
mean = round(mean(overall_score), 2),
sd = round(sd(overall_score), 2),
min = round(min(overall_score), 2),
max = round(max(overall_score), 2)
) %>%
arrange(desc(mean))
## # A tibble: 4 × 6
## income n mean sd min max
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 High income 60 81.2 15.9 42.2 95.3
## 2 Upper middle income 52 69.2 16.1 30.7 91.6
## 3 Lower middle income 50 63.5 12.0 36.5 85.2
## 4 Low income 24 56.4 12.6 27.5 74.3
Key observations:
The response variable for this analysis is
overall_score, a composite index that aggregates a
country’s performance across all dimensions of its statistical system,
including data use, data products, data services, data sources, and data
infrastructure. This is the most meaningful single measure of a
country’s statistical capacity, and it is the quantity most relevant to
policymakers and development organizations assessing where support is
needed.
The categorical explanatory variable is income, which
classifies each country into one of four World Bank income groups: Low
income, Lower middle income, Upper middle income, and High income. This
grouping reflects structural economic differences between countries that
may correspond to different levels of investment in national statistical
systems.
\[H_0: \mu_{\text{Low}} = \mu_{\text{Lower middle}} = \mu_{\text{Upper middle}} = \mu_{\text{High}}\]
\[H_A: \text{At least one group mean differs}\]
The null hypothesis states that the average
overall_score is identical across all four income groups.
The alternative hypothesis states that at least one income group has a
different mean overall score. This is a one-way ANOVA
because there is a single categorical predictor with more than two
levels.
data_2023 %>%
filter(!is.na(overall_score)) %>%
mutate(income = factor(income,
levels = c("Low income", "Lower middle income",
"Upper middle income", "High income"))) %>%
ggplot(aes(x = income, y = overall_score, fill = income)) +
geom_boxplot(alpha = 0.7, outlier.shape = 21, outlier.size = 2) +
stat_summary(fun = mean, geom = "point", shape = 18,
size = 4, color = "darkred",
position = position_nudge(x = 0.25)) +
scale_fill_manual(values = c(
"Low income" = "#FF7043",
"Lower middle income" = "#FFA726",
"Upper middle income" = "#42A5F5",
"High income" = "#26A69A"
)) +
labs(
title = "Overall Statistical Performance Score by Income Group (2023)",
subtitle = "Red diamond = group mean",
x = "Income Group",
y = "Overall Score (0–100)",
fill = "Income Group"
) +
theme_minimal(base_size = 13) +
theme(legend.position = "none")
The boxplot shows a clear ascending pattern in median
overall_score as income group increases. The median and
mean (red diamond) both rise consistently from Low income to High
income. There is visible overlap between adjacent groups, particularly
between Low income and Lower middle income, and between Upper middle and
High income, which motivates the formal test and post-hoc
comparisons.
# Fit one-way ANOVA model
anova_model <- aov(overall_score ~ income, data = filter(data_2023, !is.na(overall_score)))
summary(anova_model)
## Df Sum Sq Mean Sq F value Pr(>F)
## income 3 14170 4723 22.18 2.74e-12 ***
## Residuals 182 38749 213
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpreting the output:
The ANOVA table reports an F-statistic of 22.18 with 3 degrees of freedom between groups and 182 degrees of freedom within groups. The corresponding p-value is approximately 2.7 × 10⁻¹², which is far below any conventional significance threshold (α = 0.05).
This means: assuming that income group has no effect on
overall_score, the probability of observing an F-statistic
this large by chance alone is essentially zero. We therefore
reject the null hypothesis and conclude that the mean
overall score is not the same across all income groups, at least one
group differs significantly from the others.
Because ANOVA only tells us that some difference exists, a Tukey Honest Significant Difference (HSD) test is used to identify which specific pairs of income groups differ significantly. Tukey HSD adjusts for multiple comparisons, controlling the family-wise error rate at 5%.
TukeyHSD(anova_model)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = overall_score ~ income, data = filter(data_2023, !is.na(overall_score)))
##
## $income
## diff lwr upr
## Low income-High income -24.868191 -34.005981 -15.730401
## Lower middle income-High income -17.720876 -24.965545 -10.476208
## Upper middle income-High income -12.047586 -19.215858 -4.879314
## Lower middle income-Low income 7.147315 -2.247929 16.542558
## Upper middle income-Low income 12.820605 3.484144 22.157066
## Upper middle income-Lower middle income 5.673290 -1.820408 13.166989
## p adj
## Low income-High income 0.0000000
## Lower middle income-High income 0.0000000
## Upper middle income-High income 0.0001281
## Lower middle income-Low income 0.2022206
## Upper middle income-Low income 0.0026408
## Upper middle income-Lower middle income 0.2059216
Key findings from the Tukey HSD:
| Comparison | Mean Difference | Significant? |
|---|---|---|
| High income vs Low income | +24.87 points | Yes |
| High income vs Lower middle income | +17.72 points | Yes |
| High income vs Upper middle income | +12.05 points | Yes |
| Upper middle income vs Low income | +12.82 points | Yes |
| Lower middle income vs Low income | +7.15 points | No |
| Upper middle income vs Lower middle income | +5.67 points | No |
The largest and most statistically clear gap is between High income and Low income countries (~25 points). High income countries differ significantly from all other groups. However, adjacent income groups, Low vs. Lower middle, and Lower middle vs. Upper middle, do not differ significantly from each other after adjusting for multiple comparisons.
There is strong evidence that income group is associated with differences in overall statistical performance. In practical terms, this means it would not be safe to assume that all income groups perform at the same level. High income countries perform substantially and significantly better than all other groups, while the three lower income tiers form a more graduated pattern where only the widest gaps reach statistical significance.
For development organizations, this suggests that targeting the High income vs. Low income gap should remain a priority. However, the lack of significant differences between adjacent groups implies that interventions may not need to distinguish sharply between Low and Lower middle income countries, they face comparably constrained statistical ecosystems. Further questions worth investigating include whether this gap has widened or narrowed over time, and which specific sub-dimensions of statistical capacity drive the High income advantage.
The response variable remains overall_score. The
continuous predictor selected is data_products_score, which
measures the quality and availability of data products that a country’s
statistical system produces, such as national accounts, household
surveys, and administrative data. This sub-indicator was chosen because
it is one of the most complete columns in the 2023 data (255 missing
values across all years, compared to over 2,800 for other sub-scores),
and it has a strong conceptual connection to overall statistical
performance: countries that produce high-quality data products are
likely to score well on the composite index too.
Before fitting a regression model, the relationship between the two variables should be roughly linear. A scatter plot is used to verify this.
data_2023 %>%
filter(!is.na(overall_score), !is.na(data_products_score)) %>%
ggplot(aes(x = data_products_score, y = overall_score)) +
geom_point(alpha = 0.6, color = "#2C7FB8", size = 2.5) +
geom_smooth(method = "lm", se = TRUE, color = "#D95F02", linewidth = 1) +
labs(
title = "Overall Score vs. Data Products Score (2023)",
subtitle = "Fitted regression line with 95% confidence band",
x = "Data Products Score (0–100)",
y = "Overall Score (0–100)"
) +
theme_minimal(base_size = 13)
## `geom_smooth()` using formula = 'y ~ x'
The scatter plot shows a clear positive linear relationship between
data_products_score and overall_score.
Countries with higher data products scores consistently tend to have
higher overall scores. The fitted regression line tracks the data well,
with no obvious curvature, which confirms that a linear model is
appropriate here.
# Fit simple linear regression
lm_model <- lm(overall_score ~ data_products_score,
data = filter(data_2023, !is.na(overall_score), !is.na(data_products_score)))
summary(lm_model)
##
## Call:
## lm(formula = overall_score ~ data_products_score, data = filter(data_2023,
## !is.na(overall_score), !is.na(data_products_score)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.0225 -7.2264 -0.4948 6.7765 27.7395
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -31.59587 5.29525 -5.967 1.22e-08 ***
## data_products_score 1.32961 0.06874 19.343 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.737 on 184 degrees of freedom
## Multiple R-squared: 0.6703, Adjusted R-squared: 0.6685
## F-statistic: 374.1 on 1 and 184 DF, p-value: < 2.2e-16
The regression equation is:
\[\widehat{\text{Overall Score}} = -31.60 + 1.33 \times \text{Data Products Score}\]
Intercept (−31.60): This is the model’s predicted
overall_score when data_products_score equals
zero. A score of zero on data products has no real-world meaning for any
observed country (the actual range is 45.2 to 94.3), so the intercept
should not be interpreted literally, it is a mathematical anchor for the
line.
Slope (1.33): For every 1-point
increase in data_products_score, the model
predicts an average increase of 1.33 points in
overall_score. This is greater than a one-for-one
relationship, meaning improvements in data products performance are
associated with disproportionately larger gains in overall statistical
performance. A country that improves its data products score by 10
points would be expected to see its overall score rise by approximately
13 points.
Both coefficients are highly statistically significant (p < 0.001), meaning the probability of observing slopes this large by chance, if there were truly no relationship, is negligible.
The model’s R² = 0.67, meaning that
data_products_score alone explains approximately
67% of the variation in overall_score
across countries. This is a strong fit for a simple one-predictor model,
particularly given that overall_score is a composite of
five sub-dimensions, only one of which is being used as a predictor
here.
The remaining 33% of variation is unexplained, likely attributable to the other sub-indicators (data use, services, sources, infrastructure) and country-level factors not captured in this model.
The strong linear relationship between
data_products_score and overall_score suggests
that a country’s capacity to produce high-quality statistical outputs,
censuses, surveys, national accounts, is one of the most important
single drivers of overall statistical system performance. For
policymakers and development funders, this implies that investments
specifically targeted at improving data product quality and availability
may yield broad-based gains across the entire statistical system, not
just in one sub-dimension.
However, since the model explains 67% of the variation, it would be an oversimplification to treat data products as the only lever worth pulling. The remaining unexplained variation suggests that other components of statistical capacity, particularly data services and infrastructure, also play meaningful independent roles. Further investigation could explore a multiple regression model incorporating all five sub-scores to understand their relative contributions.
This analysis extends the Week 7 findings in two ways. The ANOVA confirms what the t-test suggested: income group is strongly associated with overall statistical performance. But by including all four income tiers, it becomes clear that the relationship is not simply binary, the gradient from Low to High income is real, though the steps between adjacent groups are not always statistically distinguishable. The Tukey HSD results are especially informative here, showing that the High income group stands apart while the lower three tiers form a more compressed cluster.
The regression analysis adds a different kind of insight: within the 2023 cross-section, roughly two-thirds of the country-to-country variation in overall performance can be predicted from a single sub-indicator. This points to the coherence of the SPI framework, countries that do well on one dimension of statistical capacity tend to do well overall, and the relationship is approximately linear and strong.
Together, these results suggest that statistical performance is structured and predictable, which is encouraging for planning purposes. Countries do not perform randomly, performance tracks with both economic resources (income group) and specific competencies (data products quality).
overall_score. If missingness is
non-random (e.g., the weakest performers are least likely to be scored),
estimates could be biased.overall_score changed
from 2016 to 2023? A mixed-effects model could answer this while
accounting for repeated measurements per country.overall_score than their data_products_score
would predict? These residual outliers may reveal cases where other
sub-dimensions are either compensating for or dragging down overall
performance.