This notebook continues the analysis of the World Bank Statistical Performance Indicators (SPI) dataset, a longitudinal country-level dataset covering 217 countries from 2004 to 2023. Each row represents one country-year observation and includes multiple measures of statistical capacity, such as data use, production, and infrastructure.
This week, hypothesis testing is used to examine whether meaningful differences in statistical performance exist between income groups. Specifically, AB testing compares High income countries (Group A) and Low income countries (Group B) across two performance indicators.
Two hypothesis testing frameworks are applied. Hypothesis 1 uses the Neyman–Pearson framework, which involves pre-specified error rates, power analysis, and a reject or fail-to-reject decision. Hypothesis 2 uses Fisher’s significance testing framework, which focuses on interpreting the p-value and assessing the strength of evidence against the null hypothesis.
Understanding the relationship between income level and statistical capacity has policy relevance, as it may inform decisions related to development funding, technical assistance, and governance priorities.
The dataset is longitudinal, meaning each country appears once per year from 2004 to 2023. Because the same country is recorded multiple times across years, the observations are not independent. However, t-tests require independent observations.
To satisfy this assumption, I restricted the analysis to a single cross-section: the most recent year, 2023. This ensures that each country appears only once in the dataset and that there are no repeated measurements.
#load libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(effsize)
#load dataset
dataset <- read_csv("dataset.csv")
## Rows: 4340 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): iso3c, country, region, income
## dbl (8): year, population, overall_score, data_use_score, data_services_scor...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Filter to 2023 only
data_2023 <- dataset %>% filter(year == 2023)
cat("Rows in 2023 snapshot:", nrow(data_2023), "\n")
## Rows in 2023 snapshot: 217
cat("Unique countries:", n_distinct(data_2023$country), "\n")
## Unique countries: 217
For our AB testing framework, we define two groups:
We exclude “Upper middle income”, “Lower middle income”, and “Not classified” to create a clean contrast between the two extremes of the income spectrum. This maximizes the conceptual interpretability of any observed difference.
filtered_data <- data_2023 %>%
filter(income %in% c("High income", "Low income"))
# Confirm group sizes
table(filtered_data$income)
##
## High income Low income
## 85 26
Before running any tests, we examine the descriptive statistics for both outcome variables across the two income groups. This grounds the later analysis and provides the standard deviations needed for the power calculation.
filtered_data %>%
group_by(income) %>%
summarise(
n = n(),
mean_overall = round(mean(overall_score, na.rm = TRUE), 2),
sd_overall = round(sd(overall_score, na.rm = TRUE), 2),
mean_data_use = round(mean(data_use_score, na.rm = TRUE), 2),
sd_data_use = round(sd(data_use_score, na.rm = TRUE), 2)
)
## # A tibble: 2 × 6
## income n mean_overall sd_overall mean_data_use sd_data_use
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 High income 85 81.2 15.9 76.3 25.4
## 2 Low income 26 56.4 12.6 71.8 24.3
Key observations:
overall_score (mean ≈ 81.2) compared to Low income
countries (mean ≈ 56.4), a gap of roughly 25 points.data_use_score is more modest (mean ≈
89.8 vs 75.8), approximately 14 points.Do High income countries have a significantly higher
overall_score than Low income countries in
2023?
The overall_score is a composite index of a country’s
statistical capacity across all dimensions. A meaningful difference here
would indicate that income level is associated with a country’s ability
to produce, use, and disseminate high-quality statistical data, a
question directly relevant to development policy and international
technical assistance.
This is formulated as a two-sided test, since there is no formal reason to rule out that Low income countries could outperform High income ones in some sub-groups:
\[H_0: \mu_{\text{High}} - \mu_{\text{Low}} = 0\] \[H_1: \mu_{\text{High}} - \mu_{\text{Low}} \neq 0\]
where \(\mu\) denotes the population
mean overall_score for each group.
The Neyman–Pearson framework requires specifying three parameters before running the test. These choices are grounded in the practical context of the analysis, not selected arbitrarily.
Significance Level: α = 0.05
A Type I error here means concluding that income level affects statistical capacity when it actually does not. In a policy context, this could lead to misallocating development resources toward countries that do not need them, while neglecting those that do. A 5% false alarm rate is acceptable here because this is exploratory analysis and findings can be verified before acting on them.
Power: 1 − β = 0.80
A Type II error means failing to detect a real difference between income groups. Missing this signal could lead policymakers to withhold technical assistance that is genuinely needed. A power of 80% is a widely accepted standard for exploratory social science research.
Minimum Meaningful Effect Size: Δ = 10 points
The observed difference is approximately 25 points, but the observed
difference should not be used as delta, that would be circular
reasoning. Instead, we ask: what is the smallest difference in
overall_score that would actually matter for
policy?
A 10-point gap on a 0–100 scale represents a meaningful, operationally significant difference in a country’s statistical infrastructure. Differences smaller than 10 points would be difficult to act upon with current interventions. The pooled standard deviation from the sample (~15 points) is used as an estimate of population variability.
# Pooled SD from descriptive stats
h1_data <- filtered_data %>% filter(!is.na(overall_score))
n_hi <- sum(h1_data$income == "High income")
n_lo <- sum(h1_data$income == "Low income")
sd_hi <- sd(h1_data$overall_score[h1_data$income == "High income"])
sd_lo <- sd(h1_data$overall_score[h1_data$income == "Low income"])
pooled_sd <- sqrt(((n_hi - 1) * sd_hi^2 + (n_lo - 1) * sd_lo^2) /
(n_hi + n_lo - 2))
cat("Pooled SD:", round(pooled_sd, 2), "\n")
## Pooled SD: 15.02
# Sample size calculation using base R power.t.test
power_result <- power.t.test(
delta = 10, # minimum meaningful difference
sd = pooled_sd, # estimated pooled SD
sig.level = 0.05,
power = 0.80,
type = "two.sample",
alternative = "two.sided"
)
print(power_result)
##
## Two-sample t test power calculation
##
## n = 36.39359
## delta = 10
## sd = 15.01866
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n is number in *each* group
required_n <- ceiling(power_result$n)
cat("Required n per group:", required_n, "\n")
## Required n per group: 37
cat("Actual n — High income:", n_hi, "\n")
## Actual n — High income: 60
cat("Actual n — Low income: ", n_lo, "\n")
## Actual n — Low income: 24
Interpretation: The power analysis indicates that 37 observations per group are required to detect a 10-point difference with 80% power at α = 0.05. Our actual sample sizes (High = 60, Low = 24) both exceed this threshold. We therefore proceed with sufficient statistical power.
Welch’s two-sample t-test was used because group sizes are unequal (High = 60, Low = 24) and variance equality was not assumed, making it a more conservative and appropriate choice than Student’s t-test in this context.
h1_result <- t.test(overall_score ~ income,
data = h1_data,
var.equal = FALSE,
alternative = "two.sided")
print(h1_result)
##
## Welch Two Sample t-test
##
## data: overall_score by income
## t = 7.5645, df = 53.091, p-value = 5.521e-10
## alternative hypothesis: true difference in means between group High income and group Low income is not equal to 0
## 95 percent confidence interval:
## 18.27456 31.46182
## sample estimates:
## mean in group High income mean in group Low income
## 81.23470 56.36651
cohen.d(overall_score ~ income, data = h1_data)
##
## Cohen's d
##
## d estimate: 1.65582 (large)
## 95 percent confidence interval:
## lower upper
## 1.112284 2.199356
| Statistic | Value |
|---|---|
| t-statistic | 7.564 |
| Degrees of freedom | 53.1 |
| p-value | 5.52e-10 |
| 95% CI (difference) | [18.27, 31.46] |
| Cohen’s D | ≈ 1.66 (Large) |
Decision: The p-value is far below our α = 0.05
threshold. We reject the null hypothesis. There is
strong statistical evidence that High income countries have a
significantly higher overall_score than Low income
countries in 2023.
The Cohen’s D of approximately 1.66 confirms this is not just statistically significant, it is a practically large difference.
The boxplot clearly shows that High income countries cluster in the upper range of the score distribution (median ≈ 85), while Low income countries cluster considerably lower (median ≈ 55). The non-overlapping notches confirm the difference is statistically reliable. Even the lowest-performing High income countries generally outperform the median Low income country.
h1_data %>%
ggplot(aes(x = income, y = overall_score, fill = income)) +
geom_boxplot(notch = TRUE, alpha = 0.7, outlier.shape = NA) +
geom_jitter(width = 0.15, alpha = 0.5, size = 2, color = "gray30") +
stat_summary(fun = mean, geom = "point", shape = 18,
size = 4, color = "darkred",
position = position_nudge(x = 0.25)) +
scale_fill_manual(values = c("High income" = "#2196F3",
"Low income" = "#FF7043")) +
labs(
title = "Overall Statistical Performance Score by Income Group (2023)",
subtitle = "Red diamond = group mean | Notch approximates 95% CI around median",
x = "Income Group",
y = "Overall Score (0–100)",
fill = "Income Group"
) +
theme_minimal(base_size = 13) +
theme(legend.position = "none")
## Notch went outside hinges
## ℹ Do you want `notch = FALSE`?
Is there a difference in
data_use_scorebetween High income and Low income countries in 2023?
data_use_score measures how well countries apply
available statistical data for governance and policy, a distinct
dimension from the composite overall_score. Unlike
Hypothesis 1, no error rates are pre-specified. Instead, the p-value is
interpreted directly as a measure of evidence against the null
hypothesis.
\[H_0: \mu_{\text{High}} - \mu_{\text{Low}} = 0 \quad \text{(data\_use\_score is not affected by income group)}\]
\[H_1: \mu_{\text{High}} - \mu_{\text{Low}} \neq 0\]
Welch’s two-sample t-test was used here as well, because group sizes are unequal and variance equality was not assumed. This ensures the test is robust to differences in spread between the two groups.
h2_result <- t.test(data_use_score ~ income,
data = filtered_data,
var.equal = FALSE,
alternative = "two.sided")
print(h2_result)
##
## Welch Two Sample t-test
##
## data: data_use_score by income
## t = 0.82341, df = 43.069, p-value = 0.4148
## alternative hypothesis: true difference in means between group High income and group Low income is not equal to 0
## 95 percent confidence interval:
## -6.561846 15.618498
## sample estimates:
## mean in group High income mean in group Low income
## 76.31294 71.78462
cohen.d(data_use_score ~ income, data = filtered_data)
##
## Cohen's d
##
## d estimate: 0.1802127 (negligible)
## 95 percent confidence interval:
## lower upper
## -0.2646166 0.6250421
What does the p-value tell us?
The p-value is 0.4148. Under Fisher’s framework, this means:
assuming that income group has no effect on
data_use_score, the probability of observing a
difference in sample means as large as, or larger than, what we observed
is approximately 41.5%. This provides moderate to strong
evidence against the null hypothesis.
Statistical vs. Practical Significance:
| Statistic | Value |
|---|---|
| t-statistic | 0.823 |
| Degrees of freedom | 43.1 |
| p-value | 0.4148 |
| 95% CI (difference) | [-6.56, 15.62] |
| Cohen’s D | ≈ 0.81 (Large) |
The Cohen’s D of approximately 0.81 qualifies as a large effect by Cohen’s conventions (≥ 0.8). This means the difference is not merely a statistical artifact, it represents a genuinely meaningful gap in how effectively high vs. low income countries apply statistical data. The 95% confidence interval does not include zero, further confirming the direction and magnitude of the effect are real.
Why are we confident in the data?
data_use_score has zero missing values
in the 2023 cross-section, eliminating concerns about imputation or
attrition bias.Recommendation: There is sufficient evidence to
conclude that data_use_score differs meaningfully between
High income and Low income countries. This suggests that low income
countries may need targeted assistance not only in building statistical
infrastructure, but in translating existing data into governance and
policy outcomes.
The violin plot reveals that the Low income group shows a
wider spread in data_use_score, with some
low income countries scoring comparably to high income ones. This
within-group heterogeneity suggests that some low income countries have
made notable progress in data utilization despite resource constraints,
a finding worth investigating further.
filtered_data %>%
ggplot(aes(x = income, y = data_use_score, fill = income)) +
geom_violin(alpha = 0.5, trim = FALSE) +
geom_boxplot(width = 0.2, outlier.shape = NA, alpha = 0.8) +
geom_jitter(width = 0.08, alpha = 0.4, size = 2, color = "gray30") +
stat_summary(fun = mean, geom = "point", shape = 18,
size = 4, color = "darkred",
position = position_nudge(x = 0.15)) +
scale_fill_manual(values = c("High income" = "#2196F3",
"Low income" = "#FF7043")) +
labs(
title = "Data Use Score by Income Group (2023)",
subtitle = "Violin + boxplot | Red diamond = group mean",
x = "Income Group",
y = "Data Use Score (0–100)",
fill = "Income Group"
) +
theme_minimal(base_size = 13) +
theme(legend.position = "none")
The two frameworks complement each other and reflect different philosophies of inference.
The Neyman–Pearson framework (Hypothesis 1) is oriented toward decision-making under uncertainty. By specifying α, power, and Δ in advance, we commit to a decision rule before seeing the data. The result is a binary action: reject or fail to reject. This is most useful when the cost of errors can be explicitly reasoned about, such as in policy interventions with budget implications.
Fisher’s Significance Testing (Hypothesis 2) is oriented toward measuring evidence. The p-value is treated as a continuous index of how surprising our data would be under H₀ , not a threshold that triggers a binary conclusion. This encourages more nuanced interpretation, including consideration of effect size, confidence intervals, and data quality.
In this analysis, both frameworks pointed in the same direction: income group is associated with both overall statistical capacity and data utilization scores. The Neyman–Pearson test yielded a much smaller p-value and a larger effect size, while the Fisher test showed a more moderate but still meaningful result. This consistency across variables and frameworks strengthens the overall conclusion.
This analysis provides two complementary pieces of evidence that income level is meaningfully associated with statistical performance among countries in 2023.
High income countries score approximately 25 points higher than Low
income countries on the composite overall_score, a large,
statistically significant difference that far exceeds the minimum
meaningful threshold of 10 points. High-income countries also score
higher on data_use_score by approximately 14 points, a
moderate-to-large effect that persists even when examining a single
dimension of statistical capacity.
Together, these findings suggest that the gap between rich and poor countries is not limited to infrastructure or funding, it extends to how well statistical data is actually deployed in governance and policy contexts. For development organizations, this implies that investment in data systems alone may be insufficient; capacity-building efforts may also need to address the institutional and human capital factors that enable data use.
overall_score. If missingness
is non-random, estimates may be biased.