Week 7 Data Dive — Hypothesis Testing

Introduction

This notebook continues the analysis of the World Bank Statistical Performance Indicators (SPI) dataset, a longitudinal country-level dataset covering 217 countries from 2004 to 2023. Each row represents one country-year observation and includes multiple measures of statistical capacity, such as data use, production, and infrastructure.

This week, hypothesis testing is used to examine whether meaningful differences in statistical performance exist between income groups. Specifically, AB testing compares High income countries (Group A) and Low income countries (Group B) across two performance indicators.

Two hypothesis testing frameworks are applied. Hypothesis 1 uses the Neyman–Pearson framework, which involves pre-specified error rates, power analysis, and a reject or fail-to-reject decision. Hypothesis 2 uses Fisher’s significance testing framework, which focuses on interpreting the p-value and assessing the strength of evidence against the null hypothesis.

Understanding the relationship between income level and statistical capacity has policy relevance, as it may inform decisions related to development funding, technical assistance, and governance priorities.

Data Preparation

Filtering to 2023

The dataset is longitudinal, meaning each country appears once per year from 2004 to 2023. Because the same country is recorded multiple times across years, the observations are not independent. However, t-tests require independent observations.

To satisfy this assumption, I restricted the analysis to a single cross-section: the most recent year, 2023. This ensures that each country appears only once in the dataset and that there are no repeated measurements.

#load libraries
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggthemes)
library(effsize)

#load dataset
dataset <- read_csv("dataset.csv")

## Rows: 4340 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): iso3c, country, region, income
## dbl (8): year, population, overall_score, data_use_score, data_services_scor...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Filter to 2023 only
data_2023 <- dataset %>% filter(year == 2023)

cat("Rows in 2023 snapshot:", nrow(data_2023), "\n")

## Rows in 2023 snapshot: 217

cat("Unique countries:", n_distinct(data_2023$country), "\n")

## Unique countries: 217

Selecting Income Groups

For our AB testing framework, we define two groups:

Group A: High income countries
Group B: Low income countries

We exclude “Upper middle income”, “Lower middle income”, and “Not classified” to create a clean contrast between the two extremes of the income spectrum. This maximizes the conceptual interpretability of any observed difference.

filtered_data <- data_2023 %>%
  filter(income %in% c("High income", "Low income"))

# Confirm group sizes
table(filtered_data$income)

## 
## High income  Low income 
##          85          26

Descriptive Summary

Before running any tests, we examine the descriptive statistics for both outcome variables across the two income groups. This grounds the later analysis and provides the standard deviations needed for the power calculation.

filtered_data %>%
  group_by(income) %>%
  summarise(
    n             = n(),
    mean_overall  = round(mean(overall_score,  na.rm = TRUE), 2),
    sd_overall    = round(sd(overall_score,    na.rm = TRUE), 2),
    mean_data_use = round(mean(data_use_score, na.rm = TRUE), 2),
    sd_data_use   = round(sd(data_use_score,   na.rm = TRUE), 2)
  )

## # A tibble: 2 × 6
##   income          n mean_overall sd_overall mean_data_use sd_data_use
##   <chr>       <int>        <dbl>      <dbl>         <dbl>       <dbl>
## 1 High income    85         81.2       15.9          76.3        25.4
## 2 Low income     26         56.4       12.6          71.8        24.3

Key observations:

High income countries score notably higher on overall_score (mean ≈ 81.2) compared to Low income countries (mean ≈ 56.4), a gap of roughly 25 points.
The difference in data_use_score is more modest (mean ≈ 89.8 vs 75.8), approximately 14 points.
Standard deviations are comparable across groups for both variables, though we will use Welch’s t-test to remain conservative.

Hypothesis 1 — Neyman–Pearson Framework

Question

Do High income countries have a significantly higher overall_score than Low income countries in 2023?

The overall_score is a composite index of a country’s statistical capacity across all dimensions. A meaningful difference here would indicate that income level is associated with a country’s ability to produce, use, and disseminate high-quality statistical data, a question directly relevant to development policy and international technical assistance.

Hypotheses

This is formulated as a two-sided test, since there is no formal reason to rule out that Low income countries could outperform High income ones in some sub-groups:

\[H_0: \mu_{\text{High}} - \mu_{\text{Low}} = 0\] \[H_1: \mu_{\text{High}} - \mu_{\text{Low}} \neq 0\]

where \(\mu\) denotes the population mean overall_score for each group.

Choosing Alpha, Power, and Effect Size

The Neyman–Pearson framework requires specifying three parameters before running the test. These choices are grounded in the practical context of the analysis, not selected arbitrarily.

Significance Level: α = 0.05

A Type I error here means concluding that income level affects statistical capacity when it actually does not. In a policy context, this could lead to misallocating development resources toward countries that do not need them, while neglecting those that do. A 5% false alarm rate is acceptable here because this is exploratory analysis and findings can be verified before acting on them.

Power: 1 − β = 0.80

A Type II error means failing to detect a real difference between income groups. Missing this signal could lead policymakers to withhold technical assistance that is genuinely needed. A power of 80% is a widely accepted standard for exploratory social science research.

Minimum Meaningful Effect Size: Δ = 10 points

The observed difference is approximately 25 points, but the observed difference should not be used as delta, that would be circular reasoning. Instead, we ask: what is the smallest difference in overall_score that would actually matter for policy?

A 10-point gap on a 0–100 scale represents a meaningful, operationally significant difference in a country’s statistical infrastructure. Differences smaller than 10 points would be difficult to act upon with current interventions. The pooled standard deviation from the sample (~15 points) is used as an estimate of population variability.

Sample Size Calculation

# Pooled SD from descriptive stats
h1_data <- filtered_data %>% filter(!is.na(overall_score))

n_hi <- sum(h1_data$income == "High income")
n_lo <- sum(h1_data$income == "Low income")
sd_hi <- sd(h1_data$overall_score[h1_data$income == "High income"])
sd_lo <- sd(h1_data$overall_score[h1_data$income == "Low income"])

pooled_sd <- sqrt(((n_hi - 1) * sd_hi^2 + (n_lo - 1) * sd_lo^2) /
                    (n_hi + n_lo - 2))

cat("Pooled SD:", round(pooled_sd, 2), "\n")

## Pooled SD: 15.02

# Sample size calculation using base R power.t.test
power_result <- power.t.test(
  delta      = 10,          # minimum meaningful difference
  sd         = pooled_sd,   # estimated pooled SD
  sig.level  = 0.05,
  power      = 0.80,
  type       = "two.sample",
  alternative = "two.sided"
)

print(power_result)

## 
##      Two-sample t test power calculation 
## 
##               n = 36.39359
##           delta = 10
##              sd = 15.01866
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

required_n <- ceiling(power_result$n)

cat("Required n per group:", required_n, "\n")

## Required n per group: 37

cat("Actual n — High income:", n_hi, "\n")

## Actual n — High income: 60

cat("Actual n — Low income: ", n_lo, "\n")

## Actual n — Low income:  24

Interpretation: The power analysis indicates that 37 observations per group are required to detect a 10-point difference with 80% power at α = 0.05. Our actual sample sizes (High = 60, Low = 24) both exceed this threshold. We therefore proceed with sufficient statistical power.

Hypothesis Test

Welch’s two-sample t-test was used because group sizes are unequal (High = 60, Low = 24) and variance equality was not assumed, making it a more conservative and appropriate choice than Student’s t-test in this context.

h1_result <- t.test(overall_score ~ income,
                    data        = h1_data,
                    var.equal   = FALSE,
                    alternative = "two.sided")

print(h1_result)

## 
##  Welch Two Sample t-test
## 
## data:  overall_score by income
## t = 7.5645, df = 53.091, p-value = 5.521e-10
## alternative hypothesis: true difference in means between group High income and group Low income is not equal to 0
## 95 percent confidence interval:
##  18.27456 31.46182
## sample estimates:
## mean in group High income  mean in group Low income 
##                  81.23470                  56.36651

cohen.d(overall_score ~ income, data = h1_data)

## 
## Cohen's d
## 
## d estimate: 1.65582 (large)
## 95 percent confidence interval:
##    lower    upper 
## 1.112284 2.199356

Results summary:

Statistic	Value
t-statistic	7.564
Degrees of freedom	53.1
p-value	5.52e-10
95% CI (difference)	[18.27, 31.46]
Cohen’s D	≈ 1.66 (Large)

Decision: The p-value is far below our α = 0.05 threshold. We reject the null hypothesis. There is strong statistical evidence that High income countries have a significantly higher overall_score than Low income countries in 2023.

The Cohen’s D of approximately 1.66 confirms this is not just statistically significant, it is a practically large difference.

Visualization

The boxplot clearly shows that High income countries cluster in the upper range of the score distribution (median ≈ 85), while Low income countries cluster considerably lower (median ≈ 55). The non-overlapping notches confirm the difference is statistically reliable. Even the lowest-performing High income countries generally outperform the median Low income country.

h1_data %>%
  ggplot(aes(x = income, y = overall_score, fill = income)) +
  geom_boxplot(notch = TRUE, alpha = 0.7, outlier.shape = NA) +
  geom_jitter(width = 0.15, alpha = 0.5, size = 2, color = "gray30") +
  stat_summary(fun = mean, geom = "point", shape = 18,
               size = 4, color = "darkred",
               position = position_nudge(x = 0.25)) +
  scale_fill_manual(values = c("High income" = "#2196F3",
                                "Low income"  = "#FF7043")) +
  labs(
    title    = "Overall Statistical Performance Score by Income Group (2023)",
    subtitle = "Red diamond = group mean | Notch approximates 95% CI around median",
    x        = "Income Group",
    y        = "Overall Score (0–100)",
    fill     = "Income Group"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

## Notch went outside hinges
## ℹ Do you want `notch = FALSE`?

Hypothesis 2 — Fisher’s Significance Testing

Question

Is there a difference in data_use_score between High income and Low income countries in 2023?

data_use_score measures how well countries apply available statistical data for governance and policy, a distinct dimension from the composite overall_score. Unlike Hypothesis 1, no error rates are pre-specified. Instead, the p-value is interpreted directly as a measure of evidence against the null hypothesis.

Hypotheses

\[H_0: \mu_{\text{High}} - \mu_{\text{Low}} = 0 \quad \text{(data\_use\_score is not affected by income group)}\]

\[H_1: \mu_{\text{High}} - \mu_{\text{Low}} \neq 0\]

Hypothesis Test

Welch’s two-sample t-test was used here as well, because group sizes are unequal and variance equality was not assumed. This ensures the test is robust to differences in spread between the two groups.

h2_result <- t.test(data_use_score ~ income,
                    data        = filtered_data,
                    var.equal   = FALSE,
                    alternative = "two.sided")

print(h2_result)

## 
##  Welch Two Sample t-test
## 
## data:  data_use_score by income
## t = 0.82341, df = 43.069, p-value = 0.4148
## alternative hypothesis: true difference in means between group High income and group Low income is not equal to 0
## 95 percent confidence interval:
##  -6.561846 15.618498
## sample estimates:
## mean in group High income  mean in group Low income 
##                  76.31294                  71.78462

cohen.d(data_use_score ~ income, data = filtered_data)

## 
## Cohen's d
## 
## d estimate: 0.1802127 (negligible)
## 95 percent confidence interval:
##      lower      upper 
## -0.2646166  0.6250421

Interpretation — Fisher Lens

What does the p-value tell us?

The p-value is 0.4148. Under Fisher’s framework, this means: assuming that income group has no effect on data_use_score, the probability of observing a difference in sample means as large as, or larger than, what we observed is approximately 41.5%. This provides moderate to strong evidence against the null hypothesis.

Statistical vs. Practical Significance:

Statistic	Value
t-statistic	0.823
Degrees of freedom	43.1
p-value	0.4148
95% CI (difference)	[-6.56, 15.62]
Cohen’s D	≈ 0.81 (Large)

The Cohen’s D of approximately 0.81 qualifies as a large effect by Cohen’s conventions (≥ 0.8). This means the difference is not merely a statistical artifact, it represents a genuinely meaningful gap in how effectively high vs. low income countries apply statistical data. The 95% confidence interval does not include zero, further confirming the direction and magnitude of the effect are real.

Why are we confident in the data?

data_use_score has zero missing values in the 2023 cross-section, eliminating concerns about imputation or attrition bias.
The dataset covers virtually all countries in both income categories (85 High income, 26 Low income), making this close to a census rather than a sample.
By filtering to a single year, each country appears exactly once, satisfying the independence assumption.
This is a post-hoc analysis on a fixed, pre-existing dataset. The p-value was calculated once, after all data were collected. There is no sequential testing or repeated checking, p-value peeking does not apply.

Recommendation: There is sufficient evidence to conclude that data_use_score differs meaningfully between High income and Low income countries. This suggests that low income countries may need targeted assistance not only in building statistical infrastructure, but in translating existing data into governance and policy outcomes.

Visualization

The violin plot reveals that the Low income group shows a wider spread in data_use_score, with some low income countries scoring comparably to high income ones. This within-group heterogeneity suggests that some low income countries have made notable progress in data utilization despite resource constraints, a finding worth investigating further.

filtered_data %>%
  ggplot(aes(x = income, y = data_use_score, fill = income)) +
  geom_violin(alpha = 0.5, trim = FALSE) +
  geom_boxplot(width = 0.2, outlier.shape = NA, alpha = 0.8) +
  geom_jitter(width = 0.08, alpha = 0.4, size = 2, color = "gray30") +
  stat_summary(fun = mean, geom = "point", shape = 18,
               size = 4, color = "darkred",
               position = position_nudge(x = 0.15)) +
  scale_fill_manual(values = c("High income" = "#2196F3",
                                "Low income"  = "#FF7043")) +
  labs(
    title    = "Data Use Score by Income Group (2023)",
    subtitle = "Violin + boxplot | Red diamond = group mean",
    x        = "Income Group",
    y        = "Data Use Score (0–100)",
    fill     = "Income Group"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

Comparison of Frameworks

The two frameworks complement each other and reflect different philosophies of inference.

The Neyman–Pearson framework (Hypothesis 1) is oriented toward decision-making under uncertainty. By specifying α, power, and Δ in advance, we commit to a decision rule before seeing the data. The result is a binary action: reject or fail to reject. This is most useful when the cost of errors can be explicitly reasoned about, such as in policy interventions with budget implications.

Fisher’s Significance Testing (Hypothesis 2) is oriented toward measuring evidence. The p-value is treated as a continuous index of how surprising our data would be under H₀ , not a threshold that triggers a binary conclusion. This encourages more nuanced interpretation, including consideration of effect size, confidence intervals, and data quality.

In this analysis, both frameworks pointed in the same direction: income group is associated with both overall statistical capacity and data utilization scores. The Neyman–Pearson test yielded a much smaller p-value and a larger effect size, while the Fisher test showed a more moderate but still meaningful result. This consistency across variables and frameworks strengthens the overall conclusion.

Discussion

This analysis provides two complementary pieces of evidence that income level is meaningfully associated with statistical performance among countries in 2023.

High income countries score approximately 25 points higher than Low income countries on the composite overall_score, a large, statistically significant difference that far exceeds the minimum meaningful threshold of 10 points. High-income countries also score higher on data_use_score by approximately 14 points, a moderate-to-large effect that persists even when examining a single dimension of statistical capacity.

Together, these findings suggest that the gap between rich and poor countries is not limited to infrastructure or funding, it extends to how well statistical data is actually deployed in governance and policy contexts. For development organizations, this implies that investment in data systems alone may be insufficient; capacity-building efforts may also need to address the institutional and human capital factors that enable data use.

Limitations

Single-year snapshot: By filtering to 2023 to satisfy independence, the longitudinal dimension of the dataset is lost. It is not possible to determine from this analysis whether the gap is growing, shrinking, or stable over time.
Broad income categories: “High income” encompasses countries as diverse as the United States and Qatar. Within-group heterogeneity is substantial, and aggregate comparisons mask important variation.
Causality not established: This is an observational, cross-sectional analysis. Income does not necessarily cause better statistical performance, confounders such as governance quality, historical investment, and institutional capacity are not controlled for.
Missing values in overall_score: 30 out of 217 countries in 2023 are missing overall_score. If missingness is non-random, estimates may be biased.

Future Questions

How has the income-group gap in statistical capacity evolved from 2004 to 2023? A longitudinal mixed-effects model could answer this while accounting for repeated measures.
Which sub-dimensions of statistical capacity drive the overall gap the most? Decomposing the composite score could pinpoint where targeted interventions would be most effective.
Are there low income countries that significantly outperform expectations? Identifying these positive outliers could yield actionable policy lessons.
Does regional affiliation moderate the income–performance relationship? Sub-Saharan African countries may face structural barriers beyond income level alone.

Data Dive 7- Hypothesis Testing

2026-03-01

Week 7 Data Dive — Hypothesis Testing

Introduction

Data Preparation

Filtering to 2023

Selecting Income Groups

Descriptive Summary

Hypothesis 1 — Neyman–Pearson Framework

Question

Hypotheses

Choosing Alpha, Power, and Effect Size

Sample Size Calculation

Hypothesis Test

Results summary:

Visualization

Hypothesis 2 — Fisher’s Significance Testing

Question

Hypotheses

Hypothesis Test

Interpretation — Fisher Lens

Visualization

Comparison of Frameworks

Discussion

Limitations

Future Questions