Week 8 Data Dive — ANOVA and Linear Regression

Introduction

This notebook continues the analysis of the World Bank Statistical Performance Indicators (SPI) dataset, a longitudinal country-level dataset covering 217 countries from 2004 to 2023. Each row represents one country-year observation and includes multiple measures of statistical capacity, such as data use, data products, and infrastructure.

Building on the hypothesis testing work from Week 7, this analysis shifts toward modeling. Two questions are addressed. First, does income group explain differences in overall statistical performance across countries? This is tested using a one-way ANOVA. Second, can a single continuous sub-indicator, data_products_score, predict a country’s overall performance? This is explored through simple linear regression.

These two approaches move beyond comparing group means toward understanding the structure of the relationship between statistical capacity indicators.

Data Preparation

Filtering to 2023

As in the previous analysis, the dataset is filtered to the most recent year (2023) to ensure that each country appears only once. Because the same country is recorded multiple times across years, using all years would violate the independence assumption required for both ANOVA and regression.

# Load libraries
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggthemes)

# Load dataset
dataset <- read_csv("dataset.csv")

## Rows: 4340 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): iso3c, country, region, income
## dbl (8): year, population, overall_score, data_use_score, data_services_scor...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Filter to 2023 and remove "Not classified" income group
data_2023 <- dataset %>%
  filter(year == 2023) %>%
  filter(income != "Not classified")

cat("Rows in 2023 snapshot:", nrow(data_2023), "\n")

## Rows in 2023 snapshot: 216

cat("Unique countries:", n_distinct(data_2023$country), "\n")

## Unique countries: 216

Income Group Distribution

# Confirm group sizes
table(data_2023$income)

## 
##         High income          Low income Lower middle income Upper middle income 
##                  85                  26                  51                  54

The 2023 snapshot contains 216 countries across four income groups: High income (n = 85), Upper middle income (n = 54), Lower middle income (n = 51), and Low income (n = 26). These are sufficient group sizes for ANOVA.

Descriptive Summary

Before running any tests, descriptive statistics for overall_score across income groups ground the analysis and set expectations for what the formal test should find.

data_2023 %>%
  filter(!is.na(overall_score)) %>%
  group_by(income) %>%
  summarise(
    n    = n(),
    mean = round(mean(overall_score), 2),
    sd   = round(sd(overall_score),   2),
    min  = round(min(overall_score),  2),
    max  = round(max(overall_score),  2)
  ) %>%
  arrange(desc(mean))

## # A tibble: 4 × 6
##   income                  n  mean    sd   min   max
##   <chr>               <int> <dbl> <dbl> <dbl> <dbl>
## 1 High income            60  81.2  15.9  42.2  95.3
## 2 Upper middle income    52  69.2  16.1  30.7  91.6
## 3 Lower middle income    50  63.5  12.0  36.5  85.2
## 4 Low income             24  56.4  12.6  27.5  74.3

Key observations:

High income countries score highest on average (mean ≈ 81.2), followed by Upper middle income (≈ 69.2), Lower middle income (≈ 63.5), and Low income (≈ 56.4).
The gap between High income and Low income countries is approximately 25 points on a 0–100 scale.
Standard deviations are similar across groups (roughly 12–16 points each), suggesting that spread is comparable and the ANOVA assumption of roughly equal variances is plausible.

Part 1 — ANOVA: Does Income Group Explain Overall Score?

Response Variable

The response variable for this analysis is overall_score, a composite index that aggregates a country’s performance across all dimensions of its statistical system, including data use, data products, data services, data sources, and data infrastructure. This is the most meaningful single measure of a country’s statistical capacity, and it is the quantity most relevant to policymakers and development organizations assessing where support is needed.

Explanatory Variable

The categorical explanatory variable is income, which classifies each country into one of four World Bank income groups: Low income, Lower middle income, Upper middle income, and High income. This grouping reflects structural economic differences between countries that may correspond to different levels of investment in national statistical systems.

Hypotheses

\[H_0: \mu_{\text{Low}} = \mu_{\text{Lower middle}} = \mu_{\text{Upper middle}} = \mu_{\text{High}}\]

\[H_A: \text{At least one group mean differs}\]

The null hypothesis states that the average overall_score is identical across all four income groups. The alternative hypothesis states that at least one income group has a different mean overall score. This is a one-way ANOVA because there is a single categorical predictor with more than two levels.

Visualization

data_2023 %>%
  filter(!is.na(overall_score)) %>%
  mutate(income = factor(income,
                         levels = c("Low income", "Lower middle income",
                                    "Upper middle income", "High income"))) %>%
  ggplot(aes(x = income, y = overall_score, fill = income)) +
  geom_boxplot(alpha = 0.7, outlier.shape = 21, outlier.size = 2) +
  stat_summary(fun = mean, geom = "point", shape = 18,
               size = 4, color = "darkred",
               position = position_nudge(x = 0.25)) +
  scale_fill_manual(values = c(
    "Low income"           = "#FF7043",
    "Lower middle income"  = "#FFA726",
    "Upper middle income"  = "#42A5F5",
    "High income"          = "#26A69A"
  )) +
  labs(
    title    = "Overall Statistical Performance Score by Income Group (2023)",
    subtitle = "Red diamond = group mean",
    x        = "Income Group",
    y        = "Overall Score (0–100)",
    fill     = "Income Group"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

The boxplot shows a clear ascending pattern in median overall_score as income group increases. The median and mean (red diamond) both rise consistently from Low income to High income. There is visible overlap between adjacent groups, particularly between Low income and Lower middle income, and between Upper middle and High income, which motivates the formal test and post-hoc comparisons.

ANOVA Test

# Fit one-way ANOVA model
anova_model <- aov(overall_score ~ income, data = filter(data_2023, !is.na(overall_score)))
summary(anova_model)

##              Df Sum Sq Mean Sq F value   Pr(>F)    
## income        3  14170    4723   22.18 2.74e-12 ***
## Residuals   182  38749     213                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpreting the output:

The ANOVA table reports an F-statistic of 22.18 with 3 degrees of freedom between groups and 182 degrees of freedom within groups. The corresponding p-value is approximately 2.7 × 10⁻¹², which is far below any conventional significance threshold (α = 0.05).

This means: assuming that income group has no effect on overall_score, the probability of observing an F-statistic this large by chance alone is essentially zero. We therefore reject the null hypothesis and conclude that the mean overall score is not the same across all income groups, at least one group differs significantly from the others.

Post-Hoc Test — Tukey HSD

Because ANOVA only tells us that some difference exists, a Tukey Honest Significant Difference (HSD) test is used to identify which specific pairs of income groups differ significantly. Tukey HSD adjusts for multiple comparisons, controlling the family-wise error rate at 5%.

TukeyHSD(anova_model)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = overall_score ~ income, data = filter(data_2023, !is.na(overall_score)))
## 
## $income
##                                               diff        lwr        upr
## Low income-High income                  -24.868191 -34.005981 -15.730401
## Lower middle income-High income         -17.720876 -24.965545 -10.476208
## Upper middle income-High income         -12.047586 -19.215858  -4.879314
## Lower middle income-Low income            7.147315  -2.247929  16.542558
## Upper middle income-Low income           12.820605   3.484144  22.157066
## Upper middle income-Lower middle income   5.673290  -1.820408  13.166989
##                                             p adj
## Low income-High income                  0.0000000
## Lower middle income-High income         0.0000000
## Upper middle income-High income         0.0001281
## Lower middle income-Low income          0.2022206
## Upper middle income-Low income          0.0026408
## Upper middle income-Lower middle income 0.2059216

Key findings from the Tukey HSD:

Comparison	Mean Difference	Significant?
High income vs Low income	+24.87 points	Yes
High income vs Lower middle income	+17.72 points	Yes
High income vs Upper middle income	+12.05 points	Yes
Upper middle income vs Low income	+12.82 points	Yes
Lower middle income vs Low income	+7.15 points	No
Upper middle income vs Lower middle income	+5.67 points	No

The largest and most statistically clear gap is between High income and Low income countries (~25 points). High income countries differ significantly from all other groups. However, adjacent income groups, Low vs. Lower middle, and Lower middle vs. Upper middle, do not differ significantly from each other after adjusting for multiple comparisons.

Practical Interpretation

There is strong evidence that income group is associated with differences in overall statistical performance. In practical terms, this means it would not be safe to assume that all income groups perform at the same level. High income countries perform substantially and significantly better than all other groups, while the three lower income tiers form a more graduated pattern where only the widest gaps reach statistical significance.

For development organizations, this suggests that targeting the High income vs. Low income gap should remain a priority. However, the lack of significant differences between adjacent groups implies that interventions may not need to distinguish sharply between Low and Lower middle income countries, they face comparably constrained statistical ecosystems. Further questions worth investigating include whether this gap has widened or narrowed over time, and which specific sub-dimensions of statistical capacity drive the High income advantage.

Part 2 — Linear Regression: Predicting Overall Score from Data Products Score

Response and Predictor Variables

The response variable remains overall_score. The continuous predictor selected is data_products_score, which measures the quality and availability of data products that a country’s statistical system produces, such as national accounts, household surveys, and administrative data. This sub-indicator was chosen because it is one of the most complete columns in the 2023 data (255 missing values across all years, compared to over 2,800 for other sub-scores), and it has a strong conceptual connection to overall statistical performance: countries that produce high-quality data products are likely to score well on the composite index too.

Checking Linearity

Before fitting a regression model, the relationship between the two variables should be roughly linear. A scatter plot is used to verify this.

data_2023 %>%
  filter(!is.na(overall_score), !is.na(data_products_score)) %>%
  ggplot(aes(x = data_products_score, y = overall_score)) +
  geom_point(alpha = 0.6, color = "#2C7FB8", size = 2.5) +
  geom_smooth(method = "lm", se = TRUE, color = "#D95F02", linewidth = 1) +
  labs(
    title    = "Overall Score vs. Data Products Score (2023)",
    subtitle = "Fitted regression line with 95% confidence band",
    x        = "Data Products Score (0–100)",
    y        = "Overall Score (0–100)"
  ) +
  theme_minimal(base_size = 13)

## `geom_smooth()` using formula = 'y ~ x'

The scatter plot shows a clear positive linear relationship between data_products_score and overall_score. Countries with higher data products scores consistently tend to have higher overall scores. The fitted regression line tracks the data well, with no obvious curvature, which confirms that a linear model is appropriate here.

Regression Model

# Fit simple linear regression
lm_model <- lm(overall_score ~ data_products_score,
               data = filter(data_2023, !is.na(overall_score), !is.na(data_products_score)))

summary(lm_model)

## 
## Call:
## lm(formula = overall_score ~ data_products_score, data = filter(data_2023, 
##     !is.na(overall_score), !is.na(data_products_score)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -26.0225  -7.2264  -0.4948   6.7765  27.7395 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -31.59587    5.29525  -5.967 1.22e-08 ***
## data_products_score   1.32961    0.06874  19.343  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.737 on 184 degrees of freedom
## Multiple R-squared:  0.6703, Adjusted R-squared:  0.6685 
## F-statistic: 374.1 on 1 and 184 DF,  p-value: < 2.2e-16

Coefficient Interpretation

The regression equation is:

\[\widehat{\text{Overall Score}} = -31.60 + 1.33 \times \text{Data Products Score}\]

Intercept (−31.60): This is the model’s predicted overall_score when data_products_score equals zero. A score of zero on data products has no real-world meaning for any observed country (the actual range is 45.2 to 94.3), so the intercept should not be interpreted literally, it is a mathematical anchor for the line.

Slope (1.33): For every 1-point increase in data_products_score, the model predicts an average increase of 1.33 points in overall_score. This is greater than a one-for-one relationship, meaning improvements in data products performance are associated with disproportionately larger gains in overall statistical performance. A country that improves its data products score by 10 points would be expected to see its overall score rise by approximately 13 points.

Both coefficients are highly statistically significant (p < 0.001), meaning the probability of observing slopes this large by chance, if there were truly no relationship, is negligible.

Model Fit

The model’s R² = 0.67, meaning that data_products_score alone explains approximately 67% of the variation in overall_score across countries. This is a strong fit for a simple one-predictor model, particularly given that overall_score is a composite of five sub-dimensions, only one of which is being used as a predictor here.

The remaining 33% of variation is unexplained, likely attributable to the other sub-indicators (data use, services, sources, infrastructure) and country-level factors not captured in this model.

Practical Interpretation

The strong linear relationship between data_products_score and overall_score suggests that a country’s capacity to produce high-quality statistical outputs, censuses, surveys, national accounts, is one of the most important single drivers of overall statistical system performance. For policymakers and development funders, this implies that investments specifically targeted at improving data product quality and availability may yield broad-based gains across the entire statistical system, not just in one sub-dimension.

However, since the model explains 67% of the variation, it would be an oversimplification to treat data products as the only lever worth pulling. The remaining unexplained variation suggests that other components of statistical capacity, particularly data services and infrastructure, also play meaningful independent roles. Further investigation could explore a multiple regression model incorporating all five sub-scores to understand their relative contributions.

Discussion

This analysis extends the Week 7 findings in two ways. The ANOVA confirms what the t-test suggested: income group is strongly associated with overall statistical performance. But by including all four income tiers, it becomes clear that the relationship is not simply binary, the gradient from Low to High income is real, though the steps between adjacent groups are not always statistically distinguishable. The Tukey HSD results are especially informative here, showing that the High income group stands apart while the lower three tiers form a more compressed cluster.

The regression analysis adds a different kind of insight: within the 2023 cross-section, roughly two-thirds of the country-to-country variation in overall performance can be predicted from a single sub-indicator. This points to the coherence of the SPI framework, countries that do well on one dimension of statistical capacity tend to do well overall, and the relationship is approximately linear and strong.

Together, these results suggest that statistical performance is structured and predictable, which is encouraging for planning purposes. Countries do not perform randomly, performance tracks with both economic resources (income group) and specific competencies (data products quality).

Limitations

Single-year snapshot: Filtering to 2023 satisfies the independence assumption but loses the longitudinal dimension. It is not possible to determine from this analysis whether the income group gap is growing or shrinking over time.
Causality not established: Both analyses are observational. Higher income does not necessarily cause better statistical performance, and better data products scores do not necessarily cause a higher overall score. Confounders such as governance quality and historical institutional investment are not controlled for.
Overall score missingness: 31 out of 217 countries in 2023 are missing overall_score. If missingness is non-random (e.g., the weakest performers are least likely to be scored), estimates could be biased.
Simple regression limitations: The one-predictor model explains 67% of variance, but the remaining 33% is unaccounted for. The model should not be used for precise predictions without acknowledging this uncertainty.

Future Questions

How has the income-group gap in overall_score changed from 2016 to 2023? A mixed-effects model could answer this while accounting for repeated measurements per country.
Would a multiple regression model using all five sub-scores substantially improve predictive accuracy over the single-predictor model?
Are there countries that score much higher or lower on overall_score than their data_products_score would predict? These residual outliers may reveal cases where other sub-dimensions are either compensating for or dragging down overall performance.
Does the slope of the regression relationship differ by income group? An interaction model could test whether the data products–overall score relationship is stronger in some income tiers than others.

Data Dive 8 — Regression Modeling

2026-03-13

Week 8 Data Dive — ANOVA and Linear Regression

Introduction

Data Preparation

Filtering to 2023

Income Group Distribution

Descriptive Summary

Part 1 — ANOVA: Does Income Group Explain Overall Score?

Response Variable

Explanatory Variable

Hypotheses

Visualization

ANOVA Test

Post-Hoc Test — Tukey HSD

Practical Interpretation

Part 2 — Linear Regression: Predicting Overall Score from Data Products Score

Response and Predictor Variables

Checking Linearity

Regression Model

Coefficient Interpretation

Model Fit

Practical Interpretation

Discussion

Limitations

Future Questions