This notebook critiques the analysis from Data Dive 8 - Regression Modeling, which used the World Bank Statistical Performance Indicators (SPI) dataset. The dataset covers 217 countries from 2004 to 2023 and measures statistical capacity across five sub-dimensions: data use, data products, data services, data sources, and data infrastructure.
Two models were used in Week 8:
data_products_score predict
overall_score?The goal is not to redo the analysis, but to critique it. This includes identifying what was missing, what assumptions were not checked, and what risks exist if these models were used in practice.
The original Week 8 analysis can be found here: Data Dive 8 - Regression Modeling
Organizations like the World Bank or UNDP could use this analysis to decide where to direct funding for improving national statistical systems. Poor decisions based on flawed models could misallocate resources and disadvantage already-struggling countries.
library(tidyverse)
dataset <- read_csv("dataset.csv")
# Filter to 2023, remove unclassified
data_2023 <- dataset %>%
filter(year == 2023) %>%
filter(income != "Not classified")
cat("Countries in 2023:", nrow(data_2023), "\n")
## Countries in 2023: 216
Question: Do income groups differ in
overall_score?
anova_model <- aov(overall_score ~ income, data = data_2023)
summary(anova_model)
## Df Sum Sq Mean Sq F value Pr(>F)
## income 3 14170 4723 22.18 2.74e-12 ***
## Residuals 182 38749 213
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 30 observations deleted due to missingness
Result: The ANOVA was significant (p < 0.001), meaning at least one income group differs in overall statistical performance.
Question: Can data_products_score
predict overall_score?
lm_model <- lm(overall_score ~ data_products_score,
data = filter(data_2023, !is.na(overall_score), !is.na(data_products_score)))
summary(lm_model)
##
## Call:
## lm(formula = overall_score ~ data_products_score, data = filter(data_2023,
## !is.na(overall_score), !is.na(data_products_score)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.0225 -7.2264 -0.4948 6.7765 27.7395
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -31.59587 5.29525 -5.967 1.22e-08 ***
## data_products_score 1.32961 0.06874 19.343 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.737 on 184 degrees of freedom
## Multiple R-squared: 0.6703, Adjusted R-squared: 0.6685
## F-statistic: 374.1 on 1 and 184 DF, p-value: < 2.2e-16
Result: For every 1-point increase in
data_products_score, overall score increases by ~1.33
points. R² = 0.67, meaning the model explains about 67% of the variation
in overall_score within this dataset.
Income group is strongly associated with statistical performance.
High-income countries significantly outperform others. A single
predictor (data_products_score) explains about 67% of the
variation in overall_score within this dataset. These
results are informative but come with important limitations, discussed
next.
1. Regression assumptions were never checked
Week 8 fit the model without verifying its assumptions. Below are the standard diagnostic plots:
par(mfrow = c(2, 2))
plot(lm_model)
par(mfrow = c(1, 1))
What to look for:
Since these assumptions were not checked in Week 8, the reported statistical significance may not be reliable.
2. Only one predictor was used
The Week 8 regression used only data_products_score. But
overall_score is a composite of five sub-dimensions. This
highlights a limitation of the original model. Relying on a single
predictor ignores other relevant components and may overstate the
importance of data products alone.
The code below shows what a multiple regression would look like and how much additional variance it explains:
lm_multi <- lm(overall_score ~ data_products_score + data_use_score +
data_services_score + data_sources_score + data_infrastructure_score,
data = filter(data_2023, !is.na(overall_score)))
summary(lm_multi)
##
## Call:
## lm(formula = overall_score ~ data_products_score + data_use_score +
## data_services_score + data_sources_score + data_infrastructure_score,
## data = filter(data_2023, !is.na(overall_score)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.878e-09 -2.936e-09 1.104e-10 1.674e-09 5.307e-09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.221e-09 1.726e-09 2.446e+00 0.0154 *
## data_products_score 2.000e-01 3.382e-11 5.913e+09 <2e-16 ***
## data_use_score 2.000e-01 1.842e-11 1.085e+10 <2e-16 ***
## data_services_score 2.000e-01 1.724e-11 1.160e+10 <2e-16 ***
## data_sources_score 2.000e-01 1.827e-11 1.095e+10 <2e-16 ***
## data_infrastructure_score 2.000e-01 1.735e-11 1.152e+10 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.877e-09 on 180 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.279e+21 on 5 and 180 DF, p-value: < 2.2e-16
cat("Simple regression R²: ", round(summary(lm_model)$r.squared, 3), "\n")
## Simple regression R²: 0.67
cat("Multiple regression R²:", round(summary(lm_multi)$r.squared, 3), "\n")
## Multiple regression R²: 1
The difference in R² shows that the original model left out important information. A more complete model would have included all sub-scores from the start.
3. The independence assumption may be violated
Filtering to 2023 was the right move for ANOVA and regression. But the dataset is longitudinal. The same countries appear every year. If someone were to model trends over time using the full dataset, standard regression would be inappropriate because repeated measurements from the same country are not independent.
A quick check of how many years each country appears:
dataset %>%
group_by(country) %>%
summarise(n_years = n()) %>%
summary()
## country n_years
## Length:217 Min. :20
## Class :character 1st Qu.:20
## Mode :character Median :20
## Mean :20
## 3rd Qu.:20
## Max. :20
Each country appears multiple times. Any model using the full dataset without accounting for this structure would produce misleading standard errors.
4. ANOVA assumes equal variances across groups
The Levene test checks this formally:
# install.packages("car") if needed
library(car)
leveneTest(overall_score ~ income, data = data_2023)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 3 0.9273 0.4287
## 182
If the p-value is significant, the equal variance assumption is violated and a Welch ANOVA would be more appropriate:
oneway.test(overall_score ~ income, data = data_2023, var.equal = FALSE)
##
## One-way analysis of means (not assuming equal variances)
##
## data: overall_score and income
## F = 22.792, num df = 3.000, denom df = 83.048, p-value = 7.355e-11
Income group labels oversimplify countries
The World Bank income classification puts 217 countries into just 4 buckets. Countries within the same group can differ dramatically in governance, geography, and history. Treating “Low income” as a homogeneous category risks masking important differences and could lead to blanket policies that don’t fit individual country contexts.
Risk of misleading comparisons
A 25-point gap between High income and Low income countries looks large numerically, but this says nothing about why the gap exists or whether it is closing. Presenting this result to policymakers without context could lead to harmful conclusions. For example, it might suggest that low-income countries are simply less capable, rather than structurally under-resourced.
Missing data is not random
31 out of 217 countries are missing overall_score in
2023. These are likely the weakest-performing countries, the ones most
relevant to development organizations. Excluding them makes the dataset
look more optimistic than reality.
data_2023 %>%
group_by(income) %>%
summarise(
total = n(),
missing_overall = sum(is.na(overall_score)),
pct_missing = round(100 * missing_overall / total, 1)
)
## # A tibble: 4 × 4
## income total missing_overall pct_missing
## <chr> <int> <int> <dbl>
## 1 High income 85 25 29.4
## 2 Low income 26 2 7.7
## 3 Lower middle income 51 1 2
## 4 Upper middle income 54 2 3.7
This table shows whether missingness is concentrated in lower-income groups, which is a key ethical concern.
Correlation is not causation
The regression shows that data_products_score and
overall_score move together. But
data_products_score is one of the components used to
calculate overall_score. This creates a built-in
correlation that may inflate R². The model partially explains a
constructed relationship rather than discovering a fully independent
pattern. The relationship is partially mathematical, not purely
empirical.
The model cannot support policy claims
A slope of 1.33 is not a policy lever. We cannot conclude that improving data products by 10 points will raise overall score by 13 points. The model is observational and cross-sectional. It cannot tell us what happens when an actual intervention is made.
Results depend on how variables are measured
The SPI scores are constructed indices. They reflect how the World Bank chooses to measure statistical capacity. Different measurement choices would produce different results. The model’s conclusions are only as valid as the underlying measurement framework.
The Week 8 analysis provided a useful starting point. Income group matters, and data products quality is a strong predictor of overall statistical performance. However, the models had real gaps: no diagnostic checks, a single predictor, and no discussion of what the results can and cannot support. Overall, the original analysis provides useful descriptive insights, but should not be used as a basis for causal claims or high-stakes policy decisions without further validation.