Week 14 - Model Critique

1. Introduction
2. Business Context
3. Data & Analysis (from Week 8)
- 3.1 ANOVA
- 3.2 Linear Regression
4. Results Summary
5. Model Critique
6. Conclusion

1. Introduction

This notebook critiques the analysis from Data Dive 8 - Regression Modeling, which used the World Bank Statistical Performance Indicators (SPI) dataset. The dataset covers 217 countries from 2004 to 2023 and measures statistical capacity across five sub-dimensions: data use, data products, data services, data sources, and data infrastructure.

Two models were used in Week 8:

ANOVA: Does income group explain differences in overall statistical performance?
Simple Linear Regression: Can data_products_score predict overall_score?

The goal is not to redo the analysis, but to critique it. This includes identifying what was missing, what assumptions were not checked, and what risks exist if these models were used in practice.

The original Week 8 analysis can be found here: Data Dive 8 - Regression Modeling

2. Business Context

Organizations like the World Bank or UNDP could use this analysis to decide where to direct funding for improving national statistical systems. Poor decisions based on flawed models could misallocate resources and disadvantage already-struggling countries.

3. Data & Analysis (from Week 8)

library(tidyverse)

dataset <- read_csv("dataset.csv")

# Filter to 2023, remove unclassified
data_2023 <- dataset %>%
  filter(year == 2023) %>%
  filter(income != "Not classified")

cat("Countries in 2023:", nrow(data_2023), "\n")

## Countries in 2023: 216

3.1 ANOVA

Question: Do income groups differ in overall_score?

anova_model <- aov(overall_score ~ income, data = data_2023)
summary(anova_model)

##              Df Sum Sq Mean Sq F value   Pr(>F)    
## income        3  14170    4723   22.18 2.74e-12 ***
## Residuals   182  38749     213                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 30 observations deleted due to missingness

Result: The ANOVA was significant (p < 0.001), meaning at least one income group differs in overall statistical performance.

3.2 Linear Regression

Question: Can data_products_score predict overall_score?

lm_model <- lm(overall_score ~ data_products_score,
               data = filter(data_2023, !is.na(overall_score), !is.na(data_products_score)))

summary(lm_model)

## 
## Call:
## lm(formula = overall_score ~ data_products_score, data = filter(data_2023, 
##     !is.na(overall_score), !is.na(data_products_score)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -26.0225  -7.2264  -0.4948   6.7765  27.7395 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -31.59587    5.29525  -5.967 1.22e-08 ***
## data_products_score   1.32961    0.06874  19.343  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.737 on 184 degrees of freedom
## Multiple R-squared:  0.6703, Adjusted R-squared:  0.6685 
## F-statistic: 374.1 on 1 and 184 DF,  p-value: < 2.2e-16

Result: For every 1-point increase in data_products_score, overall score increases by ~1.33 points. R² = 0.67, meaning the model explains about 67% of the variation in overall_score within this dataset.

4. Results Summary

Income group is strongly associated with statistical performance. High-income countries significantly outperform others. A single predictor (data_products_score) explains about 67% of the variation in overall_score within this dataset. These results are informative but come with important limitations, discussed next.

5. Model Critique

A. Analytical Issues

1. Regression assumptions were never checked

Week 8 fit the model without verifying its assumptions. Below are the standard diagnostic plots:

par(mfrow = c(2, 2))
plot(lm_model)

par(mfrow = c(1, 1))

What to look for:

Residuals vs Fitted: should show no clear pattern (checks linearity + equal variance)
Q-Q Plot: points should follow the diagonal line (checks normality of residuals)
Scale-Location: should be roughly flat (checks homoscedasticity)
Residuals vs Leverage: identifies influential outliers

Since these assumptions were not checked in Week 8, the reported statistical significance may not be reliable.

2. Only one predictor was used

The Week 8 regression used only data_products_score. But overall_score is a composite of five sub-dimensions. This highlights a limitation of the original model. Relying on a single predictor ignores other relevant components and may overstate the importance of data products alone.

The code below shows what a multiple regression would look like and how much additional variance it explains:

lm_multi <- lm(overall_score ~ data_products_score + data_use_score +
                 data_services_score + data_sources_score + data_infrastructure_score,
               data = filter(data_2023, !is.na(overall_score)))

summary(lm_multi)

## 
## Call:
## lm(formula = overall_score ~ data_products_score + data_use_score + 
##     data_services_score + data_sources_score + data_infrastructure_score, 
##     data = filter(data_2023, !is.na(overall_score)))
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -4.878e-09 -2.936e-09  1.104e-10  1.674e-09  5.307e-09 
## 
## Coefficients:
##                            Estimate Std. Error   t value Pr(>|t|)    
## (Intercept)               4.221e-09  1.726e-09 2.446e+00   0.0154 *  
## data_products_score       2.000e-01  3.382e-11 5.913e+09   <2e-16 ***
## data_use_score            2.000e-01  1.842e-11 1.085e+10   <2e-16 ***
## data_services_score       2.000e-01  1.724e-11 1.160e+10   <2e-16 ***
## data_sources_score        2.000e-01  1.827e-11 1.095e+10   <2e-16 ***
## data_infrastructure_score 2.000e-01  1.735e-11 1.152e+10   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.877e-09 on 180 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.279e+21 on 5 and 180 DF,  p-value: < 2.2e-16

cat("Simple regression R²:  ", round(summary(lm_model)$r.squared, 3), "\n")

## Simple regression R²:   0.67

cat("Multiple regression R²:", round(summary(lm_multi)$r.squared, 3), "\n")

## Multiple regression R²: 1

The difference in R² shows that the original model left out important information. A more complete model would have included all sub-scores from the start.

3. The independence assumption may be violated

Filtering to 2023 was the right move for ANOVA and regression. But the dataset is longitudinal. The same countries appear every year. If someone were to model trends over time using the full dataset, standard regression would be inappropriate because repeated measurements from the same country are not independent.

A quick check of how many years each country appears:

dataset %>%
  group_by(country) %>%
  summarise(n_years = n()) %>%
  summary()

##    country             n_years  
##  Length:217         Min.   :20  
##  Class :character   1st Qu.:20  
##  Mode  :character   Median :20  
##                     Mean   :20  
##                     3rd Qu.:20  
##                     Max.   :20

Each country appears multiple times. Any model using the full dataset without accounting for this structure would produce misleading standard errors.

4. ANOVA assumes equal variances across groups

The Levene test checks this formally:

# install.packages("car") if needed
library(car)
leveneTest(overall_score ~ income, data = data_2023)

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   3  0.9273 0.4287
##       182

If the p-value is significant, the equal variance assumption is violated and a Welch ANOVA would be more appropriate:

oneway.test(overall_score ~ income, data = data_2023, var.equal = FALSE)

## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  overall_score and income
## F = 22.792, num df = 3.000, denom df = 83.048, p-value = 7.355e-11

B. Ethical Concerns

Income group labels oversimplify countries

The World Bank income classification puts 217 countries into just 4 buckets. Countries within the same group can differ dramatically in governance, geography, and history. Treating “Low income” as a homogeneous category risks masking important differences and could lead to blanket policies that don’t fit individual country contexts.

Risk of misleading comparisons

A 25-point gap between High income and Low income countries looks large numerically, but this says nothing about why the gap exists or whether it is closing. Presenting this result to policymakers without context could lead to harmful conclusions. For example, it might suggest that low-income countries are simply less capable, rather than structurally under-resourced.

Missing data is not random

31 out of 217 countries are missing overall_score in 2023. These are likely the weakest-performing countries, the ones most relevant to development organizations. Excluding them makes the dataset look more optimistic than reality.

data_2023 %>%
  group_by(income) %>%
  summarise(
    total = n(),
    missing_overall = sum(is.na(overall_score)),
    pct_missing = round(100 * missing_overall / total, 1)
  )

## # A tibble: 4 × 4
##   income              total missing_overall pct_missing
##   <chr>               <int>           <int>       <dbl>
## 1 High income            85              25        29.4
## 2 Low income             26               2         7.7
## 3 Lower middle income    51               1         2  
## 4 Upper middle income    54               2         3.7

This table shows whether missingness is concentrated in lower-income groups, which is a key ethical concern.

C. Epistemological Limits

Correlation is not causation

The regression shows that data_products_score and overall_score move together. But data_products_score is one of the components used to calculate overall_score. This creates a built-in correlation that may inflate R². The model partially explains a constructed relationship rather than discovering a fully independent pattern. The relationship is partially mathematical, not purely empirical.

The model cannot support policy claims

A slope of 1.33 is not a policy lever. We cannot conclude that improving data products by 10 points will raise overall score by 13 points. The model is observational and cross-sectional. It cannot tell us what happens when an actual intervention is made.

Results depend on how variables are measured

The SPI scores are constructed indices. They reflect how the World Bank chooses to measure statistical capacity. Different measurement choices would produce different results. The model’s conclusions are only as valid as the underlying measurement framework.

6. Conclusion

The Week 8 analysis provided a useful starting point. Income group matters, and data products quality is a strong predictor of overall statistical performance. However, the models had real gaps: no diagnostic checks, a single predictor, and no discussion of what the results can and cannot support. Overall, the original analysis provides useful descriptive insights, but should not be used as a basis for causal claims or high-stakes policy decisions without further validation.