Week 10 Data Dive — Generalized Linear Models

Introduction

This notebook extends the regression analysis from previous weeks by introducing a generalized linear model (GLM), specifically logistic regression, to model a binary outcome.

Prior analyses focused on continuous outcomes using linear regression — predicting overall_score from sub-indicator scores and income group. Logistic regression is needed here because the response variable is binary. Rather than modeling a score, the goal is to model the probability that a country is high-performing, defined as having an overall_score above 85.

This threshold was chosen because it represents a meaningfully high level of statistical performance, well above the dataset mean, and results in a reasonably balanced split: 48 high-performing countries and 138 that fall below the threshold. The three predictors used are data_use_score, data_services_score, and data_infrastructure_score — sub-indicators that reflect distinct dimensions of a country’s statistical capacity and were not directly used as outcome variables in prior models.

The dataset is the World Bank Statistical Performance Indicators dataset, covering 217 countries from 2004 to 2023. The analysis uses only the 2023 cross-sectional snapshot.

Data Preparation

# Load libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)
# Load dataset
dataset <- read_csv("dataset.csv")
## Rows: 4340 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): iso3c, country, region, income
## dbl (8): year, population, overall_score, data_use_score, data_services_scor...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Filter to 2023, remove unclassified income group, create binary outcome
data_2023 <- dataset %>%
  filter(year == 2023) %>%
  filter(income != "Not classified") %>%
  mutate(high_performance = ifelse(overall_score > 85, 1, 0))

cat("Rows in 2023 snapshot:", nrow(data_2023), "\n")
## Rows in 2023 snapshot: 216
cat("high_performance = 1:", sum(data_2023$high_performance, na.rm = TRUE), "\n")
## high_performance = 1: 48
cat("high_performance = 0:", sum(data_2023$high_performance == 0, na.rm = TRUE), "\n")
## high_performance = 0: 138
# Retain only rows with no missing values across the variables used in the model
model_data <- data_2023 %>%
  filter(
    !is.na(high_performance),
    !is.na(data_use_score),
    !is.na(data_services_score),
    !is.na(data_infrastructure_score)
  )

cat("Rows available for modeling:", nrow(model_data), "\n")
## Rows available for modeling: 186

Part 1 — Building the Logistic Regression Model

Model Specification

A logistic regression model is fit using glm() with family = "binomial". The outcome is high_performance, and the three predictors are data_use_score, data_services_score, and data_infrastructure_score. Each of these sub-indicators measures a different aspect of statistical capacity: how data is used in policy and decision-making, the quality and accessibility of data services, and the strength of the underlying statistical infrastructure.

# Fit logistic regression model
model <- glm(
  high_performance ~ data_use_score + data_services_score + data_infrastructure_score,
  data = model_data,
  family = "binomial"
)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(model)
## 
## Call:
## glm(formula = high_performance ~ data_use_score + data_services_score + 
##     data_infrastructure_score, family = "binomial", data = model_data)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)   
## (Intercept)               -102.0330    36.8884  -2.766  0.00567 **
## data_use_score               0.4999     0.2291   2.182  0.02912 * 
## data_services_score          0.3108     0.1177   2.640  0.00828 **
## data_infrastructure_score    0.3175     0.1037   3.063  0.00219 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 212.42  on 185  degrees of freedom
## Residual deviance:  17.07  on 182  degrees of freedom
## AIC: 25.07
## 
## Number of Fisher Scoring iterations: 11

Odds Ratios

The model coefficients are on the log-odds scale, which is not directly interpretable. Exponentiating them converts to odds ratios, which are easier to reason about.

# Convert log-odds coefficients to odds ratios
exp(coef(model))
##               (Intercept)            data_use_score       data_services_score 
##              4.870975e-45              1.648510e+00              1.364555e+00 
## data_infrastructure_score 
##              1.373728e+00

All three predictors have odds ratios above 1, confirming that higher scores across each dimension are associated with increased probability of being classified as high-performing. This suggests that statistical capacity is multi-dimensional: no single sub-indicator drives the outcome alone, but rather all three contribute independently.


Part 2 — Coefficient Interpretation

The logistic regression model estimates the log-odds of a country being high-performing as a function of the three sub-indicator scores. All three predictors have positive coefficients and statistically significant p-values (all below 0.05), meaning each is positively associated with the probability of high performance when the others are held constant.

data_use_score (OR = 1.65, p = 0.029)

The coefficient for data_use_score is positive (log-odds = 0.50). The odds ratio of 1.65 means that for each one-unit increase in data use score, the odds of a country being classified as high-performing multiply by 1.65 (a 65% increase), holding data services and infrastructure scores constant. Put differently, as data use score increases, the probability of being classified as high-performing also increases — not just the odds. This is the largest effect among the three predictors, suggesting that how actively a country uses its data for decision-making is particularly important in distinguishing top performers.

data_services_score (OR = 1.36, p = 0.008)

The coefficient for data_services_score is also positive (log-odds = 0.31). Each one-unit increase in data services score is associated with a 36% increase in the odds of high performance, holding other variables constant. This reflects the importance of accessible, timely, and well-disseminated statistical outputs in reaching high overall performance.

data_infrastructure_score (OR = 1.37, p = 0.002)

The coefficient for data_infrastructure_score is positive (log-odds = 0.32). Each one-unit increase is associated with a 37% increase in the odds of high performance, holding other variables constant. This is the most precisely estimated effect (smallest p-value), suggesting that the foundational systems underlying a national statistical office are a consistent predictor of whether a country reaches top-tier performance.

The key insight from this section is that all three sub-indicators matter independently. Even after controlling for infrastructure and services, data use retains a significant positive association — suggesting that countries that invest in using their data, not just collecting it, are more likely to be top performers.


Part 3 — Confidence Interval for data_use_score

A 95% confidence interval for the data_use_score coefficient is constructed manually using the standard error from the model summary.

# Extract coefficient and standard error for data_use_score
beta <- coef(model)["data_use_score"]
se   <- summary(model)$coefficients["data_use_score", "Std. Error"]

# Construct 95% CI on log-odds scale
lower_log <- beta - 1.96 * se
upper_log <- beta + 1.96 * se

# Convert to odds ratio scale
lower_or <- exp(lower_log)
upper_or <- exp(upper_log)

cat("Log-odds coefficient:", round(beta, 4), "\n")
## Log-odds coefficient: 0.4999
cat("Standard Error:", round(se, 4), "\n")
## Standard Error: 0.2291
cat("95% CI (log-odds): [", round(lower_log, 4), ",", round(upper_log, 4), "]\n")
## 95% CI (log-odds): [ 0.0508 , 0.9489 ]
cat("95% CI (odds ratio): [", round(lower_or, 4), ",", round(upper_or, 4), "]\n")
## 95% CI (odds ratio): [ 1.0521 , 2.5829 ]

The 95% confidence interval for the log-odds coefficient of data_use_score is [0.05, 0.95]. This interval does not include zero, confirming that the association is statistically significant at the 0.05 level.

Converted to the odds ratio scale, the interval is approximately [1.05, 2.58]. This means that, with 95% confidence, a one-unit increase in data use score multiplies the odds of high performance by somewhere between 1.05 and 2.58, holding other variables constant. Even the lower bound of the interval represents a positive association, which reinforces the conclusion that data use is a meaningful predictor of top-tier statistical performance. The confidence interval reinforces that this effect is consistently positive and not attributable to sampling variability alone.


Discussion

The logistic regression model identifies all three sub-indicators as statistically significant positive predictors of whether a country achieves high performance (overall score above 85). Among the three, data_use_score has the largest point estimate (OR = 1.65), while data_infrastructure_score has the most precise estimate (lowest p-value = 0.002). Together, the model suggests that high-performing countries are distinguished not by any single dimension of statistical capacity, but by strength across use, services, and infrastructure simultaneously.

The McFadden pseudo-R² for this model is approximately 0.92, indicating a strong fit relative to a null model — though this metric is not directly comparable to the R² from linear regression and should be interpreted with caution. This unusually high value likely indicates near-complete separation, meaning the predictors almost perfectly distinguish high-performing countries in this dataset. While the model coefficients remain valid, the wide confidence interval for data_use_score (OR: 1.05 to 2.58) is consistent with this concern and suggests the estimates should be interpreted cautiously.

Limitations

  • Threshold choice: The 85-point cutoff for high_performance is somewhat arbitrary. A different threshold would change which countries are classified as high-performing and could affect the model’s coefficients and significance.
  • Separation concerns: The very high pseudo-R² suggests possible near-complete separation, where the predictors almost perfectly predict the outcome. This can inflate standard errors and make coefficient estimates unstable.
  • Single-year snapshot: Using only 2023 data satisfies the independence assumption but limits the generalizability of findings. It is not possible to assess whether these associations have strengthened or weakened over time.
  • Omitted variables: Economic development, regional context, and governance quality are likely associated with both the predictors and the outcome. Excluding them means the model may be picking up indirect effects rather than the direct contribution of each sub-indicator.

Further Questions

  • Does the association between data_use_score and high performance differ across income groups? An interaction term could test whether the effect of data use is stronger in high-income countries versus lower-income ones.
  • How sensitive are the results to the choice of 85 as the performance threshold? Testing alternative cutoffs (e.g., 80 or 90) would reveal how stable the model is.
  • Could a Poisson regression model using a count-based outcome — such as the number of sub-indicators above a given benchmark — provide a more nuanced picture of statistical performance than a binary classification?