This notebook extends the regression analysis from previous weeks by introducing a generalized linear model (GLM), specifically logistic regression, to model a binary outcome.
Prior analyses focused on continuous outcomes using linear regression
— predicting overall_score from sub-indicator scores and
income group. Logistic regression is needed here because the response
variable is binary. Rather than modeling a score, the goal is to model
the probability that a country is high-performing,
defined as having an overall_score above 85.
This threshold was chosen because it represents a meaningfully high
level of statistical performance, well above the dataset mean, and
results in a reasonably balanced split: 48 high-performing countries and
138 that fall below the threshold. The three predictors used are
data_use_score, data_services_score, and
data_infrastructure_score — sub-indicators that reflect
distinct dimensions of a country’s statistical capacity and were not
directly used as outcome variables in prior models.
The dataset is the World Bank Statistical Performance Indicators dataset, covering 217 countries from 2004 to 2023. The analysis uses only the 2023 cross-sectional snapshot.
# Load libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)
# Load dataset
dataset <- read_csv("dataset.csv")
## Rows: 4340 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): iso3c, country, region, income
## dbl (8): year, population, overall_score, data_use_score, data_services_scor...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Filter to 2023, remove unclassified income group, create binary outcome
data_2023 <- dataset %>%
filter(year == 2023) %>%
filter(income != "Not classified") %>%
mutate(high_performance = ifelse(overall_score > 85, 1, 0))
cat("Rows in 2023 snapshot:", nrow(data_2023), "\n")
## Rows in 2023 snapshot: 216
cat("high_performance = 1:", sum(data_2023$high_performance, na.rm = TRUE), "\n")
## high_performance = 1: 48
cat("high_performance = 0:", sum(data_2023$high_performance == 0, na.rm = TRUE), "\n")
## high_performance = 0: 138
# Retain only rows with no missing values across the variables used in the model
model_data <- data_2023 %>%
filter(
!is.na(high_performance),
!is.na(data_use_score),
!is.na(data_services_score),
!is.na(data_infrastructure_score)
)
cat("Rows available for modeling:", nrow(model_data), "\n")
## Rows available for modeling: 186
A logistic regression model is fit using glm() with
family = "binomial". The outcome is
high_performance, and the three predictors are
data_use_score, data_services_score, and
data_infrastructure_score. Each of these sub-indicators
measures a different aspect of statistical capacity: how data is used in
policy and decision-making, the quality and accessibility of data
services, and the strength of the underlying statistical
infrastructure.
# Fit logistic regression model
model <- glm(
high_performance ~ data_use_score + data_services_score + data_infrastructure_score,
data = model_data,
family = "binomial"
)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(model)
##
## Call:
## glm(formula = high_performance ~ data_use_score + data_services_score +
## data_infrastructure_score, family = "binomial", data = model_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -102.0330 36.8884 -2.766 0.00567 **
## data_use_score 0.4999 0.2291 2.182 0.02912 *
## data_services_score 0.3108 0.1177 2.640 0.00828 **
## data_infrastructure_score 0.3175 0.1037 3.063 0.00219 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 212.42 on 185 degrees of freedom
## Residual deviance: 17.07 on 182 degrees of freedom
## AIC: 25.07
##
## Number of Fisher Scoring iterations: 11
The model coefficients are on the log-odds scale, which is not directly interpretable. Exponentiating them converts to odds ratios, which are easier to reason about.
# Convert log-odds coefficients to odds ratios
exp(coef(model))
## (Intercept) data_use_score data_services_score
## 4.870975e-45 1.648510e+00 1.364555e+00
## data_infrastructure_score
## 1.373728e+00
All three predictors have odds ratios above 1, confirming that higher scores across each dimension are associated with increased probability of being classified as high-performing. This suggests that statistical capacity is multi-dimensional: no single sub-indicator drives the outcome alone, but rather all three contribute independently.
The logistic regression model estimates the log-odds of a country being high-performing as a function of the three sub-indicator scores. All three predictors have positive coefficients and statistically significant p-values (all below 0.05), meaning each is positively associated with the probability of high performance when the others are held constant.
data_use_score (OR = 1.65, p =
0.029)
The coefficient for data_use_score is positive (log-odds
= 0.50). The odds ratio of 1.65 means that for each one-unit increase in
data use score, the odds of a country being classified as
high-performing multiply by 1.65 (a 65% increase), holding data services
and infrastructure scores constant. Put differently, as data use score
increases, the probability of being classified as high-performing also
increases — not just the odds. This is the largest effect among the
three predictors, suggesting that how actively a country uses its data
for decision-making is particularly important in distinguishing top
performers.
data_services_score (OR = 1.36, p =
0.008)
The coefficient for data_services_score is also positive
(log-odds = 0.31). Each one-unit increase in data services score is
associated with a 36% increase in the odds of high performance, holding
other variables constant. This reflects the importance of accessible,
timely, and well-disseminated statistical outputs in reaching high
overall performance.
data_infrastructure_score (OR = 1.37, p =
0.002)
The coefficient for data_infrastructure_score is
positive (log-odds = 0.32). Each one-unit increase is associated with a
37% increase in the odds of high performance, holding other variables
constant. This is the most precisely estimated effect (smallest
p-value), suggesting that the foundational systems underlying a national
statistical office are a consistent predictor of whether a country
reaches top-tier performance.
The key insight from this section is that all three sub-indicators matter independently. Even after controlling for infrastructure and services, data use retains a significant positive association — suggesting that countries that invest in using their data, not just collecting it, are more likely to be top performers.
data_use_scoreA 95% confidence interval for the data_use_score
coefficient is constructed manually using the standard error from the
model summary.
# Extract coefficient and standard error for data_use_score
beta <- coef(model)["data_use_score"]
se <- summary(model)$coefficients["data_use_score", "Std. Error"]
# Construct 95% CI on log-odds scale
lower_log <- beta - 1.96 * se
upper_log <- beta + 1.96 * se
# Convert to odds ratio scale
lower_or <- exp(lower_log)
upper_or <- exp(upper_log)
cat("Log-odds coefficient:", round(beta, 4), "\n")
## Log-odds coefficient: 0.4999
cat("Standard Error:", round(se, 4), "\n")
## Standard Error: 0.2291
cat("95% CI (log-odds): [", round(lower_log, 4), ",", round(upper_log, 4), "]\n")
## 95% CI (log-odds): [ 0.0508 , 0.9489 ]
cat("95% CI (odds ratio): [", round(lower_or, 4), ",", round(upper_or, 4), "]\n")
## 95% CI (odds ratio): [ 1.0521 , 2.5829 ]
The 95% confidence interval for the log-odds coefficient of
data_use_score is [0.05, 0.95]. This interval does not
include zero, confirming that the association is statistically
significant at the 0.05 level.
Converted to the odds ratio scale, the interval is approximately [1.05, 2.58]. This means that, with 95% confidence, a one-unit increase in data use score multiplies the odds of high performance by somewhere between 1.05 and 2.58, holding other variables constant. Even the lower bound of the interval represents a positive association, which reinforces the conclusion that data use is a meaningful predictor of top-tier statistical performance. The confidence interval reinforces that this effect is consistently positive and not attributable to sampling variability alone.
The logistic regression model identifies all three sub-indicators as
statistically significant positive predictors of whether a country
achieves high performance (overall score above 85). Among the three,
data_use_score has the largest point estimate (OR = 1.65),
while data_infrastructure_score has the most precise
estimate (lowest p-value = 0.002). Together, the model suggests that
high-performing countries are distinguished not by any single dimension
of statistical capacity, but by strength across use, services, and
infrastructure simultaneously.
The McFadden pseudo-R² for this model is approximately 0.92,
indicating a strong fit relative to a null model — though this metric is
not directly comparable to the R² from linear regression and should be
interpreted with caution. This unusually high value likely indicates
near-complete separation, meaning the predictors almost perfectly
distinguish high-performing countries in this dataset. While the model
coefficients remain valid, the wide confidence interval for
data_use_score (OR: 1.05 to 2.58) is consistent with this
concern and suggests the estimates should be interpreted cautiously.
high_performance is somewhat arbitrary. A different
threshold would change which countries are classified as high-performing
and could affect the model’s coefficients and significance.data_use_score and high
performance differ across income groups? An interaction term could test
whether the effect of data use is stronger in high-income countries
versus lower-income ones.