Week 11 Data Dive - Generalized Linear Models (Part 2)

Introduction

This notebook builds on the logistic regression model from Week 10 by returning to a continuous outcome and applying tools from this week’s lab: model comparison (AIC, BIC, ANOVA) and multicollinearity diagnostics (VIF).

The response variable is data_products_score, which measures the quality and availability of statistical outputs produced by a country. Two sub-indicators are used as predictors: data_use_score and data_services_score. Both reflect distinct but related dimensions of statistical capacity, making them natural candidates for a multiple linear regression model.

To simplify the analysis, only the 2023 cross-sectional snapshot is used. This avoids the repeated-measures complexity that would arise from using multiple years and ensures the independence assumption for linear regression is more plausible.


Data Preparation

# Load libraries
library(tidyverse)
library(broom)
library(lindia)
library(car)
# Load dataset
dataset <- read_csv("dataset.csv")
## Rows: 4340 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): iso3c, country, region, income
## dbl (8): year, population, overall_score, data_use_score, data_services_scor...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Filter to 2023 and remove unclassified income group
dataset_clean <- dataset %>%
  filter(year == 2023) %>%
  filter(income != "Not classified") %>%
  filter(
    !is.na(data_products_score),
    !is.na(data_use_score),
    !is.na(data_services_score)
  )

cat("Rows available for modeling:", nrow(dataset_clean), "\n")
## Rows available for modeling: 187

Part 1 — Model Building

Two nested linear models are fit using lm(). The reduced model uses only data_use_score as a predictor. The full model adds data_services_score as a second predictor. Both models use data_products_score as the response variable. These models assume a linear relationship between each predictor and the response.

# Reduced model: one predictor
model_reduced <- lm(data_products_score ~ data_use_score,
                    data = dataset_clean)

summary(model_reduced)
## 
## Call:
## lm(formula = data_products_score ~ data_use_score, data = dataset_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.251  -3.974   1.105   5.388  14.860 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    43.18895    2.38328   18.12   <2e-16 ***
## data_use_score  0.40293    0.02838   14.20   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.461 on 185 degrees of freedom
## Multiple R-squared:  0.5214, Adjusted R-squared:  0.5188 
## F-statistic: 201.6 on 1 and 185 DF,  p-value: < 2.2e-16
# Full model: two predictors
model_full <- lm(data_products_score ~ data_use_score + data_services_score,
                 data = dataset_clean)

summary(model_full)
## 
## Call:
## lm(formula = data_products_score ~ data_use_score + data_services_score, 
##     data = dataset_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.283  -3.426   0.445   4.830  12.149 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         41.27610    2.22008  18.592  < 2e-16 ***
## data_use_score       0.27215    0.03446   7.898 2.49e-13 ***
## data_services_score  0.18287    0.03138   5.828 2.47e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.873 on 184 degrees of freedom
## Multiple R-squared:  0.596,  Adjusted R-squared:  0.5916 
## F-statistic: 135.7 on 2 and 184 DF,  p-value: < 2.2e-16

Insight: Both data_use_score and data_services_score have positive coefficients and statistically significant p-values in the full model. This means that higher scores on both predictors are associated with higher data_products_score, even after accounting for the other predictor.

Significance: This suggests that the dimensions of statistical capacity measured by these sub-indicators are connected. Countries that use data effectively and provide accessible data services also tend to produce higher-quality statistical outputs. This points to a broader pattern where statistical capacity is mutually reinforcing across dimensions.

Further Questions:

  • Are data_use_score and data_services_score too similar to each other to be used as separate predictors? This will be examined in the diagnostics section.
  • Is one predictor more important than the other in explaining variation in data_products_score?

Part 2 — Model Diagnostics

Diagnostic plots are produced using gg_diagnose() from the lindia package. These plots check the core assumptions of linear regression: linearity, constant variance (homoscedasticity), normality of residuals, and the presence of influential observations.

# Diagnostic plots for the full model
gg_diagnose(model_full)
## Warning: `fortify(<lm>)` was deprecated in ggplot2 4.0.0.
## ℹ Please use `broom::augment(<lm>)` instead.
## ℹ The deprecated feature was likely used in the lindia package.
##   Please report the issue at <https://github.com/yeukyul/lindia/issues>.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the lindia package.
##   Please report the issue at <https://github.com/yeukyul/lindia/issues>.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

The Variance Inflation Factor (VIF) is used to check for multicollinearity between the two predictors. A VIF above 10 would indicate a problematic level of collinearity.

# Check VIF for the full model
vif(model_full)
##      data_use_score data_services_score 
##            1.736664            1.736664

Insight: The diagnostic plots show that residuals are approximately normally distributed with no severe heteroscedasticity. The VIF values for both predictors are well below 10, indicating that multicollinearity is not a concern here. The two predictors, while related, are contributing sufficiently independent information to the model. While no major violations are present, the slight curvature in the residuals suggests potential model misspecification and that the model may not fully capture nonlinear relationships in the data.

Significance: The absence of major assumption violations means that the inference drawn from this model — p-values, confidence intervals, and coefficient estimates — is more reliable. Because assumptions are reasonably met, we can be more confident in the inference from this model. If the VIF had been high, the standard errors for the coefficients would have been inflated, making it harder to detect real effects.

Further Questions:

  • The residual vs. fitted plot may show a slight pattern at the extremes of the fitted values. Would a transformation of the response variable (e.g., log or square root) improve the fit?
  • Are there any high-leverage or influential observations that are disproportionately driving the model estimates?

Part 3 — Coefficient Interpretation

The focus here is on the data_use_score coefficient from the full model.

# Extract and display the full model coefficients
tidy(model_full)
## # A tibble: 3 × 5
##   term                estimate std.error statistic  p.value
##   <chr>                  <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)           41.3      2.22       18.6  4.12e-44
## 2 data_use_score         0.272    0.0345      7.90 2.49e-13
## 3 data_services_score    0.183    0.0314      5.83 2.47e- 8

The estimated coefficient for data_use_score is approximately 0.272. This means that for each one-unit increase in data_use_score, the predicted data_products_score increases by about 0.272 points, holding data_services_score constant.

The p-value for this coefficient is well below 0.05, indicating that this association is statistically significant. The intercept alone does not have a meaningful real-world interpretation here, since a data_use_score of zero is outside the observed range of the data.

Insight: Higher data use scores are associated with higher data product scores. Even after controlling for data services, the relationship between how actively a country uses its data and the quality of its statistical outputs remains positive and significant.

Significance: This suggests a practical linkage between data usage and statistical outputs. Countries that invest in applying their data — for policy, planning, and decision-making — tend to also produce better statistical products. This is consistent with the idea that data use creates demand and feedback loops that improve data quality over time.

Further Questions:

  • Is this relationship causal, or are both variables jointly driven by a third factor such as national income or governance quality?
  • Are there missing confounders — such as regional context or institutional capacity — that, if included, would change or reduce this estimated effect?

Part 4 — Model Comparison

The two nested models are compared using AIC, BIC, and an ANOVA likelihood ratio test. These tools address the question of whether adding data_services_score to the model meaningfully improves fit, or whether the simpler model is sufficient.

# AIC comparison
cat("Reduced model AIC:", round(AIC(model_reduced), 2), "\n")
## Reduced model AIC: 1286.28
cat("Full model AIC:   ", round(AIC(model_full), 2), "\n")
## Full model AIC:    1256.6
# BIC comparison
cat("Reduced model BIC:", round(BIC(model_reduced), 2), "\n")
## Reduced model BIC: 1295.97
cat("Full model BIC:   ", round(BIC(model_full), 2), "\n")
## Full model BIC:    1269.52
# ANOVA test between nested models
anova(model_reduced, model_full)
## Analysis of Variance Table
## 
## Model 1: data_products_score ~ data_use_score
## Model 2: data_products_score ~ data_use_score + data_services_score
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1    185 10297.2                                  
## 2    184  8692.6  1    1604.5 33.964 2.469e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Insight: Both AIC and BIC are lower for the full model, and the ANOVA test produces a statistically significant p-value. This means that adding data_services_score as a second predictor meaningfully reduces the residual variance — it is not just adding noise to the model.

Significance: The model comparison confirms that the full model is the better choice. AIC and BIC penalize models for additional parameters, so the fact that the full model still wins despite having one more predictor indicates that data_services_score is genuinely contributing explanatory power. The ANOVA result reinforces this: the reduction in residual sum of squares when adding the second predictor is larger than what would be expected by chance alone.

Further Questions:

  • Are there other sub-indicators in the dataset (e.g., data_infrastructure_score) that might further improve the model?
  • At what point does adding more predictors stop improving the model, and could backward stepwise selection using step() help identify the optimal set?

Discussion

The full linear model — predicting data_products_score from data_use_score and data_services_score — performs well across all checks. Both predictors are statistically significant, the diagnostic plots do not reveal major assumption violations, VIF values are acceptable, and the full model outperforms the reduced model on AIC, BIC, and ANOVA.

The most practically meaningful finding is that data use and data services together explain a meaningful share of the variation in data product quality across countries. This is consistent with a view of statistical capacity as systemic: countries that score well on one dimension of capacity tend to score well on others, and these dimensions reinforce each other.


Limitations

  • Single-year snapshot: Using only 2023 data simplifies the analysis but limits generalizability. Trends over time are not captured.
  • Missing observations: 29 countries were dropped due to missing values in data_services_score. If missingness is not random (e.g., lower-capacity countries are more likely to be missing), the model may not represent the full range of countries.
  • Linearity assumption: The model assumes a linear relationship between each predictor and the response. This was not formally tested and may not hold across the full range of the data.
  • Omitted variables: Factors such as income level, regional context, and governance quality are likely correlated with all three variables and are not included in the model.

Further Questions

  • Would including income as a categorical covariate change the estimated slopes for data_use_score and data_services_score?
  • Could a Poisson or other GLM family be more appropriate if the response variable were reframed as a count or proportion?
  • How stable are these estimates across different years in the dataset? A panel model accounting for country-level fixed effects might provide more reliable inference.