This notebook builds on the logistic regression model from Week 10 by returning to a continuous outcome and applying tools from this week’s lab: model comparison (AIC, BIC, ANOVA) and multicollinearity diagnostics (VIF).
The response variable is data_products_score, which
measures the quality and availability of statistical outputs produced by
a country. Two sub-indicators are used as predictors:
data_use_score and data_services_score. Both
reflect distinct but related dimensions of statistical capacity, making
them natural candidates for a multiple linear regression model.
To simplify the analysis, only the 2023 cross-sectional snapshot is used. This avoids the repeated-measures complexity that would arise from using multiple years and ensures the independence assumption for linear regression is more plausible.
# Load libraries
library(tidyverse)
library(broom)
library(lindia)
library(car)
# Load dataset
dataset <- read_csv("dataset.csv")
## Rows: 4340 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): iso3c, country, region, income
## dbl (8): year, population, overall_score, data_use_score, data_services_scor...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Filter to 2023 and remove unclassified income group
dataset_clean <- dataset %>%
filter(year == 2023) %>%
filter(income != "Not classified") %>%
filter(
!is.na(data_products_score),
!is.na(data_use_score),
!is.na(data_services_score)
)
cat("Rows available for modeling:", nrow(dataset_clean), "\n")
## Rows available for modeling: 187
Two nested linear models are fit using lm(). The reduced
model uses only data_use_score as a predictor. The full
model adds data_services_score as a second predictor. Both
models use data_products_score as the response variable.
These models assume a linear relationship between each predictor and the
response.
# Reduced model: one predictor
model_reduced <- lm(data_products_score ~ data_use_score,
data = dataset_clean)
summary(model_reduced)
##
## Call:
## lm(formula = data_products_score ~ data_use_score, data = dataset_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.251 -3.974 1.105 5.388 14.860
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 43.18895 2.38328 18.12 <2e-16 ***
## data_use_score 0.40293 0.02838 14.20 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.461 on 185 degrees of freedom
## Multiple R-squared: 0.5214, Adjusted R-squared: 0.5188
## F-statistic: 201.6 on 1 and 185 DF, p-value: < 2.2e-16
# Full model: two predictors
model_full <- lm(data_products_score ~ data_use_score + data_services_score,
data = dataset_clean)
summary(model_full)
##
## Call:
## lm(formula = data_products_score ~ data_use_score + data_services_score,
## data = dataset_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.283 -3.426 0.445 4.830 12.149
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41.27610 2.22008 18.592 < 2e-16 ***
## data_use_score 0.27215 0.03446 7.898 2.49e-13 ***
## data_services_score 0.18287 0.03138 5.828 2.47e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.873 on 184 degrees of freedom
## Multiple R-squared: 0.596, Adjusted R-squared: 0.5916
## F-statistic: 135.7 on 2 and 184 DF, p-value: < 2.2e-16
Insight: Both data_use_score and
data_services_score have positive coefficients and
statistically significant p-values in the full model. This means that
higher scores on both predictors are associated with higher
data_products_score, even after accounting for the other
predictor.
Significance: This suggests that the dimensions of statistical capacity measured by these sub-indicators are connected. Countries that use data effectively and provide accessible data services also tend to produce higher-quality statistical outputs. This points to a broader pattern where statistical capacity is mutually reinforcing across dimensions.
Further Questions:
data_use_score and data_services_score
too similar to each other to be used as separate predictors? This will
be examined in the diagnostics section.data_products_score?Diagnostic plots are produced using gg_diagnose() from
the lindia package. These plots check the core assumptions
of linear regression: linearity, constant variance (homoscedasticity),
normality of residuals, and the presence of influential
observations.
# Diagnostic plots for the full model
gg_diagnose(model_full)
## Warning: `fortify(<lm>)` was deprecated in ggplot2 4.0.0.
## ℹ Please use `broom::augment(<lm>)` instead.
## ℹ The deprecated feature was likely used in the lindia package.
## Please report the issue at <https://github.com/yeukyul/lindia/issues>.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the lindia package.
## Please report the issue at <https://github.com/yeukyul/lindia/issues>.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
The Variance Inflation Factor (VIF) is used to check for multicollinearity between the two predictors. A VIF above 10 would indicate a problematic level of collinearity.
# Check VIF for the full model
vif(model_full)
## data_use_score data_services_score
## 1.736664 1.736664
Insight: The diagnostic plots show that residuals are approximately normally distributed with no severe heteroscedasticity. The VIF values for both predictors are well below 10, indicating that multicollinearity is not a concern here. The two predictors, while related, are contributing sufficiently independent information to the model. While no major violations are present, the slight curvature in the residuals suggests potential model misspecification and that the model may not fully capture nonlinear relationships in the data.
Significance: The absence of major assumption violations means that the inference drawn from this model — p-values, confidence intervals, and coefficient estimates — is more reliable. Because assumptions are reasonably met, we can be more confident in the inference from this model. If the VIF had been high, the standard errors for the coefficients would have been inflated, making it harder to detect real effects.
Further Questions:
The focus here is on the data_use_score coefficient from
the full model.
# Extract and display the full model coefficients
tidy(model_full)
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 41.3 2.22 18.6 4.12e-44
## 2 data_use_score 0.272 0.0345 7.90 2.49e-13
## 3 data_services_score 0.183 0.0314 5.83 2.47e- 8
The estimated coefficient for data_use_score is
approximately 0.272. This means that for each one-unit increase in
data_use_score, the predicted
data_products_score increases by about 0.272 points,
holding data_services_score constant.
The p-value for this coefficient is well below 0.05, indicating that
this association is statistically significant. The intercept alone does
not have a meaningful real-world interpretation here, since a
data_use_score of zero is outside the observed range of the
data.
Insight: Higher data use scores are associated with higher data product scores. Even after controlling for data services, the relationship between how actively a country uses its data and the quality of its statistical outputs remains positive and significant.
Significance: This suggests a practical linkage between data usage and statistical outputs. Countries that invest in applying their data — for policy, planning, and decision-making — tend to also produce better statistical products. This is consistent with the idea that data use creates demand and feedback loops that improve data quality over time.
Further Questions:
The two nested models are compared using AIC, BIC, and an ANOVA
likelihood ratio test. These tools address the question of whether
adding data_services_score to the model meaningfully
improves fit, or whether the simpler model is sufficient.
# AIC comparison
cat("Reduced model AIC:", round(AIC(model_reduced), 2), "\n")
## Reduced model AIC: 1286.28
cat("Full model AIC: ", round(AIC(model_full), 2), "\n")
## Full model AIC: 1256.6
# BIC comparison
cat("Reduced model BIC:", round(BIC(model_reduced), 2), "\n")
## Reduced model BIC: 1295.97
cat("Full model BIC: ", round(BIC(model_full), 2), "\n")
## Full model BIC: 1269.52
# ANOVA test between nested models
anova(model_reduced, model_full)
## Analysis of Variance Table
##
## Model 1: data_products_score ~ data_use_score
## Model 2: data_products_score ~ data_use_score + data_services_score
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 185 10297.2
## 2 184 8692.6 1 1604.5 33.964 2.469e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Insight: Both AIC and BIC are lower for the full
model, and the ANOVA test produces a statistically significant p-value.
This means that adding data_services_score as a second
predictor meaningfully reduces the residual variance — it is not just
adding noise to the model.
Significance: The model comparison confirms that the
full model is the better choice. AIC and BIC penalize models for
additional parameters, so the fact that the full model still wins
despite having one more predictor indicates that
data_services_score is genuinely contributing explanatory
power. The ANOVA result reinforces this: the reduction in residual sum
of squares when adding the second predictor is larger than what would be
expected by chance alone.
Further Questions:
data_infrastructure_score) that might further improve the
model?step() help
identify the optimal set?The full linear model — predicting data_products_score
from data_use_score and data_services_score —
performs well across all checks. Both predictors are statistically
significant, the diagnostic plots do not reveal major assumption
violations, VIF values are acceptable, and the full model outperforms
the reduced model on AIC, BIC, and ANOVA.
The most practically meaningful finding is that data use and data services together explain a meaningful share of the variation in data product quality across countries. This is consistent with a view of statistical capacity as systemic: countries that score well on one dimension of capacity tend to score well on others, and these dimensions reinforce each other.
data_services_score. If missingness is
not random (e.g., lower-capacity countries are more likely to be
missing), the model may not represent the full range of countries.income as a categorical covariate
change the estimated slopes for data_use_score and
data_services_score?