Week 9 Data Dive — Regression Diagnostics

Introduction

This notebook extends the regression analysis from Week 8 by expanding the simple linear regression model and applying diagnostic tools to evaluate whether the model’s assumptions are reasonably met.

In Week 8, a simple linear regression was fit using data_products_score as the sole predictor of overall_score. That model explained approximately 67% of the variation in overall statistical performance across countries. This week, 1–3 additional variables are introduced to improve the model, and the 5 diagnostic plots discussed in class are used to assess its validity. The central goal is to ensure that any conclusions drawn from the model rest on a valid foundation.

The dataset is the World Bank Statistical Performance Indicators dataset, a longitudinal country-level dataset covering 217 countries from 2004 to 2023. The analysis continues to use the 2023 cross-sectional snapshot.

Data Preparation

# Load libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lindia)
library(GGally)
library(broom)
# Load dataset
dataset <- read_csv("dataset.csv")
## Rows: 4340 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): iso3c, country, region, income
## dbl (8): year, population, overall_score, data_use_score, data_services_scor...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Filter to 2023, remove unclassified income group, create binary variable
data_2023 <- dataset %>%
  filter(year == 2023) %>%
  filter(income != "Not classified") %>%
  mutate(high_income = ifelse(income == "High income", 1, 0))

cat("Rows in 2023 snapshot:", nrow(data_2023), "\n")
## Rows in 2023 snapshot: 216
# Retain only rows with no missing values across the variables used in the model
model_data <- data_2023 %>%
  filter(
    !is.na(overall_score),
    !is.na(data_products_score),
    !is.na(data_infrastructure_score)
  )

cat("Rows available for modeling:", nrow(model_data), "\n")
## Rows available for modeling: 186

Part 1 — Building the Expanded Model

Variable Selection

The Week 8 model used data_products_score as a single predictor. The assignment asks to include 1–3 additional variables, trying at least one binary or interaction term. Three candidate variables were considered.

data_infrastructure_score (included)

This sub-indicator measures the strength of a country’s statistical infrastructure, including civil registration systems, geospatial frameworks, and administrative data systems. It has a strong correlation with overall_score (r = 0.912) and a moderate correlation with data_products_score (r = 0.699). The correlation between the two predictors is moderate, not severe, so both can reasonably be included together. Multicollinearity is worth monitoring but does not rise to a level that warrants exclusion.

high_income binary variable (included)

A binary variable was created from the income column: high_income = 1 for High income countries and high_income = 0 for all others. This captures the structural economic divide identified in the Week 8 ANOVA, where High income countries scored significantly higher than all other groups. It has a moderate correlation with overall_score (r = 0.464) and a low correlation with data_products_score (r = 0.130), so multicollinearity is not a concern here.

data_use_score (excluded)

This sub-indicator was considered but excluded due to multicollinearity concerns. Its correlation with data_products_score is r = 0.736, the highest among all remaining sub-scores. Including both in the same model would introduce redundancy that complicates the interpretation of each coefficient. Since data_infrastructure_score offers comparable predictive value with lower collinearity, data_use_score was set aside.

Correlation Check

Before fitting the model, a correlation heatmap is used to visualize the relationships among the three predictors and confirm that the multicollinearity assessment above is reasonable.

# Correlation heatmap of predictors
ggcorr(
  select(model_data, data_products_score, data_infrastructure_score, high_income),
  label = TRUE,
  label_round = 2
) +
  labs(title = "Correlation Heatmap — Model Predictors")

The heatmap confirms that data_products_score and data_infrastructure_score have a moderate positive correlation (r ≈ 0.70), while high_income shows low correlations with both continuous predictors. This pattern is consistent with the expectations discussed above. The moderate correlation between the two continuous predictors is worth noting, but is not severe enough to invalidate the model.

Interaction Term

The prompt asks to try either a binary or an interaction term. An interaction term between data_products_score and high_income was tested to explore whether the slope of the data products–overall score relationship differs for High income countries versus all others.

# Test interaction term
lm_interact <- lm(overall_score ~ data_products_score + data_infrastructure_score +
                    high_income + data_products_score:high_income,
                  data = model_data)

tidy(lm_interact, conf.int = TRUE)
## # A tibble: 5 × 7
##   term                  estimate std.error statistic  p.value conf.low conf.high
##   <chr>                    <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)            -19.6      4.11       -4.78 3.67e- 6  -27.7     -11.5  
## 2 data_products_score      0.798    0.0648     12.3  1.04e-25    0.670     0.926
## 3 data_infrastructure_…    0.412    0.0282     14.6  1.64e-32    0.357     0.468
## 4 high_income             25.5      5.65        4.52 1.11e- 5   14.4      36.7  
## 5 data_products_score:…   -0.278    0.0724     -3.84 1.70e- 4   -0.421    -0.135

The interaction term data_products_score:high_income has a p-value well above 0.05, indicating insufficient evidence that the slope of data_products_score differs between High income and other countries. Including it does not meaningfully improve the model and adds unnecessary complexity. For these reasons, the interaction term is excluded from the final model.

Final Model

Based on the variable selection discussion above, the final model includes three terms: data_products_score (retained from Week 8), data_infrastructure_score (new continuous variable), and high_income (new binary variable).

# Fit the final regression model
lm_model <- lm(overall_score ~ data_products_score + data_infrastructure_score + high_income,
               data = model_data)

summary(lm_model)
## 
## Call:
## lm(formula = overall_score ~ data_products_score + data_infrastructure_score + 
##     high_income, data = model_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.4645  -2.6528  -0.4722   2.7818  14.8517 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -9.30681    3.22271  -2.888  0.00435 ** 
## data_products_score        0.66217    0.05631  11.759  < 2e-16 ***
## data_infrastructure_score  0.41119    0.02922  14.070  < 2e-16 ***
## high_income                4.19840    1.04793   4.006 8.98e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.271 on 182 degrees of freedom
## Multiple R-squared:  0.9044, Adjusted R-squared:  0.9029 
## F-statistic: 574.2 on 3 and 182 DF,  p-value: < 2.2e-16

Coefficient Interpretation

The regression equation is:

\[\widehat{\text{Overall Score}} = -9.31 + 0.66 \times \text{Data Products Score} + 0.41 \times \text{Data Infrastructure Score} + 4.20 \times \text{High Income}\]

Intercept (−9.31): The predicted overall_score when all predictors are zero. This has no practical meaning since no country has a score of zero on either sub-indicator, but it serves as the mathematical anchor for the regression surface.

data_products_score (0.66): Holding the other variables constant, each 1-point increase in data products score is associated with an average increase of 0.66 points in overall score. Note that this slope is smaller than the 1.33 estimated in the Week 8 simple regression. This is expected: when data_infrastructure_score is added to the model, it absorbs some of the explanatory work previously attributed to data_products_score alone.

data_infrastructure_score (0.41): Holding the other variables constant, each 1-point increase in data infrastructure score is associated with an average increase of 0.41 points in overall score. This confirms that infrastructure has an independent contribution to overall performance beyond what data products alone capture.

high_income (4.20): Controlling for both sub-indicator scores, High income countries score approximately 4.2 points higher on average than non-High income countries. This residual income effect is smaller than the raw group difference seen in the Week 8 ANOVA (≈ 25 points), which makes sense: much of that gap was already explained by the sub-indicator scores themselves. The remaining 4.2 points may reflect structural advantages in governance, institutional stability, or data funding not captured by the two sub-scores.

Model Fit

The model’s R² = 0.90, meaning the three predictors together explain approximately 90% of the variation in overall_score across countries. This is a substantial improvement over the Week 8 simple regression (R² = 0.67), confirming that data_infrastructure_score and high_income add meaningful explanatory power. The adjusted R² = 0.90 as well, indicating that the improvement is not simply a consequence of adding more variables.


Part 2 — Model Diagnostics

Running a regression model is not sufficient on its own. The validity of the model’s conclusions depends on whether key assumptions are reasonably satisfied. Five diagnostic plots are used here to evaluate the model systematically.

1. Residuals vs. Fitted Values

# Residuals vs. fitted values
gg_resfitted(lm_model) +
  geom_smooth(se = FALSE, color = "steelblue", linewidth = 0.8) +
  labs(title = "Residuals vs. Fitted Values")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the lindia package.
##   Please report the issue at <https://github.com/yeukyul/lindia/issues>.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

What to look for: If the linearity assumption holds and errors have constant variance (homoscedasticity), the residuals should be randomly scattered around zero with no discernible pattern across the range of fitted values.

Observation: The residuals appear roughly centered around zero and do not show a strong systematic curve. The smoothing line stays relatively close to zero across most of the fitted value range, which is consistent with the linearity assumption being approximately met. There is some mild spread variation at the lower and upper ends of the fitted range, suggesting a slight concern with constant variance, though it does not appear severe.

Confidence: Moderate to high confidence that linearity holds. The homoscedasticity assumption is plausible but should be further assessed in the scale-location context using the residual histogram and QQ-plot below.

2. Residuals vs. X Values

# Residuals vs. each predictor
plots <- gg_resX(lm_model, plot.all = FALSE)

plots$data_products_score +
  geom_smooth(se = FALSE, color = "steelblue", linewidth = 0.8) +
  labs(title = "Residuals vs. Data Products Score")

plots$data_infrastructure_score +
  geom_smooth(se = FALSE, color = "steelblue", linewidth = 0.8) +
  labs(title = "Residuals vs. Data Infrastructure Score")

What to look for: Residuals should be randomly scattered with no trend as each predictor increases. A systematic pattern (e.g., curvature or fanning) would suggest that the linearity assumption is violated for that variable, or that its variance is not constant.

Observation for data_products_score: Residuals are generally scattered without a strong directional trend, supporting the assumption that the relationship between this predictor and overall_score is approximately linear. There may be slightly more spread in the middle of the range, but no severe pattern is apparent.

Observation for data_infrastructure_score: The residuals also appear reasonably scattered, though there is some visible increase in spread toward higher values of this predictor. This is a mild concern consistent with what was observed in the residuals vs. fitted plot, and suggests the homoscedasticity assumption may be modestly strained at higher infrastructure scores.

Confidence: Moderate confidence in linearity for both predictors. The mild heteroscedasticity at higher values of data_infrastructure_score warrants acknowledgment but does not appear severe enough to invalidate the model.

3. Correlation Heatmap

# Correlation heatmap (re-examined as a diagnostic tool)
ggcorr(
  select(model_data, data_products_score, data_infrastructure_score, high_income),
  label = TRUE,
  label_round = 2
) +
  labs(title = "Correlation Heatmap — Multicollinearity Diagnostic")

What to look for: High correlations between predictor variables (typically |r| > 0.80) indicate multicollinearity, which can inflate standard errors and make individual coefficients unreliable, even when the overall model fit is strong.

Observation: The two continuous predictors, data_products_score and data_infrastructure_score, show a moderate positive correlation of r ≈ 0.70. This is noticeable but below the threshold typically considered problematic. The high_income binary variable shows low correlations with both continuous predictors (r ≈ 0.13 and r ≈ 0.52, respectively). There is no pair here with a correlation above 0.80, so severe multicollinearity does not appear to be present in this model.

Confidence: Moderate to high confidence that multicollinearity is not a significant issue. The moderate correlation between the two continuous sub-scores is worth monitoring, particularly if additional sub-indicators were added to the model in a future expansion.

4. Residual Histogram

# Histogram of residuals
gg_reshist(lm_model) +
  labs(title = "Histogram of Residuals")
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

What to look for: The normality of errors assumption requires that residuals follow an approximately normal distribution. In a histogram, this would appear as a roughly symmetric, bell-shaped distribution centered at zero.

Observation: The histogram of residuals is roughly symmetric and centered near zero, with a shape that is approximately bell-shaped. There is a slight suggestion of a longer right tail compared to the left, which may reflect a handful of countries where the model underestimates overall performance. The departure from perfect normality is mild.

Confidence: Moderate to high confidence that the normality assumption is reasonably met. The slight right skew is not severe enough to raise serious concerns about the validity of the model’s inference, but it suggests that some dimension of statistical performance is not fully captured by the three predictors currently in the model.

5. QQ-Plot

# Normal QQ plot of residuals
gg_qqplot(lm_model) +
  labs(title = "Normal QQ-Plot of Residuals")

What to look for: In a QQ-plot, residuals that are perfectly normally distributed fall exactly on the diagonal reference line. Departures from the line, particularly in the tails, indicate deviations from normality. As emphasized in the lab materials, QQ-plots are sensitive and can flag deviations that are not practically meaningful, so conclusions here should be interpreted alongside the histogram.

Observation: The residuals track the reference line closely through the middle of the distribution. There is some deviation in the upper tail, where a small number of points fall above the line. This is consistent with the slight right skew observed in the histogram. The deviations are confined to the tails and do not represent a wholesale departure from normality.

Confidence: Moderate confidence that residuals are normal enough for the model’s inferences to be trustworthy. The tail deviations are mild and consistent across both the histogram and QQ-plot, which together suggest the normality assumption is approximately, if not perfectly, satisfied.

Cook’s Distance

# Cook's distance — identify high-influence observations
gg_cooksd(lm_model, threshold = "matlab") +
  labs(title = "Cook's Distance by Observation")

What to look for: Cook’s distance measures the influence of each individual observation on the fitted model. Observations above the threshold line are considered potentially influential and may be worth investigating further.

Observation: A small number of observations are flagged as potentially influential. These are worth examining to determine whether they represent genuine outliers (e.g., countries with unusual combinations of sub-indicator scores) or data entry concerns. It is important to note that influence alone is not sufficient reason to remove an observation. These points should be investigated in context before any decision about their treatment is made.

Significance: If any of these influential observations come from countries with atypical statistical systems or data anomalies, their presence could be pulling the regression line in ways that do not represent the broader pattern. A useful next step would be to identify which countries these rows correspond to and examine whether they share any characteristics.


Discussion

Expanding the Week 8 model from one predictor to three improved the fit substantially, with R² rising from 0.67 to 0.90. The added variables, data_infrastructure_score and the high_income binary, each contribute independent explanatory power and were included for theoretically grounded reasons.

The diagnostic results are generally encouraging. The linearity assumption appears approximately satisfied for both continuous predictors. The residuals are roughly normally distributed, with only mild deviations in the upper tail. Multicollinearity is present at a moderate level between the two continuous sub-scores, but does not appear severe enough to undermine the model. The most consistent finding across the diagnostic plots is a mild increase in residual spread at higher fitted values and higher values of data_infrastructure_score, suggesting that the homoscedasticity assumption is not perfectly met.

Together, these diagnostics support the conclusion that the model is reasonably valid, though not without limitations. Results should be interpreted with awareness that the mild heteroscedasticity and tail deviations may slightly affect the precision of the coefficient estimates. The model should not be used for precise numerical predictions without acknowledging this uncertainty.

Limitations

  • Single-year snapshot: The analysis uses only 2023 data to satisfy the independence assumption. Longitudinal patterns, such as whether the infrastructure-performance relationship has strengthened over time, cannot be assessed here.
  • Moderate multicollinearity: The correlation of r ≈ 0.70 between data_products_score and data_infrastructure_score means their individual coefficients should be interpreted with some caution. The model explains how they jointly relate to overall performance, but attributing precise credit to each is less reliable.
  • Mild heteroscedasticity: The slight increase in residual variance at higher fitted values is a modest violation of the constant variance assumption. Confidence intervals and p-values remain interpretable, but may be slightly optimistic.
  • Influential observations: A small number of observations with high Cook’s distance were identified. Their impact on the model has not been fully investigated and could be a source of bias.

Further Questions

  • Which specific countries correspond to the high Cook’s distance observations, and do they share any regional or structural characteristics that might explain their outsized influence?
  • Would a log transformation of overall_score reduce the mild heteroscedasticity identified in the residual plots?
  • How do the remaining sub-indicators, data_use_score, data_services_score, and data_sources_score, contribute to the unexplained 10% of variance? A model including all five sub-scores would address this.
  • Does the relationship between data_infrastructure_score and overall_score differ across income groups? An interaction term between these two variables could be explored in a follow-up analysis.