This notebook extends the regression analysis from Week 8 by expanding the simple linear regression model and applying diagnostic tools to evaluate whether the model’s assumptions are reasonably met.
In Week 8, a simple linear regression was fit using
data_products_score as the sole predictor of
overall_score. That model explained approximately 67% of
the variation in overall statistical performance across countries. This
week, 1–3 additional variables are introduced to improve the model, and
the 5 diagnostic plots discussed in class are used to assess its
validity. The central goal is to ensure that any conclusions drawn from
the model rest on a valid foundation.
The dataset is the World Bank Statistical Performance Indicators dataset, a longitudinal country-level dataset covering 217 countries from 2004 to 2023. The analysis continues to use the 2023 cross-sectional snapshot.
# Load libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lindia)
library(GGally)
library(broom)
# Load dataset
dataset <- read_csv("dataset.csv")
## Rows: 4340 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): iso3c, country, region, income
## dbl (8): year, population, overall_score, data_use_score, data_services_scor...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Filter to 2023, remove unclassified income group, create binary variable
data_2023 <- dataset %>%
filter(year == 2023) %>%
filter(income != "Not classified") %>%
mutate(high_income = ifelse(income == "High income", 1, 0))
cat("Rows in 2023 snapshot:", nrow(data_2023), "\n")
## Rows in 2023 snapshot: 216
# Retain only rows with no missing values across the variables used in the model
model_data <- data_2023 %>%
filter(
!is.na(overall_score),
!is.na(data_products_score),
!is.na(data_infrastructure_score)
)
cat("Rows available for modeling:", nrow(model_data), "\n")
## Rows available for modeling: 186
The Week 8 model used data_products_score as a single
predictor. The assignment asks to include 1–3 additional variables,
trying at least one binary or interaction term. Three candidate
variables were considered.
data_infrastructure_score
(included)
This sub-indicator measures the strength of a country’s statistical
infrastructure, including civil registration systems, geospatial
frameworks, and administrative data systems. It has a strong correlation
with overall_score (r = 0.912) and a moderate correlation
with data_products_score (r = 0.699). The correlation
between the two predictors is moderate, not severe, so both can
reasonably be included together. Multicollinearity is worth monitoring
but does not rise to a level that warrants exclusion.
high_income binary variable
(included)
A binary variable was created from the income column:
high_income = 1 for High income countries and
high_income = 0 for all others. This captures the
structural economic divide identified in the Week 8 ANOVA, where High
income countries scored significantly higher than all other groups. It
has a moderate correlation with overall_score (r = 0.464)
and a low correlation with data_products_score (r = 0.130),
so multicollinearity is not a concern here.
data_use_score (excluded)
This sub-indicator was considered but excluded due to
multicollinearity concerns. Its correlation with
data_products_score is r = 0.736, the highest among all
remaining sub-scores. Including both in the same model would introduce
redundancy that complicates the interpretation of each coefficient.
Since data_infrastructure_score offers comparable
predictive value with lower collinearity, data_use_score
was set aside.
Before fitting the model, a correlation heatmap is used to visualize the relationships among the three predictors and confirm that the multicollinearity assessment above is reasonable.
# Correlation heatmap of predictors
ggcorr(
select(model_data, data_products_score, data_infrastructure_score, high_income),
label = TRUE,
label_round = 2
) +
labs(title = "Correlation Heatmap — Model Predictors")
The heatmap confirms that data_products_score and
data_infrastructure_score have a moderate positive
correlation (r ≈ 0.70), while high_income shows low
correlations with both continuous predictors. This pattern is consistent
with the expectations discussed above. The moderate correlation between
the two continuous predictors is worth noting, but is not severe enough
to invalidate the model.
The prompt asks to try either a binary or an interaction term. An
interaction term between data_products_score and
high_income was tested to explore whether the slope of the
data products–overall score relationship differs for High income
countries versus all others.
# Test interaction term
lm_interact <- lm(overall_score ~ data_products_score + data_infrastructure_score +
high_income + data_products_score:high_income,
data = model_data)
tidy(lm_interact, conf.int = TRUE)
## # A tibble: 5 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -19.6 4.11 -4.78 3.67e- 6 -27.7 -11.5
## 2 data_products_score 0.798 0.0648 12.3 1.04e-25 0.670 0.926
## 3 data_infrastructure_… 0.412 0.0282 14.6 1.64e-32 0.357 0.468
## 4 high_income 25.5 5.65 4.52 1.11e- 5 14.4 36.7
## 5 data_products_score:… -0.278 0.0724 -3.84 1.70e- 4 -0.421 -0.135
The interaction term data_products_score:high_income has
a p-value well above 0.05, indicating insufficient evidence that the
slope of data_products_score differs between High income
and other countries. Including it does not meaningfully improve the
model and adds unnecessary complexity. For these reasons, the
interaction term is excluded from the final model.
Based on the variable selection discussion above, the final model
includes three terms: data_products_score (retained from
Week 8), data_infrastructure_score (new continuous
variable), and high_income (new binary variable).
# Fit the final regression model
lm_model <- lm(overall_score ~ data_products_score + data_infrastructure_score + high_income,
data = model_data)
summary(lm_model)
##
## Call:
## lm(formula = overall_score ~ data_products_score + data_infrastructure_score +
## high_income, data = model_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.4645 -2.6528 -0.4722 2.7818 14.8517
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.30681 3.22271 -2.888 0.00435 **
## data_products_score 0.66217 0.05631 11.759 < 2e-16 ***
## data_infrastructure_score 0.41119 0.02922 14.070 < 2e-16 ***
## high_income 4.19840 1.04793 4.006 8.98e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.271 on 182 degrees of freedom
## Multiple R-squared: 0.9044, Adjusted R-squared: 0.9029
## F-statistic: 574.2 on 3 and 182 DF, p-value: < 2.2e-16
The regression equation is:
\[\widehat{\text{Overall Score}} = -9.31 + 0.66 \times \text{Data Products Score} + 0.41 \times \text{Data Infrastructure Score} + 4.20 \times \text{High Income}\]
Intercept (−9.31): The predicted
overall_score when all predictors are zero. This has no
practical meaning since no country has a score of zero on either
sub-indicator, but it serves as the mathematical anchor for the
regression surface.
data_products_score (0.66): Holding the
other variables constant, each 1-point increase in data products score
is associated with an average increase of 0.66 points in overall score.
Note that this slope is smaller than the 1.33 estimated in the Week 8
simple regression. This is expected: when
data_infrastructure_score is added to the model, it absorbs
some of the explanatory work previously attributed to
data_products_score alone.
data_infrastructure_score (0.41):
Holding the other variables constant, each 1-point increase in data
infrastructure score is associated with an average increase of 0.41
points in overall score. This confirms that infrastructure has an
independent contribution to overall performance beyond what data
products alone capture.
high_income (4.20): Controlling for
both sub-indicator scores, High income countries score approximately 4.2
points higher on average than non-High income countries. This residual
income effect is smaller than the raw group difference seen in the Week
8 ANOVA (≈ 25 points), which makes sense: much of that gap was already
explained by the sub-indicator scores themselves. The remaining 4.2
points may reflect structural advantages in governance, institutional
stability, or data funding not captured by the two sub-scores.
The model’s R² = 0.90, meaning the three predictors
together explain approximately 90% of the variation in
overall_score across countries. This is a substantial
improvement over the Week 8 simple regression (R² = 0.67), confirming
that data_infrastructure_score and high_income
add meaningful explanatory power. The adjusted R² = 0.90 as well,
indicating that the improvement is not simply a consequence of adding
more variables.
Running a regression model is not sufficient on its own. The validity of the model’s conclusions depends on whether key assumptions are reasonably satisfied. Five diagnostic plots are used here to evaluate the model systematically.
# Residuals vs. fitted values
gg_resfitted(lm_model) +
geom_smooth(se = FALSE, color = "steelblue", linewidth = 0.8) +
labs(title = "Residuals vs. Fitted Values")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the lindia package.
## Please report the issue at <https://github.com/yeukyul/lindia/issues>.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
What to look for: If the linearity assumption holds and errors have constant variance (homoscedasticity), the residuals should be randomly scattered around zero with no discernible pattern across the range of fitted values.
Observation: The residuals appear roughly centered around zero and do not show a strong systematic curve. The smoothing line stays relatively close to zero across most of the fitted value range, which is consistent with the linearity assumption being approximately met. There is some mild spread variation at the lower and upper ends of the fitted range, suggesting a slight concern with constant variance, though it does not appear severe.
Confidence: Moderate to high confidence that linearity holds. The homoscedasticity assumption is plausible but should be further assessed in the scale-location context using the residual histogram and QQ-plot below.
# Residuals vs. each predictor
plots <- gg_resX(lm_model, plot.all = FALSE)
plots$data_products_score +
geom_smooth(se = FALSE, color = "steelblue", linewidth = 0.8) +
labs(title = "Residuals vs. Data Products Score")
plots$data_infrastructure_score +
geom_smooth(se = FALSE, color = "steelblue", linewidth = 0.8) +
labs(title = "Residuals vs. Data Infrastructure Score")
What to look for: Residuals should be randomly scattered with no trend as each predictor increases. A systematic pattern (e.g., curvature or fanning) would suggest that the linearity assumption is violated for that variable, or that its variance is not constant.
Observation for data_products_score:
Residuals are generally scattered without a strong directional trend,
supporting the assumption that the relationship between this predictor
and overall_score is approximately linear. There may be
slightly more spread in the middle of the range, but no severe pattern
is apparent.
Observation for
data_infrastructure_score: The residuals also
appear reasonably scattered, though there is some visible increase in
spread toward higher values of this predictor. This is a mild concern
consistent with what was observed in the residuals vs. fitted plot, and
suggests the homoscedasticity assumption may be modestly strained at
higher infrastructure scores.
Confidence: Moderate confidence in linearity for
both predictors. The mild heteroscedasticity at higher values of
data_infrastructure_score warrants acknowledgment but does
not appear severe enough to invalidate the model.
# Correlation heatmap (re-examined as a diagnostic tool)
ggcorr(
select(model_data, data_products_score, data_infrastructure_score, high_income),
label = TRUE,
label_round = 2
) +
labs(title = "Correlation Heatmap — Multicollinearity Diagnostic")
What to look for: High correlations between predictor variables (typically |r| > 0.80) indicate multicollinearity, which can inflate standard errors and make individual coefficients unreliable, even when the overall model fit is strong.
Observation: The two continuous predictors,
data_products_score and
data_infrastructure_score, show a moderate positive
correlation of r ≈ 0.70. This is noticeable but below the threshold
typically considered problematic. The high_income binary
variable shows low correlations with both continuous predictors (r ≈
0.13 and r ≈ 0.52, respectively). There is no pair here with a
correlation above 0.80, so severe multicollinearity does not appear to
be present in this model.
Confidence: Moderate to high confidence that multicollinearity is not a significant issue. The moderate correlation between the two continuous sub-scores is worth monitoring, particularly if additional sub-indicators were added to the model in a future expansion.
# Histogram of residuals
gg_reshist(lm_model) +
labs(title = "Histogram of Residuals")
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
What to look for: The normality of errors assumption requires that residuals follow an approximately normal distribution. In a histogram, this would appear as a roughly symmetric, bell-shaped distribution centered at zero.
Observation: The histogram of residuals is roughly symmetric and centered near zero, with a shape that is approximately bell-shaped. There is a slight suggestion of a longer right tail compared to the left, which may reflect a handful of countries where the model underestimates overall performance. The departure from perfect normality is mild.
Confidence: Moderate to high confidence that the normality assumption is reasonably met. The slight right skew is not severe enough to raise serious concerns about the validity of the model’s inference, but it suggests that some dimension of statistical performance is not fully captured by the three predictors currently in the model.
# Normal QQ plot of residuals
gg_qqplot(lm_model) +
labs(title = "Normal QQ-Plot of Residuals")
What to look for: In a QQ-plot, residuals that are perfectly normally distributed fall exactly on the diagonal reference line. Departures from the line, particularly in the tails, indicate deviations from normality. As emphasized in the lab materials, QQ-plots are sensitive and can flag deviations that are not practically meaningful, so conclusions here should be interpreted alongside the histogram.
Observation: The residuals track the reference line closely through the middle of the distribution. There is some deviation in the upper tail, where a small number of points fall above the line. This is consistent with the slight right skew observed in the histogram. The deviations are confined to the tails and do not represent a wholesale departure from normality.
Confidence: Moderate confidence that residuals are normal enough for the model’s inferences to be trustworthy. The tail deviations are mild and consistent across both the histogram and QQ-plot, which together suggest the normality assumption is approximately, if not perfectly, satisfied.
# Cook's distance — identify high-influence observations
gg_cooksd(lm_model, threshold = "matlab") +
labs(title = "Cook's Distance by Observation")
What to look for: Cook’s distance measures the influence of each individual observation on the fitted model. Observations above the threshold line are considered potentially influential and may be worth investigating further.
Observation: A small number of observations are flagged as potentially influential. These are worth examining to determine whether they represent genuine outliers (e.g., countries with unusual combinations of sub-indicator scores) or data entry concerns. It is important to note that influence alone is not sufficient reason to remove an observation. These points should be investigated in context before any decision about their treatment is made.
Significance: If any of these influential observations come from countries with atypical statistical systems or data anomalies, their presence could be pulling the regression line in ways that do not represent the broader pattern. A useful next step would be to identify which countries these rows correspond to and examine whether they share any characteristics.
Expanding the Week 8 model from one predictor to three improved the
fit substantially, with R² rising from 0.67 to 0.90. The added
variables, data_infrastructure_score and the
high_income binary, each contribute independent explanatory
power and were included for theoretically grounded reasons.
The diagnostic results are generally encouraging. The linearity
assumption appears approximately satisfied for both continuous
predictors. The residuals are roughly normally distributed, with only
mild deviations in the upper tail. Multicollinearity is present at a
moderate level between the two continuous sub-scores, but does not
appear severe enough to undermine the model. The most consistent finding
across the diagnostic plots is a mild increase in residual spread at
higher fitted values and higher values of
data_infrastructure_score, suggesting that the
homoscedasticity assumption is not perfectly met.
Together, these diagnostics support the conclusion that the model is reasonably valid, though not without limitations. Results should be interpreted with awareness that the mild heteroscedasticity and tail deviations may slightly affect the precision of the coefficient estimates. The model should not be used for precise numerical predictions without acknowledging this uncertainty.
data_products_score and
data_infrastructure_score means their individual
coefficients should be interpreted with some caution. The model explains
how they jointly relate to overall performance, but attributing precise
credit to each is less reliable.overall_score reduce the
mild heteroscedasticity identified in the residual plots?data_use_score,
data_services_score, and data_sources_score,
contribute to the unexplained 10% of variance? A model including all
five sub-scores would address this.data_infrastructure_score
and overall_score differ across income groups? An
interaction term between these two variables could be explored in a
follow-up analysis.