The purpose of this Week 9 data dive is to critically evaluate and diagnose a multiple linear regression model predicting COVID-19 case fatality rates.
Last week, we constructed a simple linear regression model. In this analysis, we extend that framework by including additional structural variables and an interaction term. The goal is not only to improve explanatory power, but to carefully assess whether regression assumptions are satisfied and whether the model is statistically reliable.
This notebook will:
In Week 8, we estimated a simple linear regression model predicting case fatality rate using vaccination coverage alone.
While that model established an initial relationship, it ignored structural demographic and economic differences across countries.
The Week 9 model extends this framework by incorporating median age, GDP per capita, and an interaction term. This allows for a more realistic representation of global heterogeneity and reduces omitted variable bias.
covid <- read.csv("covid_combined_groups.csv")
covid_model <- covid %>%
select(case_fatality_rate,
vax_coverage,
median_age,
gdp_per_capita) %>%
drop_na()
summary(covid_model)## case_fatality_rate vax_coverage median_age gdp_per_capita
## Min. :0.000000 Min. : 0.00 Min. :18.10 Min. : 1730
## 1st Qu.:0.005885 1st Qu.: 0.00 1st Qu.:31.90 1st Qu.:17336
## Median :0.014787 Median : 0.00 Median :39.70 Median :29481
## Mean :0.027773 Mean :10.54 Mean :37.48 Mean :30159
## 3rd Qu.:0.030259 3rd Qu.: 5.79 3rd Qu.:43.20 3rd Qu.:42659
## Max. :2.281690 Max. :84.68 Max. :48.20 Max. :94278
We restrict the dataset to relevant predictors and remove missing values to ensure valid regression estimation.
Vaccines reduce severe illness and mortality.
We expect a negative relationship with case fatality
rate.
Older populations are biologically more vulnerable.
We expect a positive relationship.
GDP proxies healthcare quality and system strength.
However, wealthier countries may also have older populations and better
reporting.
The direction is theoretically ambiguous.
We include:
\[ CFR = \beta_0 + \beta_1 Vax + \beta_2 Age + \beta_3 (Vax \times Age) + \beta_4 GDP + \epsilon \]
This allows the effect of vaccination to depend on population age.
The marginal effect of vaccination becomes:
\[ \frac{\partial CFR}{\partial Vax} = \beta_1 + \beta_3 (MedianAge) \]
This improves realism compared to a purely additive model.
model_week9 <- lm(case_fatality_rate ~
vax_coverage * median_age +
gdp_per_capita,
data = covid_model)
summary(model_week9)##
## Call:
## lm(formula = case_fatality_rate ~ vax_coverage * median_age +
## gdp_per_capita, data = covid_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.03936 -0.02086 -0.01108 0.00196 2.25016
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.118e-02 1.903e-03 5.873 4.31e-09 ***
## vax_coverage 5.593e-04 1.157e-04 4.832 1.36e-06 ***
## median_age 8.717e-04 5.805e-05 15.017 < 2e-16 ***
## gdp_per_capita -4.601e-07 2.274e-08 -20.235 < 2e-16 ***
## vax_coverage:median_age -1.940e-05 2.878e-06 -6.742 1.59e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06653 on 39574 degrees of freedom
## Multiple R-squared: 0.01562, Adjusted R-squared: 0.01552
## F-statistic: 156.9 on 4 and 39574 DF, p-value: < 2.2e-16
The coefficient for vax_coverage is
0.0005593 (p = 1.36e-06).
This positive coefficient alone suggests that, at a median age of zero, vaccination coverage increases case fatality rate. However, this interpretation is incomplete because the interaction term modifies this relationship.
The interaction term must be incorporated to understand the true marginal effect of vaccination.
The coefficient for median_age is
0.0008717 (p < 2e-16).
This indicates that, holding other variables constant, a one-year increase in median age increases the case fatality rate by approximately 0.00087. This confirms that older populations experience significantly higher COVID-19 fatality rates.
The magnitude is both statistically and substantively meaningful.
The coefficient for gdp_per_capita is
-4.601e-07 (p < 2e-16).
This negative relationship indicates that wealthier countries tend to have lower case fatality rates, holding other variables constant. Although the coefficient appears small, GDP is measured in large units, making the aggregate impact meaningful.
This result supports the hypothesis that economic development and healthcare infrastructure reduce mortality risk.
The interaction coefficient is -1.940e-05 (p = 1.59e-11).
This negative and statistically significant interaction term indicates that the effect of vaccination becomes more protective as median age increases.
The marginal effect of vaccination is:
∂CFR / ∂Vaccination = 0.0005593 − 0.00001940 × (Median Age)
Because the interaction term is negative, the protective impact of vaccination strengthens in older populations.
This confirms that vaccination is especially important in countries with aging demographics.
The residual summary shows a maximum residual of 2.25, which is substantially larger than the interquartile range. This suggests the presence of potential outliers that may require further diagnostic inspection.
Overall, the regression model demonstrates strong statistical significance across predictors, meaningful interaction effects, and theoretically consistent relationships between demographic structure, economic development, vaccination coverage, and case fatality rate.
We evaluate:
If the interaction term is significant, vaccination effectiveness depends on demographic structure.
## vax_coverage median_age gdp_per_capita
## 53.719738 1.673028 1.474565
## vax_coverage:median_age
## 54.553866
The Variance Inflation Factor (VIF) results reveal severe multicollinearity within the model.
vax_coverage: 53.72vax_coverage:median_age:
54.55median_age: 1.67gdp_per_capita: 1.47A VIF value above 10 is considered severe. Values above 50 indicate extreme multicollinearity.
The vaccination variable and its interaction term exhibit extremely high VIF values. This occurs because interaction terms are mechanically correlated with their component variables. Without centering the variables before constructing the interaction, strong multicollinearity is expected.
Severe multicollinearity does not bias coefficient estimates, but it:
The instability particularly affects interpretation of
vax_coverage and the interaction coefficient.
The model currently exhibits severe multicollinearity driven by the interaction structure.
To correct this issue, the predictors should be mean-centered before constructing the interaction term. Centering will dramatically reduce VIF values while preserving the substantive interpretation of the interaction effect.
Without centering, coefficient magnitudes should be interpreted cautiously due to inflated variance.
We now compute diagnostic statistics manually using base R functions.
model_data <- data.frame(
fitted = fitted(model_week9),
resid = resid(model_week9),
std_resid = rstandard(model_week9),
cooks = cooks.distance(model_week9),
leverage = hatvalues(model_week9)
)
n <- nrow(model_data)
cooks_cutoff <- 4/nBefore interpreting regression coefficients, we must verify that the linearity assumption holds.
Linear regression assumes that the relationship between the predictors and the response variable is approximately linear.
The Residuals vs Fitted plot helps us evaluate whether systematic patterns remain unexplained by the model. If the model is correctly specified, residuals should be randomly scattered around zero with no visible structure.
If curvature or systematic trends appear, this suggests the model may be mis-specified and nonlinear terms or transformations may be necessary.
ggplot(model_data, aes(fitted, resid)) +
geom_point(color = "steelblue", alpha = 0.6) +
geom_smooth(method = "loess", color = "red", se = FALSE) +
labs(title = "Residuals vs Fitted",
x = "Fitted Values",
y = "Residuals") +
theme_minimal()The residuals are heavily concentrated near zero at lower fitted values, but dispersion increases substantially as fitted values rise beyond approximately 0.025.
The LOESS smoothing line remains relatively flat near zero, indicating that strong nonlinearity is not present. However, the clear widening spread of residuals at higher fitted values indicates non-constant variance.
Several extreme positive residuals above 1.0 and even exceeding 2.0 are visible at higher fitted values. These represent substantial prediction errors for specific observations.
Conclusion: The linearity assumption is largely satisfied, but the homoscedasticity assumption is violated due to increasing variance at higher fitted values.
Linear regression assumes that residuals are approximately normally distributed. While this assumption becomes less critical in large samples due to the Central Limit Theorem, it remains important for accurate hypothesis testing and confidence intervals.
The Normal Q-Q plot compares standardized residuals to a theoretical normal distribution.
If residuals follow a straight diagonal line, the normality assumption is supported.
Significant deviations, especially in the tails, suggest heavy-tailed or skewed errors.
ggplot(model_data, aes(sample = std_resid)) +
stat_qq(color = "darkgreen") +
stat_qq_line(color = "red") +
labs(title = "Normal Q-Q Plot") +
theme_minimal()The residuals deviate dramatically from the theoretical normal reference line in the upper tail.
The right tail curves sharply upward, with standardized residuals reaching extremely large values. This indicates strong positive skewness and heavy-tailed behavior.
The lower tail remains relatively compressed, reinforcing asymmetry in the residual distribution.
Homoscedasticity means that the variance of residuals remains constant across levels of the fitted values.
If variance increases or decreases systematically, the model exhibits heteroscedasticity, which can lead to biased standard errors and unreliable hypothesis testing.
The Scale-Location plot displays the square root of standardized residuals against fitted values to detect variance patterns.
ggplot(model_data, aes(fitted, sqrt(abs(std_resid)))) +
geom_point(color = "purple", alpha = 0.6) +
geom_smooth(method = "loess", color = "red", se = FALSE) +
labs(title = "Scale-Location Plot",
x = "Fitted Values",
y = "Sqrt(|Standardized Residuals|)") +
theme_minimal()The red LOESS line trends upward as fitted values increase, and the vertical spread of points expands significantly for fitted values above 0.025.
This clear upward pattern indicates increasing variance of residuals across fitted values.
The presence of multiple high standardized residuals above 4 and even near 6 confirms substantial heteroscedasticity.
Conclusion: The homoscedasticity assumption is strongly violated. Variance increases systematically with predicted case fatality rate. Robust standard errors or transformation would improve reliability.
Not all data points contribute equally to model estimation. Some observations may exert disproportionate influence on coefficient estimates.
Cook’s Distance measures how much the regression model changes if a specific observation is removed.
Observations exceeding the threshold (4/n) may be considered influential.
ggplot(model_data, aes(seq_along(cooks), cooks)) +
geom_bar(stat = "identity", fill = "orange") +
geom_hline(yintercept = cooks_cutoff, color = "red") +
labs(title = "Cook's Distance",
x = "Observation Index",
y = "Cook's Distance") +
theme_minimal()Cook’s Distance values are extremely small across nearly all observations.
No observation exceeds the typical threshold of concern. Even the largest spikes remain very small relative to the scale.
Conclusion: There are no highly influential observations materially distorting the regression estimates. The model is not driven by a small subset of extreme data points.
Leverage measures how far an observation’s predictor values are from the mean of all predictor values.
High-leverage points can strongly influence the regression line, especially if they also have large residuals.
This plot helps identify potentially problematic observations that combine high leverage and large standardized residuals.
ggplot(model_data, aes(leverage, std_resid)) +
geom_point(color = "brown", alpha = 0.6) +
geom_smooth(method = "loess", color = "red", se = FALSE) +
labs(title = "Residuals vs Leverage",
x = "Leverage",
y = "Standardized Residuals") +
theme_minimal()Most observations cluster at very low leverage values, indicating that predictor values are not extreme relative to the overall dataset.
However, several observations exhibit extremely large standardized residuals, exceeding 20 and even above 30, despite low leverage.
This indicates that while these observations do not exert structural influence on the regression line, they represent substantial prediction errors.
Conclusion: The model does not suffer from high-leverage influential distortion. However, the presence of extreme residual outliers confirms the heavy-tailed residual distribution observed in the Q-Q plot.
The regression model demonstrates extremely strong statistical significance across all predictors. Vaccination coverage, median age, GDP per capita, and the interaction term are all significant at p < 0.001, indicating meaningful relationships with case fatality rate.
However, diagnostic analysis reveals several important structural issues.
First, severe multicollinearity is present between vaccination coverage and its interaction term, with VIF values exceeding 50. This level of multicollinearity inflates standard errors and reduces coefficient stability, making individual parameter interpretation less reliable.
Second, the residual diagnostics indicate that while the linearity assumption is reasonably satisfied, the homoscedasticity and normality assumptions are violated.
The Scale-Location plot shows clear heteroscedasticity, with residual variance increasing at higher fitted values.
The Q-Q plot reveals heavy right-skew and extreme upper-tail deviations, confirming non-normal residuals.
Despite these distributional violations, Cook’s Distance and leverage plots show no evidence of influential high-leverage points driving the model.
Because the dataset is large, coefficient estimates remain consistent. However, standard errors are likely affected by heteroscedasticity.
To strengthen inference, the model should incorporate:
Overall, the model captures meaningful structural relationships but requires refinement for optimal statistical reliability.
The multicollinearity observed (VIF > 50) is severe and materially inflates variance of coefficient estimates. The heteroscedasticity observed is also substantial, as variance clearly increases across fitted values. The normality violation is pronounced due to extreme upper-tail residuals.
The results reveal several substantive findings:
Median age has a strong positive association with case fatality rate, confirming that aging populations experience higher mortality risk.
GDP per capita is negatively associated with fatality rate, indicating that wealthier countries tend to experience lower mortality, likely reflecting healthcare capacity and infrastructure advantages.
The interaction term is negative and highly significant, demonstrating that vaccination becomes increasingly protective as median age rises. Vaccination is particularly critical in older populations.
These findings provide clear evidence that demographic structure and economic development fundamentally shape pandemic outcomes.
The model highlights the importance of prioritizing vaccination efforts in aging societies and investing in healthcare infrastructure to reduce mortality disparities.
The diagnostics suggest several directions for refinement:
Centering predictors before constructing the interaction to eliminate severe multicollinearity.
Applying heteroscedasticity-robust standard errors to correct for non-constant variance.
Investigating extreme residual observations to understand structural outliers.
Exploring transformation of the response variable to address heavy-tailed residual behavior.
Evaluating whether additional healthcare capacity variables (e.g., hospital beds per thousand) improve explanatory strength.
These refinements would enhance both statistical precision and substantive interpretability.