Introduction

The purpose of this Week 9 data dive is to critically evaluate and diagnose a multiple linear regression model predicting COVID-19 case fatality rates.

Last week, we constructed a simple linear regression model. In this analysis, we extend that framework by including additional structural variables and an interaction term. The goal is not only to improve explanatory power, but to carefully assess whether regression assumptions are satisfied and whether the model is statistically reliable.

This notebook will:

Extend the model to include 2–4 explanatory variables
Include an interaction term
Evaluate multicollinearity
Diagnose model assumptions using 5 diagnostic plots
Assess severity of violations
Interpret statistical and practical significance
Reflect critically on limitations and improvements

Connection to Week 8 Model

In Week 8, we estimated a simple linear regression model predicting case fatality rate using vaccination coverage alone.

While that model established an initial relationship, it ignored structural demographic and economic differences across countries.

The Week 9 model extends this framework by incorporating median age, GDP per capita, and an interaction term. This allows for a more realistic representation of global heterogeneity and reduces omitted variable bias.

Data Preparation

covid <- read.csv("covid_combined_groups.csv")

covid_model <- covid %>%
  select(case_fatality_rate,
         vax_coverage,
         median_age,
         gdp_per_capita) %>%
  drop_na()

summary(covid_model)

##  case_fatality_rate  vax_coverage     median_age    gdp_per_capita 
##  Min.   :0.000000   Min.   : 0.00   Min.   :18.10   Min.   : 1730  
##  1st Qu.:0.005885   1st Qu.: 0.00   1st Qu.:31.90   1st Qu.:17336  
##  Median :0.014787   Median : 0.00   Median :39.70   Median :29481  
##  Mean   :0.027773   Mean   :10.54   Mean   :37.48   Mean   :30159  
##  3rd Qu.:0.030259   3rd Qu.: 5.79   3rd Qu.:43.20   3rd Qu.:42659  
##  Max.   :2.281690   Max.   :84.68   Max.   :48.20   Max.   :94278

What Is Being Done

We restrict the dataset to relevant predictors and remove missing values to ensure valid regression estimation.

Response Variable

case_fatality_rate (CFR)
Proportion of confirmed COVID cases resulting in death.

Explanatory Variables

vax_coverage — vaccination coverage percentage
median_age — demographic vulnerability proxy
gdp_per_capita — economic development proxy

Model Specification and Theoretical Justification

1. Vaccination Coverage

Vaccines reduce severe illness and mortality.
We expect a negative relationship with case fatality rate.

2. Median Age

Older populations are biologically more vulnerable.
We expect a positive relationship.

3. GDP per Capita

GDP proxies healthcare quality and system strength.
However, wealthier countries may also have older populations and better reporting.
The direction is theoretically ambiguous.

4. Interaction: Vaccination × Median Age

We include:

\[ CFR = \beta_0 + \beta_1 Vax + \beta_2 Age + \beta_3 (Vax \times Age) + \beta_4 GDP + \epsilon \]

This allows the effect of vaccination to depend on population age.

The marginal effect of vaccination becomes:

\[ \frac{\partial CFR}{\partial Vax} = \beta_1 + \beta_3 (MedianAge) \]

This improves realism compared to a purely additive model.

Multiple Linear Regression Model

model_week9 <- lm(case_fatality_rate ~ 
                    vax_coverage * median_age + 
                    gdp_per_capita,
                  data = covid_model)

summary(model_week9)

## 
## Call:
## lm(formula = case_fatality_rate ~ vax_coverage * median_age + 
##     gdp_per_capita, data = covid_model)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.03936 -0.02086 -0.01108  0.00196  2.25016 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              1.118e-02  1.903e-03   5.873 4.31e-09 ***
## vax_coverage             5.593e-04  1.157e-04   4.832 1.36e-06 ***
## median_age               8.717e-04  5.805e-05  15.017  < 2e-16 ***
## gdp_per_capita          -4.601e-07  2.274e-08 -20.235  < 2e-16 ***
## vax_coverage:median_age -1.940e-05  2.878e-06  -6.742 1.59e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06653 on 39574 degrees of freedom
## Multiple R-squared:  0.01562,    Adjusted R-squared:  0.01552 
## F-statistic: 156.9 on 4 and 39574 DF,  p-value: < 2.2e-16

Regression Results Interpretation

The multiple linear regression results indicate that all included predictors are statistically significant at the 0.001 level.

Overall Statistical Significance

All coefficients — vaccination coverage, median age, GDP per capita, and the interaction term — have p-values below 0.001. This provides extremely strong statistical evidence that each predictor contributes meaningfully to explaining variation in case fatality rate.

Intercept

The intercept (0.01118) represents the predicted case fatality rate when all predictors are zero. While not substantively meaningful in isolation, it serves as the baseline from which other effects are interpreted.

Vaccination Coverage

The coefficient for vax_coverage is 0.0005593 (p = 1.36e-06).
This positive coefficient alone suggests that, at a median age of zero, vaccination coverage increases case fatality rate. However, this interpretation is incomplete because the interaction term modifies this relationship.
The interaction term must be incorporated to understand the true marginal effect of vaccination.

Median Age

The coefficient for median_age is 0.0008717 (p < 2e-16).
This indicates that, holding other variables constant, a one-year increase in median age increases the case fatality rate by approximately 0.00087. This confirms that older populations experience significantly higher COVID-19 fatality rates.
The magnitude is both statistically and substantively meaningful.

GDP per Capita

The coefficient for gdp_per_capita is -4.601e-07 (p < 2e-16).
This negative relationship indicates that wealthier countries tend to have lower case fatality rates, holding other variables constant. Although the coefficient appears small, GDP is measured in large units, making the aggregate impact meaningful.
This result supports the hypothesis that economic development and healthcare infrastructure reduce mortality risk.

Interaction Term: Vaccination × Median Age

The interaction coefficient is -1.940e-05 (p = 1.59e-11).
This negative and statistically significant interaction term indicates that the effect of vaccination becomes more protective as median age increases.
The marginal effect of vaccination is:
∂CFR / ∂Vaccination = 0.0005593 − 0.00001940 × (Median Age)
Because the interaction term is negative, the protective impact of vaccination strengthens in older populations.
This confirms that vaccination is especially important in countries with aging demographics.

Residual Distribution

The residual summary shows a maximum residual of 2.25, which is substantially larger than the interquartile range. This suggests the presence of potential outliers that may require further diagnostic inspection.
Overall, the regression model demonstrates strong statistical significance across predictors, meaningful interaction effects, and theoretically consistent relationships between demographic structure, economic development, vaccination coverage, and case fatality rate.

Interpretation Strategy

We evaluate:

Coefficient signs
Statistical significance
Magnitude and practical importance
Adjusted R² (model fit improvement over simple regression)

If the interaction term is significant, vaccination effectiveness depends on demographic structure.

Multicollinearity Assessment

vif(model_week9)

##            vax_coverage              median_age          gdp_per_capita 
##               53.719738                1.673028                1.474565 
## vax_coverage:median_age 
##               54.553866

Interpretation

The Variance Inflation Factor (VIF) results reveal severe multicollinearity within the model.
- VIF for vax_coverage: 53.72
- VIF for vax_coverage:median_age: 54.55
- VIF for median_age: 1.67
- VIF for gdp_per_capita: 1.47
A VIF value above 10 is considered severe. Values above 50 indicate extreme multicollinearity.
The vaccination variable and its interaction term exhibit extremely high VIF values. This occurs because interaction terms are mechanically correlated with their component variables. Without centering the variables before constructing the interaction, strong multicollinearity is expected.

Implications

Severe multicollinearity does not bias coefficient estimates, but it:
- Inflates standard errors
- Reduces numerical stability
- Makes individual coefficient interpretation unreliable
- Increases sensitivity to small data changes
The instability particularly affects interpretation of vax_coverage and the interaction coefficient.

Conclusion

The model currently exhibits severe multicollinearity driven by the interaction structure.

To correct this issue, the predictors should be mean-centered before constructing the interaction term. Centering will dramatically reduce VIF values while preserving the substantive interpretation of the interaction effect.

Without centering, coefficient magnitudes should be interpreted cautiously due to inflated variance.

Although multicollinearity is severe for the interaction term, it arises mechanically from including an uncentered interaction. The interaction is theoretically essential and therefore retained. Centering the variables would resolve the instability without removing substantively meaningful structure.

Diagnostic Data Construction

We now compute diagnostic statistics manually using base R functions.

model_data <- data.frame(
  fitted = fitted(model_week9),
  resid = resid(model_week9),
  std_resid = rstandard(model_week9),
  cooks = cooks.distance(model_week9),
  leverage = hatvalues(model_week9)
)

n <- nrow(model_data)
cooks_cutoff <- 4/n

1. Residuals vs Fitted (Linearity)

Diagnostic Objective: Testing the Linearity Assumption

Before interpreting regression coefficients, we must verify that the linearity assumption holds.
Linear regression assumes that the relationship between the predictors and the response variable is approximately linear.
The Residuals vs Fitted plot helps us evaluate whether systematic patterns remain unexplained by the model. If the model is correctly specified, residuals should be randomly scattered around zero with no visible structure.
If curvature or systematic trends appear, this suggests the model may be mis-specified and nonlinear terms or transformations may be necessary.

ggplot(model_data, aes(fitted, resid)) +
  geom_point(color = "steelblue", alpha = 0.6) +
  geom_smooth(method = "loess", color = "red", se = FALSE) +
  labs(title = "Residuals vs Fitted",
       x = "Fitted Values",
       y = "Residuals") +
  theme_minimal()

Residuals vs Fitted Interpretation

The residuals are heavily concentrated near zero at lower fitted values, but dispersion increases substantially as fitted values rise beyond approximately 0.025.
The LOESS smoothing line remains relatively flat near zero, indicating that strong nonlinearity is not present. However, the clear widening spread of residuals at higher fitted values indicates non-constant variance.
Several extreme positive residuals above 1.0 and even exceeding 2.0 are visible at higher fitted values. These represent substantial prediction errors for specific observations.
Conclusion: The linearity assumption is largely satisfied, but the homoscedasticity assumption is violated due to increasing variance at higher fitted values.

2. Normal Q-Q Plot (Normality of Errors)

Diagnostic Objective: Testing Normality of Residuals

Linear regression assumes that residuals are approximately normally distributed. While this assumption becomes less critical in large samples due to the Central Limit Theorem, it remains important for accurate hypothesis testing and confidence intervals.
The Normal Q-Q plot compares standardized residuals to a theoretical normal distribution.
If residuals follow a straight diagonal line, the normality assumption is supported.
Significant deviations, especially in the tails, suggest heavy-tailed or skewed errors.

ggplot(model_data, aes(sample = std_resid)) +
  stat_qq(color = "darkgreen") +
  stat_qq_line(color = "red") +
  labs(title = "Normal Q-Q Plot") +
  theme_minimal()

Normal Q-Q Plot Interpretation

The residuals deviate dramatically from the theoretical normal reference line in the upper tail.
The right tail curves sharply upward, with standardized residuals reaching extremely large values. This indicates strong positive skewness and heavy-tailed behavior.
The lower tail remains relatively compressed, reinforcing asymmetry in the residual distribution.
Conclusion : The normality assumption is clearly violated. The residual distribution exhibits heavy right-skew and extreme outliers. While large sample size reduces concern for coefficient estimation, hypothesis tests relying on normality may be distorted.

3. Scale-Location Plot (Homoscedasticity)

Diagnostic Objective: Testing Homoscedasticity

Homoscedasticity means that the variance of residuals remains constant across levels of the fitted values.
If variance increases or decreases systematically, the model exhibits heteroscedasticity, which can lead to biased standard errors and unreliable hypothesis testing.
The Scale-Location plot displays the square root of standardized residuals against fitted values to detect variance patterns.

ggplot(model_data, aes(fitted, sqrt(abs(std_resid)))) +
  geom_point(color = "purple", alpha = 0.6) +
  geom_smooth(method = "loess", color = "red", se = FALSE) +
  labs(title = "Scale-Location Plot",
       x = "Fitted Values",
       y = "Sqrt(|Standardized Residuals|)") +
  theme_minimal()

Scale-Location Plot Interpretation

The red LOESS line trends upward as fitted values increase, and the vertical spread of points expands significantly for fitted values above 0.025.
This clear upward pattern indicates increasing variance of residuals across fitted values.
The presence of multiple high standardized residuals above 4 and even near 6 confirms substantial heteroscedasticity.
Conclusion: The homoscedasticity assumption is strongly violated. Variance increases systematically with predicted case fatality rate. Robust standard errors or transformation would improve reliability.

4. Cook’s Distance (Influential Observations)

Diagnostic Objective: Identifying Influential Observations

Not all data points contribute equally to model estimation. Some observations may exert disproportionate influence on coefficient estimates.
Cook’s Distance measures how much the regression model changes if a specific observation is removed.
Observations exceeding the threshold (4/n) may be considered influential.

ggplot(model_data, aes(seq_along(cooks), cooks)) +
  geom_bar(stat = "identity", fill = "orange") +
  geom_hline(yintercept = cooks_cutoff, color = "red") +
  labs(title = "Cook's Distance",
       x = "Observation Index",
       y = "Cook's Distance") +
  theme_minimal()

Cook’s Distance Interpretation

Cook’s Distance values are extremely small across nearly all observations.
No observation exceeds the typical threshold of concern. Even the largest spikes remain very small relative to the scale.
Conclusion: There are no highly influential observations materially distorting the regression estimates. The model is not driven by a small subset of extreme data points.

5. Residuals vs Leverage

Diagnostic Objective: Detecting High-Leverage Influential Points

Leverage measures how far an observation’s predictor values are from the mean of all predictor values.
High-leverage points can strongly influence the regression line, especially if they also have large residuals.
This plot helps identify potentially problematic observations that combine high leverage and large standardized residuals.

ggplot(model_data, aes(leverage, std_resid)) +
  geom_point(color = "brown", alpha = 0.6) +
  geom_smooth(method = "loess", color = "red", se = FALSE) +
  labs(title = "Residuals vs Leverage",
       x = "Leverage",
       y = "Standardized Residuals") +
  theme_minimal()

Residuals vs Leverage Interpretation

Most observations cluster at very low leverage values, indicating that predictor values are not extreme relative to the overall dataset.
However, several observations exhibit extremely large standardized residuals, exceeding 20 and even above 30, despite low leverage.
This indicates that while these observations do not exert structural influence on the regression line, they represent substantial prediction errors.
Conclusion: The model does not suffer from high-leverage influential distortion. However, the presence of extreme residual outliers confirms the heavy-tailed residual distribution observed in the Q-Q plot.

Overall Model Evaluation

The regression model demonstrates extremely strong statistical significance across all predictors. Vaccination coverage, median age, GDP per capita, and the interaction term are all significant at p < 0.001, indicating meaningful relationships with case fatality rate.
However, diagnostic analysis reveals several important structural issues.
First, severe multicollinearity is present between vaccination coverage and its interaction term, with VIF values exceeding 50. This level of multicollinearity inflates standard errors and reduces coefficient stability, making individual parameter interpretation less reliable.
Second, the residual diagnostics indicate that while the linearity assumption is reasonably satisfied, the homoscedasticity and normality assumptions are violated.
The Scale-Location plot shows clear heteroscedasticity, with residual variance increasing at higher fitted values.
The Q-Q plot reveals heavy right-skew and extreme upper-tail deviations, confirming non-normal residuals.
Despite these distributional violations, Cook’s Distance and leverage plots show no evidence of influential high-leverage points driving the model.
Because the dataset is large, coefficient estimates remain consistent. However, standard errors are likely affected by heteroscedasticity.
To strengthen inference, the model should incorporate:
- Mean-centering before interaction construction (to resolve multicollinearity)
- Heteroscedasticity-robust standard errors
- Potential transformation of the response variable to reduce skewness
Overall, the model captures meaningful structural relationships but requires refinement for optimal statistical reliability.
The multicollinearity observed (VIF > 50) is severe and materially inflates variance of coefficient estimates. The heteroscedasticity observed is also substantial, as variance clearly increases across fitted values. The normality violation is pronounced due to extreme upper-tail residuals.
Despite these issues, the large sample size ensures consistency of coefficient estimates. However, standard errors and hypothesis tests may be distorted without correction.

Key Insights and Significance

The results reveal several substantive findings:

Median age has a strong positive association with case fatality rate, confirming that aging populations experience higher mortality risk.
GDP per capita is negatively associated with fatality rate, indicating that wealthier countries tend to experience lower mortality, likely reflecting healthcare capacity and infrastructure advantages.
The interaction term is negative and highly significant, demonstrating that vaccination becomes increasingly protective as median age rises. Vaccination is particularly critical in older populations.
These findings provide clear evidence that demographic structure and economic development fundamentally shape pandemic outcomes.
The model highlights the importance of prioritizing vaccination efforts in aging societies and investing in healthcare infrastructure to reduce mortality disparities.

Further Questions

The diagnostics suggest several directions for refinement:

Centering predictors before constructing the interaction to eliminate severe multicollinearity.
Applying heteroscedasticity-robust standard errors to correct for non-constant variance.
Investigating extreme residual observations to understand structural outliers.
Exploring transformation of the response variable to address heavy-tailed residual behavior.
Evaluating whether additional healthcare capacity variables (e.g., hospital beds per thousand) improve explanatory strength.
These refinements would enhance both statistical precision and substantive interpretability.

Week 9 Completion Checklist

Added 2–4 explanatory variables
Included interaction term
Justified inclusion of each variable
Evaluated multicollinearity
Produced and interpreted 5 diagnostic plots
Assessed severity of violations
Interpreted statistical and practical significance
Reflected on limitations and improvements

Week 9 Data Dive: Data Dive — Regression Diagnostics

Krish Shah

March 10, 2026