Week 11 Model

This notebook completes the Week 11 requirements only:

build one linear or generalized linear model
diagnose the model
highlight issues with the model
interpret at least one coefficient

Data and model choice

covid <- read.csv("covid_combined_groups.csv")
covid$date <- as.Date(covid$date)

model_data <- covid %>%
  select(new_cases_smoothed_per_million,
         population,
         vax_coverage,
         median_age,
         reproduction_rate,
         stringency_index) %>%
  drop_na() %>%
  mutate(new_cases_est = round(new_cases_smoothed_per_million * population / 1000000)) %>%
  filter(new_cases_est >= 0)

poisson_model <- glm(
  new_cases_est ~ vax_coverage + median_age + reproduction_rate + stringency_index,
  family = poisson(link = "log"),
  data = model_data
)

summary(poisson_model)

## 
## Call:
## glm(formula = new_cases_est ~ vax_coverage + median_age + reproduction_rate + 
##     stringency_index, family = poisson(link = "log"), data = model_data)
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        7.771e+00  5.146e-04   15102   <2e-16 ***
## vax_coverage       1.865e-02  2.762e-06    6753   <2e-16 ***
## median_age        -1.891e-02  8.743e-06   -2163   <2e-16 ***
## reproduction_rate -2.620e-01  2.183e-04   -1200   <2e-16 ***
## stringency_index   2.689e-02  4.272e-06    6295   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 737274115  on 41601  degrees of freedom
## Residual deviance: 666882985  on 41597  degrees of freedom
## AIC: 667227179
## 
## Number of Fisher Scoring iterations: 7

I used a Poisson generalized linear model because the response variable is a count: new_cases_est. A Poisson model is a natural choice for count outcomes because it is designed for non-negative integers and uses a log link. That means the predictors act multiplicatively on the expected count rather than additively.

The goal of the model is to see how country-level vaccination coverage, age structure, reproduction rate, and policy stringency are associated with the expected number of new cases.

Actual Poisson distribution

lambda_hat <- mean(model_data$new_cases_est)
pois_dist <- tibble(
  count = 0:max(ceiling(lambda_hat + 4 * sqrt(lambda_hat)), 20),
  probability = dpois(0:max(ceiling(lambda_hat + 4 * sqrt(lambda_hat)), 20), lambda_hat)
)

ggplot(pois_dist, aes(x = count, y = probability)) +
  geom_col() +
  labs(
    title = "Poisson Distribution Based on the Observed Mean Count",
    subtitle = paste0("Lambda (mean count) = ", round(lambda_hat, 2)),
    x = "Count",
    y = "Probability"
  )

This is an actual Poisson distribution using the average observed count in the dataset as the rate parameter, or lambda. In a Poisson distribution, lambda is both the mean and the variance. That is important because it tells us what the model expects: counts near the average are most likely, and the probability falls as values move farther away from the center.

For this dataset, the distribution is shifted far to the right because the average number of estimated cases is large. That means the model is working with a high-count outcome, which makes it especially important to check whether the Poisson assumption of equal mean and variance is realistic.

Coefficient interpretation

beta_vax <- coef(poisson_model)["vax_coverage"]
irr_vax <- exp(beta_vax)

The coefficient for vax_coverage is 0.0187. Because this is a Poisson model with a log link, the coefficient should be interpreted on the multiplicative scale using the incidence rate ratio (IRR). The IRR for vaccination coverage is 1.0188.

This means that, holding the other predictors constant, a one-unit increase in vaccination coverage is associated with multiplying the expected case count by about 1.0188. If the IRR is below 1, higher vaccination coverage is associated with fewer expected cases. That is meaningful because it gives a direct numeric summary of the relationship rather than only a visual impression.

Model diagnostics

Residuals versus fitted values

plot(poisson_model, which = 1)

This plot checks whether the model is capturing the overall pattern in the data. A good residuals-versus-fitted plot should look like a random cloud around zero.

Here, the points do not form a clean random cloud. Instead, the spread becomes much larger in the middle and upper fitted range, and there are several extreme residuals. This suggests the model is not explaining the data equally well across the full range of predictions.

The main issue shown here is non-constant variance and a few very large outliers. That means the model is struggling most for observations with very high case counts.

Q-Q plot

plot(poisson_model, which = 2)

This plot compares the residuals to the pattern expected under the model. For a well-behaved fit, the points should follow the reference line fairly closely.

In this plot, the points bend strongly away from the line in the upper tail. That means the residuals are heavily skewed and the model is not matching the extreme values well. The far-right points are especially large, which suggests that a small number of observations are much more extreme than the Poisson model expects.

The main takeaway is that the model fits the center of the data better than the tails. That matters because the most extreme observations are often the ones that drive public-health interpretation.

Scale-location plot

plot(poisson_model, which = 3)

This plot shows whether the spread of the residuals stays about the same across fitted values. A flat pattern would suggest stable variance.

Here, the trend rises as fitted values increase, which means the residual spread gets larger for larger predicted counts. That is another sign of heteroscedasticity and likely overdispersion.

This is important because a Poisson model assumes the variance is tied closely to the mean. When the variance grows faster than expected, the standard errors can become too optimistic. That can make some predictors look more significant than they really are.

Cook’s distance

plot(poisson_model, which = 4)

Cook’s distance identifies observations that have a large effect on the fitted model. The tallest spikes are the most influential points.

In this plot, a few observations stand out clearly from the rest. That means most rows are not unusually influential, but a small number are pulling the model fit a lot. Those points should be checked to see whether they are valid extreme observations or possible data issues.

To avoid a long scroll of raw values, I only show the top influential rows:

cooks <- cooks.distance(poisson_model)
cook_threshold <- 4 / nrow(model_data)

influential_points <- model_data %>%
  mutate(row_id = row_number(), cooks_distance = cooks) %>%
  slice_max(order_by = cooks_distance, n = 8) %>%
  select(row_id, new_cases_est, vax_coverage, median_age, reproduction_rate, stringency_index, cooks_distance)

kable(influential_points, digits = 4, caption = "Top 8 observations by Cook's distance")

Top 8 observations by Cook’s distance
row_id	new_cases_est	vax_coverage	median_age	reproduction_rate	stringency_index	cooks_distance
21913	389014	2.85	28.2	0.88	81.94	377.8152
21912	389014	2.81	28.2	0.90	81.94	369.6059
21911	389014	2.77	28.2	0.92	81.94	362.4849
21910	389014	2.70	28.2	0.94	81.94	356.4433
21909	389014	2.63	28.2	0.96	81.94	351.4931
21908	389014	2.53	28.2	0.99	81.94	346.1178
21920	339115	2.94	28.2	0.79	81.94	322.6333
21919	339115	0.00	28.2	0.80	81.94	318.4159

A common rough cutoff is 9.6^{-5}. Any observation above that value is worth extra attention. The important idea is not that every influential point must be removed, but that it should be understood.

Multicollinearity

vif(poisson_model)

##      vax_coverage        median_age reproduction_rate  stringency_index 
##          1.239360          1.072237          1.000734          1.226067

The VIF values are all close to 1. That is a good sign because it means the predictors are not highly redundant. So, the model does not appear to have a multicollinearity problem. This helps because the coefficient estimates are easier to interpret separately.

Overdispersion check

dispersion_ratio <- deviance(poisson_model) / df.residual(poisson_model)
dispersion_ratio

## [1] 16032

For a Poisson model, a dispersion ratio near 1 is ideal. A ratio much larger than 1 suggests the data are more variable than the model expects.

If this value is clearly above 1, it supports the visual diagnostics showing overdispersion. That would mean the Poisson model is too simple for the data and that a negative binomial model may fit better.

Issues with the model

The main issues visible in this analysis are:

overdispersion
large residual spread for high fitted values
several influential observations
poor fit in the extreme upper tail

These issues matter because they affect how reliable the coefficient estimates and standard errors are. The model still gives a useful first summary of the relationship between the predictors and case counts, but it does not capture the full complexity of the data.

Conclusion

This Week 11 notebook builds a Poisson generalized linear model, checks the diagnostics, highlights the model issues, and interprets the vaccination coefficient. The main insight is that higher vaccination coverage is associated with lower expected case counts, while higher reproduction rate is expected to increase case counts. At the same time, the diagnostics show that the Poisson assumptions are not fully satisfied, especially because of overdispersion and influential outliers. A natural next question is whether a negative binomial model would better handle the extra variability in the data.

Week 11 Data Dive: Poisson GLM

Krish Shah

March 31, 2026