This notebook completes the Week 11 requirements only:
covid <- read.csv("covid_combined_groups.csv")
covid$date <- as.Date(covid$date)
model_data <- covid %>%
select(new_cases_smoothed_per_million,
population,
vax_coverage,
median_age,
reproduction_rate,
stringency_index) %>%
drop_na() %>%
mutate(new_cases_est = round(new_cases_smoothed_per_million * population / 1000000)) %>%
filter(new_cases_est >= 0)
poisson_model <- glm(
new_cases_est ~ vax_coverage + median_age + reproduction_rate + stringency_index,
family = poisson(link = "log"),
data = model_data
)
summary(poisson_model)##
## Call:
## glm(formula = new_cases_est ~ vax_coverage + median_age + reproduction_rate +
## stringency_index, family = poisson(link = "log"), data = model_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.771e+00 5.146e-04 15102 <2e-16 ***
## vax_coverage 1.865e-02 2.762e-06 6753 <2e-16 ***
## median_age -1.891e-02 8.743e-06 -2163 <2e-16 ***
## reproduction_rate -2.620e-01 2.183e-04 -1200 <2e-16 ***
## stringency_index 2.689e-02 4.272e-06 6295 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 737274115 on 41601 degrees of freedom
## Residual deviance: 666882985 on 41597 degrees of freedom
## AIC: 667227179
##
## Number of Fisher Scoring iterations: 7
I used a Poisson generalized linear model because
the response variable is a count: new_cases_est. A Poisson
model is a natural choice for count outcomes because it is designed for
non-negative integers and uses a log link. That means the predictors act
multiplicatively on the expected count rather than additively.
The goal of the model is to see how country-level vaccination coverage, age structure, reproduction rate, and policy stringency are associated with the expected number of new cases.
lambda_hat <- mean(model_data$new_cases_est)
pois_dist <- tibble(
count = 0:max(ceiling(lambda_hat + 4 * sqrt(lambda_hat)), 20),
probability = dpois(0:max(ceiling(lambda_hat + 4 * sqrt(lambda_hat)), 20), lambda_hat)
)
ggplot(pois_dist, aes(x = count, y = probability)) +
geom_col() +
labs(
title = "Poisson Distribution Based on the Observed Mean Count",
subtitle = paste0("Lambda (mean count) = ", round(lambda_hat, 2)),
x = "Count",
y = "Probability"
)This is an actual Poisson distribution using the average observed count in the dataset as the rate parameter, or lambda. In a Poisson distribution, lambda is both the mean and the variance. That is important because it tells us what the model expects: counts near the average are most likely, and the probability falls as values move farther away from the center.
For this dataset, the distribution is shifted far to the right because the average number of estimated cases is large. That means the model is working with a high-count outcome, which makes it especially important to check whether the Poisson assumption of equal mean and variance is realistic.
The coefficient for vax_coverage is 0.0187. Because this is a Poisson model with a log link, the coefficient should be interpreted on the multiplicative scale using the incidence rate ratio (IRR). The IRR for vaccination coverage is 1.0188.
This means that, holding the other predictors constant, a one-unit increase in vaccination coverage is associated with multiplying the expected case count by about 1.0188. If the IRR is below 1, higher vaccination coverage is associated with fewer expected cases. That is meaningful because it gives a direct numeric summary of the relationship rather than only a visual impression.
This plot checks whether the model is capturing the overall pattern in the data. A good residuals-versus-fitted plot should look like a random cloud around zero.
Here, the points do not form a clean random cloud. Instead, the spread becomes much larger in the middle and upper fitted range, and there are several extreme residuals. This suggests the model is not explaining the data equally well across the full range of predictions.
The main issue shown here is non-constant variance and a few very large outliers. That means the model is struggling most for observations with very high case counts.
This plot compares the residuals to the pattern expected under the model. For a well-behaved fit, the points should follow the reference line fairly closely.
In this plot, the points bend strongly away from the line in the upper tail. That means the residuals are heavily skewed and the model is not matching the extreme values well. The far-right points are especially large, which suggests that a small number of observations are much more extreme than the Poisson model expects.
The main takeaway is that the model fits the center of the data better than the tails. That matters because the most extreme observations are often the ones that drive public-health interpretation.
This plot shows whether the spread of the residuals stays about the same across fitted values. A flat pattern would suggest stable variance.
Here, the trend rises as fitted values increase, which means the residual spread gets larger for larger predicted counts. That is another sign of heteroscedasticity and likely overdispersion.
This is important because a Poisson model assumes the variance is tied closely to the mean. When the variance grows faster than expected, the standard errors can become too optimistic. That can make some predictors look more significant than they really are.
Cook’s distance identifies observations that have a large effect on the fitted model. The tallest spikes are the most influential points.
In this plot, a few observations stand out clearly from the rest. That means most rows are not unusually influential, but a small number are pulling the model fit a lot. Those points should be checked to see whether they are valid extreme observations or possible data issues.
To avoid a long scroll of raw values, I only show the top influential rows:
cooks <- cooks.distance(poisson_model)
cook_threshold <- 4 / nrow(model_data)
influential_points <- model_data %>%
mutate(row_id = row_number(), cooks_distance = cooks) %>%
slice_max(order_by = cooks_distance, n = 8) %>%
select(row_id, new_cases_est, vax_coverage, median_age, reproduction_rate, stringency_index, cooks_distance)
kable(influential_points, digits = 4, caption = "Top 8 observations by Cook's distance")| row_id | new_cases_est | vax_coverage | median_age | reproduction_rate | stringency_index | cooks_distance |
|---|---|---|---|---|---|---|
| 21913 | 389014 | 2.85 | 28.2 | 0.88 | 81.94 | 377.8152 |
| 21912 | 389014 | 2.81 | 28.2 | 0.90 | 81.94 | 369.6059 |
| 21911 | 389014 | 2.77 | 28.2 | 0.92 | 81.94 | 362.4849 |
| 21910 | 389014 | 2.70 | 28.2 | 0.94 | 81.94 | 356.4433 |
| 21909 | 389014 | 2.63 | 28.2 | 0.96 | 81.94 | 351.4931 |
| 21908 | 389014 | 2.53 | 28.2 | 0.99 | 81.94 | 346.1178 |
| 21920 | 339115 | 2.94 | 28.2 | 0.79 | 81.94 | 322.6333 |
| 21919 | 339115 | 0.00 | 28.2 | 0.80 | 81.94 | 318.4159 |
A common rough cutoff is 9.6^{-5}. Any observation above that value is worth extra attention. The important idea is not that every influential point must be removed, but that it should be understood.
## vax_coverage median_age reproduction_rate stringency_index
## 1.239360 1.072237 1.000734 1.226067
The VIF values are all close to 1. That is a good sign because it means the predictors are not highly redundant. So, the model does not appear to have a multicollinearity problem. This helps because the coefficient estimates are easier to interpret separately.
## [1] 16032
For a Poisson model, a dispersion ratio near 1 is ideal. A ratio much larger than 1 suggests the data are more variable than the model expects.
If this value is clearly above 1, it supports the visual diagnostics showing overdispersion. That would mean the Poisson model is too simple for the data and that a negative binomial model may fit better.
The main issues visible in this analysis are:
These issues matter because they affect how reliable the coefficient estimates and standard errors are. The model still gives a useful first summary of the relationship between the predictors and case counts, but it does not capture the full complexity of the data.
This Week 11 notebook builds a Poisson generalized linear model, checks the diagnostics, highlights the model issues, and interprets the vaccination coefficient. The main insight is that higher vaccination coverage is associated with lower expected case counts, while higher reproduction rate is expected to increase case counts. At the same time, the diagnostics show that the Poisson assumptions are not fully satisfied, especially because of overdispersion and influential outliers. A natural next question is whether a negative binomial model would better handle the extra variability in the data.