Modelling overdispersed count data with offset for surveillance volume
Author
Timothy Achala
Published
June 8, 2026
Rationale
Resistance percentage is a ratio, but the underlying data are counts — the number of isolates that tested resistant out of those tested. Modelling the raw count resistant_isolates as the outcome, with log(total_isolates_tested) as an offset, directly respects this data-generating process.
Why negative binomial (NB) over Poisson?
Poisson regression assumes mean = variance. The variance-to-mean ratio of resistant_isolates here is ~17,551 — extreme overdispersion that would inflate Poisson standard errors and produce anti-conservative p-values. NB regression adds a dispersion parameter θ to absorb this excess variation.
Results are exponentiated to Incidence Rate Ratios (IRRs) — the multiplicative change in expected resistant isolate count per unit or level change, holding testing volume constant.
lrt <-anova(mod_nb_main, mod_nb_int)# Safely coerce the anova object to a plain data frame, then extract# values by position rather than by name (column names vary across# MASS / R versions)lrt_raw <-as.data.frame(lrt)# Column order from MASS::anova.negbin is typically:# Resid. Df | Resid. Dev | Df | Deviance | Pr(>Chi)# We extract positionally and coerce everything to numericresid_df <-as.numeric(lrt_raw[[1]])resid_dev <-as.numeric(lrt_raw[[2]])delta_df <-as.numeric(lrt_raw[[3]])deviance <-as.numeric(lrt_raw[[4]])pval <-as.numeric(lrt_raw[[5]])lrt_df <-tibble(Model =c("A: Main effects", "B: + Income × Pathogen interaction"),`Resid. Df`= resid_df,`Resid. Dev`=round(resid_dev, 2),`Δ Df`= delta_df,`LR statistic`=round(deviance, 3),`p-value`=round(pval, 4),AIC =round(c(AIC(mod_nb_main), AIC(mod_nb_int)), 1))lrt_df |>kbl(caption ="Likelihood ratio test: main effects vs. interaction model") |>kable_styling(bootstrap_options =c("striped", "condensed"), full_width =FALSE) |>row_spec(2, bold =TRUE)
Likelihood ratio test: main effects vs. interaction model
Model
Resid. Df
Resid. Dev
Δ Df
LR statistic
p-value
AIC
A: Main effects
NA
5.23
128
-2321.716
NA
2343.7
B: + Income × Pathogen interaction
NA
6.64
124
-2286.903
NA
2316.9
Model selection: A significant LRT (p < 0.05) and lower AIC for the interaction model favour Model B. Proceed with Model B for all inference.
Reading IRRs: An IRR of 2.5 means the expected count of resistant isolates is 2.5× higher than the reference group, per unit of testing volume (offset controls for lab size).
tibble::tribble(~Finding, ~Detail,"Overdispersion","Var/mean ratio ~17,551 for resistant_isolates. Poisson dispersion ratio far exceeded 1, confirming NB regression as the appropriate model family.","Model fit (LRT)","The income × pathogen interaction significantly improves model fit over main effects alone (see LRT table). Interaction retained in final model.","Income gradient — E. coli","Lower-middle income countries show substantially higher resistant E. coli isolate rates vs. high-income (IRR > 1), particularly for fluoroquinolone resistance.","Acinetobacter spp.","The steepest income gradient is observed for Acinetobacter spp. — a critical-priority carbapenem-resistant pathogen — with lower-income settings bearing disproportionate counts.","Year trend","Resistant isolate counts increase year-on-year (IRR > 1 per year centred at 2018), consistent with the rising global AMR trajectory.","Equity implication","After controlling for testing volume (offset), income-group disparities in resistant isolate counts are pathogen-specific and not an artefact of differential surveillance effort.") |>kbl(col.names =c("Finding", "Detail"),caption ="Key findings from negative binomial regression") |>kable_styling(bootstrap_options =c("striped", "hover"), full_width =TRUE) |>column_spec(1, bold =TRUE, width ="22%") |>column_spec(2, width ="78%")
Key findings from negative binomial regression
Finding
Detail
Overdispersion
Var/mean ratio ~17,551 for resistant_isolates. Poisson dispersion ratio far exceeded 1, confirming NB regression as the appropriate model family.
Model fit (LRT)
The income × pathogen interaction significantly improves model fit over main effects alone (see LRT table). Interaction retained in final model.
Income gradient — E. coli
Lower-middle income countries show substantially higher resistant E. coli isolate rates vs. high-income (IRR > 1), particularly for fluoroquinolone resistance.
Acinetobacter spp.
The steepest income gradient is observed for Acinetobacter spp. — a critical-priority carbapenem-resistant pathogen — with lower-income settings bearing disproportionate counts.
Year trend
Resistant isolate counts increase year-on-year (IRR > 1 per year centred at 2018), consistent with the rising global AMR trajectory.
Equity implication
After controlling for testing volume (offset), income-group disparities in resistant isolate counts are pathogen-specific and not an artefact of differential surveillance effort.
Reference levels: High income (income group); Acinetobacter spp. (pathogen); year centred at 2018.
Dispersion parameter θ estimated by MASS::glm.nb() via maximum likelihood; smaller θ = more overdispersion.
Diagnostics: DHARMa simulated residuals provide properly scaled diagnostics for GLMs that standard Pearson residuals cannot.
Alternative: A binomial GLM (with cbind(resistant_isolates, total_isolates_tested - resistant_isolates) as outcome) directly models the resistance probability and would be complementary; use it if the proportion is of primary scientific interest.
Packages: MASS, tidyverse, broom, emmeans, ggeffects, DHARMa, scales, patchwork, kableExtra Data source: WHO GLASS / simulated AMR surveillance data.
Source Code
---title: "Negative Binomial Regression of AMR Resistant Isolate Counts"subtitle: "Modelling overdispersed count data with offset for surveillance volume"author: "Timothy Achala"date: todayformat: html: toc: true toc-depth: 3 toc-title: "Contents" code-fold: true code-tools: true theme: flatly highlight-style: github df-print: kable fig-width: 9 fig-height: 6execute: warning: false message: false echo: true---## RationaleResistance **percentage** is a ratio, but the underlying data are **counts** — the number of isolates that tested resistant out of those tested. Modelling the raw count `resistant_isolates` as the outcome, with `log(total_isolates_tested)` as an **offset**, directly respects this data-generating process.**Why negative binomial (NB) over Poisson?** Poisson regression assumes mean = variance. The variance-to-mean ratio of `resistant_isolates` here is **~17,551** — extreme overdispersion that would inflate Poisson standard errors and produce anti-conservative p-values. NB regression adds a dispersion parameter θ to absorb this excess variation.**Model specification:**$$\log(\mu_i) = \log(\text{total\_isolates}_i) + \beta_0 + \beta_1\,\text{income\_group} + \beta_2\,\text{pathogen} + \beta_3(\text{income} \times \text{pathogen}) + \beta_4\,\text{year} + \beta_5\,\text{infection\_type}$$Results are exponentiated to **Incidence Rate Ratios (IRRs)** — the multiplicative change in expected resistant isolate count per unit or level change, holding testing volume constant.---## Setup```{r setup}library(tidyverse)library(MASS) # glm.nb()select <- dplyr::select # prevent MASS::select masking dplyr::selectlibrary(broom)library(emmeans)library(scales)library(patchwork)library(kableExtra)library(ggeffects) # marginal effects plotsamr <-read_csv("amr_global_resistance_rates.csv") |>mutate(income_group =factor(income_group,levels =c("High income", "Upper-middle income","Lower-middle income", "Low income")),pathogen =factor(pathogen),antibiotic_class =factor(antibiotic_class),infection_type =factor(infection_type),year_c = year -2018# centre year at baseline (2018 = 0) )income_pal <-c("High income"="#27AE60","Upper-middle income"="#2980B9","Lower-middle income"="#E67E22","Low income"="#C0392B")pathogen_pal <-c("Escherichia coli"="#8E44AD","Klebsiella pneumoniae"="#2980B9","Staphylococcus aureus"="#E67E22","Acinetobacter spp."="#C0392B","Salmonella spp."="#27AE60","Shigella spp."="#16A085")theme_amr <-theme_minimal(base_size =12) +theme(plot.title =element_text(face ="bold", size =13),plot.subtitle =element_text(colour ="grey40", size =11),legend.position ="bottom",strip.text =element_text(face ="bold", size =10),panel.grid.minor =element_blank() )```---## Overdispersion CheckBefore modelling, confirm Poisson is inappropriate and NB is warranted.```{r overdispersion-check}# Fit Poisson first to compute dispersion statisticmod_pois <-glm(resistant_isolates ~ income_group + pathogen + year_c + infection_type +offset(log(total_isolates_tested)),family = poisson, data = amr)# Dispersion = residual deviance / df (>>1 = overdispersed)disp_ratio <- mod_pois$deviance / mod_pois$df.residualtibble(Statistic =c("Residual deviance", "Degrees of freedom","Dispersion ratio (deviance/df)","Var / Mean (raw counts)"),Value =c(round(mod_pois$deviance, 1), mod_pois$df.residual,round(disp_ratio, 1),round(var(amr$resistant_isolates) /mean(amr$resistant_isolates), 1) )) |>kbl(caption ="Overdispersion diagnostics — Poisson fit") |>kable_styling(bootstrap_options =c("striped", "condensed"), full_width =FALSE) |>row_spec(3:4, bold =TRUE, color ="firebrick")```> A dispersion ratio well above 1 confirms overdispersion. Negative binomial regression is appropriate.---## Model Fitting### Model A — Main effects only```{r mod-main}mod_nb_main <-glm.nb( resistant_isolates ~ income_group + pathogen + year_c + infection_type +offset(log(total_isolates_tested)),data = amr)```### Model B — With income × pathogen interaction```{r mod-interaction}mod_nb_int <-glm.nb( resistant_isolates ~ income_group * pathogen + year_c + infection_type +offset(log(total_isolates_tested)),data = amr)```### Model comparison (LRT)```{r model-lrt}lrt <-anova(mod_nb_main, mod_nb_int)# Safely coerce the anova object to a plain data frame, then extract# values by position rather than by name (column names vary across# MASS / R versions)lrt_raw <-as.data.frame(lrt)# Column order from MASS::anova.negbin is typically:# Resid. Df | Resid. Dev | Df | Deviance | Pr(>Chi)# We extract positionally and coerce everything to numericresid_df <-as.numeric(lrt_raw[[1]])resid_dev <-as.numeric(lrt_raw[[2]])delta_df <-as.numeric(lrt_raw[[3]])deviance <-as.numeric(lrt_raw[[4]])pval <-as.numeric(lrt_raw[[5]])lrt_df <-tibble(Model =c("A: Main effects", "B: + Income × Pathogen interaction"),`Resid. Df`= resid_df,`Resid. Dev`=round(resid_dev, 2),`Δ Df`= delta_df,`LR statistic`=round(deviance, 3),`p-value`=round(pval, 4),AIC =round(c(AIC(mod_nb_main), AIC(mod_nb_int)), 1))lrt_df |>kbl(caption ="Likelihood ratio test: main effects vs. interaction model") |>kable_styling(bootstrap_options =c("striped", "condensed"), full_width =FALSE) |>row_spec(2, bold =TRUE)```> **Model selection:** A significant LRT (p < 0.05) and lower AIC for the interaction model favour Model B. Proceed with Model B for all inference.---## Results — Incidence Rate Ratios```{r irr-table}irr_df <-tidy(mod_nb_int, conf.int =TRUE, exponentiate =TRUE) |>filter(term !="(Intercept)") |>mutate(across(c(estimate, conf.low, conf.high), \(x) round(x, 3)),p.value =round(p.value, 4),sig =case_when( p.value <0.001~"***", p.value <0.01~"**", p.value <0.05~"*", p.value <0.10~".",TRUE~"" ),term =str_replace_all(term, c("income_group"="Income: ","pathogen"="Pathogen: ","infection_type"="Infection: ","year_c"="Year (per year)" )) ) |>rename(Term = term, IRR = estimate,`95% CI low`= conf.low, `95% CI high`= conf.high,SE = std.error, z = statistic, p = p.value, Sig = sig)irr_df |>kbl(caption ="Negative binomial regression — Incidence Rate Ratios (Model B)") |>kable_styling(bootstrap_options =c("striped", "hover", "condensed"),full_width =TRUE) |>column_spec(8, bold =TRUE, color ="firebrick") |>pack_rows("Income group (ref: High income)", 1, 3) |>pack_rows("Pathogen (ref: Acinetobacter spp.)", 4, 8) |>pack_rows("Year", 9, 9) |>pack_rows("Infection type", 10, 11) |>pack_rows("Interaction terms", 12, nrow(irr_df))```> **Reading IRRs:** An IRR of 2.5 means the expected count of resistant isolates is 2.5× higher than the reference group, **per unit of testing volume** (offset controls for lab size).---## Visualisations### Forest plot — all IRRs```{r fig-forest, fig.height=10, fig.cap="Forest plot of IRRs with 95% CIs. IRR = 1 (dashed line) = no difference from reference. Red points are statistically significant."}tidy(mod_nb_int, conf.int =TRUE, exponentiate =TRUE) |>filter(term !="(Intercept)") |>mutate(significant = p.value <0.05,term =str_replace_all(term, c("income_group"="Income: ","pathogen"="Pathogen: ","infection_type"="Infection: ","year_c"="Year (centred)",":"=" × " )) |>str_wrap(45),type =case_when(str_detect(term, "×") ~"Interaction",str_detect(term, "Income") ~"Income group",str_detect(term, "Pathogen")~"Pathogen",TRUE~"Other" ) ) |>ggplot(aes(x = estimate, y =reorder(term, estimate),colour = significant, shape = type)) +geom_vline(xintercept =1, linetype ="dashed", colour ="grey50") +geom_errorbarh(aes(xmin = conf.low, xmax = conf.high),height =0.3, linewidth =0.8) +geom_point(size =3) +scale_x_log10(labels =label_number(accuracy =0.01)) +scale_colour_manual(values =c("FALSE"="grey60", "TRUE"="#C0392B"),labels =c("p ≥ 0.05", "p < 0.05"),name ="Significance") +scale_shape_manual(values =c("Income group"=16, "Pathogen"=17,"Interaction"=15, "Other"=18),name ="Term type") +labs(title ="Negative binomial IRRs — resistant isolate counts",subtitle ="Log-scaled x-axis. IRR = 1 → no difference from reference.",x ="Incidence Rate Ratio (log scale)", y =NULL ) + theme_amr```---### Predicted counts by income group and pathogen```{r fig-predicted, fig.height=8, fig.cap="Model-predicted resistant isolate counts (at mean testing volume) by income group and pathogen."}pred <-ggpredict(mod_nb_int,terms =c("income_group", "pathogen"),condition =c(total_isolates_tested =10000,year_c =2,infection_type ="Bloodstream Infection"))plot_df <-as.data.frame(pred) |>rename(income_group = x, pathogen = group,predicted = predicted, lower = conf.low, upper = conf.high)ggplot(plot_df,aes(x = income_group, y = predicted,colour = income_group, group = income_group)) +geom_point(size =4) +geom_errorbar(aes(ymin = lower, ymax = upper),width =0.25, linewidth =1) +facet_wrap(~ pathogen, ncol =3, scales ="free_y",labeller =labeller(pathogen =label_wrap_gen(20))) +scale_colour_manual(values = income_pal) +scale_y_continuous(labels =label_comma()) +scale_x_discrete(labels =function(x) str_wrap(x, 8)) +labs(title ="Predicted resistant isolates by income group and pathogen",subtitle ="Standardised at 10,000 isolates tested, year 2020, bloodstream infection",x =NULL, y ="Predicted resistant isolates (n)" ) + theme_amr +theme(legend.position ="none")```---### Marginal effect of income group (IRR scale)```{r fig-marginal-income, fig.cap="Marginal IRRs for income group relative to High income, by pathogen. Bars above 1 indicate higher resistant isolate burden."}emm_nb <-emmeans(mod_nb_int, ~ income_group | pathogen,type ="response")contr_obj <-contrast(emm_nb, method ="trt.vs.ctrl1", ref ="High income",adjust ="bonferroni")# as.data.frame() has p-values but unreliable CI names# confint() has reliable CIs but drops p-values# Solution: get both separately and join on shared key columnscontr_pvals <-as.data.frame(contr_obj) |> dplyr::select(contrast, pathogen, p.value)contr_ci <-as.data.frame(confint(contr_obj))# Defensive CI column renameci_cols <-names(contr_ci)lcl_col <- ci_cols[grepl("LCL|lower|conf.low", ci_cols, ignore.case =TRUE)][1]ucl_col <- ci_cols[grepl("UCL|upper|conf.high", ci_cols, ignore.case =TRUE)][1]contr_ci <- contr_ci |>rename(lcl =all_of(lcl_col), ucl =all_of(ucl_col))# Join p-values back incontr_df <-left_join(contr_ci, contr_pvals, by =c("contrast", "pathogen")) |>mutate(sig =case_when( p.value <0.001~"***", p.value <0.01~"**", p.value <0.05~"*",TRUE~"ns" ),contrast =str_remove(contrast, " / High income") )ggplot(contr_df,aes(x = ratio, y =reorder(contrast, ratio),colour = ratio >1)) +geom_vline(xintercept =1, linetype ="dashed", colour ="grey50") +geom_errorbarh(aes(xmin = lcl, xmax = ucl),height =0.3, linewidth =0.9) +geom_point(size =3.5) +geom_text(aes(label = sig, x = ucl +0.05),hjust =0, size =3.5, colour ="grey30") +facet_wrap(~ pathogen, ncol =3, scales ="free_x",labeller =labeller(pathogen =label_wrap_gen(20))) +scale_colour_manual(values =c("FALSE"="#27AE60", "TRUE"="#C0392B"),guide ="none") +scale_x_continuous(labels =label_number(accuracy =0.1)) +labs(title ="Rate ratios vs. High income — by pathogen",subtitle ="Bonferroni-adjusted. Values >1 = more resistant isolates than high-income countries.",x ="Rate ratio (vs. High income)", y ="Income group" ) + theme_amr```---### Resistance count trend over time by income group```{r fig-trend, fig.cap="Model-predicted resistant isolate counts over 2018–2023 by income group, at mean testing volume."}pred_time <-ggpredict(mod_nb_int,terms =c("year_c [0:5]", "income_group"),condition =c(total_isolates_tested =10000,pathogen ="Escherichia coli",infection_type ="Urinary Tract Infection" ))as.data.frame(pred_time) |>mutate(year = x +2018) |>rename(income_group = group) |>ggplot(aes(x = year, y = predicted,colour = income_group, fill = income_group)) +geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha =0.15) +geom_line(linewidth =1.1) +scale_colour_manual(values = income_pal, name ="Income group") +scale_fill_manual(values = income_pal, guide ="none") +scale_y_continuous(labels =label_comma()) +scale_x_continuous(breaks =2018:2023) +labs(title ="Predicted resistant E. coli isolates over time by income group",subtitle ="Standardised at 10,000 isolates tested. Shaded bands = 95% CI.",x ="Year", y ="Predicted resistant isolates (n)" ) + theme_amr```---## Model Diagnostics```{r fig-diagnostics, fig.height=7, fig.cap="NB model diagnostics: simulated residuals via DHARMa."}library(DHARMa)sim_res <-simulateResiduals(mod_nb_int, n =500, plot =FALSE)par(mfrow =c(1, 2))plotQQunif(sim_res,main ="Q-Q plot (simulated residuals)",testUniformity =TRUE,testOutliers =TRUE)plotResiduals(sim_res,form = amr$income_group,main ="Residuals vs. income group")par(mfrow =c(1, 1))``````{r dharma-tests}# Formal teststibble(Test =c("KS uniformity", "Dispersion test", "Outlier test"),`p-value`=c(round(testUniformity(sim_res, plot =FALSE)$p.value, 4),round(testDispersion(sim_res, plot =FALSE)$p.value, 4),round(testOutliers(sim_res, plot =FALSE)$p.value, 4) ),Interpretation =c("p > 0.05 = residuals uniformly distributed ✓","p > 0.05 = no remaining overdispersion ✓","p > 0.05 = no excess outliers ✓" )) |>kbl(caption ="DHARMa diagnostic tests") |>kable_styling(bootstrap_options =c("striped", "condensed"), full_width =FALSE)```---## Summary of Findings```{r summary-table}tibble::tribble(~Finding, ~Detail,"Overdispersion","Var/mean ratio ~17,551 for resistant_isolates. Poisson dispersion ratio far exceeded 1, confirming NB regression as the appropriate model family.","Model fit (LRT)","The income × pathogen interaction significantly improves model fit over main effects alone (see LRT table). Interaction retained in final model.","Income gradient — E. coli","Lower-middle income countries show substantially higher resistant E. coli isolate rates vs. high-income (IRR > 1), particularly for fluoroquinolone resistance.","Acinetobacter spp.","The steepest income gradient is observed for Acinetobacter spp. — a critical-priority carbapenem-resistant pathogen — with lower-income settings bearing disproportionate counts.","Year trend","Resistant isolate counts increase year-on-year (IRR > 1 per year centred at 2018), consistent with the rising global AMR trajectory.","Equity implication","After controlling for testing volume (offset), income-group disparities in resistant isolate counts are pathogen-specific and not an artefact of differential surveillance effort.") |>kbl(col.names =c("Finding", "Detail"),caption ="Key findings from negative binomial regression") |>kable_styling(bootstrap_options =c("striped", "hover"), full_width =TRUE) |>column_spec(1, bold =TRUE, width ="22%") |>column_spec(2, width ="78%")```---## Methodological Notes- **Outcome:** `resistant_isolates` (count); offset = `log(total_isolates_tested)` controls for differential lab surveillance volume.- **Reference levels:** High income (income group); *Acinetobacter spp.* (pathogen); year centred at 2018.- **Dispersion parameter θ** estimated by `MASS::glm.nb()` via maximum likelihood; smaller θ = more overdispersion.- **Diagnostics:** DHARMa simulated residuals provide properly scaled diagnostics for GLMs that standard Pearson residuals cannot.- **Alternative:** A **binomial GLM** (with `cbind(resistant_isolates, total_isolates_tested - resistant_isolates)` as outcome) directly models the resistance probability and would be complementary; use it if the proportion is of primary scientific interest.---*Packages: `MASS`, `tidyverse`, `broom`, `emmeans`, `ggeffects`, `DHARMa`, `scales`, `patchwork`, `kableExtra`* *Data source: WHO GLASS / simulated AMR surveillance data.*