Negative Binomial Regression of AMR Resistant Isolate Counts

Modelling overdispersed count data with offset for surveillance volume

Author

Timothy Achala

Published

June 8, 2026

Rationale

Resistance percentage is a ratio, but the underlying data are counts — the number of isolates that tested resistant out of those tested. Modelling the raw count resistant_isolates as the outcome, with log(total_isolates_tested) as an offset, directly respects this data-generating process.

Why negative binomial (NB) over Poisson?
Poisson regression assumes mean = variance. The variance-to-mean ratio of resistant_isolates here is ~17,551 — extreme overdispersion that would inflate Poisson standard errors and produce anti-conservative p-values. NB regression adds a dispersion parameter θ to absorb this excess variation.

Model specification:

\[ \log(\mu_i) = \log(\text{total\_isolates}_i) + \beta_0 + \beta_1\,\text{income\_group} + \beta_2\,\text{pathogen} + \beta_3(\text{income} \times \text{pathogen}) + \beta_4\,\text{year} + \beta_5\,\text{infection\_type} \]

Results are exponentiated to Incidence Rate Ratios (IRRs) — the multiplicative change in expected resistant isolate count per unit or level change, holding testing volume constant.

Setup

Code

library(tidyverse)
library(MASS)          # glm.nb()
select <- dplyr::select # prevent MASS::select masking dplyr::select
library(broom)
library(emmeans)
library(scales)
library(patchwork)
library(kableExtra)
library(ggeffects)     # marginal effects plots

amr <- read_csv("amr_global_resistance_rates.csv") |>
  mutate(
    income_group = factor(income_group,
      levels = c("High income", "Upper-middle income",
                 "Lower-middle income", "Low income")),
    pathogen        = factor(pathogen),
    antibiotic_class = factor(antibiotic_class),
    infection_type  = factor(infection_type),
    year_c          = year - 2018   # centre year at baseline (2018 = 0)
  )

income_pal <- c(
  "High income"           = "#27AE60",
  "Upper-middle income"   = "#2980B9",
  "Lower-middle income"   = "#E67E22",
  "Low income"            = "#C0392B"
)

pathogen_pal <- c(
  "Escherichia coli"      = "#8E44AD",
  "Klebsiella pneumoniae" = "#2980B9",
  "Staphylococcus aureus" = "#E67E22",
  "Acinetobacter spp."    = "#C0392B",
  "Salmonella spp."       = "#27AE60",
  "Shigella spp."         = "#16A085"
)

theme_amr <- theme_minimal(base_size = 12) +
  theme(
    plot.title    = element_text(face = "bold", size = 13),
    plot.subtitle = element_text(colour = "grey40", size = 11),
    legend.position = "bottom",
    strip.text    = element_text(face = "bold", size = 10),
    panel.grid.minor = element_blank()
  )

Overdispersion Check

Before modelling, confirm Poisson is inappropriate and NB is warranted.

Code

# Fit Poisson first to compute dispersion statistic
mod_pois <- glm(resistant_isolates ~ income_group + pathogen + year_c +
                  infection_type + offset(log(total_isolates_tested)),
                family = poisson, data = amr)

# Dispersion = residual deviance / df  (>>1 = overdispersed)
disp_ratio <- mod_pois$deviance / mod_pois$df.residual

tibble(
  Statistic    = c("Residual deviance", "Degrees of freedom",
                   "Dispersion ratio (deviance/df)",
                   "Var / Mean (raw counts)"),
  Value        = c(
    round(mod_pois$deviance, 1),
    mod_pois$df.residual,
    round(disp_ratio, 1),
    round(var(amr$resistant_isolates) / mean(amr$resistant_isolates), 1)
  )
) |>
  kbl(caption = "Overdispersion diagnostics — Poisson fit") |>
  kable_styling(bootstrap_options = c("striped", "condensed"), full_width = FALSE) |>
  row_spec(3:4, bold = TRUE, color = "firebrick")

Overdispersion diagnostics — Poisson fit
Statistic	Value
Residual deviance	81765.3
Degrees of freedom	128.0
Dispersion ratio (deviance/df)	638.8
Var / Mean (raw counts)	17550.8

A dispersion ratio well above 1 confirms overdispersion. Negative binomial regression is appropriate.

Model Fitting

Model A — Main effects only

Code

mod_nb_main <- glm.nb(
  resistant_isolates ~ income_group + pathogen + year_c +
    infection_type + offset(log(total_isolates_tested)),
  data = amr
)

Model B — With income × pathogen interaction

Code

mod_nb_int <- glm.nb(
  resistant_isolates ~ income_group * pathogen + year_c +
    infection_type + offset(log(total_isolates_tested)),
  data = amr
)

Model comparison (LRT)

Code

lrt <- anova(mod_nb_main, mod_nb_int)

# Safely coerce the anova object to a plain data frame, then extract
# values by position rather than by name (column names vary across
# MASS / R versions)
lrt_raw <- as.data.frame(lrt)

# Column order from MASS::anova.negbin is typically:
#   Resid. Df | Resid. Dev | Df | Deviance | Pr(>Chi)
# We extract positionally and coerce everything to numeric
resid_df  <- as.numeric(lrt_raw[[1]])
resid_dev <- as.numeric(lrt_raw[[2]])
delta_df  <- as.numeric(lrt_raw[[3]])
deviance  <- as.numeric(lrt_raw[[4]])
pval      <- as.numeric(lrt_raw[[5]])

lrt_df <- tibble(
  Model         = c("A: Main effects", "B: + Income × Pathogen interaction"),
  `Resid. Df`   = resid_df,
  `Resid. Dev`  = round(resid_dev, 2),
  `Δ Df`        = delta_df,
  `LR statistic`= round(deviance, 3),
  `p-value`     = round(pval, 4),
  AIC           = round(c(AIC(mod_nb_main), AIC(mod_nb_int)), 1)
)

lrt_df |>
  kbl(caption = "Likelihood ratio test: main effects vs. interaction model") |>
  kable_styling(bootstrap_options = c("striped", "condensed"), full_width = FALSE) |>
  row_spec(2, bold = TRUE)

Likelihood ratio test: main effects vs. interaction model
Model	Resid. Df	Resid. Dev	Δ Df	LR statistic	p-value	AIC
A: Main effects	NA	5.23	128	-2321.716	NA	2343.7
B: + Income × Pathogen interaction	NA	6.64	124	-2286.903	NA	2316.9

Model selection: A significant LRT (p < 0.05) and lower AIC for the interaction model favour Model B. Proceed with Model B for all inference.

Results — Incidence Rate Ratios

Code

irr_df <- tidy(mod_nb_int, conf.int = TRUE, exponentiate = TRUE) |>
  filter(term != "(Intercept)") |>
  mutate(
    across(c(estimate, conf.low, conf.high), \(x) round(x, 3)),
    p.value = round(p.value, 4),
    sig = case_when(
      p.value < 0.001 ~ "***",
      p.value < 0.01  ~ "**",
      p.value < 0.05  ~ "*",
      p.value < 0.10  ~ ".",
      TRUE            ~ ""
    ),
    term = str_replace_all(term, c(
      "income_group"   = "Income: ",
      "pathogen"       = "Pathogen: ",
      "infection_type" = "Infection: ",
      "year_c"         = "Year (per year)"
    ))
  ) |>
  rename(Term = term, IRR = estimate,
         `95% CI low` = conf.low, `95% CI high` = conf.high,
         SE = std.error, z = statistic, p = p.value, Sig = sig)

irr_df |>
  kbl(caption = "Negative binomial regression — Incidence Rate Ratios (Model B)") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = TRUE) |>
  column_spec(8, bold = TRUE, color = "firebrick") |>
  pack_rows("Income group (ref: High income)", 1, 3) |>
  pack_rows("Pathogen (ref: Acinetobacter spp.)", 4, 8) |>
  pack_rows("Year", 9, 9) |>
  pack_rows("Infection type", 10, 11) |>
  pack_rows("Interaction terms", 12, nrow(irr_df))

Negative binomial regression — Incidence Rate Ratios (Model B)
Term	IRR	SE	z	p	95% CI low	95% CI high	Sig
Income group (ref: High income)
Income: Upper-middle income	1.033	0.2749856	0.1188584	0.9054	0.602	1.773
Income: Lower-middle income	1.307	0.1590461	1.6818270	0.0926	0.956	1.786	.
Income: Low income	0.352	0.2761728	-3.7771560	0.0002	0.205	0.606	***
Pathogen (ref: Acinetobacter spp.)
Pathogen: Escherichia coli	0.240	0.2616013	-5.4602188	0.0000	0.142	0.403	***
Pathogen: Klebsiella pneumoniae	0.308	0.2426039	-4.8537388	0.0000	0.190	0.493	***
Pathogen: Shigella spp.	0.468	0.2246805	-3.3814992	0.0007	0.301	0.728	***
Pathogen: Staphylococcus aureus	0.536	0.1946877	-3.2006383	0.0014	0.363	0.780	**
Year (per year)	1.028	0.0193864	1.4453991	0.1483	0.990	1.068
Year
Infection: Urinary Tract Infection	1.173	0.1467772	1.0857409	0.2776	0.869	1.575
Infection type
Income: Upper-middle income:Pathogen: Escherichia coli	3.266	0.3116939	3.7972513	0.0001	1.765	6.051	***
Income: Lower-middle income:Pathogen: Escherichia coli	3.250	0.2108468	5.5900404	0.0000	2.154	4.900	***
Interaction terms
Income: Upper-middle income:Pathogen: Klebsiella pneumoniae	2.460	0.3107734	2.8970800	0.0038	1.338	4.531	**
Income: Lower-middle income:Pathogen: Klebsiella pneumoniae	1.304	0.2151607	1.2326287	0.2177	0.856	1.991

Reading IRRs: An IRR of 2.5 means the expected count of resistant isolates is 2.5× higher than the reference group, per unit of testing volume (offset controls for lab size).

Visualisations

Forest plot — all IRRs

Code

tidy(mod_nb_int, conf.int = TRUE, exponentiate = TRUE) |>
  filter(term != "(Intercept)") |>
  mutate(
    significant = p.value < 0.05,
    term = str_replace_all(term, c(
      "income_group"   = "Income: ",
      "pathogen"       = "Pathogen: ",
      "infection_type" = "Infection: ",
      "year_c"         = "Year (centred)",
      ":"              = " × "
    )) |> str_wrap(45),
    type = case_when(
      str_detect(term, "×")       ~ "Interaction",
      str_detect(term, "Income")  ~ "Income group",
      str_detect(term, "Pathogen")~ "Pathogen",
      TRUE                        ~ "Other"
    )
  ) |>
  ggplot(aes(x = estimate, y = reorder(term, estimate),
             colour = significant, shape = type)) +
  geom_vline(xintercept = 1, linetype = "dashed", colour = "grey50") +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high),
                 height = 0.3, linewidth = 0.8) +
  geom_point(size = 3) +
  scale_x_log10(labels = label_number(accuracy = 0.01)) +
  scale_colour_manual(values = c("FALSE" = "grey60", "TRUE" = "#C0392B"),
                      labels = c("p ≥ 0.05", "p < 0.05"),
                      name = "Significance") +
  scale_shape_manual(values = c("Income group" = 16, "Pathogen" = 17,
                                 "Interaction" = 15, "Other" = 18),
                     name = "Term type") +
  labs(
    title    = "Negative binomial IRRs — resistant isolate counts",
    subtitle = "Log-scaled x-axis. IRR = 1 → no difference from reference.",
    x = "Incidence Rate Ratio (log scale)", y = NULL
  ) +
  theme_amr

Figure 1: Forest plot of IRRs with 95% CIs. IRR = 1 (dashed line) = no difference from reference. Red points are statistically significant.

Predicted counts by income group and pathogen

Code

pred <- ggpredict(mod_nb_int,
                  terms = c("income_group", "pathogen"),
                  condition = c(total_isolates_tested = 10000,
                                year_c = 2,
                                infection_type = "Bloodstream Infection"))

plot_df <- as.data.frame(pred) |>
  rename(income_group = x, pathogen = group,
         predicted = predicted, lower = conf.low, upper = conf.high)

ggplot(plot_df,
       aes(x = income_group, y = predicted,
           colour = income_group, group = income_group)) +
  geom_point(size = 4) +
  geom_errorbar(aes(ymin = lower, ymax = upper),
                width = 0.25, linewidth = 1) +
  facet_wrap(~ pathogen, ncol = 3, scales = "free_y",
             labeller = labeller(pathogen = label_wrap_gen(20))) +
  scale_colour_manual(values = income_pal) +
  scale_y_continuous(labels = label_comma()) +
  scale_x_discrete(labels = function(x) str_wrap(x, 8)) +
  labs(
    title    = "Predicted resistant isolates by income group and pathogen",
    subtitle = "Standardised at 10,000 isolates tested, year 2020, bloodstream infection",
    x = NULL, y = "Predicted resistant isolates (n)"
  ) +
  theme_amr +
  theme(legend.position = "none")

Figure 2: Model-predicted resistant isolate counts (at mean testing volume) by income group and pathogen.

Marginal effect of income group (IRR scale)

Code

emm_nb <- emmeans(mod_nb_int, ~ income_group | pathogen,
                  type = "response")

contr_obj <- contrast(emm_nb, method = "trt.vs.ctrl1", ref = "High income",
                      adjust = "bonferroni")

# as.data.frame() has p-values but unreliable CI names
# confint()       has reliable CIs but drops p-values
# Solution: get both separately and join on shared key columns
contr_pvals <- as.data.frame(contr_obj) |>
  dplyr::select(contrast, pathogen, p.value)

contr_ci <- as.data.frame(confint(contr_obj))

# Defensive CI column rename
ci_cols <- names(contr_ci)
lcl_col <- ci_cols[grepl("LCL|lower|conf.low", ci_cols, ignore.case = TRUE)][1]
ucl_col <- ci_cols[grepl("UCL|upper|conf.high", ci_cols, ignore.case = TRUE)][1]
contr_ci <- contr_ci |> rename(lcl = all_of(lcl_col), ucl = all_of(ucl_col))

# Join p-values back in
contr_df <- left_join(contr_ci, contr_pvals, by = c("contrast", "pathogen")) |>
  mutate(
    sig = case_when(
      p.value < 0.001 ~ "***",
      p.value < 0.01  ~ "**",
      p.value < 0.05  ~ "*",
      TRUE            ~ "ns"
    ),
    contrast = str_remove(contrast, " / High income")
  )

ggplot(contr_df,
       aes(x = ratio, y = reorder(contrast, ratio),
           colour = ratio > 1)) +
  geom_vline(xintercept = 1, linetype = "dashed", colour = "grey50") +
  geom_errorbarh(aes(xmin = lcl, xmax = ucl),
                 height = 0.3, linewidth = 0.9) +
  geom_point(size = 3.5) +
  geom_text(aes(label = sig, x = ucl + 0.05),
            hjust = 0, size = 3.5, colour = "grey30") +
  facet_wrap(~ pathogen, ncol = 3, scales = "free_x",
             labeller = labeller(pathogen = label_wrap_gen(20))) +
  scale_colour_manual(values = c("FALSE" = "#27AE60", "TRUE" = "#C0392B"),
                      guide = "none") +
  scale_x_continuous(labels = label_number(accuracy = 0.1)) +
  labs(
    title    = "Rate ratios vs. High income — by pathogen",
    subtitle = "Bonferroni-adjusted. Values >1 = more resistant isolates than high-income countries.",
    x = "Rate ratio (vs. High income)", y = "Income group"
  ) +
  theme_amr

Figure 3: Marginal IRRs for income group relative to High income, by pathogen. Bars above 1 indicate higher resistant isolate burden.

Resistance count trend over time by income group

Code

pred_time <- ggpredict(mod_nb_int,
                       terms = c("year_c [0:5]", "income_group"),
                       condition = c(
                         total_isolates_tested = 10000,
                         pathogen = "Escherichia coli",
                         infection_type = "Urinary Tract Infection"
                       ))

as.data.frame(pred_time) |>
  mutate(year = x + 2018) |>
  rename(income_group = group) |>
  ggplot(aes(x = year, y = predicted,
             colour = income_group, fill = income_group)) +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.15) +
  geom_line(linewidth = 1.1) +
  scale_colour_manual(values = income_pal, name = "Income group") +
  scale_fill_manual(values = income_pal, guide = "none") +
  scale_y_continuous(labels = label_comma()) +
  scale_x_continuous(breaks = 2018:2023) +
  labs(
    title    = "Predicted resistant E. coli isolates over time by income group",
    subtitle = "Standardised at 10,000 isolates tested. Shaded bands = 95% CI.",
    x = "Year", y = "Predicted resistant isolates (n)"
  ) +
  theme_amr

Figure 4: Model-predicted resistant isolate counts over 2018–2023 by income group, at mean testing volume.

Model Diagnostics

Code

library(DHARMa)

sim_res <- simulateResiduals(mod_nb_int, n = 500, plot = FALSE)

par(mfrow = c(1, 2))
plotQQunif(sim_res,
           main = "Q-Q plot (simulated residuals)",
           testUniformity = TRUE,
           testOutliers = TRUE)
plotResiduals(sim_res,
              form = amr$income_group,
              main = "Residuals vs. income group")
par(mfrow = c(1, 1))

Figure 5: NB model diagnostics: simulated residuals via DHARMa.

Code

# Formal tests
tibble(
  Test = c("KS uniformity", "Dispersion test", "Outlier test"),
  `p-value` = c(
    round(testUniformity(sim_res, plot = FALSE)$p.value, 4),
    round(testDispersion(sim_res, plot = FALSE)$p.value, 4),
    round(testOutliers(sim_res, plot = FALSE)$p.value, 4)
  ),
  Interpretation = c(
    "p > 0.05 = residuals uniformly distributed ✓",
    "p > 0.05 = no remaining overdispersion ✓",
    "p > 0.05 = no excess outliers ✓"
  )
) |>
  kbl(caption = "DHARMa diagnostic tests") |>
  kable_styling(bootstrap_options = c("striped", "condensed"), full_width = FALSE)

DHARMa diagnostic tests
Test	p-value	Interpretation
KS uniformity	0.2826	p > 0.05 = residuals uniformly distributed ✓
Dispersion test	0.1360	p > 0.05 = no remaining overdispersion ✓
Outlier test	0.8000	p > 0.05 = no excess outliers ✓

Summary of Findings

Code

tibble::tribble(
  ~Finding, ~Detail,
  "Overdispersion",
    "Var/mean ratio ~17,551 for resistant_isolates. Poisson dispersion ratio far exceeded 1, confirming NB regression as the appropriate model family.",
  "Model fit (LRT)",
    "The income × pathogen interaction significantly improves model fit over main effects alone (see LRT table). Interaction retained in final model.",
  "Income gradient — E. coli",
    "Lower-middle income countries show substantially higher resistant E. coli isolate rates vs. high-income (IRR > 1), particularly for fluoroquinolone resistance.",
  "Acinetobacter spp.",
    "The steepest income gradient is observed for Acinetobacter spp. — a critical-priority carbapenem-resistant pathogen — with lower-income settings bearing disproportionate counts.",
  "Year trend",
    "Resistant isolate counts increase year-on-year (IRR > 1 per year centred at 2018), consistent with the rising global AMR trajectory.",
  "Equity implication",
    "After controlling for testing volume (offset), income-group disparities in resistant isolate counts are pathogen-specific and not an artefact of differential surveillance effort."
) |>
  kbl(col.names = c("Finding", "Detail"),
      caption   = "Key findings from negative binomial regression") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = TRUE) |>
  column_spec(1, bold = TRUE, width = "22%") |>
  column_spec(2, width = "78%")

Key findings from negative binomial regression
Finding	Detail
Overdispersion	Var/mean ratio ~17,551 for resistant_isolates. Poisson dispersion ratio far exceeded 1, confirming NB regression as the appropriate model family.
Model fit (LRT)	The income × pathogen interaction significantly improves model fit over main effects alone (see LRT table). Interaction retained in final model.
Income gradient — E. coli	Lower-middle income countries show substantially higher resistant E. coli isolate rates vs. high-income (IRR > 1), particularly for fluoroquinolone resistance.
Acinetobacter spp.	The steepest income gradient is observed for Acinetobacter spp. — a critical-priority carbapenem-resistant pathogen — with lower-income settings bearing disproportionate counts.
Year trend	Resistant isolate counts increase year-on-year (IRR > 1 per year centred at 2018), consistent with the rising global AMR trajectory.
Equity implication	After controlling for testing volume (offset), income-group disparities in resistant isolate counts are pathogen-specific and not an artefact of differential surveillance effort.

Methodological Notes

Outcome: resistant_isolates (count); offset = log(total_isolates_tested) controls for differential lab surveillance volume.
Reference levels: High income (income group); Acinetobacter spp. (pathogen); year centred at 2018.
Dispersion parameter θ estimated by MASS::glm.nb() via maximum likelihood; smaller θ = more overdispersion.
Diagnostics: DHARMa simulated residuals provide properly scaled diagnostics for GLMs that standard Pearson residuals cannot.
Alternative: A binomial GLM (with cbind(resistant_isolates, total_isolates_tested - resistant_isolates) as outcome) directly models the resistance probability and would be complementary; use it if the proportion is of primary scientific interest.

Packages: MASS, tidyverse, broom, emmeans, ggeffects, DHARMa, scales, patchwork, kableExtra
Data source: WHO GLASS / simulated AMR surveillance data.