Is HLM More Accurate than NB Regression Analyses?
When detecting HCV outbreaks, HLM will alert you if the current quarter’s case count is much higher than the average of previous quarters, without adjusting for changes in population size or time trends. In contrast, NB regression can model expected case counts based on factors such as time, region, population, and seasonality, providing a more accurate expected count along with confidence intervals.
It is generally recommended to use NB regression (especially in the presence of overdispersion or covariates) when accuracy, context, and statistical control are required, while HLM is more suitable for quick anomaly detection in settings without detailed data or modeling resources.
Here are the steps for comparing HLM vs. NB regression in R using dummy HCV data:
1. Install and Load Required Packages.
#install.packages(c("tidyverse","MASS"))
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(MASS) # For negative binomial model
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
2. Create Dummy Quarterly HCV Data (with Outbreak).
set.seed(1980)
years <- 2015:2024
quarters <- rep(1:4, times = length(years))
dates <- seq(as.Date("2015-01-01"), by = "quarter", length.out = length(years) * 4)
# Simulate baseline HCV counts + artificial outbreak in last year
baseline_cases <- rpois(36, lambda = 10)
outbreak_cases <- rpois(4, lambda = 25) # outbreak year
cases <- c(baseline_cases, outbreak_cases)
df <- data.frame(
date = dates,
year = year(dates),
quarter = quarter(dates),
hcv_cases = cases,
time = 1:length(dates)
)
# Returns the first 10 rows
head(df, 10)
## date year quarter hcv_cases time
## 1 2015-01-01 2015 1 11 1
## 2 2015-04-01 2015 2 14 2
## 3 2015-07-01 2015 3 5 3
## 4 2015-10-01 2015 4 11 4
## 5 2016-01-01 2016 1 13 5
## 6 2016-04-01 2016 2 7 6
## 7 2016-07-01 2016 3 16 7
## 8 2016-10-01 2016 4 11 8
## 9 2017-01-01 2017 1 7 9
## 10 2017-04-01 2017 2 8 10
3. Create Historical Limits Method (HLM).
# Define a function called detect_hlm to apply the Historical Limits Method (HLM)
# data = input dataset, baseline_years = number of years to use as reference
detect_hlm <- function(data, baseline_years = 3) {
# Create an empty dataframe to store results
out <- data.frame()
# Loop through all 4 quarters of the year
for (q in 1:4) {
# Loop through each year starting after baseline_years up to the last year
for (y in (min(data$year) + baseline_years):(max(data$year))) {
# Select the current quarter for the given year
current <- data %>% filter(year == y, quarter == q)
# Select historical baseline data: same quarter, but from the past 'baseline_years'
historical <- data %>%
filter(quarter == q, year %in% (y - baseline_years):(y - 1))
# Skip iteration if there are not enough baseline years
if (nrow(historical) < baseline_years) next
# Calculate the mean (expected number of cases) from the baseline
mu <- mean(historical$hcv_cases)
# Calculate the standard deviation from the baseline
sd_ <- sd(historical$hcv_cases)
# Define outbreak threshold as mean + 2 standard deviations
threshold <- mu + 2 * sd_
# Flag as outbreak (1) if cases exceed threshold, otherwise no outbreak (0)
current$hlm_flag <- ifelse(current$hcv_cases > threshold, 1, 0)
# Store the threshold used for comparison
current$threshold <- threshold
# Store the baseline mean for reference
current$baseline_mean <- mu
# Store the baseline standard deviation for reference
current$baseline_sd <- sd_
# Append the current row with results into the output dataframe
out <- bind_rows(out, current)
}
}
# Return the final dataset with outbreak detection results
return(out)
}
# Run the HLM detection on the dataset 'df' and save results
hlm_results <- detect_hlm(df)
4. Create Negative Binomial (NB) Regression
# Fit NB regression with time and quarter as predictors
nb_model <- glm.nb(hcv_cases ~ time + factor(quarter), data = df)
# Predict expected counts and CI
df$nb_expected <- predict(nb_model, type = "response")
se_fit <- predict(nb_model, type = "link", se.fit = TRUE)$se.fit
link_fit <- predict(nb_model, type = "link")
# Transform back to response scale
df$nb_upper <- exp(link_fit + 2 * se_fit) # upper bound (like threshold)
df$nb_flag <- ifelse(df$hcv_cases > df$nb_upper, 1, 0)
5. Combine HLM and NB flags
# Combine flags
combined <- left_join(hlm_results, df, by = "date")
# Rename variables
combined<- combined %>%
rename("year"=year.x, "quarter"=quarter.x,
"hcv_cases"=hcv_cases.x,"time"=time.x)
6. Compare Detection Methods
# Create a ggplot object using the dataset "combined"
ggplot(combined, aes(x = date)) +
# Add a gray line showing the observed HCV cases over time
geom_line(aes(y = hcv_cases), color = "gray30") +
# Add points for HCV cases, colored by whether HLM flagged an outbreak
# hlm_flag = 0 (black, no outbreak), hlm_flag = 1 (red, outbreak)
geom_point(aes(y = hcv_cases, color = factor(hlm_flag)), size = 7, shape = 16) +
# Add points for HCV cases, shaped by whether NB regression flagged an outbreak
# nb_flag = 0 (open circle), nb_flag = 1 (filled triangle)
# Color is fixed as blue to distinguish NB
geom_point(aes(y = hcv_cases, shape = factor(nb_flag)), size = 5, color = "blue") +
# Add HLM threshold line: mean + 2*SD, shown as a dashed red line
geom_line(aes(y = threshold), linetype = "dashed", color = "red") +
# Add NB upper confidence threshold, shown as a dotted blue line
geom_line(aes(y = nb_upper), linetype = "dotted", color = "blue") +
# Define how the HLM flag colors should appear in the legend
# "0" = black (no outbreak), "1" = red (outbreak detected)
scale_color_manual(values = c("0" = "black", "1" = "red"), name = "HLM Flag") +
# Define how the NB flag shapes should appear in the legend
# "0" = open circle, "1" = filled triangle
scale_shape_manual(values = c("0" = 1, "1" = 17), name = "NB Flag") +
# Add plot title, subtitle, and axis labels
labs(
title = "HCV Outbreak Detection: HLM vs Negative Binomial Regression",
subtitle = "Note. Shapes show detection flags and lines show HLM (Red) and NB (Blue) detection thresholds",
y = "HCV Cases",
x = "Quarter"
) +
# Use a minimal theme (clean look)
theme_minimal() +
# Customize text elements: make title bigger and bold, subtitle slightly smaller,
# axis titles and labels larger for readability
theme(
plot.title = element_text(size = 27, face = "bold"),
plot.subtitle = element_text(size = 14),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12)
)
Interpreting the HCV Outbreak Detection Plot (HLM vs. Negative Binomial Regression)
The plot illustrates differences in outbreak detection between the Historical Limits Method (HLM) and Negative Binomial (NB) regression. Red circles indicate outbreaks flagged by HLM, based on deviations from past averages, while blue triangles highlight outbreaks identified by the NB regression model, which incorporates factors such as trends and seasonality. The dashed red line shows the detection threshold set by HLM, whereas the dotted blue line marks the upper bound of the NB model, providing a more statistically informed benchmark for identifying unusual case counts.
7. Print HLM and NB Outbreaks Infomration
outbreaks <- combined %>%
dplyr::filter(hlm_flag == 1 | nb_flag == 1) %>%
dplyr::select(year, quarter, date, hcv_cases, hlm_flag, nb_flag)
print("Detected Outbreaks:")
## [1] "Detected Outbreaks:"
print(outbreaks)
## year quarter date hcv_cases hlm_flag nb_flag
## 1 2023 1 2023-01-01 10 1 0
## 2 2024 1 2024-01-01 28 1 1
## 3 2019 2 2019-04-01 12 1 0
## 4 2024 2 2024-04-01 30 1 1
## 5 2024 3 2024-07-01 22 1 1
## 6 2021 4 2021-10-01 10 1 0
## 7 2024 4 2024-10-01 21 1 1
Conclusion
The Historical Limits Method (HLM) is useful for rapid detection of unusual spikes by comparing case counts to past averages. However, it does not account for changes in population, trends, or other contextual factors. Negative Binomial (NB) regression, by incorporating trends, seasonality, and covariates, provides a more reliable and robust approach for outbreak detection.
Disclaimer:The author of this tutorial, along with any associated organizations, assumes no responsibility for the use or misuse of the code and methods presented. This content is intended for educational purposes only and is not a substitute for professional medical or epidemiological advice. Always consult with qualified public health experts and follow official guidelines when conducting real-world disease surveillance.
A.M.D.G.