Is HLM More Accurate than NB Regression Analyses?

When detecting HCV outbreaks, HLM will alert you if the current quarter’s case count is much higher than the average of previous quarters, without adjusting for changes in population size or time trends. In contrast, NB regression can model expected case counts based on factors such as time, region, population, and seasonality, providing a more accurate expected count along with confidence intervals.

It is generally recommended to use NB regression (especially in the presence of overdispersion or covariates) when accuracy, context, and statistical control are required, while HLM is more suitable for quick anomaly detection in settings without detailed data or modeling resources.

Here are the steps for comparing HLM vs. NB regression in R using dummy HCV data:

1. Install and Load Required Packages.

#install.packages(c("tidyverse","MASS"))
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(MASS) # For negative binomial model
## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select

2. Create Dummy Quarterly HCV Data (with Outbreak).

set.seed(1980)

years <- 2015:2024
quarters <- rep(1:4, times = length(years))
dates <- seq(as.Date("2015-01-01"), by = "quarter", length.out = length(years) * 4)

# Simulate baseline HCV counts + artificial outbreak in last year
baseline_cases <- rpois(36, lambda = 10)
outbreak_cases <- rpois(4, lambda = 25)  # outbreak year
cases <- c(baseline_cases, outbreak_cases)

df <- data.frame(
  date = dates,
  year = year(dates),
  quarter = quarter(dates),
  hcv_cases = cases,
  time = 1:length(dates)
)

# Returns the first 10 rows
head(df, 10)
##          date year quarter hcv_cases time
## 1  2015-01-01 2015       1        11    1
## 2  2015-04-01 2015       2        14    2
## 3  2015-07-01 2015       3         5    3
## 4  2015-10-01 2015       4        11    4
## 5  2016-01-01 2016       1        13    5
## 6  2016-04-01 2016       2         7    6
## 7  2016-07-01 2016       3        16    7
## 8  2016-10-01 2016       4        11    8
## 9  2017-01-01 2017       1         7    9
## 10 2017-04-01 2017       2         8   10

3. Create Historical Limits Method (HLM).

# Define a function called detect_hlm to apply the Historical Limits Method (HLM)
# data = input dataset, baseline_years = number of years to use as reference
detect_hlm <- function(data, baseline_years = 3) {
  # Create an empty dataframe to store results
  out <- data.frame()
  # Loop through all 4 quarters of the year
  for (q in 1:4) {
    # Loop through each year starting after baseline_years up to the last year
    for (y in (min(data$year) + baseline_years):(max(data$year))) {
      # Select the current quarter for the given year
      current <- data %>% filter(year == y, quarter == q)
      # Select historical baseline data: same quarter, but from the past 'baseline_years'
      historical <- data %>%
        filter(quarter == q, year %in% (y - baseline_years):(y - 1))
      # Skip iteration if there are not enough baseline years
      if (nrow(historical) < baseline_years) next
      # Calculate the mean (expected number of cases) from the baseline
      mu <- mean(historical$hcv_cases)
      # Calculate the standard deviation from the baseline
      sd_ <- sd(historical$hcv_cases)
      # Define outbreak threshold as mean + 2 standard deviations
      threshold <- mu + 2 * sd_
      # Flag as outbreak (1) if cases exceed threshold, otherwise no outbreak (0)
      current$hlm_flag <- ifelse(current$hcv_cases > threshold, 1, 0)
      # Store the threshold used for comparison
      current$threshold <- threshold
      # Store the baseline mean for reference
      current$baseline_mean <- mu
      # Store the baseline standard deviation for reference
      current$baseline_sd <- sd_
      # Append the current row with results into the output dataframe
      out <- bind_rows(out, current)
    }
  }
  
  # Return the final dataset with outbreak detection results
  return(out)
}

# Run the HLM detection on the dataset 'df' and save results
hlm_results <- detect_hlm(df)

4. Create Negative Binomial (NB) Regression

# Fit NB regression with time and quarter as predictors
nb_model <- glm.nb(hcv_cases ~ time + factor(quarter), data = df)

# Predict expected counts and CI
df$nb_expected <- predict(nb_model, type = "response")
se_fit <- predict(nb_model, type = "link", se.fit = TRUE)$se.fit
link_fit <- predict(nb_model, type = "link")

# Transform back to response scale
df$nb_upper <- exp(link_fit + 2 * se_fit)  # upper bound (like threshold)
df$nb_flag <- ifelse(df$hcv_cases > df$nb_upper, 1, 0)

5. Combine HLM and NB flags

# Combine flags
combined <- left_join(hlm_results, df, by = "date")
# Rename variables
combined<- combined %>% 
          rename("year"=year.x, "quarter"=quarter.x, 
                "hcv_cases"=hcv_cases.x,"time"=time.x)

6. Compare Detection Methods

# Create a ggplot object using the dataset "combined"
ggplot(combined, aes(x = date)) +
  # Add a gray line showing the observed HCV cases over time
  geom_line(aes(y = hcv_cases), color = "gray30") +
  # Add points for HCV cases, colored by whether HLM flagged an outbreak
  # hlm_flag = 0 (black, no outbreak), hlm_flag = 1 (red, outbreak)
  geom_point(aes(y = hcv_cases, color = factor(hlm_flag)), size = 7, shape = 16) +
  # Add points for HCV cases, shaped by whether NB regression flagged an outbreak
  # nb_flag = 0 (open circle), nb_flag = 1 (filled triangle)
  # Color is fixed as blue to distinguish NB
  geom_point(aes(y = hcv_cases, shape = factor(nb_flag)), size = 5, color = "blue") +
  # Add HLM threshold line: mean + 2*SD, shown as a dashed red line
  geom_line(aes(y = threshold), linetype = "dashed", color = "red") +
  # Add NB upper confidence threshold, shown as a dotted blue line
  geom_line(aes(y = nb_upper), linetype = "dotted", color = "blue") +
  # Define how the HLM flag colors should appear in the legend
  # "0" = black (no outbreak), "1" = red (outbreak detected)
  scale_color_manual(values = c("0" = "black", "1" = "red"), name = "HLM Flag") +
  # Define how the NB flag shapes should appear in the legend
  # "0" = open circle, "1" = filled triangle
  scale_shape_manual(values = c("0" = 1, "1" = 17), name = "NB Flag") +
  # Add plot title, subtitle, and axis labels
  labs(
    title = "HCV Outbreak Detection: HLM vs Negative Binomial Regression",
    subtitle = "Note. Shapes show detection flags and lines show HLM (Red) and NB (Blue) detection thresholds",
    y = "HCV Cases",
    x = "Quarter"
  ) +
  # Use a minimal theme (clean look)
  theme_minimal() +
  # Customize text elements: make title bigger and bold, subtitle slightly smaller,
  # axis titles and labels larger for readability
  theme(
    plot.title = element_text(size = 27, face = "bold"),
    plot.subtitle = element_text(size = 14),
    axis.title = element_text(size = 14),
    axis.text = element_text(size = 12)
  )

Interpreting the HCV Outbreak Detection Plot (HLM vs. Negative Binomial Regression)

The plot illustrates differences in outbreak detection between the Historical Limits Method (HLM) and Negative Binomial (NB) regression. Red circles indicate outbreaks flagged by HLM, based on deviations from past averages, while blue triangles highlight outbreaks identified by the NB regression model, which incorporates factors such as trends and seasonality. The dashed red line shows the detection threshold set by HLM, whereas the dotted blue line marks the upper bound of the NB model, providing a more statistically informed benchmark for identifying unusual case counts.

7. Print HLM and NB Outbreaks Infomration

outbreaks <- combined %>%
  dplyr::filter(hlm_flag == 1 | nb_flag == 1) %>%
  dplyr::select(year, quarter, date, hcv_cases, hlm_flag, nb_flag)

print("Detected Outbreaks:")
## [1] "Detected Outbreaks:"
print(outbreaks)
##   year quarter       date hcv_cases hlm_flag nb_flag
## 1 2023       1 2023-01-01        10        1       0
## 2 2024       1 2024-01-01        28        1       1
## 3 2019       2 2019-04-01        12        1       0
## 4 2024       2 2024-04-01        30        1       1
## 5 2024       3 2024-07-01        22        1       1
## 6 2021       4 2021-10-01        10        1       0
## 7 2024       4 2024-10-01        21        1       1

Conclusion

The Historical Limits Method (HLM) is useful for rapid detection of unusual spikes by comparing case counts to past averages. However, it does not account for changes in population, trends, or other contextual factors. Negative Binomial (NB) regression, by incorporating trends, seasonality, and covariates, provides a more reliable and robust approach for outbreak detection.

Disclaimer:The author of this tutorial, along with any associated organizations, assumes no responsibility for the use or misuse of the code and methods presented. This content is intended for educational purposes only and is not a substitute for professional medical or epidemiological advice. Always consult with qualified public health experts and follow official guidelines when conducting real-world disease surveillance.

A.M.D.G.