A TSAD tutorial
This tutorial provides a practical guide to using the R programming language to detect infectious disease outbreaks through time–space analysis. You’ll learn how to use R’s powerful data-handling capabilities and functions to identify unusual increases in specific diagnoses. This process is crucial for public health surveillance, as it enables the rapid deployment of resources to interrupt transmission.
Here are the steps for the tutorial:
1. Load {tidyverse} package.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
2. Create Sample Data (i.e., Dummy disease data in R).
Note. In a real scenario, you would load your actual data, e.g., using read.csv(), read_excel(), or dbConnect().
set.seed(1980) # for reproducibility
disease_data <- tibble(
patient_id = 1:1000,
diagnosis_date = sample(seq(as.Date("2018-01-01"), as.Date("2024-12-31"), by = "day"), 1000, replace = TRUE),
county_fips = sample(c("001", "003", "005", "007", "009", "011", "013", "015"), 1000, replace = TRUE,
prob = c(0.15, 0.15, 0.1, 0.1, 0.1, 0.1, 0.1, 0.2))
)
3. Introduce a small ‘outbreak’ in county ‘001’ for 2024 to demonstrate.
outbreak_cases <- tibble(
patient_id = 1001:1050,
diagnosis_date = sample(seq(as.Date("2024-01-01"), as.Date("2024-12-31"), by = "day"), 50, replace = TRUE),
county_fips = rep("001", 50)
)
# Bind disease_data and outbreak_cases data frames by row
disease_data <- bind_rows(disease_data, outbreak_cases)
4. Data Processing and Transformation.
# Extract year from diagnosis_date
disease_data_processed <- disease_data %>%
mutate(diagnosis_year = format(diagnosis_date, "%Y")) %>%
# Ensure county_fips is a character or factor for consistent grouping
mutate(county_fips = as.character(county_fips))
# Define current year and historical years for baseline calculation
current_year <- 2024
# Previous 36 months (3 full years)
historical_years <- c(2021, 2022, 2023)
5. Calculate Current 12-month Counts.
current_counts <- disease_data_processed %>%
filter(diagnosis_year == current_year) %>%
group_by(county_fips) %>%
summarise(current_cases = n(), .groups = 'drop')
6. Calculate Average and Standard Deviation for Previous 36 Months.
historical_stats <- disease_data_processed %>%
filter(diagnosis_year %in% historical_years) %>%
group_by(county_fips, diagnosis_year) %>%
# Group_by county_fips, diagnosis_year initially
summarise(cases_per_year = n(), .groups = 'drop_last') %>%
summarise(
avg_baseline_cases = mean(cases_per_year, na.rm = TRUE),
std_baseline_cases = sd(cases_per_year, na.rm = TRUE),
.groups = 'drop'
)
7. Join and Identify Alerts.
alerts <- current_counts %>%
left_join(historical_stats, by = "county_fips") %>%
# Handle cases where a county might not have historical data (e.g., new county, no cases historically)
# For simplicity here, we'll assume they have historical data, or treat NA std as 0 for alert calculation
mutate(
std_baseline_cases = replace_na(std_baseline_cases, 0), # Treat NA std dev as 0 for alert calculation
avg_baseline_cases = replace_na(avg_baseline_cases, 0) # Treat NA avg as 0
) %>%
mutate(
alert_flag = as.numeric(
current_cases > (avg_baseline_cases + (2 * std_baseline_cases)) &
current_cases > (avg_baseline_cases + 2) # Additional criterion: absolute increase of at least 2 cases
)
)
8. View Alerts.
# Filter for counties with an alert
outbreak_alerts <- alerts %>%
filter(alert_flag == 1) %>%
select(county_fips, current_cases, avg_baseline_cases, std_baseline_cases)
print("Disease Outbreak Alerts by County:")
## [1] "Disease Outbreak Alerts by County:"
print(outbreak_alerts)
## # A tibble: 2 × 4
## county_fips current_cases avg_baseline_cases std_baseline_cases
## <chr> <int> <dbl> <dbl>
## 1 001 76 16 3.46
## 2 005 18 10.7 2.52
9. Visualization of a County with An Alert (e.g., County 001).
# Visualize the full trend for a county
# Plot the cases for county '001' over time, including historical and current
county_001_data <-disease_data_processed %>%
filter(county_fips == "001") %>%
group_by(diagnosis_year) %>%
summarise(cases = n())
ggplot(county_001_data, aes(x = diagnosis_year, y = cases, group = 1)) +
geom_line(color = "blue", size = 1) +
geom_point(color = "darkblue", size = 2) +
labs(
title = paste0("Disease Cases Over Time for County 001 (Alerted)"),
x = "Year",
y = "Number of Cases"
) +
theme_minimal() +
scale_x_discrete(guide = guide_axis(n.dodge = 2)) # Adjust label dodging if many years
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Visualize the full trend for a county
# Plot the cases for county '005' over time, including historical and current
county_005_data <-disease_data_processed %>%
filter(county_fips == "005") %>%
group_by(diagnosis_year) %>%
summarise(cases = n())
ggplot(county_005_data, aes(x = diagnosis_year, y = cases, group = 1)) +
geom_line(color = "blue", size = 1) +
geom_point(color = "darkblue", size = 2) +
labs(
title = paste0("Disease Cases Over Time for County 005 (Alerted)"),
x = "Year",
y = "Number of Cases"
) +
theme_minimal() +
scale_x_discrete(guide = guide_axis(n.dodge = 2)) # Adjust label dodging if many years
Conclusion
In conclusion, mastering this R program provides a powerful and reproducible tool for detecting infectious disease outbreaks. By using the {tidyverse} package, you’ve learned to systematically prepare, analyze, and visualize public health surveillance data. This ability to transform raw case counts into actionable insights—flagging potential outbreaks based on statistical thresholds—is a cornerstone of modern epidemiology.
As you continue your work with the R programming language, consider how you might refine this program further by integrating geospatial mapping to better pinpoint outbreak locations.
Disclaimer: The author of this tutorial, along with any associated organizations, assumes no responsibility for the use or misuse of the code and methods presented. This content is intended for educational purposes only and is not a substitute for professional medical or epidemiological advice. Always consult with qualified public health experts and follow official guidelines when conducting real-world disease surveillance.
A.M.D.G.