Introduction

Effective surveillance depends not only on the quality of data collected, but also on how quickly and clearly that data can be transformed into actionable insights. Zambia’s electronic IDSR Case‑Based Surveillance system generates rich, nationwide information on priority diseases, yet much of the routine reporting process remains manual, time‑consuming, and vulnerable to delays. As outbreaks evolve rapidly, the Surveillance and Disease Intelligence Unit requires tools that can streamline analysis, standardize outputs, and ensure timely situational awareness.

This project focuses on enhancing surveillance reporting by developing an automated, reproducible R‑based workflow for processing eIDSR cholera data. The workflow cleans, analyzes, and visualizes case‑based records, producing standardized tables, charts, and indicators that can feed directly into weekly surveillance and situation reports. By integrating temporal, spatial, and demographic analyses, the system supports rapid interpretation of who is affected, what is happening, when transmission is occurring, and where outbreaks are concentrated. Ultimately, the project aims to strengthen reporting efficiency, reduce manual workload, and improve the timeliness and consistency of national surveillance outputs.

R Packages Used This project was built on a set of R packages that together provide a complete environment for data cleaning, transformation, visualization, spatial analysis, and reproducible project management. Each package contributes a specific capability, and their combined use creates a coherent, efficient workflow suitable for outbreak analytics and public health reporting.

pacman::p_load(
  tidyverse,
  forcats,
  lubridate,
  sf,
  readxl,
  ggplot2,
  ggspatial,
  here,
  zoo
)

Loading and Cleaning Data

This initial step focused on importing and standardizing the national cholera case-based line list for analysis. The raw Excel file was first read into R, after which all column names were cleaned by replacing spaces with underscores to ensure they were syntactically consistent and easy to reference in code. Key variables were then harmonized: age was coerced to numeric, sex values were trimmed and converted to uppercase, and the date of onset was converted from character format to a proper Date object. From this onset date, epidemiological week and year were derived to align the dataset with routine IDSR reporting structures.

To support age-stratified analysis, cases were grouped into predefined age bands (0–4, 5–14, 15–24, 25–34, 35–44, 45–54, 55–64, and 65+ years), with implausible or missing ages assigned as NA. District names were then standardized by trimming whitespace, converting to uppercase, and recoding known variants (e.g., “SENGA”, “SENGA HILL”) into a consistent label (“Senga Hill”), while all other districts were converted to title case. This ensured that district-level summaries would not be fragmented by spelling or formatting inconsistencies. Finally, the cleaned and harmonized line list was exported as clean_data.csv, creating a reproducible, analysis-ready dataset for subsequent descriptive and time-series analyses.

linelist <- read_excel("Data/Consolidated National Cholera Outbreak Linelist - Aug2025.xlsx")
## Warning: Expecting date in M1108 / R1108C13: got '2,/2/28/2026'
## Warning: Expecting date in M1109 / R1109C13: got '2,/2/28/2026'
linelist <- linelist %>%
  rename_with(~ gsub("\\s+", "_", .x)) %>%  # Replace spaces in column names
  mutate(
    Age_in_yrs = as.numeric(Age_in_yrs),
    Sex = trimws(toupper(Sex)),
    date_onset = as.Date(Date_of_onset, format = "%d/%m/%Y"),  # Adjust format if needed,
    epi_week = epiweek(date_onset),
    epi_year = year(date_onset),
    Age_Group = case_when(
      Age_in_yrs >= 0   & Age_in_yrs <= 4   ~ "0–4",
      Age_in_yrs >= 5   & Age_in_yrs <= 14  ~ "5–14",
      Age_in_yrs >= 15  & Age_in_yrs <= 24  ~ "15–24",
      Age_in_yrs >= 25  & Age_in_yrs <= 34  ~ "25–34",
      Age_in_yrs >= 35  & Age_in_yrs <= 44  ~ "35–44",
      Age_in_yrs >= 45  & Age_in_yrs <= 54  ~ "45–54",
      Age_in_yrs >= 55  & Age_in_yrs <= 64  ~ "55–64",
      Age_in_yrs >= 65                   ~ "65+",
      TRUE                               ~ NA_character_
    )
  )
 
linelist <- linelist %>%
  rename_with(~ gsub("\\s+", "_", .x)) %>%  # Clean column names
  mutate(
    District = trimws(toupper(District)),  # Standardize to uppercase
    District = case_when(
      District == "MBALA"        ~ "Mbala",
      District == "SENGA"        ~ "Senga Hill",
      District == "SENGA HILL"   ~ "Senga Hill",
      TRUE                       ~ str_to_title(District)  # Title-case fallback
    )
  )

rio::export(linelist, "clean_data.csv")

Creating a Summary Table

After cleaning and standardizing the line list, the next step involved generating a descriptive summary of the geographic distribution of cholera cases. The dataset was grouped by Province and District, and the total number of reported cases was calculated for each administrative unit. This produced a structured table showing the burden of disease across Zambia’s subnational levels, allowing for quick identification of high‑incidence districts within each province. The table also supports downstream analyses such as mapping, prioritization of response activities, and comparison of outbreak intensity across regions

# Cases by Province and District
cases_by_province_district <- linelist %>%
  group_by(Province, District) %>%
  summarise(
    cases = n(),
    .groups = "drop"
  ) %>%
  arrange(Province, desc(cases))

cases_by_province_district
## # A tibble: 24 × 3
##    Province   District cases
##    <chr>      <chr>    <int>
##  1 Central    Kabwe        1
##  2 Central    Kapiri       1
##  3 Copperbelt Masaiti     28
##  4 Copperbelt Kitwe        7
##  5 Copperbelt Ndola        2
##  6 Eastern    Nyimba       2
##  7 Lusaka     Lusaka     317
##  8 Lusaka     Chongwe     65
##  9 Lusaka     Chilanga    10
## 10 Lusaka     Rufunsa      4
## # ℹ 14 more rows

Constructing daily epidemic curves and identifying outbreak waves

The cleaned cholera line list was transformed into a daily epidemic curve stratified by district. For each district, cases were aggregated by date of symptom onset to obtain the number of cases reported per day. The resulting dataset was ordered chronologically within districts to preserve the temporal structure required for wave detection. To identify distinct outbreak waves, the number of days since the last reported case in each district was calculated. A new wave was defined whenever there was a gap of at least 28 days between consecutive cases, and these events were flagged as new wave onsets. The analysis was restricted to cases with onset dates from 1 August 2025 onwards, and cumulative wave numbers were assigned within each district, generating labels such as “Wave 1”, “Wave 2”, and so on.

Deriving national daily totals and a 7‑day moving average: To complement district‑level trends with a national perspective, daily case counts across all districts were summed to produce a national epidemic curve. A 7‑day moving average of national daily cases was then calculated to smooth short‑term fluctuations and highlight the underlying transmission pattern. This smoothed national trend was merged back into the district‑level dataset, allowing the national moving average to be visualized alongside district‑specific daily counts on a single epidemic curve.

Summarizing the number of waves by district: The number of outbreak waves experienced by each district during the analysis period was summarized by taking the maximum wave number assigned within that district. This produced a compact table indicating how many distinct waves each district had undergone, and a corresponding label (e.g., “Lusaka (3)”) was created for use in legends and annotations. This summary supports interpretation of heterogeneity in outbreak dynamics across districts, highlighting areas with recurrent or persistent transmission.

Labelling outbreak waves and their contributing districts: To provide interpretable annotations on the epidemic curve, the analysis identified the start date of each wave and the districts contributing to that wave. For every wave number, the first day flagged as a new wave was extracted, and the set of districts with cases at that wave onset was collapsed into a comma‑separated label. These were combined into descriptive wave labels (e.g., “Wave 2 (Lusaka, Ndola)”), which were then joined back to the dataset containing wave start dates. This produced a dedicated object containing the timing, magnitude, and district composition of each wave, suitable for direct use in plotting and narrative description.

Preparing custom legend labels for visualization: Finally, the district‑level wave summary was converted into a named vector of labels, mapping each district to its corresponding “District (number of waves)” string. This structure was designed for use in plot legends, ensuring that visual outputs not only distinguish districts by colour but also communicate how many outbreak waves each district experienced during the study period.

Visualizing the epidemic curve with district‑level bars and a national moving average: The final step involved generating a comprehensive epidemic curve to visualize the temporal progression of cholera transmission across Zambia. Daily case counts for each district were plotted as stacked bars, allowing for clear comparison of outbreak intensity and timing between districts. To improve interpretability over the extended outbreak period, the x‑axis was calibrated in epidemiological weeks using ISO week numbering, consistent with IDSR reporting conventions. This ensured that the visualization aligned with routine surveillance cycles and remained readable even when spanning several months of data. To highlight the broader national transmission trend, a 7‑day moving average of total daily cases was overlaid on the plot as a red, dotted line. This smoothed curve was intentionally drawn behind the district bars to provide contextual trend information without obscuring district‑level detail. The legend was customized to display each district alongside the number of outbreak waves it experienced, enhancing the interpretive value of the visualization. The final figure provides a clear, integrated view of both localized and national cholera dynamics, supporting situational awareness and decision‑making for the IDSR response.

#STEP 1
daily_curve_2 <- linelist %>%
  mutate(date_onset = as.Date(date_onset)) %>%
  
  # 1. Daily counts per district
  group_by(date_onset, District) %>%
  summarise(cases = n(), .groups = "drop") %>%
  
  # 2. Order within district
  arrange(District, date_onset) %>%
  
  # 3. Detect new waves + compute moving average
  group_by(District) %>%
  mutate(
    days_since_last_case = date_onset - lag(date_onset),
    new_wave = if_else(
      is.na(days_since_last_case) | days_since_last_case >= 28,
      TRUE,
      FALSE
    )
  ) %>%
  ungroup() %>% 
  
  # 4. Restrict to August 2025 onward
  filter(date_onset >= as.Date("2025-08-01")) %>%
  
  # 5. Number waves and create wave labels
  group_by(District) %>%
  arrange(date_onset) %>%
  mutate(
    wave_number = cumsum(new_wave),
    wave_label  = paste0("Wave ", wave_number)
  ) %>%
  ungroup()
  

#STEP 2
  # National totals + 7-day MA
national_ma <- daily_curve_2 %>%
  group_by(date_onset) %>%
  summarise(national_cases = sum(cases), .groups = "drop") %>%
  arrange(date_onset) %>%
  mutate(
    national_ma7 = rollmean(national_cases, k = 7, fill = NA, align = "right")
  )

# Merge national MA back into district dataset
daily_curve_2 <- daily_curve_2 %>%
  left_join(national_ma, by = "date_onset")

#STEP 3
wave_counts <- daily_curve_2 %>%
  group_by(District) %>%
  summarise(waves = max(wave_number, na.rm = TRUE)) %>%
  mutate(label = paste0(District, " (", waves, ")")) %>%
  select(District, label)   # <-- IMPORTANT: District first, label second

#STEP 4
wave_starts_labels <- daily_curve_2 %>%
  filter(new_wave == TRUE) %>%
  group_by(wave_number) %>%
  summarise(
    districts = paste(unique(District), collapse = ", "),
    .groups = "drop"
  ) %>%
  mutate(
    wave_label = paste0("Wave ", wave_number),
    district_label = paste0("(", districts, ")")
  ) %>%
  select(wave_number, wave_label, district_label)   # <-- ensures clean names

#Joining
wave_starts <- daily_curve_2 %>%
  filter(new_wave == TRUE) %>%
  select(date_onset, District, cases, wave_number) %>%   # keep only what you need
  left_join(wave_starts_labels, by = "wave_number")

#STEP 5
legend_labels <- deframe(wave_counts)

#STEP 6 
ggplot(daily_curve_2, aes(x = date_onset)) +
  geom_col(aes(y = cases, fill = District), color = "black") +

  geom_line(aes(y = national_ma7), 
          color = "red", 
          linetype = "dotted", 
          linewidth = 1.1)+
  
  scale_fill_discrete(labels = legend_labels) +

  scale_x_date(
    date_breaks = "2 week",
    date_labels = "W%V\n%Y"   # epi-week + year
  ) +

  labs(
    x = "Epi-week",
    y = "Number of Cases",
    fill = "District (Outbreaks)"
  ) +

  theme_minimal(base_size = 14) +
  theme(
    #axis.text.x = element_text(angle = 45, hjust = 1),
    axis.text.x = element_text(size = 8, angle = 45, hjust = 1),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    legend.position = "bottom",
    legend.direction = "horizontal",
    plot.margin = margin(t = 10, r = 20, b = 40, l = 20)
  ) +
  coord_cartesian(clip = "off")
## Warning: Removed 7 rows containing missing values or values outside the scale range
## (`geom_line()`).

Creating an Age-Sex Pyramid

A national‑level age–sex pyramid provides a broader demographic perspective on the cholera outbreak by showing how the burden of disease is distributed across age groups for males and females combined from all districts.

To construct the dataset, the line list is grouped by age category and sex, and the number of cases in each group is counted. Male case counts are multiplied by –1 to create a mirrored layout, enabling both sexes to be displayed on a shared horizontal axis. This structure makes it easy to compare the relative burden between males and females within each age group.

The resulting plot displays age groups along the vertical axis, with bars extending to the left for males and to the right for females. Distinct colours are used to differentiate the sexes, and the x‑axis labels are formatted to show absolute values so that negative male counts appear as positive numbers in the final figure. A minimalist theme keeps the focus on the demographic pattern rather than on gridlines or axis decorations.

This national‑level pyramid highlights which age groups are most affected across Zambia and whether the distribution differs between males and females. Such demographic insights are valuable for tailoring national‑level risk communication, guiding targeted interventions, and understanding which population groups are most involved in transmission during the outbreak.

plot_data <- linelist %>%
  # filter(!is.na(Age_Group), Sex %in% c("M", "F"),
  #        District == "Lusaka") %>%    # <--- Added line for Lusaka only
  group_by(Age_Group, Sex) %>%
  summarise(cases = n(), .groups = "drop") %>%
  mutate(cases = ifelse(Sex == "M", -cases, cases))

ggplot(plot_data, aes(x = cases, y = fct_rev(factor(Age_Group)), fill = Sex)) +
  geom_col(width = 0.7) +
  scale_x_continuous(labels = abs, name = "Number of Cases") +
  scale_fill_manual(values = c("M" = "#1f78b4", "F" = "#e31a1c")) +
  labs(
    # title = "Age Distribution of Cholera Cases by Sex",
    y = "Age Group",
    fill = "Sex"
  ) +
  theme_minimal() +
  theme(
    axis.text.y = element_text(size = 12),
    axis.title.y = element_text(size = 14),
    axis.title.x = element_text(size = 14),
    plot.title = element_text(size = 16, face = "bold")
  )
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_col()`).

Monitoring days since last reported case in active districts

This stage of the analysis focused on assessing recent reporting activity among districts that remained epidemiologically active during the outbreak period. A subset of the national line list was created by filtering for districts with ongoing transmission or recent case detection. For each of these districts, the most recent date on which a case was seen at a health facility was extracted and converted into a proper date format. The number of days since this last reported case was then calculated relative to the current date, providing a simple but operationally important indicator of recent surveillance activity and potential silent transmission.

To support interpretation and response prioritization, districts were classified into three transmission‑status categories based on the number of days since their last reported case: Active (0–10 days), Moderate (11–20 days), and Control (21+ days). These thresholds align with practical decision‑making needs in IDSR, where districts with prolonged periods of zero reporting may be transitioning out of active transmission or may require verification of reporting completeness.

A horizontal bar chart was then generated to visualize these metrics. Districts were ordered by the number of days since their last case, allowing those with the longest silent periods to appear at the top of the plot. Bars were colour‑coded according to transmission status, and the exact number of days since last reporting was displayed directly on each bar to enhance readability. A dashed reference line at 28 days was included to mark the IDSR threshold commonly used to define the end of an outbreak wave. This visualization provides a rapid, intuitive overview of which districts remain active, which are approaching control thresholds, and where follow‑up or verification may be required to confirm the absence of ongoing transmission.

today <- Sys.Date()

# For Active Districts only
 active <- linelist %>%
   filter(District %in% c("Chilanga", "Lusaka", "Mpulungu", "Nakonde", "Chongwe", "Choma", "Nyimba", "Mpika", "Kapiri", "Rufunsa"))
 
 
 district_last_seen <- active %>%
  rename_with(~ gsub("\\s+", "_", .x)) %>%
  mutate(date_seen = as.Date(Date_seen_at_health_facility, format = "%d/%m/%Y")) %>%
  group_by(District) %>%
  summarise(last_report = max(date_seen, na.rm = TRUE), .groups = "drop") %>%
  mutate(days_since_last = as.integer(today - last_report)) %>%
  arrange(desc(days_since_last))


district_last_seen <- district_last_seen %>%
  mutate(
    Transmission_status = case_when(
      days_since_last <= 10  ~ "Active (0–10 days)",
      days_since_last <= 20 ~ "Moderate (11–20 days)",
      TRUE                  ~ "Control (21+ days)"
    )
  )


ggplot(district_last_seen,
       aes(x = reorder(District, days_since_last),
           y = days_since_last,
           fill = Transmission_status)) +
  
  geom_col() +
  geom_text(aes(label = days_since_last), hjust = -0.2, size = 3.5) +
  coord_flip() +
  
  scale_fill_manual(
    values = c(
      "Active (0–10 days)"       = "darkred",
      "Moderate (11–20 days)"   = "orange",   # closest to “amber”
      "Control (21+ days)"        = "darkgreen"
    )
  ) +
  
  geom_hline(yintercept = 28, linetype = "dashed", color = "black", size = 1) +
  
  labs(
    title = "Days of Zero Reporting by District",
    x = "District",
    y = "Days Since Last Case",
    fill = "Transmission_status"
  ) +
  
  theme_minimal() +
  theme(
    axis.text.y = element_text(size = 11),
    plot.title = element_text(size = 16, face = "bold"),
    plot.subtitle = element_text(size = 13)
  ) +
  
  expand_limits(y = max(district_last_seen$days_since_last, na.rm = TRUE) + 5)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Tracking days since last reported case across all districts This step broadened the earlier analysis by extending the “days since last case” metric to all districts nationwide, not only those with ongoing transmission. The goal was to generate a comprehensive view of recent reporting activity across the entire country, enabling comparison between districts that remain active and those that have not reported cases for extended periods.

The dataset was first standardized by cleaning column names and converting the date of last facility visit into a proper date format. For each district, the most recent date on which a cholera case was seen was identified, and the number of days since that report was calculated relative to the current date. This produced a simple but powerful indicator of how recently each district has detected a case.

Similarly to support operational interpretation, districts were categorized into three transmission‑status groups based on the number of days since their last reported case:

A horizontal bar chart was then generated to visualize these metrics across all districts. Districts were ordered by the number of days since their last case, allowing those with the longest silent periods to appear at the top. Bars were colour‑coded according to transmission status, and the exact number of days since last reporting was displayed directly on each bar. A dashed reference line at 28 days was included to mark the IDSR threshold commonly used to define the end of an outbreak wave. This visualization provides a national overview of reporting recency, highlighting districts that may require follow‑up, verification of zero reporting, or continued monitoring.

 district_last_seen <- linelist %>%
  rename_with(~ gsub("\\s+", "_", .x)) %>%
  mutate(date_seen = as.Date(Date_seen_at_health_facility, format = "%d/%m/%Y")) %>%
  group_by(District) %>%
  summarise(last_report = max(date_seen, na.rm = TRUE), .groups = "drop") %>%
  mutate(days_since_last = as.integer(today - last_report)) %>%
  arrange(desc(days_since_last))


district_last_seen <- district_last_seen %>%
  mutate(
    Transmission_status = case_when(
      days_since_last <= 10  ~ "Active (0–10 days)",
      days_since_last <= 20 ~ "Moderate (11–20 days)",
      TRUE                  ~ "Control (21+ days)"
    )
  )


ggplot(district_last_seen,
       aes(x = reorder(District, days_since_last),
           y = days_since_last,
           fill = Transmission_status)) +
  
  geom_col() +
  geom_text(aes(label = days_since_last), hjust = -0.2, size = 3.5) +
  coord_flip() +
  
  scale_fill_manual(
    values = c(
      "Active (0–10 days)"       = "darkred",
      "Moderate (11–20 days)"   = "orange",   # closest to “amber”
      "Control (21+ days)"        = "darkgreen"
    )
  ) +
  
  geom_hline(yintercept = 28, linetype = "dashed", color = "black", size = 1) +
  
  labs(
    title = "Days of Zero Reporting by District",
    x = "District",
    y = "Days Since Last Case",
    fill = "Transmission_status"
  ) +
  
  theme_minimal() +
  theme(
    axis.text.y = element_text(size = 11),
    plot.title = element_text(size = 16, face = "bold"),
    plot.subtitle = element_text(size = 13)
  ) +
  
  expand_limits(y = max(district_last_seen$days_since_last, na.rm = TRUE) + 5)

Mapping

This step introduced the spatial component of the analysis by loading Zambia’s official district‑level administrative boundaries. The shapefile was imported using sf::st_read(), which preserves the geographic structure of the polygons and prepares them for integration with the cleaned cholera line list. This spatial dataset provides the geographic framework required to visualize the distribution of cholera cases across districts, enabling the production of choropleth maps and other geospatial outputs.

The shapefile contains district boundaries consistent with national administrative definitions, allowing the analytical dataset to be merged with spatial geometries using the standardized district names to be created next in the workflow. This ensures that case counts, transmission status, or other indicators can be accurately mapped to their corresponding geographic areas. Incorporating spatial data at this stage sets the foundation for generating maps that support situational awareness, highlight hotspots, and guide targeted public health interventions.

zambia_districts <- st_read("C:/Users/USER1/OneDrive/Documents/My Documents/Readings/MSc PH-FETP/FETP Specific/R/Mapping Ashley/maps/shapefiles/Shape Files/Zambia_DistrictShapefiles/Adminbound.shp")
## Reading layer `Adminbound' from data source 
##   `C:\Users\USER1\OneDrive\Documents\My Documents\Readings\MSc PH-FETP\FETP Specific\R\Mapping Ashley\maps\shapefiles\Shape Files\Zambia_DistrictShapefiles\Adminbound.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 115 features and 5 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 21.98004 ymin: -18.07918 xmax: 33.71244 ymax: -8.271976
## Geodetic CRS:  WGS 84

Creating a vector of all district names This step created a comprehensive vector of all district names in Zambia to support the spatial mapping workflow. Having a standardized list of district names is essential because the administrative boundary shapefile and the cholera line list must match exactly for a successful spatial join.

districts <- c(
  "Chadiza", "Chama", "Chasefu", "Chavuma", "Chembe", "Chibombo", "Chiengi",
  "Chifunabuli", "Chikankanta", "Chilanga", "Chililabombwe", "Chilubi", "Chingola", "Chinsali",
  "Chipangali", "Chipata", "Chipili", "Chirundu", "Chisamba", "Chitambo", "Choma",
  "Chongwe", "Gwembe", "Ikelenge", "Isoka", "Itezhi-tezhi", "Kabompo", "Kabwe",
  "Kafue", "Kalabo", "Kalomo", "Kalulushi", "Kalumbila", "Kanchibiya", "Kaoma",
  "Kapiri Mposhi", "Kaputa", "Kasama", "Kasempa", "Kasenengwa", "Katete", "Kawambwa",
  "Kazungula", "Kitwe", "Lavushimanda", "Limulunga", "Livingstone", "Luampa", "Luangwa",
  "Luano", "Luanshya", "Lufwanyama", "Lukulu", "Lumezi", "Lundazi", "Lunga",
  "Lunte District", "Lupososhi", "Lusaka", "Luwingu", "Mafinga", "Mambwe", "Mansa",
  "Manyinga", "Masaiti", "Mazabuka", "Mbala", "Milengi", "Mitete", "Mkushi",
  "Mongu", "Monze", "Mpika", "Mpongwe", "Mporokoso", "Mpulungu", "Mufulira",
  "Mufumbwe", "Mulobezi", "Mumbwa", "Mungwi", "Mushindano", "Mwandi", "Mwansabombwe",
  "Mwense", "Mwinilunga", "Nakonde", "Nalolo", "Namwala", "Nchelenge", "Ndola",
  "Ngabwe", "Nkeyema", "Nsama", "Nyimba", "Pemba", "Petauke", "Rufunsa",
  "Samfya", "Senanga", "Senga Hill", "Serenje", "Shang'ombo", "Shibuyunji", "Shiwamg'andu",
  "Siavonga", "Sikongo", "Sinazongwe", "Sinda", "Sioma", "Solwezi", "Vubwi",
  "Zambezi", "Zimba"
)

Building a district‑level cholera case dataset for mapping This step prepared a clean, map‑ready dataset by assigning cholera case counts to every district in Zambia. A data frame was first created containing all districts listed in the national administrative structure, each initialized with zero cases. This ensures that the mapping workflow includes every district, even those that have not reported cholera cases during the outbreak period. Including zero‑case districts is essential for producing complete national maps, avoiding gaps or missing polygons when the spatial data are joined.

The dataset was then updated with the most recent cumulative case counts for districts that had reported cholera. Using case_when(), specific districts were assigned their observed case totals (e.g., Mpulungu, Lusaka, Chilanga, Nakonde, Nkeyema, Mbala), while all remaining districts retained a value of zero. This approach ensures that the mapping dataset reflects the true geographic distribution of cases while maintaining a consistent structure across all districts.

By combining a complete district list with updated case counts, this step produced a harmonized dataset that can be seamlessly merged with the Zambia district shapefile. The resulting object, cholera_data, forms the foundation for generating accurate and visually coherent choropleth maps that highlight both affected and unaffected areas across the country.

#For Zero cases
cholera_data <- data.frame(
  district = districts,
  cases = 0
) 

#With Daily Cummulative cases
cholera_data <- cholera_data %>%
  mutate(cases = case_when(
    district == "Mpulungu"   ~ 318,
    # district == "Nsama"      ~ 161,
    # district == "Senga Hill" ~ 5,
    # district == "Kaputa"     ~ 1,
    district == "Mbala"      ~ 1,
    # district == "Mpika" ~ 8,
    # district == "Masaiti" ~ 28,
    # district == "Ndola" ~ 2,
    # district == "Kitwe" ~ 7,
    # district == "Monze" ~ 43,
    # district == "Kabwe" ~ 2,
    district == "Lusaka" ~ 20,
    # district == "Chongwe" ~ 3,
    # district == "Solwezi" ~ 3,
    # district == "Kasempa" ~ 2,
    # district == "Gwembe" ~ 1,
    district == "Chilanga" ~ 5,
    district == "Nakonde" ~ 1,
    district == "Nkeyema" ~ 1,
    TRUE                     ~ 0  # All other districts get 0
  ))

Joining cholera case data with Zambia’s district shapefile

map_data <- zambia_districts%>%
  left_join(cholera_data, by = c("NAME" = "district"))

This step linked the prepared cholera case dataset to Zambia’s district‑level administrative boundaries, creating a unified spatial dataset ready for mapping. Using a left join, each district polygon in the shapefile was matched with its corresponding case count from the cholera_data table. The join was performed using the district name field in the shapefile (NAME) and the standardized district names in the case dataset.

This operation ensured that every district polygon—whether it reported cases or not—received an associated case value. Districts with no reported cholera cases retained a value of zero, allowing them to appear on the map with an appropriate colour scale rather than being omitted. The resulting object, map_data, now contains both the spatial geometry and the epidemiological data needed to generate choropleth maps that accurately reflect the geographic distribution of cholera across Zambia. This merged dataset forms the backbone of all subsequent spatial visualizations, enabling clear identification of hotspots, unaffected districts, and regional patterns in transmission.

OPTION 1 - Mapping cholera case distribution using a continuous colour scale

ggplot(map_data) +
  geom_sf(aes(fill = cases)) +
  scale_fill_gradient(
    low = "white",       # 0 cases
    high = "darkred",    # high case counts
    na.value = "grey90"  # for missing districts, if any
  ) +
  labs(
    # title = "Cholera Cases by District (Aug–Dec 2025) n=623",   #17th Nov
    fill = "Cases"
  ) +
  # Add North arrow
  annotation_north_arrow(
    location = "tl",        # top-left corner
    which_north = "true",
    style = north_arrow_fancy_orienteering
  ) +
  # Add scale bar
  annotation_scale(
    location = "bl",
    width_hint = 0.3
  ) +
  # Clean theme: no gridlines, no lat/long labels
  theme_minimal() +
  theme(
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    axis.text.x = element_blank(),   # remove longitude labels
    axis.text.y = element_blank(),   # remove latitude labels
    axis.ticks = element_blank()     # remove tick marks
  )

This step produced a choropleth map to visualize the geographic distribution of cholera cases across Zambia’s districts. Using the merged spatial dataset, each district polygon was shaded according to its cumulative case count, allowing areas with higher transmission to stand out clearly. A continuous colour gradient was applied, ranging from white for districts with zero cases to dark red for those with the highest burden. This gradient provides an intuitive visual cue for identifying hotspots and contrasting them with unaffected or low‑incidence districts.

To enhance the map’s interpretability and cartographic quality, a north arrow and scale bar were added, providing orientation and distance context. The map was rendered using a minimalist theme that removes gridlines, axis labels, and tick marks, ensuring that attention remains focused on the spatial patterns rather than extraneous visual elements. This clean presentation is well suited for inclusion in situation reports, presentations, and IDSR bulletins, where clarity and rapid interpretation are essential.

The resulting map offers a clear national overview of cholera distribution, highlighting both the concentration of cases in affected districts and the wide geographic extent of districts reporting zero cases. This spatial perspective complements the earlier epidemic curves and zero‑reporting analyses, providing a comprehensive understanding of the outbreak’s spread and intensity.

OPTION 2: Highlighting affected districts with distinct colours on a national map

#Highlighting all affected districts with own color
highlighted_districts <- c("Mpulungu","Lusaka",
                            "Chilanga", "Nakonde", "Chongwe", "Choma", "Nyimba", "Mpika", "Kapiri Mposhi", "Rufunsa")

# Assign each highlighted district its own label, others grouped as "Other"
map_data2 <- zambia_districts %>%
  mutate(highlight = ifelse(NAME %in% highlighted_districts, NAME, "Other"))

# Plot with distinct colors for each selected district
ggplot(map_data2) +
  geom_sf(aes(fill = highlight), color = "black") +
  scale_fill_manual(
    values = c(
      "Mpulungu"   = "#FF5733",
      # "Nsama"      = "#33FF57",
      # "Senga Hill" = "#3357FF",
      # "Mbala"      = "#FF33A6",
      # "Kaputa"     = "#FFC300",
      "Lusaka"     = "#DAF7A6",
      "Mpika"      = "#581845",
      # "Masaiti"    = "#C70039",
      # "Ndola"      = "#900C3F",
      # "Kitwe"      = "#FF8D1A",
      # "Monze"      = "#2ECC71",
      # "Kabwe"      = "#3498DB",
      "Chongwe"    = "#2E44AD",
      # "Solwezi"    = "#1ABC9C",
      # "Kasempa"    = "#F39C12",
      # "Gwembe"     = "#7D3C98",
      "Chilanga"   = "brown",
      "Nakonde"    = "darkgreen",
      # "Nkeyema" = "darkblue",
      "Choma" = "purple",
      "Nyimba" = "#2ECC71",
      "Kapiri Mposhi" = "#FF8D1A",
      "Rufunsa" = "#900C3F",
      "Other"      = "white"
    ),
    breaks = highlighted_districts   # only show selected districts in legend
  ) +
  labs(
    # title = "Affected districts, Aug - Dec 2025 Cholera Outbreak, Zambia (n=17)",
    fill = "District"
  ) +
  # Add North arrow
  annotation_north_arrow(
    location = "tl",        # top-left corner
    which_north = "true",
    style = north_arrow_fancy_orienteering
  ) +
  # Add scale bar
  annotation_scale(
    location = "bl",
    width_hint = 0.3
  ) +
  # Clean theme: no gridlines, no lat/long labels
  theme_minimal() +
  theme(
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    axis.text.x = element_blank(),   # remove longitude labels
    axis.text.y = element_blank(),   # remove latitude labels
    axis.ticks = element_blank(),     # remove tick marks
    legend.text = element_text(size = 8),   # reduce legend font size
    legend.title = element_text(size = 9)   # adjust legend title size
  )

This step created a categorical map designed to visually distinguish cholera‑affected districts from the rest of the country. Instead of shading districts by the number of cases, each affected district was assigned its own unique colour, while all unaffected districts were grouped under a single “Other” category and displayed in white. This approach is particularly useful when the goal is to highlight the geographic footprint of the outbreak rather than the intensity of transmission.

A vector of affected districts was first defined, representing areas with recent or ongoing cholera activity. The shapefile was then augmented with a new variable that classified each district as either one of the highlighted districts or “Other.” This classification allowed the map to display each affected district in a distinct colour, making it easy to identify where cholera transmission has occurred and how widely it is distributed across the country.

The map was rendered using geom_sf(), with a manually defined colour palette to ensure that each highlighted district had a visually distinct and easily interpretable colour. A north arrow and scale bar were added to improve cartographic clarity, and a minimalist theme was applied to remove gridlines, axis labels, and tick marks, keeping the focus on the spatial patterns. The legend was simplified to include only the affected districts, reducing clutter and making the map more readable.

Conclusion

The automated workflow developed in this project demonstrates how R can transform Zambia’s eIDSR case‑based data into clear, timely, and actionable surveillance intelligence. Through a reproducible pipeline, the system consistently generates the key outputs needed for weekly reporting—epidemic curves, demographic profiles, zero‑reporting indicators, and district‑level maps—reducing reliance on manual processes and improving turnaround time.

Most importantly, the workflow strengthens the ability of the Surveillance and Disease Intelligence Unit to interpret the core dimensions of an outbreak:

• Who is affected, through age‑sex distributions and population‑level patterns • What is happening, through automated indicators, case counts, and epidemic curves • When transmission is rising, peaking, or declining, through timelines and reporting intervals • Where outbreaks are occurring or spreading, through district‑level spatial analysis

By aligning automation with surveillance needs, the project enhances both efficiency and situational awareness. The resulting tools provide a foundation for faster, more consistent reporting and can be expanded to other diseases within the eIDSR platform, supporting Zambia’s broader goal of strengthening national public health intelligence.