Effective surveillance depends not only on the quality of data collected, but also on how quickly and clearly that data can be transformed into actionable insights. Zambia’s electronic IDSR Case‑Based Surveillance system generates rich, nationwide information on priority diseases, yet much of the routine reporting process remains manual, time‑consuming, and vulnerable to delays. As outbreaks evolve rapidly, the Surveillance and Disease Intelligence Unit requires tools that can streamline analysis, standardize outputs, and ensure timely situational awareness.
This project focuses on enhancing surveillance reporting by developing an automated, reproducible R‑based workflow for processing eIDSR cholera data. The workflow cleans, analyzes, and visualizes case‑based records, producing standardized tables, charts, and indicators that can feed directly into weekly surveillance and situation reports. By integrating temporal, spatial, and demographic analyses, the system supports rapid interpretation of who is affected, what is happening, when transmission is occurring, and where outbreaks are concentrated. Ultimately, the project aims to strengthen reporting efficiency, reduce manual workload, and improve the timeliness and consistency of national surveillance outputs.
R Packages Used This project was built on a set of R packages that together provide a complete environment for data cleaning, transformation, visualization, spatial analysis, and reproducible project management. Each package contributes a specific capability, and their combined use creates a coherent, efficient workflow suitable for outbreak analytics and public health reporting.
pacman::p_load(
tidyverse,
forcats,
lubridate,
sf,
readxl,
ggplot2,
ggspatial,
here,
zoo
)
This initial step focused on importing and standardizing the national
cholera case-based line list for analysis. The raw Excel file was first
read into R, after which all column names were cleaned by replacing
spaces with underscores to ensure they were syntactically consistent and
easy to reference in code. Key variables were then harmonized: age was
coerced to numeric, sex values were trimmed and converted to uppercase,
and the date of onset was converted from character format to a proper
Date object. From this onset date, epidemiological week and
year were derived to align the dataset with routine IDSR reporting
structures.
To support age-stratified analysis, cases were grouped into
predefined age bands (0–4, 5–14, 15–24, 25–34, 35–44, 45–54, 55–64, and
65+ years), with implausible or missing ages assigned as
NA. District names were then standardized by trimming
whitespace, converting to uppercase, and recoding known variants (e.g.,
“SENGA”, “SENGA HILL”) into a consistent label (“Senga Hill”), while all
other districts were converted to title case. This ensured that
district-level summaries would not be fragmented by spelling or
formatting inconsistencies. Finally, the cleaned and harmonized line
list was exported as clean_data.csv, creating a
reproducible, analysis-ready dataset for subsequent descriptive and
time-series analyses.
linelist <- read_excel("Data/Consolidated National Cholera Outbreak Linelist - Aug2025.xlsx")
## Warning: Expecting date in M1108 / R1108C13: got '2,/2/28/2026'
## Warning: Expecting date in M1109 / R1109C13: got '2,/2/28/2026'
linelist <- linelist %>%
rename_with(~ gsub("\\s+", "_", .x)) %>% # Replace spaces in column names
mutate(
Age_in_yrs = as.numeric(Age_in_yrs),
Sex = trimws(toupper(Sex)),
date_onset = as.Date(Date_of_onset, format = "%d/%m/%Y"), # Adjust format if needed,
epi_week = epiweek(date_onset),
epi_year = year(date_onset),
Age_Group = case_when(
Age_in_yrs >= 0 & Age_in_yrs <= 4 ~ "0–4",
Age_in_yrs >= 5 & Age_in_yrs <= 14 ~ "5–14",
Age_in_yrs >= 15 & Age_in_yrs <= 24 ~ "15–24",
Age_in_yrs >= 25 & Age_in_yrs <= 34 ~ "25–34",
Age_in_yrs >= 35 & Age_in_yrs <= 44 ~ "35–44",
Age_in_yrs >= 45 & Age_in_yrs <= 54 ~ "45–54",
Age_in_yrs >= 55 & Age_in_yrs <= 64 ~ "55–64",
Age_in_yrs >= 65 ~ "65+",
TRUE ~ NA_character_
)
)
linelist <- linelist %>%
rename_with(~ gsub("\\s+", "_", .x)) %>% # Clean column names
mutate(
District = trimws(toupper(District)), # Standardize to uppercase
District = case_when(
District == "MBALA" ~ "Mbala",
District == "SENGA" ~ "Senga Hill",
District == "SENGA HILL" ~ "Senga Hill",
TRUE ~ str_to_title(District) # Title-case fallback
)
)
rio::export(linelist, "clean_data.csv")
After cleaning and standardizing the line list, the next step involved generating a descriptive summary of the geographic distribution of cholera cases. The dataset was grouped by Province and District, and the total number of reported cases was calculated for each administrative unit. This produced a structured table showing the burden of disease across Zambia’s subnational levels, allowing for quick identification of high‑incidence districts within each province. The table also supports downstream analyses such as mapping, prioritization of response activities, and comparison of outbreak intensity across regions
# Cases by Province and District
cases_by_province_district <- linelist %>%
group_by(Province, District) %>%
summarise(
cases = n(),
.groups = "drop"
) %>%
arrange(Province, desc(cases))
cases_by_province_district
## # A tibble: 24 × 3
## Province District cases
## <chr> <chr> <int>
## 1 Central Kabwe 1
## 2 Central Kapiri 1
## 3 Copperbelt Masaiti 28
## 4 Copperbelt Kitwe 7
## 5 Copperbelt Ndola 2
## 6 Eastern Nyimba 2
## 7 Lusaka Lusaka 317
## 8 Lusaka Chongwe 65
## 9 Lusaka Chilanga 10
## 10 Lusaka Rufunsa 4
## # ℹ 14 more rows
The cleaned cholera line list was transformed into a daily epidemic curve stratified by district. For each district, cases were aggregated by date of symptom onset to obtain the number of cases reported per day. The resulting dataset was ordered chronologically within districts to preserve the temporal structure required for wave detection. To identify distinct outbreak waves, the number of days since the last reported case in each district was calculated. A new wave was defined whenever there was a gap of at least 28 days between consecutive cases, and these events were flagged as new wave onsets. The analysis was restricted to cases with onset dates from 1 August 2025 onwards, and cumulative wave numbers were assigned within each district, generating labels such as “Wave 1”, “Wave 2”, and so on.
Deriving national daily totals and a 7‑day moving average: To complement district‑level trends with a national perspective, daily case counts across all districts were summed to produce a national epidemic curve. A 7‑day moving average of national daily cases was then calculated to smooth short‑term fluctuations and highlight the underlying transmission pattern. This smoothed national trend was merged back into the district‑level dataset, allowing the national moving average to be visualized alongside district‑specific daily counts on a single epidemic curve.
Summarizing the number of waves by district: The number of outbreak waves experienced by each district during the analysis period was summarized by taking the maximum wave number assigned within that district. This produced a compact table indicating how many distinct waves each district had undergone, and a corresponding label (e.g., “Lusaka (3)”) was created for use in legends and annotations. This summary supports interpretation of heterogeneity in outbreak dynamics across districts, highlighting areas with recurrent or persistent transmission.
Labelling outbreak waves and their contributing districts: To provide interpretable annotations on the epidemic curve, the analysis identified the start date of each wave and the districts contributing to that wave. For every wave number, the first day flagged as a new wave was extracted, and the set of districts with cases at that wave onset was collapsed into a comma‑separated label. These were combined into descriptive wave labels (e.g., “Wave 2 (Lusaka, Ndola)”), which were then joined back to the dataset containing wave start dates. This produced a dedicated object containing the timing, magnitude, and district composition of each wave, suitable for direct use in plotting and narrative description.
Preparing custom legend labels for visualization: Finally, the district‑level wave summary was converted into a named vector of labels, mapping each district to its corresponding “District (number of waves)” string. This structure was designed for use in plot legends, ensuring that visual outputs not only distinguish districts by colour but also communicate how many outbreak waves each district experienced during the study period.
Visualizing the epidemic curve with district‑level bars and a national moving average: The final step involved generating a comprehensive epidemic curve to visualize the temporal progression of cholera transmission across Zambia. Daily case counts for each district were plotted as stacked bars, allowing for clear comparison of outbreak intensity and timing between districts. To improve interpretability over the extended outbreak period, the x‑axis was calibrated in epidemiological weeks using ISO week numbering, consistent with IDSR reporting conventions. This ensured that the visualization aligned with routine surveillance cycles and remained readable even when spanning several months of data. To highlight the broader national transmission trend, a 7‑day moving average of total daily cases was overlaid on the plot as a red, dotted line. This smoothed curve was intentionally drawn behind the district bars to provide contextual trend information without obscuring district‑level detail. The legend was customized to display each district alongside the number of outbreak waves it experienced, enhancing the interpretive value of the visualization. The final figure provides a clear, integrated view of both localized and national cholera dynamics, supporting situational awareness and decision‑making for the IDSR response.
#STEP 1
daily_curve_2 <- linelist %>%
mutate(date_onset = as.Date(date_onset)) %>%
# 1. Daily counts per district
group_by(date_onset, District) %>%
summarise(cases = n(), .groups = "drop") %>%
# 2. Order within district
arrange(District, date_onset) %>%
# 3. Detect new waves + compute moving average
group_by(District) %>%
mutate(
days_since_last_case = date_onset - lag(date_onset),
new_wave = if_else(
is.na(days_since_last_case) | days_since_last_case >= 28,
TRUE,
FALSE
)
) %>%
ungroup() %>%
# 4. Restrict to August 2025 onward
filter(date_onset >= as.Date("2025-08-01")) %>%
# 5. Number waves and create wave labels
group_by(District) %>%
arrange(date_onset) %>%
mutate(
wave_number = cumsum(new_wave),
wave_label = paste0("Wave ", wave_number)
) %>%
ungroup()
#STEP 2
# National totals + 7-day MA
national_ma <- daily_curve_2 %>%
group_by(date_onset) %>%
summarise(national_cases = sum(cases), .groups = "drop") %>%
arrange(date_onset) %>%
mutate(
national_ma7 = rollmean(national_cases, k = 7, fill = NA, align = "right")
)
# Merge national MA back into district dataset
daily_curve_2 <- daily_curve_2 %>%
left_join(national_ma, by = "date_onset")
#STEP 3
wave_counts <- daily_curve_2 %>%
group_by(District) %>%
summarise(waves = max(wave_number, na.rm = TRUE)) %>%
mutate(label = paste0(District, " (", waves, ")")) %>%
select(District, label) # <-- IMPORTANT: District first, label second
#STEP 4
wave_starts_labels <- daily_curve_2 %>%
filter(new_wave == TRUE) %>%
group_by(wave_number) %>%
summarise(
districts = paste(unique(District), collapse = ", "),
.groups = "drop"
) %>%
mutate(
wave_label = paste0("Wave ", wave_number),
district_label = paste0("(", districts, ")")
) %>%
select(wave_number, wave_label, district_label) # <-- ensures clean names
#Joining
wave_starts <- daily_curve_2 %>%
filter(new_wave == TRUE) %>%
select(date_onset, District, cases, wave_number) %>% # keep only what you need
left_join(wave_starts_labels, by = "wave_number")
#STEP 5
legend_labels <- deframe(wave_counts)
#STEP 6
ggplot(daily_curve_2, aes(x = date_onset)) +
geom_col(aes(y = cases, fill = District), color = "black") +
geom_line(aes(y = national_ma7),
color = "red",
linetype = "dotted",
linewidth = 1.1)+
scale_fill_discrete(labels = legend_labels) +
scale_x_date(
date_breaks = "2 week",
date_labels = "W%V\n%Y" # epi-week + year
) +
labs(
x = "Epi-week",
y = "Number of Cases",
fill = "District (Outbreaks)"
) +
theme_minimal(base_size = 14) +
theme(
#axis.text.x = element_text(angle = 45, hjust = 1),
axis.text.x = element_text(size = 8, angle = 45, hjust = 1),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.position = "bottom",
legend.direction = "horizontal",
plot.margin = margin(t = 10, r = 20, b = 40, l = 20)
) +
coord_cartesian(clip = "off")
## Warning: Removed 7 rows containing missing values or values outside the scale range
## (`geom_line()`).
A national‑level age–sex pyramid provides a broader demographic perspective on the cholera outbreak by showing how the burden of disease is distributed across age groups for males and females combined from all districts.
To construct the dataset, the line list is grouped by age category and sex, and the number of cases in each group is counted. Male case counts are multiplied by –1 to create a mirrored layout, enabling both sexes to be displayed on a shared horizontal axis. This structure makes it easy to compare the relative burden between males and females within each age group.
The resulting plot displays age groups along the vertical axis, with bars extending to the left for males and to the right for females. Distinct colours are used to differentiate the sexes, and the x‑axis labels are formatted to show absolute values so that negative male counts appear as positive numbers in the final figure. A minimalist theme keeps the focus on the demographic pattern rather than on gridlines or axis decorations.
This national‑level pyramid highlights which age groups are most affected across Zambia and whether the distribution differs between males and females. Such demographic insights are valuable for tailoring national‑level risk communication, guiding targeted interventions, and understanding which population groups are most involved in transmission during the outbreak.
plot_data <- linelist %>%
# filter(!is.na(Age_Group), Sex %in% c("M", "F"),
# District == "Lusaka") %>% # <--- Added line for Lusaka only
group_by(Age_Group, Sex) %>%
summarise(cases = n(), .groups = "drop") %>%
mutate(cases = ifelse(Sex == "M", -cases, cases))
ggplot(plot_data, aes(x = cases, y = fct_rev(factor(Age_Group)), fill = Sex)) +
geom_col(width = 0.7) +
scale_x_continuous(labels = abs, name = "Number of Cases") +
scale_fill_manual(values = c("M" = "#1f78b4", "F" = "#e31a1c")) +
labs(
# title = "Age Distribution of Cholera Cases by Sex",
y = "Age Group",
fill = "Sex"
) +
theme_minimal() +
theme(
axis.text.y = element_text(size = 12),
axis.title.y = element_text(size = 14),
axis.title.x = element_text(size = 14),
plot.title = element_text(size = 16, face = "bold")
)
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_col()`).
This stage of the analysis focused on assessing recent reporting activity among districts that remained epidemiologically active during the outbreak period. A subset of the national line list was created by filtering for districts with ongoing transmission or recent case detection. For each of these districts, the most recent date on which a case was seen at a health facility was extracted and converted into a proper date format. The number of days since this last reported case was then calculated relative to the current date, providing a simple but operationally important indicator of recent surveillance activity and potential silent transmission.
To support interpretation and response prioritization, districts were classified into three transmission‑status categories based on the number of days since their last reported case: Active (0–10 days), Moderate (11–20 days), and Control (21+ days). These thresholds align with practical decision‑making needs in IDSR, where districts with prolonged periods of zero reporting may be transitioning out of active transmission or may require verification of reporting completeness.
A horizontal bar chart was then generated to visualize these metrics. Districts were ordered by the number of days since their last case, allowing those with the longest silent periods to appear at the top of the plot. Bars were colour‑coded according to transmission status, and the exact number of days since last reporting was displayed directly on each bar to enhance readability. A dashed reference line at 28 days was included to mark the IDSR threshold commonly used to define the end of an outbreak wave. This visualization provides a rapid, intuitive overview of which districts remain active, which are approaching control thresholds, and where follow‑up or verification may be required to confirm the absence of ongoing transmission.
today <- Sys.Date()
# For Active Districts only
active <- linelist %>%
filter(District %in% c("Chilanga", "Lusaka", "Mpulungu", "Nakonde", "Chongwe", "Choma", "Nyimba", "Mpika", "Kapiri", "Rufunsa"))
district_last_seen <- active %>%
rename_with(~ gsub("\\s+", "_", .x)) %>%
mutate(date_seen = as.Date(Date_seen_at_health_facility, format = "%d/%m/%Y")) %>%
group_by(District) %>%
summarise(last_report = max(date_seen, na.rm = TRUE), .groups = "drop") %>%
mutate(days_since_last = as.integer(today - last_report)) %>%
arrange(desc(days_since_last))
district_last_seen <- district_last_seen %>%
mutate(
Transmission_status = case_when(
days_since_last <= 10 ~ "Active (0–10 days)",
days_since_last <= 20 ~ "Moderate (11–20 days)",
TRUE ~ "Control (21+ days)"
)
)
ggplot(district_last_seen,
aes(x = reorder(District, days_since_last),
y = days_since_last,
fill = Transmission_status)) +
geom_col() +
geom_text(aes(label = days_since_last), hjust = -0.2, size = 3.5) +
coord_flip() +
scale_fill_manual(
values = c(
"Active (0–10 days)" = "darkred",
"Moderate (11–20 days)" = "orange", # closest to “amber”
"Control (21+ days)" = "darkgreen"
)
) +
geom_hline(yintercept = 28, linetype = "dashed", color = "black", size = 1) +
labs(
title = "Days of Zero Reporting by District",
x = "District",
y = "Days Since Last Case",
fill = "Transmission_status"
) +
theme_minimal() +
theme(
axis.text.y = element_text(size = 11),
plot.title = element_text(size = 16, face = "bold"),
plot.subtitle = element_text(size = 13)
) +
expand_limits(y = max(district_last_seen$days_since_last, na.rm = TRUE) + 5)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Tracking days since last reported case across all districts This step broadened the earlier analysis by extending the “days since last case” metric to all districts nationwide, not only those with ongoing transmission. The goal was to generate a comprehensive view of recent reporting activity across the entire country, enabling comparison between districts that remain active and those that have not reported cases for extended periods.
The dataset was first standardized by cleaning column names and converting the date of last facility visit into a proper date format. For each district, the most recent date on which a cholera case was seen was identified, and the number of days since that report was calculated relative to the current date. This produced a simple but powerful indicator of how recently each district has detected a case.
Similarly to support operational interpretation, districts were categorized into three transmission‑status groups based on the number of days since their last reported case:
A horizontal bar chart was then generated to visualize these metrics across all districts. Districts were ordered by the number of days since their last case, allowing those with the longest silent periods to appear at the top. Bars were colour‑coded according to transmission status, and the exact number of days since last reporting was displayed directly on each bar. A dashed reference line at 28 days was included to mark the IDSR threshold commonly used to define the end of an outbreak wave. This visualization provides a national overview of reporting recency, highlighting districts that may require follow‑up, verification of zero reporting, or continued monitoring.
district_last_seen <- linelist %>%
rename_with(~ gsub("\\s+", "_", .x)) %>%
mutate(date_seen = as.Date(Date_seen_at_health_facility, format = "%d/%m/%Y")) %>%
group_by(District) %>%
summarise(last_report = max(date_seen, na.rm = TRUE), .groups = "drop") %>%
mutate(days_since_last = as.integer(today - last_report)) %>%
arrange(desc(days_since_last))
district_last_seen <- district_last_seen %>%
mutate(
Transmission_status = case_when(
days_since_last <= 10 ~ "Active (0–10 days)",
days_since_last <= 20 ~ "Moderate (11–20 days)",
TRUE ~ "Control (21+ days)"
)
)
ggplot(district_last_seen,
aes(x = reorder(District, days_since_last),
y = days_since_last,
fill = Transmission_status)) +
geom_col() +
geom_text(aes(label = days_since_last), hjust = -0.2, size = 3.5) +
coord_flip() +
scale_fill_manual(
values = c(
"Active (0–10 days)" = "darkred",
"Moderate (11–20 days)" = "orange", # closest to “amber”
"Control (21+ days)" = "darkgreen"
)
) +
geom_hline(yintercept = 28, linetype = "dashed", color = "black", size = 1) +
labs(
title = "Days of Zero Reporting by District",
x = "District",
y = "Days Since Last Case",
fill = "Transmission_status"
) +
theme_minimal() +
theme(
axis.text.y = element_text(size = 11),
plot.title = element_text(size = 16, face = "bold"),
plot.subtitle = element_text(size = 13)
) +
expand_limits(y = max(district_last_seen$days_since_last, na.rm = TRUE) + 5)
This step introduced the spatial component of the analysis by loading
Zambia’s official district‑level administrative boundaries. The
shapefile was imported using sf::st_read(), which preserves
the geographic structure of the polygons and prepares them for
integration with the cleaned cholera line list. This spatial dataset
provides the geographic framework required to visualize the distribution
of cholera cases across districts, enabling the production of choropleth
maps and other geospatial outputs.
The shapefile contains district boundaries consistent with national administrative definitions, allowing the analytical dataset to be merged with spatial geometries using the standardized district names to be created next in the workflow. This ensures that case counts, transmission status, or other indicators can be accurately mapped to their corresponding geographic areas. Incorporating spatial data at this stage sets the foundation for generating maps that support situational awareness, highlight hotspots, and guide targeted public health interventions.
zambia_districts <- st_read("C:/Users/USER1/OneDrive/Documents/My Documents/Readings/MSc PH-FETP/FETP Specific/R/Mapping Ashley/maps/shapefiles/Shape Files/Zambia_DistrictShapefiles/Adminbound.shp")
## Reading layer `Adminbound' from data source
## `C:\Users\USER1\OneDrive\Documents\My Documents\Readings\MSc PH-FETP\FETP Specific\R\Mapping Ashley\maps\shapefiles\Shape Files\Zambia_DistrictShapefiles\Adminbound.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 115 features and 5 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 21.98004 ymin: -18.07918 xmax: 33.71244 ymax: -8.271976
## Geodetic CRS: WGS 84
Creating a vector of all district names This step created a comprehensive vector of all district names in Zambia to support the spatial mapping workflow. Having a standardized list of district names is essential because the administrative boundary shapefile and the cholera line list must match exactly for a successful spatial join.
districts <- c(
"Chadiza", "Chama", "Chasefu", "Chavuma", "Chembe", "Chibombo", "Chiengi",
"Chifunabuli", "Chikankanta", "Chilanga", "Chililabombwe", "Chilubi", "Chingola", "Chinsali",
"Chipangali", "Chipata", "Chipili", "Chirundu", "Chisamba", "Chitambo", "Choma",
"Chongwe", "Gwembe", "Ikelenge", "Isoka", "Itezhi-tezhi", "Kabompo", "Kabwe",
"Kafue", "Kalabo", "Kalomo", "Kalulushi", "Kalumbila", "Kanchibiya", "Kaoma",
"Kapiri Mposhi", "Kaputa", "Kasama", "Kasempa", "Kasenengwa", "Katete", "Kawambwa",
"Kazungula", "Kitwe", "Lavushimanda", "Limulunga", "Livingstone", "Luampa", "Luangwa",
"Luano", "Luanshya", "Lufwanyama", "Lukulu", "Lumezi", "Lundazi", "Lunga",
"Lunte District", "Lupososhi", "Lusaka", "Luwingu", "Mafinga", "Mambwe", "Mansa",
"Manyinga", "Masaiti", "Mazabuka", "Mbala", "Milengi", "Mitete", "Mkushi",
"Mongu", "Monze", "Mpika", "Mpongwe", "Mporokoso", "Mpulungu", "Mufulira",
"Mufumbwe", "Mulobezi", "Mumbwa", "Mungwi", "Mushindano", "Mwandi", "Mwansabombwe",
"Mwense", "Mwinilunga", "Nakonde", "Nalolo", "Namwala", "Nchelenge", "Ndola",
"Ngabwe", "Nkeyema", "Nsama", "Nyimba", "Pemba", "Petauke", "Rufunsa",
"Samfya", "Senanga", "Senga Hill", "Serenje", "Shang'ombo", "Shibuyunji", "Shiwamg'andu",
"Siavonga", "Sikongo", "Sinazongwe", "Sinda", "Sioma", "Solwezi", "Vubwi",
"Zambezi", "Zimba"
)
Building a district‑level cholera case dataset for mapping This step prepared a clean, map‑ready dataset by assigning cholera case counts to every district in Zambia. A data frame was first created containing all districts listed in the national administrative structure, each initialized with zero cases. This ensures that the mapping workflow includes every district, even those that have not reported cholera cases during the outbreak period. Including zero‑case districts is essential for producing complete national maps, avoiding gaps or missing polygons when the spatial data are joined.
The dataset was then updated with the most recent cumulative case
counts for districts that had reported cholera. Using
case_when(), specific districts were assigned their
observed case totals (e.g., Mpulungu, Lusaka, Chilanga, Nakonde,
Nkeyema, Mbala), while all remaining districts retained a value of zero.
This approach ensures that the mapping dataset reflects the true
geographic distribution of cases while maintaining a consistent
structure across all districts.
By combining a complete district list with updated case counts, this
step produced a harmonized dataset that can be seamlessly merged with
the Zambia district shapefile. The resulting object,
cholera_data, forms the foundation for generating accurate
and visually coherent choropleth maps that highlight both affected and
unaffected areas across the country.
#For Zero cases
cholera_data <- data.frame(
district = districts,
cases = 0
)
#With Daily Cummulative cases
cholera_data <- cholera_data %>%
mutate(cases = case_when(
district == "Mpulungu" ~ 318,
# district == "Nsama" ~ 161,
# district == "Senga Hill" ~ 5,
# district == "Kaputa" ~ 1,
district == "Mbala" ~ 1,
# district == "Mpika" ~ 8,
# district == "Masaiti" ~ 28,
# district == "Ndola" ~ 2,
# district == "Kitwe" ~ 7,
# district == "Monze" ~ 43,
# district == "Kabwe" ~ 2,
district == "Lusaka" ~ 20,
# district == "Chongwe" ~ 3,
# district == "Solwezi" ~ 3,
# district == "Kasempa" ~ 2,
# district == "Gwembe" ~ 1,
district == "Chilanga" ~ 5,
district == "Nakonde" ~ 1,
district == "Nkeyema" ~ 1,
TRUE ~ 0 # All other districts get 0
))
Joining cholera case data with Zambia’s district shapefile
map_data <- zambia_districts%>%
left_join(cholera_data, by = c("NAME" = "district"))
This step linked the prepared cholera case dataset to Zambia’s district‑level administrative boundaries, creating a unified spatial dataset ready for mapping. Using a left join, each district polygon in the shapefile was matched with its corresponding case count from the cholera_data table. The join was performed using the district name field in the shapefile (NAME) and the standardized district names in the case dataset.
This operation ensured that every district polygon—whether it reported cases or not—received an associated case value. Districts with no reported cholera cases retained a value of zero, allowing them to appear on the map with an appropriate colour scale rather than being omitted. The resulting object, map_data, now contains both the spatial geometry and the epidemiological data needed to generate choropleth maps that accurately reflect the geographic distribution of cholera across Zambia. This merged dataset forms the backbone of all subsequent spatial visualizations, enabling clear identification of hotspots, unaffected districts, and regional patterns in transmission.
OPTION 1 - Mapping cholera case distribution using a continuous colour scale
ggplot(map_data) +
geom_sf(aes(fill = cases)) +
scale_fill_gradient(
low = "white", # 0 cases
high = "darkred", # high case counts
na.value = "grey90" # for missing districts, if any
) +
labs(
# title = "Cholera Cases by District (Aug–Dec 2025) n=623", #17th Nov
fill = "Cases"
) +
# Add North arrow
annotation_north_arrow(
location = "tl", # top-left corner
which_north = "true",
style = north_arrow_fancy_orienteering
) +
# Add scale bar
annotation_scale(
location = "bl",
width_hint = 0.3
) +
# Clean theme: no gridlines, no lat/long labels
theme_minimal() +
theme(
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.text.x = element_blank(), # remove longitude labels
axis.text.y = element_blank(), # remove latitude labels
axis.ticks = element_blank() # remove tick marks
)
This step produced a choropleth map to visualize the geographic
distribution of cholera cases across Zambia’s districts. Using the
merged spatial dataset, each district polygon was shaded according to
its cumulative case count, allowing areas with higher transmission to
stand out clearly. A continuous colour gradient was applied, ranging
from white for districts with zero cases to dark red for those with the
highest burden. This gradient provides an intuitive visual cue for
identifying hotspots and contrasting them with unaffected or
low‑incidence districts.
To enhance the map’s interpretability and cartographic quality, a north arrow and scale bar were added, providing orientation and distance context. The map was rendered using a minimalist theme that removes gridlines, axis labels, and tick marks, ensuring that attention remains focused on the spatial patterns rather than extraneous visual elements. This clean presentation is well suited for inclusion in situation reports, presentations, and IDSR bulletins, where clarity and rapid interpretation are essential.
The resulting map offers a clear national overview of cholera distribution, highlighting both the concentration of cases in affected districts and the wide geographic extent of districts reporting zero cases. This spatial perspective complements the earlier epidemic curves and zero‑reporting analyses, providing a comprehensive understanding of the outbreak’s spread and intensity.
OPTION 2: Highlighting affected districts with distinct colours on a national map
#Highlighting all affected districts with own color
highlighted_districts <- c("Mpulungu","Lusaka",
"Chilanga", "Nakonde", "Chongwe", "Choma", "Nyimba", "Mpika", "Kapiri Mposhi", "Rufunsa")
# Assign each highlighted district its own label, others grouped as "Other"
map_data2 <- zambia_districts %>%
mutate(highlight = ifelse(NAME %in% highlighted_districts, NAME, "Other"))
# Plot with distinct colors for each selected district
ggplot(map_data2) +
geom_sf(aes(fill = highlight), color = "black") +
scale_fill_manual(
values = c(
"Mpulungu" = "#FF5733",
# "Nsama" = "#33FF57",
# "Senga Hill" = "#3357FF",
# "Mbala" = "#FF33A6",
# "Kaputa" = "#FFC300",
"Lusaka" = "#DAF7A6",
"Mpika" = "#581845",
# "Masaiti" = "#C70039",
# "Ndola" = "#900C3F",
# "Kitwe" = "#FF8D1A",
# "Monze" = "#2ECC71",
# "Kabwe" = "#3498DB",
"Chongwe" = "#2E44AD",
# "Solwezi" = "#1ABC9C",
# "Kasempa" = "#F39C12",
# "Gwembe" = "#7D3C98",
"Chilanga" = "brown",
"Nakonde" = "darkgreen",
# "Nkeyema" = "darkblue",
"Choma" = "purple",
"Nyimba" = "#2ECC71",
"Kapiri Mposhi" = "#FF8D1A",
"Rufunsa" = "#900C3F",
"Other" = "white"
),
breaks = highlighted_districts # only show selected districts in legend
) +
labs(
# title = "Affected districts, Aug - Dec 2025 Cholera Outbreak, Zambia (n=17)",
fill = "District"
) +
# Add North arrow
annotation_north_arrow(
location = "tl", # top-left corner
which_north = "true",
style = north_arrow_fancy_orienteering
) +
# Add scale bar
annotation_scale(
location = "bl",
width_hint = 0.3
) +
# Clean theme: no gridlines, no lat/long labels
theme_minimal() +
theme(
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.text.x = element_blank(), # remove longitude labels
axis.text.y = element_blank(), # remove latitude labels
axis.ticks = element_blank(), # remove tick marks
legend.text = element_text(size = 8), # reduce legend font size
legend.title = element_text(size = 9) # adjust legend title size
)
This step created a categorical map designed to visually distinguish
cholera‑affected districts from the rest of the country. Instead of
shading districts by the number of cases, each affected district was
assigned its own unique colour, while all unaffected districts were
grouped under a single “Other” category and displayed in white. This
approach is particularly useful when the goal is to highlight the
geographic footprint of the outbreak rather than the intensity
of transmission.
A vector of affected districts was first defined, representing areas with recent or ongoing cholera activity. The shapefile was then augmented with a new variable that classified each district as either one of the highlighted districts or “Other.” This classification allowed the map to display each affected district in a distinct colour, making it easy to identify where cholera transmission has occurred and how widely it is distributed across the country.
The map was rendered using geom_sf(), with a manually
defined colour palette to ensure that each highlighted district had a
visually distinct and easily interpretable colour. A north arrow and
scale bar were added to improve cartographic clarity, and a minimalist
theme was applied to remove gridlines, axis labels, and tick marks,
keeping the focus on the spatial patterns. The legend was simplified to
include only the affected districts, reducing clutter and making the map
more readable.
The automated workflow developed in this project demonstrates how R can transform Zambia’s eIDSR case‑based data into clear, timely, and actionable surveillance intelligence. Through a reproducible pipeline, the system consistently generates the key outputs needed for weekly reporting—epidemic curves, demographic profiles, zero‑reporting indicators, and district‑level maps—reducing reliance on manual processes and improving turnaround time.
Most importantly, the workflow strengthens the ability of the Surveillance and Disease Intelligence Unit to interpret the core dimensions of an outbreak:
• Who is affected, through age‑sex distributions and population‑level patterns • What is happening, through automated indicators, case counts, and epidemic curves • When transmission is rising, peaking, or declining, through timelines and reporting intervals • Where outbreaks are occurring or spreading, through district‑level spatial analysis
By aligning automation with surveillance needs, the project enhances both efficiency and situational awareness. The resulting tools provide a foundation for faster, more consistent reporting and can be expanded to other diseases within the eIDSR platform, supporting Zambia’s broader goal of strengthening national public health intelligence.