Cholera is an infectious transmissible diarrheal disease caused by the ingestion of contaminated water or food with the bacterium Vibrio cholerae (Awofeso & Aldbak, 2018).
It is undoubtedly among the diseases of public health importance in many developing countries, including Nigeria (Ajayi & Smith, 2019; Ngwa et al., 2021). The disease is predominant in areas lacking safe drinking water, adequate sanitation and hygiene, such as those typically found in Internally Displaced Persons (IDP) camps in Nigeria. (Chansa et al., 2024; Mike-Ogburia et al., 2024; Salako et al., 2021).
The infection is often mild or without symptoms, but sometimes it can be severe. Infected persons may develop severe disease characterized by profuse watery diarrhea, vomiting, leg cramps and if untreated, it leads to rapid dehydration, acidosis, circulatory collapse and death within 12–24 h (Sanchez et al., 1994, WHO, 2008). Despite being easily preventable and treatable, cholera remains a major cause of morbidity and mortality in regions with inadequate water, sanitation, and hygiene (WASH) infrastructure.(Lauren D’Mello-Guyett et al.,2020).
The disease is predominant in areas lacking safe drinking water, adequate sanitation and hygiene, such as those typically found in Internally Displaced Persons (IDP) camps in Nigeria (Chansa et al., 2024; Mike-Ogburia et al., 2024; Salako et al., 2021).
To examine the trend of cholera in Nigeria and evaluate how WASH interventions help in reducing cholera outbreaks.
What has been the trend and pattern of cholera cases in Nigeria?
How have WASH (Water, Sanitation and Hygiene) interventions impacted cholera transmission and case reduction in Nigeria?
To examine the trend and pattern of cholera cases in Nigeria over the study period.
To investigate the effect of WASH interventions and treatment measures on cholera transmission in Nigeria.
(Mike-Ogburia et al., 2024) examined cholera outbreaks in Nigeria from 2010–2024 and found that recurrent epidemics are driven by poor WASH conditions, seasonal flooding, and conflict-related displacement. Northern regions and overcrowded IDP camps were the most affected. They also reported rising cases in young children, multiple outbreak waves, and increasing antibiotic-resistant Vibrio cholerae. Although surveillance, treatment, WASH interventions, and vaccination campaigns have helped, gaps in healthcare capacity and vaccine supply continue to hinder effective control.
(Charnley et al., 2022) reported that Cholera remains endemic in over 50 countries, with Nigeria having the second-highest burden. Using co variate selection and machine-learning models, the authors linked cholera transmission to environmental extremes (floods, droughts), conflict, poverty, and poor sanitation. Their model showed that improved sanitation and reduced poverty significantly protect populations—even during climate shocks. A traffic-light early-warning system was developed to signal high-risk periods. The study highlights that addressing poverty, sanitation, and social instability is essential for reducing cholera outbreaks in Nigeria.
The study was conducted using cholera data aggregated at the national level across all 36 states in Nigeria, including the Federal Capital Territory (FCT). Nigeria is located in West Africa between latitudes 4°N–14°N and longitudes 3°E–15°E. It shares boundaries with Benin Republic to the west, Niger to the north, Chad to the northeast, Cameroon to the east, and the Atlantic Ocean to the south. With an estimated population of over 220 million people, the country experiences diverse climatic conditions that influence cholera transmission patterns. The Nigeria hotspot map used in this study, showing the spatial distribution of cholera cases across the states and highlighting high-risk areas, particularly in the northern regions.
The data used in this study was obtained from the Nigeria Centre for Disease Control (NCDC) and contains monthly records of suspected and confirmed cholera cases in Nigeria from 2018 to 2024. The dataset represents aggregated national reports covering the 36 states and the Federal Capital Territory (FCT). Altogether, the dataset spans 81 months, from May 2018 to December 2024.
During data extraction, inconsistencies were observed in monthly reporting on the NCDC platform. The most notable gap occurred in 2020, where only one monthly record (December) was available, while all other months were missing. Similar missing months appeared in other years as well, but these were fewer compared to the gaps seen in 2020. To maintain a continuous monthly time series, the missing months were assigned a value of 1, rather than zero, to indicate the absence of reported data rather than an actual absence of cases.
In total, the dataset contains 8,590 confirmed cholera cases within the study period, with an approximate monthly average of 195 confirmed cases. Clear outbreak patterns were evident in the data. A major peak occurred in 2021, beginning with an increase in August (316 cases) and reaching the highest recorded value in December (943 cases). Another significant surge was recorded in 2024, with high case counts from July to September, including 660 cases in July and 650 cases in August. These periods represent the most prominent cholera clusters within the dataset.
Required packages
library(tidyr)
library(dplyr)
library(ggplot2)
library(caret)
library(sf)
require(tseries)
require(forecast)
library(corrplot)
library(rnaturalearth)
library(devtools)
library(rnaturalearthdata)
library(ggspatial)
library(Metrics)
cholera_data<-read.csv("~/GLADYS FOLDER .R/Cholera_dataaaa.csv")
#View(cholera_data)
dates <- seq(as.Date("2018-01-01"),as.Date("2024-12-01"),by = "month")
head(dates)
## [1] "2018-01-01" "2018-02-01" "2018-03-01" "2018-04-01" "2018-05-01"
## [6] "2018-06-01"
plot(dates,cholera_data$ConfirmedCases,type = "l",col="red")
adf_result<-adf.test(cholera_data$ConfirmedCases)
## Warning in adf.test(cholera_data$ConfirmedCases): p-value smaller than printed
## p-value
print(adf_result)
##
## Augmented Dickey-Fuller Test
##
## data: cholera_data$ConfirmedCases
## Dickey-Fuller = -4.458, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary
# Extract exogenous variables
rainfall <- cholera_data$Rainfall
humidity <- cholera_data$Humidity
cholera_data<- cholera_data$ConfirmedCases+1
log_cholera_data<-log(cholera_data)
plot(log_cholera_data,type= "l")
ts_cholera_data <- ts( log_cholera_data, start = c(2018,1), frequency = 12)
ts_rain <- ts(rainfall, start = c(2018,1), frequency = 12)
ts_humid <- ts(humidity, start = c(2018,1), frequency = 12)
plot(ts_cholera_data)
decomp<-stl(ts_cholera_data,s.window = "periodic")
plot(decomp,col.range = "red")
n <- length(ts_cholera_data)
n_train <- floor(0.8 * n)
train_cholera_data <- window(ts_cholera_data, end = time(ts_cholera_data)[n_train])
test_cholera_data <- window(ts_cholera_data, start = time(ts_cholera_data)[n_train + 1])
train_rain <- window(ts_rain, end = time(ts_rain)[n_train])
test_rain <- window(ts_rain, start = time(ts_rain)[n_train + 1])
train_humid <- window(ts_humid, end = time(ts_humid)[n_train])
test_humid <- window(ts_humid, start = time(ts_humid)[n_train + 1])
# Combine exogenous variables
train_xreg <- cbind(train_rain, train_humid)
test_xreg <- cbind(test_rain, test_humid)
model_auto <- auto.arima(train_cholera_data, seasonal = TRUE,
ic = "aic",
trace = TRUE)
##
## ARIMA(2,0,2)(1,0,1)[12] with non-zero mean : 236.3737
## ARIMA(0,0,0) with non-zero mean : 283.6322
## ARIMA(1,0,0)(1,0,0)[12] with non-zero mean : 229.9084
## ARIMA(0,0,1)(0,0,1)[12] with non-zero mean : 245.6296
## ARIMA(0,0,0) with zero mean : 342.5922
## ARIMA(1,0,0) with non-zero mean : 227.9085
## ARIMA(1,0,0)(0,0,1)[12] with non-zero mean : 229.9084
## ARIMA(1,0,0)(1,0,1)[12] with non-zero mean : 231.664
## ARIMA(2,0,0) with non-zero mean : 228.6632
## ARIMA(1,0,1) with non-zero mean : 228.6917
## ARIMA(0,0,1) with non-zero mean : 244.2091
## ARIMA(2,0,1) with non-zero mean : 230.6355
## ARIMA(1,0,0) with zero mean : 232.0472
##
## Best model: ARIMA(1,0,0) with non-zero mean
model1 = Arima(train_cholera_data, order = c(1,0,0))
model2 = Arima(train_cholera_data,order = c(1,0,0),seasonal = list(order = c(0,0,1), period = 12))
model3 = Arima(train_cholera_data,order = c(1,0,0),seasonal = list(order = c(1,0,1), period = 12))
model4 = Arima(train_cholera_data,order = c(2,0,2),seasonal = list(order = c(1,0,1), period = 12))
model5 = Arima(train_cholera_data,order = c(1,0,0),seasonal = list(order = c(0,1,1), period = 12))
forecast1 <- forecast(model1,h = length(test_cholera_data))
forecast2 <- forecast(model2,h =length(test_cholera_data ))
forecast3 <- forecast(model3,h = length(test_cholera_data ))
forecast4 <- forecast(model4,h = length(test_cholera_data ))
forecast5 <- forecast(model5,h =length(test_cholera_data ))
rmse1 <- rmse(test_cholera_data, forecast1$mean)
rmse2 <- rmse(test_cholera_data, forecast2$mean)
rmse3 <- rmse(test_cholera_data, forecast3$mean)
rmse4 <- rmse(test_cholera_data, forecast4$mean)
rmse5 <- rmse(test_cholera_data, forecast5$mean)
RMSE= c(rmse1,rmse2, rmse3,rmse4,rmse5)
RMSE
## [1] 2.397805 2.397102 2.343880 2.389193 2.190628
mae1 <- mae(test_cholera_data, forecast1$mean)
mae2 <- mae(test_cholera_data, forecast2$mean)
mae3 <- mae(test_cholera_data, forecast3$mean)
mae4 <- mae(test_cholera_data, forecast4$mean)
mae5 <- mae(test_cholera_data, forecast5$mean)
MAE= c(mae1,mae2,mae3,mae4,mae5)
MAE
## [1] 2.109525 2.108861 2.067711 2.122549 2.028944
model <- Arima(train_cholera_data,order = c(1,0,0),seasonal = list(order = c(0,1,1), period = 12))
fit <- forecast(model,h =length(test_cholera_data ))
plot(fit, main="Train-Test fit",
ylab="log(Cases)", xlab="Time", col="black", lwd=2)
lines(test_cholera_data, col="red", lwd=2, lty=2)
legend("topleft", legend=c("Forecast", "Actual"),
col=c("blue", "red"), lty=c(1,2))
model_actual <- Arima(ts_cholera_data,order = c(1,0,0),seasonal = list(order = c(0,1,1), period = 12))
fit1 <- forecast(model_actual,h =24)
plot(fit1, ylab="log(Cases)", xlab="Time (Months)",lwd=2, main="Cholera Cases")
legend("topleft", c("cases", "Forecast","80% CI",
"95% CI"), cex=0.7, col=c(1, 4,"lightblue",8),lty=c(1, 1, 1, 2))
mean_original <- exp(fit1$mean)
#--------Back-transform the 80% and 95% prediction intervals---------
lower_80 <- exp(fit1$lower[ , "80%"])
upper_80 <- exp(fit1$upper[ , "80%"])
lower_95 <- exp(fit1$lower[ , "95%"])
upper_95 <- exp(fit1$upper[ , "95%"])
#--------Back-transform mean and prediction intervals---------
mean_original <- exp(fit1$mean)
lower_80 <- exp(fit1$lower[ , "80%"])
upper_80 <- exp(fit1$upper[ , "80%"])
lower_95 <- exp(fit1$lower[ , "95%"])
upper_95 <- exp(fit1$upper[ , "95%"])
#-------Create time variable------
time_months <- time(mean_original)
forecast_df <- data.frame(
Time = time_months,
Mean = mean_original,
Lower_80 = lower_80,
Upper_80 = upper_80,
Lower_95 = lower_95,
Upper_95 = upper_95
)
print(forecast_df)
## Time Mean Lower_80 Upper_80 Lower_95 Upper_95
## 1 2025.000 1.925122 0.2593862 14.28794 0.08976893 41.28482
## 2 2025.083 3.715725 0.3096111 44.59342 0.08308223 166.18014
## 3 2025.167 5.509685 0.3669436 82.72831 0.08745475 347.11240
## 4 2025.250 7.959582 0.4732460 133.87318 0.10621501 596.47833
## 5 2025.333 9.936179 0.5567333 177.33382 0.12108907 815.33080
## 6 2025.417 56.298217 3.0566664 1036.91042 0.65383317 4847.55038
## 7 2025.500 55.822463 2.9802986 1045.58228 0.63184844 4931.79570
## 8 2025.583 86.374511 4.5702173 1632.42917 0.96433006 7736.51724
## 9 2025.667 62.839721 3.3092355 1193.27580 0.69650970 5669.45528
## 10 2025.750 9.289034 0.4880119 176.81156 0.10258474 841.12065
## 11 2025.833 9.999756 0.5248451 190.52311 0.11027122 906.81068
## 12 2025.917 3.190932 0.1674910 60.79161 0.03519165 289.33138
## 13 2026.000 4.513788 0.2221455 91.71593 0.04511025 451.65528
## 14 2026.083 6.947633 0.3302333 146.16818 0.06583525 733.18775
## 15 2026.167 8.724386 0.4070427 186.99488 0.08035279 947.25905
## 16 2026.250 11.155391 0.5152934 241.49881 0.10118603 1229.84123
## 17 2026.333 12.731516 0.5849526 277.10191 0.11453907 1415.16339
## 18 2026.417 67.540135 3.0942398 1474.24572 0.60495904 7540.46051
## 19 2026.500 63.808264 2.9188198 1394.91124 0.57020236 7140.43788
## 20 2026.583 95.286498 4.3553774 2084.66815 0.85049095 10675.61811
## 21 2026.667 67.538871 3.0860695 1478.09345 0.60252372 7570.65474
## 22 2026.750 9.794268 0.4475263 214.35091 0.08737441 1097.89214
## 23 2026.833 10.396375 0.4751804 227.46018 0.09278824 1164.85251
## 24 2026.917 3.283399 0.1501815 71.78453 0.02933718 367.47601
checkresiduals(fit1)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,0,0)(0,1,1)[12]
## Q* = 8.9727, df = 15, p-value = 0.8789
##
## Model df: 2. Total lags used: 17
# Dataset
model_data <- data.frame(
Models = c("ARIMA(1,0,0)",
"ARIMA(1,0,0)(0,0,1)[12]",
"ARIMA(1,0,0)(1,0,1)[12]",
"ARIMA(2,0,2)(1,0,1)[12]",
"ARIMA(1,0,0)(0,1,1)[12]"),
RMSE = c(2.397805, 2.397102, 2.343880, 2.389193, 2.190628 ),
MAE = c(2.109525, 2.108861, 2.067711, 2.122549, 2.028944)
)
# Convert to long format
long_data <- model_data %>%
pivot_longer(cols = RMSE:MAE, names_to = "Metric", values_to = "Value") %>%
mutate(Highlight = ifelse(Models == "ARIMA(1,0,0)(0,1,1)[12]", "Best", "Others"))
##-------Plot---------
ggplot(long_data, aes(x = Models, y = Value, fill = Highlight)) +
geom_col(width = 0.7) +
facet_wrap(~ Metric, scales = "free_y") +
scale_fill_manual(values = c("Best" = "maroon", "Others" = "lightgrey")) +
theme_minimal(base_size = 12) +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(hjust = 0.5)) +
labs(
title = "Model Comparison (ARIMA Forecasts)",
x = NULL,
y = "Metric Value",
fill = "Model Type"
)
# Load data
cholera_data <- read.csv("~/GLADYS FOLDER .R/Cholera_dataaaa.csv")
# KEEP THE DATA AS A DATAFRAME. DO NOT UNLIST.
# Rename column ConfirmedCases → Cases
names(cholera_data)[names(cholera_data) == "ConfirmedCases"] <- "Cases"
cholera_df <- cholera_data
# --- Decompose ConfirmedCases into seasonal component ---
ts_cases <- ts(cholera_df$Cases, frequency = 12)
decomp <- stl(ts_cases, s.window = "periodic")
# Add seasonal effect to your dataframe
cholera_df$seasonal_effect <- decomp$time.series[, "seasonal"]
# --- Select variables for correlation ---
variables <- cholera_df[, c("Cases", "Rainfall", "Temperature", "Humidity")]
# --- Correlation matrix ---
corre_matrix <- cor(variables, use = "complete.obs")
# --- Plot correlation ---
corrplot(corre_matrix,
method = "color",
type = "upper",
col = colorRampPalette(c("red", "white", "blue"))(200),
addCoef.col = "black",
tl.col = "black",
number.cex = 0.7,
tl.srt = 45)
Human population is divided into:
\(S(t)\) – Susceptible
\(E(t)\) – Pre-infectious (exposed)
\(A(t)\) – Asymptomatic
\(I(t)\) – Symptomatic
\(R(t)\) – Recovered
\(N(t)\) – Total population size: S + E + A + I + R
The environment is represented by :
\(B(t)\) – Reservoir of v.cholerae
\(\Pi_1\) – Recruitment rate of humans into the susceptible population.
\(\Pi_2\) – Introduction of bacteria into the reservoir.
\(\lambda\) – Rate at which the susceptible individuals get infected.
\(\gamma\) – Rate at which asymptomatic class becomes infectious.
\(\omega_1\) – Recovery rate of asymptomatic individuals.
\(\omega_2\) – Recovery rate of symptomatic individuals.
\(\Theta\) – Proportion of exposed that becomes asymptomatic.
\(\tau\) – Natural death of bacteria.
\(\delta\) – Disease induced death.
\(\phi\) – Rate at which the recovered individuals goes back to the susceptible.
\((1-\Theta)\kappa\) – Proportion of exposed indviduals who become infectious.
\(\psi_1\) – Contribution of asymptomatic individuals to the bacteria reservoir.
\(\psi_2\) – Contribution of symptomatic individuals to the bacteria reservoir.
\(\eta\) – efficacy of WASH intervention.
\(\mu\) – Natural death rate of human population.
\[\frac{dS}{dt} = \Pi_1 + \psi R -(1-\eta)\lambda S -\mu S\]
\[\frac{dE}{dt} = (1-\eta) \lambda S - \kappa E - \mu E\]
\[\frac{dA}{dt} = \kappa\Theta E - (\gamma+ \omega _{1}+ \mu )A\]
\[\frac{dI}{dt} = \kappa (1-\Theta) E +\gamma A -(\omega_{2} + \mu +\delta) I\]
\[\frac{dR}{dt} = \omega_{1}A +\omega_{2} I -(\psi + \mu) R\]
\[\frac{dB}{dt} = \Pi_2 + (1-\eta)\phi_1 A + (1-\eta)\phi_2 I - \tau B \]
The time series forecast shows that cholera cases are expected to increase mainly between June and September, which is the rainy season, in both 2025 and 2026. This pattern agrees with previous studies in Nigeria, which report that cholera outbreaks usually become more common during periods of heavy rainfall and flooding. Abaje, I., Abdullahi, N., & Jeje, O. (2016).
During the rainy season, flooding and poor drainage can contaminate water sources, making people more exposed to cholera through unsafe drinking water and poor sanitation.
These findings show that cholera follows a predictable seasonal pattern. Using time series forecasts to prepare ahead of the rainy season—through improved water supply, sanitation, hygiene education, and early public health action—can help reduce future outbreaks (Ahmad & Prasad, 2023).