Aim of the Study

The aim of the research is to generate an early warning system based on malaria trends in Kaduna State for decision-making in malaria control.

Research Questions

Research Objectives

The objectives of this study are to:

Research Methodology

In respect to objective 1

In respect to objective 2


Loading necessary libraries

setwd("C:\\Users\\USER\\Documents\\R")
kaduna_data <- read.csv("Kaduna_Dataa.csv")
summary(kaduna_data)
##   periodname        Confirmed.uncomplicated.Malaria
##  Length:120         Min.   :  6362                 
##  Class :character   1st Qu.: 45606                 
##  Mode  :character   Median :116448                 
##                     Mean   :116347                 
##                     3rd Qu.:168176                 
##                     Max.   :279846                 
##  Persons.with.clinically.diagnosed.Malaria     Pop           
##  Min.   :   11                             Length:120        
##  1st Qu.: 1774                             Class :character  
##  Median : 5377                             Mode  :character  
##  Mean   :15713                                               
##  3rd Qu.:31398                                               
##  Max.   :60096                                               
##     Rainfall           Humidity      Temperature    Persons.with.fever
##  Min.   :   0.000   Min.   :24.60   Min.   :16.96   Min.   : 11442    
##  1st Qu.:   0.165   1st Qu.:49.73   1st Qu.:22.70   1st Qu.: 82649    
##  Median : 141.775   Median :72.15   Median :23.93   Median :158179    
##  Mean   : 224.823   Mean   :66.10   Mean   :23.86   Mean   :159375    
##  3rd Qu.: 293.317   3rd Qu.:85.33   3rd Qu.:25.16   3rd Qu.:219428    
##  Max.   :1552.160   Max.   :91.24   Max.   :28.77   Max.   :350658
head(kaduna_data)
##   periodname Confirmed.uncomplicated.Malaria
## 1     15-Jan                            6362
## 2     15-Feb                           20387
## 3     15-Mar                           24538
## 4     15-Apr                           23037
## 5     15-May                           25563
## 6     15-Jun                           31619
##   Persons.with.clinically.diagnosed.Malaria       Pop Rainfall Humidity
## 1                                      7016 1,046,000     0.00    29.05
## 2                                     22860 1,046,000     0.60    28.34
## 3                                     28344 1,046,000     2.44    32.81
## 4                                     25232 1,046,000     2.21    25.25
## 5                                     28807 1,046,000    56.76    56.19
## 6                                     33598 1,046,000   123.16    75.69
##   Temperature Persons.with.fever
## 1       20.80              11442
## 2       26.94              38728
## 3       28.05              45905
## 4       28.70              42961
## 5       28.77              51961
## 6       25.97              58044
# Convert population values from character to numeric
kaduna_data$Pop <- as.numeric(gsub(",", "", kaduna_data$Pop))

Confirmed Malaria Incidence

# Calculate malaria incidence per 1000 inhabitants
Confirmed_Malaria_Incidence <- 
  (kaduna_data$Confirmed.uncomplicated.Malaria / kaduna_data$Pop) * 1000
kaduna_data <- kaduna_data %>% 
  mutate(Malaria_Incidence = round(Confirmed_Malaria_Incidence, 3))

Plot of monthly malaria incidence in Kaduna State (2015–2024)

Data_1 <- kaduna_data$Malaria_Incidence

#date
dates_data <- seq(as.Date("2015/1/1"), as.Date("2024/12/1"), by = "month")


plot(Data_1~dates_data, data=kaduna_data,col='red', xlab = "Time", type='l', lwd=3)

The plot indicates strong seasonal (temporal) patterns in malaria incidence in Kaduna State, with recurrent peaks typically occurring between May and October each year, corresponding to the rainy season, and lower incidence between November and March. Over time, there is a gradual upward trend, with malaria cases increasing from relatively low levels in 2015–2017, rising more sharply from 2018 onward, and reaching the highest levels between 2022 and 2024. This pattern reflects persistent seasonality alongside a growing malaria burden over the study period

Log Transform the data

#data("Kaduna_routine_2015_2024")
Data_2 <- log(Data_1)  # log to stabilize variance

##
ts_data <- ts((Data_2),start= c(2015,1), frequency = 12)  # specify frequency!
length(ts_data)  # should be >= 24 for monthly data
## [1] 120
plot(ts_data, type= "l")

#To check if the data is stationary
adf.test(ts_data) # To check if the p-value of the data is less than 0.05
## Warning in adf.test(ts_data): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  ts_data
## Dickey-Fuller = -4.5652, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary

STL decomposition

# STL decomposition
decomp <- stl(ts_data, s.window = "periodic")  # s.window = "periodic" for stable seasonality

# Plot decomposition
plot(decomp,col.range = "blue")

Splitting the data into train and test

# Data Split
n=length(ts_data)
n_train=floor(0.7*n)
train=window(ts_data,end=time(ts_data)[n_train])
test=window(ts_data,start=time(ts_data)[n_train+1])

#fit arima model
model=auto.arima(train,ic="aic",trace=TRUE)
## 
##  ARIMA(2,1,2)(1,0,1)[12] with drift         : Inf
##  ARIMA(0,1,0)            with drift         : 1.051391
##  ARIMA(1,1,0)(1,0,0)[12] with drift         : 0.5957612
##  ARIMA(0,1,1)(0,0,1)[12] with drift         : 1.935375
##  ARIMA(0,1,0)                               : 1.150643
##  ARIMA(1,1,0)            with drift         : 2.729574
##  ARIMA(1,1,0)(2,0,0)[12] with drift         : 1.394708
##  ARIMA(1,1,0)(1,0,1)[12] with drift         : 0.2136217
##  ARIMA(1,1,0)(0,0,1)[12] with drift         : 1.658619
##  ARIMA(1,1,0)(2,0,1)[12] with drift         : 2.21345
##  ARIMA(1,1,0)(1,0,2)[12] with drift         : 2.213413
##  ARIMA(1,1,0)(0,0,2)[12] with drift         : 2.751596
##  ARIMA(1,1,0)(2,0,2)[12] with drift         : 4.17356
##  ARIMA(0,1,0)(1,0,1)[12] with drift         : Inf
##  ARIMA(2,1,0)(1,0,1)[12] with drift         : 1.699748
##  ARIMA(1,1,1)(1,0,1)[12] with drift         : 1.341303
##  ARIMA(0,1,1)(1,0,1)[12] with drift         : 0.4571998
##  ARIMA(2,1,1)(1,0,1)[12] with drift         : Inf
##  ARIMA(1,1,0)(1,0,1)[12]                    : -0.68151
##  ARIMA(1,1,0)(0,0,1)[12]                    : 1.720156
##  ARIMA(1,1,0)(1,0,0)[12]                    : 0.3987677
##  ARIMA(1,1,0)(2,0,1)[12]                    : 1.317541
##  ARIMA(1,1,0)(1,0,2)[12]                    : 1.3173
##  ARIMA(1,1,0)                               : 2.998669
##  ARIMA(1,1,0)(0,0,2)[12]                    : 2.620064
##  ARIMA(1,1,0)(2,0,0)[12]                    : 0.8265951
##  ARIMA(1,1,0)(2,0,2)[12]                    : 3.278579
##  ARIMA(0,1,0)(1,0,1)[12]                    : Inf
##  ARIMA(2,1,0)(1,0,1)[12]                    : 0.7277321
##  ARIMA(1,1,1)(1,0,1)[12]                    : 0.4522662
##  ARIMA(0,1,1)(1,0,1)[12]                    : -0.4458432
##  ARIMA(2,1,1)(1,0,1)[12]                    : Inf
## 
##  Best model: ARIMA(1,1,0)(1,0,1)[12]

Model Fitting

#select model with lowest aic value
model1=Arima(train,order=c(1,1,0),seasonal=list(order=c(1,0,1),period=12))
model2=Arima(train,order=c(1,1,0),seasonal=list(order=c(1,0,0),period=12))
model3=Arima(train,order=c(1,1,1),seasonal=list(order=c(1,0,1),period=12))
model4=Arima(train,order=c(0,1,1),seasonal=list(order=c(1,0,1),period=12))


#forecast
forecast1=forecast(model1,h=length(test))
forecast2=forecast(model2,h=length(test))
forecast3=forecast(model3,h=length(test))
forecast4=forecast(model4,h=length(test))

Root Mean Square Error

#check r model metrics
rmse1=rmse(test,forecast1$mean)
rmse2=rmse(test,forecast2$mean)
rmse3=rmse(test,forecast3$mean)
rmse4=rmse(test,forecast4$mean)


RMSE=c(rmse1,rmse2,rmse3,rmse4)
RMSE
## [1] 0.1390960 0.1697500 0.1404222 0.1496580

Mean Absolute Error

mae1=mae(test,forecast1$mean)
mae2=mae(test,forecast2$mean)
mae3=mae(test,forecast3$mean)
mae4=mae(test,forecast4$mean)


MAE=c(mae1,mae2,mae3,mae4)
MAE
## [1] 0.1223946 0.1392379 0.1243776 0.1332541

Mean Absolute Percentage Error

mape1=mape(test,forecast1$mean)
mape2=mape(test,forecast2$mean)
mape3=mape(test,forecast3$mean)
mape4=mape(test,forecast4$mean)


MAPE=c(mape1,mape2,mape3,mape4)
MAPE
## [1] 0.02445229 0.02723795 0.02482584 0.02658660

Model Comparison (ARIMA Forecasts)

# Dataset

model_data <- data.frame(
  Models = c("ARIMA(1,1,0)(1,0,1)[12]",
             "ARIMA(1,1,0)(1,0,0)[12]",
             "ARIMA(1,1,1)(1,0,1)[12]",
             "ARIMA(0,1,1)(1,0,1)[12]"
            ),
  RMSE = c(0.1390960, 0.1697500, 0.1404222, 0.1496580),
  MAE  = c(0.1223946, 0.1392379, 0.1243776, 0.1332541),
  MAPE = c(0.02445229, 0.02723795, 0.02482584, 0.02658660)
)

# Convert to long format
long_data <- model_data %>%
  pivot_longer(cols = RMSE:MAPE, names_to = "Metric", values_to = "Value") %>%
  mutate(Highlight = ifelse(Models == "ARIMA(1,1,0)(1,0,1)[12]", "Best", "Others"))

##-------Plot---------
ggplot(long_data, aes(x = Models, y = Value, fill = Highlight)) +
  geom_col(width = 0.7) +
  facet_wrap(~ Metric, scales = "free_y") +
  scale_fill_manual(values = c("Best" = "blue", "Others" = "lightgrey")) +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(hjust = 0.5)) +
  labs(
    title = "Model Comparison (ARIMA Forecasts)",
    x = NULL,
    y = "Metric Value",
    fill = "Model Type"
  )

Train-test fit

forecast= forecast(model1, h=length(test))
plot(forecast1,ylab="log(cases)",xlab="Time",
     lwd=2,main="Forecast vs Actual")
lines(test, col="red",lty=2,lwd=2)
legend("topleft",legend=c("forecast","actual"),
       col=c("blue","red"),lty=c(1,2))

The forecast versus actual plot compares observed malaria cases with model-predicted values over time. Historical data show an overall increasing trend with fluctuations. The forecasted values closely track the actual observed cases during the validation period, indicating good model performance. The widening prediction intervals reflect increasing uncertainty further into the future.

Forecast from ARIMA(1,1,0)(1,0,1)[12]

#fitting (For the forecast)
model1= Arima(ts_data,order=c(1,1,0),seasonal=list(order=c(1,0,1),period=12))
fit=forecast(model1,h=24)
plot(forecast(fit,h=24),ylab="log(cases)",
     xlab="Time",lwd=2,main="Forecast from ARIMA(1,1,0)(1,0,1)[12]")
legend("topleft",c("log(cases)","forecast","80% CI","95% CI"),
       cex=0.7,col=c(1,4,"lightblue",8),lty=c(1,1,1,2))

The result from the plot suggest that malaria incidence in Kaduna State is expected to remain persistently high if current transmission patterns continue.

Back-transform the forecast mean

# Back-transform the forecast mean
mean_original <- exp(fit$mean)

# Back-transform the 80% and 90% prediction intervals
lower_80 <- exp(fit$lower[,"80%"])
upper_80 <- exp(fit$upper[,"80%"])
lower_95 <- exp(fit$lower[,"95%"])
upper_95 <- exp(fit$upper[,"95%"])

# Back-transform mean and prediction intervals
mean_original <- exp(fit$mean)
lower_80 <- exp(fit$lower[,"80%"])
upper_80 <- exp(fit$upper[,"80%"])
lower_95 <- exp(fit$lower[,"95%"])
upper_95 <- exp(fit$upper[,"95%"])

Forecast Data Frame

# Create time variable
time_months <- time(mean_original)


# Combine into a data frame
forecast_df <- data.frame(
  Time = time_months,
  Mean = mean_original,
  Lower_80 = lower_80,
  Upper_80 = upper_80,
  Lower_95 = lower_95,
  Upper_95 = upper_95
)

print(forecast_df)
##        Time     Mean  Lower_80 Upper_80  Lower_95  Upper_95
## 1  2025.000 180.2297 140.53755 231.1322 123.19771  263.6636
## 2  2025.083 191.9438 139.08901 264.8837 117.28546  314.1261
## 3  2025.167 191.8755 130.45339 282.2172 106.35362  346.1678
## 4  2025.250 201.2354 129.63217 312.3893 102.70886  394.2766
## 5  2025.333 207.2449 127.23115 337.5780  98.27102  437.0611
## 6  2025.417 241.1382 141.69952 410.3588 106.93959  543.7429
## 7  2025.500 236.1156 133.25034 418.3898  98.43362  566.3775
## 8  2025.583 254.0262 138.04644 467.4465  99.95863  645.5604
## 9  2025.667 237.1959 124.39826 452.2722  88.39674  636.4701
## 10 2025.750 252.1573 127.86346 497.2750  89.25294  712.3944
## 11 2025.833 229.0559 112.48031 466.4515  77.19217  679.6882
## 12 2025.917 228.2558 108.69753 479.3182  73.39326  709.8840
## 13 2026.000 209.1075  95.36061 458.5327  62.92948  694.8406
## 14 2026.083 221.3930  97.15001 504.5275  62.81695  780.2807
## 15 2026.167 221.3696  93.59285 523.5925  59.33692  825.8688
## 16 2026.250 231.3503  94.39324 567.0210  58.72764  911.3758
## 17 2026.333 237.7475  93.73855 602.9950  57.27286  986.9225
## 18 2026.417 273.5756 104.36137 717.1580  62.65841 1194.4700
## 19 2026.500 268.2909  99.12967 726.1196  58.51992 1230.0085
## 20 2026.583 287.0998 102.84826 801.4360  59.72905 1380.0034
## 21 2026.667 269.4282  93.66265 775.0319  53.53673 1355.9203
## 22 2026.750 285.1418  96.27293 844.5348  54.18476 1500.5295
## 23 2026.833 260.8487  85.60202 794.8650  47.45919 1433.6959
## 24 2026.917 260.0042  82.99182 814.5644  45.34170 1490.9494

Residuals from ARIMA(1,1,0)(1,0,1)[12]

# Extract residuals
res <- residuals(model)

# set up plotting layout: 3 plots, one below the other
#par(mfrow = c(3,1), mar = c(4,4,3,2)) # Adjust margins

# ACF plot
acf(res, main = "ACF of Residuals", lwd = 2)

checkresiduals(fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,0)(1,0,1)[12]
## Q* = 22.926, df = 21, p-value = 0.3479
## 
## Model df: 3.   Total lags used: 24

Conclusion

The analysis of malaria incidence in Kaduna State shows a gradual upward trend with clear seasonal fluctuations. Using the ARIMA(1,1,0)(1,0,1)[12] model on log-transformed monthly data, Malaria cases was forecasted from early 2024 through 2026. The forecasted log(cases) range between approximately 5.2 and 5.5, corresponding to 180 to 245 cases per month, with widening 80% and 95% confidence intervals indicating increasing uncertainty over time. These projections reflect the expected trajectory of malaria in Kaduna if no additional interventions or control measures are implemented, preserving the historical seasonal pattern with recurring peaks during months of higher malaria incidence. These results underscore persistent seasonality and highlight the urgent need for continuous monitoring to guide timely interventions and resource allocation.