1 Executive Summary

This analysis evaluates the temporal dynamics of monthly confirmed malaria cases collected from a health structure in Togo between 2022 and 2024.

Using ARIMA, SARIMA, and ARIMAX models, we identified a strong seasonal pattern with annual periodicity. The SARIMA model demonstrated superior performance and was used to generate forecasts for 2025–2026.

The results highlight the importance of incorporating seasonality in epidemiological time series modelling for improved predictive accuracy.

2 Background and Study Objectives

2.1 Data Description

The data used in this analysis originate from the malaria epidemiological surveillance system and cover the period from January 2022 to December 2024. They are derived from real-world data collected within a health structure in Togo.

The observations are organized as monthly time series data, which allows the application of time series analytical methods.

The primary dependent variable is:

Total_confirmed_cases: the total monthly number of biologically confirmed malaria cases.

This variable represents a discrete time series that may exhibit:

Temporal dependence

Annual seasonality

Potential non-stationarity

2.2 Objective of the Time Series Analysis

The primary objective of this time series analysis is to model the temporal dynamics of confirmed malaria cases and identify the most appropriate model for generating forecasts for 2025–2026.

More specifically, the analysis aims to:

  • Examine the temporal structure of the series (trend, seasonality, autocorrelation);

  • Test stationarity and apply necessary transformations;

  • Identify and estimate ARIMA, SARIMA, and ARIMAX models;

  • Compare models using information criteria (AIC, BIC);

  • Select the best-performing forecasting model;

  • Generate monthly forecasts of confirmed malaria cases for 2025–2026.

3 Data Preparation and Initial Exploration

3.1 R Markdown Configuration

# Load required libraries
library(haven)      # Import SPSS data
library(dplyr)      # Data manipulation
library(tidyverse)  # Visualization and tidy tools
library(forecast)   # ARIMA / SARIMA / ARIMAX models + Acf() and Pacf()
library(tseries)    # Augmented Dickey-Fuller (ADF) test

3.2 Data Import and Initial Inspection

# Import the dataset (SPSS format) from the project data folder
data <- read_sav("data/ARIMA.sav")

# Check the object structure and data type
class(data)
## [1] "tbl_df"     "tbl"        "data.frame"
# Preview the first rows of the dataset
head(data)
## # A tibble: 6 × 22
##   Année Mois        Temps Saison       Total_cas_suspects Cas_suspects_moins_5…¹
##   <dbl> <dbl+lbl>   <dbl> <dbl+lbl>                 <dbl>                  <dbl>
## 1  2022 1 [Janvier]     1 1 [Transiti…                681                    131
## 2  2022 2 [Février]     2 1 [Transiti…                645                    229
## 3  2022 3 [Mars]        3 1 [Transiti…                538                    196
## 4  2022 4 [Avril]       4 2 [Forte tr…                777                    323
## 5  2022 5 [Mai]         5 2 [Forte tr…                903                    302
## 6  2022 6 [Juin]        6 2 [Forte tr…                990                    210
## # ℹ abbreviated name: ¹​Cas_suspects_moins_5ans
## # ℹ 16 more variables: Cas_suspects_plus_5ans_sans_FE <dbl>,
## #   Cas_suspects_FE <dbl>, Cas_suspects_plus_5ans <dbl>,
## #   Total_cas_confirmés <dbl>, Cas_confirmés_moins_5ans <dbl>,
## #   Cas_confirmés_plus_5ans_sans_FE <dbl>, Cas_confirmés_FE <dbl>,
## #   Cas_confirmés_plus_5ans <dbl>, Prevalence_total <dbl>,
## #   Prevalence_moins_5ans <dbl>, Prevalence_plus_5ans_sans_FE <dbl>, …

3.2.1 Interpretation

This section initializes the R environment by loading the necessary libraries required for data import, manipulation, visualization, and time series modelling.

  • haven is used to import SPSS data files.

  • dplyr and tidyverse support data wrangling and visualization.

  • forecast provides functions for ARIMA-family modelling and diagnostics.

  • tseries is used to perform the Augmented Dickey–Fuller test for stationarity assessment.

4 Time Variable Construction

4.1 Creation of the Date Variable

In time series analysis, it is essential to define a properly structured temporal variable to preserve the chronological ordering of observations. Since the dataset is organized at a monthly scale, a Date variable is constructed using the available year and month information.

Each monthly observation is assigned to the first day of the corresponding month. This conventional approach ensures proper representation of monthly periodicity and compatibility with visualization and time series analysis tools in R.

The creation of the temporal variable facilitates:

Graphical representation using calendar-based plotting methods

Consistency between the dataset structure and time series modelling techniques

Clear temporal interpretation of results

A verification step is performed to confirm that the new variable is correctly recognized as a Date object.

# Create Date variable from Year and Month
data <- data %>%
  mutate(
    Date = as.Date(paste(Année, Mois, "01", sep = "-"))
  )

# Verify Date class
class(data$Date)
## [1] "Date"
head(data$Date)
## [1] "2022-01-01" "2022-02-01" "2022-03-01" "2022-04-01" "2022-05-01"
## [6] "2022-06-01"

4.1.1 Interpretation

The output confirms that the Date variable is correctly stored as a Date object in R, ensuring compatibility with time-based plotting and chronological data manipulation.

5 Definition of the Monthly Time Series Object

After constructing the temporal variable, the dependent variable Total_confirmed_cases is transformed into a structured time series object (ts).

This step is crucial because ARIMA-family models require data formatted as time series objects.

The series is defined as:

  • Monthly frequency (12 observations per year)

  • Covering January 2022 to December 2024

This specification allows R to incorporate seasonal structure automatically.

# Define monthly time series
ts_confirmed <- ts(
  data$Total_cas_confirmés,
  start = c(2022, 1),
  end   = c(2024, 12),
  frequency = 12
)

ts_confirmed
##      Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 2022 379 205 192 199 419 602 618 695 423 343 373 291
## 2023 392 408 180 290 351 613 583 756 481 476 760 533
## 2024 307 238 439 419 439 522 572 470 479 282 386 449

5.1 Interpretation

The output confirms that the data are now structured as a monthly time series with 36 observations. This format enables:

  • Seasonal decomposition

  • Stationarity testing

  • ARIMA/SARIMA model estimation

  • Forecast generation

6 Exploratory Time Series Analysis

Before formal modelling, an exploratory analysis is conducted to understand the underlying behaviour of the series.

The objectives of the exploratory analysis are to:

  • Detect potential long-term trends

  • Identify recurring seasonal patterns

  • Examine irregular fluctuations or structural breaks

  • Assess variance stability

Two complementary visualization approaches are used:

Direct plotting of the ts object

Calendar-based visualization using the Date variable

6.1 Time Series Plot

plot.ts(
  ts_confirmed,
  main = "Monthly Confirmed Malaria Cases (2022–2024)",
  xlab = "Year (monthly data)",
  ylab = "Number of Confirmed Cases",
  col  = "violetred3",
  lwd  = 3)

6.2 Calendar-Based Visualization

ggplot(data, aes(x = Date, y = Total_cas_confirmés)) +
  geom_line(color = "steelblue", linewidth = 1.2) +
  labs(
    title = "Monthly Confirmed Malaria Cases (2022–2024)",
    x = "Date",
    y = "Number of Confirmed Cases"
  ) +
  theme_minimal()

6.2.1 Interpretation of the Exploratory Analysis

The visual inspection of the monthly confirmed malaria cases from 2022 to 2024 reveals substantial month-to-month variability. Recurrent peaks are observed at similar periods across years, indicating the presence of a strong annual seasonal component.

No clear monotonic linear trend (increasing or decreasing) is observed over the entire study period, suggesting that long-term structural changes are limited. However, the magnitude of seasonal fluctuations appears considerable, justifying the consideration of models that explicitly incorporate seasonal dynamics.

These preliminary findings support the use of SARIMA-type models for subsequent analysis.

7 Time Series Decomposition

Time series decomposition is an essential exploratory step that helps to better understand the internal structure of the observed series. It separates the original time series into distinct components typically interpreted as:

- A trend component, representing long-term evolution;

- A seasonal component, reflecting recurring periodic fluctuations;

- A residual (irregular) component, capturing random variability not explained by trend or seasonality.

This step is particularly important in ARIMA modelling, as visual identification of trend and seasonality helps assess whether the series may violate stationarity assumptions.

7.1 Additive Decomposition

First, an additive decomposition is applied to the monthly confirmed malaria cases.

The additive model assumes that the observed series can be expressed as:

                                            Xt = Tt + St + Rt
           

where:

  • Tt represents the trend component,

  • St represents the seasonal component,

  • Rt represents the residual component.

Additive decomposition is appropriate when the magnitude of seasonal fluctuations remains approximately constant over time.

# Additive decomposition
decomp_additive <- decompose(ts_confirmed, type = "additive")
plot(decomp_additive)

# Residual diagnostics
par(mfrow = c(1,2))
plot.ts(decomp_additive$random, main = "Residuals")
hist(decomp_additive$random, breaks = 25, freq = FALSE, main = "Histogram")

7.1.1 Interpretation of the Additive Decomposition

The additive decomposition highlights three distinct components.

The trend component exhibits moderate fluctuations without a persistent upward or downward trajectory, confirming the absence of a strong long-term structural trend during the study period.

The seasonal component displays a regular and reproducible annual pattern with relatively stable amplitude. This indicates that seasonal effects add a relatively constant magnitude to the series across years.

The residual component appears centered around zero and does not show obvious systematic structure, suggesting that the additive model adequately captures the main temporal features of the series.

7.2 Multiplicative Decomposition

Next, a multiplicative decomposition is performed to evaluate whether seasonal variation changes proportionally with the level of the series.

In the multiplicative framework, the series is assumed to follow:

                                        Xt = Tt x St x Rt
            

Multiplicative decomposition is appropriate when seasonal fluctuations increase or decrease proportionally with the series level, implying heteroscedastic behavior.

# Multiplicative decomposition
decomp_multi <- decompose(ts_confirmed, type = "multiplicative")
plot(decomp_multi)

# Residual diagnostics
par(mfrow = c(1,2))
plot.ts(decomp_multi$random, main="Residuals", ylab="")
hist(decomp_multi$random, breaks = 25, freq = FALSE, main = "Histogram")

7.2.1 Interpretation of the Multiplicative Decomposition

The multiplicative decomposition assumes that seasonal amplitude varies proportionally with the series level. In this case, although seasonality is clearly detectable, the residuals appear more heterogeneous compared to the additive decomposition.

The comparison between the two approaches suggests that the additive decomposition provides a better representation of the data. This indicates that seasonal effects primarily operate in an additive rather than proportional manner.

8 Stationarity and Temporal Dependence

After exploring the structural components of the series through decomposition, the next step of the Box–Jenkins methodology consists of formally assessing whether the stochastic process satisfies the stationarity assumption required for ARIMA-type modelling.

Time series analysis relies on the fundamental assumption of stationarity. A time series is considered stationary when its key statistical properties remain constant over time. In practice, this implies:

  • A constant mean (no systematic trend);

  • A constant variance;

  • A stable temporal dependence structure, without persistent unmodelled autocorrelation.

In epidemiological data, such as monthly confirmed malaria cases, these assumptions are rarely satisfied a priori. Therefore, it is essential to formally assess stationarity and examine temporal dependence before fitting ARIMA-type models.

8.1 Augmented Dickey–Fuller (ADF) Test

The Augmented Dickey–Fuller (ADF) test is used to assess the presence of a unit root, which indicates non-stationarity.

The hypotheses are defined as:

- Null hypothesis (H₀): The series is non-stationary (contains a unit root).

- Alternative hypothesis (H₁): The series is stationary.

Rejection of the null hypothesis at the conventional 5% significance level suggests that the series can be considered stationary in level, or stationary after differencing if required.

# Augmented Dickey-Fuller test (ADF)
adf.test(ts_confirmed)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  ts_confirmed
## Dickey-Fuller = -3.5609, Lag order = 3, p-value = 0.04993
## alternative hypothesis: stationary

8.1.1 Interpretation of the Augmented Dickey–Fuller (ADF) Test

The Augmented Dickey–Fuller test produced a test statistic of −3.56 with a p-value of 0.0499. At the conventional 5% significance level, the null hypothesis of a unit root (non-stationarity) is rejected, indicating that the series can be regarded as stationary in level.

However, this statistical rejection is marginal and must be interpreted alongside the graphical analyses, which revealed a pronounced annual seasonal pattern in malaria incidence. In such contexts, the objective of seasonal differencing is not to correct non-stationarity per se, but to explicitly account for systematic periodic behaviour embedded in the data.

Accordingly, no regular (non-seasonal) differencing was required, whereas seasonal differencing might be incorporated within the SARIMA framework to model the observed 12-month transmission cycle and improve the representation of recurring epidemiological dynamics.

8.2 Autocorrelation and Partial Autocorrelation Analysis

Beyond formal statistical tests of stationarity, examining the temporal dependence structure of the series constitutes a fundamental step in the Box–Jenkins modelling strategy. Once the time series has been rendered stationary (if required through differencing), the Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) are analysed to guide the identification of appropriate ARIMA-type models.

8.2.1 Conceptual Role of ACF and PACF

- Autocorrelation Function (ACF)

Measures the correlation between observations and their lagged values across successive time intervals. It reflects both direct and indirect temporal relationships.

- Partial Autocorrelation Function (PACF)

Measures the correlation between observations at a specific lag after removing the influence of intermediate lags. It therefore isolates the direct dependence structure.

The joint interpretation of ACF and PACF enables:

  • Identification of autoregressive (AR) components;

  • Identification of moving average (MA) components;

  • Detection of seasonal dependence;

  • Specification of candidate ARIMA/SARIMA models.

8.2.2 Theoretical Identification Rules (Box–Jenkins Framework)

Table 1. The Box–Jenkins methodology provides well-established diagnostic patterns linking ACF/PACF behaviour to the underlying stochastic process

Pattern ACF Behaviour PACF Behaviour Suggested Model
ACF tails off + PACF cuts off at lag p Gradual decay Sharp cutoff AR(p)
ACF cuts off at lag q + PACF tails off Sharp cutoff Gradual decay MA(q)
Both ACF and PACF tail off Gradual decay in both No clear cutoff ARMA(p,q)

These patterns allow an initial empirical identification before parameter estimation.

8.3 Statistical Rationale for These Patterns

- Autoregressive Process AR(p)

An autoregressive process of order p is defined as:

                               Yt = ϕ1Yt−1 +ϕ2Yt−2 +⋯+ ϕpYt−p + εt

Each observation depends directly on its previous values.

Implications:

  • Information propagates through time, generating correlations at many lags.

  • The ACF declines gradually (tailing-off pattern).

  • The PACF shows a sharp cutoff at lag p because it removes indirect effects.

The PACF therefore identifies the true autoregressive order.

- Moving Average Process MA(q)

A moving average process of order q is expressed as:

                                    Yt = εt+ θ1εt−1 +⋯+ θqεt−q

Implications:

  • Correlation disappears after lag q.

  • The ACF cuts off sharply at lag q.

  • The PACF decays gradually, approximating an autoregressive representation.

The ACF identifies the order of the MA component.

- Mixed ARMA Processes

When both AR and MA components are present:

  • Neither ACF nor PACF exhibits a clean cutoff;

  • Both functions display gradual decay.

This behaviour indicates a combined dependence structure, commonly observed in real-world epidemiological time series where persistence and random shocks coexist.

#ACF & PACF
par(mfrow = c(1,2))
Acf(ts_confirmed, main = "ACF of Original Series")
Pacf(ts_confirmed, main = "PACF of Original Series")

8.3.1 Interpretation of ACF and PACF

The ACF and PACF were examined directly on the observed monthly malaria incidence series as part of the exploratory phase of the Box–Jenkins methodology. No prior transformation was applied, as the variance appeared reasonably stable over time.

The ACF exhibits a gradual decay along with notable spikes at lags corresponding to multiples of 12, indicating a strong annual seasonal dependence typical of epidemiological data with recurring transmission cycles.

The PACF shows a more pronounced contribution at the first few lags, followed by a rapid attenuation, suggesting limited short-term autoregressive structure once the primary dependence is accounted for.

This joint behaviour is consistent with:

The presence of a non-seasonal moving-average component capturing short-term shock propagation;

A seasonal autoregressive structure reflecting persistence of malaria transmission from one year to the next.

These empirical observations motivated the consideration of SARIMA models that explicitly incorporate seasonal dynamics, which were subsequently evaluated using automated selection procedures and diagnostic criteria.

8.4 Lag-1 Scatterplot

Autocorrelation can also be visually explored using a lag-1 scatterplot, where consecutive observations are plotted against each other.

plot(ts_confirmed[1:35], ts_confirmed[2:36])

### Interpretation of the Lag-1 Scatterplot

The scatterplot of consecutive observations (Xt versus Xt+1) does not reveal a clear linear structure or strong systematic alignment along a diagonal axis. The points appear relatively dispersed.

This suggests that immediate linear dependence (lag 1) is not dominant. However, visual inspection alone is insufficient to conclude the absence of autocorrelation. Formal evaluation through ACF and PACF remains the appropriate method for assessing temporal dependence.

9 ARIMA Model Identification and Estimation

9.1 The ARIMA (p, d, q) Model

The ARIMA (AutoRegressive Integrated Moving Average) model is widely used for the analysis and forecasting of univariate time series. It is particularly suitable for series exhibiting temporal dependence and that are stationary, or can be rendered stationary through differencing.

An ARIMA model is defined by three parameters (p,d,q), each representing a specific component of the series dynamics.

- Autoregressive Component: AR(p)

The autoregressive component reflects the dependence of the current value of the series on its own past values.

In an AR(p) model, the observation at time t is expressed as a linear combination of the previous p observations.

Thus, the series exhibits temporal memory, where past values directly influence future values.

The parameter: p = number of autoregressive lags included in the model.

- Integrated Component: I(d)

The integrated component corresponds to the differencing process applied to the series to achieve stationarity.

Differencing removes trends and stabilizes the mean.

The parameter:

d = 0: series is stationary in level

d = 1: first-order differencing

d = 2: second-order differencing

The choice of d is typically guided by graphical inspection and formal stationarity tests such as the Augmented Dickey–Fuller test.

- Moving Average Component: MA(q)

The moving average component models the dependence of the current observation on past error terms (random shocks).

In an MA(q) model, the observation at time t depends on the previous q forecast errors.

The parameter: q = number of moving average lags included.

This component captures short-term random fluctuations not explained by the autoregressive structure.

Model Notation

An ARIMA model is written as:

                                                 ARIMA(p,d,q)

where:

  • p= autoregressive order

  • d = differencing order

  • q = moving average order

Identification of p and q is primarily based on ACF and PACF analysis, while d is determined from stationarity assessment.

Role of ARIMA in This Study

In this analysis, the non-seasonal ARIMA model serves as a baseline reference model, describing the temporal dynamics of confirmed malaria cases without explicitly incorporating seasonality.

It provides a comparison benchmark for more complex models such as SARIMA and ARIMAX.

9.2 Non-Seasonal ARIMA Model (Baseline)

The non-seasonal ARIMA model is estimated automatically using information criteria to determine the optimal combination of autoregressive, differencing, and moving average terms.

Automatic model selection was performed using auto.arima() with stepwise = FALSE and approximation = FALSE to ensure an exhaustive search and exact maximum likelihood estimation, thereby improving the robustness and reproducibility of model identification.

# Non-seasonal ARIMA model
model_auto_no_seasonal <- auto.arima(
  ts_confirmed,
  seasonal = FALSE,
  stepwise = FALSE,
  approximation = FALSE
)

summary(model_auto_no_seasonal)
## Series: ts_confirmed 
## ARIMA(1,0,0) with non-zero mean 
## 
## Coefficients:
##          ar1      mean
##       0.5182  431.2995
## s.e.  0.1382   42.8504
## 
## sigma^2 = 17213:  log likelihood = -225.77
## AIC=457.54   AICc=458.29   BIC=462.29
## 
## Training set error measures:
##                     ME     RMSE      MAE       MPE     MAPE      MASE      ACF1
## Training set 0.9631615 127.5001 101.1711 -10.49598 27.75727 0.7865584 0.1225203

9.2.1 Interpretation of the Non-Seasonal ARIMA Model

The automatically selected model is ARIMA(1,0,0) with a non-zero mean.

The positive and statistically significant autoregressive coefficient indicates moderate dependence between consecutive observations.

However, although the model captures part of the short-term temporal dependence, the information criterion (AIC = 457.5) remains relatively high, suggesting suboptimal fit.

This likely reflects the absence of an explicit seasonal component, despite strong evidence of annual seasonality in the data.

9.3 Diagnostic Analysis of the Non-Seasonal ARIMA Model

After model estimation, residual diagnostics are performed to assess model adequacy.

A well-specified ARIMA model should produce residuals that behave as white noise, meaning:

  • No residual autocorrelation

  • Constant variance

  • Zero mean

  • No systematic pattern

checkresiduals(model_auto_no_seasonal)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,0) with non-zero mean
## Q* = 4.1373, df = 6, p-value = 0.6581
## 
## Model df: 1.   Total lags used: 7
acf(model_auto_no_seasonal$residuals,
    main = "ACF of ARIMA Residuals")

pacf(model_auto_no_seasonal$residuals,
     main = "PACF of ARIMA Residuals")

hist(model_auto_no_seasonal$residuals,
     main = "Histogram of ARIMA Residuals",
     xlab = "Residuals")

9.3.1 Interpretation of Residual Diagnostics

Residual diagnostics indicate no significant residual autocorrelation.

The Ljung–Box test is used to check whether residuals from a fitted time series model are independent (i.e., no remaining autocorrelation). The Ljung–Box test yields a p-value of 0.66, suggesting that the residuals do not significantly deviate from white noise behavior.

Therefore, from a statistical standpoint, the model appears adequately specified.

However, despite statistical validity, the model remains substantively limited because it fails to explicitly model the strong seasonal structure observed in the original series.

This justifies proceeding to seasonal modelling using SARIMA.

10 Seasonal ARIMA (SARIMA) Modelling

Monthly confirmed malaria cases are likely to exhibit annual seasonality. Therefore, a Seasonal AutoRegressive Integrated Moving Average (SARIMA) model is estimated.

The SARIMA model extends the classical ARIMA framework by explicitly incorporating seasonal components, allowing the capture of recurrent periodic fluctuations observed in the time series.

Model identification and estimation follow the Box–Jenkins procedure, which relies on:

  • Stationarity assessment

  • ACF and PACF analysis

  • Information criteria comparison

After estimation, residual diagnostics are examined to verify the overall adequacy of the model before using it for forecasting.

Formulation of the SARIMA Model

A SARIMA model is denoted as:

                                              SARIMA(p,d,q)(P,D,Q)s

where:

  • p,d,q = non-seasonal autoregressive, differencing, and moving average orders

  • P,D,Q = seasonal autoregressive, differencing, and moving average orders

  • s = seasonal periodicity (for monthly data,s=12)

This model simultaneously captures:

  • Short-term (non-seasonal) dynamics

  • Long-term seasonal patterns

10.1 Automatic Identification - SARIMA Model

#SARIMA
model_auto_seasonal <- auto.arima(
  ts_confirmed,
  seasonal = TRUE,
  stepwise = FALSE,
  approximation = FALSE
)

summary(model_auto_seasonal)
## Series: ts_confirmed 
## ARIMA(0,0,1)(1,1,0)[12] 
## 
## Coefficients:
##          ma1     sar1
##       0.3259  -0.7008
## s.e.  0.1785   0.1474
## 
## sigma^2 = 14252:  log likelihood = -151.89
## AIC=309.79   AICc=310.99   BIC=313.32
## 
## Training set error measures:
##                    ME     RMSE      MAE      MPE     MAPE     MASE        ACF1
## Training set 15.77623 93.32365 59.60174 1.735404 14.03168 0.463376 -0.01288609

10.1.1 Interpretation of the Selected SARIMA Model

The automatically selected model is:

                                           SARIMA(0,0,1)(1,1,0)12
        

This specification includes:

  • A non-seasonal moving average component

  • seasonal autoregressive component

  • Seasonal differencing of order 1

The inclusion of seasonal differencing confirms the presence of strong annual seasonality.

The information criteria:

  • AIC = 309.8

  • BIC = 313.3

are substantially lower than those of the non-seasonal ARIMA model (AIC = 457.5), indicating a marked improvement in model fit.

Additionally, performance metrics such as RMSE and MAPE confirm superior predictive performance.

These findings strongly support the relevance of explicitly modelling seasonality.

10.2 Residual Diagnostics of the SARIMA Model

Model adequacy is assessed through residual diagnostics. A well-specified SARIMA model should produce residuals behaving as white noise.

checkresiduals(model_auto_seasonal)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,1)(1,1,0)[12]
## Q* = 0.97072, df = 5, p-value = 0.9649
## 
## Model df: 2.   Total lags used: 7
acf(model_auto_seasonal$residuals,
    main = "ACF of SARIMA Residuals")

pacf(model_auto_seasonal$residuals,
     main = "PACF of SARIMA Residuals")

hist(model_auto_seasonal$residuals,
     main = "Histogram of SARIMA Residuals",
     xlab = "Residuals")

plot(model_auto_seasonal)

10.2.1 Interpretation of Residual Diagnostics

The residual diagnostics indicate no significant remaining autocorrelation. The Ljung–Box test yields a p-value of 0.96, suggesting failure to reject the null hypothesis of independence.

The residual ACF and PACF do not reveal any systematic pattern, and the residual distribution appears approximately symmetric around zero.

These results confirm that the SARIMA model adequately captures both short-term and seasonal dynamics of confirmed malaria cases.

11 ARIMAX Modelling (ARIMA with Exogenous Variables)

To extend the modelling framework beyond purely univariate approaches, an ARIMAX (AutoRegressive Integrated Moving Average with eXogenous variables) model is estimated.

The ARIMAX model extends the ARIMA framework by incorporating external explanatory variables that may influence the temporal dynamics of confirmed malaria cases.

In this study, the exogenous variables correspond to climatic factors:

  • Precipitation

  • Temperature

  • Humidity

The influence of these variables on malaria transmission is well documented in the epidemiological literature.

To ensure numerical stability and facilitate comparison of relative effects, all exogenous variables are standardized prior to estimation.

The primary objective of the ARIMAX approach is to assess whether incorporating external climatic information significantly improves model fit and predictive performance compared to purely univariate models (ARIMA and SARIMA).

Formulation of the ARIMAX Model

The ARIMAX model can be expressed as:

Yt = ARIMA(p,d,q)(P,D,Q)s + βX

where:

  • Yt = confirmed malaria cases

  • Xt = vector of exogenous climatic variables

  • β = vector of regression coefficients

  • The ARIMA structure captures internal temporal dependence

  • Model validity is evaluated using:

  • Residual diagnostics

  • Ljung–Box test

  • Information criteria (AIC, BIC)

  • Comparative performance assessment

11.1 Selection of Exogenous Variables

# Selection of exogenous variables
xreg <- data[, c("Précipitation", "Température", "Humidité")]

summary(xreg)  # Inspection
##  Précipitation     Température       Humidité    
##  Min.   :  0.83   Min.   :25.14   Min.   :68.50  
##  1st Qu.: 31.97   1st Qu.:26.34   1st Qu.:75.40  
##  Median :115.88   Median :27.12   Median :83.14  
##  Mean   :110.91   Mean   :27.47   Mean   :81.35  
##  3rd Qu.:166.38   3rd Qu.:28.50   3rd Qu.:88.17  
##  Max.   :312.93   Max.   :30.28   Max.   :89.43
xreg_scaled <- scale(xreg)  # Standardization

Standardization ensures that the estimated coefficients are comparable in magnitude and prevents scale-related estimation instability.

11.2 Automatic ARIMAX Estimation

#Automatic ARIMAX
model_arimax <- auto.arima(
  ts_confirmed,
  xreg = xreg_scaled,
  seasonal = TRUE,
  stepwise = FALSE,
  approximation = FALSE
)

summary(model_arimax)
## Series: ts_confirmed 
## Regression with ARIMA(0,0,1) errors 
## 
## Coefficients:
##          ma1  intercept  Précipitation  Température  Humidité
##       0.3842   433.9409       -56.8735     -64.0025   74.2652
## s.e.  0.1550    23.3372        25.9497      30.7639   38.2861
## 
## sigma^2 = 12037:  log likelihood = -217.59
## AIC=447.19   AICc=450.08   BIC=456.69
## 
## Training set error measures:
##                      ME    RMSE      MAE       MPE     MAPE      MASE
## Training set -0.4614706 101.809 83.42968 -7.090258 22.49914 0.6486273
##                     ACF1
## Training set 0.007491519

11.2.1 Interpretation of the ARIMAX Model

The ARIMAX model incorporates standardized climatic variables as explanatory covariates.

Although some coefficients display signs consistent with established epidemiological findings, none appear strongly statistically significant.

Furthermore, the information criterion for the ARIMAX model (AIC = 447.2) is substantially higher than that of the SARIMA model (AIC = 309.8), indicating that the inclusion of exogenous variables does not improve overall model fit.

This suggests that seasonal structure alone captures most of the temporal variability in the series.

11.3 Residual Diagnostics of the ARIMAX Model

#ARIMAX Diagnostics
checkresiduals(model_arimax)

## 
##  Ljung-Box test
## 
## data:  Residuals from Regression with ARIMA(0,0,1) errors
## Q* = 1.0628, df = 6, p-value = 0.9831
## 
## Model df: 1.   Total lags used: 7

11.3.1 Interpretation of ARIMAX Diagnostics

Residual diagnostics indicate that the ARIMAX residuals behave as white noise. The Ljung–Box test yields a p-value of 0.98, suggesting no significant remaining autocorrelation.

However, despite satisfactory residual diagnostics, the overall performance of the ARIMAX model remains inferior to that of the SARIMA model, limiting its predictive relevance in this context.

12 Model Comparison: ARIMA vs SARIMA vs ARIMAX

To formally compare the competing models, information criteria are examined:

_ Akaike Information Criterion (AIC)

_ Bayesian Information Criterion (BIC)

Lower values indicate a better balance between goodness-of-fit and model parsimony.

AIC(model_auto_no_seasonal, model_auto_seasonal, model_arimax)
##                        df      AIC
## model_auto_no_seasonal  3 457.5407
## model_auto_seasonal     3 309.7897
## model_arimax            6 447.1864
BIC(model_auto_no_seasonal, model_auto_seasonal, model_arimax)
##                        df      BIC
## model_auto_no_seasonal  3 462.2913
## model_auto_seasonal     3 313.3238
## model_arimax            6 456.6875

12.1 Interpretation of Model Comparison

The comparison clearly demonstrates that the SARIMA model outperforms both the non-seasonal ARIMA and the ARIMAX models.

The SARIMA model provides:

  • The lowest AIC and BIC values

  • A parsimonious structure

  • Adequate residual diagnostics

  • Explicit modelling of seasonal dynamics

Therefore, the SARIMA specification represents the most appropriate model for forecasting confirmed malaria cases in this dataset.

13 Forecasting Confirmed Malaria Cases (SARIMA Model)

Based on the SARIMA model identified as the best-performing specification, monthly forecasts of confirmed malaria cases were generated for the period 2025-2026.

Forecasts are accompanied by 95% confidence intervals, which quantify the uncertainty associated with future projections.

13.1 Forecast Generation

# Generate 24-month forecast (2025–2026)
sarima_forecast <- forecast(
  model_auto_seasonal,
  h = 24,
  level = 95
)

13.2 Extraction of Forecasted Values

To support quantitative interpretation, forecasted values and their confidence intervals were extracted into a structured table.

forecast_table <- data.frame(
  Date = seq(as.Date("2025-01-01"), by = "month", length.out = 24),
  Forecast = as.numeric(sarima_forecast$mean),
  Lower_95 = as.numeric(sarima_forecast$lower[,1]),
  Upper_95 = as.numeric(sarima_forecast$upper[,1])
)

13.3 Tabular Presentation of Forecasts

The forecasted monthly values of confirmed malaria cases for the period 2025–2026 are presented in Table.

This tabular representation provides the point estimates together with their corresponding 95% confidence intervals, allowing a precise numerical assessment of the expected disease burden and the uncertainty surrounding the projections. Such presentation facilitates interpretation, comparison across months, and potential use in public health planning and resource allocation

knitr::kable(
  forecast_table,
  digits = 0,
  caption = "Table 2. presents the forecasted monthly values of confirmed malaria cases for the period 2025-2026, together with their corresponding 95% confidence intervals."
)
Table 2. presents the forecasted monthly values of confirmed malaria cases for the period 2025-2026, together with their corresponding 95% confidence intervals.
Date Forecast Lower_95 Upper_95
2025-01-01 401 167 635
2025-02-01 357 111 603
2025-03-01 257 11 504
2025-04-01 329 83 575
2025-05-01 377 131 623
2025-06-01 586 340 832
2025-07-01 580 334 826
2025-08-01 670 424 917
2025-09-01 480 234 726
2025-10-01 418 172 664
2025-11-01 648 402 894
2025-12-01 508 262 754
2026-01-01 335 79 591
2026-02-01 274 17 531
2026-03-01 385 128 642
2026-04-01 392 135 649
2026-05-01 421 164 677
2026-06-01 541 284 798
2026-07-01 574 317 831
2026-08-01 530 273 787
2026-09-01 479 223 736
2026-10-01 323 66 580
2026-11-01 464 208 721
2026-12-01 467 210 723

13.4 Forecast Visualization

plot(
  sarima_forecast,
  main = "Forecast of Monthly Confirmed Malaria Cases Based on SARIMA Model",
  xlab = "Year",
  ylab = "Number of Confirmed Cases"
)

13.5 Alternatively, using ggplot2

autoplot(sarima_forecast) +
  labs(
    title = "Forecast of Monthly Confirmed Malaria Cases (SARIMA)",
    x = "Year",
    y = "Number of Confirmed Cases"
  ) +
  theme_minimal()

13.6 Export of Forecast Results

The forecasted values were exported as a structured dataset in CSV format to enable external use, further statistical analysis, and integration into reporting or public health decision-support tools. This export ensures that the numerical projections can be easily shared, archived, and reused in other analytical environments while maintaining reproducibility of the forecasting process.

write.csv(
  forecast_table,
  "output/malaria_forecasts_2025_2026.csv",
  row.names = FALSE
)

13.7 Interpretation of Forecast Results

The forecasts for the period 2025–2026 indicate the persistence of a seasonal pattern consistent with that observed during the historical period (2022–2024). Predicted values continue to display recurrent annual peaks, demonstrating that the SARIMA model successfully captures the underlying seasonal transmission dynamics of malaria.

The forecast uncertainty increases progressively with the prediction horizon, as reflected by the widening confidence intervals. This behaviour is expected in time series forecasting and reflects the accumulation of uncertainty over time as predictions extend further from the observed data.

The numerical forecast estimates corroborate the graphical projections, with anticipated mid-year peaks aligning with historically observed malaria transmission cycles.

Overall, the SARIMA model provides coherent, epidemiologically plausible, and seasonally consistent projections of future confirmed malaria cases, supporting its suitability for short-term surveillance and planning purposes.

14 Forecast Accuracy Evaluation

To assess the predictive performance of the selected model, an out-of-sample evaluation is conducted using a training/test split approach.

This procedure consists of:

  • Estimating the model using a training period

  • Generating forecasts for a hold-out test period

  • Comparing predicted values with observed data

This approach allows evaluation of model robustness and generalization capacity.

14.1 Training and Test Split

# Training/test split
train <- window(ts_confirmed, end = c(2023, 12))
test  <- window(ts_confirmed, start = c(2024, 1))

The training sample includes data from January 2022 to December 2023, while the year 2024 is reserved for out-of-sample validation.

14.2 Model Estimation on Training Data

model_train <- auto.arima(train, seasonal = TRUE)

The model is re-estimated using only the training period to ensure that forecasts for 2024 are genuinely out-of-sample.

14.3 Out-of-Sample Forecasting

forecast_test <- forecast(model_train, h = length(test))

Forecasts are generated for the entire test period.

14.4 Forecast Accuracy Metrics

The accuracy() function provides standard forecast performance measures, including:

  • RMSE (Root Mean Squared Error)

  • MAE (Mean Absolute Error)

  • MAPE (Mean Absolute Percentage Error)

  • Theil’s U statistic

accuracy(forecast_test, test)
##                      ME     RMSE       MAE       MPE     MAPE     MASE
## Training set   1.898482 140.0864 112.84764 -12.39882 30.56403 1.030572
## Test set     -34.366241 109.2203  80.74957 -15.82558 24.59275 0.737439
##                    ACF1 Theil's U
## Training set 0.07856212        NA
## Test set     0.43053208  0.841934

14.4.1 Interpretation of Forecast Performance

The out-of-sample evaluation indicates acceptable predictive performance, with moderate forecast errors and a Theil’s U statistic below 1.

A Theil’s U value below 1 suggests that the model outperforms a naïve benchmark model, confirming its added predictive value.

These results demonstrate that the SARIMA model possesses reasonable generalization capability and can be reliably used to support epidemiological surveillance and short-term planning.

15 Conclusion and Public Health Implications

15.1 Conclusion

This study applied a structured time series modelling framework to monthly confirmed malaria cases observed between 2022 and 2024. Using the Box–Jenkins methodology, we systematically evaluated non-seasonal ARIMA, seasonal SARIMA, and ARIMAX models incorporating climatic covariates.

The analysis revealed a pronounced and stable annual seasonal pattern in confirmed malaria cases. While the non-seasonal ARIMA model captured short-term temporal dependence, it failed to account for the strong seasonal structure. The SARIMA model significantly improved model fit, as evidenced by substantially lower AIC and BIC values, adequate residual diagnostics, and superior forecasting performance.

The ARIMAX model, although theoretically motivated by the documented influence of climatic factors on malaria transmission, did not significantly improve predictive performance in this dataset. This suggests that the seasonal structure alone captured most of the temporal variability during the study period.

Out-of-sample validation confirmed that the SARIMA model demonstrates reasonable generalization capacity and outperforms a naïve benchmark model.

Overall, the SARIMA specification represents the most appropriate model for forecasting confirmed malaria cases in this context.

15.2 Public Health Implications

The identification of strong annual seasonality in confirmed malaria cases has direct implications for epidemiological planning and resource allocation.

First, the persistence of predictable seasonal peaks suggests that malaria prevention and control strategies should be intensified in advance of high-transmission periods. Forecast-informed planning may support:

Strategic distribution of insecticide-treated nets (ITNs)

Targeted indoor residual spraying campaigns

Optimization of antimalarial drug stock management

Reinforcement of community awareness campaigns before seasonal peaks

Second, short-term forecasting tools based on SARIMA models can serve as early warning systems, assisting health authorities in anticipating increases in case burden and preventing healthcare system overload.

Third, the limited added predictive value of climatic covariates in this specific dataset highlights the importance of local epidemiological structure. While climate is biologically relevant, its predictive contribution may vary depending on spatial scale, data resolution, and transmission intensity.

Finally, integrating routine surveillance data with time series modelling approaches provides a cost-effective and scalable analytical framework for strengthening malaria monitoring systems in resource-constrained settings.

15.3 Limitations and Future Directions

Several limitations should be acknowledged:

The relatively short time horizon (three years) may limit detection of long-term structural changes.

The analysis is based on aggregated monthly data, which may mask intra-month variability.

Additional covariates such as vector density, intervention coverage, or socio-demographic factors were not incorporated.

Future research could extend this framework by:

  • Applying multilevel or spatial time series models

  • Integrating higher-resolution climatic or entomological data

  • Exploring Bayesian time series approaches

  • Developing real-time early warning dashboards

16 References

Brodersen, K. H., Gallusser, F., Koehler, J., Remy, N., & Scott, S. L. (2015). Inferring causal impact using Bayesian structural time-series models. Annals of Applied Statistics, 9(1), 247–274. https://doi.org/10.1214/14-AOAS788

Gaubatz, K. T. (2014). A Survivor’s Guide to R: An Introduction for the Uninitiated and the Unnerved. SAGE Publications.

Liboschik, T., Fokianos, K., & Fried, R. (2017). tscount: An R package for analysis of count time series following generalized linear models. Journal of Statistical Software, 82(1), 1–51. https://doi.org/10.18637/jss.v082.i05

Righetti, N. (2025). Time Series Analysis With R. Available at: https://nicolarighetti.github.io/Time-Series-Analysis-With-R/

Schaffer, A. L., Dobbins, T. A., & Pearson, S. A. (2021). Interrupted time series analysis using autoregressive integrated moving average (ARIMA) models: A guide for evaluating large-scale health interventions. BMC Medical Research Methodology, 21(1), 1–12. https://doi.org/10.1186/s12874-021-01306-2

Shin, Y. (2017). Time Series Analysis in the Social Sciences: The Fundamentals. University of California Press.

Wells, C., Shah, D. V., Pevehouse, J. C., Foley, J., Lukito, J., Pelled, A., & Yang, J. (2019). The temporal turn in communication research: Time series analyses using computational approaches. International Journal of Communication, 13, 19328036.

Zivot, E., & Wang, J. (2003). Unit root tests. In Modeling Financial Time Series with S-Plus® (pp. 111–139). Springer. https://doi.org/10.1007/978-0-387-21763-5_4

17 Citation

If you use this analysis, please cite:

Tchakondo, S. (2026).
Time Series Analysis of Confirmed Malaria Cases Using ARIMA, SARIMA and ARIMAX Models.
Zenodo. https://doi.org/10.5281/zenodo.18637204

Source code available at:
https://github.com/Samadou-Tchakondo/malaria-time-series-analysis