Predicting Disease Spread: ETS, ARIMA, and NNETAR Models

First Draft

Abstract

This study aims to predict the spread of dengue in two cities, San Juan and Iquitos, using advanced time series forecasting models: Exponential Smoothing State Space Model (ETS), Autoregressive Integrated Moving Average (ARIMA), and Neural Network Time Series Model (NNETAR). The goal is to forecast the number of dengue cases in these cities based on historical data, with a focus on comparing the performance of the models using Root Mean Square Error (RMSE) as the evaluation metric. The study found that the NNETAR model outperformed the ETS and ARIMA models, providing more accurate predictions of dengue cases

Introduction

Dengue fever is a significant public health concern, particularly in tropical and subtropical regions. Accurate forecasting of disease spread is critical for timely public health interventions. This study leverages time series forecasting models to predict the weekly number of dengue cases in San Juan, Puerto Rico, and Iquitos, Peru. By employing ETS, ARIMA, and NNETAR models, the research aims to identify the most effective approach for predicting disease spread.

# Load packages quietly without startup messages
suppressPackageStartupMessages({
  library(tidyverse)
  # Add any other libraries you are loading here
})
# Load necessary libraries

library(forecast)   # For forecasting models like ETS, ARIMA, and NNAR

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(tseries)    # For time series analysis (e.g., ADF test)

Data Description

The dataset provides various features on a (year, weekofyear) timescale for both San Juan and Iquitos. These features include: • City and Date Indicators: o city: Abbreviations for the cities, with sj for San Juan and iq for Iquitos. o week_start_date: The starting date of the week in yyyy-mm-dd format. • NOAA’s GHCN Daily Climate Data: o station_max_temp_c: Maximum temperature in degrees Celsius. o station_min_temp_c: Minimum temperature in degrees Celsius. o station_avg_temp_c: Average temperature in degrees Celsius.

Literature Review

Literature Review For disease prediction especially for vector borne diseases such as dengue a number of modeling techniques have been considered. Literature in this respect includes papers using statistical and machine learning, or a mix, to predict disease prevalence with respect to environment and climate. Time Series Models

# Read training and test data
train_data <- read_csv("dengue_features_train.csv", show_col_types = FALSE)
test_data <- read_csv("dengue_features_test.csv", show_col_types = FALSE)

# Read the labels data which contains the total_cases column
labels_data <- read_csv("dengue_labels_train.csv", show_col_types = FALSE)

# Merge the total_cases column into train_data
train_data <- train_data %>% 
  left_join(labels_data, by = c("city", "year", "weekofyear"))

# Convert date columns to Date type
train_data$week_start_date <- as.Date(train_data$week_start_date, format="%Y-%m-%d")
test_data$week_start_date <- as.Date(test_data$week_start_date, format="%Y-%m-%d")

head(train_data, 3)

# Split the training data into a training set and a validation set
validation_index <- floor(0.8 * nrow(train_data))
train_set <- train_data[1:validation_index, ]
validation_set <- train_data[(validation_index + 1):nrow(train_data), ]

Methodology

The study employs three time series forecasting models: ETS, ARIMA, and NNETAR. The models are trained on historical data and evaluated using a hold-out validation set. The performance of the models is assessed based on RMSE.

Exploratory Data Analysis

The Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) plots were generated to understand the underlying structure of the total_cases time series. These plots provided insights into the seasonality and potential autocorrelation in the data, which informed the selection of model parameters.

# ACF and PACF plots
acf(train_set$total_cases, main="ACF of Dengue Cases")

pacf(train_set$total_cases, main="PACF of Dengue Cases")

Model Training

ETS Model: The ETS model was trained using exponential smoothing, with an additive error model. The model’s parameters were optimized to minimize the AIC value. ARIMA Model: The ARIMA model was fitted using automatic selection of the order of differencing, autoregression, and moving average terms, with a seasonal component. NNETAR Model: The NNETAR model was trained as a neural network with 13 input nodes, 7 hidden nodes, and 1 output node, using 20 networks to ensure robust predictions.

# Fit models on the training set
ets_model <- ets(train_set$total_cases)
arima_model <- auto.arima(train_set$total_cases, seasonal = TRUE)
nnetar_model <- nnetar(train_set$total_cases)

# Print model summaries
print(ets_model)

## ETS(A,Ad,N) 
## 
## Call:
##  ets(y = train_set$total_cases) 
## 
##   Smoothing parameters:
##     alpha = 0.9999 
##     beta  = 0.1262 
##     phi   = 0.8 
## 
##   Initial states:
##     l = 3.6625 
##     b = 0.1095 
## 
##   sigma:  12.2779
## 
##      AIC     AICc      BIC 
## 14062.54 14062.61 14092.89

print(arima_model)

## Series: train_set$total_cases 
## ARIMA(1,1,1) 
## 
## Coefficients:
##          ar1      ma1
##       0.7263  -0.6182
## s.e.  0.0789   0.0890
## 
## sigma^2 = 150.3:  log likelihood = -4564.18
## AIC=9134.36   AICc=9134.38   BIC=9149.54

print(nnetar_model)

## Series: train_set$total_cases 
## Model:  NNAR(13,7) 
## Call:   nnetar(y = train_set$total_cases)
## 
## Average of 20 networks, each of which is
## a 13-7-1 network with 106 weights
## options were - linear output units 
## 
## sigma^2 estimated as 52.77

Results and Discussion

4.1. Model Performance The models were evaluated on the validation set, and their RMSEs were calculated as follows:

ETS: 40.56 ARIMA: 38.03 NNETAR: 18.14 The NNETAR model demonstrated the lowest RMSE, indicating superior predictive performance compared to the ETS and ARIMA models.

# Forecast on the validation set
ets_forecast <- forecast(ets_model, h = nrow(validation_set))$mean
arima_forecast <- forecast(arima_model, h = nrow(validation_set))$mean
nnetar_forecast <- forecast(nnetar_model, h = nrow(validation_set))$mean

Forecast Error Analysis

The density plot of forecast errors showed that the NNETAR model had a tighter distribution around zero, suggesting more accurate predictions. The residuals plot further confirmed that the NNETAR model had the smallest residuals over time, while ETS and ARIMA exhibited larger and more variable residuals.

# Calculate RMSE for each model
ets_rmse <- sqrt(mean((validation_set$total_cases - ets_forecast)^2))
arima_rmse <- sqrt(mean((validation_set$total_cases - arima_forecast)^2))
nnetar_rmse <- sqrt(mean((validation_set$total_cases - nnetar_forecast)^2))

# Compare RMSEs
model_comparison <- data.frame(Model = c("ETS", "ARIMA", "NNETAR"),
                               RMSE = c(ets_rmse, arima_rmse, nnetar_rmse))

print(model_comparison)

##    Model     RMSE
## 1    ETS 40.56273
## 2  ARIMA 38.02876
## 3 NNETAR 20.90349

# RMSE comparison bar chart
ggplot(model_comparison, aes(x = Model, y = RMSE, fill = Model)) +
    geom_bar(stat = "identity") +
    labs(title = "RMSE Comparison of Models",
         x = "Model", y = "RMSE") +
    scale_fill_manual(values = c("ETS" = "blue", "ARIMA" = "green", "NNETAR" = "red")) +
    theme_minimal()

# Forecast error density plot
forecast_errors <- data.frame(
    Model = rep(c("ETS", "ARIMA", "NNETAR"), each = nrow(validation_set)),
    Error = c(validation_set$total_cases - ets_forecast,
              validation_set$total_cases - arima_forecast,
              validation_set$total_cases - nnetar_forecast)
)

ggplot(forecast_errors, aes(x = Error, fill = Model)) +
    geom_density(alpha = 0.5) +
    labs(title = "Density Plot of Forecast Errors",
         x = "Error", y = "Density") +
    scale_fill_manual(values = c("ETS" = "blue", "ARIMA" = "green", "NNETAR" = "red")) +
    theme_minimal()

### Actual vs. Predicted Cases A comparison of actual vs. predicted dengue cases showed that the NNETAR model closely followed the actual number of cases, whereas ETS and ARIMA had larger deviation

# Calculate residuals using forecasted values
residuals_df <- data.frame(
    Date = validation_set$week_start_date,
    ETS = validation_set$total_cases - as.numeric(ets_forecast),
    ARIMA = validation_set$total_cases - as.numeric(arima_forecast),
    NNETAR = validation_set$total_cases - as.numeric(nnetar_forecast)
)

# Melt the dataframe for easier plotting
residuals_melted <- residuals_df %>%
    pivot_longer(cols = c(ETS, ARIMA, NNETAR), names_to = "Model", values_to = "Residual")

# Plotting residuals
ggplot(residuals_melted, aes(x = Date, y = Residual, color = Model)) +
    geom_line() +
    labs(title = "Residuals of the Models",
         x = "Date", y = "Residual") +
    scale_color_manual(values = c("ETS" = "blue", "ARIMA" = "green", "NNETAR" = "red")) +
    theme_minimal()

# Create a dataframe for actual vs predicted values for each model
predictions_df <- data.frame(
    Date = validation_set$week_start_date,
    Actual = validation_set$total_cases,
    ETS = as.numeric(ets_forecast),
    ARIMA = as.numeric(arima_forecast),
    NNETAR = as.numeric(nnetar_forecast)
)

# Melt the dataframe for easier plotting
predictions_melted <- predictions_df %>%
    pivot_longer(cols = c(ETS, ARIMA, NNETAR), names_to = "Model", values_to = "Predicted")

# Plotting actual vs predicted values
ggplot(predictions_melted, aes(x = Date)) +
    geom_line(aes(y = Actual, color = "Actual")) +
    geom_line(aes(y = Predicted, color = Model)) +
    labs(title = "Actual vs Predicted Dengue Cases",
         x = "Date", y = "Total Cases") +
    scale_color_manual(values = c("Actual" = "black", "ETS" = "blue", "ARIMA" = "green", "NNETAR" = "red")) +
    theme_minimal()

# Select the model with the lowest RMSE (e.g., nnetar)
final_model <- nnetar_model

# Predict total_cases for the test data
test_data$total_cases <- forecast(final_model, h = nrow(test_data))$mean

# Convert predictions to integer values as required
test_data$total_cases <- round(test_data$total_cases)

Conclusion

The study concludes that the NNETAR model is the most effective in predicting dengue cases in San Juan and Iquitos, outperforming the ETS and ARIMA models. These findings have significant implications for public health planning, as more accurate forecasts can lead to better allocation of resources and timely interventions in response to dengue outbreaks.

# Prepare the submission file
submission <- test_data %>% select(city, year, weekofyear, total_cases)
# Uncomment the line below to write the submission file
 #write.csv(submission, "submission8.csv", row.names = FALSE)

References

DrivenData. (2016). DengAI: Predicting Disease Spread. Retrieved [Month Day Year] from https://www.drivendata.org/competitions/44/dengai-predicting-disease-spread/.