This project explores different predictive models to best forecast the number of dengue fever cases occurring on a weekly basis in San Juan, Puerto Rico and Iquitos, Peru. Dengue fever is a virus carried by mosquitoes and is prevalent in tropical and subtropical regions of the world. The CDC estimates that half of the world’s population live in areas with a risk of dengue and it is a leading cause of illness in these regions. Recent research shows that climate change may cause distribution shifts to the virus and produce serious global health implications. Given the research behind dengue fever’s prevalence in particular climates, we can examine certain climate variables such as temperature, humidity, and precipitation, to better predict potential dengue outbreaks.
In this paper, I explore dynamic ARIMA and Neural Network models with climate variables as external regressors and compare their forecasting performance to benchmark models such as a NAIVE model and simple ARIMA and Neural Network models. The data for this project was collected from several U.S agencies including the Center for Disease Control and the National Oceanic and Atmospheric Administration. Models were trained on a training set with dengue case information along with environmental data for San Juan and Iquitos and then used to forecast internally missing data on a test set with only environmental data. This forecast was submitted to the DengAI: Predicting Disease Spread competition hosted by DrivenData.
# merge training sets
dengue_train <- merge(dengue_train_prelim, labels_train, by = c("city","year","weekofyear"))
# set up date variable
dengue_train$week_start_date <- as.Date(dengue_train$week_start_date)
dengue_test$week_start_date <- as.Date(dengue_test$week_start_date)
# Creating Tsibble for both data sets (train & test)
dengue_train = dengue_train %>%
mutate(Week = yearweek((week_start_date))) %>%
as_tsibble(index = Week, key = city)
dengue_test = dengue_test %>%
mutate(Week = yearweek((week_start_date))) %>%
as_tsibble(index = Week, key = city)
# split data sets by city
sj_train <- subset(dengue_train, dengue_train$city == "sj")
iq_train <- subset(dengue_train, dengue_train$city == "iq")
sj_test_real <- subset(dengue_test, dengue_test$city == "sj")
iq_test_real <- subset(dengue_test, dengue_test$city == "iq")
# check for missing rows
sj_train <- fill_gaps(sj_train, .full = TRUE)
iq_train<- fill_gaps(iq_train, .full = TRUE)
sj_test_real <- fill_gaps(sj_test_real, .full = TRUE)
iq_test_real <- fill_gaps(iq_test_real, .full = TRUE)
iq_train <- fill(iq_train, c( "ndvi_ne","ndvi_nw","ndvi_se", "ndvi_sw","precipitation_amt_mm","reanalysis_air_temp_k" , "reanalysis_avg_temp_k", "reanalysis_dew_point_temp_k","reanalysis_max_air_temp_k","reanalysis_min_air_temp_k", "reanalysis_precip_amt_kg_per_m2","reanalysis_relative_humidity_percent", "reanalysis_sat_precip_amt_mm", "reanalysis_specific_humidity_g_per_kg", "reanalysis_tdtr_k" , "station_avg_temp_c", "station_diur_temp_rng_c", "station_max_temp_c", "station_min_temp_c", "station_precip_mm", "total_cases"), .direction = 'down')
sj_train <- fill(sj_train, c( "ndvi_ne","ndvi_nw","ndvi_se", "ndvi_sw","precipitation_amt_mm","reanalysis_air_temp_k" , "reanalysis_avg_temp_k", "reanalysis_dew_point_temp_k","reanalysis_max_air_temp_k","reanalysis_min_air_temp_k", "reanalysis_precip_amt_kg_per_m2","reanalysis_relative_humidity_percent", "reanalysis_sat_precip_amt_mm", "reanalysis_specific_humidity_g_per_kg", "reanalysis_tdtr_k" , "station_avg_temp_c", "station_diur_temp_rng_c", "station_max_temp_c", "station_min_temp_c", "station_precip_mm", "total_cases"), .direction = 'down')
sj_test_real <- fill(sj_test_real, c( "ndvi_ne","ndvi_nw","ndvi_se", "ndvi_sw","precipitation_amt_mm","reanalysis_air_temp_k" , "reanalysis_avg_temp_k", "reanalysis_dew_point_temp_k","reanalysis_max_air_temp_k","reanalysis_min_air_temp_k", "reanalysis_precip_amt_kg_per_m2","reanalysis_relative_humidity_percent", "reanalysis_sat_precip_amt_mm", "reanalysis_specific_humidity_g_per_kg", "reanalysis_tdtr_k" , "station_avg_temp_c", "station_diur_temp_rng_c", "station_max_temp_c", "station_min_temp_c", "station_precip_mm"), .direction = 'down')
iq_test_real <- fill(iq_test_real, c( "ndvi_ne","ndvi_nw","ndvi_se", "ndvi_sw","precipitation_amt_mm","reanalysis_air_temp_k" , "reanalysis_avg_temp_k", "reanalysis_dew_point_temp_k","reanalysis_max_air_temp_k","reanalysis_min_air_temp_k", "reanalysis_precip_amt_kg_per_m2","reanalysis_relative_humidity_percent", "reanalysis_sat_precip_amt_mm", "reanalysis_specific_humidity_g_per_kg", "reanalysis_tdtr_k" , "station_avg_temp_c", "station_diur_temp_rng_c", "station_max_temp_c", "station_min_temp_c", "station_precip_mm"), .direction = 'down')
##=== Hold out data to test model
sj_training <- sj_train %>%
filter(year(Week) < 2005 )
sj_testing <- sj_train %>%
filter(year(Week) > 2004 )
iq_training <- iq_train %>%
filter(year(Week) < 2009 )
iq_testing <- iq_train %>%
filter(year(Week) > 2008 )
The training data for San Juan spans from 1990 to 1998, while the Iquitos data begins in 2000 and ends in 2010. Figure 1 shows the two time series plotted together. San Juan visibly has significantly more cases and higher outbreak peaks, which makes sense given the population size of San Juan compared to Iquitos, 2,389,000 versus 386,000 respectively in 2000. Total cases will need to be transformed to a scalable metric in order for the model to be applied across cities. For this project, I trained and tested my models on datasets broken out by city and then combined the results so this transformation was not necessary.
Next, I looked at a decomposition of the data from each city (Figure 2 and Figure 3). Both cities appear to have slight changes in the seasonality over the training time period. While San Juan exhibits a few large outbreaks in the 1990s, it appears that the baseline trend is consistently low. In Iquitos, on the other hand, the overall trend is positive, meaning outbreaks may not be as severe but the number of consistent cases is growing. Lastly, I look at seasonal plots (Figure 4 and Figure 5), which show a higher rate of cases between October and January in both cities.
dengue_train %>%
autoplot(total_cases) +
labs(x = "Week", y = "Total Cases",
title = "Figure 1: Dengue Fever Cases by Week") +
guides(color = guide_legend(title = "City"))
sj_train %>%
model(STL(total_cases + 1)) %>%
components() %>%
autoplot() + labs(title = "Figure 2: STL Decomposition, San Juan")
iq_train %>%
model(STL(total_cases + 1)) %>%
components() %>%
autoplot() + labs(title = "Figure 3: STL Decomposition, Iquitos")
sj_training %>% gg_season(total_cases) +
labs(title = "Figure 4: Seasonality of Dengue Cases, San Juan") +
xlab("Weeks") +
ylab("Total Cases")
iq_training %>% gg_season(total_cases) +
labs(title = "Figure 5: Seasonality of Dengue Cases, Iquitos") +
xlab("Weeks") +
ylab("Total Cases")
To begin, I built a simple seasonal NAIVE model as a benchmark comparison point for my dynamic models. This forecast uses a random walk, in which the forecasted value is equal to the value from the same period in the prior year. The simplicity of the model is valuable given the ease of data collection. However, it only considers the latest year’s observations, so will neglect any trends or cycles seen over multiple years. It also ignores any known information about the causal relationship between dengue fever and climate variables.
Secondly, I explored two ARIMA models. The first is a simple ARIMA(1, 1, 2) model auto calculated with the fable() package utilizing the Hyndmand-Khandakar algorithm in R. This non-seasonal model has 1 autoregressive term, 1 degree of differencing for stationarity, and 2 lagged forecast errors. I compared this against a more complex ARIMA model with external regressors, which included climate variables and a fourier term (k=10). The climate variables were determined by looking at the Pearson’s correlation coefficient for each variable included in the features training set. I selected variables with a coefficient over 0.10, which included: air temperature, average temperature, dew point temperature, min and max air temperature, humidity (g per kg), and station min and max temperature. Although variables had different relative influence on cases in Iquitos than in San Juan, removing less significant variables in Iquitos did not produce a better model so they were left in.
Lastly, I created two Neural Network models. Again, I started with a simple model using the NNETAR() forecasting function in R, which produced NNAR(14,1,8)[52]. I compared this to a neural network model that included the following external regressors: dew point, min and max air temperature, and humidity (g per kg). All of the above models were trained on separate data sets for San Juan and Iquitos and tested on hold out sets from the initial training set provided.
# Correlation matrix shows variables with highest correlation to total cases.
# In San Juan: air temp, avg. temp, dew point temp, max air temp, min air temp, humidity g per kg, station avg. temp, station min and max temp.
# In Iquitos: dew point temp, min air temp, precipitation amount, relative humidity, humidity g per kg, tdtr, station avg. temp, and station min temp.
sj_train_temp <- sj_train[,-c(1:4)]
sj_train_temp <- sj_train_temp[,-c(22)]
corr_matrix_sj <- cor(sj_train_temp, use = 'complete.obs', method = 'pearson')
iq_train_temp <- iq_train[,-c(1:4)]
iq_train_temp <- iq_train_temp[,-c(22)]
corr_matrix_iq <- cor(iq_train_temp, use = 'complete.obs', method = 'pearson')
# San Juan
m2_sj <- sj_training %>% model(ARIMA(total_cases))
fcst_arimabasic <- m2_sj %>% forecast(sj_testing)
m2_sj_accuracy <- accuracy(fcst_arimabasic , sj_testing)
# Iquitos
m2_iq <- iq_training %>% model(ARIMA(total_cases))
fcst_arimabasic <- m2_iq %>% forecast(iq_testing)
m2_iq_accuracy <- accuracy(fcst_arimabasic , iq_testing)
m1_sj <- sj_training %>%
model(ARIMA(total_cases ~ reanalysis_air_temp_k + reanalysis_avg_temp_k + reanalysis_dew_point_temp_k + reanalysis_max_air_temp_k + reanalysis_min_air_temp_k + reanalysis_specific_humidity_g_per_kg + station_avg_temp_c + station_min_temp_c + station_max_temp_c + fourier(K=10)))
ARIMA_fcst <- m1_sj %>% forecast(sj_testing)
m1_sj_accuracy <- accuracy(ARIMA_fcst, sj_testing)
m1_sj %>% gg_tsresiduals(lag_max = 52) + labs(title = "ARIMA - San Juan")
ARIMA_fcst %>% autoplot(sj_testing) +
labs(title = "ARIMA model - San Juan") +
xlab("Week") +
ylab("Total Cases")
m1_iq <- iq_training %>%
model(ARIMA(total_cases ~ reanalysis_air_temp_k + reanalysis_avg_temp_k + reanalysis_dew_point_temp_k + reanalysis_max_air_temp_k + reanalysis_min_air_temp_k + reanalysis_specific_humidity_g_per_kg + station_avg_temp_c + station_min_temp_c + station_max_temp_c + fourier(K=10)))
ARIMA_fcst <- m1_iq %>% forecast(iq_testing)
m1_iq_accuracy <- accuracy(ARIMA_fcst, iq_testing)
m1_iq %>% gg_tsresiduals(lag_max = 52) + labs(title = "ARIMA - Iquitos")
ARIMA_fcst %>% autoplot(iq_testing) +
labs(title = "ARIMA model - Iquitos") +
xlab("Week") +
ylab("Total Cases")
# San Juan
m4_sj <- sj_training %>% model(NNETAR(total_cases))
neuralnet_fcstbasic <- m4_sj %>% forecast(sj_testing, times=10, scale = TRUE)
m4_sj_accuracy <- accuracy(neuralnet_fcstbasic, sj_testing)
# Iquitos
m4_iq <- iq_training %>% model(NNETAR(total_cases))
neuralnet_fcstbasic <- m4_iq %>% forecast(iq_testing, times=10, scale = TRUE)
m4_iq_accuracy <- accuracy(neuralnet_fcstbasic, iq_testing)
m3_sj <- sj_training %>% model(NNETAR(total_cases ~ reanalysis_dew_point_temp_k + reanalysis_min_air_temp_k + reanalysis_max_air_temp_k + reanalysis_specific_humidity_g_per_kg))
neuralnet_fcst <- m3_sj %>% forecast(sj_testing, times=100, scale = TRUE)
m3_sj_accuracy <- accuracy(neuralnet_fcst, sj_testing)
m3_sj %>% gg_tsresiduals(lag_max = 52) + labs(title = "Neural Network - San Juan")
neuralnet_fcst %>% autoplot(sj_testing) +
labs(title = "Neural Net model - San Juan") +
xlab("Week") +
ylab("Total Cases")
m3_iq <- iq_training %>% model(NNETAR(total_cases ~ reanalysis_dew_point_temp_k + reanalysis_min_air_temp_k + reanalysis_max_air_temp_k + reanalysis_specific_humidity_g_per_kg))
neuralnet_fcst <- m3_iq %>% forecast(iq_testing, times=100, scale = TRUE)
m3_iq_accuracy <- accuracy(neuralnet_fcst, iq_testing)
neuralnet_fcst %>% autoplot(iq_testing) +
labs(title = "Neural Net model - Iquitos") +
xlab("Week") +
ylab("Total Cases")
m3_iq %>% gg_tsresiduals(lag_max = 52) + labs(title = "Neural Network - Iquitos")
m5_sj <- sj_training %>% model(SNAIVE(total_cases))
SNAIVE_fcst <- m5_sj %>% forecast(sj_testing)
m5_sj_accuracy <- accuracy(SNAIVE_fcst, sj_testing)
SNAIVE_fcst %>% autoplot(sj_testing) +
labs(title = "Naive Model- San Juan") +
xlab("Week") +
ylab("Total Cases")
m5_iq <- iq_training %>% model(SNAIVE(total_cases))
SNAIVE_fcst <- m5_iq %>% forecast(iq_testing)
m5_iq_accuracy <- accuracy(SNAIVE_fcst, iq_testing)
SNAIVE_fcst %>% autoplot(iq_testing) +
labs(title = "Naive Model - Iquitos") +
xlab("Week") +
ylab("Total Cases")
Models were tested on hold out data from the training set provided by Driven Data.
sj_accuracy_metrics = rbind(m1_sj_accuracy, m2_sj_accuracy, m3_sj_accuracy, m4_sj_accuracy, m5_sj_accuracy)
sj_accuracy_metrics %>% arrange(MAE)
## # A tibble: 5 × 11
## .model city .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 "SNAIVE(total_ca… sj Test 14.2 34.1 18.2 -Inf Inf NaN NaN 0.907
## 2 "ARIMA(total_cas… sj Test -7.78 25.9 18.9 -Inf Inf NaN NaN 0.891
## 3 "ARIMA(total_cas… sj Test 15.9 36.0 19.4 -Inf Inf NaN NaN 0.934
## 4 "NNETAR(total_ca… sj Test -27.5 46.8 38.9 -Inf Inf NaN NaN 0.948
## 5 "NNETAR(total_ca… sj Test -34.2 54.5 40.5 -Inf Inf NaN NaN 0.950
iq_accuracy_metrics = rbind(m1_iq_accuracy, m2_iq_accuracy, m3_iq_accuracy, m4_iq_accuracy,
m5_iq_accuracy)
iq_accuracy_metrics %>% arrange(MAE)
## # A tibble: 5 × 11
## .model city .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 "NNETAR(total_ca… iq Test -0.145 7.79 5.77 -Inf Inf NaN NaN 0.834
## 2 "ARIMA(total_cas… iq Test -3.18 8.11 6.74 -Inf Inf NaN NaN 0.840
## 3 "ARIMA(total_cas… iq Test 5.59 9.10 7.12 NaN Inf NaN NaN 0.748
## 4 "NNETAR(total_ca… iq Test -4.28 10.7 8.20 -Inf Inf NaN NaN 0.877
## 5 "SNAIVE(total_ca… iq Test -7.85 17.3 10.7 -Inf Inf NaN NaN 0.759
ARIMA_sj_model_final <- sj_train %>%
model(ARIMA(total_cases ~ reanalysis_air_temp_k + reanalysis_avg_temp_k + reanalysis_dew_point_temp_k + reanalysis_max_air_temp_k + reanalysis_min_air_temp_k + reanalysis_specific_humidity_g_per_kg + station_avg_temp_c + station_min_temp_c + station_max_temp_c + fourier(K=10)))
ARIMA_iq_model_final <- iq_train %>%
model(ARIMA(total_cases ~ reanalysis_air_temp_k + reanalysis_avg_temp_k + reanalysis_dew_point_temp_k + reanalysis_max_air_temp_k + reanalysis_min_air_temp_k + reanalysis_specific_humidity_g_per_kg + station_avg_temp_c + station_min_temp_c + station_max_temp_c + fourier(K=10)))
ARIMA_sj_fcst <- ARIMA_sj_model_final %>% forecast(sj_test_real)
ARIMA_iq_fcst <- ARIMA_iq_model_final %>% forecast(iq_test_real)
sj_cases <- ARIMA_sj_fcst %>% subset(, c("city","year", "weekofyear", ".mean"))
iq_cases <- ARIMA_iq_fcst %>% subset(, c("city","year", "weekofyear", ".mean"))
submission <- rbind(sj_cases, iq_cases)
submission <- rename(submission, total_cases = .mean)
submission$total_cases <- round(submission$total_cases, 0)
submission <- na.omit(submission)
write.csv(submission, file = 'DengAIsubmissionARIMA1.csv', row.names = FALSE)
ARIMA_sj_fcst %>% autoplot(sj_train) +
labs(title = "ARIMA Model - San Juan") +
xlab("Week") +
ylab("Total Cases")
ARIMA_iq_fcst %>% autoplot(iq_train) +
labs(title = "ARIMA Model - Iquitos") +
xlab("Week") +
ylab("Total Cases")
NN_sj_model_final <- sj_train %>% model(NNETAR(total_cases ~ reanalysis_dew_point_temp_k + reanalysis_min_air_temp_k + reanalysis_max_air_temp_k + reanalysis_specific_humidity_g_per_kg))
NN_iq_model_final<- iq_train %>% model(NNETAR(total_cases ~ reanalysis_dew_point_temp_k + reanalysis_min_air_temp_k + reanalysis_max_air_temp_k + reanalysis_specific_humidity_g_per_kg))
NN_sj_fcst <- NN_sj_model_final %>% forecast(sj_test_real, time = 100)
NN_iq_fcst <- NN_iq_model_final %>% forecast(iq_test_real, time = 100)
sj_cases <- NN_sj_fcst %>% subset(, c("city","year", "weekofyear", ".mean"))
iq_cases <- NN_iq_fcst %>% subset(, c("city","year", "weekofyear", ".mean"))
submission <- rbind(sj_cases, iq_cases)
submission <- rename(submission, total_cases = .mean)
submission$total_cases <- round(submission$total_cases, 0)
submission <- na.omit(submission)
write.csv(submission, file = 'DengAIsubmission_NN1.csv', row.names = FALSE)
NN_sj_fcst %>% autoplot(sj_train) +
labs(title = "Neural Network Model - San Juan") +
xlab("Week") +
ylab("Total Cases")
NN_iq_fcst %>% autoplot(iq_train) +
labs(title = "Neural Network Model - Iquitos") +
xlab("Week") +
ylab("Total Cases")
SNAIVE_sj_model_final <- sj_training %>% model(SNAIVE(total_cases))
SNAIVE_iq_model_final <- iq_training %>% model(SNAIVE(total_cases))
SNAIVE_sj_fcst <- SNAIVE_sj_model_final %>% forecast(sj_test_real)
SNAIVE_iq_fcst <- SNAIVE_iq_model_final %>% forecast(iq_test_real)
sj_cases <- SNAIVE_sj_fcst %>% subset(, c("city","year", "weekofyear", ".mean"))
iq_cases <- SNAIVE_iq_fcst %>% subset(, c("city","year", "weekofyear", ".mean"))
submission <- rbind(sj_cases, iq_cases)
submission <- rename(submission, total_cases = .mean)
submission$total_cases <- round(submission$total_cases, 0)
submission <- na.omit(submission)
write.csv(submission, file = 'DengAIsubmission_SNAIVE1.csv', row.names = FALSE)
SNAIVE_sj_fcst %>% autoplot(sj_train) +
labs(title = "SNAIVE - San Juan") +
xlab("Week") +
ylab("Total Cases")
SNAIVE_iq_fcst %>% autoplot(iq_train) +
labs(title = "SNAIVE- Iquitos") +
xlab("Week") +
ylab("Total Cases")
The final step of this project was to test the models on test data and submit the forecast results to the DengAI: Predicting Disease Spread competition hosted by Driven Data. I submitted forecasts produced by the dynamic ARIMA and Neural Network models as well as my benchmark SNAIVE model. Similar to my internal results, the ARIMA model scored the best with an MAE score of 25.6. Disappointingly, this is higher than the internal MAE scores produced when the model was trained on a subset of the training data and tested on hold out data. Another dynamic ARIMA attempt was made using lagged climate variables as suggested in Kakarla et al. (2022), however this increased the MAE score to 31.9.
The Neural Network model performed similarly to the tested model, with an MAE score of 27.1. Since internally it performed worse on the San Juan, the larger data set, the MAE will be biased in that direction when tested on the full data set. Lastly, the Seasonal Naive model produced an MAE score of 35.7. This was more surprising as it had produced the best results in my internal evaluation on San Juan data.
Results can be found on the DengAI: Predicting Disease Spread competition page of the DrivenData website, under username martharose. My best performing model ranked #1588 as of Aug. 18, 2023.
There are several limitations to the models as they are currently built. The primary concern is the applicability of the model to predict dengue cases in different locations. For the purposes of the project, I was interested in seeing how the model performance compared in different locations, so split the training and test sets into different time series by location and built models for San Juan and Iquitos. I then combined the forecasts into one submission file. If the model was to be used for multiple cities at once, the dependent variable total cases would need to be scaled accordingly, either by creating a case rate (i.e. cases for every 100,000 people) or using a logarithmic transformation to look at net percent change in outcome.
Another limitation of the model was created by the nature of the problem posed by competition. Since it was asking for a forecast of internally missing data, I had access to exogenous climate data for the time period that I was forecasting. In most forecasting scenarios, this would not be the case if you are building a forward looking forecast for a period of time in the future. External regressors would have to be a lagged value already known to us at the time of forecasting. While I did produce an example of this with an ARIMA model, the results did not perform as well as the model trained with current climate variables. Lastly, in future iterations of this project, I would like to explore additional feature engineering to explore climate data with stronger correlation to dengue cases.
Understanding the relationship between climate variables and dengue outbreaks is a valuable tool that can be used to predict future outbreaks and movement of the virus globally. Accurate forecasting will allow for improved resource allocation to combat dengue fever and future research to aid in prevention. Prediction models require data consolidation from multiple sources to get demographic, weather, and environment data as well as information on case count over regular intervals. The modeling process itself should always begin with proper analysis of the time series to understand observed seasonality and trends. Additionally, starting with basic models is beneficial in setting a benchmark to ensure more dynamic and complex models actually improve forecast performance. In this case
Dengue. Centers for Disease Control and Prevention. Aug. 15, 2023. https://www.cdc.gov/dengue/index.html
Kakarla, S.G., Kondeti, P.K., Vavilala, H.P. et al. Weather integrated multiple machine learning models for prediction of dengue prevalence in India. Int J Biometeorol 67, 285–297 (2023). https://doi.org/10.1007/s00484-022-02405-z
Nayak, M.S., Narayan, K.A. Forecasting Dengue Fever Incidence Using ARIMA Analysis. International Journal of Collaborative Research on Internal Medicine and Public Health. Vol. 11, Issue 6. (2019).