Introduction

This project explores different predictive models to best forecast the number of dengue fever cases occurring on a weekly basis in San Juan, Puerto Rico and Iquitos, Peru. Dengue fever is a virus carried by mosquitoes and is prevalent in tropical and subtropical regions of the world. The CDC estimates that half of the world’s population live in areas with a risk of dengue and it is a leading cause of illness in these regions. Recent research shows that climate change may cause distribution shifts to the virus and produce serious global health implications. Given the research behind dengue fever’s prevalence in particular climates, we can examine certain climate variables such as temperature, humidity, and precipitation, to better predict potential dengue outbreaks.

In this paper, I explore dynamic ARIMA and Neural Network models with climate variables as external regressors and compare their forecasting performance to benchmark models such as a NAIVE model and simple ARIMA and Neural Network models. The data for this project was collected from several U.S agencies including the Center for Disease Control and the National Oceanic and Atmospheric Administration. Models were trained on a training set with dengue case information along with environmental data for San Juan and Iquitos and then used to forecast internally missing data on a test set with only environmental data. This forecast was submitted to the DengAI: Predicting Disease Spread competition hosted by DrivenData.

Data Manipulation

# merge training sets
dengue_train <- merge(dengue_train_prelim, labels_train, by = c("city","year","weekofyear"))

# set up date variable
dengue_train$week_start_date <- as.Date(dengue_train$week_start_date)
dengue_test$week_start_date <- as.Date(dengue_test$week_start_date)


# Creating Tsibble for both data sets (train & test)

dengue_train = dengue_train %>% 
  mutate(Week = yearweek((week_start_date))) %>% 
  as_tsibble(index = Week, key = city)

dengue_test = dengue_test %>% 
  mutate(Week = yearweek((week_start_date))) %>% 
  as_tsibble(index = Week, key = city)


# split data sets by city
sj_train <- subset(dengue_train, dengue_train$city == "sj")
iq_train <- subset(dengue_train, dengue_train$city == "iq")

sj_test_real <- subset(dengue_test, dengue_test$city == "sj")
iq_test_real <- subset(dengue_test, dengue_test$city == "iq")

# check for missing rows
 sj_train <- fill_gaps(sj_train, .full = TRUE)
 iq_train<- fill_gaps(iq_train, .full = TRUE)
 sj_test_real <- fill_gaps(sj_test_real, .full = TRUE)
 iq_test_real <- fill_gaps(iq_test_real, .full = TRUE)


iq_train <- fill(iq_train, c( "ndvi_ne","ndvi_nw","ndvi_se", "ndvi_sw","precipitation_amt_mm","reanalysis_air_temp_k" , "reanalysis_avg_temp_k", "reanalysis_dew_point_temp_k","reanalysis_max_air_temp_k","reanalysis_min_air_temp_k", "reanalysis_precip_amt_kg_per_m2","reanalysis_relative_humidity_percent", "reanalysis_sat_precip_amt_mm", "reanalysis_specific_humidity_g_per_kg", "reanalysis_tdtr_k" , "station_avg_temp_c", "station_diur_temp_rng_c", "station_max_temp_c", "station_min_temp_c", "station_precip_mm", "total_cases"), .direction = 'down')

sj_train <- fill(sj_train, c( "ndvi_ne","ndvi_nw","ndvi_se", "ndvi_sw","precipitation_amt_mm","reanalysis_air_temp_k" , "reanalysis_avg_temp_k", "reanalysis_dew_point_temp_k","reanalysis_max_air_temp_k","reanalysis_min_air_temp_k", "reanalysis_precip_amt_kg_per_m2","reanalysis_relative_humidity_percent", "reanalysis_sat_precip_amt_mm", "reanalysis_specific_humidity_g_per_kg", "reanalysis_tdtr_k" , "station_avg_temp_c", "station_diur_temp_rng_c", "station_max_temp_c", "station_min_temp_c", "station_precip_mm", "total_cases"), .direction = 'down')

sj_test_real <- fill(sj_test_real, c( "ndvi_ne","ndvi_nw","ndvi_se", "ndvi_sw","precipitation_amt_mm","reanalysis_air_temp_k" , "reanalysis_avg_temp_k", "reanalysis_dew_point_temp_k","reanalysis_max_air_temp_k","reanalysis_min_air_temp_k", "reanalysis_precip_amt_kg_per_m2","reanalysis_relative_humidity_percent", "reanalysis_sat_precip_amt_mm", "reanalysis_specific_humidity_g_per_kg", "reanalysis_tdtr_k" , "station_avg_temp_c", "station_diur_temp_rng_c", "station_max_temp_c", "station_min_temp_c", "station_precip_mm"), .direction = 'down')

iq_test_real <- fill(iq_test_real, c( "ndvi_ne","ndvi_nw","ndvi_se", "ndvi_sw","precipitation_amt_mm","reanalysis_air_temp_k" , "reanalysis_avg_temp_k", "reanalysis_dew_point_temp_k","reanalysis_max_air_temp_k","reanalysis_min_air_temp_k", "reanalysis_precip_amt_kg_per_m2","reanalysis_relative_humidity_percent", "reanalysis_sat_precip_amt_mm", "reanalysis_specific_humidity_g_per_kg", "reanalysis_tdtr_k" , "station_avg_temp_c", "station_diur_temp_rng_c", "station_max_temp_c", "station_min_temp_c", "station_precip_mm"), .direction = 'down')


##=== Hold out data to test model 

sj_training <- sj_train  %>% 
  filter(year(Week) < 2005 )

sj_testing <- sj_train  %>% 
  filter(year(Week) > 2004 )

iq_training <- iq_train  %>% 
  filter(year(Week) < 2009 )

iq_testing <- iq_train  %>% 
  filter(year(Week) > 2008 )

Exploratory Data Analysis

The training data for San Juan spans from 1990 to 1998, while the Iquitos data begins in 2000 and ends in 2010. Figure 1 shows the two time series plotted together. San Juan visibly has significantly more cases and higher outbreak peaks, which makes sense given the population size of San Juan compared to Iquitos, 2,389,000 versus 386,000 respectively in 2000. Total cases will need to be transformed to a scalable metric in order for the model to be applied across cities. For this project, I trained and tested my models on datasets broken out by city and then combined the results so this transformation was not necessary.

Next, I looked at a decomposition of the data from each city (Figure 2 and Figure 3). Both cities appear to have slight changes in the seasonality over the training time period. While San Juan exhibits a few large outbreaks in the 1990s, it appears that the baseline trend is consistently low. In Iquitos, on the other hand, the overall trend is positive, meaning outbreaks may not be as severe but the number of consistent cases is growing. Lastly, I look at seasonal plots (Figure 4 and Figure 5), which show a higher rate of cases between October and January in both cities.

dengue_train %>% 
  autoplot(total_cases) +
  labs(x = "Week", y = "Total Cases",
       title = "Figure 1: Dengue Fever Cases by Week") +
  guides(color = guide_legend(title = "City"))

sj_train %>% 
  model(STL(total_cases + 1)) %>% 
  components() %>% 
  autoplot() + labs(title = "Figure 2: STL Decomposition, San Juan")

iq_train %>% 
  model(STL(total_cases + 1)) %>% 
  components() %>% 
  autoplot() + labs(title = "Figure 3: STL Decomposition, Iquitos")

sj_training  %>% gg_season(total_cases) + 
  labs(title = "Figure 4: Seasonality of Dengue Cases, San Juan") +
  xlab("Weeks") +
  ylab("Total Cases")

iq_training  %>% gg_season(total_cases) + 
  labs(title = "Figure 5: Seasonality of Dengue Cases, Iquitos") +
  xlab("Weeks") +
  ylab("Total Cases")

Methods

To begin, I built a simple seasonal NAIVE model as a benchmark comparison point for my dynamic models. This forecast uses a random walk, in which the forecasted value is equal to the value from the same period in the prior year. The simplicity of the model is valuable given the ease of data collection. However, it only considers the latest year’s observations, so will neglect any trends or cycles seen over multiple years. It also ignores any known information about the causal relationship between dengue fever and climate variables.

Secondly, I explored two ARIMA models. The first is a simple ARIMA(1, 1, 2) model auto calculated with the fable() package utilizing the Hyndmand-Khandakar algorithm in R. This non-seasonal model has 1 autoregressive term, 1 degree of differencing for stationarity, and 2 lagged forecast errors. I compared this against a more complex ARIMA model with external regressors, which included climate variables and a fourier term (k=10). The climate variables were determined by looking at the Pearson’s correlation coefficient for each variable included in the features training set. I selected variables with a coefficient over 0.10, which included: air temperature, average temperature, dew point temperature, min and max air temperature, humidity (g per kg), and station min and max temperature. Although variables had different relative influence on cases in Iquitos than in San Juan, removing less significant variables in Iquitos did not produce a better model so they were left in.

Lastly, I created two Neural Network models. Again, I started with a simple model using the NNETAR() forecasting function in R, which produced NNAR(14,1,8)[52]. I compared this to a neural network model that included the following external regressors: dew point, min and max air temperature, and humidity (g per kg). All of the above models were trained on separate data sets for San Juan and Iquitos and tested on hold out sets from the initial training set provided.

# Correlation matrix shows variables with highest correlation to total cases.

# In San Juan: air temp, avg. temp, dew point temp, max air temp, min air temp, humidity g per kg, station avg. temp, station min and max temp.

# In Iquitos: dew point temp, min air temp, precipitation amount, relative humidity, humidity g per kg, tdtr, station avg. temp, and station min temp.

sj_train_temp <- sj_train[,-c(1:4)]
sj_train_temp <- sj_train_temp[,-c(22)]
corr_matrix_sj <- cor(sj_train_temp, use = 'complete.obs', method = 'pearson')

iq_train_temp <- iq_train[,-c(1:4)]
iq_train_temp <- iq_train_temp[,-c(22)]
corr_matrix_iq <- cor(iq_train_temp, use = 'complete.obs', method = 'pearson')

Model Formulation

ARIMA Models

Basic ARIMA models

# San Juan
m2_sj <- sj_training %>%  model(ARIMA(total_cases))
fcst_arimabasic <- m2_sj %>% forecast(sj_testing)
m2_sj_accuracy <- accuracy(fcst_arimabasic , sj_testing)

# Iquitos 
m2_iq <- iq_training %>%  model(ARIMA(total_cases))
fcst_arimabasic <- m2_iq %>% forecast(iq_testing)
m2_iq_accuracy <- accuracy(fcst_arimabasic , iq_testing)

ARIMA with external regressors - San Juan

m1_sj <- sj_training %>% 
  model(ARIMA(total_cases ~ reanalysis_air_temp_k + reanalysis_avg_temp_k + reanalysis_dew_point_temp_k + reanalysis_max_air_temp_k + reanalysis_min_air_temp_k + reanalysis_specific_humidity_g_per_kg  + station_avg_temp_c + station_min_temp_c + station_max_temp_c + fourier(K=10)))

ARIMA_fcst <- m1_sj %>% forecast(sj_testing)
m1_sj_accuracy <- accuracy(ARIMA_fcst, sj_testing)

m1_sj %>% gg_tsresiduals(lag_max = 52) + labs(title = "ARIMA - San Juan")

ARIMA_fcst %>% autoplot(sj_testing) + 
  labs(title = "ARIMA model - San Juan") +
  xlab("Week") +
  ylab("Total Cases")

ARIMA with external regressors - Iquitos

m1_iq <- iq_training %>% 
  model(ARIMA(total_cases ~ reanalysis_air_temp_k + reanalysis_avg_temp_k + reanalysis_dew_point_temp_k + reanalysis_max_air_temp_k + reanalysis_min_air_temp_k + reanalysis_specific_humidity_g_per_kg  + station_avg_temp_c + station_min_temp_c + station_max_temp_c + fourier(K=10)))

ARIMA_fcst <- m1_iq %>% forecast(iq_testing)
m1_iq_accuracy <- accuracy(ARIMA_fcst, iq_testing)

m1_iq %>% gg_tsresiduals(lag_max = 52) + labs(title = "ARIMA - Iquitos")

ARIMA_fcst %>% autoplot(iq_testing) + 
  labs(title = "ARIMA model - Iquitos") +
  xlab("Week") +
  ylab("Total Cases")

Neural Net Models

Basic Neural Net Models

# San Juan
m4_sj <- sj_training %>% model(NNETAR(total_cases))
neuralnet_fcstbasic <- m4_sj %>% forecast(sj_testing, times=10, scale = TRUE)
m4_sj_accuracy <- accuracy(neuralnet_fcstbasic, sj_testing)

# Iquitos
m4_iq <- iq_training %>% model(NNETAR(total_cases))
neuralnet_fcstbasic <- m4_iq %>% forecast(iq_testing, times=10, scale = TRUE)
m4_iq_accuracy <- accuracy(neuralnet_fcstbasic, iq_testing)

Neural Net Model with regressors- San Juan

m3_sj <- sj_training %>% model(NNETAR(total_cases ~ reanalysis_dew_point_temp_k + reanalysis_min_air_temp_k + reanalysis_max_air_temp_k + reanalysis_specific_humidity_g_per_kg))

neuralnet_fcst <- m3_sj %>% forecast(sj_testing, times=100, scale = TRUE)
m3_sj_accuracy <- accuracy(neuralnet_fcst, sj_testing)

m3_sj %>% gg_tsresiduals(lag_max = 52) + labs(title = "Neural Network - San Juan")

neuralnet_fcst %>% autoplot(sj_testing) + 
  labs(title = "Neural Net model - San Juan") +
  xlab("Week") +
  ylab("Total Cases")

Neural Net Model - Iquitos

 m3_iq <- iq_training %>% model(NNETAR(total_cases ~ reanalysis_dew_point_temp_k + reanalysis_min_air_temp_k + reanalysis_max_air_temp_k + reanalysis_specific_humidity_g_per_kg))

neuralnet_fcst <- m3_iq %>% forecast(iq_testing, times=100, scale = TRUE)
m3_iq_accuracy <- accuracy(neuralnet_fcst, iq_testing)

neuralnet_fcst %>% autoplot(iq_testing) +
  labs(title = "Neural Net model - Iquitos") +
  xlab("Week") +
  ylab("Total Cases")

m3_iq %>% gg_tsresiduals(lag_max = 52) + labs(title = "Neural Network - Iquitos")

Seasonal NAIVE Models

NAIVE model - San Juan

m5_sj <- sj_training %>%  model(SNAIVE(total_cases))
SNAIVE_fcst <- m5_sj %>% forecast(sj_testing)
m5_sj_accuracy <- accuracy(SNAIVE_fcst, sj_testing)

SNAIVE_fcst %>% autoplot(sj_testing) + 
  labs(title = "Naive Model- San Juan") +
  xlab("Week") +
  ylab("Total Cases")

NAIVE model - Iquitos

m5_iq <- iq_training %>% model(SNAIVE(total_cases))
SNAIVE_fcst <- m5_iq %>% forecast(iq_testing)
m5_iq_accuracy <- accuracy(SNAIVE_fcst, iq_testing)

SNAIVE_fcst %>% autoplot(iq_testing) + 
  labs(title = "Naive Model - Iquitos") +
  xlab("Week") +
  ylab("Total Cases")

Accuracy Metrics

Models were tested on hold out data from the training set provided by Driven Data.

Accuracy Metrics for San Juan Models

sj_accuracy_metrics = rbind(m1_sj_accuracy, m2_sj_accuracy, m3_sj_accuracy, m4_sj_accuracy,  m5_sj_accuracy)

sj_accuracy_metrics %>% arrange(MAE)

## # A tibble: 5 × 11
##   .model            city  .type     ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>             <chr> <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 "SNAIVE(total_ca… sj    Test   14.2   34.1  18.2  -Inf   Inf   NaN   NaN 0.907
## 2 "ARIMA(total_cas… sj    Test   -7.78  25.9  18.9  -Inf   Inf   NaN   NaN 0.891
## 3 "ARIMA(total_cas… sj    Test   15.9   36.0  19.4  -Inf   Inf   NaN   NaN 0.934
## 4 "NNETAR(total_ca… sj    Test  -27.5   46.8  38.9  -Inf   Inf   NaN   NaN 0.948
## 5 "NNETAR(total_ca… sj    Test  -34.2   54.5  40.5  -Inf   Inf   NaN   NaN 0.950

Accuracy Metrics for Iquitos Models

iq_accuracy_metrics = rbind(m1_iq_accuracy, m2_iq_accuracy, m3_iq_accuracy, m4_iq_accuracy,
                            m5_iq_accuracy)
iq_accuracy_metrics %>% arrange(MAE)

## # A tibble: 5 × 11
##   .model            city  .type     ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>             <chr> <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 "NNETAR(total_ca… iq    Test  -0.145  7.79  5.77  -Inf   Inf   NaN   NaN 0.834
## 2 "ARIMA(total_cas… iq    Test  -3.18   8.11  6.74  -Inf   Inf   NaN   NaN 0.840
## 3 "ARIMA(total_cas… iq    Test   5.59   9.10  7.12   NaN   Inf   NaN   NaN 0.748
## 4 "NNETAR(total_ca… iq    Test  -4.28  10.7   8.20  -Inf   Inf   NaN   NaN 0.877
## 5 "SNAIVE(total_ca… iq    Test  -7.85  17.3  10.7   -Inf   Inf   NaN   NaN 0.759

Full model training with competition data

Training and Testing ARIMA Model for submission

ARIMA_sj_model_final <- sj_train %>% 
  model(ARIMA(total_cases ~ reanalysis_air_temp_k + reanalysis_avg_temp_k + reanalysis_dew_point_temp_k + reanalysis_max_air_temp_k + reanalysis_min_air_temp_k + reanalysis_specific_humidity_g_per_kg  + station_avg_temp_c + station_min_temp_c + station_max_temp_c + fourier(K=10)))


ARIMA_iq_model_final <- iq_train %>% 
  model(ARIMA(total_cases ~ reanalysis_air_temp_k + reanalysis_avg_temp_k + reanalysis_dew_point_temp_k + reanalysis_max_air_temp_k + reanalysis_min_air_temp_k + reanalysis_specific_humidity_g_per_kg  + station_avg_temp_c + station_min_temp_c + station_max_temp_c + fourier(K=10)))


ARIMA_sj_fcst <- ARIMA_sj_model_final %>% forecast(sj_test_real)
ARIMA_iq_fcst <- ARIMA_iq_model_final %>% forecast(iq_test_real)

sj_cases <- ARIMA_sj_fcst %>% subset(, c("city","year", "weekofyear", ".mean"))
iq_cases <- ARIMA_iq_fcst %>% subset(, c("city","year", "weekofyear", ".mean"))

submission <- rbind(sj_cases, iq_cases)
submission <- rename(submission, total_cases = .mean)
submission$total_cases <- round(submission$total_cases, 0)

submission <- na.omit(submission)

write.csv(submission, file = 'DengAIsubmissionARIMA1.csv', row.names = FALSE)

ARIMA_sj_fcst %>% autoplot(sj_train) + 
  labs(title = "ARIMA Model - San Juan") +
  xlab("Week") +
  ylab("Total Cases")

ARIMA_iq_fcst %>% autoplot(iq_train) + 
  labs(title = "ARIMA Model - Iquitos") +
  xlab("Week") +
  ylab("Total Cases")

Training and Testing Neural Network Model for submission

NN_sj_model_final <- sj_train %>% model(NNETAR(total_cases ~ reanalysis_dew_point_temp_k + reanalysis_min_air_temp_k + reanalysis_max_air_temp_k + reanalysis_specific_humidity_g_per_kg))


NN_iq_model_final<- iq_train %>% model(NNETAR(total_cases ~ reanalysis_dew_point_temp_k + reanalysis_min_air_temp_k + reanalysis_max_air_temp_k + reanalysis_specific_humidity_g_per_kg))


NN_sj_fcst <- NN_sj_model_final %>% forecast(sj_test_real, time = 100)
NN_iq_fcst <- NN_iq_model_final %>% forecast(iq_test_real, time = 100)

sj_cases <- NN_sj_fcst %>% subset(, c("city","year", "weekofyear", ".mean"))
iq_cases <- NN_iq_fcst %>% subset(, c("city","year", "weekofyear", ".mean"))

submission <- rbind(sj_cases, iq_cases)
submission <- rename(submission, total_cases = .mean)
submission$total_cases <- round(submission$total_cases, 0)

submission <- na.omit(submission)

write.csv(submission, file = 'DengAIsubmission_NN1.csv', row.names = FALSE)

NN_sj_fcst %>% autoplot(sj_train) + 
  labs(title = "Neural Network Model - San Juan") +
  xlab("Week") +
  ylab("Total Cases")

NN_iq_fcst %>% autoplot(iq_train) + 
  labs(title = "Neural Network Model - Iquitos") +
  xlab("Week") +
  ylab("Total Cases")

Training and Testing SNAIVE Model for submission

SNAIVE_sj_model_final <- sj_training %>%  model(SNAIVE(total_cases))
SNAIVE_iq_model_final <- iq_training %>%  model(SNAIVE(total_cases))

SNAIVE_sj_fcst <- SNAIVE_sj_model_final %>% forecast(sj_test_real)
SNAIVE_iq_fcst <- SNAIVE_iq_model_final %>% forecast(iq_test_real)

sj_cases <- SNAIVE_sj_fcst %>% subset(, c("city","year", "weekofyear", ".mean"))
iq_cases <- SNAIVE_iq_fcst %>% subset(, c("city","year", "weekofyear", ".mean"))

submission <- rbind(sj_cases, iq_cases)
submission <- rename(submission, total_cases = .mean)
submission$total_cases <- round(submission$total_cases, 0)

submission <- na.omit(submission)

write.csv(submission, file = 'DengAIsubmission_SNAIVE1.csv', row.names = FALSE)


SNAIVE_sj_fcst %>% autoplot(sj_train) + 
  labs(title = "SNAIVE - San Juan") +
  xlab("Week") +
  ylab("Total Cases")

SNAIVE_iq_fcst %>% autoplot(iq_train) + 
  labs(title = "SNAIVE- Iquitos") +
  xlab("Week") +
  ylab("Total Cases")

Discussion of Model Performance in DrivenData Competition

The final step of this project was to test the models on test data and submit the forecast results to the DengAI: Predicting Disease Spread competition hosted by Driven Data. I submitted forecasts produced by the dynamic ARIMA and Neural Network models as well as my benchmark SNAIVE model. Similar to my internal results, the ARIMA model scored the best with an MAE score of 25.6. Disappointingly, this is higher than the internal MAE scores produced when the model was trained on a subset of the training data and tested on hold out data. Another dynamic ARIMA attempt was made using lagged climate variables as suggested in Kakarla et al. (2022), however this increased the MAE score to 31.9.

The Neural Network model performed similarly to the tested model, with an MAE score of 27.1. Since internally it performed worse on the San Juan, the larger data set, the MAE will be biased in that direction when tested on the full data set. Lastly, the Seasonal Naive model produced an MAE score of 35.7. This was more surprising as it had produced the best results in my internal evaluation on San Juan data.

Results can be found on the DengAI: Predicting Disease Spread competition page of the DrivenData website, under username martharose. My best performing model ranked #1588 as of Aug. 18, 2023.

Limitations to Models and Future Adaptations

There are several limitations to the models as they are currently built. The primary concern is the applicability of the model to predict dengue cases in different locations. For the purposes of the project, I was interested in seeing how the model performance compared in different locations, so split the training and test sets into different time series by location and built models for San Juan and Iquitos. I then combined the forecasts into one submission file. If the model was to be used for multiple cities at once, the dependent variable total cases would need to be scaled accordingly, either by creating a case rate (i.e. cases for every 100,000 people) or using a logarithmic transformation to look at net percent change in outcome.

Another limitation of the model was created by the nature of the problem posed by competition. Since it was asking for a forecast of internally missing data, I had access to exogenous climate data for the time period that I was forecasting. In most forecasting scenarios, this would not be the case if you are building a forward looking forecast for a period of time in the future. External regressors would have to be a lagged value already known to us at the time of forecasting. While I did produce an example of this with an ARIMA model, the results did not perform as well as the model trained with current climate variables. Lastly, in future iterations of this project, I would like to explore additional feature engineering to explore climate data with stronger correlation to dengue cases.

Conclusion

Understanding the relationship between climate variables and dengue outbreaks is a valuable tool that can be used to predict future outbreaks and movement of the virus globally. Accurate forecasting will allow for improved resource allocation to combat dengue fever and future research to aid in prevention. Prediction models require data consolidation from multiple sources to get demographic, weather, and environment data as well as information on case count over regular intervals. The modeling process itself should always begin with proper analysis of the time series to understand observed seasonality and trends. Additionally, starting with basic models is beneficial in setting a benchmark to ensure more dynamic and complex models actually improve forecast performance. In this case

References

Dengue. Centers for Disease Control and Prevention. Aug. 15, 2023. https://www.cdc.gov/dengue/index.html

Kakarla, S.G., Kondeti, P.K., Vavilala, H.P. et al. Weather integrated multiple machine learning models for prediction of dengue prevalence in India. Int J Biometeorol 67, 285–297 (2023). https://doi.org/10.1007/s00484-022-02405-z

Nayak, M.S., Narayan, K.A. Forecasting Dengue Fever Incidence Using ARIMA Analysis. International Journal of Collaborative Research on Internal Medicine and Public Health. Vol. 11, Issue 6. (2019).

Predictive Analytics Final: Dengue Forecasting

Martha Rose Oordt

2023-08-18

Introduction

Data Manipulation

Exploratory Data Analysis

Methods

Model Formulation

ARIMA Models

Basic ARIMA models

ARIMA with external regressors - San Juan

ARIMA with external regressors - Iquitos

Neural Net Models

Basic Neural Net Models

Neural Net Model with regressors- San Juan

Neural Net Model - Iquitos

Seasonal NAIVE Models

NAIVE model - San Juan

NAIVE model - Iquitos

Accuracy Metrics

Accuracy Metrics for San Juan Models

Accuracy Metrics for Iquitos Models

Full model training with competition data

Training and Testing ARIMA Model for submission

Training and Testing Neural Network Model for submission

Training and Testing SNAIVE Model for submission

Discussion of Model Performance in DrivenData Competition

Limitations to Models and Future Adaptations

Conclusion

References

Predictive Analytics Final: Dengue Forecasting

Martha Rose Oordt

2023-08-18

Introduction

Data Manipulation

Exploratory Data Analysis

Related Literature

Methods

Model Formulation

ARIMA Models

Basic ARIMA models

ARIMA with external regressors - San Juan

ARIMA with external regressors - Iquitos

Neural Net Models

Basic Neural Net Models

Neural Net Model with regressors- San Juan

Neural Net Model - Iquitos

Seasonal NAIVE Models

NAIVE model - San Juan

NAIVE model - Iquitos

Accuracy Metrics

Accuracy Metrics for San Juan Models

Accuracy Metrics for Iquitos Models

Full model training with competition data

Training and Testing ARIMA Model for submission

Training and Testing Neural Network Model for submission

Training and Testing SNAIVE Model for submission

Discussion of Model Performance in DrivenData Competition

Limitations to Models and Future Adaptations

Conclusion

References