Introduction


      Insect-borne diseases have plagued (pun not intended) humanity for millennials. While medicine has developed to a point where most outbreaks of even the most severe diseases can be contained, our ability to effectively forecast and stop outbreaks as they happen is crucial.
      Mosquito-carried illnesses are particularly concerning. With global warming increasing average temperatures worldwide, breeding conditions for mosquitoes improve, accelerating the spread of any diseases they may carry. Dengue fever is among the illnesses spread by mosquitoes. Concentrated largely in the Southern Hemisphere, almost half a billion cases are diagnosed annually, only in South America.
      The purpose of this paper is to create an effective way to forecast the number of cases on a weekly basis in two South American towns, San Juan, Puerto Rico, and Iquitos, Peru. As this is a part of a Driven Data competition, we were provided with premade Training and Testing datasets with a wide array of variables. To generate an accurate forecast, I will use an OLS, Neural Network, and a Dynamic Regression models, along with a wide array of evaluation metrics, such as MAE, AIC, BIC, and R2, to find which model performs best. Driven Data also evaluates the results of our forecast, which will be useful when generating benchmarks and seeing the general performance of our model.

Variable Table

Variable Name Variable Definition
city City abbreviations representing San Juan and Iquitos in the dataset.
year The year when the data was recorded.
weekofyear The week of the year when the data was recorded.
week_start_date The start date of the week when the data was recorded.
ndvi_ne Normalized difference vegetation index (NE quadrant).
ndvi_nw Normalized difference vegetation index (NW quadrant).
ndvi_se Normalized difference vegetation index (SE quadrant).
ndvi_sw Normalized difference vegetation index (SW quadrant).
precipitation_amt_mm Total precipitation in millimeters.
reanalysis_air_temp_k Reanalysis air temperature in Kelvin.
reanalysis_avg_temp_k Reanalysis average temperature in Kelvin.
reanalysis_dew_point_temp_k Reanalysis dew point temperature in Kelvin.
reanalysis_max_air_temp_k Reanalysis maximum air temperature in Kelvin.
reanalysis_min_air_temp_k Reanalysis minimum air temperature in Kelvin.
reanalysis_precip_amt_kg_per_m2 Reanalysis precipitation amount in kilograms per square meter.
reanalysis_relative_humidity_percent Reanalysis relative humidity as a percentage.
reanalysis_sat_precip_amt_mm Reanalysis saturation precipitation amount in millimeters.
reanalysis_specific_humidity_g_per_kg Reanalysis specific humidity in grams per kilogram.
reanalysis_tdtr_k Reanalysis temperature diurnal range in Kelvin.
station_avg_temp_c Average temperature at the weather station in Celsius.
station_diur_temp_rng_c Diurnal temperature range at the weather station in Celsius.
station_max_temp_c Maximum temperature at the weather station in Celsius.
station_min_temp_c Minimum temperature at the weather station in Celsius.
station_precip_mm Precipitation at the weather station in millimeters.

Data Description


      To prepare the data for analysis, I will perform some exploratory data analysis. The first is assessing the number of missing observations.

Missing Observations


      The below graph shows that most variables, at some point, are missing some observations. While the ndvi_ne variable might be missing too many observations to be safely used, the remaining variables are all missing less than 5% of observations. I will replace the missing observations with each variable’s mean to deal with this issue. That way, I will not be drastically altering the distribution of the variable and can safely use it in my models.

vis_miss(Train,sort_miss = TRUE, show_perc = TRUE)

# List of variables to be imputed
variables_to_impute <- c("ndvi_ne", "ndvi_nw", "ndvi_se", "ndvi_sw",
                          "precipitation_amt_mm", "reanalysis_air_temp_k",
                          "reanalysis_avg_temp_k", "reanalysis_dew_point_temp_k",
                          "reanalysis_max_air_temp_k", "reanalysis_min_air_temp_k",
                          "reanalysis_precip_amt_kg_per_m2", "reanalysis_relative_humidity_percent",
                          "reanalysis_sat_precip_amt_mm", "reanalysis_specific_humidity_g_per_kg",
                          "reanalysis_tdtr_k", "station_avg_temp_c",
                          "station_diur_temp_rng_c", "station_max_temp_c",
                          "station_min_temp_c", "station_precip_mm")

# Impute missing values with mean for each variable
for (variable in variables_to_impute) {
  Train[[variable]] <- ifelse(is.na(Train[[variable]]), mean(Train[[variable]], na.rm = TRUE), Train[[variable]])
}


      As we can see, upon the completion of the loop, the data has no more missing observations and is ready for analysis.

Summary Statistics

## SummaryStats was converted to a data frame

Data Frame Summary

SummaryStats

Dimensions: 15 x 21
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph
ndvi_ne [numeric]
Mean (sd) : 103.9 (374.9)
min ≤ med ≤ max:
-0.4 ≤ 0.1 ≤ 1456
IQR (CV) : 0.3 (3.6)
14 distinct values
ndvi_nw [numeric]
Mean (sd) : 103.9 (374.9)
min ≤ med ≤ max:
-0.5 ≤ 0.1 ≤ 1456
IQR (CV) : 0.2 (3.6)
15 distinct values
ndvi_se [numeric]
Mean (sd) : 103.9 (374.9)
min ≤ med ≤ max:
0 ≤ 0.2 ≤ 1456
IQR (CV) : 0.5 (3.6)
15 distinct values
ndvi_sw [numeric]
Mean (sd) : 104 (374.9)
min ≤ med ≤ max:
-0.1 ≤ 0.2 ≤ 1456
IQR (CV) : 0.6 (3.6)
15 distinct values
precipitation_amt_mm [numeric]
Mean (sd) : 151.2 (373.7)
min ≤ med ≤ max:
0 ≤ 43.5 ≤ 1456
IQR (CV) : 60.8 (2.5)
15 distinct values
reanalysis_air_temp_k [numeric]
Mean (sd) : 223.5 (369.9)
min ≤ med ≤ max:
-0.7 ≤ 100 ≤ 1456
IQR (CV) : 298 (1.7)
15 distinct values
reanalysis_avg_temp_k [numeric]
Mean (sd) : 223.7 (370)
min ≤ med ≤ max:
-0.5 ≤ 100 ≤ 1456
IQR (CV) : 298.6 (1.7)
15 distinct values
reanalysis_dew_point_temp_k [numeric]
Mean (sd) : 222 (369.6)
min ≤ med ≤ max:
-0.7 ≤ 100 ≤ 1456
IQR (CV) : 294.6 (1.7)
15 distinct values
reanalysis_max_air_temp_k [numeric]
Mean (sd) : 226.1 (370.1)
min ≤ med ≤ max:
-0.2 ≤ 100 ≤ 1456
IQR (CV) : 301.1 (1.6)
15 distinct values
reanalysis_min_air_temp_k [numeric]
Mean (sd) : 222.3 (369.5)
min ≤ med ≤ max:
-0.7 ≤ 100 ≤ 1456
IQR (CV) : 294.6 (1.7)
15 distinct values
reanalysis_precip_amt_kg_per_m2 [numeric]
Mean (sd) : 159.6 (385.8)
min ≤ med ≤ max:
0 ≤ 27.4 ≤ 1456
IQR (CV) : 39.4 (2.4)
15 distinct values
reanalysis_relative_humidity_percent [numeric]
Mean (sd) : 137.3 (367.1)
min ≤ med ≤ max:
-0.4 ≤ 57.8 ≤ 1456
IQR (CV) : 81 (2.7)
15 distinct values
reanalysis_sat_precip_amt_mm [numeric]
Mean (sd) : 151.2 (373.7)
min ≤ med ≤ max:
0 ≤ 43.5 ≤ 1456
IQR (CV) : 60.8 (2.5)
15 distinct values
reanalysis_specific_humidity_g_per_kg [numeric]
Mean (sd) : 110.7 (373)
min ≤ med ≤ max:
-0.5 ≤ 11.7 ≤ 1456
IQR (CV) : 16.7 (3.4)
15 distinct values
reanalysis_tdtr_k [numeric]
Mean (sd) : 106.8 (374.1)
min ≤ med ≤ max:
-0.2 ≤ 2.9 ≤ 1456
IQR (CV) : 5.3 (3.5)
15 distinct values
station_avg_temp_c [numeric]
Mean (sd) : 114.7 (372)
min ≤ med ≤ max:
-0.6 ≤ 21.4 ≤ 1456
IQR (CV) : 27.1 (3.2)
15 distinct values
station_diur_temp_rng_c [numeric]
Mean (sd) : 107.7 (373.8)
min ≤ med ≤ max:
-0.2 ≤ 4.5 ≤ 1456
IQR (CV) : 7.5 (3.5)
15 distinct values
station_max_temp_c [numeric]
Mean (sd) : 117.4 (371.3)
min ≤ med ≤ max:
-0.3 ≤ 26.7 ≤ 1456
IQR (CV) : 32.4 (3.2)
15 distinct values
station_min_temp_c [numeric]
Mean (sd) : 112.7 (372.5)
min ≤ med ≤ max:
-0.3 ≤ 14.7 ≤ 1456
IQR (CV) : 21.8 (3.3)
15 distinct values
station_precip_mm [numeric]
Mean (sd) : 157.6 (383.9)
min ≤ med ≤ max:
0 ≤ 27 ≤ 1456
IQR (CV) : 44.4 (2.4)
15 distinct values
weekofyear [numeric]
Mean (sd) : 118.3 (371)
min ≤ med ≤ max:
-1.2 ≤ 19.3 ≤ 1456
IQR (CV) : 32.2 (3.1)
15 distinct values

Generated by summarytools 1.0.1 (R version 4.2.3)
2024-04-28

Literature Review

Wolda, H. (1988)


      As mosquitoes carry the dengue fever, my first focus was to investigate literature focusing on forecasting the life cycles of said insects.
      A paper published in the Annual Review of Ecology and Systematics (Wolda, H. (1988)) titled “Insect Seasonality: Why?” discusses the external factors affecting the rises and falls in mosquito population in tropic and subtropic countries. According to Wolda, as latitude decreases, the seasons of the mosquito life cycle lengthen and become more frequent. This is particularly aided by the higher humidity levels, percipitation, and temperatures. As the dataset provided by Driven Data includes information on all three of these factors, this paper is valuable in the variable selection process.

Jian, Y., Silvestri, S., Brown, J., Hickman, R., & Marani, M. (2016)


      The Jian et al. paper (Jian, Y., Silvestri, S., Brown, J., Hickman, R., & Marani, M. (2016)) provides us with some insights into the model building with regards to mosquito population growth. While the replication of the SSR model used in the paper is outside of the scope of this paper, its focus on past states is a valuable insight. The paper does it quite differently from an ARIMA model; however, its accuracy makes me believe that the inclusion of lagged terms and auto-regressive terms may be valuable.

Models

OLS


      While in general, OLS models are rarely used due to their inherent endogeneity issues, in our case said issues are an asset. If we used a fixed effects model or some other econometric model with a stronger causality claim, we would eliminate much of the predictive power that an OLS model may have.
      While a zero conditional mean violation is the end of any study, if our variables are strongly correlated with others, it may generate a superior forecast. A limitation of this approach is that we can make no claims about the reason behind the growth or decrease in cases. However, we may be able to forecast outbreaks fairly effectively.
      To increase the predictive power of my models, I generated some variables that might be of interest. Firstly, I created a location binary variable, which takes on the value of one if the location of the observation is San Juan and 0 otherwise. Furthermore, I created lagged variables of the station_max_temp_c and percipitation_amt_mm. Mosquitos breed during higher temperatures and after rain. Creating a one-week lag of these variables will hopefully account for spikes in the mosquito population and the subsequent increases in dengue fever.

## 
## ===========================================================
##                                     Dependent variable:    
##                                 ---------------------------
##                                            Cases           
## -----------------------------------------------------------
## station_max_temp_c                         0.110           
##                                           (0.905)          
##                                                            
## station_max_temp_c_lagged                 -0.724           
##                                           (0.864)          
##                                                            
## precipitation_amt_mm                       0.013           
##                                           (0.020)          
##                                                            
## percipitation_amt_mm_lagged                0.023           
##                                           (0.019)          
##                                                            
## reanalysis_air_temp_k                    12.754***         
##                                           (4.200)          
##                                                            
## reanalysis_precip_amt_kg_per_m2            0.015           
##                                           (0.022)          
##                                                            
## reanalysis_min_air_temp_k                  0.417           
##                                           (0.845)          
##                                                            
## reanalysis_avg_temp_k                    -7.562**          
##                                           (3.690)          
##                                                            
## station_avg_temp_c                       -3.339**          
##                                           (1.475)          
##                                                            
## station_avg_temp_c_lagged                  1.831           
##                                           (1.345)          
##                                                            
## SJ                                        -7.001           
##                                           (6.831)          
##                                                            
## yr1991                                    26.532           
##                                          (19.489)          
##                                                            
## yr1992                                   85.587***         
##                                          (19.496)          
##                                                            
## yr1993                                   53.479***         
##                                          (19.458)          
##                                                            
## yr1994                                    -12.734          
##                                          (19.486)          
##                                                            
## yr1995                                   59.268***         
##                                          (19.459)          
##                                                            
## yr1996                                   44.918**          
##                                          (19.489)          
##                                                            
## yr1997                                   41.254**          
##                                          (19.610)          
##                                                            
## yr1998                                   86.568***         
##                                          (19.465)          
##                                                            
## yr1999                                   64.332***         
##                                          (19.572)          
##                                                            
## yr2000                                   44.012**          
##                                          (19.471)          
##                                                            
## yr2001                                   42.811**          
##                                          (18.730)          
##                                                            
## yr2002                                   51.049***         
##                                          (18.780)          
##                                                            
## yr2003                                   41.440**          
##                                          (18.745)          
##                                                            
## yr2004                                   41.053**          
##                                          (18.791)          
##                                                            
## yr2005                                   46.015**          
##                                          (18.760)          
##                                                            
## yr2006                                   50.203***         
##                                          (18.831)          
##                                                            
## yr2007                                   43.197**          
##                                          (18.719)          
##                                                            
## yr2008                                   50.190***         
##                                          (19.105)          
##                                                            
## yr2009                                   58.054***         
##                                          (19.610)          
##                                                            
## yr2010                                   53.860***         
##                                          (20.295)          
##                                                            
## yr2011                                                     
##                                                            
##                                                            
## yr2012                                                     
##                                                            
##                                                            
## yr2013                                                     
##                                                            
##                                                            
## weekofyear                               1.554***          
##                                           (0.484)          
##                                                            
## yr1991:weekofyear                          0.752           
##                                           (0.548)          
##                                                            
## yr1992:weekofyear                        -1.790***         
##                                           (0.549)          
##                                                            
## yr1993:weekofyear                        -1.442***         
##                                           (0.548)          
##                                                            
## yr1994:weekofyear                        4.957***          
##                                           (0.549)          
##                                                            
## yr1995:weekofyear                        -1.566***         
##                                           (0.548)          
##                                                            
## yr1996:weekofyear                        -1.380**          
##                                           (0.546)          
##                                                            
## yr1997:weekofyear                         -0.852           
##                                           (0.551)          
##                                                            
## yr1998:weekofyear                         -0.377           
##                                           (0.547)          
##                                                            
## yr1999:weekofyear                        -1.654***         
##                                           (0.548)          
##                                                            
## yr2000:weekofyear                        -1.650***         
##                                           (0.535)          
##                                                            
## yr2001:weekofyear                        -1.248**          
##                                           (0.516)          
##                                                            
## yr2002:weekofyear                        -1.693***         
##                                           (0.518)          
##                                                            
## yr2003:weekofyear                        -1.372***         
##                                           (0.516)          
##                                                            
## yr2004:weekofyear                        -1.283**          
##                                           (0.517)          
##                                                            
## yr2005:weekofyear                        -1.235**          
##                                           (0.517)          
##                                                            
## yr2006:weekofyear                        -1.778***         
##                                           (0.518)          
##                                                            
## yr2007:weekofyear                        -1.011**          
##                                           (0.515)          
##                                                            
## yr2008:weekofyear                        -1.385**          
##                                           (0.539)          
##                                                            
## yr2009:weekofyear                        -1.921***         
##                                           (0.549)          
##                                                            
## yr2010:weekofyear                        -2.005***         
##                                           (0.710)          
##                                                            
## yr2011:weekofyear                                          
##                                                            
##                                                            
## yr2012:weekofyear                                          
##                                                            
##                                                            
## yr2013:weekofyear                                          
##                                                            
##                                                            
## Constant                               -1,642.182***       
##                                          (357.665)         
##                                                            
## -----------------------------------------------------------
## Observations                               1,455           
## R2                                         0.617           
## Adjusted R2                                0.603           
## Residual Std. Error                 27.480 (df = 1402)     
## F Statistic                      43.451*** (df = 52; 1402) 
## ===========================================================
## Note:                           *p<0.1; **p<0.05; ***p<0.01


      As mentioned before, the causality claims we can make about the regression coefficients are limited. However, we see that the time variables and the interaction terms, including year and week of the year term, were very efficient in anticipating spikes in the data.
      Unfortunately, the R2 of the model only reached 0.617, and the Driven Data evaluation showed that the model performed quite poorly with an MAE of 32.38. While perhaps the inclusion of other variables may be helpful, I suspect using a different model will be a superior way to improve our forecast.

Dynamic Regression


      A dynamic regression offers the possibility to include both the wide array of variables that a regular regression can and to “squeeze out” the last bits of predictive power that usually come from time series models specifically ARIMA. While writing my DR model, I ran the San Juan and the Iquitos data in separate models and included four variables that would give the model the most predictive power. Specifically, I focused on the weather variables, which have the strongest predictive power regarding the mosquito lifecycle. An argument could be made against the usage of these variables, as some of them were not statistically significant in our OLS model; however, with the high probability of bias in our model, those statements are not necessarily true.

#Dynamic Regression SJ



DRSJ <- auto.arima(TrainSJ$Cases,
                     xreg = cbind(TrainSJ$reanalysis_precip_amt_kg_per_m2,
                                  TrainSJ$reanalysis_avg_temp_k,
                                  TrainSJ$reanalysis_max_air_temp_k,
                                  TrainSJ$reanalysis_relative_humidity_percent))

DRSJforc <- forecast(DRSJ,
                     xreg = cbind(TestSJ$reanalysis_precip_amt_kg_per_m2,
                                  TestSJ$reanalysis_avg_temp_k,
                                  TestSJ$reanalysis_max_air_temp_k,
                                  TestSJ$reanalysis_relative_humidity_percent),
                     h = 260)
## Warning in forecast.forecast_ARIMA(DRSJ, xreg =
## cbind(TestSJ$reanalysis_precip_amt_kg_per_m2, : Upper prediction intervals are
## not finite.
TestSJ$total_cases <- DRSJforc[["mean"]]

# Convert week_start_date column to Date format
TrainSJ$week_start_date <- as.Date(TrainSJ$week_start_date, format = "%Y-%m-%d")

# Plot the time series with forecast
TrainSJ %>%
  ggplot(aes(x = week_start_date, y = Cases)) +
  geom_line(color = "black", size = 1) + 
  geom_line(data = TestSJ, aes(x = as.Date(week_start_date), y = total_cases), color = "red") + # Plot forecast in red dashed line
  labs(y = "Cases of the Dengue Fever in SJ",
       x = "Time",
       title = "Dynamic Regression Forecast of Cases of the Dengue Fever in SJ",
       subtitle = "Source: Google Data") +
  theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

################################################################################
#Dynamic Regression IQ
DRIQ <- auto.arima(TrainIQ$Cases,
                   xreg = cbind(TrainIQ$reanalysis_precip_amt_kg_per_m2,
                                TrainIQ$reanalysis_avg_temp_k,
                                TrainIQ$reanalysis_max_air_temp_k,
                                TrainIQ$reanalysis_relative_humidity_percent))

DRIQforc <- forecast(DRIQ,
                     xreg = cbind(TestIQ$reanalysis_precip_amt_kg_per_m2,
                                  TestIQ$reanalysis_avg_temp_k,
                                  TestIQ$reanalysis_max_air_temp_k,
                                  TestIQ$reanalysis_relative_humidity_percent),
                     h = 156)

TestIQ$total_cases <- DRIQforc[["mean"]]

# Plot the time series with forecast
TrainIQ %>%
  ggplot(aes(x = week_start_date, y = Cases)) +
  geom_line(color = "black", size = 1) + 
  geom_line(data = TestIQ, aes(x = as.Date(week_start_date), y = total_cases), color = "red") + # Plot forecast in red dashed line
  labs(y = "Cases of the Dengue Fever in IQ",
       x = "Time",
       title = "Dynamic Regression Forecast of Cases of the Dengue Fever in IQ",
       subtitle = "Source: Google Data") +
  theme_minimal()


      Unfortunately, the model did not perform as well as I had hoped. With an MAE of 34.1, it performed worse than the standard OLS model, leading me to believe a different approach may be necessary. If I were to improve this model, I would likely include some of the year terms, as those were crucial in the OLS model’s ability to predict peak years.

Neural Network


      The final model I will utilize is a Neural Network model. To create this model, I utilized the NNETAAR command, splitting the training dataset into two, each representing SJ or IQ. The models used were NNAR(5,3) for the IQ part of the dataset and NNAR(10,6) for the SJ portion. These models were exceptionally computationally difficult, using 5 and 10 lagged inputs and 3 and 6 nodes, respectively.


# To save computational power, I only include an image of the code, rather than it run every time (it takes about 30 minutes).
#NNEAR IQ 
TrainIQ$Index <- 1:nrow(TrainIQ) 

TrainIQ <- TrainIQ %>% 
  as_tsibble(index = Index)

NNER <- TrainIQ %>% 
  model(NNETAR(Cases))

NNERForc <- NNER %>% 
  forecast(h=156)

report(NNER) 
accuracy(NNER)

TestIQ$total_cases <- NNERForc$.mean

TrainSJ$Index <- 1:nrow(TrainSJ) 
TrainSJ <- TrainSJ %>% 
  as_tsibble(index = Index)

NNERSJ <- TrainSJ%>% 
  model(NNETAR(Cases))

NNERForcSJ <- NNERSJ %>% 
  forecast(h=260)

report(NNERSJ) 
accuracy(NNERSJ) 
TestSJ$total_cases <- NNERForcSJ$.mean


      The model indeed preformed well. With a MAE of 28.46, the models forecast was an improvement upon the prior models. Graphically, the model also seems to follow the shape of the data more accurately. A potential improvement of the model would be the addition of external regressors. Unfortunately, I was restricted by the computational power of my device.

Conclusion


      The model indeed preformed well. With an MAE of 28.46, the model’s forecast was an improvement from the prior models. Graphically, the model also follows the shape of the data more accurately. A potential improvement of the model would be the addition of external regressors. Unfortunately, I was restricted by the computational power of my device.

Conclusion


     We have found that regression methods are perhaps not optimal for forecasting illness data, and more advanced time series models may be necessary. The dengue fever is expected to remain contained under the model, though some spikes should be anticipated. While the neural network model performed the best, its prediction accuracy could be improved by the addition of external regressors. The model struggled with forecasting the seasonality, or spikes, of the diseases, making it a suboptimal model for performing week-to-week variation forecasts.
      In the future, I would employ AWS or another sort of cloud computing service and use an NNER model with a higher number of variables. Potentially, it might also be beneficial to train the model on a larger number of observations, for example, data on malaria cases, to establish the model’s best form to forecast mosquito-carried illnesses.

Sources

[1] - Wolda, H. (1988). Insect Seasonality: Why? Annual Review of Ecology and Systematics, 19, 1–18. http://www.jstor.org/stable/2097145

[2] - Jian, Y., Silvestri, S., Brown, J., Hickman, R., & Marani, M. (2016). The predictability of mosquito abundance from daily to monthly timescales. Ecological Applications, 26(8), 2611–2622. http://www.jstor.org/stable/44132562