Insect-borne diseases have plagued (pun not intended)
humanity for millennials. While medicine has developed to a point where
most outbreaks of even the most severe diseases can be contained, our
ability to effectively forecast and stop outbreaks as they happen is
crucial.
Mosquito-carried illnesses are particularly
concerning. With global warming increasing average temperatures
worldwide, breeding conditions for mosquitoes improve, accelerating the
spread of any diseases they may carry. Dengue fever is among the
illnesses spread by mosquitoes. Concentrated largely in the Southern
Hemisphere, almost half a billion cases are diagnosed annually, only in
South America.
The purpose of this paper is to create an
effective way to forecast the number of cases on a weekly basis in two
South American towns, San Juan, Puerto Rico, and Iquitos, Peru. As this
is a part of a Driven Data competition, we were provided with premade
Training and Testing datasets with a wide array of variables. To
generate an accurate forecast, I will use an OLS, Neural Network, and a
Dynamic Regression models, along with a wide array of evaluation
metrics, such as MAE, AIC, BIC, and R2, to find which model performs
best. Driven Data also evaluates the results of our forecast, which will
be useful when generating benchmarks and seeing the general performance
of our model.
| Variable Name | Variable Definition |
|---|---|
| city | City abbreviations representing San Juan and Iquitos in the dataset. |
| year | The year when the data was recorded. |
| weekofyear | The week of the year when the data was recorded. |
| week_start_date | The start date of the week when the data was recorded. |
| ndvi_ne | Normalized difference vegetation index (NE quadrant). |
| ndvi_nw | Normalized difference vegetation index (NW quadrant). |
| ndvi_se | Normalized difference vegetation index (SE quadrant). |
| ndvi_sw | Normalized difference vegetation index (SW quadrant). |
| precipitation_amt_mm | Total precipitation in millimeters. |
| reanalysis_air_temp_k | Reanalysis air temperature in Kelvin. |
| reanalysis_avg_temp_k | Reanalysis average temperature in Kelvin. |
| reanalysis_dew_point_temp_k | Reanalysis dew point temperature in Kelvin. |
| reanalysis_max_air_temp_k | Reanalysis maximum air temperature in Kelvin. |
| reanalysis_min_air_temp_k | Reanalysis minimum air temperature in Kelvin. |
| reanalysis_precip_amt_kg_per_m2 | Reanalysis precipitation amount in kilograms per square meter. |
| reanalysis_relative_humidity_percent | Reanalysis relative humidity as a percentage. |
| reanalysis_sat_precip_amt_mm | Reanalysis saturation precipitation amount in millimeters. |
| reanalysis_specific_humidity_g_per_kg | Reanalysis specific humidity in grams per kilogram. |
| reanalysis_tdtr_k | Reanalysis temperature diurnal range in Kelvin. |
| station_avg_temp_c | Average temperature at the weather station in Celsius. |
| station_diur_temp_rng_c | Diurnal temperature range at the weather station in Celsius. |
| station_max_temp_c | Maximum temperature at the weather station in Celsius. |
| station_min_temp_c | Minimum temperature at the weather station in Celsius. |
| station_precip_mm | Precipitation at the weather station in millimeters. |
To prepare the data for analysis, I will perform some
exploratory data analysis. The first is assessing the number of missing
observations.
The below graph shows that most variables, at some point,
are missing some observations. While the ndvi_ne variable might be
missing too many observations to be safely used, the remaining variables
are all missing less than 5% of observations. I will replace the missing
observations with each variable’s mean to deal with this issue. That
way, I will not be drastically altering the distribution of the variable
and can safely use it in my models.
vis_miss(Train,sort_miss = TRUE, show_perc = TRUE)
# List of variables to be imputed
variables_to_impute <- c("ndvi_ne", "ndvi_nw", "ndvi_se", "ndvi_sw",
"precipitation_amt_mm", "reanalysis_air_temp_k",
"reanalysis_avg_temp_k", "reanalysis_dew_point_temp_k",
"reanalysis_max_air_temp_k", "reanalysis_min_air_temp_k",
"reanalysis_precip_amt_kg_per_m2", "reanalysis_relative_humidity_percent",
"reanalysis_sat_precip_amt_mm", "reanalysis_specific_humidity_g_per_kg",
"reanalysis_tdtr_k", "station_avg_temp_c",
"station_diur_temp_rng_c", "station_max_temp_c",
"station_min_temp_c", "station_precip_mm")
# Impute missing values with mean for each variable
for (variable in variables_to_impute) {
Train[[variable]] <- ifelse(is.na(Train[[variable]]), mean(Train[[variable]], na.rm = TRUE), Train[[variable]])
}
As we can see, upon the completion of the loop, the data
has no more missing observations and is ready for analysis.
## SummaryStats was converted to a data frame
| Variable | Stats / Values | Freqs (% of Valid) | Graph | ||||
|---|---|---|---|---|---|---|---|
| ndvi_ne [numeric] |
|
14 distinct values | |||||
| ndvi_nw [numeric] |
|
15 distinct values | |||||
| ndvi_se [numeric] |
|
15 distinct values | |||||
| ndvi_sw [numeric] |
|
15 distinct values | |||||
| precipitation_amt_mm [numeric] |
|
15 distinct values | |||||
| reanalysis_air_temp_k [numeric] |
|
15 distinct values | |||||
| reanalysis_avg_temp_k [numeric] |
|
15 distinct values | |||||
| reanalysis_dew_point_temp_k [numeric] |
|
15 distinct values | |||||
| reanalysis_max_air_temp_k [numeric] |
|
15 distinct values | |||||
| reanalysis_min_air_temp_k [numeric] |
|
15 distinct values | |||||
| reanalysis_precip_amt_kg_per_m2 [numeric] |
|
15 distinct values | |||||
| reanalysis_relative_humidity_percent [numeric] |
|
15 distinct values | |||||
| reanalysis_sat_precip_amt_mm [numeric] |
|
15 distinct values | |||||
| reanalysis_specific_humidity_g_per_kg [numeric] |
|
15 distinct values | |||||
| reanalysis_tdtr_k [numeric] |
|
15 distinct values | |||||
| station_avg_temp_c [numeric] |
|
15 distinct values | |||||
| station_diur_temp_rng_c [numeric] |
|
15 distinct values | |||||
| station_max_temp_c [numeric] |
|
15 distinct values | |||||
| station_min_temp_c [numeric] |
|
15 distinct values | |||||
| station_precip_mm [numeric] |
|
15 distinct values | |||||
| weekofyear [numeric] |
|
15 distinct values |
Generated by summarytools 1.0.1 (R version 4.2.3)
2024-04-28
As mosquitoes carry the dengue fever, my first focus was
to investigate literature focusing on forecasting the life cycles of
said insects.
A paper published in the Annual Review of
Ecology and Systematics (Wolda, H. (1988)) titled “Insect Seasonality:
Why?” discusses the external factors affecting the rises and falls in
mosquito population in tropic and subtropic countries. According to
Wolda, as latitude decreases, the seasons of the mosquito life cycle
lengthen and become more frequent. This is particularly aided by the
higher humidity levels, percipitation, and temperatures. As the dataset
provided by Driven Data includes information on all three of these
factors, this paper is valuable in the variable selection process.
The Jian et al. paper (Jian, Y., Silvestri, S., Brown, J.,
Hickman, R., & Marani, M. (2016)) provides us with some insights
into the model building with regards to mosquito population growth.
While the replication of the SSR model used in the paper is outside of
the scope of this paper, its focus on past states is a valuable insight.
The paper does it quite differently from an ARIMA model; however, its
accuracy makes me believe that the inclusion of lagged terms and
auto-regressive terms may be valuable.
While in general, OLS models are rarely used due to their
inherent endogeneity issues, in our case said issues are an asset. If we
used a fixed effects model or some other econometric model with a
stronger causality claim, we would eliminate much of the predictive
power that an OLS model may have.
While a zero conditional
mean violation is the end of any study, if our variables are strongly
correlated with others, it may generate a superior forecast. A
limitation of this approach is that we can make no claims about the
reason behind the growth or decrease in cases. However, we may be able
to forecast outbreaks fairly effectively.
To increase the
predictive power of my models, I generated some variables that might be
of interest. Firstly, I created a location binary variable, which takes
on the value of one if the location of the observation is San Juan and 0
otherwise. Furthermore, I created lagged variables of the
station_max_temp_c and percipitation_amt_mm. Mosquitos breed during
higher temperatures and after rain. Creating a one-week lag of these
variables will hopefully account for spikes in the mosquito population
and the subsequent increases in dengue fever.
##
## ===========================================================
## Dependent variable:
## ---------------------------
## Cases
## -----------------------------------------------------------
## station_max_temp_c 0.110
## (0.905)
##
## station_max_temp_c_lagged -0.724
## (0.864)
##
## precipitation_amt_mm 0.013
## (0.020)
##
## percipitation_amt_mm_lagged 0.023
## (0.019)
##
## reanalysis_air_temp_k 12.754***
## (4.200)
##
## reanalysis_precip_amt_kg_per_m2 0.015
## (0.022)
##
## reanalysis_min_air_temp_k 0.417
## (0.845)
##
## reanalysis_avg_temp_k -7.562**
## (3.690)
##
## station_avg_temp_c -3.339**
## (1.475)
##
## station_avg_temp_c_lagged 1.831
## (1.345)
##
## SJ -7.001
## (6.831)
##
## yr1991 26.532
## (19.489)
##
## yr1992 85.587***
## (19.496)
##
## yr1993 53.479***
## (19.458)
##
## yr1994 -12.734
## (19.486)
##
## yr1995 59.268***
## (19.459)
##
## yr1996 44.918**
## (19.489)
##
## yr1997 41.254**
## (19.610)
##
## yr1998 86.568***
## (19.465)
##
## yr1999 64.332***
## (19.572)
##
## yr2000 44.012**
## (19.471)
##
## yr2001 42.811**
## (18.730)
##
## yr2002 51.049***
## (18.780)
##
## yr2003 41.440**
## (18.745)
##
## yr2004 41.053**
## (18.791)
##
## yr2005 46.015**
## (18.760)
##
## yr2006 50.203***
## (18.831)
##
## yr2007 43.197**
## (18.719)
##
## yr2008 50.190***
## (19.105)
##
## yr2009 58.054***
## (19.610)
##
## yr2010 53.860***
## (20.295)
##
## yr2011
##
##
## yr2012
##
##
## yr2013
##
##
## weekofyear 1.554***
## (0.484)
##
## yr1991:weekofyear 0.752
## (0.548)
##
## yr1992:weekofyear -1.790***
## (0.549)
##
## yr1993:weekofyear -1.442***
## (0.548)
##
## yr1994:weekofyear 4.957***
## (0.549)
##
## yr1995:weekofyear -1.566***
## (0.548)
##
## yr1996:weekofyear -1.380**
## (0.546)
##
## yr1997:weekofyear -0.852
## (0.551)
##
## yr1998:weekofyear -0.377
## (0.547)
##
## yr1999:weekofyear -1.654***
## (0.548)
##
## yr2000:weekofyear -1.650***
## (0.535)
##
## yr2001:weekofyear -1.248**
## (0.516)
##
## yr2002:weekofyear -1.693***
## (0.518)
##
## yr2003:weekofyear -1.372***
## (0.516)
##
## yr2004:weekofyear -1.283**
## (0.517)
##
## yr2005:weekofyear -1.235**
## (0.517)
##
## yr2006:weekofyear -1.778***
## (0.518)
##
## yr2007:weekofyear -1.011**
## (0.515)
##
## yr2008:weekofyear -1.385**
## (0.539)
##
## yr2009:weekofyear -1.921***
## (0.549)
##
## yr2010:weekofyear -2.005***
## (0.710)
##
## yr2011:weekofyear
##
##
## yr2012:weekofyear
##
##
## yr2013:weekofyear
##
##
## Constant -1,642.182***
## (357.665)
##
## -----------------------------------------------------------
## Observations 1,455
## R2 0.617
## Adjusted R2 0.603
## Residual Std. Error 27.480 (df = 1402)
## F Statistic 43.451*** (df = 52; 1402)
## ===========================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
As mentioned before, the causality claims we can make
about the regression coefficients are limited. However, we see that the
time variables and the interaction terms, including year and week of the
year term, were very efficient in anticipating spikes in the data.
Unfortunately, the R2 of the model only reached 0.617, and the
Driven Data evaluation showed that the model performed quite poorly with
an MAE of 32.38. While perhaps the inclusion of other variables may be
helpful, I suspect using a different model will be a superior way to
improve our forecast.
A dynamic regression offers the possibility to include
both the wide array of variables that a regular regression can and to
“squeeze out” the last bits of predictive power that usually come from
time series models specifically ARIMA. While writing my DR model, I ran
the San Juan and the Iquitos data in separate models and included four
variables that would give the model the most predictive power.
Specifically, I focused on the weather variables, which have the
strongest predictive power regarding the mosquito lifecycle. An argument
could be made against the usage of these variables, as some of them were
not statistically significant in our OLS model; however, with the high
probability of bias in our model, those statements are not necessarily
true.
#Dynamic Regression SJ
DRSJ <- auto.arima(TrainSJ$Cases,
xreg = cbind(TrainSJ$reanalysis_precip_amt_kg_per_m2,
TrainSJ$reanalysis_avg_temp_k,
TrainSJ$reanalysis_max_air_temp_k,
TrainSJ$reanalysis_relative_humidity_percent))
DRSJforc <- forecast(DRSJ,
xreg = cbind(TestSJ$reanalysis_precip_amt_kg_per_m2,
TestSJ$reanalysis_avg_temp_k,
TestSJ$reanalysis_max_air_temp_k,
TestSJ$reanalysis_relative_humidity_percent),
h = 260)
## Warning in forecast.forecast_ARIMA(DRSJ, xreg =
## cbind(TestSJ$reanalysis_precip_amt_kg_per_m2, : Upper prediction intervals are
## not finite.
TestSJ$total_cases <- DRSJforc[["mean"]]
# Convert week_start_date column to Date format
TrainSJ$week_start_date <- as.Date(TrainSJ$week_start_date, format = "%Y-%m-%d")
# Plot the time series with forecast
TrainSJ %>%
ggplot(aes(x = week_start_date, y = Cases)) +
geom_line(color = "black", size = 1) +
geom_line(data = TestSJ, aes(x = as.Date(week_start_date), y = total_cases), color = "red") + # Plot forecast in red dashed line
labs(y = "Cases of the Dengue Fever in SJ",
x = "Time",
title = "Dynamic Regression Forecast of Cases of the Dengue Fever in SJ",
subtitle = "Source: Google Data") +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
################################################################################
#Dynamic Regression IQ
DRIQ <- auto.arima(TrainIQ$Cases,
xreg = cbind(TrainIQ$reanalysis_precip_amt_kg_per_m2,
TrainIQ$reanalysis_avg_temp_k,
TrainIQ$reanalysis_max_air_temp_k,
TrainIQ$reanalysis_relative_humidity_percent))
DRIQforc <- forecast(DRIQ,
xreg = cbind(TestIQ$reanalysis_precip_amt_kg_per_m2,
TestIQ$reanalysis_avg_temp_k,
TestIQ$reanalysis_max_air_temp_k,
TestIQ$reanalysis_relative_humidity_percent),
h = 156)
TestIQ$total_cases <- DRIQforc[["mean"]]
# Plot the time series with forecast
TrainIQ %>%
ggplot(aes(x = week_start_date, y = Cases)) +
geom_line(color = "black", size = 1) +
geom_line(data = TestIQ, aes(x = as.Date(week_start_date), y = total_cases), color = "red") + # Plot forecast in red dashed line
labs(y = "Cases of the Dengue Fever in IQ",
x = "Time",
title = "Dynamic Regression Forecast of Cases of the Dengue Fever in IQ",
subtitle = "Source: Google Data") +
theme_minimal()
Unfortunately, the model did not perform as well as I had
hoped. With an MAE of 34.1, it performed worse than the standard OLS
model, leading me to believe a different approach may be necessary. If I
were to improve this model, I would likely include some of the year
terms, as those were crucial in the OLS model’s ability to predict peak
years.
The final model I will utilize is a Neural Network model.
To create this model, I utilized the NNETAAR command, splitting the
training dataset into two, each representing SJ or IQ. The models used
were NNAR(5,3) for the IQ part of the dataset and NNAR(10,6) for the SJ
portion. These models were exceptionally computationally difficult,
using 5 and 10 lagged inputs and 3 and 6 nodes, respectively.
# To save computational power, I only include an image of the code, rather than it run every time (it takes about 30 minutes).
#NNEAR IQ
TrainIQ$Index <- 1:nrow(TrainIQ)
TrainIQ <- TrainIQ %>%
as_tsibble(index = Index)
NNER <- TrainIQ %>%
model(NNETAR(Cases))
NNERForc <- NNER %>%
forecast(h=156)
report(NNER)
accuracy(NNER)
TestIQ$total_cases <- NNERForc$.mean
TrainSJ$Index <- 1:nrow(TrainSJ)
TrainSJ <- TrainSJ %>%
as_tsibble(index = Index)
NNERSJ <- TrainSJ%>%
model(NNETAR(Cases))
NNERForcSJ <- NNERSJ %>%
forecast(h=260)
report(NNERSJ)
accuracy(NNERSJ)
TestSJ$total_cases <- NNERForcSJ$.mean
The model indeed preformed well. With a MAE of 28.46, the
models forecast was an improvement upon the prior models. Graphically,
the model also seems to follow the shape of the data more accurately. A
potential improvement of the model would be the addition of external
regressors. Unfortunately, I was restricted by the computational power
of my device.
The model indeed preformed well. With an MAE of 28.46, the
model’s forecast was an improvement from the prior models. Graphically,
the model also follows the shape of the data more accurately. A
potential improvement of the model would be the addition of external
regressors. Unfortunately, I was restricted by the computational power
of my device.
We have found that regression methods are perhaps not
optimal for forecasting illness data, and more advanced time series
models may be necessary. The dengue fever is expected to remain
contained under the model, though some spikes should be anticipated.
While the neural network model performed the best, its prediction
accuracy could be improved by the addition of external regressors. The
model struggled with forecasting the seasonality, or spikes, of the
diseases, making it a suboptimal model for performing week-to-week
variation forecasts.
In the future, I would employ AWS or
another sort of cloud computing service and use an NNER model with a
higher number of variables. Potentially, it might also be beneficial to
train the model on a larger number of observations, for example, data on
malaria cases, to establish the model’s best form to forecast
mosquito-carried illnesses.
[1] - Wolda, H. (1988). Insect Seasonality: Why? Annual Review of Ecology and Systematics, 19, 1–18. http://www.jstor.org/stable/2097145
[2] - Jian, Y., Silvestri, S., Brown, J., Hickman, R., & Marani, M. (2016). The predictability of mosquito abundance from daily to monthly timescales. Ecological Applications, 26(8), 2611–2622. http://www.jstor.org/stable/44132562