COVID-19 Forecasting
Introduction
Problem and Significance
This project aims at looking at confirmed cases and deaths related to COVID-19 through June 30, 2020 in Mexico. I chose this country because I was born there, and I still have family living there, including my mother. My father passed away in June 2020 when curve of cases started going up, and he was a person who was a prime candidate for this virus as he had been fighting an extreme case of diabetes. Many countries could have been better prepared and learn from what happened in other countries and thus implement safety protocols, including masks and lockdowns, ahead of time.
I will try to show some of the trends and forecast the number of cases and deaths into the future. Much of the modeling relies on univariate approaches, like ARIMA. From this paper “On the accuracy of ARIMA based prediction of COVID-19 spread” from Results Phys., 27 (2021), Article 104509, it appears that ARIMA modeling could be helpful despite some of its limitations. On the second reference it mentions the application of SARIMA models, which inject the seasonality of the observed variable. This can be combined with ETS models, which also adjust for seasonality.
All in all, we will review step by step all the code, and the process.
Data wrangling
Probably the most time-consuming part of the project.
First I will split the data management between tibbles and time series. This is because sometimes the code breaks when trying to handle vectors as opposed to dataframes. Therefore, I will have the option of moving back and forth between them.
# Filter Mexico
train <- train_data %>%
filter(Country_Region == "Mexico")
# Setup the dataframes
cases_st <- train[which(train$Target == "ConfirmedCases"),]
deaths_st <- train[which(train$Target == "Fatalities"),]
# Setup Cases dataframe
cases <- aggregate(TargetValue ~ Date, data = cases_st, FUN = sum)
deaths <- aggregate(TargetValue ~ Date, data = deaths_st, FUN = sum)
#convert to dates
cases$Date <- as.Date(cases$Date)
deaths$Date <- as.Date(deaths$Date)
#convert to tibble
cases_tb <- tsibble(cases, index = Date)
cases_tb <- cases_tb %>%
filter_index("2020-02-23" ~.)
deaths_tb <- tsibble(deaths, index = Date)
deaths_tb <- deaths_tb %>%
filter_index("2020-02-23" ~.)
#convert to ts
cases_ts <-ts(cases, start = 32, end = 140, frequency = 7)
deaths_ts <-ts(deaths, start = 32, end = 140, frequency = 7)Data Exploration
Data as is
The first thing to do was to plot the time series observed and try to identify patterns. The second thing was to plot auto-correlation (ACF) and partial auto-correlation (PACF).
# Confirmed Cases
cases_plot <-
autoplot(cases_tb,TargetValue) +
labs(title = "Confirmed Cases"
,x = "Date"
,y = "Confirmed Cases") +
theme_ipsum_ps() +
geom_smooth(formula = y ~ x, method = "loess", color = "orange", size = 0.3)
#Deaths
deaths_plot<-
autoplot(deaths_tb,TargetValue) +
labs(title = "Deaths"
,x = "Date"
,y = "Number of Fatalities") +
theme_ipsum_ps() +
geom_smooth(formula = y ~ x, method = "loess", color = "orange", size = 0.3)
grid.arrange(cases_plot,
deaths_plot, nrow=1, ncol=2)# ACF Confirmed Cases
cases_tb2 <- ts(subset(cases_tb, select = "TargetValue", frequency = 365))
deaths_tb2 <- ts(subset(deaths_tb, select = "TargetValue", frequency = 365))
ggtsdisplay(cases_tb2, main = "Confirmed Cases")ggtsdisplay(deaths_tb2, main = "Deaths")Stationarity
In this section we assessed for stationarity on both, and plotted accordingly.
Cases
The plot shows an upward trend, meaning that it is unlikely this data is stationary. We performed a quick KPSS test to make sure that this is NOT stationary, and then get the potential differences. When running the first test we can see that we need to get the differences. We then get the differences and the result is 1. However, when running the KPSS test again, the kpss value is between 0.01 and 0.1. In this case I decided to get another difference so that the indicator can show 0.1.
# first we plot the differences
# Cases
cases_tb %>%
gg_tsdisplay(difference(TargetValue), plot_type = 'partial')+
labs(title = "Difference of 1, Cases")# Unit Root
cases_tb %>%
features(TargetValue, unitroot_kpss) %>%
kbl()%>%
kable_material(c("striped","hover", full_width = T))| kpss_stat | kpss_pvalue |
|---|---|
| 2.061807 | 0.01 |
paste("Number of Differences = ", ndiffs(cases_tb$TargetValue)) %>%
kbl()%>%
kable_material(c("striped","hover", full_width = T))| x |
|---|
| Number of Differences = 1 |
# KPSS indicator still lower than 0.1, but higher than 0.01
order1_cases <-
cases_tb$TargetValue %>%
diff() %>%
unitroot_kpss
order1_cases %>%
kbl()%>%
kable_material(c("striped","hover", full_width = T))| x | |
|---|---|
| kpss_stat | 0.4885659 |
| kpss_pvalue | 0.0442419 |
# We difference again
order2_cases <-
cases_ts %>%
diff() %>%
diff() %>%
unitroot_kpss
order2_cases %>%
kbl()%>%
kable_material(c("striped","hover", full_width = T))| x | |
|---|---|
| kpss_stat | 0.0026919 |
| kpss_pvalue | 0.1000000 |
cases_tb2 %>%
diff() %>%
diff() %>%
ggtsdisplay(lag.max=30, main = "Cases, order 2 Differences")Deaths
Deaths showed that one difference was sufficient to convert the time-series into stationary. As we can see, the ACF plots show significance within the last two weeks, with 7 and 14 spiking, and remaining statistically significant even at 16. PACF also showed similar results, though more observations remained statistically significant within the same time frame.
# first we plot the differences
# Cases
deaths_tb %>%
gg_tsdisplay(difference(TargetValue), plot_type = 'partial')+
labs(title = "Difference of 1, Deaths")# Unit Root
deaths_tb %>%
features(TargetValue, unitroot_kpss) %>%
kbl()%>%
kable_material(c("striped","hover", full_width = T))| kpss_stat | kpss_pvalue |
|---|---|
| 1.880058 | 0.01 |
paste("Number of Differences = ", ndiffs(deaths_tb$TargetValue)) %>%
kbl()%>%
kable_material(c("striped","hover", full_width = T))| x |
|---|
| Number of Differences = 1 |
# KPSS at 0.1
order1_deaths <-
deaths_tb$TargetValue %>%
diff() %>%
unitroot_kpss
order1_deaths %>%
kbl()%>%
kable_material(c("striped","hover", full_width = T))| x | |
|---|---|
| kpss_stat | 0.1096631 |
| kpss_pvalue | 0.1000000 |
# No need to difference again initial plot shows one differenceModels
As mentioned before, I performed ARIMA and ETS models since they appear to show relatively good results and are relevant to the development of this quick paper.
From the models below, we can see that a combination of ETS and ARIMA can provide relatively good results. We can statistically be close to each other and decide on the best model. The models were run several times to optimize the results.
################
# Cases
holts_cases <- cases_tb
#holts_cases [] = holts_cases [] +1
#2020-05-13
fix_c <- holts_cases %>%
filter_index(. ~ "2020-05-30")
#models
fit_c <-
fix_c %>%
model( "ETS_c_AAM" = ETS(TargetValue ~ error("A") + trend("A") + season("M"))
, "ETS_c_AAN" = ETS(TargetValue ~ error("A") + trend("A") + season("N"))
, "ETS_c_AAA" = ETS(TargetValue ~ error("A") + trend("A") + season("A"))
, "ETS_c_AAdA" = ETS(TargetValue ~ error("A") + trend("Ad") + season("A")) #damped
, "ETS_c_auto" = ETS(TargetValue)
, "ARIMA_c1" = ARIMA(TargetValue ~ pdq(0, 1, 2) + PDQ(0, 1, 1)) #same as in Hyndman book
, "ARIMA_c2" = ARIMA(TargetValue ~ pdq(2, 1, 0) + PDQ(0, 1, 1))
, "ARIMA_c3" = ARIMA(TargetValue ~ pdq(1, 1, 1) + PDQ(0, 1, 1))
, "ARIMA_c_auto" = ARIMA(TargetValue)
)
################
# Deaths
holts_deaths <- deaths_tb
#holts_deaths [] = holts_deaths [] +1
fix_d <- holts_deaths %>%
filter_index(. ~ "2020-05-30")
# models
fit_d <- fix_d %>%
model( "ETS_d_AAM" = ETS(TargetValue ~ error("A") + trend("A") + season("M"))
, "ETS_d_AAN" = ETS(TargetValue ~ error("A") + trend("A") + season("N"))
, "ETS_d_AAA" = ETS(TargetValue ~ error("A") + trend("A") + season("A"))
, "ETS_d_AAdA" = ETS(TargetValue ~ error("A") + trend("Ad") + season("A")) #damped
, "ETS_d_auto" = ETS(TargetValue)
, "ARIMA_d1" = ARIMA(TargetValue ~ pdq(0, 1, 2) + PDQ(0, 1, 1)) #same as in Hyndman book
, "ARIMA_d2" = ARIMA(TargetValue ~ pdq(2, 1, 0) + PDQ(0, 1, 1))
, "ARIMA_d3" = ARIMA(TargetValue ~ pdq(1, 1, 1) + PDQ(0, 1, 1))
, "ARIMA_d_auto" = ARIMA(TargetValue)
)
# forecasts
fc_cases <-
fit_c %>%
forecast(h=28)
fc_death <-
fit_d %>%
forecast(h=28)
models_comp <-
bind_rows(
fit_c %>% accuracy()
,fit_d %>% accuracy()
,fc_cases %>% accuracy(cases_tb)
,fc_death %>% accuracy(deaths_tb)
)
models_comp %>%
select(
.model
,.type
,RMSE
,RMSSE
,MASE
,ACF1
) %>%
kbl()%>%
kable_material(c("striped","hover", full_width = T))| .model | .type | RMSE | RMSSE | MASE | ACF1 |
|---|---|---|---|---|---|
| ETS_c_AAM | Training | 197.94864 | 0.5223434 | 0.4525789 | 0.2316406 |
| ETS_c_AAN | Training | 194.33885 | 0.5128179 | 0.4783194 | 0.2309706 |
| ETS_c_AAA | Training | 182.22134 | 0.4808425 | 0.5140632 | 0.1669257 |
| ETS_c_AAdA | Training | 184.91679 | 0.4879552 | 0.5339753 | 0.1501927 |
| ETS_c_auto | Training | 194.33885 | 0.5128179 | 0.4783194 | 0.2309706 |
| ARIMA_c1 | Training | 187.99198 | 0.4960699 | 0.4707706 | -0.0792275 |
| ARIMA_c2 | Training | 192.45682 | 0.5078517 | 0.4712215 | -0.0756834 |
| ARIMA_c3 | Training | 189.20177 | 0.4992623 | 0.4738702 | -0.0161504 |
| ARIMA_c_auto | Training | 181.87299 | 0.4799232 | 0.5059969 | -0.0461007 |
| ETS_d_AAM | Training | 34.60012 | 0.5949892 | 0.5705695 | -0.0522078 |
| ETS_d_AAN | Training | 53.58992 | 0.9215410 | 0.9127866 | 0.3075097 |
| ETS_d_AAA | Training | 39.69160 | 0.6825432 | 0.7149753 | 0.0670663 |
| ETS_d_AAdA | Training | 40.40115 | 0.6947448 | 0.7441628 | 0.0771447 |
| ETS_d_auto | Training | 39.69160 | 0.6825432 | 0.7149753 | 0.0670663 |
| ARIMA_d1 | Training | 40.55545 | 0.6973981 | 0.6467786 | -0.0412796 |
| ARIMA_d2 | Training | 46.45180 | 0.7987928 | 0.6967036 | -0.0808972 |
| ARIMA_d3 | Training | 40.56072 | 0.6974888 | 0.6480464 | -0.0360262 |
| ARIMA_d_auto | Training | 31.94064 | 0.5492565 | 0.5066420 | -0.0843061 |
| ARIMA_c_auto | Test | 710.18371 | 1.8740202 | 2.3425481 | 0.2884195 |
| ARIMA_c1 | Test | 497.02829 | 1.3115495 | 1.5991439 | 0.2070650 |
| ARIMA_c2 | Test | 690.51453 | 1.8221176 | 2.2634335 | 0.2087592 |
| ARIMA_c3 | Test | 502.59334 | 1.3262344 | 1.6054057 | 0.2239654 |
| ETS_c_AAA | Test | 506.37019 | 1.3362007 | 1.7334207 | 0.2959944 |
| ETS_c_AAdA | Test | 531.89299 | 1.4035498 | 1.8564574 | 0.2672352 |
| ETS_c_AAM | Test | 325.93178 | 0.8600630 | 1.1268786 | 0.2821467 |
| ETS_c_AAN | Test | 567.75924 | 1.4981931 | 1.9236739 | 0.3214893 |
| ETS_c_auto | Test | 567.75924 | 1.4981931 | 1.9236739 | 0.3214893 |
| ARIMA_d_auto | Test | 191.72872 | 3.2969983 | 3.5786167 | 0.2655432 |
| ARIMA_d1 | Test | 215.58720 | 3.7072726 | 4.2218912 | 0.3336219 |
| ARIMA_d2 | Test | 217.53060 | 3.7406916 | 4.2240468 | 0.3307464 |
| ARIMA_d3 | Test | 215.51063 | 3.7059559 | 4.2185929 | 0.3333758 |
| ETS_d_AAA | Test | 212.91244 | 3.6612770 | 4.3802753 | 0.3468298 |
| ETS_d_AAdA | Test | 217.28795 | 3.7365189 | 4.3111374 | 0.3344338 |
| ETS_d_AAM | Test | 190.92684 | 3.2832090 | 3.5420457 | 0.2286799 |
| ETS_d_AAN | Test | 276.14096 | 4.7485649 | 6.4026640 | 0.4839405 |
| ETS_d_auto | Test | 212.91244 | 3.6612770 | 4.3802753 | 0.3468298 |
Chose Models
Cases
In training, the models with the better results were the auto ARIMA, which was an order of (0,1,5) w/ drift) but performing poorly on the test data. The best performing model was the ETS AAM model on the test data. We plotted them accordingly, and we can see that the test results with the ETS model appear to perform better.
## Plot ARIMA and ETS
#Cases#
fc_cases %>%
autoplot(cases_tb, level = 0, size = 0.5) +
labs(title = "Confirmed Cases"
,y = "Confirmed Cases") +
theme_ipsum_es()# Model chosen was an ETS for cases
ets_mod_c <-
fix_c %>%
model("ETS_d_AAM" = ETS(TargetValue ~ error("A") + trend("A") + season("M")))
ets_fc_c <-
ets_mod_c %>%
forecast(h = 43)
ets_fc_c %>%
autoplot(cases_tb, level = 30, size = 0.5) +
labs(title = "Confirmed Cases, ETS Forecast (A,A,M)"
,y = "Confirmed Cases") +
theme_ipsum_es()# Second Model for Cases was the auto ARIMA
# Model chosen was an ETS for cases
arima_mod_c <-
fix_c %>%
model("ARIMA_c_auto" = ARIMA(TargetValue))
arima_fc_c <-
arima_mod_c %>%
forecast(h = 43)
arima_fc_c %>%
autoplot(fix_c, level = 30, size = 0.5) +
labs(title = "Confirmed Cases, ARIMA Forecast (0,1,5) w/ drift)"
,y = "Confirmed Cases") +
theme_ipsum_es()Deaths
For the number of deaths, the auto ARIMA at (2,1,2)(0,1,2)[7] performed better than the rest, though the ETS AAM–similar to the number of confirmed cases–performed well too. Similar to confirmed cases, the ETS model AAM performed marginally better than the auto ARIMA at (2,1,2)(0,1,2)[7].
##################################
#Deaths
fc_death %>%
autoplot(deaths_tb, level = 0, size = 0.5) +
labs(title = "Deaths"
,y = "Deaths") +
theme_ipsum_es()# Model chosen was an ETS for deaths too
ets_mod_d <-
fix_d %>%
model("ETS_d_AAM" = ETS(TargetValue ~ error("A") + trend("A") + season("M")))
ets_fc_d <-
ets_mod_d %>%
forecast(h = 43)
ets_fc_d %>%
autoplot(deaths_tb, level = 30, size = 0.5) +
labs(title = "Deaths, ETS Forecast"
,y = "Deaths") +
theme_ipsum_es()# ARIMA for deaths
arima_mod_d <-
fix_d %>%
model("ARIMA_d_auto" = ARIMA(TargetValue))
arima_fc_d <-
arima_mod_d %>%
forecast(h = 43)
arima_fc_d %>%
autoplot(deaths_tb, level = 30, size = 0.5) +
labs(title = "Deaths, ARIMA Forecast (2,1,2)(0,1,2)[7]"
,y = "Deaths") +
theme_ipsum_es()Limitations
Intuitively, these models may encounter limitations, as suggested by many professional statisticians, that these models are not always suitable for infectious diseases models, given that the trend could be biased. Moreover, this model is univariate, not accounting for exogenous events that could affect the forecast outcome. In other words, different from different approaches taken, such as studying a country in which the number of cases and deaths had already peaked and then trended lower, this model could go trend up indefinitely in the absence of any intervention.
However, the assumption that what the model suggests could happen in the absence of any intervention, say by policymakers, provides us with helpful information to take proper action.
Future Work
At this point, much work is needed. Other models could have been tested, such as STL, and eventually some neural networks.
References
“On the accuracy of ARIMA based prediction of COVID-19 spread” https://doi.org/10.1016/j.rinp.2021.104509
“Forecasting the dynamics of cumulative COVID-19 cases (confirmed, recovered and deaths) for top-16 countries using statistical machine learning models: Auto-Regressive Integrated Moving Average (ARIMA) and Seasonal Auto-Regressive Integrated Moving Average (SARIMA)” https://doi.org/10.1016/j.asoc.2021.107161
extra - Models v.2
On this section I tried several models using the ts commands as opposed to the tibbles.
The results were not optimal, and there appears to
ARIMA, AUTO
The first one is the auto.arima, and let the package provide the best fit parameters.
From there I twitched some of the parameters manually and ran the residuals, plot the charts, and fit the most reasonable model.
#Auto Arima Cases
auto_fit_c <- auto.arima(cases_tb2)
summary(auto_fit_c)## Series: cases_tb2
## ARIMA(5,1,4) with drift
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ma1 ma2 ma3
## -0.8577 0.0944 -0.4016 -1.0749 -0.3057 0.4825 -0.6893 0.0362
## s.e. 0.2513 0.2054 0.1033 0.1397 0.2069 0.2463 0.2040 0.1024
## ma4 drift
## 0.6097 39.1224
## s.e. 0.1095 8.4292
##
## sigma^2 estimated as 50119: log likelihood=-734.84
## AIC=1491.69 AICc=1494.44 BIC=1521.19
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 0.2185645 212.2767 155.5616 -Inf Inf 0.9258089 0.007523936
checkresiduals(auto_fit_c)##
## Ljung-Box test
##
## data: Residuals from ARIMA(5,1,4) with drift
## Q* = 14.274, df = 3, p-value = 0.002555
##
## Model df: 10. Total lags used: 13
#plot
par(mfrow=c(1,1))
plot(cases_tb2, xlim = c(90,109), main = "Confirmed Cases ARIMA (5,1,4) w /drift")
points(fitted(auto_fit_c), pch = 20, col = "deeppink2")
points(fitted(auto_fit_c), type = "l", col = "deeppink1")plot(forecast(auto_fit_c, h = 28))#Auto Arima Deaths
auto_fit_d <- arima(deaths_tb2[0:109], order=c(3,2,1))
summary(auto_fit_d)##
## Call:
## arima(x = deaths_tb2[0:109], order = c(3, 2, 1))
##
## Coefficients:
## ar1 ar2 ar3 ma1
## 0.0379 -0.1350 -0.4934 -1.0000
## s.e. 0.0841 0.0842 0.0856 0.0299
##
## sigma^2 estimated as 7918: log likelihood = -635.35, aic = 1280.7
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 5.021456 88.16287 42.88906 -Inf Inf 0.8440268 -0.1512115
checkresiduals(auto_fit_d)##
## Ljung-Box test
##
## data: Residuals from ARIMA(3,2,1)
## Q* = 30.428, df = 6, p-value = 3.259e-05
##
## Model df: 4. Total lags used: 10
#plot
par(mfrow=c(1,1))
plot(forecast(auto_fit_d, h=28))