COVID-19 Forecasting

Introduction

Problem and Significance

This project aims at looking at confirmed cases and deaths related to COVID-19 through June 30, 2020 in Mexico. I chose this country because I was born there, and I still have family living there, including my mother. My father passed away in June 2020 when curve of cases started going up, and he was a person who was a prime candidate for this virus as he had been fighting an extreme case of diabetes. Many countries could have been better prepared and learn from what happened in other countries and thus implement safety protocols, including masks and lockdowns, ahead of time.

I will try to show some of the trends and forecast the number of cases and deaths into the future. Much of the modeling relies on univariate approaches, like ARIMA. From this paper “On the accuracy of ARIMA based prediction of COVID-19 spread” from Results Phys., 27 (2021), Article 104509, it appears that ARIMA modeling could be helpful despite some of its limitations. On the second reference it mentions the application of SARIMA models, which inject the seasonality of the observed variable. This can be combined with ETS models, which also adjust for seasonality.

All in all, we will review step by step all the code, and the process.

Data wrangling

Probably the most time-consuming part of the project.

First I will split the data management between tibbles and time series. This is because sometimes the code breaks when trying to handle vectors as opposed to dataframes. Therefore, I will have the option of moving back and forth between them.

# Filter Mexico
train <- train_data %>%
  filter(Country_Region == "Mexico")

# Setup the dataframes
cases_st <- train[which(train$Target == "ConfirmedCases"),]
deaths_st <- train[which(train$Target == "Fatalities"),]

# Setup Cases dataframe
cases <- aggregate(TargetValue ~ Date, data = cases_st, FUN = sum)
deaths <- aggregate(TargetValue ~ Date, data = deaths_st, FUN = sum)

#convert to dates
cases$Date <- as.Date(cases$Date)
deaths$Date <- as.Date(deaths$Date)

#convert to tibble
cases_tb <- tsibble(cases, index = Date)
cases_tb <- cases_tb %>%
  filter_index("2020-02-23" ~.)

deaths_tb <- tsibble(deaths, index = Date)
deaths_tb <- deaths_tb %>%
  filter_index("2020-02-23" ~.)

#convert to ts
cases_ts <-ts(cases, start = 32, end = 140, frequency = 7)
deaths_ts <-ts(deaths, start = 32, end = 140, frequency = 7)

Data Exploration

Data as is

The first thing to do was to plot the time series observed and try to identify patterns. The second thing was to plot auto-correlation (ACF) and partial auto-correlation (PACF).

# Confirmed Cases
cases_plot <-
autoplot(cases_tb,TargetValue) +
  labs(title = "Confirmed Cases"
       ,x = "Date"
       ,y = "Confirmed Cases") + 
  theme_ipsum_ps() +
  geom_smooth(formula = y ~ x, method = "loess", color = "orange", size = 0.3)

#Deaths
deaths_plot<-
  autoplot(deaths_tb,TargetValue) +
  labs(title = "Deaths"
       ,x = "Date"
       ,y = "Number of Fatalities") + 
  theme_ipsum_ps() +
  geom_smooth(formula = y ~ x, method = "loess", color = "orange", size = 0.3)

grid.arrange(cases_plot,
             deaths_plot, nrow=1, ncol=2)

# ACF Confirmed Cases

cases_tb2 <- ts(subset(cases_tb, select = "TargetValue", frequency = 365))
deaths_tb2 <- ts(subset(deaths_tb, select = "TargetValue", frequency = 365))

ggtsdisplay(cases_tb2, main = "Confirmed Cases")

ggtsdisplay(deaths_tb2, main = "Deaths")

Stationarity

In this section we assessed for stationarity on both, and plotted accordingly.

Cases

The plot shows an upward trend, meaning that it is unlikely this data is stationary. We performed a quick KPSS test to make sure that this is NOT stationary, and then get the potential differences. When running the first test we can see that we need to get the differences. We then get the differences and the result is 1. However, when running the KPSS test again, the kpss value is between 0.01 and 0.1. In this case I decided to get another difference so that the indicator can show 0.1.

# first we plot the differences

# Cases
cases_tb %>%
  gg_tsdisplay(difference(TargetValue), plot_type = 'partial')+
  labs(title = "Difference of 1, Cases")

# Unit Root
cases_tb %>%
  features(TargetValue, unitroot_kpss) %>%
  kbl()%>%
  kable_material(c("striped","hover", full_width = T))

kpss_stat	kpss_pvalue
2.061807	0.01

paste("Number of Differences = ", ndiffs(cases_tb$TargetValue)) %>%
  kbl()%>%
  kable_material(c("striped","hover", full_width = T))

x
Number of Differences = 1

# KPSS indicator still lower than 0.1, but higher than 0.01


order1_cases <-
cases_tb$TargetValue %>% 
  diff() %>% 
  unitroot_kpss

order1_cases %>%
  kbl()%>%
  kable_material(c("striped","hover", full_width = T))

	x
kpss_stat	0.4885659
kpss_pvalue	0.0442419

# We difference again
order2_cases <-
cases_ts %>% 
  diff() %>% 
  diff() %>% 
  unitroot_kpss

order2_cases %>%
  kbl()%>%
  kable_material(c("striped","hover", full_width = T))

	x
kpss_stat	0.0026919
kpss_pvalue	0.1000000

cases_tb2 %>% 
  diff() %>% 
  diff() %>% 
  ggtsdisplay(lag.max=30, main = "Cases, order 2 Differences")

Deaths

Deaths showed that one difference was sufficient to convert the time-series into stationary. As we can see, the ACF plots show significance within the last two weeks, with 7 and 14 spiking, and remaining statistically significant even at 16. PACF also showed similar results, though more observations remained statistically significant within the same time frame.

# first we plot the differences

# Cases
deaths_tb %>%
  gg_tsdisplay(difference(TargetValue), plot_type = 'partial')+
  labs(title = "Difference of 1, Deaths")

# Unit Root

deaths_tb %>%
  features(TargetValue, unitroot_kpss) %>%
  kbl()%>%
  kable_material(c("striped","hover", full_width = T))

kpss_stat	kpss_pvalue
1.880058	0.01

paste("Number of Differences = ", ndiffs(deaths_tb$TargetValue)) %>%
  kbl()%>%
  kable_material(c("striped","hover", full_width = T))

x
Number of Differences = 1

# KPSS at 0.1

order1_deaths <-
deaths_tb$TargetValue %>% 
  diff() %>% 
  unitroot_kpss

order1_deaths %>%
  kbl()%>%
  kable_material(c("striped","hover", full_width = T))

	x
kpss_stat	0.1096631
kpss_pvalue	0.1000000

# No need to difference again initial plot shows one difference

Models

As mentioned before, I performed ARIMA and ETS models since they appear to show relatively good results and are relevant to the development of this quick paper.

From the models below, we can see that a combination of ETS and ARIMA can provide relatively good results. We can statistically be close to each other and decide on the best model. The models were run several times to optimize the results.

################
# Cases
holts_cases <- cases_tb
#holts_cases [] = holts_cases [] +1

#2020-05-13

fix_c <- holts_cases %>%
  filter_index(. ~ "2020-05-30")

#models

fit_c <- 
  fix_c %>%
  model( "ETS_c_AAM" = ETS(TargetValue ~ error("A") + trend("A") + season("M"))
        , "ETS_c_AAN" = ETS(TargetValue ~ error("A") + trend("A") + season("N"))
        , "ETS_c_AAA" = ETS(TargetValue ~ error("A") + trend("A") + season("A"))
        , "ETS_c_AAdA" = ETS(TargetValue ~ error("A") + trend("Ad") + season("A")) #damped
        , "ETS_c_auto" = ETS(TargetValue)
        , "ARIMA_c1" = ARIMA(TargetValue ~ pdq(0, 1, 2) + PDQ(0, 1, 1)) #same as in Hyndman book
        , "ARIMA_c2" = ARIMA(TargetValue ~ pdq(2, 1, 0) + PDQ(0, 1, 1))
        , "ARIMA_c3" = ARIMA(TargetValue ~ pdq(1, 1, 1) + PDQ(0, 1, 1))
        , "ARIMA_c_auto" = ARIMA(TargetValue)
        ) 

################
# Deaths

holts_deaths <- deaths_tb
#holts_deaths [] = holts_deaths [] +1


fix_d <- holts_deaths %>%
  filter_index(. ~ "2020-05-30")

# models

fit_d <- fix_d %>%
  model(  "ETS_d_AAM" = ETS(TargetValue ~ error("A") + trend("A") + season("M"))
        , "ETS_d_AAN" = ETS(TargetValue ~ error("A") + trend("A") + season("N"))
        , "ETS_d_AAA" = ETS(TargetValue ~ error("A") + trend("A") + season("A"))
        , "ETS_d_AAdA" = ETS(TargetValue ~ error("A") + trend("Ad") + season("A")) #damped
        , "ETS_d_auto" = ETS(TargetValue)
        , "ARIMA_d1" = ARIMA(TargetValue ~ pdq(0, 1, 2) + PDQ(0, 1, 1)) #same as in Hyndman book
        , "ARIMA_d2" = ARIMA(TargetValue ~ pdq(2, 1, 0) + PDQ(0, 1, 1))
        , "ARIMA_d3" = ARIMA(TargetValue ~ pdq(1, 1, 1) + PDQ(0, 1, 1))
        , "ARIMA_d_auto" = ARIMA(TargetValue)
        )

# forecasts
fc_cases <- 
  fit_c %>%
  forecast(h=28)

fc_death <-
  fit_d %>%
  forecast(h=28)

models_comp <-
bind_rows(
  fit_c %>% accuracy()
  ,fit_d %>% accuracy()
  ,fc_cases %>% accuracy(cases_tb)
  ,fc_death %>% accuracy(deaths_tb)
)


models_comp %>%
  select(
    .model
    ,.type
    ,RMSE
    ,RMSSE
    ,MASE
    ,ACF1
  ) %>%
  kbl()%>%
  kable_material(c("striped","hover", full_width = T))

.model	.type	RMSE	RMSSE	MASE	ACF1
ETS_c_AAM	Training	197.94864	0.5223434	0.4525789	0.2316406
ETS_c_AAN	Training	194.33885	0.5128179	0.4783194	0.2309706
ETS_c_AAA	Training	182.22134	0.4808425	0.5140632	0.1669257
ETS_c_AAdA	Training	184.91679	0.4879552	0.5339753	0.1501927
ETS_c_auto	Training	194.33885	0.5128179	0.4783194	0.2309706
ARIMA_c1	Training	187.99198	0.4960699	0.4707706	-0.0792275
ARIMA_c2	Training	192.45682	0.5078517	0.4712215	-0.0756834
ARIMA_c3	Training	189.20177	0.4992623	0.4738702	-0.0161504
ARIMA_c_auto	Training	181.87299	0.4799232	0.5059969	-0.0461007
ETS_d_AAM	Training	34.60012	0.5949892	0.5705695	-0.0522078
ETS_d_AAN	Training	53.58992	0.9215410	0.9127866	0.3075097
ETS_d_AAA	Training	39.69160	0.6825432	0.7149753	0.0670663
ETS_d_AAdA	Training	40.40115	0.6947448	0.7441628	0.0771447
ETS_d_auto	Training	39.69160	0.6825432	0.7149753	0.0670663
ARIMA_d1	Training	40.55545	0.6973981	0.6467786	-0.0412796
ARIMA_d2	Training	46.45180	0.7987928	0.6967036	-0.0808972
ARIMA_d3	Training	40.56072	0.6974888	0.6480464	-0.0360262
ARIMA_d_auto	Training	31.94064	0.5492565	0.5066420	-0.0843061
ARIMA_c_auto	Test	710.18371	1.8740202	2.3425481	0.2884195
ARIMA_c1	Test	497.02829	1.3115495	1.5991439	0.2070650
ARIMA_c2	Test	690.51453	1.8221176	2.2634335	0.2087592
ARIMA_c3	Test	502.59334	1.3262344	1.6054057	0.2239654
ETS_c_AAA	Test	506.37019	1.3362007	1.7334207	0.2959944
ETS_c_AAdA	Test	531.89299	1.4035498	1.8564574	0.2672352
ETS_c_AAM	Test	325.93178	0.8600630	1.1268786	0.2821467
ETS_c_AAN	Test	567.75924	1.4981931	1.9236739	0.3214893
ETS_c_auto	Test	567.75924	1.4981931	1.9236739	0.3214893
ARIMA_d_auto	Test	191.72872	3.2969983	3.5786167	0.2655432
ARIMA_d1	Test	215.58720	3.7072726	4.2218912	0.3336219
ARIMA_d2	Test	217.53060	3.7406916	4.2240468	0.3307464
ARIMA_d3	Test	215.51063	3.7059559	4.2185929	0.3333758
ETS_d_AAA	Test	212.91244	3.6612770	4.3802753	0.3468298
ETS_d_AAdA	Test	217.28795	3.7365189	4.3111374	0.3344338
ETS_d_AAM	Test	190.92684	3.2832090	3.5420457	0.2286799
ETS_d_AAN	Test	276.14096	4.7485649	6.4026640	0.4839405
ETS_d_auto	Test	212.91244	3.6612770	4.3802753	0.3468298

Chose Models

Cases

In training, the models with the better results were the auto ARIMA, which was an order of (0,1,5) w/ drift) but performing poorly on the test data. The best performing model was the ETS AAM model on the test data. We plotted them accordingly, and we can see that the test results with the ETS model appear to perform better.

## Plot ARIMA and ETS

#Cases#

fc_cases %>%
  autoplot(cases_tb, level = 0, size = 0.5) +
  labs(title = "Confirmed Cases"
       ,y = "Confirmed Cases") +
  theme_ipsum_es()

# Model chosen was an ETS for cases
ets_mod_c <- 
  fix_c %>%
  model("ETS_d_AAM" = ETS(TargetValue ~ error("A") + trend("A") + season("M")))

ets_fc_c <-
  ets_mod_c %>%
  forecast(h = 43)

ets_fc_c %>%
  autoplot(cases_tb, level = 30, size = 0.5) +
  labs(title = "Confirmed Cases, ETS Forecast (A,A,M)"
       ,y = "Confirmed Cases") +
  theme_ipsum_es()

# Second Model for Cases was the auto ARIMA

# Model chosen was an ETS for cases

arima_mod_c <- 
  fix_c %>%
  model("ARIMA_c_auto" = ARIMA(TargetValue))

arima_fc_c <-
  arima_mod_c %>%
  forecast(h = 43)

arima_fc_c %>%
  autoplot(fix_c, level = 30, size = 0.5) +
  labs(title = "Confirmed Cases, ARIMA Forecast (0,1,5) w/ drift)"
       ,y = "Confirmed Cases") +
  theme_ipsum_es()

Deaths

For the number of deaths, the auto ARIMA at (2,1,2)(0,1,2)[7] performed better than the rest, though the ETS AAM–similar to the number of confirmed cases–performed well too. Similar to confirmed cases, the ETS model AAM performed marginally better than the auto ARIMA at (2,1,2)(0,1,2)[7].

##################################
#Deaths

fc_death %>%
  autoplot(deaths_tb, level = 0, size = 0.5) +
  labs(title = "Deaths"
       ,y = "Deaths") +
  theme_ipsum_es()

# Model chosen was an ETS for deaths too
ets_mod_d <- 
  fix_d %>%
  model("ETS_d_AAM" = ETS(TargetValue ~ error("A") + trend("A") + season("M")))

ets_fc_d <-
  ets_mod_d %>%
  forecast(h = 43)

ets_fc_d %>%
  autoplot(deaths_tb, level = 30, size = 0.5) +
  labs(title = "Deaths, ETS Forecast"
       ,y = "Deaths") +
  theme_ipsum_es()

# ARIMA for deaths

arima_mod_d <- 
  fix_d %>%
  model("ARIMA_d_auto" = ARIMA(TargetValue))

arima_fc_d <-
  arima_mod_d %>%
  forecast(h = 43)

arima_fc_d %>%
  autoplot(deaths_tb, level = 30, size = 0.5) +
  labs(title = "Deaths, ARIMA Forecast (2,1,2)(0,1,2)[7]"
       ,y = "Deaths") +
  theme_ipsum_es()

Limitations

Intuitively, these models may encounter limitations, as suggested by many professional statisticians, that these models are not always suitable for infectious diseases models, given that the trend could be biased. Moreover, this model is univariate, not accounting for exogenous events that could affect the forecast outcome. In other words, different from different approaches taken, such as studying a country in which the number of cases and deaths had already peaked and then trended lower, this model could go trend up indefinitely in the absence of any intervention.

However, the assumption that what the model suggests could happen in the absence of any intervention, say by policymakers, provides us with helpful information to take proper action.

Future Work

At this point, much work is needed. Other models could have been tested, such as STL, and eventually some neural networks.

References

“On the accuracy of ARIMA based prediction of COVID-19 spread” https://doi.org/10.1016/j.rinp.2021.104509

“Forecasting the dynamics of cumulative COVID-19 cases (confirmed, recovered and deaths) for top-16 countries using statistical machine learning models: Auto-Regressive Integrated Moving Average (ARIMA) and Seasonal Auto-Regressive Integrated Moving Average (SARIMA)” https://doi.org/10.1016/j.asoc.2021.107161

extra - Models v.2

On this section I tried several models using the ts commands as opposed to the tibbles.

The results were not optimal, and there appears to

ARIMA, AUTO

The first one is the auto.arima, and let the package provide the best fit parameters.

From there I twitched some of the parameters manually and ran the residuals, plot the charts, and fit the most reasonable model.

#Auto Arima Cases
auto_fit_c <- auto.arima(cases_tb2)
summary(auto_fit_c)

## Series: cases_tb2 
## ARIMA(5,1,4) with drift 
## 
## Coefficients:
##           ar1     ar2      ar3      ar4      ar5     ma1      ma2     ma3
##       -0.8577  0.0944  -0.4016  -1.0749  -0.3057  0.4825  -0.6893  0.0362
## s.e.   0.2513  0.2054   0.1033   0.1397   0.2069  0.2463   0.2040  0.1024
##          ma4    drift
##       0.6097  39.1224
## s.e.  0.1095   8.4292
## 
## sigma^2 estimated as 50119:  log likelihood=-734.84
## AIC=1491.69   AICc=1494.44   BIC=1521.19
## 
## Training set error measures:
##                     ME     RMSE      MAE  MPE MAPE      MASE        ACF1
## Training set 0.2185645 212.2767 155.5616 -Inf  Inf 0.9258089 0.007523936

checkresiduals(auto_fit_c)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(5,1,4) with drift
## Q* = 14.274, df = 3, p-value = 0.002555
## 
## Model df: 10.   Total lags used: 13

#plot
par(mfrow=c(1,1))
plot(cases_tb2, xlim = c(90,109), main = "Confirmed Cases ARIMA (5,1,4) w /drift")
points(fitted(auto_fit_c), pch = 20, col = "deeppink2")
points(fitted(auto_fit_c), type = "l", col = "deeppink1")

plot(forecast(auto_fit_c, h = 28))

#Auto Arima Deaths


auto_fit_d <- arima(deaths_tb2[0:109], order=c(3,2,1))
summary(auto_fit_d)

## 
## Call:
## arima(x = deaths_tb2[0:109], order = c(3, 2, 1))
## 
## Coefficients:
##          ar1      ar2      ar3      ma1
##       0.0379  -0.1350  -0.4934  -1.0000
## s.e.  0.0841   0.0842   0.0856   0.0299
## 
## sigma^2 estimated as 7918:  log likelihood = -635.35,  aic = 1280.7
## 
## Training set error measures:
##                    ME     RMSE      MAE  MPE MAPE      MASE       ACF1
## Training set 5.021456 88.16287 42.88906 -Inf  Inf 0.8440268 -0.1512115

checkresiduals(auto_fit_d)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(3,2,1)
## Q* = 30.428, df = 6, p-value = 3.259e-05
## 
## Model df: 4.   Total lags used: 10

#plot
par(mfrow=c(1,1))
plot(forecast(auto_fit_d, h=28))