COVID-19 Forecasting

Introduction

Problem and Significance

This project aims at looking at confirmed cases and deaths related to COVID-19 through June 30, 2020 in Mexico. I chose this country because I was born there, and I still have family living there, including my mother. My father passed away in June 2020 when curve of cases started going up, and he was a person who was a prime candidate for this virus as he had been fighting an extreme case of diabetes. Many countries could have been better prepared and learn from what happened in other countries and thus implement safety protocols, including masks and lockdowns, ahead of time.

I will try to show some of the trends and forecast the number of cases and deaths into the future. Much of the modeling relies on univariate approaches, like ARIMA. From this paper “On the accuracy of ARIMA based prediction of COVID-19 spread” from Results Phys., 27 (2021), Article 104509, it appears that ARIMA modeling could be helpful despite some of its limitations. On the second reference it mentions the application of SARIMA models, which inject the seasonality of the observed variable. This can be combined with ETS models, which also adjust for seasonality.

All in all, we will review step by step all the code, and the process.

Data wrangling

Probably the most time-consuming part of the project.

First I will split the data management between tibbles and time series. This is because sometimes the code breaks when trying to handle vectors as opposed to dataframes. Therefore, I will have the option of moving back and forth between them.

# Filter Mexico
train <- train_data %>%
  filter(Country_Region == "Mexico")

# Setup the dataframes
cases_st <- train[which(train$Target == "ConfirmedCases"),]
deaths_st <- train[which(train$Target == "Fatalities"),]

# Setup Cases dataframe
cases <- aggregate(TargetValue ~ Date, data = cases_st, FUN = sum)
deaths <- aggregate(TargetValue ~ Date, data = deaths_st, FUN = sum)

#convert to dates
cases$Date <- as.Date(cases$Date)
deaths$Date <- as.Date(deaths$Date)

#convert to tibble
cases_tb <- tsibble(cases, index = Date)
cases_tb <- cases_tb %>%
  filter_index("2020-02-23" ~.)

deaths_tb <- tsibble(deaths, index = Date)
deaths_tb <- deaths_tb %>%
  filter_index("2020-02-23" ~.)

#convert to ts
cases_ts <-ts(cases, start = 32, end = 140, frequency = 7)
deaths_ts <-ts(deaths, start = 32, end = 140, frequency = 7)

Data Exploration

Data as is

The first thing to do was to plot the time series observed and try to identify patterns. The second thing was to plot auto-correlation (ACF) and partial auto-correlation (PACF).

# Confirmed Cases
cases_plot <-
autoplot(cases_tb,TargetValue) +
  labs(title = "Confirmed Cases"
       ,x = "Date"
       ,y = "Confirmed Cases") + 
  theme_ipsum_ps() +
  geom_smooth(formula = y ~ x, method = "loess", color = "orange", size = 0.3)

#Deaths
deaths_plot<-
  autoplot(deaths_tb,TargetValue) +
  labs(title = "Deaths"
       ,x = "Date"
       ,y = "Number of Fatalities") + 
  theme_ipsum_ps() +
  geom_smooth(formula = y ~ x, method = "loess", color = "orange", size = 0.3)

grid.arrange(cases_plot,
             deaths_plot, nrow=1, ncol=2)

# ACF Confirmed Cases

cases_tb2 <- ts(subset(cases_tb, select = "TargetValue", frequency = 365))
deaths_tb2 <- ts(subset(deaths_tb, select = "TargetValue", frequency = 365))

ggtsdisplay(cases_tb2, main = "Confirmed Cases")

ggtsdisplay(deaths_tb2, main = "Deaths")

Stationarity

In this section we assessed for stationarity on both, and plotted accordingly.

Cases

The plot shows an upward trend, meaning that it is unlikely this data is stationary. We performed a quick KPSS test to make sure that this is NOT stationary, and then get the potential differences. When running the first test we can see that we need to get the differences. We then get the differences and the result is 1. However, when running the KPSS test again, the kpss value is between 0.01 and 0.1. In this case I decided to get another difference so that the indicator can show 0.1.

# first we plot the differences

# Cases
cases_tb %>%
  gg_tsdisplay(difference(TargetValue), plot_type = 'partial')+
  labs(title = "Difference of 1, Cases")

# Unit Root
cases_tb %>%
  features(TargetValue, unitroot_kpss) %>%
  kbl()%>%
  kable_material(c("striped","hover", full_width = T))
kpss_stat kpss_pvalue
2.061807 0.01
paste("Number of Differences = ", ndiffs(cases_tb$TargetValue)) %>%
  kbl()%>%
  kable_material(c("striped","hover", full_width = T))
x
Number of Differences = 1
# KPSS indicator still lower than 0.1, but higher than 0.01


order1_cases <-
cases_tb$TargetValue %>% 
  diff() %>% 
  unitroot_kpss

order1_cases %>%
  kbl()%>%
  kable_material(c("striped","hover", full_width = T))
x
kpss_stat 0.4885659
kpss_pvalue 0.0442419
# We difference again
order2_cases <-
cases_ts %>% 
  diff() %>% 
  diff() %>% 
  unitroot_kpss

order2_cases %>%
  kbl()%>%
  kable_material(c("striped","hover", full_width = T))
x
kpss_stat 0.0026919
kpss_pvalue 0.1000000
cases_tb2 %>% 
  diff() %>% 
  diff() %>% 
  ggtsdisplay(lag.max=30, main = "Cases, order 2 Differences")

Deaths

Deaths showed that one difference was sufficient to convert the time-series into stationary. As we can see, the ACF plots show significance within the last two weeks, with 7 and 14 spiking, and remaining statistically significant even at 16. PACF also showed similar results, though more observations remained statistically significant within the same time frame.

# first we plot the differences

# Cases
deaths_tb %>%
  gg_tsdisplay(difference(TargetValue), plot_type = 'partial')+
  labs(title = "Difference of 1, Deaths")

# Unit Root

deaths_tb %>%
  features(TargetValue, unitroot_kpss) %>%
  kbl()%>%
  kable_material(c("striped","hover", full_width = T))
kpss_stat kpss_pvalue
1.880058 0.01
paste("Number of Differences = ", ndiffs(deaths_tb$TargetValue)) %>%
  kbl()%>%
  kable_material(c("striped","hover", full_width = T))
x
Number of Differences = 1
# KPSS at 0.1

order1_deaths <-
deaths_tb$TargetValue %>% 
  diff() %>% 
  unitroot_kpss

order1_deaths %>%
  kbl()%>%
  kable_material(c("striped","hover", full_width = T))
x
kpss_stat 0.1096631
kpss_pvalue 0.1000000
# No need to difference again initial plot shows one difference

Models

As mentioned before, I performed ARIMA and ETS models since they appear to show relatively good results and are relevant to the development of this quick paper.

From the models below, we can see that a combination of ETS and ARIMA can provide relatively good results. We can statistically be close to each other and decide on the best model. The models were run several times to optimize the results.

################
# Cases
holts_cases <- cases_tb
#holts_cases [] = holts_cases [] +1

#2020-05-13

fix_c <- holts_cases %>%
  filter_index(. ~ "2020-05-30")

#models

fit_c <- 
  fix_c %>%
  model( "ETS_c_AAM" = ETS(TargetValue ~ error("A") + trend("A") + season("M"))
        , "ETS_c_AAN" = ETS(TargetValue ~ error("A") + trend("A") + season("N"))
        , "ETS_c_AAA" = ETS(TargetValue ~ error("A") + trend("A") + season("A"))
        , "ETS_c_AAdA" = ETS(TargetValue ~ error("A") + trend("Ad") + season("A")) #damped
        , "ETS_c_auto" = ETS(TargetValue)
        , "ARIMA_c1" = ARIMA(TargetValue ~ pdq(0, 1, 2) + PDQ(0, 1, 1)) #same as in Hyndman book
        , "ARIMA_c2" = ARIMA(TargetValue ~ pdq(2, 1, 0) + PDQ(0, 1, 1))
        , "ARIMA_c3" = ARIMA(TargetValue ~ pdq(1, 1, 1) + PDQ(0, 1, 1))
        , "ARIMA_c_auto" = ARIMA(TargetValue)
        ) 

################
# Deaths

holts_deaths <- deaths_tb
#holts_deaths [] = holts_deaths [] +1


fix_d <- holts_deaths %>%
  filter_index(. ~ "2020-05-30")

# models

fit_d <- fix_d %>%
  model(  "ETS_d_AAM" = ETS(TargetValue ~ error("A") + trend("A") + season("M"))
        , "ETS_d_AAN" = ETS(TargetValue ~ error("A") + trend("A") + season("N"))
        , "ETS_d_AAA" = ETS(TargetValue ~ error("A") + trend("A") + season("A"))
        , "ETS_d_AAdA" = ETS(TargetValue ~ error("A") + trend("Ad") + season("A")) #damped
        , "ETS_d_auto" = ETS(TargetValue)
        , "ARIMA_d1" = ARIMA(TargetValue ~ pdq(0, 1, 2) + PDQ(0, 1, 1)) #same as in Hyndman book
        , "ARIMA_d2" = ARIMA(TargetValue ~ pdq(2, 1, 0) + PDQ(0, 1, 1))
        , "ARIMA_d3" = ARIMA(TargetValue ~ pdq(1, 1, 1) + PDQ(0, 1, 1))
        , "ARIMA_d_auto" = ARIMA(TargetValue)
        )

# forecasts
fc_cases <- 
  fit_c %>%
  forecast(h=28)

fc_death <-
  fit_d %>%
  forecast(h=28)

models_comp <-
bind_rows(
  fit_c %>% accuracy()
  ,fit_d %>% accuracy()
  ,fc_cases %>% accuracy(cases_tb)
  ,fc_death %>% accuracy(deaths_tb)
)


models_comp %>%
  select(
    .model
    ,.type
    ,RMSE
    ,RMSSE
    ,MASE
    ,ACF1
  ) %>%
  kbl()%>%
  kable_material(c("striped","hover", full_width = T))
.model .type RMSE RMSSE MASE ACF1
ETS_c_AAM Training 197.94864 0.5223434 0.4525789 0.2316406
ETS_c_AAN Training 194.33885 0.5128179 0.4783194 0.2309706
ETS_c_AAA Training 182.22134 0.4808425 0.5140632 0.1669257
ETS_c_AAdA Training 184.91679 0.4879552 0.5339753 0.1501927
ETS_c_auto Training 194.33885 0.5128179 0.4783194 0.2309706
ARIMA_c1 Training 187.99198 0.4960699 0.4707706 -0.0792275
ARIMA_c2 Training 192.45682 0.5078517 0.4712215 -0.0756834
ARIMA_c3 Training 189.20177 0.4992623 0.4738702 -0.0161504
ARIMA_c_auto Training 181.87299 0.4799232 0.5059969 -0.0461007
ETS_d_AAM Training 34.60012 0.5949892 0.5705695 -0.0522078
ETS_d_AAN Training 53.58992 0.9215410 0.9127866 0.3075097
ETS_d_AAA Training 39.69160 0.6825432 0.7149753 0.0670663
ETS_d_AAdA Training 40.40115 0.6947448 0.7441628 0.0771447
ETS_d_auto Training 39.69160 0.6825432 0.7149753 0.0670663
ARIMA_d1 Training 40.55545 0.6973981 0.6467786 -0.0412796
ARIMA_d2 Training 46.45180 0.7987928 0.6967036 -0.0808972
ARIMA_d3 Training 40.56072 0.6974888 0.6480464 -0.0360262
ARIMA_d_auto Training 31.94064 0.5492565 0.5066420 -0.0843061
ARIMA_c_auto Test 710.18371 1.8740202 2.3425481 0.2884195
ARIMA_c1 Test 497.02829 1.3115495 1.5991439 0.2070650
ARIMA_c2 Test 690.51453 1.8221176 2.2634335 0.2087592
ARIMA_c3 Test 502.59334 1.3262344 1.6054057 0.2239654
ETS_c_AAA Test 506.37019 1.3362007 1.7334207 0.2959944
ETS_c_AAdA Test 531.89299 1.4035498 1.8564574 0.2672352
ETS_c_AAM Test 325.93178 0.8600630 1.1268786 0.2821467
ETS_c_AAN Test 567.75924 1.4981931 1.9236739 0.3214893
ETS_c_auto Test 567.75924 1.4981931 1.9236739 0.3214893
ARIMA_d_auto Test 191.72872 3.2969983 3.5786167 0.2655432
ARIMA_d1 Test 215.58720 3.7072726 4.2218912 0.3336219
ARIMA_d2 Test 217.53060 3.7406916 4.2240468 0.3307464
ARIMA_d3 Test 215.51063 3.7059559 4.2185929 0.3333758
ETS_d_AAA Test 212.91244 3.6612770 4.3802753 0.3468298
ETS_d_AAdA Test 217.28795 3.7365189 4.3111374 0.3344338
ETS_d_AAM Test 190.92684 3.2832090 3.5420457 0.2286799
ETS_d_AAN Test 276.14096 4.7485649 6.4026640 0.4839405
ETS_d_auto Test 212.91244 3.6612770 4.3802753 0.3468298

Chose Models

Cases

In training, the models with the better results were the auto ARIMA, which was an order of (0,1,5) w/ drift) but performing poorly on the test data. The best performing model was the ETS AAM model on the test data. We plotted them accordingly, and we can see that the test results with the ETS model appear to perform better.

## Plot ARIMA and ETS

#Cases#

fc_cases %>%
  autoplot(cases_tb, level = 0, size = 0.5) +
  labs(title = "Confirmed Cases"
       ,y = "Confirmed Cases") +
  theme_ipsum_es()

# Model chosen was an ETS for cases
ets_mod_c <- 
  fix_c %>%
  model("ETS_d_AAM" = ETS(TargetValue ~ error("A") + trend("A") + season("M")))

ets_fc_c <-
  ets_mod_c %>%
  forecast(h = 43)

ets_fc_c %>%
  autoplot(cases_tb, level = 30, size = 0.5) +
  labs(title = "Confirmed Cases, ETS Forecast (A,A,M)"
       ,y = "Confirmed Cases") +
  theme_ipsum_es()

# Second Model for Cases was the auto ARIMA

# Model chosen was an ETS for cases

arima_mod_c <- 
  fix_c %>%
  model("ARIMA_c_auto" = ARIMA(TargetValue))

arima_fc_c <-
  arima_mod_c %>%
  forecast(h = 43)

arima_fc_c %>%
  autoplot(fix_c, level = 30, size = 0.5) +
  labs(title = "Confirmed Cases, ARIMA Forecast (0,1,5) w/ drift)"
       ,y = "Confirmed Cases") +
  theme_ipsum_es()

Deaths

For the number of deaths, the auto ARIMA at (2,1,2)(0,1,2)[7] performed better than the rest, though the ETS AAM–similar to the number of confirmed cases–performed well too. Similar to confirmed cases, the ETS model AAM performed marginally better than the auto ARIMA at (2,1,2)(0,1,2)[7].

##################################
#Deaths

fc_death %>%
  autoplot(deaths_tb, level = 0, size = 0.5) +
  labs(title = "Deaths"
       ,y = "Deaths") +
  theme_ipsum_es()

# Model chosen was an ETS for deaths too
ets_mod_d <- 
  fix_d %>%
  model("ETS_d_AAM" = ETS(TargetValue ~ error("A") + trend("A") + season("M")))

ets_fc_d <-
  ets_mod_d %>%
  forecast(h = 43)

ets_fc_d %>%
  autoplot(deaths_tb, level = 30, size = 0.5) +
  labs(title = "Deaths, ETS Forecast"
       ,y = "Deaths") +
  theme_ipsum_es()

# ARIMA for deaths

arima_mod_d <- 
  fix_d %>%
  model("ARIMA_d_auto" = ARIMA(TargetValue))

arima_fc_d <-
  arima_mod_d %>%
  forecast(h = 43)

arima_fc_d %>%
  autoplot(deaths_tb, level = 30, size = 0.5) +
  labs(title = "Deaths, ARIMA Forecast (2,1,2)(0,1,2)[7]"
       ,y = "Deaths") +
  theme_ipsum_es()

Limitations

Intuitively, these models may encounter limitations, as suggested by many professional statisticians, that these models are not always suitable for infectious diseases models, given that the trend could be biased. Moreover, this model is univariate, not accounting for exogenous events that could affect the forecast outcome. In other words, different from different approaches taken, such as studying a country in which the number of cases and deaths had already peaked and then trended lower, this model could go trend up indefinitely in the absence of any intervention.

However, the assumption that what the model suggests could happen in the absence of any intervention, say by policymakers, provides us with helpful information to take proper action.

Future Work

At this point, much work is needed. Other models could have been tested, such as STL, and eventually some neural networks.

References

“On the accuracy of ARIMA based prediction of COVID-19 spread” https://doi.org/10.1016/j.rinp.2021.104509

“Forecasting the dynamics of cumulative COVID-19 cases (confirmed, recovered and deaths) for top-16 countries using statistical machine learning models: Auto-Regressive Integrated Moving Average (ARIMA) and Seasonal Auto-Regressive Integrated Moving Average (SARIMA)” https://doi.org/10.1016/j.asoc.2021.107161

extra - Models v.2

On this section I tried several models using the ts commands as opposed to the tibbles.

The results were not optimal, and there appears to

ARIMA, AUTO

The first one is the auto.arima, and let the package provide the best fit parameters.

From there I twitched some of the parameters manually and ran the residuals, plot the charts, and fit the most reasonable model.

#Auto Arima Cases
auto_fit_c <- auto.arima(cases_tb2)
summary(auto_fit_c)
## Series: cases_tb2 
## ARIMA(5,1,4) with drift 
## 
## Coefficients:
##           ar1     ar2      ar3      ar4      ar5     ma1      ma2     ma3
##       -0.8577  0.0944  -0.4016  -1.0749  -0.3057  0.4825  -0.6893  0.0362
## s.e.   0.2513  0.2054   0.1033   0.1397   0.2069  0.2463   0.2040  0.1024
##          ma4    drift
##       0.6097  39.1224
## s.e.  0.1095   8.4292
## 
## sigma^2 estimated as 50119:  log likelihood=-734.84
## AIC=1491.69   AICc=1494.44   BIC=1521.19
## 
## Training set error measures:
##                     ME     RMSE      MAE  MPE MAPE      MASE        ACF1
## Training set 0.2185645 212.2767 155.5616 -Inf  Inf 0.9258089 0.007523936
checkresiduals(auto_fit_c)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(5,1,4) with drift
## Q* = 14.274, df = 3, p-value = 0.002555
## 
## Model df: 10.   Total lags used: 13
#plot
par(mfrow=c(1,1))
plot(cases_tb2, xlim = c(90,109), main = "Confirmed Cases ARIMA (5,1,4) w /drift")
points(fitted(auto_fit_c), pch = 20, col = "deeppink2")
points(fitted(auto_fit_c), type = "l", col = "deeppink1")

plot(forecast(auto_fit_c, h = 28))

#Auto Arima Deaths


auto_fit_d <- arima(deaths_tb2[0:109], order=c(3,2,1))
summary(auto_fit_d)
## 
## Call:
## arima(x = deaths_tb2[0:109], order = c(3, 2, 1))
## 
## Coefficients:
##          ar1      ar2      ar3      ma1
##       0.0379  -0.1350  -0.4934  -1.0000
## s.e.  0.0841   0.0842   0.0856   0.0299
## 
## sigma^2 estimated as 7918:  log likelihood = -635.35,  aic = 1280.7
## 
## Training set error measures:
##                    ME     RMSE      MAE  MPE MAPE      MASE       ACF1
## Training set 5.021456 88.16287 42.88906 -Inf  Inf 0.8440268 -0.1512115
checkresiduals(auto_fit_d)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(3,2,1)
## Q* = 30.428, df = 6, p-value = 3.259e-05
## 
## Model df: 4.   Total lags used: 10
#plot
par(mfrow=c(1,1))
plot(forecast(auto_fit_d, h=28))