The idea of this paper is to build a model that accurately predicts daily covid cases in Colombia using only past values. This model can help us understand how the virus works and better yet, help us make all kinds of decisions.We will use two types of model. ARIMA and ETS to specify the coronavirus series.
As mentioned above, being able to correctly specify a model that tells us the future values of covid can be useful in many fields of life. For example, this could help countries make decisions about purchasing vaccines or tests, help people plan their vacations, or help investors decide what type of business to invest in.
view(train)
colombia_casos <- train %>%
filter(Target == 'ConfirmedCases')
ggplot(data = colombia_casos)+
geom_line(mapping=aes(x = Date, y = TargetValue)) +
labs(title="Colombian Daily Covid cases")
As we can see, the cases of covid in Colombia have a clear positive trend and it seems that they also have seasonality.
We are going to use two types of models. ETS and ARIMA, commonly used to specify time series models. The Holt-Winters method is an extension of the exponential smoothing approach and allows short-term forecasting. Thats why we are only forecasting 6 days in the future.
On the other hand we have ARIMA method. The ARMA model is a stationary autoregressive model where the independent variables follow stochastic trends and the error term is stationary. The ARIMA allows us to differentiate the series with the intention of making the series stationary.
The first paper that we are reviewing is named “ARIMA modelling & forecasting of COVID-19 in top five affected countries” by Alok Kumar Sahai, Namita Rath, Vishal Sood and Manvendra Pratap Singh. This papers uses an ARIMA model in order to forecast the top five affected countries using data from 15th February to June 30, 2020. The authors conclude different specifications of the arima model according to the country.Among these results, we find an ARIMA(4,2,4) for Spain or an ARIMA(3,0,0) for Russia. This is curious since we would be concluding that the best model to predict covid cases in Russia is an AR(3) without the need for an MA or differentiation component.
“Comparison of ARIMA, ETS, NNAR, TBATS and hybrid models to forecast the second wave of COVID-19 hospitalizations in Italy” by Perone G., goes a bit further by testing ETS specifications. His main conclusion was that despite the fact that the models used manage to accurately predict values in the short term, many other factors must be taken into account, such as quarantines or public policies, which cannot be taken into account or predicted only with the data.
In order to predict covid cases in Colombia, we will use R studio to specify an ETS model and an ARIMA model automatically.
colombia<-ts(colombia_casos$TargetValue,frequency=365,start=c(2020,23))
ETS1<-ets(colombia)
ETS1
## ETS(A,A,N)
##
## Call:
## ets(y = colombia)
##
## Smoothing parameters:
## alpha = 0.0138
## beta = 0.0138
##
## Initial states:
## l = 0.0703
## b = 0.071
##
## sigma: 313.7924
##
## AIC AICc BIC
## 2307.417 2307.864 2322.125
The parameters would be indicating that the influence of recent and past values is not that high.
plot(ETS1)
accuracy(ETS1)
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 22.67025 309.2772 113.4089 NaN Inf NaN -0.3413837
predicETS <- forecast(ETS1, h = 6, level = 0)
acc <- accuracy(predicETS, colombia[1:6])
plot(predicETS, main="AAN Forecast" , ylab = "Cases", xlab = "Date", bty = "l")
ARIMA1 <- auto.arima(colombia)
ARIMA1
## Series: colombia
## ARIMA(3,1,1) with drift
##
## Coefficients:
## ar1 ar2 ar3 ma1 drift
## -1.6023 -1.5436 -0.6589 0.2884 9.8675
## s.e. 0.1170 0.1340 0.1045 0.1246 5.3756
##
## sigma^2 = 56923: log likelihood = -957.69
## AIC=1927.37 AICc=1928.01 BIC=1944.98
plot(ARIMA1)
prdicARMIA <- forecast(ARIMA1, 6)
acc2 <- accuracy(prdicARMIA, colombia[1:6])
acc2
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -9.180263e-02 233.4161 105.0572 -Inf Inf 0.6068129 0.03515189
## Test set -1.390909e+03 1635.1993 1390.9087 -Inf Inf 8.0339208 NA
plot(prdicARMIA, main="Auto-ARIMA")
train <- ts(colombia[1:134], frequency = 365, start = c(2020,23))
test <- ts(colombia[134:140], frequency = 365, start= c(2020,159))
ets_train <- ets(train)
forecast_ets <- forecast(ets_train, h=12)
autoplot(colombia) +
autolayer(forecast_ets, series = "ETS Model")
arima_train <- auto.arima(train)
forecast_arima <- forecast(arima_train, h=6)
autoplot(colombia) +
autolayer(forecast_arima, series = "ARIMA Model")
ets_acc <- accuracy(forecast_ets, test)
ets_acc
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 12.72076 161.0531 64.17528 NaN Inf NaN -0.2964485
## Test set -447.74414 1294.0967 1093.39270 -Inf Inf NaN -0.3132315
## Theil's U
## Training set NA
## Test set 0
arima_acc <- accuracy(forecast_arima, test)
arima_acc
## ME RMSE MAE MPE MAPE MASE ACF1 Theil's U
## Training set 0.5973379 123.9492 68.88739 -Inf Inf NaN -0.05134434 NA
## Test set 6.1341164 1192.7855 950.47951 -Inf Inf NaN -0.35495502 0
As we can see from the graphs, our predictions are not that close to the real value. Even so, if we had to choose a model looking only at the RMSE, we would choose an ARIMA(3,1,1) model. We may think that there are some other information that we are not taking into account.
As expected, one of the possible limitations is the lack of information to specify the model. There are variables that we cannot take into account, such as public policies, test availability or even the number of tests carried out by medical personnel. We could hardly reach an error with a form similar to white noise. It is possible that this still contains a lot of information that we cannot control. Even so, in the short term the series could be predicted relatively accurately.
The first thing that comes to mind is to use a var model and perform a boost response exercise. Variables such as daily deaths can have an effect in the medium term since they can dissuade people from certain dangerous behaviors (regarding contagion).
Methodologies other than those used in this paper could also be tested. We could even break down the series by city or region, since it is possible that the series has a different shape depending on the customs of the people or geographical conditions.
The challenges in this type of work range from programming to knowledge of the theory. In both cases, a lot was learned, especially when seeing errors in the predictions. What do these errors depend on? Are we using the commands wrong? Are we making theoretical errors? These types of questions are quite useful since the reason for our errors is not always clear and they lead us to review every detail.