Midterm

Juan Esteban Rincón Poveda

The problem

The idea of this paper is to build a model that accurately predicts daily covid cases in Colombia using only past values. This model can help us understand how the virus works and better yet, help us make all kinds of decisions.We will use two types of model. ARIMA and ETS to specify the coronavirus series.

Significance

As mentioned above, being able to correctly specify a model that tells us the future values of covid can be useful in many fields of life. For example, this could help countries make decisions about purchasing vaccines or tests, help people plan their vacations, or help investors decide what type of business to invest in.

The data

view(train)
colombia_casos <- train %>% 
  filter(Target == 'ConfirmedCases')


ggplot(data = colombia_casos)+
  geom_line(mapping=aes(x = Date, y = TargetValue)) + 
  labs(title="Colombian Daily Covid cases")

As we can see, the cases of covid in Colombia have a clear positive trend and it seems that they also have seasonality.

Types of Models

We are going to use two types of models. ETS and ARIMA, commonly used to specify time series models. The Holt-Winters method is an extension of the exponential smoothing approach and allows short-term forecasting. Thats why we are only forecasting 6 days in the future.

On the other hand we have ARIMA method. The ARMA model is a stationary autoregressive model where the independent variables follow stochastic trends and the error term is stationary. The ARIMA allows us to differentiate the series with the intention of making the series stationary.

Review Literature

The first paper that we are reviewing is named “ARIMA modelling & forecasting of COVID-19 in top five affected countries” by Alok Kumar Sahai, Namita Rath, Vishal Sood and Manvendra Pratap Singh. This papers uses an ARIMA model in order to forecast the top five affected countries using data from 15th February to June 30, 2020. The authors conclude different specifications of the arima model according to the country.Among these results, we find an ARIMA(4,2,4) for Spain or an ARIMA(3,0,0) for Russia. This is curious since we would be concluding that the best model to predict covid cases in Russia is an AR(3) without the need for an MA or differentiation component.

“Comparison of ARIMA, ETS, NNAR, TBATS and hybrid models to forecast the second wave of COVID-19 hospitalizations in Italy” by Perone G., goes a bit further by testing ETS specifications. His main conclusion was that despite the fact that the models used manage to accurately predict values in the short term, many other factors must be taken into account, such as quarantines or public policies, which cannot be taken into account or predicted only with the data.

Builds Models

In order to predict covid cases in Colombia, we will use R studio to specify an ETS model and an ARIMA model automatically.

Holts Exponential Smoothing for Colombian Covid Cases

colombia<-ts(colombia_casos$TargetValue,frequency=365,start=c(2020,23)) 

ETS1<-ets(colombia)
ETS1
## ETS(A,A,N) 
## 
## Call:
##  ets(y = colombia) 
## 
##   Smoothing parameters:
##     alpha = 0.0138 
##     beta  = 0.0138 
## 
##   Initial states:
##     l = 0.0703 
##     b = 0.071 
## 
##   sigma:  313.7924
## 
##      AIC     AICc      BIC 
## 2307.417 2307.864 2322.125

The parameters would be indicating that the influence of recent and past values is not that high.

plot(ETS1)

accuracy(ETS1)
##                    ME     RMSE      MAE MPE MAPE MASE       ACF1
## Training set 22.67025 309.2772 113.4089 NaN  Inf  NaN -0.3413837
predicETS <- forecast(ETS1, h = 6, level = 0)
acc <- accuracy(predicETS, colombia[1:6])

plot(predicETS, main="AAN Forecast" , ylab = "Cases", xlab = "Date", bty = "l")

ARIMA

ARIMA1 <- auto.arima(colombia)
ARIMA1
## Series: colombia 
## ARIMA(3,1,1) with drift 
## 
## Coefficients:
##           ar1      ar2      ar3     ma1   drift
##       -1.6023  -1.5436  -0.6589  0.2884  9.8675
## s.e.   0.1170   0.1340   0.1045  0.1246  5.3756
## 
## sigma^2 = 56923:  log likelihood = -957.69
## AIC=1927.37   AICc=1928.01   BIC=1944.98
plot(ARIMA1)

prdicARMIA <- forecast(ARIMA1, 6)
acc2 <- accuracy(prdicARMIA, colombia[1:6])
acc2
##                         ME      RMSE       MAE  MPE MAPE      MASE       ACF1
## Training set -9.180263e-02  233.4161  105.0572 -Inf  Inf 0.6068129 0.03515189
## Test set     -1.390909e+03 1635.1993 1390.9087 -Inf  Inf 8.0339208         NA
plot(prdicARMIA, main="Auto-ARIMA")

Test vs Train ETS

train <- ts(colombia[1:134], frequency = 365, start = c(2020,23))
test <- ts(colombia[134:140], frequency = 365, start= c(2020,159))

ets_train  <- ets(train)
forecast_ets <- forecast(ets_train, h=12)
autoplot(colombia) +
  autolayer(forecast_ets, series = "ETS Model") 

Test vs Train ARIMA

arima_train <- auto.arima(train)
forecast_arima <- forecast(arima_train, h=6)
autoplot(colombia) +
  autolayer(forecast_arima, series = "ARIMA Model") 

ets_acc <- accuracy(forecast_ets, test)
ets_acc
##                      ME      RMSE        MAE  MPE MAPE MASE       ACF1
## Training set   12.72076  161.0531   64.17528  NaN  Inf  NaN -0.2964485
## Test set     -447.74414 1294.0967 1093.39270 -Inf  Inf  NaN -0.3132315
##              Theil's U
## Training set        NA
## Test set             0
arima_acc <- accuracy(forecast_arima, test)
arima_acc
##                     ME      RMSE       MAE  MPE MAPE MASE        ACF1 Theil's U
## Training set 0.5973379  123.9492  68.88739 -Inf  Inf  NaN -0.05134434        NA
## Test set     6.1341164 1192.7855 950.47951 -Inf  Inf  NaN -0.35495502         0

As we can see from the graphs, our predictions are not that close to the real value. Even so, if we had to choose a model looking only at the RMSE, we would choose an ARIMA(3,1,1) model. We may think that there are some other information that we are not taking into account.

Limitations

As expected, one of the possible limitations is the lack of information to specify the model. There are variables that we cannot take into account, such as public policies, test availability or even the number of tests carried out by medical personnel. We could hardly reach an error with a form similar to white noise. It is possible that this still contains a lot of information that we cannot control. Even so, in the short term the series could be predicted relatively accurately.

Future Work

The first thing that comes to mind is to use a var model and perform a boost response exercise. Variables such as daily deaths can have an effect in the medium term since they can dissuade people from certain dangerous behaviors (regarding contagion).

Methodologies other than those used in this paper could also be tested. We could even break down the series by city or region, since it is possible that the series has a different shape depending on the customs of the people or geographical conditions.

What Was Learned

The challenges in this type of work range from programming to knowledge of the theory. In both cases, a lot was learned, especially when seeing errors in the predictions. What do these errors depend on? Are we using the commands wrong? Are we making theoretical errors? These types of questions are quite useful since the reason for our errors is not always clear and they lead us to review every detail.