Forecasting The New Cases of Covid-19 in DKI Jakarta by using Time Series

Introduction

Corona virus disease (Covid-19) in increase rapidly almost in all country, include Indonesia. Recently, there is so many information which inform the new cases in DKI Jakarta was grow rapidly. We’ll try to forecast the number of increasing case in DKI Jakarta. The dataset was provided by Kaggle. The data contains information from March 1st, 2020 to Juli 31st, 2020.

Import Library and Setup

library(dplyr)
library(lubridate)
library(forecast)
library(TTR)
library(ggplot2)
library(tseries)
library(gridExtra)

Data Import

covid <- read.csv("covid_19_indonesia_time_series_all.csv")
rmarkdown::paged_table(covid)

The data has 4,578 obs and 37 columns. But, we’ll only focus on forecasting the New.Cases variable in DKI Jakarta.

Data Preparation and EDA

#subset the location only Jakarta, and column i..Date, and New.Cases and change data type
covid1 <- covid %>% 
  subset(Location == "DKI Jakarta") %>% 
  select(ï..Date, New.Cases) %>% 
  mutate(ï..Date = mdy(ï..Date)) 

#change the column name from i..Date to Date
colnames(covid1)[1] = "Date"

After subset the data, the data to be use in forecasting should have passed :
1. The data can’t contain NA (missing value)
2. The data must be ordered by time
3. The interval of date must be completely and cannot be skipped.

#Check if there is missing column in data
colSums(is.na(covid1))

##      Date New.Cases 
##         0         0

Our data in Maret is so much of 0 value, so we’ll use the data start from April.

covid1 <- covid1 %>% 
  subset(Date >= "2020-04-01")

Then we’ll check when the data start and end.

range(covid1$Date)

## [1] "2020-04-01" "2020-07-31"

Now, our data is from April 1, 2020 to July 31, 2020. Then we’ll convert the data to ts object.

#create object ts 
covid_ts <- ts(data = covid1$New.Cases,
   start = min(covid1$Date),
   frequency = 7) #weekly seasonality 

#visualise object covid_ts
covid_ts %>% autoplot()

From the plot, we can see that our data’s model is multiplicative. We’ll try to see the seasonality, trend and error of data.

covid_dc <- decompose(covid_ts, type = "multiplicative")
covid_dc %>% autoplot()

We can also see an up or down trend using adjusted seasonal which has removing the effects of seasonal data.

autoplot(covid_dc$x - covid_dc$seasonal)

From the plot and information above, we can see that the data’s seasonal pattern is random or no seasonal. The trend patterns of data is increasing.

Cross Validation

Before we do forecasting, we’ll try to split our data to train and test. test data is the data of last 7 days (1 week), and the rest as train data.

test <- tail(covid_ts, 7) #get 7 last days 
train <- head(covid_ts, length(covid_ts) - length(test)) #get the rest data

Time Series Modeling

Before we do the modeling, we’ll check our train data.

train %>% autoplot()

From train data, we see that the plot has no seasonal and no trend, because the trend is increase - decrease - increase. We’ll try modeling using ETS method with auto.

ETS method

ETS is time series model to define Error, Trend, and Seasonal. We can fill the method with 3 letters with the value of Error, Trend and Seasonal in model = “XXX”. We can fill “XXX” with :

A: aditif
M: multiplikatif
N: none
Z: auto

In this session, we’ll try using auto model.

covid_ets <- ets(y = train, model = "ZZZ")
covid_ets

## ETS(M,Ad,N) 
## 
## Call:
##  ets(y = train, model = "ZZZ") 
## 
##   Smoothing parameters:
##     alpha = 0.1329 
##     beta  = 0.0362 
##     phi   = 0.8597 
## 
##   Initial states:
##     l = 26.4274 
##     b = 23.7602 
## 
##   sigma:  0.3107
## 
##      AIC     AICc      BIC 
## 1409.256 1410.034 1425.726

From the auto model, the result was ETS(M,Ad,N) which means that the model have error and trend, but doesn’t have seasonal. We also get the value of AIC is 1409.256.

Exponential Smoothing

After we do modeling with ETS method, we’ll try to make model with exponential smoothing. Exponential smoothing actually has 3 types :

Simple Exponential Smoothing, is using for forecasting model which data have no seasonal and trend (only has error)
Holt’s Exponential Smoothing (Double Exponential Smoothing), is using for forecasting model which data has no seasonal (have error and trend)
Holt’s Winters Exponential (Triple Exponential Smoothing), is using for forecasting model which data has error, trend, and seasonal.

We’ll be using Holt’s Exponential Smoothing (Double Exponential Smoothing) by add parameter gamma = F.

covid_holt <- HoltWinters(x = train, gamma = F)
covid_holt

## Holt-Winters exponential smoothing with trend and without seasonal component.
## 
## Call:
## HoltWinters(x = train, gamma = F)
## 
## Smoothing parameters:
##  alpha: 0.2337292
##  beta : 0.110738
##  gamma: FALSE
## 
## Coefficients:
##         [,1]
## a 388.538734
## b   8.174938

We can set the alpha, and beta value, but in this case we’ll try to use auto so that the system can try to search the optimum number of alpha and beta. We get value of alpha = 0.2337292, and beta = 0.110738.

Non - Seasonal Arima

Before continue, we’ll try to check if our data is stationer.

adf.test(train)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  train
## Dickey-Fuller = -0.9608, Lag order = 4, p-value = 0.9408
## alternative hypothesis: stationary

p-value = 0.2044, which means p-value is < 0.05 (not stationary). We’ll try to make data stationer by doing differencing.

diff(train, lag = 1) %>% adf.test()

## 
##  Augmented Dickey-Fuller Test
## 
## data:  .
## Dickey-Fuller = -6.124, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary

p-value is < 0.05, so the data is stationer and we can go on.

tsdisplay(diff(train))

From the first 5 lags, we can see the p and q from PACF and ACF by count how many line which passed the threshold.
p = 2 (1st lag and 2nd lag passed the threshold)
d = 1 (we only do differencing once to get stationary data)
q = 1 (1st lag on ACF plot passed the threshold)

We’ll try to do ARIMA modeling by combining the (p,d,and q) value. Some combination we can try :

ARIMA (1,1,1)
ARIMA (2,1,1)
ARIMA (3,1,1)

#modeling
covid_arima1 <- Arima(y = train, order = c(1,1,1))
covid_arima2 <- Arima(y = train, order = c(2,1,1))
covid_arima3 <- Arima(y = train, order = c(3,1,1))

After modeling, we’ll try to check the AIC of each arima model

covid_arima1$aic

## [1] 1196.647

covid_arima2$aic

## [1] 1198.53

covid_arima3$aic

## [1] 1200.437

From the AIC value, we can see that model covid_arima1 with p,d,q = (1,1,1) is the best in arima model for this case. We’ll try to check which is the best model for arima method by MAPE (error evaluation).

accuracy(covid_arima1)

##                    ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 7.454104 44.53286 33.38272 -3.004421 23.84321 0.7045795
##                     ACF1
## Training set -0.04075299

accuracy(covid_arima2)

##                    ME     RMSE      MAE      MPE     MAPE      MASE      ACF1
## Training set 7.204015 44.51002 33.44979 -3.12076 23.91231 0.7059951 -0.036073

accuracy(covid_arima3)

##                   ME     RMSE      MAE       MPE     MAPE      MASE        ACF1
## Training set 7.41779 44.49129 33.52492 -3.014416 23.96201 0.7075809 -0.04033545

If we check from MAPE method, we see that covid_arima1 is the best because the error percentage is 23.84% by MAPE. We can also using auto arima model.

covid_arima_auto <- auto.arima(y = train)

Then, we’ll try to compare model arima 1 and auto arima.

#AIC
covid_arima1$aic

## [1] 1196.647

covid_arima_auto$aic

## [1] 1193.265

#MAPE
accuracy(covid_arima1)

##                    ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 7.454104 44.53286 33.38272 -3.004421 23.84321 0.7045795
##                     ACF1
## Training set -0.04075299

accuracy(covid_arima_auto)

##                     ME    RMSE      MAE       MPE     MAPE      MASE
## Training set 0.1998666 43.8558 32.99011 -9.575776 25.13218 0.6962931
##                     ACF1
## Training set -0.04305135

From AIC method, we see that covid_arima_auto is better than covid_arima1 because covid_arima1 has AIC value larger than covid_arima_auto.
From MAPE, we see that covid_arima1 is better than covid_arima_auto because the error percentage based on MAPE is 23.84%.

Forecasting

# forecast
covid_ets_f <- forecast(covid_ets, h = 7)
covid_holt_f <- forecast(covid_holt, h = 7)
covid_arima_f <- forecast(covid_arima1, h = 7)

#forecast result of ets method
covid_ets_f

##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 18369.43       375.3148 225.8589 524.7707 146.7417 603.8879
## 18369.57       379.5637 226.2511 532.8764 145.0922 614.0352
## 18369.71       383.2166 225.5622 540.8711 142.1049 624.3283
## 18369.86       386.3571 223.8896 548.8247 137.8844 634.8298
## 18370.00       389.0570 221.3447 556.7694 132.5632 645.5509
## 18370.14       391.3782 218.0424 564.7141 126.2839 656.4725
## 18370.29       393.3738 214.0934 572.6543 119.1880 667.5596

#forecast result of holt method
covid_holt_f

##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 18369.43       396.7137 339.1114 454.3159 308.6186 484.8087
## 18369.57       404.8886 345.3769 464.4003 313.8733 495.9040
## 18369.71       413.0635 351.3214 474.8057 318.6371 507.4900
## 18369.86       421.2385 356.9439 485.5330 322.9084 519.5685
## 18370.00       429.4134 362.2480 496.5789 326.6927 532.1341
## 18370.14       437.5884 367.2410 507.9357 330.0013 545.1754
## 18370.29       445.7633 371.9331 519.5935 332.8497 558.6769

#forecast result of arima method
covid_arima_f

##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 18369.43       378.8558 321.0254 436.6863 290.4118 467.2999
## 18369.57       368.8388 308.7154 428.9622 276.8880 460.7896
## 18369.71       370.0075 306.2312 433.7838 272.4701 467.5449
## 18369.86       369.8712 302.8163 436.9260 267.3196 472.4227
## 18370.00       369.8871 299.6872 440.0869 262.5257 477.2484
## 18370.14       369.8852 296.6775 443.0929 257.9237 481.8467
## 18370.29       369.8854 293.7885 445.9824 253.5052 486.2657

# visualization
a <- autoplot(covid_ets_f, series = "ETS", fcol = "red") +
  autolayer(covid_ts, series = "Actual", color = "black") + 
  labs(subtitle = "New Case of Covid in DKI Jakarta, Indonesia from April - July 2020",
       y = "New Cases") +
  theme_minimal()

b <- autoplot(covid_holt_f, series = "HOLT", fcol = "green") +
  autolayer(covid_ts, series = "Actual", color = "black") +
  labs(subtitle = "New Case of Covid in DKI Jakarta, Indonesia from April - July 2020",
       y = "New Cases") +
  theme_minimal()

c <- autoplot(covid_arima_f, series = "ARIMA", fcol = "blue") +
  autolayer(covid_ts, series = "Actual", color = "black") +
  labs(subtitle = "New Case of Covid in DKI Jakarta, Indonesia from April - July 2020",
       y = "New Cases") +
  theme_minimal()

grid.arrange(a,b,c)

Evaluation Model

accuracy(covid_ets_f, test)

##                     ME     RMSE      MAE       MPE     MAPE      MASE
## Training set  5.724946 44.00628 32.65048 -5.186556 23.82883 0.6891249
## Test set     58.391235 98.39879 60.19546 10.766757 11.25571 1.2704925
##                    ACF1 Theil's U
## Training set  0.0661572        NA
## Test set     -0.3574688 0.8930929

accuracy(covid_holt_f, test)

##                      ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -0.6240073 44.75229 34.08774 -8.962154 25.73653 0.7194598
## Test set     22.6186591 81.21541 56.10221  2.589004 11.11248 1.1840997
##                       ACF1 Theil's U
## Training set -0.0005869861        NA
## Test set     -0.3916424619 0.7232941

accuracy(covid_arima_f, test)

##                     ME      RMSE      MAE       MPE     MAPE      MASE
## Training set  7.454104  44.53286 33.38272 -3.004421 23.84321 0.7045795
## Test set     72.824144 109.82510 75.64010 13.978744 14.74187 1.5964687
##                     ACF1 Theil's U
## Training set -0.04075299        NA
## Test set     -0.29181453 0.9934245

There is some information we can get about MAPE :
* Model covid_ets_f is better at forecasting in train dataset with 23.82% of error, but when forecasting in test dataset, model covid_ets_f got error of 11.25%.
* Model covid_holt_f is better at forecasting in test dataset with 11.11% of error, but when forecasting in train dataset, model covid_holt_f got error of 25.73%.

Assumption Check

Normality : Shapiro.test

H0 : residuals are normally distributed H1 : residuals are not normally distributed

# p-value < 0.05; reject H0; accept H1 
shapiro.test(covid_ets_f$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  covid_ets_f$residuals
## W = 0.95686, p-value = 0.0009636

shapiro.test(covid_holt_f$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  covid_holt_f$residuals
## W = 0.97934, p-value = 0.07753

shapiro.test(covid_arima_f$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  covid_arima_f$residuals
## W = 0.9715, p-value = 0.01481

By using shapiro.test to do normality check asumption, only model covid_holt_f was passed the test by p-value > 0.05 (residual distributed normally).

hist(covid_holt_f$residuals, breaks = 20)

Autocorrelation : Box.test - Ljng-Box

H0 : No autocorrelation in the forecast errors H1 : there is an autocorrelation in the forecast errors

Box.test(covid_ets_f$residuals, type = "Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  covid_ets_f$residuals
## X-squared = 0.20115, df = 1, p-value = 0.6538

Box.test(covid_holt_f$residuals, type = "Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  covid_holt_f$residuals
## X-squared = 3.9977e-05, df = 1, p-value = 0.995

Box.test(covid_arima_f$residuals, type = "Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  covid_arima_f$residuals
## X-squared = 0.19602, df = 1, p-value = 0.658

Based on box.test function, all the model : covid_ets_f, covid_holt_f, and covid_arima_f has p-value > 0.05 (no autocorrelation in forecast errors).

Conclusion

We have build several model and forecasting. By using assumption check, only Holt’s Exponential Smoothing model covid_holt_f was passed the normality test and auto-correlation test for this case. Model Holt’s Exponential Smoothing in forecasting dataset was better with error of 11.11248 % in forecasting test dataset (by using MAPE).

Additional

As we know, recently the number of new case in Indonesia was growing. DKI Jakarta was one of the most region which grow rapidly. We’ll try to forecast the number of new cases in DKI Jakarta in August.

covid_ts %>% autoplot()

#forecasting the value of new.cases in August by using Holt method.
aug_f <- forecast(HoltWinters(x = covid_ts, gamma = F), h = 31)

#visualize by table
aug_df <- as.data.frame(aug_f) %>% 
  mutate(date = seq(from = ymd("2020-08-01"), to = ymd("2020-08-31"), by = "day"))
rmarkdown::paged_table(aug_df)

#visualize by plot
covid_ts %>% autoplot() +
  autolayer(aug_f)

The forecasting with Holt’s Exponential Smoothing forecast that the number of new.cases in DKI Jakarta will be increasing.