Forecasting The New Cases of Covid-19 in DKI Jakarta by using Time Series
Introduction
Corona virus disease (Covid-19) in increase rapidly almost in all country, include Indonesia. Recently, there is so many information which inform the new cases in DKI Jakarta was grow rapidly. We’ll try to forecast the number of increasing case in DKI Jakarta. The dataset was provided by Kaggle. The data contains information from March 1st, 2020 to Juli 31st, 2020.
Import Library and Setup
Data Import
The data has 4,578 obs and 37 columns. But, we’ll only focus on forecasting the New.Cases
variable in DKI Jakarta.
Data Preparation and EDA
#subset the location only Jakarta, and column i..Date, and New.Cases and change data type
covid1 <- covid %>%
subset(Location == "DKI Jakarta") %>%
select(ï..Date, New.Cases) %>%
mutate(ï..Date = mdy(ï..Date))
#change the column name from i..Date to Date
colnames(covid1)[1] = "Date"
After subset the data, the data to be use in forecasting should have passed :
1. The data can’t contain NA (missing value)
2. The data must be ordered by time
3. The interval of date must be completely and cannot be skipped.
## Date New.Cases
## 0 0
Our data in Maret is so much of 0 value, so we’ll use the data start from April.
Then we’ll check when the data start and end.
## [1] "2020-04-01" "2020-07-31"
Now, our data is from April 1, 2020 to July 31, 2020. Then we’ll convert the data to ts object.
#create object ts
covid_ts <- ts(data = covid1$New.Cases,
start = min(covid1$Date),
frequency = 7) #weekly seasonality
#visualise object covid_ts
covid_ts %>% autoplot()
From the plot, we can see that our data’s model is multiplicative
. We’ll try to see the seasonality, trend and error of data.
We can also see an up or down trend using adjusted seasonal which has removing the effects of seasonal data.
From the plot and information above, we can see that the data’s seasonal pattern is random or no seasonal. The trend patterns of data is increasing.
Cross Validation
Before we do forecasting, we’ll try to split our data to train
and test
. test
data is the data of last 7 days (1 week), and the rest as train
data.
Time Series Modeling
Before we do the modeling, we’ll check our train data.
From train
data, we see that the plot has no seasonal and no trend, because the trend is increase - decrease - increase. We’ll try modeling using ETS method with auto.
ETS method
ETS is time series model to define Error, Trend, and Seasonal. We can fill the method with 3 letters with the value of Error, Trend and Seasonal in model = “XXX”. We can fill “XXX” with :
A: aditif
M: multiplikatif
N: none
Z: auto
In this session, we’ll try using auto model.
## ETS(M,Ad,N)
##
## Call:
## ets(y = train, model = "ZZZ")
##
## Smoothing parameters:
## alpha = 0.1329
## beta = 0.0362
## phi = 0.8597
##
## Initial states:
## l = 26.4274
## b = 23.7602
##
## sigma: 0.3107
##
## AIC AICc BIC
## 1409.256 1410.034 1425.726
From the auto model, the result was ETS(M,Ad,N) which means that the model have error and trend, but doesn’t have seasonal. We also get the value of AIC is 1409.256.
Exponential Smoothing
After we do modeling with ETS method, we’ll try to make model with exponential smoothing. Exponential smoothing actually has 3 types :
- Simple Exponential Smoothing, is using for forecasting model which data have no seasonal and trend (only has error)
- Holt’s Exponential Smoothing (Double Exponential Smoothing), is using for forecasting model which data has no seasonal (have error and trend)
- Holt’s Winters Exponential (Triple Exponential Smoothing), is using for forecasting model which data has error, trend, and seasonal.
We’ll be using Holt’s Exponential Smoothing (Double Exponential Smoothing) by add parameter gamma = F
.
## Holt-Winters exponential smoothing with trend and without seasonal component.
##
## Call:
## HoltWinters(x = train, gamma = F)
##
## Smoothing parameters:
## alpha: 0.2337292
## beta : 0.110738
## gamma: FALSE
##
## Coefficients:
## [,1]
## a 388.538734
## b 8.174938
We can set the alpha, and beta value, but in this case we’ll try to use auto so that the system can try to search the optimum number of alpha and beta. We get value of alpha = 0.2337292, and beta = 0.110738.
Non - Seasonal Arima
Before continue, we’ll try to check if our data is stationer.
##
## Augmented Dickey-Fuller Test
##
## data: train
## Dickey-Fuller = -0.9608, Lag order = 4, p-value = 0.9408
## alternative hypothesis: stationary
p-value = 0.2044, which means p-value is < 0.05 (not stationary). We’ll try to make data stationer by doing differencing.
##
## Augmented Dickey-Fuller Test
##
## data: .
## Dickey-Fuller = -6.124, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary
p-value is < 0.05, so the data is stationer and we can go on.
From the first 5 lags, we can see the p and q from PACF and ACF by count how many line which passed the threshold.
p = 2 (1st lag and 2nd lag passed the threshold)
d = 1 (we only do differencing once to get stationary data)
q = 1 (1st lag on ACF plot passed the threshold)
We’ll try to do ARIMA modeling by combining the (p,d,and q) value. Some combination we can try :
- ARIMA (1,1,1)
- ARIMA (2,1,1)
- ARIMA (3,1,1)
#modeling
covid_arima1 <- Arima(y = train, order = c(1,1,1))
covid_arima2 <- Arima(y = train, order = c(2,1,1))
covid_arima3 <- Arima(y = train, order = c(3,1,1))
After modeling, we’ll try to check the AIC of each arima model
## [1] 1196.647
## [1] 1198.53
## [1] 1200.437
From the AIC value, we can see that model covid_arima1
with p,d,q = (1,1,1) is the best in arima model for this case. We’ll try to check which is the best model for arima method by MAPE (error evaluation).
## ME RMSE MAE MPE MAPE MASE
## Training set 7.454104 44.53286 33.38272 -3.004421 23.84321 0.7045795
## ACF1
## Training set -0.04075299
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 7.204015 44.51002 33.44979 -3.12076 23.91231 0.7059951 -0.036073
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 7.41779 44.49129 33.52492 -3.014416 23.96201 0.7075809 -0.04033545
If we check from MAPE method, we see that covid_arima1
is the best because the error percentage is 23.84% by MAPE. We can also using auto arima model.
Then, we’ll try to compare model arima 1 and auto arima.
## [1] 1196.647
## [1] 1193.265
## ME RMSE MAE MPE MAPE MASE
## Training set 7.454104 44.53286 33.38272 -3.004421 23.84321 0.7045795
## ACF1
## Training set -0.04075299
## ME RMSE MAE MPE MAPE MASE
## Training set 0.1998666 43.8558 32.99011 -9.575776 25.13218 0.6962931
## ACF1
## Training set -0.04305135
- From AIC method, we see that
covid_arima_auto
is better thancovid_arima1
becausecovid_arima1
has AIC value larger thancovid_arima_auto
. - From MAPE, we see that
covid_arima1
is better thancovid_arima_auto
because the error percentage based on MAPE is 23.84%.
Forecasting
# forecast
covid_ets_f <- forecast(covid_ets, h = 7)
covid_holt_f <- forecast(covid_holt, h = 7)
covid_arima_f <- forecast(covid_arima1, h = 7)
#forecast result of ets method
covid_ets_f
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 18369.43 375.3148 225.8589 524.7707 146.7417 603.8879
## 18369.57 379.5637 226.2511 532.8764 145.0922 614.0352
## 18369.71 383.2166 225.5622 540.8711 142.1049 624.3283
## 18369.86 386.3571 223.8896 548.8247 137.8844 634.8298
## 18370.00 389.0570 221.3447 556.7694 132.5632 645.5509
## 18370.14 391.3782 218.0424 564.7141 126.2839 656.4725
## 18370.29 393.3738 214.0934 572.6543 119.1880 667.5596
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 18369.43 396.7137 339.1114 454.3159 308.6186 484.8087
## 18369.57 404.8886 345.3769 464.4003 313.8733 495.9040
## 18369.71 413.0635 351.3214 474.8057 318.6371 507.4900
## 18369.86 421.2385 356.9439 485.5330 322.9084 519.5685
## 18370.00 429.4134 362.2480 496.5789 326.6927 532.1341
## 18370.14 437.5884 367.2410 507.9357 330.0013 545.1754
## 18370.29 445.7633 371.9331 519.5935 332.8497 558.6769
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 18369.43 378.8558 321.0254 436.6863 290.4118 467.2999
## 18369.57 368.8388 308.7154 428.9622 276.8880 460.7896
## 18369.71 370.0075 306.2312 433.7838 272.4701 467.5449
## 18369.86 369.8712 302.8163 436.9260 267.3196 472.4227
## 18370.00 369.8871 299.6872 440.0869 262.5257 477.2484
## 18370.14 369.8852 296.6775 443.0929 257.9237 481.8467
## 18370.29 369.8854 293.7885 445.9824 253.5052 486.2657
# visualization
a <- autoplot(covid_ets_f, series = "ETS", fcol = "red") +
autolayer(covid_ts, series = "Actual", color = "black") +
labs(subtitle = "New Case of Covid in DKI Jakarta, Indonesia from April - July 2020",
y = "New Cases") +
theme_minimal()
b <- autoplot(covid_holt_f, series = "HOLT", fcol = "green") +
autolayer(covid_ts, series = "Actual", color = "black") +
labs(subtitle = "New Case of Covid in DKI Jakarta, Indonesia from April - July 2020",
y = "New Cases") +
theme_minimal()
c <- autoplot(covid_arima_f, series = "ARIMA", fcol = "blue") +
autolayer(covid_ts, series = "Actual", color = "black") +
labs(subtitle = "New Case of Covid in DKI Jakarta, Indonesia from April - July 2020",
y = "New Cases") +
theme_minimal()
grid.arrange(a,b,c)
Evaluation Model
## ME RMSE MAE MPE MAPE MASE
## Training set 5.724946 44.00628 32.65048 -5.186556 23.82883 0.6891249
## Test set 58.391235 98.39879 60.19546 10.766757 11.25571 1.2704925
## ACF1 Theil's U
## Training set 0.0661572 NA
## Test set -0.3574688 0.8930929
## ME RMSE MAE MPE MAPE MASE
## Training set -0.6240073 44.75229 34.08774 -8.962154 25.73653 0.7194598
## Test set 22.6186591 81.21541 56.10221 2.589004 11.11248 1.1840997
## ACF1 Theil's U
## Training set -0.0005869861 NA
## Test set -0.3916424619 0.7232941
## ME RMSE MAE MPE MAPE MASE
## Training set 7.454104 44.53286 33.38272 -3.004421 23.84321 0.7045795
## Test set 72.824144 109.82510 75.64010 13.978744 14.74187 1.5964687
## ACF1 Theil's U
## Training set -0.04075299 NA
## Test set -0.29181453 0.9934245
There is some information we can get about MAPE :
* Model covid_ets_f
is better at forecasting in train dataset with 23.82% of error, but when forecasting in test dataset, model covid_ets_f
got error of 11.25%.
* Model covid_holt_f
is better at forecasting in test dataset with 11.11% of error, but when forecasting in train dataset, model covid_holt_f
got error of 25.73%.
Assumption Check
- Normality : Shapiro.test
H0 : residuals are normally distributed H1 : residuals are not normally distributed
##
## Shapiro-Wilk normality test
##
## data: covid_ets_f$residuals
## W = 0.95686, p-value = 0.0009636
##
## Shapiro-Wilk normality test
##
## data: covid_holt_f$residuals
## W = 0.97934, p-value = 0.07753
##
## Shapiro-Wilk normality test
##
## data: covid_arima_f$residuals
## W = 0.9715, p-value = 0.01481
By using shapiro.test to do normality check asumption, only model covid_holt_f
was passed the test by p-value > 0.05 (residual distributed normally).
- Autocorrelation : Box.test - Ljng-Box
H0 : No autocorrelation in the forecast errors H1 : there is an autocorrelation in the forecast errors
##
## Box-Ljung test
##
## data: covid_ets_f$residuals
## X-squared = 0.20115, df = 1, p-value = 0.6538
##
## Box-Ljung test
##
## data: covid_holt_f$residuals
## X-squared = 3.9977e-05, df = 1, p-value = 0.995
##
## Box-Ljung test
##
## data: covid_arima_f$residuals
## X-squared = 0.19602, df = 1, p-value = 0.658
Based on box.test
function, all the model : covid_ets_f
, covid_holt_f
, and covid_arima_f
has p-value > 0.05 (no autocorrelation in forecast errors).
Conclusion
We have build several model and forecasting. By using assumption check, only Holt’s Exponential Smoothing model covid_holt_f
was passed the normality test and auto-correlation test for this case. Model Holt’s Exponential Smoothing in forecasting dataset was better with error of 11.11248 % in forecasting test dataset (by using MAPE).
Additional
As we know, recently the number of new case in Indonesia was growing. DKI Jakarta was one of the most region which grow rapidly. We’ll try to forecast the number of new cases in DKI Jakarta in August.
#forecasting the value of new.cases in August by using Holt method.
aug_f <- forecast(HoltWinters(x = covid_ts, gamma = F), h = 31)
#visualize by table
aug_df <- as.data.frame(aug_f) %>%
mutate(date = seq(from = ymd("2020-08-01"), to = ymd("2020-08-31"), by = "day"))
rmarkdown::paged_table(aug_df)
The forecasting with Holt’s Exponential Smoothing forecast that the number of new.cases in DKI Jakarta will be increasing.