Time Series Analysis for New Cases of Covid-19 in DKI Jakarta

Introduction

Until today, Covid-19 still increase rapidly almost in all countries, including Indonesia. All cities in Indonesia have been affected by COVID-19, including DKI Jakarta. We’ll try to forecast the number of increasing case in DKI Jakarta. The dataset was provided by Kaggle . The data contains information from March 1st, 2020 until December 2nd, 2021.

Input Library

Below list of package that we’re using for this analysis.

library(dplyr)
library(forecast)
library(tseries)
library(TTR)
library(lubridate)
library(tidyverse)

Input Data

covid <- read.csv("covid_19_indonesia_time_series_all.csv")

head(covid)

#>    ï..Date Location.ISO.Code    Location New.Cases New.Deaths New.Recovered
#> 1 3/1/2020             ID-JK DKI Jakarta         2          0             0
#> 2 3/2/2020             ID-JK DKI Jakarta         2          0             0
#> 3 3/2/2020               IDN   Indonesia         2          0             0
#> 4 3/2/2020             ID-RI        Riau         1          0             0
#> 5 3/3/2020             ID-JK DKI Jakarta         2          0             0
#> 6 3/3/2020               IDN   Indonesia         0          0             0
#>   New.Active.Cases Total.Cases Total.Deaths Total.Recovered Total.Active.Cases
#> 1                2          39           20              41                -22
#> 2                2          41           20              41                -20
#> 3                2           2            0               0                  2
#> 4                1           2            0               3                 -1
#> 5                2          43           20              41                -18
#> 6                0           2            0               0                  2
#>   Location.Level City.or.Regency    Province   Country Continent   Island
#> 1       Province              NA DKI Jakarta Indonesia      Asia     Jawa
#> 2       Province              NA DKI Jakarta Indonesia      Asia     Jawa
#> 3        Country              NA             Indonesia      Asia         
#> 4       Province              NA        Riau Indonesia      Asia Sumatera
#> 5       Province              NA DKI Jakarta Indonesia      Asia     Jawa
#> 6        Country              NA             Indonesia      Asia         
#>   Time.Zone         Special.Status Total.Regencies Total.Cities Total.Districts
#> 1 UTC+07:00 Daerah Khusus Ibu Kota               1            5              44
#> 2 UTC+07:00 Daerah Khusus Ibu Kota               1            5              44
#> 3                                              416           98            7230
#> 4 UTC+07:00                                     10            2             169
#> 5 UTC+07:00 Daerah Khusus Ibu Kota               1            5              44
#> 6                                              416           98            7230
#>   Total.Urban.Villages Total.Rural.Villages Area..km2. Population
#> 1                  267                   NA        664   10846145
#> 2                  267                   NA        664   10846145
#> 3                 8488                74953    1916907  265185520
#> 4                  268                 1591      87024    6074100
#> 5                  267                   NA        664   10846145
#> 6                 8488                74953    1916907  265185520
#>   Population.Density Longitude   Latitude New.Cases.per.Million
#> 1           16334.31  106.8361 -6.2046990                  0.18
#> 2           16334.31  106.8361 -6.2046990                  0.18
#> 3             138.34  113.9213 -0.7892750                  0.01
#> 4              69.80  101.8051  0.5116479                  0.16
#> 5           16334.31  106.8361 -6.2046990                  0.18
#> 6             138.34  113.9213 -0.7892750                  0.00
#>   Total.Cases.per.Million New.Deaths.per.Million Total.Deaths.per.Million
#> 1                    3.60                      0                     1.84
#> 2                    3.78                      0                     1.84
#> 3                    0.01                      0                     0.00
#> 4                    0.33                      0                     0.00
#> 5                    3.96                      0                     1.84
#> 6                    0.01                      0                     0.00
#>   Total.Deaths.per.100rb Case.Fatality.Rate Case.Recovered.Rate
#> 1                   0.18             51.28%             105.13%
#> 2                   0.18             48.78%             100.00%
#> 3                   0.00              0.00%               0.00%
#> 4                   0.00              0.00%             150.00%
#> 5                   0.18             46.51%              95.35%
#> 6                   0.00              0.00%               0.00%
#>   Growth.Factor.of.New.Cases Growth.Factor.of.New.Deaths
#> 1                         NA                          NA
#> 2                          1                           1
#> 3                         NA                          NA
#> 4                         NA                          NA
#> 5                          1                           1
#> 6                          0                           1

Data Preparation

Data Wrangling

Check structure data of covid

glimpse(covid)

#> Rows: 21,759
#> Columns: 38
#> $ ï..Date                     <chr> "3/1/2020", "3/2/2020", "3/2/2020", "3/2/2~
#> $ Location.ISO.Code           <chr> "ID-JK", "ID-JK", "IDN", "ID-RI", "ID-JK",~
#> $ Location                    <chr> "DKI Jakarta", "DKI Jakarta", "Indonesia",~
#> $ New.Cases                   <int> 2, 2, 2, 1, 2, 0, 1, 0, 2, 0, 1, 0, 0, 0, ~
#> $ New.Deaths                  <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, ~
#> $ New.Recovered               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
#> $ New.Active.Cases            <int> 2, 2, 2, 1, 2, 0, 0, 0, 2, 0, 1, 0, -1, 0,~
#> $ Total.Cases                 <int> 39, 41, 2, 2, 43, 2, 1, 2, 45, 2, 2, 2, 45~
#> $ Total.Deaths                <int> 20, 20, 0, 0, 20, 0, 1, 0, 20, 0, 1, 0, 21~
#> $ Total.Recovered             <int> 41, 41, 0, 3, 41, 0, 8, 3, 41, 0, 8, 3, 41~
#> $ Total.Active.Cases          <int> -22, -20, 2, -1, -18, 2, -8, -1, -16, 2, -~
#> $ Location.Level              <chr> "Province", "Province", "Country", "Provin~
#> $ City.or.Regency             <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
#> $ Province                    <chr> "DKI Jakarta", "DKI Jakarta", "", "Riau", ~
#> $ Country                     <chr> "Indonesia", "Indonesia", "Indonesia", "In~
#> $ Continent                   <chr> "Asia", "Asia", "Asia", "Asia", "Asia", "A~
#> $ Island                      <chr> "Jawa", "Jawa", "", "Sumatera", "Jawa", ""~
#> $ Time.Zone                   <chr> "UTC+07:00", "UTC+07:00", "", "UTC+07:00",~
#> $ Special.Status              <chr> "Daerah Khusus Ibu Kota", "Daerah Khusus I~
#> $ Total.Regencies             <int> 1, 1, 416, 10, 1, 416, 18, 10, 1, 416, 18,~
#> $ Total.Cities                <int> 5, 5, 98, 2, 5, 98, 9, 2, 5, 98, 9, 2, 5, ~
#> $ Total.Districts             <int> 44, 44, 7230, 169, 44, 7230, 627, 169, 44,~
#> $ Total.Urban.Villages        <int> 267, 267, 8488, 268, 267, 8488, 645, 268, ~
#> $ Total.Rural.Villages        <int> NA, NA, 74953, 1591, NA, 74953, 5312, 1591~
#> $ Area..km2.                  <int> 664, 664, 1916907, 87024, 664, 1916907, 35~
#> $ Population                  <int> 10846145, 10846145, 265185520, 6074100, 10~
#> $ Population.Density          <dbl> 16334.31, 16334.31, 138.34, 69.80, 16334.3~
#> $ Longitude                   <dbl> 106.8361, 106.8361, 113.9213, 101.8051, 10~
#> $ Latitude                    <dbl> -6.2046990, -6.2046990, -0.7892750, 0.5116~
#> $ New.Cases.per.Million       <dbl> 0.18, 0.18, 0.01, 0.16, 0.18, 0.00, 0.02, ~
#> $ Total.Cases.per.Million     <dbl> 3.60, 3.78, 0.01, 0.33, 3.96, 0.01, 0.02, ~
#> $ New.Deaths.per.Million      <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.02, ~
#> $ Total.Deaths.per.Million    <dbl> 1.84, 1.84, 0.00, 0.00, 1.84, 0.00, 0.02, ~
#> $ Total.Deaths.per.100rb      <dbl> 0.18, 0.18, 0.00, 0.00, 0.18, 0.00, 0.00, ~
#> $ Case.Fatality.Rate          <chr> "51.28%", "48.78%", "0.00%", "0.00%", "46.~
#> $ Case.Recovered.Rate         <chr> "105.13%", "100.00%", "0.00%", "150.00%", ~
#> $ Growth.Factor.of.New.Cases  <dbl> NA, 1, NA, NA, 1, 0, NA, 0, 1, 1, 1, 1, 0,~
#> $ Growth.Factor.of.New.Deaths <dbl> NA, 1, NA, NA, 1, 1, NA, 1, 1, 1, 0, 1, NA~

Subset column Location, ï..Date, and New.Cases for analysis

#subset the location only DKI Jakarta, and column i..Date, and New.Cases and change data type

covid.jkt <- covid %>% 
  subset(Location == "DKI Jakarta") %>% 
  select(ï..Date, New.Cases) %>% 
  mutate(ï..Date = mdy(ï..Date))

#change the column name from i..Date to Date
colnames(covid.jkt)[1] = "Date"

head(covid.jkt)

#>          Date New.Cases
#> 1  2020-03-01         2
#> 2  2020-03-02         2
#> 5  2020-03-03         2
#> 9  2020-03-04         2
#> 13 2020-03-05         0
#> 18 2020-03-06         0

Check Missing Value

#Check if there is missing column in data

colSums(is.na(covid.jkt))

#>      Date New.Cases 
#>         0         0

For March in our data contains so much of 0 value, so we’ll use the data start from April 2020 until December 2021.

covid.jkt <- covid.jkt %>% 
  subset(Date >= "2020-04-01")

head(covid.jkt)

#>           Date New.Cases
#> 415 2020-04-01        59
#> 448 2020-04-02        70
#> 481 2020-04-03        71
#> 514 2020-04-04        57
#> 547 2020-04-05        87
#> 580 2020-04-06       103

Then we’ll check when the data start until end.

# Check range `Date` the data
range(covid.jkt$Date)

#> [1] "2020-04-01" "2021-12-02"

Now, our data is start from April 1st, 2020 until December 2nd, 2021.

Before we’re doing the analysis, we have to convert the data into time series object

Build time series object

#create object ts 
covid.jkt_ts <- ts(data = covid.jkt$New.Cases,
   start = min(covid.jkt$Date),
   frequency = 7) #weekly seasonality 

#visualise object covid_ts
covid.jkt_ts %>% autoplot()

From the plot, we can see that our data’s model is multiplicative. We’ll try to see the seasonality, trend and error of data.

Exploratory Data Analysis

covid.jkt_dc <- decompose(covid.jkt_ts, type = "multiplicative")

covid.jkt_dc %>% autoplot()

We can also see an up or down trend using adjusted seasonal which has removing the effects of seasonal data.

autoplot(covid.jkt_dc$x - covid.jkt_dc$seasonal)

From the plot and information above, we can see that the data doesn’t has a seasonal. The trend patterns of data is increasing.

Cross Validation

Before we do the forecasting, we’ll try to split our data into train and test. The test data is the data of last 7 days (1 week), and the rest as train data.

test <- covid.jkt_ts %>% tail(7) #get 7 last days 

train <- covid.jkt_ts %>% head(-7) #get the rest data

Time Series Modeling

Before we do the modeling, we’ll check our train data plot.

train %>% autoplot()

From the train data plot , we can see that the plot has only trend. So, we’ll try modeling using Holt’s Exponential Smoothing method and ARIMA method.

Holt’s Exponential Smoothing

Build Model

covid.jkt_holt <- HoltWinters(x = train,
                                gamma = F)
covid.jkt_holt

#> Holt-Winters exponential smoothing with trend and without seasonal component.
#> 
#> Call:
#> HoltWinters(x = train, gamma = F)
#> 
#> Smoothing parameters:
#>  alpha: 0.9286053
#>  beta : 0
#>  gamma: FALSE
#> 
#> Coefficients:
#>       [,1]
#> a 56.76507
#> b 11.00000

We can set the alpha, and beta value, but in this case we’ll try to use auto so that the system can try to search the optimum number of alpha and beta. We get value of alpha = 56.77, and beta = 11.

ARIMA

The first step before build The Arima Model is checking our data had already stationary or not. Because Arima can only used when the data had already stationary.

adf.test(train)

#> 
#>  Augmented Dickey-Fuller Test
#> 
#> data:  train
#> Dickey-Fuller = -3.486, Lag order = 8, p-value = 0.04365
#> alternative hypothesis: stationary

p-value is less than 0,05. So the data train had already stationer

Next, we’ll start using auto.arima() for modeling

covid.jkt_auto <- auto.arima(train)

covid.jkt_auto

#> Series: train 
#> ARIMA(3,1,2)(1,0,0)[7] 
#> 
#> Coefficients:
#>           ar1      ar2      ar3     ma1     ma2    sar1
#>       -1.7436  -1.0851  -0.3076  1.6646  0.7745  0.4702
#> s.e.   0.0656   0.0826   0.0444  0.0599  0.0569  0.0408
#> 
#> sigma^2 estimated as 210813:  log likelihood=-4549.58
#> AIC=9113.16   AICc=9113.35   BIC=9143.97

From model covid.jkt_auto we can see that the system try to search the optimum number p,d,q. We get p = 3, d = 1, and q = 2

Next, we try to try another p,d,q number

covid.jkt_ts %>% 
  diff() %>% 
  tsdisplay()

From the first 5 lags, we can see the p and q from PACF and ACF by count how many line which passed the threshold.

p = 5 (2nd lag and 5st lag passed the threshold) d = 1 (we only do differencing once to get stationary data) q = 5 (2nd lag and 5st on ACF plot passed the threshold)

We’ll try to do ARIMA modeling by combining the (p,d,and q) value. Some combination we can try :

ARIMA (2,1,2) ARIMA (0,1,0) ARIMA (5,1,5)

Goodness of Fit

arima1 <- Arima(train, order = c(2,1,2))
arima2 <- Arima(train, order = c(0,1,0))
arima3 <- Arima(train, order = c(5,1,5))

For selecting the best model, we can see the smallest AIC value of each model

# check AIC
arima1$aic

#> [1] 9250.474

arima2$aic

#> [1] 9263.08

arima3$aic

#> [1] 9129.653

We’ll use model arima3 because the AIC value is smaller than others.

Forecasting

# forecast
forecast_holt <- forecast(covid.jkt_holt, h=7)
forecast_auto <- forecast(covid.jkt_auto, h=7)
forecast_arima3 <- forecast(arima3, h=7)

# visualization

a <- autoplot(forecast_holt, series = "Holt", fcol = "red") +
  autolayer(covid.jkt_ts, series = "Actual", color = "black") + 
  labs(subtitle = "New Case of Covid in DKI Jakarta, Indonesia from April - December 2021",
       y = "New Cases") +
  theme_minimal()

b <- autoplot(forecast_auto, series = "Auto Arima", fcol = "green") +
  autolayer(covid.jkt_ts, series = "Actual", color = "black") +
  labs(subtitle = "New Case of Covid in DKI Jakarta, Indonesia from April - December 2021",
       y = "New Cases") +
  theme_minimal()

c <- autoplot(forecast_arima3, series = "Arima3", fcol = "blue") +
  autolayer(covid.jkt_ts, series = "Actual", color = "black") +
  labs(subtitle = "New Case of Covid in DKI Jakarta, Indonesia from April - December 2021",
       y = "New Cases") +
  theme_minimal()

gridExtra::grid.arrange(a,b,c)

Evaluation Model

We’ll try to evaluate all the models with the data test

accuracy(forecast_holt, test)

#>                     ME      RMSE       MAE       MPE      MAPE      MASE
#> Training set -11.86940 523.00021 261.70054  -11.6534  28.67445 0.5662851
#> Test set     -49.62222  58.86475  50.26077 -124.5099 125.42215 0.1087576
#>                   ACF1 Theil's U
#> Training set 0.0113167        NA
#> Test set     0.2658930  3.236423

accuracy(forecast_auto, test)

#>                       ME      RMSE       MAE       MPE     MAPE       MASE
#> Training set -0.04602697 456.47557 238.40553 -5.391155 24.71554 0.51587780
#> Test set     -0.36093219  11.06588  10.06216 -3.532828 20.53556 0.02177318
#>                       ACF1 Theil's U
#> Training set  0.0008031502        NA
#> Test set     -0.2067190812 0.5645724

accuracy(forecast_arima3, test)

#>                       ME      RMSE       MAE       MPE     MAPE       MASE
#> Training set -0.01579353 459.81411 237.10778 -6.638561 25.94193 0.51306962
#> Test set      0.58796216  15.26343  10.70438 -7.094500 24.70932 0.02316285
#>                     ACF1 Theil's U
#> Training set  0.03054903        NA
#> Test set     -0.20944803 0.4571864

There is some information we can get about MAPE :

Model covid.jkt_holt is worst at forecasting in data train and data test
Model covid.jkt_auto is better at forecasting in test dataset with 20.54% of error, but when forecasting in train dataset, model covid.jkt_auto got error of 24.72%.
Model arima3 is better at forecasting in test dataset with 24.71% of error, but when forecasting in train dataset, model covid.jkt_auto got error of 25.94%.

Assumption Check

1. Normality with Shapiro.test

H0 : residuals are normally distributed
H1 : residuals are not normally distributed

shapiro.test(forecast_auto$residuals)

#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  forecast_auto$residuals
#> W = 0.74821, p-value < 0.00000000000000022

shapiro.test(forecast_holt$residuals)

#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  forecast_holt$residuals
#> W = 0.71777, p-value < 0.00000000000000022

shapiro.test(forecast_arima3$residuals)

#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  forecast_arima3$residuals
#> W = 0.73685, p-value < 0.00000000000000022

From shapiro teset for all the models we can see that the residuals are not normally distributed (accepted H1)

2. Autocorrelation : Box.test - Ljng-Box

H0 : No autocorrelation in the forecast errors
H1 : there is an autocorrelation in the forecast errors

Box.test(forecast_holt$residuals, type = "Ljung-Box")

#> 
#>  Box-Ljung test
#> 
#> data:  forecast_holt$residuals
#> X-squared = 0.077482, df = 1, p-value = 0.7807

Box.test(forecast_auto$residuals, type = "Ljung-Box")

#> 
#>  Box-Ljung test
#> 
#> data:  forecast_auto$residuals
#> X-squared = 0.00039155, df = 1, p-value = 0.9842

Box.test(arima3$residuals, type = "Ljung-Box")

#> 
#>  Box-Ljung test
#> 
#> data:  arima3$residuals
#> X-squared = 0.56648, df = 1, p-value = 0.4517

From L-jung Box test, we can see that model covid.jkt_auto and model covid.jkt_holt has no autocorrelation in the forecast errors (p-value more than 0,05, accepted H0). But, model arima3 has an autocorrelation in the forecast errors (p-value less than 0,05, accepted H1).

Conclusion

We have already built several model time series analysis for this case, such as Holt’s Exponential Smoothing and ARIMA. Based on MAPE, model ARIMA covid.jkt_auto is better than others. Model covid.jkt_auto in forecasting dataset was better with error of 20.54 % in forecasting test dataset (by using MAPE). Based on assumption check, model covid.jkt_auto and covid.jkt_holt has no auto-correlation in forecast errors. But, all the residual models haven’t distributed normally. Next, for the recommendation we need to do normalizing data in data preprocess before modeling and fitting step.