Time Series Analysis for New Cases of Covid-19 in DKI Jakarta
Introduction
Until today, Covid-19 still increase rapidly almost in all countries, including Indonesia. All cities in Indonesia have been affected by COVID-19, including DKI Jakarta. We’ll try to forecast the number of increasing case in DKI Jakarta. The dataset was provided by Kaggle . The data contains information from March 1st, 2020 until December 2nd, 2021.
Input Library
Below list of package that we’re using for this analysis.
library(dplyr)
library(forecast)
library(tseries)
library(TTR)
library(lubridate)
library(tidyverse)Input Data
covid <- read.csv("covid_19_indonesia_time_series_all.csv")
head(covid)#> ï..Date Location.ISO.Code Location New.Cases New.Deaths New.Recovered
#> 1 3/1/2020 ID-JK DKI Jakarta 2 0 0
#> 2 3/2/2020 ID-JK DKI Jakarta 2 0 0
#> 3 3/2/2020 IDN Indonesia 2 0 0
#> 4 3/2/2020 ID-RI Riau 1 0 0
#> 5 3/3/2020 ID-JK DKI Jakarta 2 0 0
#> 6 3/3/2020 IDN Indonesia 0 0 0
#> New.Active.Cases Total.Cases Total.Deaths Total.Recovered Total.Active.Cases
#> 1 2 39 20 41 -22
#> 2 2 41 20 41 -20
#> 3 2 2 0 0 2
#> 4 1 2 0 3 -1
#> 5 2 43 20 41 -18
#> 6 0 2 0 0 2
#> Location.Level City.or.Regency Province Country Continent Island
#> 1 Province NA DKI Jakarta Indonesia Asia Jawa
#> 2 Province NA DKI Jakarta Indonesia Asia Jawa
#> 3 Country NA Indonesia Asia
#> 4 Province NA Riau Indonesia Asia Sumatera
#> 5 Province NA DKI Jakarta Indonesia Asia Jawa
#> 6 Country NA Indonesia Asia
#> Time.Zone Special.Status Total.Regencies Total.Cities Total.Districts
#> 1 UTC+07:00 Daerah Khusus Ibu Kota 1 5 44
#> 2 UTC+07:00 Daerah Khusus Ibu Kota 1 5 44
#> 3 416 98 7230
#> 4 UTC+07:00 10 2 169
#> 5 UTC+07:00 Daerah Khusus Ibu Kota 1 5 44
#> 6 416 98 7230
#> Total.Urban.Villages Total.Rural.Villages Area..km2. Population
#> 1 267 NA 664 10846145
#> 2 267 NA 664 10846145
#> 3 8488 74953 1916907 265185520
#> 4 268 1591 87024 6074100
#> 5 267 NA 664 10846145
#> 6 8488 74953 1916907 265185520
#> Population.Density Longitude Latitude New.Cases.per.Million
#> 1 16334.31 106.8361 -6.2046990 0.18
#> 2 16334.31 106.8361 -6.2046990 0.18
#> 3 138.34 113.9213 -0.7892750 0.01
#> 4 69.80 101.8051 0.5116479 0.16
#> 5 16334.31 106.8361 -6.2046990 0.18
#> 6 138.34 113.9213 -0.7892750 0.00
#> Total.Cases.per.Million New.Deaths.per.Million Total.Deaths.per.Million
#> 1 3.60 0 1.84
#> 2 3.78 0 1.84
#> 3 0.01 0 0.00
#> 4 0.33 0 0.00
#> 5 3.96 0 1.84
#> 6 0.01 0 0.00
#> Total.Deaths.per.100rb Case.Fatality.Rate Case.Recovered.Rate
#> 1 0.18 51.28% 105.13%
#> 2 0.18 48.78% 100.00%
#> 3 0.00 0.00% 0.00%
#> 4 0.00 0.00% 150.00%
#> 5 0.18 46.51% 95.35%
#> 6 0.00 0.00% 0.00%
#> Growth.Factor.of.New.Cases Growth.Factor.of.New.Deaths
#> 1 NA NA
#> 2 1 1
#> 3 NA NA
#> 4 NA NA
#> 5 1 1
#> 6 0 1
Data Preparation
Data Wrangling
Check structure data of covid
glimpse(covid)#> Rows: 21,759
#> Columns: 38
#> $ ï..Date <chr> "3/1/2020", "3/2/2020", "3/2/2020", "3/2/2~
#> $ Location.ISO.Code <chr> "ID-JK", "ID-JK", "IDN", "ID-RI", "ID-JK",~
#> $ Location <chr> "DKI Jakarta", "DKI Jakarta", "Indonesia",~
#> $ New.Cases <int> 2, 2, 2, 1, 2, 0, 1, 0, 2, 0, 1, 0, 0, 0, ~
#> $ New.Deaths <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, ~
#> $ New.Recovered <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
#> $ New.Active.Cases <int> 2, 2, 2, 1, 2, 0, 0, 0, 2, 0, 1, 0, -1, 0,~
#> $ Total.Cases <int> 39, 41, 2, 2, 43, 2, 1, 2, 45, 2, 2, 2, 45~
#> $ Total.Deaths <int> 20, 20, 0, 0, 20, 0, 1, 0, 20, 0, 1, 0, 21~
#> $ Total.Recovered <int> 41, 41, 0, 3, 41, 0, 8, 3, 41, 0, 8, 3, 41~
#> $ Total.Active.Cases <int> -22, -20, 2, -1, -18, 2, -8, -1, -16, 2, -~
#> $ Location.Level <chr> "Province", "Province", "Country", "Provin~
#> $ City.or.Regency <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
#> $ Province <chr> "DKI Jakarta", "DKI Jakarta", "", "Riau", ~
#> $ Country <chr> "Indonesia", "Indonesia", "Indonesia", "In~
#> $ Continent <chr> "Asia", "Asia", "Asia", "Asia", "Asia", "A~
#> $ Island <chr> "Jawa", "Jawa", "", "Sumatera", "Jawa", ""~
#> $ Time.Zone <chr> "UTC+07:00", "UTC+07:00", "", "UTC+07:00",~
#> $ Special.Status <chr> "Daerah Khusus Ibu Kota", "Daerah Khusus I~
#> $ Total.Regencies <int> 1, 1, 416, 10, 1, 416, 18, 10, 1, 416, 18,~
#> $ Total.Cities <int> 5, 5, 98, 2, 5, 98, 9, 2, 5, 98, 9, 2, 5, ~
#> $ Total.Districts <int> 44, 44, 7230, 169, 44, 7230, 627, 169, 44,~
#> $ Total.Urban.Villages <int> 267, 267, 8488, 268, 267, 8488, 645, 268, ~
#> $ Total.Rural.Villages <int> NA, NA, 74953, 1591, NA, 74953, 5312, 1591~
#> $ Area..km2. <int> 664, 664, 1916907, 87024, 664, 1916907, 35~
#> $ Population <int> 10846145, 10846145, 265185520, 6074100, 10~
#> $ Population.Density <dbl> 16334.31, 16334.31, 138.34, 69.80, 16334.3~
#> $ Longitude <dbl> 106.8361, 106.8361, 113.9213, 101.8051, 10~
#> $ Latitude <dbl> -6.2046990, -6.2046990, -0.7892750, 0.5116~
#> $ New.Cases.per.Million <dbl> 0.18, 0.18, 0.01, 0.16, 0.18, 0.00, 0.02, ~
#> $ Total.Cases.per.Million <dbl> 3.60, 3.78, 0.01, 0.33, 3.96, 0.01, 0.02, ~
#> $ New.Deaths.per.Million <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.02, ~
#> $ Total.Deaths.per.Million <dbl> 1.84, 1.84, 0.00, 0.00, 1.84, 0.00, 0.02, ~
#> $ Total.Deaths.per.100rb <dbl> 0.18, 0.18, 0.00, 0.00, 0.18, 0.00, 0.00, ~
#> $ Case.Fatality.Rate <chr> "51.28%", "48.78%", "0.00%", "0.00%", "46.~
#> $ Case.Recovered.Rate <chr> "105.13%", "100.00%", "0.00%", "150.00%", ~
#> $ Growth.Factor.of.New.Cases <dbl> NA, 1, NA, NA, 1, 0, NA, 0, 1, 1, 1, 1, 0,~
#> $ Growth.Factor.of.New.Deaths <dbl> NA, 1, NA, NA, 1, 1, NA, 1, 1, 1, 0, 1, NA~
Subset column Location, ï..Date, and New.Cases for analysis
#subset the location only DKI Jakarta, and column i..Date, and New.Cases and change data type
covid.jkt <- covid %>%
subset(Location == "DKI Jakarta") %>%
select(ï..Date, New.Cases) %>%
mutate(ï..Date = mdy(ï..Date))
#change the column name from i..Date to Date
colnames(covid.jkt)[1] = "Date"
head(covid.jkt)#> Date New.Cases
#> 1 2020-03-01 2
#> 2 2020-03-02 2
#> 5 2020-03-03 2
#> 9 2020-03-04 2
#> 13 2020-03-05 0
#> 18 2020-03-06 0
Check Missing Value
#Check if there is missing column in data
colSums(is.na(covid.jkt))#> Date New.Cases
#> 0 0
For March in our data contains so much of 0 value, so we’ll use the data start from April 2020 until December 2021.
covid.jkt <- covid.jkt %>%
subset(Date >= "2020-04-01")
head(covid.jkt)#> Date New.Cases
#> 415 2020-04-01 59
#> 448 2020-04-02 70
#> 481 2020-04-03 71
#> 514 2020-04-04 57
#> 547 2020-04-05 87
#> 580 2020-04-06 103
Then we’ll check when the data start until end.
# Check range `Date` the data
range(covid.jkt$Date)#> [1] "2020-04-01" "2021-12-02"
Now, our data is start from April 1st, 2020 until December 2nd, 2021.
Before we’re doing the analysis, we have to convert the data into time series object
Build time series object
#create object ts
covid.jkt_ts <- ts(data = covid.jkt$New.Cases,
start = min(covid.jkt$Date),
frequency = 7) #weekly seasonality
#visualise object covid_ts
covid.jkt_ts %>% autoplot()From the plot, we can see that our data’s model is multiplicative. We’ll try to see the seasonality, trend and error of data.
Exploratory Data Analysis
covid.jkt_dc <- decompose(covid.jkt_ts, type = "multiplicative")
covid.jkt_dc %>% autoplot()We can also see an up or down trend using adjusted seasonal which has removing the effects of seasonal data.
autoplot(covid.jkt_dc$x - covid.jkt_dc$seasonal)From the plot and information above, we can see that the data doesn’t has a seasonal. The trend patterns of data is increasing.
Cross Validation
Before we do the forecasting, we’ll try to split our data into train and test. The test data is the data of last 7 days (1 week), and the rest as train data.
test <- covid.jkt_ts %>% tail(7) #get 7 last days
train <- covid.jkt_ts %>% head(-7) #get the rest dataTime Series Modeling
Before we do the modeling, we’ll check our train data plot.
train %>% autoplot()From the train data plot , we can see that the plot has only trend. So, we’ll try modeling using Holt’s Exponential Smoothing method and ARIMA method.
Holt’s Exponential Smoothing
Build Model
covid.jkt_holt <- HoltWinters(x = train,
gamma = F)
covid.jkt_holt#> Holt-Winters exponential smoothing with trend and without seasonal component.
#>
#> Call:
#> HoltWinters(x = train, gamma = F)
#>
#> Smoothing parameters:
#> alpha: 0.9286053
#> beta : 0
#> gamma: FALSE
#>
#> Coefficients:
#> [,1]
#> a 56.76507
#> b 11.00000
We can set the alpha, and beta value, but in this case we’ll try to use auto so that the system can try to search the optimum number of alpha and beta. We get value of alpha = 56.77, and beta = 11.
ARIMA
The first step before build The Arima Model is checking our data had already stationary or not. Because Arima can only used when the data had already stationary.
adf.test(train)#>
#> Augmented Dickey-Fuller Test
#>
#> data: train
#> Dickey-Fuller = -3.486, Lag order = 8, p-value = 0.04365
#> alternative hypothesis: stationary
p-value is less than 0,05. So the data train had already stationer
Next, we’ll start using auto.arima() for modeling
covid.jkt_auto <- auto.arima(train)
covid.jkt_auto#> Series: train
#> ARIMA(3,1,2)(1,0,0)[7]
#>
#> Coefficients:
#> ar1 ar2 ar3 ma1 ma2 sar1
#> -1.7436 -1.0851 -0.3076 1.6646 0.7745 0.4702
#> s.e. 0.0656 0.0826 0.0444 0.0599 0.0569 0.0408
#>
#> sigma^2 estimated as 210813: log likelihood=-4549.58
#> AIC=9113.16 AICc=9113.35 BIC=9143.97
From model covid.jkt_auto we can see that the system try to search the optimum number p,d,q. We get p = 3, d = 1, and q = 2
Next, we try to try another p,d,q number
covid.jkt_ts %>%
diff() %>%
tsdisplay()From the first 5 lags, we can see the p and q from PACF and ACF by count how many line which passed the threshold.
p = 5 (2nd lag and 5st lag passed the threshold) d = 1 (we only do differencing once to get stationary data) q = 5 (2nd lag and 5st on ACF plot passed the threshold)
We’ll try to do ARIMA modeling by combining the (p,d,and q) value. Some combination we can try :
ARIMA (2,1,2) ARIMA (0,1,0) ARIMA (5,1,5)
Goodness of Fit
arima1 <- Arima(train, order = c(2,1,2))
arima2 <- Arima(train, order = c(0,1,0))
arima3 <- Arima(train, order = c(5,1,5))For selecting the best model, we can see the smallest AIC value of each model
# check AIC
arima1$aic#> [1] 9250.474
arima2$aic#> [1] 9263.08
arima3$aic#> [1] 9129.653
We’ll use model arima3 because the AIC value is smaller than others.
Forecasting
# forecast
forecast_holt <- forecast(covid.jkt_holt, h=7)
forecast_auto <- forecast(covid.jkt_auto, h=7)
forecast_arima3 <- forecast(arima3, h=7)# visualization
a <- autoplot(forecast_holt, series = "Holt", fcol = "red") +
autolayer(covid.jkt_ts, series = "Actual", color = "black") +
labs(subtitle = "New Case of Covid in DKI Jakarta, Indonesia from April - December 2021",
y = "New Cases") +
theme_minimal()
b <- autoplot(forecast_auto, series = "Auto Arima", fcol = "green") +
autolayer(covid.jkt_ts, series = "Actual", color = "black") +
labs(subtitle = "New Case of Covid in DKI Jakarta, Indonesia from April - December 2021",
y = "New Cases") +
theme_minimal()
c <- autoplot(forecast_arima3, series = "Arima3", fcol = "blue") +
autolayer(covid.jkt_ts, series = "Actual", color = "black") +
labs(subtitle = "New Case of Covid in DKI Jakarta, Indonesia from April - December 2021",
y = "New Cases") +
theme_minimal()
gridExtra::grid.arrange(a,b,c)Evaluation Model
We’ll try to evaluate all the models with the data test
accuracy(forecast_holt, test)#> ME RMSE MAE MPE MAPE MASE
#> Training set -11.86940 523.00021 261.70054 -11.6534 28.67445 0.5662851
#> Test set -49.62222 58.86475 50.26077 -124.5099 125.42215 0.1087576
#> ACF1 Theil's U
#> Training set 0.0113167 NA
#> Test set 0.2658930 3.236423
accuracy(forecast_auto, test)#> ME RMSE MAE MPE MAPE MASE
#> Training set -0.04602697 456.47557 238.40553 -5.391155 24.71554 0.51587780
#> Test set -0.36093219 11.06588 10.06216 -3.532828 20.53556 0.02177318
#> ACF1 Theil's U
#> Training set 0.0008031502 NA
#> Test set -0.2067190812 0.5645724
accuracy(forecast_arima3, test)#> ME RMSE MAE MPE MAPE MASE
#> Training set -0.01579353 459.81411 237.10778 -6.638561 25.94193 0.51306962
#> Test set 0.58796216 15.26343 10.70438 -7.094500 24.70932 0.02316285
#> ACF1 Theil's U
#> Training set 0.03054903 NA
#> Test set -0.20944803 0.4571864
There is some information we can get about MAPE :
Model
covid.jkt_holtis worst at forecasting in data train and data testModel
covid.jkt_autois better at forecasting in test dataset with 20.54% of error, but when forecasting in train dataset, modelcovid.jkt_autogot error of 24.72%.Model
arima3is better at forecasting in test dataset with 24.71% of error, but when forecasting in train dataset, modelcovid.jkt_autogot error of 25.94%.
Assumption Check
1. Normality with Shapiro.test
H0 : residuals are normally distributed
H1 : residuals are not normally distributed
shapiro.test(forecast_auto$residuals)#>
#> Shapiro-Wilk normality test
#>
#> data: forecast_auto$residuals
#> W = 0.74821, p-value < 0.00000000000000022
shapiro.test(forecast_holt$residuals)#>
#> Shapiro-Wilk normality test
#>
#> data: forecast_holt$residuals
#> W = 0.71777, p-value < 0.00000000000000022
shapiro.test(forecast_arima3$residuals)#>
#> Shapiro-Wilk normality test
#>
#> data: forecast_arima3$residuals
#> W = 0.73685, p-value < 0.00000000000000022
From shapiro teset for all the models we can see that the residuals are not normally distributed (accepted H1)
2. Autocorrelation : Box.test - Ljng-Box
H0 : No autocorrelation in the forecast errors
H1 : there is an autocorrelation in the forecast errors
Box.test(forecast_holt$residuals, type = "Ljung-Box")#>
#> Box-Ljung test
#>
#> data: forecast_holt$residuals
#> X-squared = 0.077482, df = 1, p-value = 0.7807
Box.test(forecast_auto$residuals, type = "Ljung-Box")#>
#> Box-Ljung test
#>
#> data: forecast_auto$residuals
#> X-squared = 0.00039155, df = 1, p-value = 0.9842
Box.test(arima3$residuals, type = "Ljung-Box")#>
#> Box-Ljung test
#>
#> data: arima3$residuals
#> X-squared = 0.56648, df = 1, p-value = 0.4517
From L-jung Box test, we can see that model covid.jkt_auto and model covid.jkt_holt has no autocorrelation in the forecast errors (p-value more than 0,05, accepted H0). But, model arima3 has an autocorrelation in the forecast errors (p-value less than 0,05, accepted H1).
Conclusion
We have already built several model time series analysis for this case, such as Holt’s Exponential Smoothing and ARIMA. Based on MAPE, model ARIMA covid.jkt_auto is better than others. Model covid.jkt_auto in forecasting dataset was better with error of 20.54 % in forecasting test dataset (by using MAPE). Based on assumption check, model covid.jkt_auto and covid.jkt_holt has no auto-correlation in forecast errors. But, all the residual models haven’t distributed normally. Next, for the recommendation we need to do normalizing data in data preprocess before modeling and fitting step.