Description

In this lecture, we are trying to do Time Series Analysis using Indonesian Covid19 data set.

Data Extraction

dataframe <- read.csv("covid_19_indonesia_time_series_all.csv")

Exploratory Data Analysis

summary(dataframe)
##    ï..Date          Location.ISO.Code    Location           New.Cases      
##  Length:3493        Length:3493        Length:3493        Min.   :   0.00  
##  Class :character   Class :character   Class :character   1st Qu.:   0.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :   3.00  
##                                                           Mean   :  32.09  
##                                                           3rd Qu.:  15.00  
##                                                           Max.   :1385.00  
##                                                                            
##    New.Deaths     New.Recovered     New.Active.Cases   Total.Cases   
##  Min.   : 0.000   Min.   :   0.00   Min.   :-309.00   Min.   :    1  
##  1st Qu.: 0.000   1st Qu.:   0.00   1st Qu.:   0.00   1st Qu.:   27  
##  Median : 0.000   Median :   0.00   Median :   1.00   Median :  128  
##  Mean   : 1.634   Mean   :  14.17   Mean   :  16.29   Mean   : 1085  
##  3rd Qu.: 0.000   3rd Qu.:   5.00   3rd Qu.:  10.00   3rd Qu.:  462  
##  Max.   :71.000   Max.   :1027.00   Max.   : 772.00   Max.   :56385  
##                                                                      
##   Total.Deaths    Total.Recovered Total.Active.Cases Location.Level    
##  Min.   :   0.0   Min.   :    0   Min.   :  -44.0    Length:3493       
##  1st Qu.:   1.0   1st Qu.:    4   1st Qu.:   16.0    Class :character  
##  Median :   5.0   Median :   34   Median :   70.0    Mode  :character  
##  Mean   :  66.5   Mean   :  345   Mean   :  673.7                      
##  3rd Qu.:  25.0   3rd Qu.:  151   3rd Qu.:  271.0                      
##  Max.   :2876.0   Max.   :24806   Max.   :28703.0                      
##                                                                        
##  City.or.Regency   Province           Country           Continent        
##  Mode:logical    Length:3493        Length:3493        Length:3493       
##  NA's:3493       Class :character   Class :character   Class :character  
##                  Mode  :character   Mode  :character   Mode  :character  
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##     Island           Time.Zone         Special.Status     Total.Regencies 
##  Length:3493        Length:3493        Length:3493        Min.   :  1.00  
##  Class :character   Class :character   Class :character   1st Qu.:  7.00  
##  Mode  :character   Mode  :character   Mode  :character   Median : 11.00  
##                                                           Mean   : 26.29  
##                                                           3rd Qu.: 18.00  
##                                                           Max.   :416.00  
##                                                                           
##   Total.Cities    Total.Districts  Total.Urban.Villages Total.Rural.Villages
##  Min.   : 1.000   Min.   :  44.0   Min.   :  35.0       Min.   :  275       
##  1st Qu.: 1.000   1st Qu.: 103.0   1st Qu.:  99.0       1st Qu.:  928       
##  Median : 2.000   Median : 169.0   Median : 175.0       Median : 1591       
##  Mean   : 6.469   Mean   : 460.7   Mean   : 559.3       Mean   : 4939       
##  3rd Qu.: 5.000   3rd Qu.: 289.0   3rd Qu.: 377.0       3rd Qu.: 2853       
##  Max.   :98.000   Max.   :7230.0   Max.   :8488.0       Max.   :74953       
##  NA's   :95                        NA's   :97           NA's   :122         
##    Area..km2.        Population        Population.Density   Longitude     
##  Min.   :    664   Min.   :   648407   Min.   :    8.59   Min.   : 96.91  
##  1st Qu.:  16787   1st Qu.:  2570289   1st Qu.:   47.79   1st Qu.:106.11  
##  Median :  42013   Median :  4340348   Median :  103.84   Median :113.42  
##  Mean   : 120695   Mean   : 17214818   Mean   :  845.88   Mean   :113.61  
##  3rd Qu.:  75468   3rd Qu.:  9426885   3rd Qu.:  262.70   3rd Qu.:120.16  
##  Max.   :1916907   Max.   :265185520   Max.   :16334.31   Max.   :138.70  
##                                                                           
##     Latitude        New.Cases.per.Million Total.Cases.per.Million
##  Min.   :-8.68220   Min.   : 0.000        Min.   :   0.01        
##  1st Qu.:-6.20470   1st Qu.: 0.000        1st Qu.:   6.26        
##  Median :-2.46175   Median : 0.570        Median :  31.16        
##  Mean   :-2.80315   Mean   : 2.316        Mean   :  75.79        
##  3rd Qu.:-0.08647   3rd Qu.: 2.540        3rd Qu.:  90.48        
##  Max.   : 4.22562   Max.   :73.410        Max.   :1008.65        
##                                                                  
##  New.Deaths.per.Million Total.Deaths.per.Million Case.Fatality.Rate
##  Min.   :0.00000        Min.   : 0.000           Length:3493       
##  1st Qu.:0.00000        1st Qu.: 0.330           Class :character  
##  Median :0.00000        Median : 1.280           Mode  :character  
##  Mean   :0.09621        Mean   : 3.684                             
##  3rd Qu.:0.00000        3rd Qu.: 3.320                             
##  Max.   :6.71000        Max.   :56.430                             
##                                                                    
##  Case.Recovered.Rate Growth.Factor.of.New.Cases Growth.Factor.of.New.Deaths
##  Length:3493         Min.   : 0.0000            Min.   : 0.0000            
##  Class :character    1st Qu.: 0.3525            1st Qu.: 1.0000            
##  Mode  :character    Median : 1.0000            Median : 1.0000            
##                      Mean   : 1.3400            Mean   : 0.9535            
##                      3rd Qu.: 1.1375            3rd Qu.: 1.0000            
##                      Max.   :54.0000            Max.   :17.0000            
##                      NA's   :515                NA's   :395
str(dataframe)
## 'data.frame':    3493 obs. of  37 variables:
##  $ ï..Date                    : chr  "3/1/2020" "3/1/2020" "3/2/2020" "3/2/2020" ...
##  $ Location.ISO.Code          : chr  "ID-JK" "ID-JB" "ID-JK" "IDN" ...
##  $ Location                   : chr  "DKI Jakarta" "Jawa Barat" "DKI Jakarta" "Indonesia" ...
##  $ New.Cases                  : int  3 3 2 2 0 2 0 1 2 0 ...
##  $ New.Deaths                 : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ New.Recovered              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ New.Active.Cases           : int  3 3 2 2 0 2 0 0 2 0 ...
##  $ Total.Cases                : int  3 3 5 2 3 7 2 4 9 2 ...
##  $ Total.Deaths               : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ Total.Recovered            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Total.Active.Cases         : int  3 3 5 2 3 7 2 3 9 2 ...
##  $ Location.Level             : chr  "Province" "Province" "Province" "Country" ...
##  $ City.or.Regency            : logi  NA NA NA NA NA NA ...
##  $ Province                   : chr  "DKI Jakarta" "Jawa Barat" "DKI Jakarta" "" ...
##  $ Country                    : chr  "Indonesia" "Indonesia" "Indonesia" "Indonesia" ...
##  $ Continent                  : chr  "Asia" "Asia" "Asia" "Asia" ...
##  $ Island                     : chr  "Jawa" "Jawa" "Jawa" "" ...
##  $ Time.Zone                  : chr  "UTC+07:00" "UTC+07:00" "UTC+07:00" "" ...
##  $ Special.Status             : chr  "Daerah Khusus Ibu Kota" "" "Daerah Khusus Ibu Kota" "" ...
##  $ Total.Regencies            : int  1 18 1 416 18 1 416 18 1 416 ...
##  $ Total.Cities               : int  5 9 5 98 9 5 98 9 5 98 ...
##  $ Total.Districts            : int  44 627 44 7230 627 44 7230 627 44 7230 ...
##  $ Total.Urban.Villages       : int  267 645 267 8488 645 267 8488 645 267 8488 ...
##  $ Total.Rural.Villages       : int  NA 5312 NA 74953 5312 NA 74953 5312 NA 74953 ...
##  $ Area..km2.                 : int  664 35378 664 1916907 35378 664 1916907 35378 664 1916907 ...
##  $ Population                 : int  10846145 45161325 10846145 265185520 45161325 10846145 265185520 45161325 10846145 265185520 ...
##  $ Population.Density         : num  16334 1277 16334 138 1277 ...
##  $ Longitude                  : num  107 108 107 114 108 ...
##  $ Latitude                   : num  -6.205 -6.92 -6.205 -0.789 -6.92 ...
##  $ New.Cases.per.Million      : num  0.28 0.07 0.18 0.01 0 0.18 0 0.02 0.18 0 ...
##  $ Total.Cases.per.Million    : num  0.28 0.07 0.46 0.01 0.07 0.65 0.01 0.09 0.83 0.01 ...
##  $ New.Deaths.per.Million     : num  0 0 0 0 0 0 0 0.02 0 0 ...
##  $ Total.Deaths.per.Million   : num  0 0 0 0 0 0 0 0.02 0 0 ...
##  $ Case.Fatality.Rate         : chr  "0.00%" "0.00%" "0.00%" "0.00%" ...
##  $ Case.Recovered.Rate        : chr  "0.00%" "0.00%" "0.00%" "0.00%" ...
##  $ Growth.Factor.of.New.Cases : num  NA NA 0.67 NA 0 1 0 NA 1 1 ...
##  $ Growth.Factor.of.New.Deaths: num  NA NA 1 NA 1 1 1 NA 1 1 ...

Since we want to do Time Series Analysis, it is important to have ‘Date’ column in our dataset. We already have it but with wrong format, the ‘Date’ should be Date data type, not Chr. We also have to make sure the ‘Date’ column has no NA values because it can affect the analysis process.

colSums(is.na(dataframe))
##                     ï..Date           Location.ISO.Code 
##                           0                           0 
##                    Location                   New.Cases 
##                           0                           0 
##                  New.Deaths               New.Recovered 
##                           0                           0 
##            New.Active.Cases                 Total.Cases 
##                           0                           0 
##                Total.Deaths             Total.Recovered 
##                           0                           0 
##          Total.Active.Cases              Location.Level 
##                           0                           0 
##             City.or.Regency                    Province 
##                        3493                           0 
##                     Country                   Continent 
##                           0                           0 
##                      Island                   Time.Zone 
##                           0                           0 
##              Special.Status             Total.Regencies 
##                           0                           0 
##                Total.Cities             Total.Districts 
##                          95                           0 
##        Total.Urban.Villages        Total.Rural.Villages 
##                          97                         122 
##                  Area..km2.                  Population 
##                           0                           0 
##          Population.Density                   Longitude 
##                           0                           0 
##                    Latitude       New.Cases.per.Million 
##                           0                           0 
##     Total.Cases.per.Million      New.Deaths.per.Million 
##                           0                           0 
##    Total.Deaths.per.Million          Case.Fatality.Rate 
##                           0                           0 
##         Case.Recovered.Rate  Growth.Factor.of.New.Cases 
##                           0                         515 
## Growth.Factor.of.New.Deaths 
##                         395

‘Date’ column has already no NA values. SO we can start the next process.

Data Preparation

We just want to analyze covid19 data in ‘Jawa Timur’ and playing with New.Cases, New.Death, and New.Recovered data so we can ignore the rest columns.

library(dplyr)
## Warning: package 'dplyr' was built under R version 4.0.5
library(lubridate)
## Warning: package 'lubridate' was built under R version 4.0.5
colnames(dataframe)[1] = "Date"

dataframe <- dataframe %>%
  subset(Location == "Jawa Timur") %>%
  select(Date, New.Recovered, New.Cases, New.Deaths) %>%
  mutate(Date = mdy(Date))
str(dataframe)
## 'data.frame':    105 obs. of  4 variables:
##  $ Date         : Date, format: "2020-03-18" "2020-03-19" ...
##  $ New.Recovered: int  0 1 0 0 0 0 0 2 2 0 ...
##  $ New.Cases    : int  2 0 0 0 0 0 0 0 8 7 ...
##  $ New.Deaths   : int  0 0 0 0 0 0 1 2 0 0 ...

Now the data set are ready to analyzed. Next, we will plotting the data.

library(ggplot2)

plot1 <- ggplot(dataframe, aes(x=Date, y=New.Cases)) +
  geom_line()

plot2 <- ggplot(dataframe, aes(x=Date, y=New.Deaths)) +
  geom_line()

plot3 <- ggplot(dataframe, aes(x=Date, y=New.Recovered)) +
  geom_line()

Put each plot into one.

library(gridExtra)
## Warning: package 'gridExtra' was built under R version 4.0.5
grid.arrange(plot1, plot2, plot3, nrow = 3)

The graph shows us that Covid19 case in Indonesia has up and down pattern.

Modeling

dataframe_cts <- ts(
  data = dataframe$New.Cases,
  start = min(dataframe$Date),
  frequency = 7
)
plot(dataframe_cts, main = "New Cases")

dataframe_dts <- ts(
  data = dataframe$New.Death,
  start = min(dataframe$Date),
  frequency = 7
)
plot(dataframe_dts, main = "New Death Cases")

dataframe_rts <- ts(
  data = dataframe$New.Recovered,
  start = min(dataframe$Date),
  frequency = 7
)
plot(dataframe_rts, main = "New Recovered Cases")

From the plot, we can see that our data’s model is multiplicative. This model assumes that as the data increase, so does the seasonal pattern. In this model, the trend and seasonal components are multiplied and then added to the error component.

Cross Validation

In this step, we will try to split our data into train and test data. We take data from the last 7 days as train data and the rest as test data.

test <- tail(dataframe_cts, 7)
train <- head(dataframe_cts, length(dataframe_cts) - length(test))
plot(train)

Time Series Modeling

library(forecast)
## Warning: package 'forecast' was built under R version 4.0.5
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
train_ets <- ets(y = train, model = "ZZZ")
train_ets
## ETS(A,A,N) 
## 
## Call:
##  ets(y = train, model = "ZZZ") 
## 
##   Smoothing parameters:
##     alpha = 0.0969 
##     beta  = 1e-04 
## 
##   Initial states:
##     l = -2.7356 
##     b = 2.8225 
## 
##   sigma:  74.1701
## 
##      AIC     AICc      BIC 
## 1299.290 1299.942 1312.214

From auto modeling we know that our model have error and trend but don’t have seasonal pattern.

checkresiduals(train_ets)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,A,N)
## Q* = 32.505, df = 6, p-value = 1.305e-05
## 
## Model df: 4.   Total lags used: 10
plot(train_ets)

Forecasting

train_fore <- forecast(train_ets, 7)
train_fore
##     Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
##  99       266.3635 171.3107 361.4163 120.9928 411.7342
## 100       269.1787 173.6795 364.6778 123.1253 415.2320
## 101       271.9938 176.0494 367.9382 125.2595 418.7281
## 102       274.8090 178.4205 371.1975 127.3956 422.2225
## 103       277.6242 180.7928 374.4557 129.5333 425.7151
## 104       280.4394 183.1661 377.7127 131.6727 429.2061
## 105       283.2546 185.5405 380.9686 133.8138 432.6953
plot(train_fore)

accuracy(train_fore, test)
##                      ME     RMSE      MAE    MPE     MAPE      MASE       ACF1
## Training set -0.7370443 72.64064 44.98850   -Inf      Inf 0.6491943 -0.3020417
## Test set     23.3338347 58.31533 48.06015 3.9822 16.78685 0.6935190         NA

Conclusion

The model we made has 16.79 % error. The number of new cases grow rapidly in Jawa Timur and predicted still growing for the next 7 days.

Recommendation