In this lecture, we are trying to do Time Series Analysis using Indonesian Covid19 data set.
dataframe <- read.csv("covid_19_indonesia_time_series_all.csv")
summary(dataframe)
## ï..Date Location.ISO.Code Location New.Cases
## Length:3493 Length:3493 Length:3493 Min. : 0.00
## Class :character Class :character Class :character 1st Qu.: 0.00
## Mode :character Mode :character Mode :character Median : 3.00
## Mean : 32.09
## 3rd Qu.: 15.00
## Max. :1385.00
##
## New.Deaths New.Recovered New.Active.Cases Total.Cases
## Min. : 0.000 Min. : 0.00 Min. :-309.00 Min. : 1
## 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 27
## Median : 0.000 Median : 0.00 Median : 1.00 Median : 128
## Mean : 1.634 Mean : 14.17 Mean : 16.29 Mean : 1085
## 3rd Qu.: 0.000 3rd Qu.: 5.00 3rd Qu.: 10.00 3rd Qu.: 462
## Max. :71.000 Max. :1027.00 Max. : 772.00 Max. :56385
##
## Total.Deaths Total.Recovered Total.Active.Cases Location.Level
## Min. : 0.0 Min. : 0 Min. : -44.0 Length:3493
## 1st Qu.: 1.0 1st Qu.: 4 1st Qu.: 16.0 Class :character
## Median : 5.0 Median : 34 Median : 70.0 Mode :character
## Mean : 66.5 Mean : 345 Mean : 673.7
## 3rd Qu.: 25.0 3rd Qu.: 151 3rd Qu.: 271.0
## Max. :2876.0 Max. :24806 Max. :28703.0
##
## City.or.Regency Province Country Continent
## Mode:logical Length:3493 Length:3493 Length:3493
## NA's:3493 Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Island Time.Zone Special.Status Total.Regencies
## Length:3493 Length:3493 Length:3493 Min. : 1.00
## Class :character Class :character Class :character 1st Qu.: 7.00
## Mode :character Mode :character Mode :character Median : 11.00
## Mean : 26.29
## 3rd Qu.: 18.00
## Max. :416.00
##
## Total.Cities Total.Districts Total.Urban.Villages Total.Rural.Villages
## Min. : 1.000 Min. : 44.0 Min. : 35.0 Min. : 275
## 1st Qu.: 1.000 1st Qu.: 103.0 1st Qu.: 99.0 1st Qu.: 928
## Median : 2.000 Median : 169.0 Median : 175.0 Median : 1591
## Mean : 6.469 Mean : 460.7 Mean : 559.3 Mean : 4939
## 3rd Qu.: 5.000 3rd Qu.: 289.0 3rd Qu.: 377.0 3rd Qu.: 2853
## Max. :98.000 Max. :7230.0 Max. :8488.0 Max. :74953
## NA's :95 NA's :97 NA's :122
## Area..km2. Population Population.Density Longitude
## Min. : 664 Min. : 648407 Min. : 8.59 Min. : 96.91
## 1st Qu.: 16787 1st Qu.: 2570289 1st Qu.: 47.79 1st Qu.:106.11
## Median : 42013 Median : 4340348 Median : 103.84 Median :113.42
## Mean : 120695 Mean : 17214818 Mean : 845.88 Mean :113.61
## 3rd Qu.: 75468 3rd Qu.: 9426885 3rd Qu.: 262.70 3rd Qu.:120.16
## Max. :1916907 Max. :265185520 Max. :16334.31 Max. :138.70
##
## Latitude New.Cases.per.Million Total.Cases.per.Million
## Min. :-8.68220 Min. : 0.000 Min. : 0.01
## 1st Qu.:-6.20470 1st Qu.: 0.000 1st Qu.: 6.26
## Median :-2.46175 Median : 0.570 Median : 31.16
## Mean :-2.80315 Mean : 2.316 Mean : 75.79
## 3rd Qu.:-0.08647 3rd Qu.: 2.540 3rd Qu.: 90.48
## Max. : 4.22562 Max. :73.410 Max. :1008.65
##
## New.Deaths.per.Million Total.Deaths.per.Million Case.Fatality.Rate
## Min. :0.00000 Min. : 0.000 Length:3493
## 1st Qu.:0.00000 1st Qu.: 0.330 Class :character
## Median :0.00000 Median : 1.280 Mode :character
## Mean :0.09621 Mean : 3.684
## 3rd Qu.:0.00000 3rd Qu.: 3.320
## Max. :6.71000 Max. :56.430
##
## Case.Recovered.Rate Growth.Factor.of.New.Cases Growth.Factor.of.New.Deaths
## Length:3493 Min. : 0.0000 Min. : 0.0000
## Class :character 1st Qu.: 0.3525 1st Qu.: 1.0000
## Mode :character Median : 1.0000 Median : 1.0000
## Mean : 1.3400 Mean : 0.9535
## 3rd Qu.: 1.1375 3rd Qu.: 1.0000
## Max. :54.0000 Max. :17.0000
## NA's :515 NA's :395
str(dataframe)
## 'data.frame': 3493 obs. of 37 variables:
## $ ï..Date : chr "3/1/2020" "3/1/2020" "3/2/2020" "3/2/2020" ...
## $ Location.ISO.Code : chr "ID-JK" "ID-JB" "ID-JK" "IDN" ...
## $ Location : chr "DKI Jakarta" "Jawa Barat" "DKI Jakarta" "Indonesia" ...
## $ New.Cases : int 3 3 2 2 0 2 0 1 2 0 ...
## $ New.Deaths : int 0 0 0 0 0 0 0 1 0 0 ...
## $ New.Recovered : int 0 0 0 0 0 0 0 0 0 0 ...
## $ New.Active.Cases : int 3 3 2 2 0 2 0 0 2 0 ...
## $ Total.Cases : int 3 3 5 2 3 7 2 4 9 2 ...
## $ Total.Deaths : int 0 0 0 0 0 0 0 1 0 0 ...
## $ Total.Recovered : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Total.Active.Cases : int 3 3 5 2 3 7 2 3 9 2 ...
## $ Location.Level : chr "Province" "Province" "Province" "Country" ...
## $ City.or.Regency : logi NA NA NA NA NA NA ...
## $ Province : chr "DKI Jakarta" "Jawa Barat" "DKI Jakarta" "" ...
## $ Country : chr "Indonesia" "Indonesia" "Indonesia" "Indonesia" ...
## $ Continent : chr "Asia" "Asia" "Asia" "Asia" ...
## $ Island : chr "Jawa" "Jawa" "Jawa" "" ...
## $ Time.Zone : chr "UTC+07:00" "UTC+07:00" "UTC+07:00" "" ...
## $ Special.Status : chr "Daerah Khusus Ibu Kota" "" "Daerah Khusus Ibu Kota" "" ...
## $ Total.Regencies : int 1 18 1 416 18 1 416 18 1 416 ...
## $ Total.Cities : int 5 9 5 98 9 5 98 9 5 98 ...
## $ Total.Districts : int 44 627 44 7230 627 44 7230 627 44 7230 ...
## $ Total.Urban.Villages : int 267 645 267 8488 645 267 8488 645 267 8488 ...
## $ Total.Rural.Villages : int NA 5312 NA 74953 5312 NA 74953 5312 NA 74953 ...
## $ Area..km2. : int 664 35378 664 1916907 35378 664 1916907 35378 664 1916907 ...
## $ Population : int 10846145 45161325 10846145 265185520 45161325 10846145 265185520 45161325 10846145 265185520 ...
## $ Population.Density : num 16334 1277 16334 138 1277 ...
## $ Longitude : num 107 108 107 114 108 ...
## $ Latitude : num -6.205 -6.92 -6.205 -0.789 -6.92 ...
## $ New.Cases.per.Million : num 0.28 0.07 0.18 0.01 0 0.18 0 0.02 0.18 0 ...
## $ Total.Cases.per.Million : num 0.28 0.07 0.46 0.01 0.07 0.65 0.01 0.09 0.83 0.01 ...
## $ New.Deaths.per.Million : num 0 0 0 0 0 0 0 0.02 0 0 ...
## $ Total.Deaths.per.Million : num 0 0 0 0 0 0 0 0.02 0 0 ...
## $ Case.Fatality.Rate : chr "0.00%" "0.00%" "0.00%" "0.00%" ...
## $ Case.Recovered.Rate : chr "0.00%" "0.00%" "0.00%" "0.00%" ...
## $ Growth.Factor.of.New.Cases : num NA NA 0.67 NA 0 1 0 NA 1 1 ...
## $ Growth.Factor.of.New.Deaths: num NA NA 1 NA 1 1 1 NA 1 1 ...
Since we want to do Time Series Analysis, it is important to have ‘Date’ column in our dataset. We already have it but with wrong format, the ‘Date’ should be Date data type, not Chr. We also have to make sure the ‘Date’ column has no NA values because it can affect the analysis process.
colSums(is.na(dataframe))
## ï..Date Location.ISO.Code
## 0 0
## Location New.Cases
## 0 0
## New.Deaths New.Recovered
## 0 0
## New.Active.Cases Total.Cases
## 0 0
## Total.Deaths Total.Recovered
## 0 0
## Total.Active.Cases Location.Level
## 0 0
## City.or.Regency Province
## 3493 0
## Country Continent
## 0 0
## Island Time.Zone
## 0 0
## Special.Status Total.Regencies
## 0 0
## Total.Cities Total.Districts
## 95 0
## Total.Urban.Villages Total.Rural.Villages
## 97 122
## Area..km2. Population
## 0 0
## Population.Density Longitude
## 0 0
## Latitude New.Cases.per.Million
## 0 0
## Total.Cases.per.Million New.Deaths.per.Million
## 0 0
## Total.Deaths.per.Million Case.Fatality.Rate
## 0 0
## Case.Recovered.Rate Growth.Factor.of.New.Cases
## 0 515
## Growth.Factor.of.New.Deaths
## 395
‘Date’ column has already no NA values. SO we can start the next process.
We just want to analyze covid19 data in ‘Jawa Timur’ and playing with New.Cases, New.Death, and New.Recovered data so we can ignore the rest columns.
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.0.5
library(lubridate)
## Warning: package 'lubridate' was built under R version 4.0.5
colnames(dataframe)[1] = "Date"
dataframe <- dataframe %>%
subset(Location == "Jawa Timur") %>%
select(Date, New.Recovered, New.Cases, New.Deaths) %>%
mutate(Date = mdy(Date))
str(dataframe)
## 'data.frame': 105 obs. of 4 variables:
## $ Date : Date, format: "2020-03-18" "2020-03-19" ...
## $ New.Recovered: int 0 1 0 0 0 0 0 2 2 0 ...
## $ New.Cases : int 2 0 0 0 0 0 0 0 8 7 ...
## $ New.Deaths : int 0 0 0 0 0 0 1 2 0 0 ...
Now the data set are ready to analyzed. Next, we will plotting the data.
library(ggplot2)
plot1 <- ggplot(dataframe, aes(x=Date, y=New.Cases)) +
geom_line()
plot2 <- ggplot(dataframe, aes(x=Date, y=New.Deaths)) +
geom_line()
plot3 <- ggplot(dataframe, aes(x=Date, y=New.Recovered)) +
geom_line()
Put each plot into one.
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 4.0.5
grid.arrange(plot1, plot2, plot3, nrow = 3)
The graph shows us that Covid19 case in Indonesia has up and down pattern.
dataframe_cts <- ts(
data = dataframe$New.Cases,
start = min(dataframe$Date),
frequency = 7
)
plot(dataframe_cts, main = "New Cases")
dataframe_dts <- ts(
data = dataframe$New.Death,
start = min(dataframe$Date),
frequency = 7
)
plot(dataframe_dts, main = "New Death Cases")
dataframe_rts <- ts(
data = dataframe$New.Recovered,
start = min(dataframe$Date),
frequency = 7
)
plot(dataframe_rts, main = "New Recovered Cases")
From the plot, we can see that our data’s model is multiplicative. This model assumes that as the data increase, so does the seasonal pattern. In this model, the trend and seasonal components are multiplied and then added to the error component.
In this step, we will try to split our data into train and test data. We take data from the last 7 days as train data and the rest as test data.
test <- tail(dataframe_cts, 7)
train <- head(dataframe_cts, length(dataframe_cts) - length(test))
plot(train)
library(forecast)
## Warning: package 'forecast' was built under R version 4.0.5
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
train_ets <- ets(y = train, model = "ZZZ")
train_ets
## ETS(A,A,N)
##
## Call:
## ets(y = train, model = "ZZZ")
##
## Smoothing parameters:
## alpha = 0.0969
## beta = 1e-04
##
## Initial states:
## l = -2.7356
## b = 2.8225
##
## sigma: 74.1701
##
## AIC AICc BIC
## 1299.290 1299.942 1312.214
From auto modeling we know that our model have error and trend but don’t have seasonal pattern.
checkresiduals(train_ets)
##
## Ljung-Box test
##
## data: Residuals from ETS(A,A,N)
## Q* = 32.505, df = 6, p-value = 1.305e-05
##
## Model df: 4. Total lags used: 10
plot(train_ets)
train_fore <- forecast(train_ets, 7)
train_fore
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 99 266.3635 171.3107 361.4163 120.9928 411.7342
## 100 269.1787 173.6795 364.6778 123.1253 415.2320
## 101 271.9938 176.0494 367.9382 125.2595 418.7281
## 102 274.8090 178.4205 371.1975 127.3956 422.2225
## 103 277.6242 180.7928 374.4557 129.5333 425.7151
## 104 280.4394 183.1661 377.7127 131.6727 429.2061
## 105 283.2546 185.5405 380.9686 133.8138 432.6953
plot(train_fore)
accuracy(train_fore, test)
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -0.7370443 72.64064 44.98850 -Inf Inf 0.6491943 -0.3020417
## Test set 23.3338347 58.31533 48.06015 3.9822 16.78685 0.6935190 NA
The model we made has 16.79 % error. The number of new cases grow rapidly in Jawa Timur and predicted still growing for the next 7 days.