Covid19 modeling daily changes for Poland

Jakub Sochacki, Karol Flisikowski

2021-06-22

Introduction

In this project, I wil use two data sets which are about active Covid-19 cases daily basis per each city and provinces. First data set contains observations gained from two month time period when COVID-19 pandemic started in Poland.
Second, larger data set contains information about daily infections in Poland from January 2020 up to this day.
Report is constructed of data analysis for both data sets, time series analysis, building a model and forecast.

Data source: https://github.com/dtandev/coronavirus/tree/master/data

Libraries

#install.packages("ggpubr")
library(ggpubr)
#install.packages("mvnormtest")
library(mvnormtest)
#install.packages("agricolae")
library(agricolae)
#install.packages("gplots")
library(gplots)
#install.packages("decoder")
library(decoder)
#install.packages("lubridate")
library(lubridate)
#install.packages("covid19.analytics")
library(covid19.analytics)
#install.packages("xts")
library(xts)
#install.packages("fpp2")
library(fpp2)

library(forecast)
library(tseries)
library(kableExtra)
library(car)
library(classInt)
library(psych)

Download

url <- "https://raw.githubusercontent.com/dtandev/coronavirus/master/data/CoronavirusPL%20-%20Timeseries.csv"
covid_pl <- read.csv(url,header = T, sep = ",",encoding = "UTF-8")

Preparing data

Smaller dataset for infected cases only

infected <- covid_pl[covid_pl$Infection.Death.Recovery=="I",]

infected$Infection.Death.Recovery<- 1
#number of unique cities
cities_number <- length(unique(infected$City))
# how many infections in each province each time stamp
infected_city <- aggregate(Infection.Death.Recovery~Province+Timestamp, data = infected, FUN = sum)
# mean daily infections for each province
infected_city_mean <- aggregate(Infection.Death.Recovery~Province, data = infected_city, FUN = mean)

Preparing large dataset

#This time we will extract data (daily updated) using covid19.analytics package
data <- covid19.data("ts-confirmed")

## ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 
## --------------------------------------------------------------------------------

d<-growth.rate(data,geo.loc="Poland",stride=1)

## Processing...  POLAND

d2<-as.data.frame(t(d$Changes),row.names=FALSE)
dat <- tail(d2, -1) #to keep all rows except the first
zmiany<-as.numeric(dat$V1)

General data analysis

Time period: 01.04.2020 - 31.05.2020

Aggregated daily infections for each province of Poland presented on the box plot.
It is visible that in Śląskie there is the highest median value of daily infections as well as interquartile range. Moreover in this voivodeship the largest daily change (492 new cases) was observed.
In every province outliers over the upper quartile are present which probably indicates on the pandemic rapid growth. However, the most significant outliers are observed in Śląskie, Dolnośląskie, Łódzkie, Mazowieckie and Wielkopolskie.

ggboxplot(infected_city, x = "Province", y = "Infection.Death.Recovery", 
          color = "Province", ylab = "Number of infected", xlab = "Province") + theme(axis.text.x = element_text(angle = -45, hjust = 0)) + 
  ggtitle("Daily infections in Poland", subtitle = paste(min(infected_city$Timestamp), " -- ", max(infected_city$Timestamp)) )

Daily infections count distribution for each province.

ggpubr::gghistogram(infected_city, x = "Infection.Death.Recovery", add = "mean", rug = FALSE, fill = "Province", add_density = TRUE, bins = 100) + 
  ggtitle("Daily infections distribution in Poland", subtitle = paste(min(infected_city$Timestamp), " -- ", max(infected_city$Timestamp)) )

Time period: 23.01.2020 – today (21.06.2021)

Frequency table

Frequency table for aggregated daily infections in Poland is created with a class interval of 1000 cases.
It can be observed that in almost half of pandemic days there was between 0 and 1000 infections. However, there are 11 intervals for which the number of infections recorded was between 1000 and 14000 cases for 10 - 20 days. If summed up these intervals give a geat part of all infections in Poland. Third group of observed intervals also represent huge amount of infections. For 10 days the number of infected people was between 21000 and 22000 every day. There were also 8 days in total with infections number reaching over 30000.

lower <- min(zmiany)
upper <- round(max(zmiany) + 500, digits = -3)
interval <- 1000
classes <- upper/interval
limits<- cut(zmiany,seq(lower,upper,by=interval), dig.lab = 10) #setting limits of classes
table1<-table(limits) 
 kbl(table1, format = "html", col.names=c("Daily infections range", "Frequency"), caption="Infections frequency table", format.args = list(scientific = FALSE)) %>% kable_styling(bootstrap_options = c( "bordered", "condensed")) %>% row_spec(1, background = "gold") %>% row_spec(c(2,3,4,5,6,7,8,9,10,13,14,22), background = "ivory")

Infections frequency table
Daily infections range	Frequency
(0,1000]	227
(1000,2000]	18
(2000,3000]	15
(3000,4000]	12
(4000,5000]	20
(5000,6000]	11
(6000,7000]	19
(7000,8000]	15
(8000,9000]	12
(9000,10000]	13
(10000,11000]	7
(11000,12000]	7
(12000,13000]	13
(13000,14000]	11
(14000,15000]	6
(15000,16000]	8
(16000,17000]	4
(17000,18000]	6
(18000,19000]	2
(19000,20000]	3
(20000,21000]	3
(21000,22000]	10
(22000,23000]	3
(23000,24000]	1
(24000,25000]	5
(25000,26000]	5
(26000,27000]	1
(27000,28000]	5
(28000,29000]	2
(29000,30000]	1
(30000,31000]	2
(31000,32000]	1
(32000,33000]	2
(33000,34000]	0
(34000,35000]	1
(35000,36000]	2

TAI

Tabular accuracy index is really satisfactory for 36 classes.

tab2<-classIntervals(zmiany, n = classes, style="fixed", fixedBreaks=seq(lower, upper, by=interval) )
tab2 <-jenks.tests(tab2)
kable(tab2) %>% kable_material_dark() %>% kable_styling( font_size = 24)

	x
# classes	36.0000000
Goodness of fit	0.9989104
Tabular accuracy	0.9651643

Distribution of daily infections in Poland seems extremely right skewed - obviously that’s better for the country and citizens. Below is the summary of daily infections in Poland. In fact, skewness is strong and positive which is caused by median being way lower than mean.
Moreover distribution is leptokurtic - caused by “long tail”.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   313.0   783.5  5579.1  8617.2 35253.0

Distribution shape measures

Skewness	1.648967
Kurtosis	2.013898

Time series analysis

zmiany_ts <- ts(zmiany,start=c(2020,1,23),frequency=365)

Since first 42 observations indicate 0 daily cases I have decided to select data only since the first conformed case in Poland.
Therefore, first 42 observations were dropped and time series will be examined starting from 04.03.2020.

dane42 <- tail(d2, -42)
zmiany42<-as.numeric(dane42$V1)
zmiany_ts42 <- ts(zmiany42,start=c(2020,3,4),frequency=365) 
zmiany_ts <- zmiany_ts42

Time plot of daily infections in Poland shows how many cases were confirmed every day. It is visible that pandemic began slowly but then two rapid and high ‘waves’ were recorded.
Comparison of 2020 and 2021.
Here it is visible that after two ‘waves’ the number of daily infections dropped to the exact level form 2020.

ggseasonplot(zmiany_ts, year.labels=TRUE, year.labels.left=TRUE, col=c("firebrick1", "blue")) + ggtitle("COVID-19 in 2020 vs. 2021")

## 2020 In 2020 the general growing trend is observed.

window_20 <- window(zmiany_ts, start=c(2020, 1),end= c(2021,1), frequency=365)
autoplot(window_20) + geom_smooth(method=lm, color = "firebrick") + ggtitle("Covid-19 daily infections in 2020")

## 2021 Year 2021 started with pretty high numbers, but so far the trend is decreasing. For now (June 2021) number of daily cases stays relatively low.

window_21 <- window(zmiany_ts, start=c(2021, 1), frequency=365)
autoplot(window_21) + geom_smooth(method=lm, color = "chartreuse4") + ggtitle("Covid-19 daily infections in 2021")

Stationarity - transformations

Here we can see the plot of the original time series data.
Moreover there are ACF and PACF plots for the raw data.
‘ndiffs’ commands will provide an information on how many times the data should be differenced to obtain a stationary data.
Initially it suggests single differencing.

tsdisplay(zmiany_ts)

ndiffs(zmiany_ts)

## [1] 1

Box-Cox transformation

The first transformation will be taking a logarithm of raw data because the range is wide. However, it is not possible for now since there are 0 values in the time series and operation log(0) would give an infinite result. For that reason, value of 1 is added to each observation in the time series.
Then, Box_cox transformation with lambda = 0 is applied. It is a power transformation which takes log() for a lambda = 0.
Time series looks better for the further work but still needs to be differenced - now double differencing is suggested.

zmiany_ts1 <- zmiany_ts + 1
zmiany_bc <- BoxCox(zmiany_ts1, 0)
ndiffs(zmiany_bc)

## [1] 2

tsdisplay(zmiany_bc)

First differencing

zmiany_bc_diff1 <- diff(zmiany_bc, lag = 1)
tsdisplay(zmiany_bc_diff1)

ndiffs(zmiany_bc_diff1)

## [1] 1

Acf(zmiany_bc_diff1, lag.max = 40)

On the autocorrelation plot it can be observed that observations exceed confidence interval with lag = 7 For that reason second differencing will be conducted with a lag of 7.

Second differencing

zmiany_bc_diff2 <- diff(zmiany_bc_diff1, lag = 7)
tsdisplay(zmiany_bc_diff2)

ndiffs(zmiany_bc_diff2)

## [1] 0

Acf(zmiany_bc_diff2, lag.max = 40)

No more differencing operation is required.
Testing the stationarity using the unit root test by Kwiatkowski–Phillips–Schmidt–Shin (KPSS).

library(urca)
zmiany_bc_diff1 %>% ur.kpss() %>% summary()

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 1.1844 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

zmiany_bc_diff2 %>% ur.kpss() %>% summary()

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.0388 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

Second test indicates that the statistical value is smaller than 1% critical value, which concludes that the data is stationary.

ACF, PACF

According to autocorrelation and partial autocorrelation plots, for further work models of order 7 also will be considered. It is because for the lag of 7 the largest value exceeding the confidence interval is observed.
Basing on the ACF and PACF plots analysis, appropriate ARIMA model is selected. PACF becomes insignificant after the 2nd lag, therefore the part of the model is more likely to be MA(2). ACF also becomes insignificant after 2nd lag, therefore the part of the model is more likely to be AR(2). I the arima(p,d,q), parameter ‘d’ will be set to 2 since time series was differenced twice.

ARIMA

Model selection

We use the principle of parsimony to decide which model is best: that is, we assume that the model with the fewest parameters is best.
For that reason three models will be considered: - ARMA(2, 0) - two parameters - ARMA(0, 2) - two parameters - ARMA(2, 2) - the most complex one with four parameters

arima0.2.2 <- Arima(zmiany_ts1, order =c(0,2,2), lambda= 0)
summary(arima0.2.2)

## Series: zmiany_ts1 
## ARIMA(0,2,2) 
## Box Cox transformation: lambda= 0 
## 
## Coefficients:
##           ma1     ma2
##       -1.6455  0.6769
## s.e.      NaN     NaN
## 
## sigma^2 estimated as 0.1203:  log likelihood=-171.56
## AIC=349.12   AICc=349.17   BIC=361.6
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE      ACF1
## Training set -69.7191 3065.353 1569.744 -8.937411 28.52469 0.1545703 0.4705795

arima2.2.0 <- Arima(zmiany_ts1, order =c(2,2,0), lambda= 0)
summary(arima2.2.0)

## Series: zmiany_ts1 
## ARIMA(2,2,0) 
## Box Cox transformation: lambda= 0 
## 
## Coefficients:
##           ar1      ar2
##       -0.7533  -0.3526
## s.e.   0.0433   0.0449
## 
## sigma^2 estimated as 0.1914:  log likelihood=-279.46
## AIC=564.93   AICc=564.98   BIC=577.4
## 
## Training set error measures:
##                     ME     RMSE      MAE       MPE     MAPE      MASE      ACF1
## Training set -385.3305 3774.596 1927.881 -9.543319 33.58852 0.1898355 0.2391352

arima2.2.2 <- Arima(zmiany_ts1, order =c(2,2,2), lambda= 0)
summary(arima2.2.2)

## Series: zmiany_ts1 
## ARIMA(2,2,2) 
## Box Cox transformation: lambda= 0 
## 
## Coefficients:
##          ar1      ar2     ma1     ma2
##       0.4477  -0.1078  -1.874  0.9070
## s.e.  0.0541   0.0525   0.032  0.0299
## 
## sigma^2 estimated as 0.1026:  log likelihood=-133.4
## AIC=276.79   AICc=276.92   BIC=297.59
## 
## Training set error measures:
##                     ME     RMSE      MAE       MPE     MAPE     MASE     ACF1
## Training set -22.50683 2471.207 1219.294 -6.797896 24.92259 0.120062 0.313234

During model selection process the crucial step is to examine information criteria. Table presents AIC, AICc and BIC for each initially considered model. It can be observed that the most complex ARIMA(2,2,2) model is characterized by the lowest Information Criteria. For that reason I will select it for further examination.

ARIMA(p,d,q)	AIC	AICc	BIC
(0,2,2)	339.9	340.0	352.4
(2,2,0)	563.2	564.0	575.7
(2,2,2)	273.6	273.8	294.4

Summary of three models indicated that model with at least 3 parameters provides the best results so far. However, during ACF and PACF plots examination it was concluded that it will be worth to consider models of order 7. For that reason additional two models are examined.

arima2.2.7 <- Arima(zmiany_ts1, order =c(2,2,7), lambda= 0)
summary(arima2.2.7)

## Series: zmiany_ts1 
## ARIMA(2,2,7) 
## Box Cox transformation: lambda= 0 
## 
## Coefficients:
##           ar1      ar2      ma1     ma2      ma3     ma4      ma5     ma6
##       -0.4417  -0.9839  -0.9295  0.8121  -1.5020  0.5391  -0.2278  0.3311
## s.e.   0.0115   0.0132   0.0509  0.0665   0.0858  0.1190   0.0898  0.0766
##          ma7
##       0.1412
## s.e.  0.0574
## 
## sigma^2 estimated as 0.08458:  log likelihood=-86.42
## AIC=192.84   AICc=193.31   BIC=234.43
## 
## Training set error measures:
##                     ME     RMSE      MAE       MPE     MAPE      MASE      ACF1
## Training set -62.17826 2342.477 1190.141 -5.336486 22.91458 0.1171913 0.3116076

arima7.2.2 <- Arima(zmiany_ts1, order =c(7,2,2), lambda= 0)
summary(arima7.2.2)

## Series: zmiany_ts1 
## ARIMA(7,2,2) 
## Box Cox transformation: lambda= 0 
## 
## Coefficients:
##           ar1      ar2      ar3      ar4      ar5      ar6      ar7      ma1
##       -1.3464  -1.2947  -1.2785  -1.2721  -1.2688  -1.0813  -0.4761  -0.2578
## s.e.   0.1069   0.0995   0.0955   0.0942   0.0943   0.0918   0.0686   0.1113
##           ma2
##       -0.2011
## s.e.   0.0966
## 
## sigma^2 estimated as 0.06886:  log likelihood=-37.36
## AIC=94.72   AICc=95.19   BIC=136.31
## 
## Training set error measures:
##                     ME     RMSE      MAE       MPE     MAPE       MASE
## Training set -90.94902 2009.586 926.9229 -4.940974 19.86228 0.09127267
##                   ACF1
## Training set 0.3593049

ARIMA(p,d,q)	AIC	AICc	BIC
(2,2,2)	273.6	273.8	294.4
(2,2,7)	192.0	192.6	233.6
7,2,2)	94.7	95.2	136.3

All Information Criteria and error measures indicate that the most appropriate model is ARIMA(7,2,2)

ARIMA(7,2,2) performs even better than automatically selected ARIMA model. It has 10 times lower AIC and BIC and forecast values are more reasonable.

Residuals

Time plot of the forecast errors shows that residuals have roughly constant variance around the mean which is close to zero.

resid7.2.2 <- arima7.2.2$residuals
checkresiduals(resid7.2.2)

On the histogram of residuals and the quantile comparison plot it is observed that errors are almost normally distributed.

par ( mfrow =c(1 ,2) )
plotForecastErrors(resid7.2.2)
qqPlot(resid7.2.2, col.lines = "red")

## [1] 15  5

Residuals diagnostics

Shapiro - Wilk test for residuals normality:
The null hypothesis for this test is that the data are normally distributed. Hypothesis is rejected since p-value < 0.05, but generally level of normality is accepted.

shapiro.test (resid7.2.2)

## 
##  Shapiro-Wilk normality test
## 
## data:  resid7.2.2
## W = 0.93844, p-value = 4.11e-13

Ljung-Box test for residuals randomness:

Box.test(resid7.2.2, lag = 1, type = "Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  resid7.2.2
## X-squared = 0.36625, df = 1, p-value = 0.5451

Box.test(resid7.2.2, lag = 7, type = "Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  resid7.2.2
## X-squared = 13.947, df = 7, p-value = 0.05213

All p-values in the above test are greater than 0.05, therefore there is no reason to reject the null hypothesis of errors independence.

Forecasting

Final forecast for 60 days with 95% confidence shows rather positive scenario of daily infections converging to 0, however it is possible that COVID-19 pandemic will gain on the intensity again soon. For accurate prediction we should also consider events changing the pandemic intensity such as society vaccination process.