Introduction
In this project, I wil use two data sets which are about active Covid-19 cases daily basis per each city and provinces. First data set contains observations gained from two month time period when COVID-19 pandemic started in Poland.
Second, larger data set contains information about daily infections in Poland from January 2020 up to this day.
Report is constructed of data analysis for both data sets, time series analysis, building a model and forecast.
Data source: https://github.com/dtandev/coronavirus/tree/master/data
Libraries
#install.packages("ggpubr")
library(ggpubr)
#install.packages("mvnormtest")
library(mvnormtest)
#install.packages("agricolae")
library(agricolae)
#install.packages("gplots")
library(gplots)
#install.packages("decoder")
library(decoder)
#install.packages("lubridate")
library(lubridate)
#install.packages("covid19.analytics")
library(covid19.analytics)
#install.packages("xts")
library(xts)
#install.packages("fpp2")
library(fpp2)
library(forecast)
library(tseries)
library(kableExtra)
library(car)
library(classInt)
library(psych)Download
url <- "https://raw.githubusercontent.com/dtandev/coronavirus/master/data/CoronavirusPL%20-%20Timeseries.csv"
covid_pl <- read.csv(url,header = T, sep = ",",encoding = "UTF-8")Preparing data
Smaller dataset for infected cases only
infected <- covid_pl[covid_pl$Infection.Death.Recovery=="I",]
infected$Infection.Death.Recovery<- 1
#number of unique cities
cities_number <- length(unique(infected$City))
# how many infections in each province each time stamp
infected_city <- aggregate(Infection.Death.Recovery~Province+Timestamp, data = infected, FUN = sum)
# mean daily infections for each province
infected_city_mean <- aggregate(Infection.Death.Recovery~Province, data = infected_city, FUN = mean)Preparing large dataset
#This time we will extract data (daily updated) using covid19.analytics package
data <- covid19.data("ts-confirmed")## ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## --------------------------------------------------------------------------------
d<-growth.rate(data,geo.loc="Poland",stride=1)## Processing... POLAND
d2<-as.data.frame(t(d$Changes),row.names=FALSE)
dat <- tail(d2, -1) #to keep all rows except the first
zmiany<-as.numeric(dat$V1)General data analysis
Time period: 01.04.2020 - 31.05.2020
Aggregated daily infections for each province of Poland presented on the box plot.
It is visible that in Śląskie there is the highest median value of daily infections as well as interquartile range. Moreover in this voivodeship the largest daily change (492 new cases) was observed.
In every province outliers over the upper quartile are present which probably indicates on the pandemic rapid growth. However, the most significant outliers are observed in Śląskie, Dolnośląskie, Łódzkie, Mazowieckie and Wielkopolskie.
ggboxplot(infected_city, x = "Province", y = "Infection.Death.Recovery",
color = "Province", ylab = "Number of infected", xlab = "Province") + theme(axis.text.x = element_text(angle = -45, hjust = 0)) +
ggtitle("Daily infections in Poland", subtitle = paste(min(infected_city$Timestamp), " -- ", max(infected_city$Timestamp)) ) Daily infections count distribution for each province.
ggpubr::gghistogram(infected_city, x = "Infection.Death.Recovery", add = "mean", rug = FALSE, fill = "Province", add_density = TRUE, bins = 100) +
ggtitle("Daily infections distribution in Poland", subtitle = paste(min(infected_city$Timestamp), " -- ", max(infected_city$Timestamp)) )Time period: 23.01.2020 – today (21.06.2021)
Frequency table
Frequency table for aggregated daily infections in Poland is created with a class interval of 1000 cases.
It can be observed that in almost half of pandemic days there was between 0 and 1000 infections. However, there are 11 intervals for which the number of infections recorded was between 1000 and 14000 cases for 10 - 20 days. If summed up these intervals give a geat part of all infections in Poland. Third group of observed intervals also represent huge amount of infections. For 10 days the number of infected people was between 21000 and 22000 every day. There were also 8 days in total with infections number reaching over 30000.
lower <- min(zmiany)
upper <- round(max(zmiany) + 500, digits = -3)
interval <- 1000
classes <- upper/interval
limits<- cut(zmiany,seq(lower,upper,by=interval), dig.lab = 10) #setting limits of classes
table1<-table(limits)
kbl(table1, format = "html", col.names=c("Daily infections range", "Frequency"), caption="Infections frequency table", format.args = list(scientific = FALSE)) %>% kable_styling(bootstrap_options = c( "bordered", "condensed")) %>% row_spec(1, background = "gold") %>% row_spec(c(2,3,4,5,6,7,8,9,10,13,14,22), background = "ivory")| Daily infections range | Frequency |
|---|---|
| (0,1000] | 227 |
| (1000,2000] | 18 |
| (2000,3000] | 15 |
| (3000,4000] | 12 |
| (4000,5000] | 20 |
| (5000,6000] | 11 |
| (6000,7000] | 19 |
| (7000,8000] | 15 |
| (8000,9000] | 12 |
| (9000,10000] | 13 |
| (10000,11000] | 7 |
| (11000,12000] | 7 |
| (12000,13000] | 13 |
| (13000,14000] | 11 |
| (14000,15000] | 6 |
| (15000,16000] | 8 |
| (16000,17000] | 4 |
| (17000,18000] | 6 |
| (18000,19000] | 2 |
| (19000,20000] | 3 |
| (20000,21000] | 3 |
| (21000,22000] | 10 |
| (22000,23000] | 3 |
| (23000,24000] | 1 |
| (24000,25000] | 5 |
| (25000,26000] | 5 |
| (26000,27000] | 1 |
| (27000,28000] | 5 |
| (28000,29000] | 2 |
| (29000,30000] | 1 |
| (30000,31000] | 2 |
| (31000,32000] | 1 |
| (32000,33000] | 2 |
| (33000,34000] | 0 |
| (34000,35000] | 1 |
| (35000,36000] | 2 |
TAI
Tabular accuracy index is really satisfactory for 36 classes.
tab2<-classIntervals(zmiany, n = classes, style="fixed", fixedBreaks=seq(lower, upper, by=interval) )
tab2 <-jenks.tests(tab2)
kable(tab2) %>% kable_material_dark() %>% kable_styling( font_size = 24)| x | |
|---|---|
| # classes | 36.0000000 |
| Goodness of fit | 0.9989104 |
| Tabular accuracy | 0.9651643 |
Distribution of daily infections in Poland seems extremely right skewed - obviously that’s better for the country and citizens. Below is the summary of daily infections in Poland. In fact, skewness is strong and positive which is caused by median being way lower than mean.
Moreover distribution is leptokurtic - caused by “long tail”.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 313.0 783.5 5579.1 8617.2 35253.0
| Skewness | 1.648967 |
| Kurtosis | 2.013898 |
Time series analysis
zmiany_ts <- ts(zmiany,start=c(2020,1,23),frequency=365)Since first 42 observations indicate 0 daily cases I have decided to select data only since the first conformed case in Poland.
Therefore, first 42 observations were dropped and time series will be examined starting from 04.03.2020.
dane42 <- tail(d2, -42)
zmiany42<-as.numeric(dane42$V1)
zmiany_ts42 <- ts(zmiany42,start=c(2020,3,4),frequency=365)
zmiany_ts <- zmiany_ts42Time plot of daily infections in Poland shows how many cases were confirmed every day. It is visible that pandemic began slowly but then two rapid and high ‘waves’ were recorded.
Comparison of 2020 and 2021.
Here it is visible that after two ‘waves’ the number of daily infections dropped to the exact level form 2020.
ggseasonplot(zmiany_ts, year.labels=TRUE, year.labels.left=TRUE, col=c("firebrick1", "blue")) + ggtitle("COVID-19 in 2020 vs. 2021") ## 2020 In 2020 the general growing trend is observed.
window_20 <- window(zmiany_ts, start=c(2020, 1),end= c(2021,1), frequency=365)
autoplot(window_20) + geom_smooth(method=lm, color = "firebrick") + ggtitle("Covid-19 daily infections in 2020") ## 2021 Year 2021 started with pretty high numbers, but so far the trend is decreasing. For now (June 2021) number of daily cases stays relatively low.
window_21 <- window(zmiany_ts, start=c(2021, 1), frequency=365)
autoplot(window_21) + geom_smooth(method=lm, color = "chartreuse4") + ggtitle("Covid-19 daily infections in 2021")Stationarity - transformations
Here we can see the plot of the original time series data.
Moreover there are ACF and PACF plots for the raw data.
‘ndiffs’ commands will provide an information on how many times the data should be differenced to obtain a stationary data.
Initially it suggests single differencing.
tsdisplay(zmiany_ts)ndiffs(zmiany_ts)## [1] 1
Box-Cox transformation
The first transformation will be taking a logarithm of raw data because the range is wide. However, it is not possible for now since there are 0 values in the time series and operation log(0) would give an infinite result. For that reason, value of 1 is added to each observation in the time series.
Then, Box_cox transformation with lambda = 0 is applied. It is a power transformation which takes log() for a lambda = 0.
Time series looks better for the further work but still needs to be differenced - now double differencing is suggested.
zmiany_ts1 <- zmiany_ts + 1
zmiany_bc <- BoxCox(zmiany_ts1, 0)
ndiffs(zmiany_bc)## [1] 2
tsdisplay(zmiany_bc)First differencing
zmiany_bc_diff1 <- diff(zmiany_bc, lag = 1)
tsdisplay(zmiany_bc_diff1)ndiffs(zmiany_bc_diff1)## [1] 1
Acf(zmiany_bc_diff1, lag.max = 40)On the autocorrelation plot it can be observed that observations exceed confidence interval with lag = 7 For that reason second differencing will be conducted with a lag of 7.
Second differencing
zmiany_bc_diff2 <- diff(zmiany_bc_diff1, lag = 7)
tsdisplay(zmiany_bc_diff2)ndiffs(zmiany_bc_diff2)## [1] 0
Acf(zmiany_bc_diff2, lag.max = 40)No more differencing operation is required.
Testing the stationarity using the unit root test by Kwiatkowski–Phillips–Schmidt–Shin (KPSS).
library(urca)
zmiany_bc_diff1 %>% ur.kpss() %>% summary()##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 5 lags.
##
## Value of test-statistic is: 1.1844
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
zmiany_bc_diff2 %>% ur.kpss() %>% summary()##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 5 lags.
##
## Value of test-statistic is: 0.0388
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
Second test indicates that the statistical value is smaller than 1% critical value, which concludes that the data is stationary.
ACF, PACF
According to autocorrelation and partial autocorrelation plots, for further work models of order 7 also will be considered. It is because for the lag of 7 the largest value exceeding the confidence interval is observed.
Basing on the ACF and PACF plots analysis, appropriate ARIMA model is selected. PACF becomes insignificant after the 2nd lag, therefore the part of the model is more likely to be MA(2). ACF also becomes insignificant after 2nd lag, therefore the part of the model is more likely to be AR(2). I the arima(p,d,q), parameter ‘d’ will be set to 2 since time series was differenced twice.
ARIMA
Model selection
We use the principle of parsimony to decide which model is best: that is, we assume that the model with the fewest parameters is best.
For that reason three models will be considered: - ARMA(2, 0) - two parameters - ARMA(0, 2) - two parameters - ARMA(2, 2) - the most complex one with four parameters
arima0.2.2 <- Arima(zmiany_ts1, order =c(0,2,2), lambda= 0)
summary(arima0.2.2)## Series: zmiany_ts1
## ARIMA(0,2,2)
## Box Cox transformation: lambda= 0
##
## Coefficients:
## ma1 ma2
## -1.6455 0.6769
## s.e. NaN NaN
##
## sigma^2 estimated as 0.1203: log likelihood=-171.56
## AIC=349.12 AICc=349.17 BIC=361.6
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -69.7191 3065.353 1569.744 -8.937411 28.52469 0.1545703 0.4705795
arima2.2.0 <- Arima(zmiany_ts1, order =c(2,2,0), lambda= 0)
summary(arima2.2.0)## Series: zmiany_ts1
## ARIMA(2,2,0)
## Box Cox transformation: lambda= 0
##
## Coefficients:
## ar1 ar2
## -0.7533 -0.3526
## s.e. 0.0433 0.0449
##
## sigma^2 estimated as 0.1914: log likelihood=-279.46
## AIC=564.93 AICc=564.98 BIC=577.4
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -385.3305 3774.596 1927.881 -9.543319 33.58852 0.1898355 0.2391352
arima2.2.2 <- Arima(zmiany_ts1, order =c(2,2,2), lambda= 0)
summary(arima2.2.2)## Series: zmiany_ts1
## ARIMA(2,2,2)
## Box Cox transformation: lambda= 0
##
## Coefficients:
## ar1 ar2 ma1 ma2
## 0.4477 -0.1078 -1.874 0.9070
## s.e. 0.0541 0.0525 0.032 0.0299
##
## sigma^2 estimated as 0.1026: log likelihood=-133.4
## AIC=276.79 AICc=276.92 BIC=297.59
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -22.50683 2471.207 1219.294 -6.797896 24.92259 0.120062 0.313234
During model selection process the crucial step is to examine information criteria. Table presents AIC, AICc and BIC for each initially considered model. It can be observed that the most complex ARIMA(2,2,2) model is characterized by the lowest Information Criteria. For that reason I will select it for further examination.
| ARIMA(p,d,q) | AIC | AICc | BIC |
|---|---|---|---|
| (0,2,2) | 339.9 | 340.0 | 352.4 |
| (2,2,0) | 563.2 | 564.0 | 575.7 |
| (2,2,2) | 273.6 | 273.8 | 294.4 |
Summary of three models indicated that model with at least 3 parameters provides the best results so far. However, during ACF and PACF plots examination it was concluded that it will be worth to consider models of order 7. For that reason additional two models are examined.
arima2.2.7 <- Arima(zmiany_ts1, order =c(2,2,7), lambda= 0)
summary(arima2.2.7)## Series: zmiany_ts1
## ARIMA(2,2,7)
## Box Cox transformation: lambda= 0
##
## Coefficients:
## ar1 ar2 ma1 ma2 ma3 ma4 ma5 ma6
## -0.4417 -0.9839 -0.9295 0.8121 -1.5020 0.5391 -0.2278 0.3311
## s.e. 0.0115 0.0132 0.0509 0.0665 0.0858 0.1190 0.0898 0.0766
## ma7
## 0.1412
## s.e. 0.0574
##
## sigma^2 estimated as 0.08458: log likelihood=-86.42
## AIC=192.84 AICc=193.31 BIC=234.43
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -62.17826 2342.477 1190.141 -5.336486 22.91458 0.1171913 0.3116076
arima7.2.2 <- Arima(zmiany_ts1, order =c(7,2,2), lambda= 0)
summary(arima7.2.2)## Series: zmiany_ts1
## ARIMA(7,2,2)
## Box Cox transformation: lambda= 0
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ar6 ar7 ma1
## -1.3464 -1.2947 -1.2785 -1.2721 -1.2688 -1.0813 -0.4761 -0.2578
## s.e. 0.1069 0.0995 0.0955 0.0942 0.0943 0.0918 0.0686 0.1113
## ma2
## -0.2011
## s.e. 0.0966
##
## sigma^2 estimated as 0.06886: log likelihood=-37.36
## AIC=94.72 AICc=95.19 BIC=136.31
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -90.94902 2009.586 926.9229 -4.940974 19.86228 0.09127267
## ACF1
## Training set 0.3593049
| ARIMA(p,d,q) | AIC | AICc | BIC |
|---|---|---|---|
| (2,2,2) | 273.6 | 273.8 | 294.4 |
| (2,2,7) | 192.0 | 192.6 | 233.6 |
| 7,2,2) | 94.7 | 95.2 | 136.3 |
All Information Criteria and error measures indicate that the most appropriate model is ARIMA(7,2,2)
ARIMA(7,2,2) performs even better than automatically selected ARIMA model. It has 10 times lower AIC and BIC and forecast values are more reasonable.
Residuals
Time plot of the forecast errors shows that residuals have roughly constant variance around the mean which is close to zero.
resid7.2.2 <- arima7.2.2$residuals
checkresiduals(resid7.2.2) On the histogram of residuals and the quantile comparison plot it is observed that errors are almost normally distributed.
par ( mfrow =c(1 ,2) )
plotForecastErrors(resid7.2.2)
qqPlot(resid7.2.2, col.lines = "red")## [1] 15 5
Residuals diagnostics
Shapiro - Wilk test for residuals normality:
The null hypothesis for this test is that the data are normally distributed. Hypothesis is rejected since p-value < 0.05, but generally level of normality is accepted.
shapiro.test (resid7.2.2)##
## Shapiro-Wilk normality test
##
## data: resid7.2.2
## W = 0.93844, p-value = 4.11e-13
Ljung-Box test for residuals randomness:
Box.test(resid7.2.2, lag = 1, type = "Ljung-Box")##
## Box-Ljung test
##
## data: resid7.2.2
## X-squared = 0.36625, df = 1, p-value = 0.5451
Box.test(resid7.2.2, lag = 7, type = "Ljung-Box")##
## Box-Ljung test
##
## data: resid7.2.2
## X-squared = 13.947, df = 7, p-value = 0.05213
All p-values in the above test are greater than 0.05, therefore there is no reason to reject the null hypothesis of errors independence.
Forecasting
Final forecast for 60 days with 95% confidence shows rather positive scenario of daily infections converging to 0, however it is possible that COVID-19 pandemic will gain on the intensity again soon. For accurate prediction we should also consider events changing the pandemic intensity such as society vaccination process.