Let’s install libraries

library(fpp2)
## Warning: package 'fpp2' was built under R version 4.0.5
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
## -- Attaching packages ---------------------------------------------- fpp2 2.4 --
## <U+221A> ggplot2   3.3.3     <U+221A> fma       2.4  
## <U+221A> forecast  8.14      <U+221A> expsmooth 2.3
## Warning: package 'forecast' was built under R version 4.0.5
## Warning: package 'fma' was built under R version 4.0.5
## Warning: package 'expsmooth' was built under R version 4.0.5
## 
library(mvnormtest)
library(agricolae)
## Warning: package 'agricolae' was built under R version 4.0.5
library(gplots)
## Warning: package 'gplots' was built under R version 4.0.5
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
library(decoder)
## Warning: package 'decoder' was built under R version 4.0.5
library(lubridate)
## Warning: package 'lubridate' was built under R version 4.0.5
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(xts)
## Warning: package 'xts' was built under R version 4.0.5
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(urca)
## Warning: package 'urca' was built under R version 4.0.5
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.0.5
## 
## Attaching package: 'ggpubr'
## The following object is masked from 'package:forecast':
## 
##     gghistogram
library(covid19.analytics)
## Warning: package 'covid19.analytics' was built under R version 4.0.5
library(urca)
library(tseries)
## Warning: package 'tseries' was built under R version 4.0.5

Data

Data source: https://github.com/dtandev/coronavirus/tree/master/data

Let"s download data from github repository.

url <- "https://raw.githubusercontent.com/dtandev/coronavirus/master/data/CoronavirusPL%20-%20Timeseries.csv"

covid_pl <- read.csv(url,header = T, sep = ",",encoding = "UTF-8")

Preparing dataset for analysis

infected <- covid_pl[covid_pl$Infection.Death.Recovery=="I",]

infected$Infection.Death.Recovery<- 1

infected_city <- aggregate(Infection.Death.Recovery~Province+Timestamp, data = infected, FUN = sum)

infected_city$Infection.Death.Recovery <- as.numeric(infected_city$Infection.Death.Recovery)

infected_city_mean <- aggregate(Infection.Death.Recovery~Province, data = infected_city, FUN = mean)

Drawing box plots

# drawing box plots
# Please use ggboxplot function from ggpubr package, if you already installed and loaded it, there is no need to do it again.  

ggboxplot(infected_city, x = "Province", y = "Infection.Death.Recovery", 
          color = "Province", ylab = "Number of infected", xlab = "Province")

First boxplot allows us to see the reallocation of cases of COVID-19 across Poland. We can see where the number of cases per day was the biggest - here it is clearly ÅšlÄ…skie voivodeship. It not only has the highest average number of cases (114.44) but also biggest range and highest outlier.

Draw histogram

# drawing histograms
# Please use gghistogram function from ggpubr package, if you already installed and loaded it, there is no need to do it again.

gghistogram(infected_city, x = "Infection.Death.Recovery",
   add = "mean", rug = F,
   fill = "Province",
   add_density = TRUE,
   bins = 100,
   xlab = "Number of cases per day", ylab = "")

The histogram shows us that most of the days the number of cases was in range 1-50. We see highest bars cumulating around 10 - rather small number of new infections. Only the blue color, symbolizing ÅšlÄ…skie has a lot of outliers reaching even 500 cases a day.

Preparing data for infected cases only

Now we are using different set of data. zmiany_ts will be the time series we will work on.

#This time we will extract data (daily updated) using covid19.analytics package

data <- covid19.data("ts-confirmed")
## Data being read from JHU/CCSE repository
## ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## Reading data from https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv
## Data retrieved on 2021-06-18 20:50:49 || Range of dates on data: 2020-01-22--2021-06-17 | Nbr of records: 278
## --------------------------------------------------------------------------------
d<-growth.rate(data,geo.loc="Poland",stride=1)
## Processing...  POLAND
## Loading required package: pheatmap
## Warning: package 'pheatmap' was built under R version 4.0.5

d2<-as.data.frame(t(d$Changes),row.names=FALSE)
dat <- tail(d2, -1) #to keep all rows except the first
zmiany<-as.numeric(dat$V1)
zmiany_ts <- ts(zmiany, start = c(2020,1,23), frequency = 365.25)

Plots

Now the distribution of cases is over 2 years - 2020 and 2021. We can easily see that the beginning of 2020 was rather smooth - small number of cases. However, the peak starts around summer and hits a point of over 3000 cases per day. It falls again in the late 2020 but still to a significantly higher level than at the begining of the year. Year 2021 starts with an increase of cases again reaching over 3000, but later starts to decrease - this is a trend we can still observe.

autoplot(zmiany_ts) + ylab("Daily infections") + xlab("Day") +
         ggtitle("New-daily infections in Poland")

ggseasonplot(zmiany_ts, year.labels = T) + xlab("Year") + ggtitle("Daily infections in Poland 2020 and 2021")

gglagplot(zmiany_ts) # lag plot

Comparison between 2020 and 2021 - we hope the low line from the beginning of 2020 will be similar at the rest of this year.

Lag plots - we see no pattern in the lags, which may suggest no strong seasonality.

Stationarity - transformations

Because the range of the data is rather small - not even 2 years, it is hard to talk about seasonality. However, the kpss and adf test show us that the data are not stationary - value of test statistic in the first one is 2.7 (over safe 1%) and adf test also rejects the hypothesis of zmiany_ts being stationary data. Hence, we use ndiffs() function to check how many differentiations are needed - we get a result of 1. zmiany_ts1 are first ordered difference - between an observation and its previous value. Now we can see that kpss test shows test statistic at level 0.13 and the plot of changed funtion is more stationary than before. However, even now the plot is not perfect because our data are highly skewed and contain a lot of very small observations.

summary(ur.kpss(zmiany_ts)) 
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 6 lags. 
## 
## Value of test-statistic is: 2.7027 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739
adf.test(zmiany_ts) # p-value < 0.05 indicates the TS is stationary
## 
##  Augmented Dickey-Fuller Test
## 
## data:  zmiany_ts
## Dickey-Fuller = -1.287, Lag order = 7, p-value = 0.8801
## alternative hypothesis: stationary
ndiffs(zmiany_ts) # here 1
## [1] 1
zmiany_ts1 <-diff(zmiany_ts)
autoplot(zmiany_ts1)

summary(ur.kpss(zmiany_ts1)) # Now the test statistics is much smaller than 1% -> data is stationary
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 6 lags. 
## 
## Value of test-statistic is: 0.1377 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

ACF, PACF

Checking autocorrelation is made by using ACF and PACF. First function checks correlation between checked and previous variables. PACF tests the residuals.

Here we can see that for both original and differentiated data ACF shows no seasonality in the data. However, for small lags the value is big and positive, that may suggest an increasing trend.

ARIMA

As we know already, order of differentiation is 1. Hence, our auto.arima will show 1 at second position. It also indicates second order of Autoregressive model and third order of Moving Average (MA). Using already differentied data gives us the result of ARIMA(2, 0, 3). Since there is no seasonality, the seasonality ARIMA is not calculated.

## Warning: package 'TTR' was built under R version 4.0.5
## Warning: The chosen seasonal unit root test encountered an error when testing for the first difference.
## From stl(): series is not periodic or has less than two periods
## 0 seasonal differences will be used. Consider using a different unit root test.
## Series: zmiany_ts 
## ARIMA(2,1,3) 
## 
## Coefficients:
##          ar1      ar2      ma1     ma2      ma3
##       1.2376  -0.9743  -1.5740  1.2744  -0.3363
## s.e.  0.0112   0.0111   0.0427  0.0706   0.0522
## 
## sigma^2 estimated as 4088917:  log likelihood=-4613.43
## AIC=9238.85   AICc=9239.02   BIC=9264.27

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,1,3)
## Q* = 1129.4, df = 97, p-value < 2.2e-16
## 
## Model df: 5.   Total lags used: 102
## Warning: The chosen seasonal unit root test encountered an error when testing for the first difference.
## From stl(): series is not periodic or has less than two periods
## 0 seasonal differences will be used. Consider using a different unit root test.
## Series: zmiany_ts1 
## ARIMA(2,0,3) with zero mean 
## 
## Coefficients:
##          ar1      ar2      ma1     ma2      ma3
##       1.2376  -0.9743  -1.5740  1.2744  -0.3363
## s.e.  0.0112   0.0111   0.0427  0.0706   0.0522
## 
## sigma^2 estimated as 4088917:  log likelihood=-4613.43
## AIC=9238.85   AICc=9239.02   BIC=9264.27

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,0,3) with zero mean
## Q* = 1127.4, df = 97, p-value < 2.2e-16
## 
## Model df: 5.   Total lags used: 102

Forecasting

To forecast future values we can use several methods:

  1. NAIVE METHOD
    future value is equal to last observed value
# 1. NAIVE METHOD

fnaive <- naive(zmiany_ts1)
autoplot(fnaive)

  1. AVERAGE METHOD
    all forecasts are equal to the average of the observed data
# 2. AVERAGE METHOD

fmeanf <- meanf(zmiany_ts1)
autoplot(fmeanf)

  1. HOLT METHOD
    uses weighted moving averages decreasing exponentially
# 3. HOLT

hcast1 <- holt(zmiany_ts1)
autoplot(hcast1) 

We cannot use Holt-Winters method because our data show no seasonality.

  1. ARIMA FORECASTING
fit <- auto.arima(zmiany_ts1)
## Warning: The chosen seasonal unit root test encountered an error when testing for the first difference.
## From stl(): series is not periodic or has less than two periods
## 0 seasonal differences will be used. Consider using a different unit root test.
autoplot(forecast(fit, h=12))