Let’s install libraries

#install.packages("ggpubr")
library(ggpubr)
## Loading required package: ggplot2
#install.packages("mvnormtest")
library(mvnormtest)

#install.packages("agricolae")
library(agricolae)

#install.packages("gplots")
library(gplots)
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
#install.packages("decoder")
library(decoder)

#install.packages("lubridate")
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
#install.packages("covid19.analytics")
library(covid19.analytics)

#install.packages("xts")
library(xts)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

Data

In this lab, we are going to use dataset which is about active Covid-19 cases daily basis per each city and provinces.

Data source: https://github.com/dtandev/coronavirus/tree/master/data

Let"s download data from github repository.

url <- "https://raw.githubusercontent.com/dtandev/coronavirus/master/data/CoronavirusPL%20-%20Timeseries.csv"

covid_pl <- read.csv(url,header = T, sep = ",",encoding = "UTF-8")

Preparing dataset for analysis

infected <- covid_pl[covid_pl$Infection.Death.Recovery=="I",]

infected$Infection.Death.Recovery<- 1

infected_city <- aggregate(Infection.Death.Recovery~Province+Timestamp, data = infected, FUN = sum)

infected_city_mean <- aggregate(Infection.Death.Recovery~Province, data = infected_city, FUN = mean)

Drawing box plots

From the boxplot, we can find Śląskie is infected mostly.The second is Mazowieckie.

# drawing box plots
# Please use ggboxplot function from ggpubr package, if you already installed and loaded it, there is no need to do it again.  


ggboxplot(infected_city, x = "Province", y = "Infection.Death.Recovery", 
          color = "Province", ylab = "Number of infected", xlab = "Province")

Draw histogram

Most of cities are infected by small number of people. But Śląskie is more than others obviously in bigger number.

# drawing histograms
# Please use gghistogram function from ggpubr package, if you already installed and loaded it, there is no need to do it again.  

gghistogram(infected_city, x = "Infection.Death.Recovery", 
   add = "mean", rug = F,
   fill = "Province",
   add_density = TRUE,
   bins = 100)

Preparing data for infected cases only

From graph, we can find the infection is incresed very fast twice in winter which are in the end of 2020 and the beginning of 2021.

#This time we will extract data (daily updated) using covid19.analytics package
data <- covid19.data("ts-confirmed")
## Data being read from JHU/CCSE repository
## ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## Reading data from https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv
## Data retrieved on 2021-06-12 16:54:14 || Range of dates on data: 2020-01-22--2021-06-11 | Nbr of records: 276
## --------------------------------------------------------------------------------
d<-growth.rate(data,geo.loc="Poland",stride=1)
## Processing...  POLAND
## Loading required package: pheatmap

d2<-as.data.frame(t(d$Changes),row.names=FALSE)
dat <- tail(d2, -1) #to keep all rows except the first
zmiany<-as.numeric(dat$V1)
zmiany_ts <- ts(zmiany,start=c(2020,1,23),frequency=365)

Plots

From graph, we can find it is not stable, so we need to stablize it.

#install.packages("fpp2")
library(fpp2)

autoplot(zmiany_ts) + ylab("Daily infections") + xlab("Day") +
         ggtitle("New-daily infections in Poland")

Stationarity - transformations

Firstly, we can use ndiffs() to get the number of differences required. It is 1.

After use 1, we find it truly get a better graph.

Then use 2, we can find it doesn’t change too much, so 1 is enough.

ndiffs(zmiany_ts)
## [1] 1
newzmiany<-diff(zmiany_ts)

autoplot(newzmiany) + ylab("Daily infections") + xlab("Day") +
         ggtitle("New-daily infections in Poland")

autoplot(diff(diff(zmiany_ts))) + ylab("Daily infections") + xlab("Day") +
         ggtitle("New-daily infections in Poland")

ACF, PACF

From acf, we can know that we need get p value from pacf.

From pacf, we can find the truncate of p is 14. And q should be 0.

Use lag.max can enlarge the front part to observe easier.

ggAcf(newzmiany)

ggPacf(newzmiany)

ggAcf(newzmiany,lag.max = 175)

ggPacf(newzmiany,lag.max = 20)

ARIMA

From above we can know that p = 14, d = 1, q = 0. We can make a ARIMA model of (14,1,0)

Then we use auto.arima to get a auto result to compare.

Using checkresiduals() to check the residuls, we can find the auto one (2,1,3) has a bad ACF, because it is out of blue line.

It is almost conform to normal distribution, only the middle is too high, I think it is OK.

From Ljung-Box test, we can know Its p-value is smaller 0.05, so it is not white noise.

arimaModel <- Arima(zmiany,order=c(14,1,0))

autoModel <- auto.arima(zmiany)

arimaModel
## Series: zmiany 
## ARIMA(14,1,0) 
## 
## Coefficients:
##           ar1      ar2      ar3      ar4      ar5      ar6     ar7     ar8
##       -0.2292  -0.3026  -0.1574  -0.1942  -0.0868  -0.0462  0.3298  0.2862
## s.e.   0.0429   0.0440   0.0459   0.0461   0.0468   0.0468  0.0448  0.0449
##          ar9    ar10    ar11     ar12    ar13    ar14
##       0.0730  0.0838  0.0796  -0.0539  0.0584  0.2541
## s.e.  0.0467  0.0466  0.0460   0.0455  0.0435  0.0423
## 
## sigma^2 estimated as 2923802:  log likelihood=-4471.24
## AIC=8972.47   AICc=8973.46   BIC=9035.84
autoModel
## Series: zmiany 
## ARIMA(2,1,3) 
## 
## Coefficients:
##          ar1      ar2      ma1     ma2      ma3
##       1.2375  -0.9743  -1.5740  1.2745  -0.3365
## s.e.  0.0112   0.0111   0.0429  0.0710   0.0526
## 
## sigma^2 estimated as 4137940:  log likelihood=-4562.25
## AIC=9136.5   AICc=9136.67   BIC=9161.85
checkresiduals(arimaModel)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(14,1,0)
## Q* = 10.227, df = 3, p-value = 0.01673
## 
## Model df: 14.   Total lags used: 17
checkresiduals(autoModel)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,1,3)
## Q* = 206.88, df = 5, p-value < 2.2e-16
## 
## Model df: 5.   Total lags used: 10

Forecasting

Use forecast() to forecast, we find the pandemic infection situation will stop.

autoplot(forecast(arimaModel, h = 30))