Let’s install libraries
#install.packages("ggpubr")
library(ggpubr)## Loading required package: ggplot2
#install.packages("mvnormtest")
library(mvnormtest)
#install.packages("agricolae")
library(agricolae)
#install.packages("gplots")
library(gplots)##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
#install.packages("decoder")
library(decoder)
#install.packages("lubridate")
library(lubridate)##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
#install.packages("covid19.analytics")
library(covid19.analytics)
#install.packages("xts")
library(xts)## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
In this lab, we are going to use dataset which is about active Covid-19 cases daily basis per each city and provinces.
Data source: https://github.com/dtandev/coronavirus/tree/master/data
url <- "https://raw.githubusercontent.com/dtandev/coronavirus/master/data/CoronavirusPL%20-%20Timeseries.csv"
covid_pl <- read.csv(url,header = T, sep = ",",encoding = "UTF-8")infected <- covid_pl[covid_pl$Infection.Death.Recovery=="I",]
infected$Infection.Death.Recovery<- 1
infected_city <- aggregate(Infection.Death.Recovery~Province+Timestamp, data = infected, FUN = sum)
infected_city_mean <- aggregate(Infection.Death.Recovery~Province, data = infected_city, FUN = mean)From the boxplot, we can find Śląskie is infected mostly.The second is Mazowieckie.
# drawing box plots
# Please use ggboxplot function from ggpubr package, if you already installed and loaded it, there is no need to do it again.
ggboxplot(infected_city, x = "Province", y = "Infection.Death.Recovery",
color = "Province", ylab = "Number of infected", xlab = "Province")Most of cities are infected by small number of people. But Śląskie is more than others obviously in bigger number.
# drawing histograms
# Please use gghistogram function from ggpubr package, if you already installed and loaded it, there is no need to do it again.
gghistogram(infected_city, x = "Infection.Death.Recovery",
add = "mean", rug = F,
fill = "Province",
add_density = TRUE,
bins = 100)From graph, we can find the infection is incresed very fast twice in winter which are in the end of 2020 and the beginning of 2021.
#This time we will extract data (daily updated) using covid19.analytics package
data <- covid19.data("ts-confirmed")## Data being read from JHU/CCSE repository
## ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## Reading data from https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv
## Data retrieved on 2021-06-12 16:54:14 || Range of dates on data: 2020-01-22--2021-06-11 | Nbr of records: 276
## --------------------------------------------------------------------------------
d<-growth.rate(data,geo.loc="Poland",stride=1)## Processing... POLAND
## Loading required package: pheatmap
d2<-as.data.frame(t(d$Changes),row.names=FALSE)
dat <- tail(d2, -1) #to keep all rows except the first
zmiany<-as.numeric(dat$V1)
zmiany_ts <- ts(zmiany,start=c(2020,1,23),frequency=365)From graph, we can find it is not stable, so we need to stablize it.
#install.packages("fpp2")
library(fpp2)
autoplot(zmiany_ts) + ylab("Daily infections") + xlab("Day") +
ggtitle("New-daily infections in Poland")Firstly, we can use ndiffs() to get the number of differences required. It is 1.
After use 1, we find it truly get a better graph.
Then use 2, we can find it doesn’t change too much, so 1 is enough.
ndiffs(zmiany_ts)## [1] 1
newzmiany<-diff(zmiany_ts)
autoplot(newzmiany) + ylab("Daily infections") + xlab("Day") +
ggtitle("New-daily infections in Poland")autoplot(diff(diff(zmiany_ts))) + ylab("Daily infections") + xlab("Day") +
ggtitle("New-daily infections in Poland")From acf, we can know that we need get p value from pacf.
From pacf, we can find the truncate of p is 14. And q should be 0.
Use lag.max can enlarge the front part to observe easier.
ggAcf(newzmiany)ggPacf(newzmiany)ggAcf(newzmiany,lag.max = 175)ggPacf(newzmiany,lag.max = 20)From above we can know that p = 14, d = 1, q = 0. We can make a ARIMA model of (14,1,0)
Then we use auto.arima to get a auto result to compare.
Using checkresiduals() to check the residuls, we can find the auto one (2,1,3) has a bad ACF, because it is out of blue line.
It is almost conform to normal distribution, only the middle is too high, I think it is OK.
From Ljung-Box test, we can know Its p-value is smaller 0.05, so it is not white noise.
arimaModel <- Arima(zmiany,order=c(14,1,0))
autoModel <- auto.arima(zmiany)
arimaModel## Series: zmiany
## ARIMA(14,1,0)
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ar6 ar7 ar8
## -0.2292 -0.3026 -0.1574 -0.1942 -0.0868 -0.0462 0.3298 0.2862
## s.e. 0.0429 0.0440 0.0459 0.0461 0.0468 0.0468 0.0448 0.0449
## ar9 ar10 ar11 ar12 ar13 ar14
## 0.0730 0.0838 0.0796 -0.0539 0.0584 0.2541
## s.e. 0.0467 0.0466 0.0460 0.0455 0.0435 0.0423
##
## sigma^2 estimated as 2923802: log likelihood=-4471.24
## AIC=8972.47 AICc=8973.46 BIC=9035.84
autoModel## Series: zmiany
## ARIMA(2,1,3)
##
## Coefficients:
## ar1 ar2 ma1 ma2 ma3
## 1.2375 -0.9743 -1.5740 1.2745 -0.3365
## s.e. 0.0112 0.0111 0.0429 0.0710 0.0526
##
## sigma^2 estimated as 4137940: log likelihood=-4562.25
## AIC=9136.5 AICc=9136.67 BIC=9161.85
checkresiduals(arimaModel)##
## Ljung-Box test
##
## data: Residuals from ARIMA(14,1,0)
## Q* = 10.227, df = 3, p-value = 0.01673
##
## Model df: 14. Total lags used: 17
checkresiduals(autoModel)##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,1,3)
## Q* = 206.88, df = 5, p-value < 2.2e-16
##
## Model df: 5. Total lags used: 10
Use forecast() to forecast, we find the pandemic infection situation will stop.
autoplot(forecast(arimaModel, h = 30))