My logbook of learning
Objective: to predict the positive cases of CoVID-19 each day in India
We observe that the COVID-19 positive cases count on daily basis increasing rapidly. It is a time series, with the current scenario shown below. Time series data are data points collected over a period of time as a sequence of time gap.
State_daily <- read_excel("C:/Users/ramya.emandi/Desktop/External Projs/Interesting Data/State_daily.xlsx", sheet = "ARIMA", col_types = c("date", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric"))
View(State_daily)
#head(State_daily)
ggplot(State_daily, aes(State_daily$Date, State_daily$PosCases)) + geom_bar(stat = "identity")
How do we predict the positive cases on daily basis, we could fit an equation to the trend line for data points and extrapolate this data to predict future values. But, we understand this is an ongoing process and the fit changes everyday. Also, in our case, it is difficult to get dependent variables to find correlations in data and subsequently regress to find values. (for e.g. covid is contracted when in close contact with other covid patients, we do not have the transmission flow chart). How about using the past historical values to predict the data. We know that the next day’s positive cases are dependent on the number of people got in contact with the past positive cases. This is called autocorrelation.
Time series data analysis means analyzing the available data to find out the pattern or trend in the data to predict some future values.To evaluate under time series analysis, firstly, our data has to be stationary. let’s check that -
plot(State_daily$PosCases)
# PosCases doesn't look stationary
As ‘Positive Cases’ data is not stationary, We take the difference of Positive Cases from previous values.
d.PosCases <- diff(State_daily$PosCases)
plot(d.PosCases)
# d.PosCases looks stationary
The differenced values look stationary. Let’s check their stats before we proceed.
summary(State_daily$PosCases)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.0 126.5 813.0 1024.3 1574.0 3656.0
summary(d.PosCases)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -685.00 -38.75 26.00 60.43 161.75 704.00
Although, we have studied the stationarity visually, we have to perform Dickey Fuller tests and augmented Dickey Fuller test to be statistically obvious and proceed.
#DF & ADF tests for stationarity in Y i.e. Positive cases
#k is the number of lags, Dickey Fuller test for stationarity
adf.test(State_daily$PosCases,"stationary", k=0)
##
## Augmented Dickey-Fuller Test
##
## data: State_daily$PosCases
## Dickey-Fuller = -2.3459, Lag order = 0, p-value = 0.4351
## alternative hypothesis: stationary
#adf.test(State_daily$PosCases,"explosive", k=0)
#k is the number of lags, Augmented Dickey Fuller test for stationarity
adf.test(State_daily$PosCases,"stationary")
##
## Augmented Dickey-Fuller Test
##
## data: State_daily$PosCases
## Dickey-Fuller = -1.2038, Lag order = 3, p-value = 0.8954
## alternative hypothesis: stationary
#DF & ADF for diff(positive cases)
adf.test(d.PosCases,"stationary",k=0)
##
## Augmented Dickey-Fuller Test
##
## data: d.PosCases
## Dickey-Fuller = -12.982, Lag order = 0, p-value = 0.01
## alternative hypothesis: stationary
adf.test(d.PosCases,"stationary")
##
## Augmented Dickey-Fuller Test
##
## data: d.PosCases
## Dickey-Fuller = -4.1408, Lag order = 3, p-value = 0.01011
## alternative hypothesis: stationary
We observe that the ‘Positive Cases’as is is not stationary.(p values > 0.05). When we take the differenced values of ’Positive Cases’, we achieve stationarity. (p values < 0.05)
The autocorrelation function (ACF) gives the autocorrelation at all possible lags. The autocorrelation at lag 0 is included by default which always takes the value 1 as it represents the correlation between the data and themselves. This function also helps in predicting which model to use under time series. 1. Autoregression (AR) model 2. Moving average (MA) model 3. ARMA (AR+MA) 4. ARIMA Autoregression Integrated Moving Average model. As well as to get a rough estimate of number of lags in the model.
#ACF and PACF graphs for visualising the differenced value of Positive cases
acf(d.PosCases, na.action = na.omit)
pacf(d.PosCases, na.action = na.omit)
As we can infer from the graph above, the autocorrelation continues to decrease as the lag increases, confirming that there is no linear association between observations separated by larger lags. Also, the autocorrelation is oscillating, meaning the coefficient of the dependent variable is negative.
Approximate Model
#ARIMA Model
auto.arima(State_daily$PosCases, trace=TRUE)
##
## ARIMA(2,1,2) with drift : 734.6645
## ARIMA(0,1,0) with drift : 753.8674
## ARIMA(1,1,0) with drift : 741.4222
## ARIMA(0,1,1) with drift : 745.4293
## ARIMA(0,1,0) : 754.7713
## ARIMA(1,1,2) with drift : 743.6531
## ARIMA(2,1,1) with drift : 745.5354
## ARIMA(3,1,2) with drift : 737.2739
## ARIMA(2,1,3) with drift : 737.2736
## ARIMA(1,1,1) with drift : 743.7503
## ARIMA(1,1,3) with drift : 744.8649
## ARIMA(3,1,1) with drift : 744.5957
## ARIMA(3,1,3) with drift : 740.0474
## ARIMA(2,1,2) : 738.3116
##
## Best model: ARIMA(2,1,2) with drift
## Series: State_daily$PosCases
## ARIMA(2,1,2) with drift
##
## Coefficients:
## ar1 ar2 ma1 ma2 drift
## -0.8964 -0.7476 0.5184 0.9528 60.9388
## s.e. 0.1120 0.1089 0.0903 0.1476 23.4787
##
## sigma^2 estimated as 37665: log likelihood=-360.44
## AIC=732.88 AICc=734.66 BIC=744.81
All the possible models are estimated here, Under ARIMA model (p,d,q) p = number of lags for autoregression (i.e. past values of Postive cases) d = number of times differenced (Integrated) q = number of lags of the residual value (i.e. past values of the unexplained error term)
and it is observed that all estimated models have d = 1. As we already, saw earlier that out positive cases was not statonary. Hence, the auto.arima function made the series stationary by differencing it once. As tested earlier, which is stationary.
The best model suggested is ARIMA(2,1,2) with drift. We have to check the AIC/ BIC values for its minimal to choose the model. At the same time, the model should be parsimonious i.e. having lesser varaibles.
ARIMA(2,1,2) is the best model as per lowest AIC value. Estimating the model below -
#arima(State_daily$PosCases, order = c(2,1,2)) if you didn't install forecast package
fitarima <- arima(State_daily$PosCases, order = c(2,1,2))
coeftest(fitarima)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 -0.85654 0.12650 -6.7709 1.280e-11 ***
## ar2 -0.71882 0.12068 -5.9567 2.574e-09 ***
## ma1 0.51709 0.11319 4.5684 4.915e-06 ***
## ma2 0.97423 0.23667 4.1165 3.847e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
All the estimates are significant with p values < 0.05. Now, we proceed with predicting the future values. The prediction for next 10 days -
#predict for next 10 days
#predict(fitarima,n.ahead = 10)
fitarimadrift <- Arima(State_daily$PosCases, order = c(2,1,2), include.drift = TRUE)
Next10days_drift <- forecast(fitarimadrift, h=10, level = c(80, 95))
#Next10days <- forecast(fitarima, h=10, level = c(80, 95))
#plot(Next10days)
plot(Next10days_drift)
The forecasts is shown as a blue line, with the 80 to 95% confidence levels for prediction intervals.The darker shaded portion is for 80% CI and lighter shaded portion is for 95% CI.
round(Next10days_drift$mean)
## Time Series:
## Start = 56
## End = 65
## Frequency = 1
## [1] 3311 3590 3526 3536 3736 3710 3745 3894 3895 3944
#predict(fitarimadrift,n.ahead = 10)
Since, COVID cases are so random, I would like to see the upper and lower values of conficence intervals.
round(Next10days_drift$lower)
## Time Series:
## Start = 56
## End = 65
## Frequency = 1
## 80% 95%
## 56 3061 2929
## 57 3297 3141
## 58 3113 2895
## 59 3059 2807
## 60 3225 2955
## 61 3132 2826
## 62 3121 2790
## 63 3239 2891
## 64 3191 2817
## 65 3202 2809
round(Next10days_drift$upper)
## Time Series:
## Start = 56
## End = 65
## Frequency = 1
## 80% 95%
## 56 3560 3692
## 57 3884 4039
## 58 3938 4156
## 59 4012 4265
## 60 4247 4518
## 61 4288 4595
## 62 4369 4699
## 63 4550 4897
## 64 4600 4974
## 65 4686 5079
I prefer to take the upper values of the 95% confidence intervals as a pessimistic approach
This is the overall process by which we can analyse time series data and forecast values from existing series using ARIMA model.