In this exercises we will show how to fit ARIMA model to time series provided by The Ministry of Finance of Nort Macedonia

The data set contains 5 time series:

All series are with daily frequency and with a time span from 01/04/2016 until 29/12/2017. The data set in total has 726 observations.

We will use the tseries, fpp2 and urca R package and follow The Box-Jenkins methodology to find the best fit to ARMA or ARIMA models.

library(fpp2)
library(tseries)
library(urca)

We first make a ts object

y <- ts(INPUT_MATRIX_FINAL_DAILY_DT, start=c(2016,4), frequency=7)

We set the frequency to 7 because we have daily data and we expect to have a weekly seasonality.

We will also create two vectors with short and long column names to use them while we select variables and plot the output.

var = colnames(INPUT_MATRIX_FINAL_DAILY_DT)
var_long_names = c("Date", #1
                   "Weekdays",#2
                   "week_number",#3
                   "Personal income TAX",#4
                   "Corporate income TAX",#5
                   "Value added TAX",  #6      
                   "Excise duty",#7
                   "CUS"#8
                   )

Before we examine each variable separately we let’s plot all series together using autoplot() function.

autoplot(y[,4:ncol(y)])

autoplot(y[,4:ncol(y)], facets = TRUE)

autoplot(y[,4:ncol(y)], facets = TRUE) +
  geom_smooth() +
  labs("Tax incomes",
       y = "In milion eur",
       x = NULL)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Personal Income TAX (PIC)

We set index i to 4 which is the column of PIC variable

i=4 # variable 4 is Personal income TAX - PIC in our data set.

We will first plot our Personal income TAX series using autoplot() function.

autoplot(y[,var[i]]) +
  ggtitle(var_long_names[i]) +  
  ggtitle(paste(var_long_names[i])) + 
  xlab("Day") +
  ylab("Milion eur")

We also decompose our series into the deterministic trend, the seasonal pattern and, the irregular pattern.

# decompose time series
y[,var[i]] %>% decompose(type="multiplicative") %>%
  autoplot() + xlab("Year") +
  ggtitle(paste(var_long_names[i]))

We can see that there is a strong seasonal pattern in our series. Therefore in the following, we will plot a set of graphs that will help us have a deeper insight into the seasonal behavior of our series. In particular, we will use ggseasonplot() as well as ggsubseriesplot() functions.

ggseasonplot(y[,var[i]], year.labels=T, year.labels.left=F) +
  ylab("$ million") +
  ggtitle(paste("Seasonal subseries plot: ",var_long_names[i], sep = "" ))

ggseasonplot(y[,var[i]], polar=TRUE, year.labels=T, year.labels.left=F) +
  ylab("$ million") +
  ggtitle(paste("Seasonal subseries plot: ",var_long_names[i], sep = "" ))

ggsubseriesplot(y[,var[i]]) +
  ylab("$ million") +
  ggtitle(paste("Seasonal subseries plot: ",var_long_names[i], sep = "" ))

Although the last plot may be the most informative showing the mean values over the week days with the blue horizontal line on the graph, all graphs indicate a moderate seasonal pattern.

We also can use a more formal way to test whether we should perform any difference transformation of our data due to seasonality or no stationarity. In this regard we use two availabel function ndiffs() and nsdiffs()

y[,var[i]]  %>% ndiffs()
## [1] 1
y[,var[i]]  %>% nsdiffs()
## [1] 0

While the ndiffs() suggest that our series is not stationary and that it is integrated of order 1 (one difference is needed to achieve stationarity), the nsdiffs() suggests that no difference is needed to remove a seasonal pattern, hence there is no strong seasonal pattern in our data. Nevertheless, in the following, we will use a more formal way to test the stationarity of our series.

Further we test is our data is stationary using two tests: Augmented Dickey-Fuller Test as well as KPSS Unit Root Test.

# Unit root tests
adf.test(y[,var[i]])
## Warning in adf.test(y[, var[i]]): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  y[, var[i]]
## Dickey-Fuller = -7.8278, Lag order = 8, p-value = 0.01
## alternative hypothesis: stationary
# If the test statistic is greater than the 1% critical value, 
# the null hypothesis that the data is stationary is rejected. In this case, the data is not stationary.
y[,var[i]] %>%  ur.kpss() %>% summary()
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 6 lags. 
## 
## Value of test-statistic is: 0.8133 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

Although the ADF test found that the series is stationary, the KPSS test suggests that the series is not stationary, therefore we suspect that our data exhibit a unit root. Therefore, following the test results, we check if our data is stationary after taking the first difference.

# Unit root tests
y[,var[i]] %>% diff(lag = 1) %>% adf.test()
## Warning in adf.test(.): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  .
## Dickey-Fuller = -11.768, Lag order = 8, p-value = 0.01
## alternative hypothesis: stationary
# If the test statistic is greater than the 1% critical value, 
# the null hypothesis that the data is stationary is rejected. In this case, the data is not stationary.
y[,var[i]] %>% diff(lag = 1)  %>%  ur.kpss() %>% summary()
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 6 lags. 
## 
## Value of test-statistic is: 0.1233 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

The results of both tests confirm that the series is integrated of the order I(1).

We also test for serial corelation in our series using the Box-Ljung test considering 25 lags.

Box.test((y[,var[i]]), lag=25, type="Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  (y[, var[i]])
## X-squared = 528.44, df = 25, p-value < 2.2e-16

Given a very small p-value we reject the zero hypotheses of no serial correlation. Hence, we conclude that there is a correlation among lagged values in the series.

ARIMA model

To construct an ARIMA model we first plot ACF and PACF functions of our ststionary data (after taking the firs difference).

y[,var[i]] %>% diff(lag=1) %>% ggtsdisplay()

Decay in the ACF function and the significant correlation of the first 3 lags suggest an AR(3) process and the PACF suggest that the partial effect of the first 4 elements that exhibit decay is significant. So we may suspect AR(3) or AR(4) process. On the other hand, decay in the PACF function suggests an MA process, while the drop in the ACF function after the second and third element may suggest MA(2) or MA(3) process. Regarding seasonal behavior, we can see the decay in two periods of 7 days in the PACF function suggesting a seasonal MA(2) process.

Therefore, we will try several models and choose the best one taking into account the information criteria as well as Box-Ljung test of serial correlation of the residuals.

ARIMA (4,1,3)(0,0,2)[7]

fit <-  y[,var[i]] %>% Arima(order=c(4,1,3), seasonal=c(0,0,2))
fit
## Series: . 
## ARIMA(4,1,3)(0,0,2)[7] 
## 
## Coefficients:
##           ar1      ar2     ar3     ar4     ma1     ma2      ma3    sma1    sma2
##       -1.4406  -0.4419  0.1573  0.0581  0.6855  -0.884  -0.7851  0.3449  0.2794
## s.e.   0.3111   0.2134  0.0716  0.0859  0.3134   0.035   0.2819  0.0467  0.0360
## 
## sigma^2 estimated as 1152:  log likelihood=-3581.92
## AIC=7183.84   AICc=7184.14   BIC=7229.7

In the ARIMA function, we define the AR(4), first difference and MA(3) process. In addition, in the seasonal part of the model, we do not make a difference, just include seasonal parts of the MA process. Note that if we denote seasonal=c(0,1,2), one seasonal difference will be done, in this case with lag = 7. In the following we inspect if the residual of our model are a white noise using the graphical presentation as well as the Box-Ljung test.

fit <-  y[,var[i]] %>% Arima(order=c(4,1,3), seasonal=c(0,0,2))
checkresiduals(fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(4,1,3)(0,0,2)[7]
## Q* = 24.075, df = 5, p-value = 0.00021
## 
## Model df: 9.   Total lags used: 14
y[,var[i]] %>% Arima(order=c(4,1,3), seasonal=c(0,0,2)) %>%
  residuals() %>% ggtsdisplay()

The first figure suggests that the residuals are not normally distributed. In addition, the Box-Ljung test rejects the zero hypothesis of no present serial correlation, hence indicating that there is a serial correlation in the residuals and that they are not a white noise (iid assumption is violated). Finally, the last graph of ACF and PACF functions confirm the results of the Box-Ljung test. In the following, we repeat the exercise with a different configuration of the model and compare the information criteria.

ARIMA (4,1,0)(0,0,2)[7]

fit <-  y[,var[i]] %>% Arima(order=c(4,1,0), seasonal=c(0,0,2))
fit
## Series: . 
## ARIMA(4,1,0)(0,0,2)[7] 
## 
## Coefficients:
##           ar1      ar2      ar3      ar4    sma1    sma2
##       -0.6144  -0.5201  -0.2829  -0.1365  0.2702  0.2990
## s.e.   0.0380   0.0430   0.0432   0.0374  0.0398  0.0302
## 
## sigma^2 estimated as 1330:  log likelihood=-3634.14
## AIC=7282.28   AICc=7282.44   BIC=7314.39
fit <-  y[,var[i]] %>% Arima(order=c(4,1,0), seasonal=c(0,0,2))
checkresiduals(fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(4,1,0)(0,0,2)[7]
## Q* = 68.882, df = 8, p-value = 8.202e-12
## 
## Model df: 6.   Total lags used: 14
y[,var[i]] %>% Arima(order=c(4,1,0), seasonal=c(0,0,2)) %>%
  residuals() %>% ggtsdisplay()

We again failed to obtain a white noise of the residuals meaning that the second configuration of the model is also not good enough. In addition, according to the information criteria AIC, AICs and BIC given in the tables above, the first configuration is better since the information criteria have lower values.

These exercises should be done several times after we do not find a satisfactory configuration. For the time being, we will run the auto.arima() function to see what is suggested by the automatic algorithm that is a part of the forecast package.

Auto ARIMA estimation

fit <-  y[,var[i]] %>% auto.arima()
fit
## Series: . 
## ARIMA(4,1,0)(0,0,2)[7] with drift 
## 
## Coefficients:
##           ar1      ar2      ar3      ar4    sma1    sma2   drift
##       -0.6148  -0.5205  -0.2831  -0.1366  0.2702  0.2988  0.2837
## s.e.   0.0380   0.0430   0.0432   0.0374  0.0398  0.0302  0.8252
## 
## sigma^2 estimated as 1331:  log likelihood=-3634.08
## AIC=7284.17   AICc=7284.37   BIC=7320.85
fit <-  y[,var[i]] %>% auto.arima()
checkresiduals(fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(4,1,0)(0,0,2)[7] with drift
## Q* = 68.987, df = 7, p-value = 2.366e-12
## 
## Model df: 7.   Total lags used: 14

The auto.arima() function suggested an ARIMA model with drift, which means including the time trend. However, this new configuration again failed to obtain white noise residuals. In addition, according to the AIC, AICs and BIC information criteria the first configuration is still preferable.

In the following, we will check whether our data exhibit a time trend using the non-parametric Cox and Stuart Trend test cs.test() from the trend package.

library(trend)
y[,var[i]] %>% cs.test()
## 
##  Cox and Stuart Trend test
## 
## data:  .
## z = 0.12856, n = 726, p-value = 0.8977
## alternative hypothesis: monotonic trend

Due to the high p-value (greater then 0.05) the test failed to reject the zero hypothesis of the non-existence of trend, suggesting that our series does not exhibit any trend, which was also evident from the decomposition graph that we have plotted at the beginning of the exercise.

Therefore, we conclude that although we did not find a configuration that results in a white noise of the residuals, the most preferable model is given by the first configuration which is ARIMA (4,1,3)(0,0,2)[7].

As the last task, we are going to use the forecast() function to make a prediction of our the best model.

fit <-  y[,var[i]] %>% Arima(order=c(4,1,3), seasonal=c(0,0,2))
autoplot(forecast(fit))