In this tutorial, we explored how to use non-seasonal ARIMA models to forecast a time series and how the generated models can be biased by the data splitting.
Let’s start loading packages and libraries
if (!require(pacman)) {
install.packages("pacman")
} else{
library(pacman)
pacman::p_load(forecast,fracdiff,tseries,devtools,ggplot2,ggfortify,vars,xts,astsa,ggpubr,ggfortify) # aTSA
}
if (!require(fpp2)) {
devtools::install_github("robjhyndman/fpp2-package")
} else{
library(fpp2)
}
For this tutorial, we selected one time serie which is not stationary but it was still used it for illustrative purposes.
We started defining a custom function in order to perform and ADF test on a time series to check if it’s stationary.
# Function to determine if a ts is stationary or not using the adf.test
station <- function(ts) {
if (adf.test(ts)$p.value > 0.05) {
print("The time series is not stationary, the ADF test has not been passed")
sprintf("The p-value is: %5.3f",adf.test(ts)$p.value)
ggAcf(ts)
} else {
print("This time series is stationary, the ADF test has been passed")
ggAcf(ts)
}
}
For this tutorial, we’re going to explore a very challenging time series shown in the figure.
The visual inpection of the time series clearly showed three periods of time with very different trends:
Let’s create different data partitions
ts_train1 <- window(wmurders,start=1950,end=1975)
ts_test1 <- window(wmurders,start=1975)
ts_train2 <- window(wmurders,start=1975,end=1995)
ts_test2 <- window(wmurders,start=1995)
ts_train3 <- window(wmurders,start=1950,end=1995)
ts_test3 <- window(wmurders,start=1995)
Let’s see our time series.
autoplot(ts_train1,legendLabs = c("Train set 1"),xlab="Year",series="Train set") +
autolayer(ts_test1,series = "Test set1") +
ggtitle("Data split 1")
autoplot(ts_train2,xlab="Year",series="Train set") +
autolayer(ts_test2,series = "Test set2") +
ggtitle("Data split 2")
autoplot(ts_train3,xlab="Year",series="Train set") +
autolayer(ts_test3,series = "Test set3") +
ggtitle("Data split 3")
As it can be seen, qualitatively any of the three train sets looks stationary. However, in order to be more quantitative let’s run statistical test over them.
To run the tests, we will make use of let’s run our function to test for stationarity.
## [1] "The time series is not stationary, the ADF test has not been passed"
## [1] "The time series is not stationary, the ADF test has not been passed"
## [1] "The time series is not stationary, the ADF test has not been passed"
As we guessed, none of the trains sets were stationary. Be aware that the reason behind they were not stationary was not based on the ACF function, it was based on the ADF statistical test. The ACF function was plotted for sake of completeness.
In order to explore how the different training sets impacts the forecast, we will generate ARIMA models in each training set and we will evaluate the residuals and forecasts.
This training set has a clear upslope trend, therefore a diffrence is needed in order to remove such slope.
As first approach to get an stationary time series, we tried to compute some differences to see if it solves the problem
The new time series looks more stationary except the first one. Next, we run the ADF test again in order to be more quantitative.
For thispurpose, we used our custom function.
## [1] "The time series is not stationary, the ADF test has not been passed"
## [1] "This time series is stationary, the ADF test has been passed"
## Warning in adf.test(ts): p-value smaller than printed p-value
## [1] "This time series is stationary, the ADF test has been passed"
We achieved an stationary time series just with two differences. We could use three differences, but with two is enough.
Of course, when runing the ADF test we did not have too much points and the results obtained with such small sample are, of course, arguable. However, the ADF test was still carried out in order to illustrate how the general procedure should be conducted.
Therefore, rigth now we know the temptative value for “d” to use in the ARIMA(p,d,q) model. We will use \(d=2\)
In order to estimate the best possible values for the remaining parameters of an ARIMA(p,d,q) model, we will have a look to the ACF and PACF functions.
As it can be seen in the last figure, there is one spike above the treshold in the ACF and PACF, which we strongly suggests us to try an ARIMA(1,2,1) model.
However, having “p” and “q” different from zero is not too desirable. Therefore, we will try another ARIMA model lowering “p” and “q” one unit while “d” is raised to end up with and ARIMA(0,3,0).
On the other hand, in order to have and additional model to compare with, we will also used the “auto.arima” function to guess the best parameters for us, and we will compare the results.
We used the “Arima” and “auto.arima” functions to generate the models. In case of the “auto.arima”, specification of the maximum orders of (P,D,Q) for the “seasonal” component was needed. As we were dealing with non-seasonal time series, the maximum values of those parameters should be set to zero.
Once the models were obtained, we computed the residuals
Next, a visual inspection of the residuals was carried out in order to determine is the models suffered frome some bias.
k1 <- cbind(resfit1,resfit2,reasfit1)
colnames(k1) <- c("Manual ARIMA(1,2,1)","Manual ARIMA(0,3,0)",
"Automatic ARIMA(0,2,1)")
autoplot(k1,xlab="Year",ylab="Residuals") +
ggtitle("Residuals comparison")
As clearly shown in the figure, the ARIMA(0,3,0) model showed the worst performace in terms of residuals. In contrast, the models ARIMA(1,2,1) and ARIMA(0,0,1) shown a better performance pretty similar, and they had pretty similar residuals.
The error metrics of the models are gatherred in the following table.
| Train set1 | ||||||||
|---|---|---|---|---|---|---|---|---|
| Model | ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | |
| ARIMA(1,2,1) | 0.035 | 0.152 | 0.115 | 1.217 | 3.492 | 0.916 | -0.139 | |
| ARIMA(0,3,0) | -0.007 | 0.408 | 0.289 | -0.268 | 8.903 | 2.311 | -0.781 | |
| ARIMA(0,2,1) | 0.037 | 0.158 | 0.121 | 1.259 | 3.706 | 0.965 | -0.323 |
As show in the table, relying blindly in the error metrics is a bad practice as they can be missleading. As an example, let’s consider the worst model: the ARIMA(0,3,0). This model had the smallest ME (mean error) just by error cancellation (for some periods of time, the model overstimates while for others it understimates the time series, which explains why the ME is close to zero) while the RMSE (root mean squared error) was bigger and close to the MAE(mean absolute error). This effect can be more clearly seen when looking the percentual errors. Thus, the ARIMA(0,3,0) had a MPE (mean percentual error) pretty close to zero again by error cancellation. However, the MAPE (mean absolute percentual error) was much bigger. This stress the importance of exploring the source of the errors further instead of relying only in the error metrics.
In order to decide between the two remaining models, the ACF and PACF functions of the residuals from those models were inspected. The resulting functions can be seen in the following figures.
ggtsdisplay(resfit1,main="Residuals of manual ARIMA(1,2,1) model")
ggtsdisplay(reasfit1,main="Residuals of automatic ARIMA(0,0,1) model")
As shown in the last figure, the ACF and PACF functions of both models showed that the residuals were not significantly correlated. Therefore it was not possible to select one model over the other based on the residual’s correlations. Therefore, additional test were needed in order to select one model.
Another property that one model should have is that the residuals should be normally distribuited, as otherwise the model will show a sistematic bias. In a first qualitative approach, the qqplots were analyzed.
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
As shown in the figure, the residuals of both ARIMA models had quartiles which deviates from normality but they looked similar. Therefore, a more quantitative approach was needed.
In order to be more formal, we conducted Shapiro’s tests of normality
# ARIMA(1,2,1) model
shapiro.test(resfit1)
##
## Shapiro-Wilk normality test
##
## data: resfit1
## W = 0.97883, p-value = 0.8482
# ARIMA(0,0,1) model
shapiro.test(reasfit1)
##
## Shapiro-Wilk normality test
##
## data: reasfit1
## W = 0.96934, p-value = 0.6063
As for Shapiro’s test the null hyphotesis is that the data is normally distribuited and in both cases the p-value is above 0.05, we could assume that the residuals obtained from both models are normally distribuited. Of course, the results of an statistical test run with a soo small sample are arguable. However, the statistical test was still carried out for ilustrative purposes in order to show how to proceed.
According to the results obtained from the test we weren’t able to discard one model over the other, although the ARIMA(0,0,1) seemed to have residual’s quartiles more closer to a Gaussian distribution than the one obtained with the ARIMA(1,2,1) model. We are then deemed to judge our models based on their forecasts.
Next, forecasts based on this model were obtained.
fore_fit1 <- forecast(ts_train1,model=fit1) # Manual ARIMA(1,2,1)
fore_afit1 <- forecast(ts_train1,model=afit1) # Automatic ARIMA(0,0,1)
The resulting forecasts are shown in the next figure.
autoplot(ts_train1,xlab ="Year",ylab="Number of murders (K)") +
autolayer(ts_test1) +
autolayer(fore_fit1,series="Manual ARIMA(1,2,1)",PI =FALSE) +
autolayer(fore_afit1,series="Automatic ARIMA(0,2,1)",PI =FALSE) +
ggtitle("Forecast obtained from ARIMA models")
As shown in the last figure, both models will provide a reasonable forecast if the dataset followed the same upward trend. However, the real time series stoped increasing and showed a totally different pattern that any model wasn’t able to capture. This sudden change of the behaviour was the reason why the models didn’t provide a reasonable forecast.
The loss function values were gathered in the following table:
| Train set1 | Test set1 | |||||
|---|---|---|---|---|---|---|
| Model | AIC | AICc | BIC | AIC | AICc | BIC |
| ARIMA(1,2,1) | -12.85 | -11.65 | -9.32 | -16.85 | -16.67 | -15.67 |
| ARIMA(0,3,0) | 28.88 | 29.07 | 30.01 | |||
| ARIMA(0,2,1) | -13.26 | -12.68 | -10.9 | -15.26 | -15.07 | -14.08 |
In order to explore the impact of the training set on the derived models and their forecasts, a different training set shown in the figure was used to generate new ARIMA models.
The new dataset was the one shown in the figure.
autoplot(ts_train2,xlab="Year",series="Train set") +
autolayer(ts_test2,series = "Test set2") +
ggtitle("Data split 2")
We transformed the training datasets again.
Inspection of previous figures didn’t allow us to decide which order of differenciation is the best. Therefore, a more formal apporach based on statistical tests was used.
## [1] "The time series is not stationary, the ADF test has not been passed"
## [1] "The time series is not stationary, the ADF test has not been passed"
## [1] "The time series is not stationary, the ADF test has not been passed"
The results of the ADF test showed that in any case the resulting time series were stationary. Therefore, another approach was tried.
Given that the differenciation approach dind’t allow to make te time series stationaty, a logaritmic transformation was tried.
logts_train2 <- log(ts_train2)
autoplot(logts_train2,main = "Log of train2 time series",ylab = "Year")
The transformed data looked more promising, therefore, formal stationatiry test were tried.
Again the adf test was used to check for stationarity
station(logts_train2)
## [1] "The time series is not stationary, the ADF test has not been passed"
According to the test, the last data transformation was not able to make the time series stationary. Therefore, taking the differences of the logarithm of the original time series was also tried.
d1logts_train2 <- diff(log(ts_train2),differences = 1)
autoplot(d1logts_train2,main = "Log of ts_train2 time series",ylab = "Year")
station(d1logts_train2)
## [1] "The time series is not stationary, the ADF test has not been passed"
As shown, even taking differences of logarithms of the train set was not possible to obtain an stationary time series. No additional attempts to make the time series stationary were tried and no model was trained in order to create forecasts based on this dataset.
Finally, we tried using larger dataset for training called “train3”. In this case, the time series shown an upward trend at the beggining and later it stabilizes. As shown with the first training set, this sudden change in the trend led to unreliable forecasts in the test set. Therefore, we wanted to explore if was possible to train ARIMA models using dataset which has this change in the trend and to asses the forecasts obtained with the resulting ARIMA models.
Our final dataset looked as shown in the figure:
autoplot(ts_train3,xlab="Year",series="Train set 3") +
autolayer(ts_test3,series = "Test set3") +
ggtitle("Data split 3")
Again the train set didn’t look to be stationary. Nevertheless the formal approach of using the ADF test was used again.
Again the ADF test was used to check if the time series of the train set was stationarity.
station(ts_train3)
## [1] "The time series is not stationary, the ADF test has not been passed"
The result obtained from the ADF test showed that the time series corresponding to the third train set was not stationary as it could be guessed by visual inspection. Then, some data transformation were explored in order to achieve an stationary time series.
We tried again to transform the time series with up to three orders of differentiation in order to make the time series stationary.
The visual inspection of the transformed time series didn’t allowed us to decide clearly which differenciation order was the best. The mean seemed to remain roughly constant, but not so the variance. Therefore an ADF test was run again in order to have a more formal anwser.
## [1] "The time series is not stationary, the ADF test has not been passed"
## [1] "This time series is stationary, the ADF test has been passed"
## [1] "This time series is stationary, the ADF test has been passed"
The results obtained from the ADF tests allowed us to conclude that the last two time series were stationary in clear contrast with the first one. As we allways want to look for the simplest model, we tried an ARIMA model with \(d=2\).
Again, in order to estimate the best possible values for the remaining parameters of an ARIMA(p,d,q) model, we will had a look to the ACF and PACF functions.
The inspection of the ACF and PACF clearly suggested us to use \(p=1\) and \(q=3\) as those are the number of spikes outside the thresholds in the ACF and PACF respectivelly. Moreover, the ACF also showed additional spikes at lags 10 and 11, whereas the PACF showed another spike outside the thresholds at lag=6. However, those spikes were only marginaly outside the thresholds and what’s more, no additional spikes were found at multiples of the those lags which could suggest some kind seasonality which the non-seasonal ARIMA models covered in this tutorial can’t handle.
Given the previous findings, we decided to train an ARIMA(1,2,3) model. In addition, as having both \(p\) and \(q\) diffrent from zero is not desirable, we also tried lowering \(p\) and \(q\) one unit while \(d\) was raised one unit to end with an ARIMA(0,3,2) model. The fact that this new ARIMA model had \(d=3\) seemed to us quite reasonable as the original time series with three orders of differentiation was also found to be stationary according to and ADF test.
Finally, another last ARIMA model was tried using the auto.arima function in order to have an additional model to compare with.
As shown in previous sections, the ARIMA models were generated
Once the models were created, we computed the residuals.
The resulting residuals are shown in the following picture.
k2 <- cbind(resfit1_3,resfit2_3,reasfit1_3)
colnames(k2) <- c("Manual ARIMA(1,2,3)","Manual ARIMA(0,3,2)","Automatic ARIMA(0,1,0)")
autoplot(k2,xlab="Year",ylab= "Residuals") +
ggtitle("Residuals comparison")
The results obtained made it difficult to opt for one model or another, as every model had acceptable performance in some periods of time while had bad performance in others. If only the magnitude of the residuals were taken into account, then a bagging approach could help to get lower residuals during the period of time under consideration. However, this approach is not the scope of this tutorial and it will not be considered here.
As this kind of analysis was very qualitative, additional tests were run in order to decide which model was best.
In order to decide between the two remaining models, the ACF and PACF of functions of the residuals of those models were inspected. The resulting functions can be seen in the next figures.
ggtsdisplay(resfit1_3,main="Residuals of manual ARIMA(1,2,3) model")
ggtsdisplay(resfit2_3,main="Residuals of automatic ARIMA(0,3,2) model")
ggtsdisplay(reasfit1_3,main="Residuals of automatic ARIMA(0,1,0) model")
According to the results, all the models shown residuals which didn’t had signifficant correlations, with the exception of a couple of spikes marginally outside the thresholds at lags 7, and/or 10 in the ACF and PACF. However, no other spikes were found at lags multiples of 7 or 10 which will suggest some kind of seasonality which non-seasonal ARIMA models can’t handle. Therefore, it was difficult to stablish which model was the best.
In order to help taking a decission, the normality of the residuals was analysed visually with aid of normal qq plots.
The results qq plots are shown in the figures below.
ggqqplot(resfit1_3,title = "Residuals from manual ARIMA(1,2,3) model" )
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
ggqqplot(resfit2_3,title = "Residuals from automatic ARIMA(0,3,2) model" )
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
ggqqplot(reasfit1_3,title = "Residuals from automatic ARIMA(0,1,0) model")
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
According to the plots, the ARIMA(0,1,0) seemed to have residuals more normally distributed in clear contrast with the ARIMA(1,2,3) and ARIMA(0,3,2) models which had residuals which deviated strongly from the normal distribution in the lower end.
In order to be more formal, we conducted Shapiro’s tests of normality
# ARIMA(1,2,3) model
shapiro.test(resfit1_3)
##
## Shapiro-Wilk normality test
##
## data: resfit1_3
## W = 0.9609, p-value = 0.1243
# ARIMA(0,3,2) model
shapiro.test(resfit2_3)
##
## Shapiro-Wilk normality test
##
## data: resfit2_3
## W = 0.96993, p-value = 0.2753
# ARIMA(0,1,0) model
shapiro.test(reasfit1_3)
##
## Shapiro-Wilk normality test
##
## data: reasfit1_3
## W = 0.96163, p-value = 0.1327
According to the results, none of the above models meet the criteria of having the residuals normally distribuited and therefore they could be discarded for forecasting purposes. On the other hand, the dataset only had 46 observations which represents a very poor sample. Therefore the results obtained from any statistical test can be arguable.
Next, even tough those models didn’t meet the criteria of having residuals normally distribuited, we wanted to explore up to which extent forecasts based on them were able or not to predict the time series on test3 set.
fore_fit1_3 <- forecast(ts_train3,model=fit1_3) # Manual ARIMA(1,2,3)
fore_fit2_3 <- forecast(ts_train3,model=fit2_3) # Automatic ARIMA(0,3,2)
fore_afit1_3 <- forecast(ts_train3,model=afit1_3) # Automatic ARIMA(0,0,1)
The resulting forecasts are shown in the next figure.
autoplot(ts_train3,xlab ="Year",ylab="Number of murders (K)") +
autolayer(ts_test3) +
autolayer(fore_fit1_3,series="Manual ARIMA(1,2,3)",PI =FALSE) +
autolayer(fore_fit2_3,series="Manual ARIMA(0,3,2)",PI =FALSE) +
autolayer(fore_afit1_3,series = "Automatic ARIMA(0,1,0)",PI=FALSE) +
ggtitle("Forecasts obtained from ARIMA models")
The figure above clearly shows, none of the models provided a good forecast in the test set 3.
The ARIMA(0,1,0) model obtained with the auto.arima function gave the worst prediction possible as it could be expected as it doesn’t include autoregressive (AR) or moving average terms (MA) and it didn’t show any variability.
In the other hand, the ARIMA(1,2,3) model showed some variability but the forecast was also quite poor.
Finally, the ARIMA(0,3,2) model provided the best forecast although it’s predictions were far from the real values.
The error metrics are summarized in the following table:
| Train set3 | Test set3 | |||||
|---|---|---|---|---|---|---|
| Model | AIC | AICc | BIC | AIC | AICc | BIC |
| ARIMA(1,2,3) | -12.46 | -10.88 | -3.54 | -20.46 | -20.36 | -18.67 |
| ARIMA(0,3,2) | -1.83 | -1.12 | 3.45 | -5.83 | -5.73 | -4.07 |
| ARIMA(0,1,0) | -18.93 | -18.84 | -17.13 | -18.93 | -18.84 | -17.13 |
While the loss functions are collected in this table:
| Train set3 | ||||||||
|---|---|---|---|---|---|---|---|---|
| Model | ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | |
| ARIMA(1,2,3) | -0.003 | 0.178 | 0.134 | 0.092 | 3.638 | 0.929 | -0.024 | |
| ARIMA(0,3,2) | -0.030 | 0.197 | 0.146 | -0.827 | 3.960 | 1.015 | -0.188 | |
| ARIMA(0,1,0) | 0.031 | 0.190 | 0.141 | 0.895 | 3.918 | 0.978 | -0.101 |
In this tutorial we have shown how to use a sistematic procedure to determine the best ARIMA model for a given time series. A shown in this study, the answer to this question is not allways easy and depends on several factors as: the time series under analysis, the data splitting, the residuals and the sample size. The interplay between all those factors point out the difficuties that newbies will have to deal with in approaching the task of model fitting to forecast a time series.