Part 1: ARIMA Models in R

Step 1:

In the first part of this week seminar you will see how to employ ARIMA models in R. For that purpose, we will use the forecast package. You have already installed the package from previous weeks. The only step you need to do is to load the library.

Task: Load library forecast.

Step 2:

For the purpose of this seminar, once again, we will use the iPhone sales data. We have discussed the last two weeks that the DataFrame structure is not the format that we can use for time series data and for that reason, we need to use the function ts to create the time series object.

Task: Import the iphone sales data and create the time series object. (I have named the object ipts)

Step 3:

During the lecture we have discussed that a tool that we can use to guide us on the selection of the proper ARIMA model is the ACF and PACF plots. One command you can use is the tsdisplay. The syntax is quite simple: tsdisplay (or ggtsdisplay) and within the tsdisplay (), the time series object.

Task: Plot the timeseries autocorrelation plot

tsdisplay(ipts)

Step 4:

By running this command, you will receive three different plots. The one on the top is the original time series. Then we have the ACF and PACF plots. From previous weeks you may remember that the iPhone sales have trend and seasonality. This is also a signal that we can receive from the ACF plot. You can see the values until lag 8 are above the significance boundaries. That means that the current value can be predicted to some extent from the previous values until eight periods back. Such persistence in the autocorrelation is found on series with trend.

Step 5:

To make the series stationary, we can take the first or the second difference. Most of the times one difference is enough. The command to take the difference is the command diff.

Task: Plot the differenced timeseries plot

plot(diff(ipts))

Now you see a completely different time series. You managed to remove the trend, so it has a constant mean. Although again, we can observe some seasonality. If you take now the ACF and PACF plots you will see a different picture. You can also check the second difference, by using again the diff function but within another diff function (diff(diff(ipts))).

Step 6:

The command to use the ARIMA model is the function arima(). The syntax requires as input a time series object and the argument order. This argument takes a vector of values. The vector 0,1,0 refers to an ARIMA model (0,1,0). The first part is the autoregressive component, the second the integrated component and the third part is the moving average. So the ARIMA 0, 1, 0 is a model without autoregressive and moving average components, but with first difference. So we can model the time series and then we can plot the time series and the forecast.

Task: Run an ARIMA (0,1,0) and plot the forecast

arm <- arima(ipts,order=c(0,1,0))
plot(forecast(arm))

Step 7:

There are some model diagnostics in order to understand if the model is good. Usually we do this by examining the residuals. A function that you can use in the forecast package is the function checkresiduals and as an input the name of the model. This produces graphs about the residuals and also produces the distribution of the residuals and the autocorrelation plot of the residuals, as well as a statistical test, the Ljiung-Box test.

Task: Check model residuals

checkresiduals(arm)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,0)
## Q* = 56.38, df = 8, p-value = 2.379e-09
## 
## Model df: 0.   Total lags used: 8

By checking the residuals we want to see if they have a normal distribution, also if there is autocorrelation between the residuals as well as the residuals to have a zero mean. If all these properties are satisfied, then we can consider that the residuals are white noise. However, in that case, we can see that in the autocorrelation plot, we have values that show that there is an autocorrelation between the residuals. We can also observe that the distribution is not normal. Mainly the Ljung- box test rejects the null hypothesis. Ljung- box test null hypothesis is that the data are independently distributed (i.e. the correlations in the population from which the sample is taken are 0, so that any observed correlations in the data result from randomness of the sampling process). So in that case, we reject this hypothesis. So there are still parts in the time series that we can model further to make a better forecast.

Step 8:

Let’s try with another model. This time we will see an ARIMA model (2,1,0). That means that there are two auto-correlation lags and we use the first difference but still this is a model without moving average.

Task: Run an ARIMA (2,1,0) model and check the residuals

arm <- arima(ipts,order=c(2,1,0))
plot(forecast(arm)) checkresiduals(arm)

So again, the command is the same but we have a different input vector. Once again, we can plot the forecast. So you can see now that the forecast is different from previously. It is not a constant value because we have the autoregressive components. But still we can see that we reject the null hypothesis. So the residuals are not white noise and you can also see this from the residuals autocorrelation plot as we have values exceeding the significance boundaries.

Step 9:

We have discussed the lecture how to use the ACF and PACF plots to have an understanding of the components. The good thing with these packages is that they have developed algorithms that use a combination of models and select the best model on the basis of information criteria. One such command in the forecast package is the auto.arima. Again, the syntax is quite simple. The only argument that you need to use, is the name of the time series object.

Task: Run auto.arima function

afit <- auto.arima(ipts)

Let’s plot the forecast for different periods. So we can ask the forecast for the next four periods or the next eight periods. Because the reference points are quarters then 4 periods is one year ahead, 8 periods are two years ahead, 12 periods are three years ahead. Let’s have a visual of the forecast and see how it differs from the previous forecasts.

Task: Plot arima forecast for 1,2,3 years ahead

plot(forecast(afit,h=4))
# Plot 2yr ARIMA forecast plot(forecast(afit,h=8))
# Plot 3yr ARIMA forecast plot(forecast(afit,h=12))

Step 10:

Take the summary of the model to have an understanding of what is the model produced from the auto ARIMA.

## Series: ipts 
## ARIMA(1,0,0)(0,1,0)[4] with drift 
## 
## Coefficients:
##          ar1   drift
##       0.5537  1.3971
## s.e.  0.1441  0.5312
## 
## sigma^2 = 34.32:  log likelihood = -107.5
## AIC=221.01   AICc=221.81   BIC=225.59
## 
## Training set error measures:
##                      ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 0.09726586 5.375886 3.498656 -17.92891 31.49914 0.4840066
##                    ACF1
## Training set 0.02669348

You can see that this model has two sets of parameters. There is a more complex family of ARIMA models which are called SARIMA models. These models do not only have the ARIMA components, but they also have some seasonal components. So the second vector that you can see refers to the seasonal components. So the first part, the first 0 in the second parentheses, is seasonal autoregressive components. The second is seasonal differenced components, and the third is seasonal moving average components. So this suggested model here is a combination of the ARIMA and seasonal components. The model suggests that the best model is a model with an autoregressive component of one lag, but also with the seasonal difference.

Step 11:

Check now the residuals of the new model.

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,0)(0,1,0)[4] with drift
## Q* = 10.218, df = 7, p-value = 0.1766
## 
## Model df: 1.   Total lags used: 8

You can see that the residuals are much better and the plots show that the residuals indeed are white noise, as it is the only model where the p-value is over the significance level. Thus, we cannot reject the null hypothesis.

One thing to know is that even this function comes with some limitations as the authors of the function, use only a set of models and do not consider the whole space of possible combinations to find the best model. So perhaps there is another model which has not been tested and which could produce a better outcome. Although there may be cases where the qualified model is not the best solution still it is a good solution.

Part 2: Rolling Forecasting Origin

The Function tsCV

In this part, we will need the function tsCV, from the forecast package. You can use the help function for a detailed description of the function and the arguments required as inputs. The first argument should be the time series and the second argument should be a forecast function. Then, you will need to define the forecast horizon and finally the rolling window.

You need to pay attention to the forecast function argument because some of the functions we have seen in the previous weeks, are indeed forecasts functions, for example the naive() or the snaive(). In that case, if for example, you want to produce a seasonal naive forecast for 5 steps ahead with a rolling window of 10 observations then this can be produced with the code below:

e<-tsCV(lynx, snaive, h=5, window=10)
#Where lynx is a timeseries found in forecast package
e[1:20,]
##         h=1   h=2   h=3   h=4   h=5
##  [1,]    NA    NA    NA    NA    NA
##  [2,]    NA    NA    NA    NA    NA
##  [3,]    NA    NA    NA    NA    NA
##  [4,]    NA    NA    NA    NA    NA
##  [5,]    NA    NA    NA    NA    NA
##  [6,]    NA    NA    NA    NA    NA
##  [7,]    NA    NA    NA    NA    NA
##  [8,]    NA    NA    NA    NA    NA
##  [9,]    NA    NA    NA    NA    NA
## [10,] -2054 -2479 -2393 -2298 -2168
## [11,]  -425  -339  -244  -114  1762
## [12,]    86   181   311  2187  2587
## [13,]    95   225  2101  2501  3225
## [14,]   130  2006  2406  3130  1545
## [15,]  1876  2276  3000  1415     0
## [16,]   400  1124  -461 -1876 -2134
## [17,]   724  -861 -2276 -2534 -2640
## [18,] -1585 -3000 -3258 -3364 -3341
## [19,] -1415 -1673 -1779 -1756 -1611
## [20,]  -258  -364  -341  -196   137

The function returns directly the forecasting errors. You can see the errors for the first 20 observations. Because we have as input forecasting horizon h=5, the function produces five different forecasting errors (1,2,3,4 and 5 -steps ahead forecast). For the first 10 observations, there are no errors produced as we have used a window of ten observations.

However, ARIMA() is not a forecast function but it estimates the parameters of the model. Therefore, after the model is estimated, the forecast function needs to be used. For that reason we need to create a function.1

Let’s say that we want to forecast an AR(2) model with a rolling window of 10 observations for 5-periods ahead, then we can use the following code:

far2 <- function(x, h){forecast(Arima(x, order=c(2,0,0)), h=h)}
e <- tsCV(lynx, far2, h=5, window=10)
e[1:20,]
##               h=1        h=2        h=3        h=4        h=5
##  [1,]          NA         NA         NA         NA         NA
##  [2,]          NA         NA         NA         NA         NA
##  [3,]          NA         NA         NA         NA         NA
##  [4,]          NA         NA         NA         NA         NA
##  [5,]          NA         NA         NA         NA         NA
##  [6,]          NA         NA         NA         NA         NA
##  [7,]          NA         NA         NA         NA         NA
##  [8,]          NA         NA         NA         NA         NA
##  [9,]          NA         NA         NA         NA         NA
## [10,]    12.58503   606.8211   509.0393  -428.9293 -1576.1382
## [11,]   462.73101   132.8941 -1026.0108 -2255.0391 -1222.9907
## [12,]  -938.14731 -2492.4514 -3611.8989 -1963.3442  -823.2878
## [13,] -1139.85752 -2529.9008 -1630.0982 -1280.7775   151.8201
## [14,]  -714.53275   123.6903  -223.1877   275.4427 -1062.6853
## [15,]  1201.59693   803.8014   939.3644  -874.6478 -2189.0756
## [16,] -1050.17812  -635.7308 -1364.4904 -1351.5218  -433.9695
## [17,]   613.17754  -819.9069 -1934.9596 -1886.5608 -1791.1836
## [18,] -1762.12587 -2886.2038 -2602.0926 -2161.4825 -1773.0504
## [19,]  -280.73511  -355.9237  -925.2598 -1442.5176 -1548.1148
## [20,]   123.80763  -354.2750  -911.6625 -1153.7439  -900.9728

You can also check this link on how to perform forecast without re-estimation of the model.

Part 3 (Without Help): Identify the best forecasting model for three new datasets in R

Before starting this parts make sure that you have covered all the previous seminars and have an understanding of using the Exponential Smoothing, ARIMA and Forecasting Accuracy Function.Now, you will be working with three new datasets of varying format, duration and numbers of observations:

Name Data format No. of Observations Duration Source
Property sales in England and Wales Monthly 263 Jan 1995 – Nov 2016 Doogal
Annual global carbon emissions from coal, oil and gas (in million tonnes of carbon per year) Yearly 56 1959 - 2014 CDIAC
Number of licenced (registered/sold) Toyota Prius cars in Great Britain Quarterly 29 2008 Q3 – 2015 Q3 GOV.UK

Step 1:

Donwload the series from Seminar Data on Minerva and upload the series in the global evnironment. Create the three time series. The first time series named psts will be the Total_Sales from the property sales. The second time series named ets will be the emissions from the second dataset. Finally, a time series pts will be for the Num from the third dataset.

Step 2:

Create the time series training and test datasets. We will be using the majority of the datasets as training data and the last 4 observations of each dataset as testing data to evaluate the accuracy of the forecast models you will create.

Step 3:

Decompose psts (property sales) and pts (prius cars) to help you better understand the data sets. The ets (emissions) dataset will not decompose as it is yearly data (frequency=1) which has no seasonality.

Step 4:

Use the simple forecasting techniques to forecast four periods ahead the psts, pts and ets datasets.

Step 5:

Use exponential smoothing to forecast the psts, pts and ets datasets.

Step 6:

Use ARIMA to forecast the psts, pts and ets datasets. This will involve you selecting what you think is the best ARIMA model (you can also use auto.arima()).

Step 7:

Evaluate your forecasts using the forecasting accuracy measures and the testing data. Has exponential smoothing, ARIMA or one of the simple forecasting techniques performed the best at forecasting each of the three datasets?


  1. You can see how to create a function in the optional mini course at the beginning of the lectures.↩︎