Part 1: ARIMA Models in R

Step 1:

In the first part of this week seminar you will see how to employ ARIMA models in R. For that purpose, once again, we will use the forecast package. You have already installed the package from previous weeks. The only step you need to do is to load the library.

Task: Load library forecast.

Step 2:

For the purpose of this seminar, once again, we will use the iPhone sales data. We have discussed the last two weeks that the DataFrame structure is not the format that we can use for time series data and for that reason, we need to use the function ts to create the time series object.

Task: Import the iphone sales data and create the time series object. (I have named the object ipts)

Step 3:

During the lecture we have discussed that a tool that we can use to guide us on the selection of the proper ARIMA model is the ACF and PACF plots. One command you can use is the tsdisplay. The syntax is quite simple: tsdisplay (or ggtsdisplay) and within the tsdisplay (), the time series object.

Task: Plot the timeseries autocorrelation plot

tsdisplay(ipts)

Step 4:

By running this command, you will receive three different plots. The one on the top is the original time series. Then we have the ACF and PACF plots. From previous weeks you may remember that the iPhone sales have trend and seasonality. This is also a signal that we can receive from the ACF plot. You can see the values until lag 8 are above the significance boundaries. That means that the current value can be predicted to some extent from the previous values until eight periods back. Such persistence in the autocorrelation is found on series with trend.

Step 5:

To make the series stationary, we can take the first or the second difference. Most of the times one difference is enough. The command to take the difference is the command diff.

Task: Plot the differenced timeseries plot

plot(diff(ipts))

Now you see a completely different time series. You managed to remove the trend, so it has a constant mean. Although again, we can observe some seasonality. If you take now the ACF and PACF plots you will see a different picture. You can also check the second difference, by using again the diff function but within another diff function (diff(diff(ipts))).

Step 6:

The command to use the ARIMA model is the function arima(). The syntax requires as input a time series object and the argument order. This argument takes a vector of values. The vector 0,1,0 refers to an ARIMA model (0,1,0). The first part is the autoregressive component, the second the integrated component and the third part is the moving average. So the ARIMA 0, 1, 0 is a model without autoregressive and moving average components, but with first difference. So we can model the time series and then we can plot the time series and the forecast.

Task: Run an ARIMA (0,1,0) and plot the forecast

arm <- arima(ipts,order=c(0,1,0))
plot(forecast(arm))

Step 7:

There are some model diagnostics in order to understand if the model is good. Usually we do this by examining the residuals. A function that you can use in the forecast package is the function checkresiduals and as an input the name of the model. This produces graphs about the residuals and also produces the distribution of the residuals and the autocorrelation plot of the residuals, as well as a statistical test, the Ljiung-Box test.

Task: Check model residuals

checkresiduals(arm)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,0)
## Q* = 56.38, df = 8, p-value = 2.379e-09
## 
## Model df: 0.   Total lags used: 8

By checking the residuals we want to see if they have a normal distribution, also if there is autocorrelation between the residuals as well as the residuals to have a zero mean. If all these properties are satisfied, then we can consider that the residuals are white noise. However, in that case, we can see that in the autocorrelation plot, we have values that show that there is a autocorrelation between the residuals. We can also observe that the distribution is not normal. Mainly we have this Ljung- box test which rejects the null hypothesis. Ljung- box test null hypothesis is that data are independently distributed (i.e. the correlations in the population from which the sample is taken are 0, so that any observed correlations in the data result from randomness of the sampling process). So in that case, we reject this hypothesis. So there are still parts in the time series that we can model further to make a better forecast.

Step 8:

Let’s try with another model. This time we will see an ARIMA model (2,1,0). That means that there are two auto-correlation lags and we use the first difference but still this is a model without moving average.

Task: Run an ARIMA (2,1,0) model and check the residuals

arm <- arima(ipts,order=c(2,1,0))
plot(forecast(arm)) checkresiduals(arm)

So again, the command is the same. What changes is the input in the vector. Once again, we can plot the forecast. So you can see now that the forecast is different than previously. It is not now a constant value because we have this auto-regressive parts, autoregressive components. But still if we plot the residuals, we can still see that we still reject the null hypothesis. So still the residuals are not white noise. And you can also see these from the autocorrelation plot for the residuals. We still have values that they extend the significance boundaries.

Step 9:

Although we have discussed the lecture how to use the ACF and PACF plots to have an understanding of the components. Still the good thing with these packages is that they’ have developed algorithms that use a combination of models and select the best model on the basis of Information Criteria. One such commands in the forecast package is the auto.arima. Again, the syntax is quite simple. The only argument that you nee to use, is thename of the time series object.

Task: Run auto.arima function

afit <- auto.arima(ipts)

Let’s plot the forecast for different periods. So we can ask fot the forecast for the next four periods or for the next eight periods. Bcause the reference poins is about quarters then 4 periods is one year ahead, 8 periods is two years ahead, 12 periods is three years ahead. We can plot this forecasts first of all, to have a visual of the forecast. And you can see how this differ from the previous forecasts.

Task: Plot arima forecast for 1,2,3 years aheds

plot(forecast(afit,h=4))
# Plot 2yr ARIMA forecast plot(forecast(afit,h=8))
# Plot 3yr ARIMA forecast plot(forecast(afit,h=12))

Step 10:

Basically you can also type the model to have an understanding of what exactly is the model that has been suggested and produced from the auto ARIMA.

## Series: ipts 
## ARIMA(1,0,0)(0,1,0)[4] with drift 
## 
## Coefficients:
##          ar1   drift
##       0.5537  1.3971
## s.e.  0.1441  0.5312
## 
## sigma^2 = 34.32:  log likelihood = -107.5
## AIC=221.01   AICc=221.81   BIC=225.59
## 
## Training set error measures:
##                      ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 0.09726586 5.375886 3.498656 -17.92891 31.49914 0.4840066
##                    ACF1
## Training set 0.02669348

You can see that this model has two set of parameters. You can see two parenthesis. The reason for that is that there is a more complex family of ARIMA models which are called SARIMA models. These models do not only have the ARIMA components, but they also have some seasonal components. So the second vector that you can see w refer to the seasonal components. So the first part, the first 0 in the second parentheses, is seasonal autoregressive components. The second is seasonal differenced components, and the third is seasonal moving average components. So this suggested model here is a combination of the ARIMA and seasonal components. The model suggests that the best model is a model with an autoregressive component of one lag, but also with the seasonal difference.

Step 11:

Check now the residuals of the new model.

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,0)(0,1,0)[4] with drift
## Q* = 10.218, df = 7, p-value = 0.1766
## 
## Model df: 1.   Total lags used: 8

You can see that the residuals are much better and the plots show that the residuals indeed are white noise, as it is the only model where the p-value is over the significance level. So we cannot reject the null hypothesis.

One thing to know is that even this function comes with some limitation as the authors of the function, use only a set of models and do not consider the whole space of possible combination sto find the best model. So perhaps there is another model which is not being tested and which can produce a better outcome. Although there may be case where the qualified model is not the best solution but still it is a good solution.

Part 2: Identify the best forecasting model for three new datasets in R

Before starting this parts make sure that you have covered all the previous seminars and have an understanading of using the Exponential Smoothing, ARIMA and Forecasting Accuracy Function.Now, you will be working with three new datasets of varying format, duration and numbers of observations:

Name Data format No. of Observations Duration Source
Property sales in England and Wales Monthly 263 Jan 1995 – Nov 2016 Doogal
Annual global carbon emissions from coal, oil and gas (in million tonnes of carbon per year) Yearly 56 1959 - 2014 CDIAC
Number of licenced (registered/sold) Toyota Prius cars in Great Britain Quarterly 29 2008 Q3 – 2015 Q3 GOV.UK

Step 1:

Donwload the series from Seminar Data on Minerva and upload the series in the global evnironment. Create the three time series. The first time series named psts will be the Total_Sales from the property sales. The second time series named ets will be the emissions from the second dataset. Finally, a time series pts will be for the Num from the third dataset.

Step 2:

Create the time series training and test datasets. We will be using the majority of the datasets as training data and the last 4 observations of each dataset as testing data to evaluate the accuracy of the forecast models you will create.

Step 3:

Decompose psts (property sales) and pts (prius cars) to help you better understand the data sets. The ets (emissions) dataset will not decompose as it is yearly data (frequency=1) which has no seasonality.

Step 4:

Use the simple forecasting techniques to forecast four periods ahead the psts, pts and ets datasets.

Step 5:

Use exponential smoothing to forecast the psts, pts and ets datasets.

Step 6:

Use ARIMA to forecast the psts, pts and ets datasets. This will involve you selecting what you think is the best ARIMA model (you can also use auto.arima()).

Step 7:

Evaluate your forecasts using the forecasting accuracy measures and the testing data. Has exponential smoothing, ARIMA or one of the simple forecasting techniques performed the best at forecasting each of the three datasets?