library(tseries)
## Warning: package 'tseries' was built under R version 3.5.2
In this the last blag of our series we will look into autocorrelation. To relate it to a practical case we will think of the problem of trying to predict the price of a certain asset or good. As usual we will use synthetic data to explain the concept, but here we will continue building on our model to conclude with a set of data that is close to what a real dataset might be.
The whole premise in our discussion is to think of price as a measure that changes in time. This is what we call a time series, an ordered set of values that change as time passes on. We can also have spacial series which we can treat the same way, a measurement that changes along a distance for example. So if we are looking at the price of the good, we can think of how that prices changes throat a period of time. This will be given by supply and demand. But how can we predict this supply and demand? Well the idea here is that both follow cycles, up to a certain extent repeating themselves.
We can start with something very simple, like a sine wave. The price of the good moves up and down following the sine wave. Lets say we have been following the price of a good for the last 40 years. In R we can build a time series with a sine wave as shown below.
series<-ts(sin(seq(0,4*pi,0.2))+1.5,start = 1948)
plot(series)
Reading this graph becomes simple, we can see that the price in the early 70’s is the sam as in the early 2000’s. So we could predict they would again be the same in the early 2030’s. This gives us the intuition of what we want to do. We can do this because the data has certain properties. We will discover these properties and change our data a bit to see how more realistic data can be analyzed.
We have only one dataset in time and we want to predict data, in comes to mind that correlation is a technique to look at. But this data being a time series we do not handle it as a predictor X and a response Y. What we try to do is comparable the data against itself and build regression based on what we find. Comparing data against itself is what we call an autocorrelation. In its simples form, we calculate the correlation of the data to the same data. In R we can compute the Person correlation for example.
cor(series,series)
## [1] 1
Not very exciting, and an obvious result, the data is completely correlated to itself. But what is interesting is to compute correlation to the same data series with a lag. This allows us to discover properties of the series. For example, let’s introduce a lag of one to our data and correlate it with the original. Note: becouse we are modeling with a sine wave, a 1 unit change in time represents a 1/2*pi change in the argument of the sine wave.
series2<-ts(sin(seq(0+1/(2*pi),4*pi+1/(2*pi),0.2))+1.5,start = 1948)
plot(series)
lines(series2,col="red")
Now we can calculate the correlation agasitn the lagged dataseries and we see how it has dropped from zero, but still very close.
cor(series,series2)
## [1] 0.9872941
We can do this for several other lags. and plot the resulting correlations.
correlations<-vector()
for(i in 0:10){
correlations<-c(correlations,cor(series,ts(sin(seq(0+i/(2*pi),4*pi+i/(2*pi),0.2))+1.5,start = 1948)))
}
plot(correlations)
In R we actually have a function that does all of this for us, its called ACF, and we can call it for example with 2000 lags.
acf(series,2000)
But what are we looking at? Well we know the data correlated to itself at lag 0 is one, it’s the same data. We also know that the data correlated for a complete half cycle is completely negatively correlated. This is the first minimum we see in the ACF. Because our dataset has only two cycles, there is less and less data to correlate against as we lag it. so our minimum is not a perfect -1 correlation. The ACF then decays to 0 as we run out of data.
One thing to keep in mind here is that for each of the lags greater than 1, we are actually including in our plot all the lags before it. In R we can eliminate these other lags and plot a partial ACF or PACF. This will come handy shortly once we link this back to the price of our good.
pacf(series,10)
We now understand autocorrelation, but how does this help us model the price of our good? Well let’s think of the price of our good as a new time series. If it perfectly correlates our original data, we can use it to predict our price. But in reality what will happen is that the price will have components of the original time series for different lags. This is what is called an Autoregressive model, or AR:
\(y_t = \beta_1y_{t-1} + \beta_2y_{t-2} + \beta_3y_{t-3} + ... + \beta_py_{t-p}\)
But our price data doesn’t need to follow our original data, we could also have some “error” or difference between the original data and our price. This is what’s called a Moving Average model, or MA:
\(y_t = \varepsilon_{t} + \alpha_1\varepsilon_{t-1} + \alpha_2\varepsilon_{t-2} + ... + \alpha_p\varepsilon_{t-p}\)
We can combine both which results in an ARMA model:
\(y_t = \beta_1y_{t-1} + \beta_2y_{t-2} + \beta_3y_{t-3} + ... + \beta_py_{t-p} + \varepsilon_{t} + \alpha_1\varepsilon_{t-1} + \alpha_2\varepsilon_{t-2} + ... + \alpha_p\varepsilon_{t-p}\)
We have to be careful here about the properties of the data we have for the price of the good. There are two characteristics that need to present for an ARMA model to be applicable.
First the data needs to be stationary, that is with constant mean. We know this is the case for our data because the sine wave has a constant amplitude. In R we can use the Dickey-Fuller test to make sure this is the case. We can do this for fifferent lags k.
adf.test(diff(series),alternative="stationary",k=1)
## Warning in adf.test(diff(series), alternative = "stationary", k = 1): p-
## value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: diff(series)
## Dickey-Fuller = -2.6009e+13, Lag order = 1, p-value = 0.01
## alternative hypothesis: stationary
Another data characteristic is for the variance to be constant, this allows us to include the MA component in our ARMA model. It is also similar to what we expect in a normal linear regression.
A simple way to determine what model to use is looking at the ACF and PACF plots. We look at the decay of these plots which tell us what kind of model to use. The graph below show how to make a selection.
We should expect our price to gradually increase over time, so we should include a trend upwards.
series<-ts(sin(seq(0,4*pi,0.2))+1.5+0.5*seq(0,4*pi,0.2),start = 1948)
plot(series)
This breaks our requirements of constant mean or stationarity, but we will see how we can handle this when we look at some real data below.
We will use the AirPassengers dataset to explore the concepts presented
data("AirPassengers")
plot(AirPassengers)
We immediately see an issue here. Our data has a trend upwards, to be expected even in our price base case before, and the variance is not constant, it continues growing also. But we solve this issue by taking the log of the data to eliminate the increasing variance. The upward trend we eliminate by taking the difference of the series.
AirPassengersStationary=diff(log(AirPassengers))
plot(AirPassengersStationary)
We can use the Dickey-Fuller test to make sure it is in fact stationary
adf.test(AirPassengersStationary, alternative="stationary", k=0)
## Warning in adf.test(AirPassengersStationary, alternative = "stationary", :
## p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: AirPassengersStationary
## Dickey-Fuller = -9.6003, Lag order = 0, p-value = 0.01
## alternative hypothesis: stationary
Now we look at the ACF and PACF plots.
acf(AirPassengersStationary)
pacf(AirPassengersStationary)
Using these graphs we can determine the coeficients of the ARMA regression and hence build a model to predict passengers… or prices.
(fit <- arima(log(AirPassengers), c(0, 1, 1),seasonal = list(order = c(0, 1, 1), period = 12)))
##
## Call:
## arima(x = log(AirPassengers), order = c(0, 1, 1), seasonal = list(order = c(0,
## 1, 1), period = 12))
##
## Coefficients:
## ma1 sma1
## -0.4018 -0.5569
## s.e. 0.0896 0.0731
##
## sigma^2 estimated as 0.001348: log likelihood = 244.7, aic = -483.4
pred <- predict(fit, n.ahead = 10*12)
ts.plot(AirPassengers,2.718^pred$pred, log = "y", lty = c(1,3))