#install.packages(c("quantmod", "tseries", "timeSeries", "forecast", "xts"))
library(quantmod)
## Warning: package 'quantmod' was built under R version 3.4.4
## Loading required package: xts
## Warning: package 'xts' was built under R version 3.4.4
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.4.4
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: TTR
## Version 0.4-0 included new data defaults. See ?getSymbols.
library(tseries)
## Warning: package 'tseries' was built under R version 3.4.4
library(timeSeries)
## Warning: package 'timeSeries' was built under R version 3.4.4
## Loading required package: timeDate
## Warning: package 'timeDate' was built under R version 3.4.4
##
## Attaching package: 'timeSeries'
## The following object is masked from 'package:zoo':
##
## time<-
library(forecast)
## Warning: package 'forecast' was built under R version 3.4.4
library(xts)
ARIMA models are also called Box-Jenkins models (This is chapters 9-12) ARIMA is autoregressive integrated moving average.
We’ll start with just the AR part: autoregressive models which is just regressing a time series on itself. This means you should let y_t depend on y_(t-1) and depending on the number of time periods, maybe up to y_(t-2),….,y_(t-p).
This is denoted as AR(p) where the order p is the number of time periods you are basing the model off of previous. It can ve written as \[ y_t = {\phi_1} {y_t-1} + {\phi_2} * y_(t-2) + ... + {\phi_p} * y_(t-p) + {\epsilon_t} \] , where y_t is stationary, the error term is iid normally distributed with mean 0 and vairance sigma_2. \(\phi_i\) are constants where \(\phi_p \neq 0\).
the expected value of \(\y_t\) = 0.
example: AR(1) create 2 models: \(\phi_1\) =.9, \(\phi_1\) = -.9 looks like
\(y_t = .9 y_[t_-1] + \epsilon_t\) which implies positive correlation between \(\y_t\) and y_t-1
\(y_t = -.9 y_[t-1] + \epsilon_t\) which implies negative correlation between \(\y_t\) and y_t-1, this also then implies positive correlation with y_t and y_t-2.
We can create a simulation with 200 values with a time series and components phi_1 =.9 and also a phi_1 = -.9. p = 1 which is the first in order=c(1,0,0).
par(mfrow=c(2,1))
# phi = .9
plot(arima.sim(list(order=c(1,0,0),ar=.9) , n=200), ylab = "y", main = (expression(AR(1) ~~~ phi == +.9)))
abline(0,0)
# phi = -.9
plot(arima.sim(list(order=c(1,0,0),ar= -.9) , n=200), ylab = "y", main = (expression(AR(1) ~~~ phi == -.9)))
abline(0,0)
to fit the model, use auto.arima when creating the random data.
set.seed(2115)
fakedata <- arima.sim(list(order=c(1,0,0),ar=.9) , n=200)
plot(fakedata)
mod1 <- auto.arima(fakedata)
mod1
## Series: fakedata
## ARIMA(1,0,0) with zero mean
##
## Coefficients:
## ar1
## 0.9266
## s.e. 0.0285
##
## sigma^2 estimated as 1.004: log likelihood=-284.66
## AIC=573.32 AICc=573.38 BIC=579.92
The coefficient should be close to .9 but not exact.
Now we can forcast. First we can start just with only 1 day into the future (time period 201)
mycast1 <- forecast(mod1, h=1) # includes prediction intervals
mycast1
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 201 -6.557273 -7.841365 -5.273181 -8.521123 -4.593423
We can look at the data a little more by graphing it:
plot(mycast1)
Now to change how many periods, just change h = new value.
plot(forecast(mod1, h=10)) # 10 days into the future
The bright blue line indicates point predictions and the prediction intervals are corresponding to the 80% and 95% confidence intervals. These are wide because predictions are hard and there’s a lot of uncertainty.
Now we can go onto the MA part of ARIMA.
y_t is the linear combination of \(\epsilon_[t-1]\), … , \(\epsilon_[t-q]\).
MA(q) where q is the number of past time periods of epsilon we will depend on. This is written as
\(y_t = \mu + \epsilon_t + \theta_1 * \epsilon_(t-1) + \theta_2* \epsilon_(t-2) + ... + \theta_q * \epsilon_t-q\)
where we assume theta_q neq 0 and the errors are iid with mean 0 and variance sigma^2
\(y_t = {\mu} + {\epsilon_t} + {\theta_1} * {\epsilon_t-1}\)
\(E(y_t) = {\mu} + {\theta_1} E(\epsilon_(t-1)) + E(\epsilon_t) = \mu +0 + \theta_1 * 0 = \mu\)
E(y_t given $_t-1) = + 1 E((t-1)) + E(e_t) = +0 + 1 * (t-1) $
This is why it’s called MOVING average, because the average is depending on \(\epsilon_(t-1)\)
Wecan show this on R. Let’s again use the coefficient = +/- .9 and using a moving average of order 1, let \(\theta_(t-1)\) = +/- .9
par(mfrow = c(2,1))
plot(arima.sim(list(order=c(0,0,1), ma=.9), n = 200) , ylab = "y", main = (expression(MA(1) ~~~ theta == +.9)))
plot(arima.sim(list(order=c(0,0,1), ma=-.9), n = 200) , ylab = "y", main = (expression(MA(1) ~~~ theta == -.9)))
creating the model with theta = .9
set.seed(2.718)
ma.data <- arima.sim(list(order=c(0,0,1), ma=.9), n = 200)
mod2 <- auto.arima(ma.data)
mod2
## Series: ma.data
## ARIMA(1,0,2) with zero mean
##
## Coefficients:
## ar1 ma1 ma2
## -0.8354 1.723 0.7496
## s.e. 0.2317 0.237 0.2105
##
## sigma^2 estimated as 1.143: log likelihood=-296.35
## AIC=600.71 AICc=600.91 BIC=613.9
Here we can see the coefficents of the random sample data.
phi = -.8354
theta = 1.723
forecast 1 day into the future
plot(forecast(mod2, h=1)) # 1 day into the future
forecast(mod2, h=1)
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 201 0.3159447 -1.054301 1.686191 -1.779666 2.411555
we can forecast 10 days into the future…
plot(forecast(mod2, h=10))
now we’re going to combine these 2 concepts from above. p and q are same as above.
\(y_t = \phi_1 * y_(t-1) + \phi_2 * y_(t-2) + ... + \phi_p * y_(t-p) + \epsilon_t + \mu + \theta_1 * \epsilon_t-1 + \theta_2 * \epsilon_(t-2) + ... + \theta_q *\epsilon_(t-q)\)
again assume phi_p and theta_q neq 0 and that errors are normally distributed with mean = 0 and variance = sigma^2
example: ARMA(1,1) = \(\phi_1 y_(t-1) + \epsilon_t + \theta_1 \epsilon_t-1 + \mu\)
example: ARMA(1,2) = \(\phi_1 y_(t-1) + \epsilon_t + \theta_1 \epsilon_(t-1) + \theta_2 * \epsilon_(t-2) + \mu\)
all that’s left is about integrate portion of ARIMA.
3 numbers that specify how it is used : p, q, and the number of differences we’re going to take which is d.
It’s important to know why do we use the differences? We want data with 0 mean and constant variance over time. If we don’t see this in our data, we will take differences in consecutive measurements.
Example: try d = 1 which will take first order differences Z_1 = (y_2 - y_1), Z_2 = (y_3 - y_2), Z_3 = (y_4 -y_3), etc
if this does not help our , then we try :
d=2 which will take second order differences (Z_2 - Z_1), then (Z_3 - Z-2), etc
and we can keep going….
test for 0 mean and constant varaince using adf.test(), or use auto.arima (which R will check this for you)
plot(WWWusage)
mod4 <- auto.arima(WWWusage)
mod4
## Series: WWWusage
## ARIMA(1,1,1)
##
## Coefficients:
## ar1 ma1
## 0.6504 0.5256
## s.e. 0.0842 0.0896
##
## sigma^2 estimated as 9.995: log likelihood=-254.15
## AIC=514.3 AICc=514.55 BIC=522.08
mycast <- forecast(mod4,h=20)
plot(mycast)
Though we don’t see constant mean and variance in our data, if we take difference of 1 then we can get those assumptions fulfilled. Thus we can forcast using auto.arima() and forcast 20 periods into the future. R uses p=1, d=1, and q=1.
We can also check out other data of stock prices.
getSymbols('TECHM.NS', from='2012-01-01', to='2015-01-01')
## 'getSymbols' currently uses auto.assign=TRUE by default, but will
## use auto.assign=FALSE in 0.5-0. You will still be able to use
## 'loadSymbols' to automatically load data. getOption("getSymbols.env")
## and getOption("getSymbols.auto.assign") will still be checked for
## alternate defaults.
##
## This message is shown once per session and may be disabled by setting
## options("getSymbols.warning4.0"=FALSE). See ?getSymbols for details.
##
## WARNING: There have been significant changes to Yahoo Finance data.
## Please see the Warning section of '?getSymbols.yahoo' for details.
##
## This message is shown once per session and may be disabled by setting
## options("getSymbols.yahoo.warning"=FALSE).
## Warning: TECHM.NS contains missing values. Some functions will not work if
## objects contain missing values in the middle of the series. Consider using
## na.omit(), na.approx(), na.fill(), etc to remove or replace them.
## [1] "TECHM.NS"
stock_prices <- TECHM.NS[,4]
then create the model ARIMA using the auto.arima command.
mod5 <- auto.arima(stock_prices)
mycast5 <- forecast(mod5, h=10)
plot(mycast5)
We can also try transforming the stock prices still like we would for a normal linear regression model.
mod6 <- auto.arima(log(stock_prices))
mycast6 <- forecast(mod6, h=10)
plot(mycast6)
We plot them to look at the difference.
par(mfrow=c(2,1))
plot(mycast5)
plot(mycast6)
R automatically creates p,d, and q when using this command. The log requires p=2, whereas the normal stock requires p=0.
Now we will try with jj data.
data("AirPassengers")
library(astsa); data(globtemp)
##
## Attaching package: 'astsa'
## The following object is masked from 'package:forecast':
##
## gas
data(jj)
jj
## Qtr1 Qtr2 Qtr3 Qtr4
## 1960 0.710000 0.630000 0.850000 0.440000
## 1961 0.610000 0.690000 0.920000 0.550000
## 1962 0.720000 0.770000 0.920000 0.600000
## 1963 0.830000 0.800000 1.000000 0.770000
## 1964 0.920000 1.000000 1.240000 1.000000
## 1965 1.160000 1.300000 1.450000 1.250000
## 1966 1.260000 1.380000 1.860000 1.560000
## 1967 1.530000 1.590000 1.830000 1.860000
## 1968 1.530000 2.070000 2.340000 2.250000
## 1969 2.160000 2.430000 2.700000 2.250000
## 1970 2.790000 3.420000 3.690000 3.600000
## 1971 3.600000 4.320000 4.320000 4.050000
## 1972 4.860000 5.040000 5.040000 4.410000
## 1973 5.580000 5.850000 6.570000 5.310000
## 1974 6.030000 6.390000 6.930000 5.850000
## 1975 6.930000 7.740000 7.830000 6.120000
## 1976 7.740000 8.910000 8.280000 6.840000
## 1977 9.540000 10.260000 9.540000 8.729999
## 1978 11.880000 12.060000 12.150000 8.910000
## 1979 14.040000 12.960000 14.850000 9.990000
## 1980 16.200000 14.670000 16.020000 11.610000
data(jj)
jjmod <- auto.arima(jj)
mycastjj <- forecast(jjmod, h=100)
plot(mycastjj)
This looks like there might be some seasonal/monthy differences we should consider when plotting another model.