10/29/2018

Outline

Last lecture:

Wold’s decomposition, Stationary AR(p) and MA(q),

This lecture:

General modeling, ARMA modeling: principles and procedures

Basic ideas in model building

  • The objective: obtain adequate but parsimonious models.

  • Parsimony: the smallest possible number of parameters for adequate representation.

  • For example: consider an MA(q) model \[Y_{t} = \varepsilon_t +\theta_1 \epsilon_{t-1} + \theta_2 \epsilon_{t-2} + \cdots + \theta_q \epsilon_{t-q}.\] For sufficiently large \(q\), it can be adequately represented by an AR(1) model (recall Wold’s decomposition) \[(1- \phi(\mathbb{L})) Y_{t}= \varepsilon_{t}\]

  • The latter model contains only one parameter \(\phi\). Fitting a relationship like MA(\(\infty\)) containing \(q\rightarrow \infty\) leads to unnecessarily poor estimation.

Commonly used procedure

  • The common (popular) process of selection is iterative: a process of evolution, adaptation or trial and error.

ARMA(p,q) Modeling

  1. Assume that the data has a weakly stationary structure. (The class)

  2. Pick up a reasonable pair of (p,q)

  3. Estimate the parameter of ARMA(p,q).

  4. Test the estimated model.

  5. Forecast (and control).

Data

library(forecast); library(Quandl); library(xts)
oil = Quandl("FRED/DCOILWTICO", type = "xts", collapse = "monthly",  
                      start_date="2006-01-01", end_date="2018-10-19")
head(oil, 10)
##           [,1]
## Jan 2006 67.86
## Feb 2006 61.37
## Mar 2006 66.25
## Apr 2006 71.80
## May 2006 71.42
## Jun 2006 73.94
## Jul 2006 74.56
## Aug 2006 70.38
## Sep 2006 62.90
## Oct 2006 58.72

Data

plot(oil)

Law of Parsimony (Occam’s razor)

  • Akaike information criterion (AIC) and Bayesian information criterion (BIC) \[AIC(p,q)= \ln(\hat{\sigma})^{2}+\frac{2(p+q)}{T}\] \[BIC(p,q)= \ln(\hat{\sigma})^{2}+\frac{\ln(T)(p+q)}{T}\] where \((\hat{\sigma})^{2}\) is estimate of \(\sigma^{2}\). \(\ln(\hat{\sigma})^{2}\) measures overall fit. \((p+q)/T\) penalties the large models.

AIC and BIC

  • Set some bounds \(P,Q\) for \(p,q\). The best model meets the minimum criteria \[\min_{p\leq P,q\leq Q}AIC(p,q)= \ln(\hat{\sigma}^{2})+\frac{2(p+q)}{T}\] \[\min_{p\leq P,q\leq Q}BIC(p,q)= \ln(\hat{\sigma}^{2})+\frac{\ln(T)(p+q)}{T}\]

Step 2: Model Selection

ar.1=lm(oil~lag(oil)); ar.2=lm(oil~0+lag(oil)+lag(oil,2)); AIC(ar.1)
## [1] 1024.31
AIC(ar.2)
## [1] 1010.092
BIC(ar.1)
## [1] 1033.401
BIC(ar.2)
## [1] 1019.164

Step 3: Estimation

ar.2.ML = arima(oil,order=c(2,0,0), method="ML",include.mean=FALSE)
coef(ar.2)
##    lag(oil) lag(oil, 2) 
##   1.2614023  -0.2655493
coef(ar.2.ML)
##        ar1        ar2 
##  1.2596994 -0.2649732

Step 3: Estimation

library(lmtest)
coeftest(ar.2.ML)
## 
## z test of coefficients:
## 
##      Estimate Std. Error z value  Pr(>|z|)    
## ar1  1.259699   0.077582 16.2370 < 2.2e-16 ***
## ar2 -0.264973   0.077785 -3.4065 0.0006581 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Step 3: Estimation

arma11 = arima(oil,order=c(1,0,1), method="ML",include.mean=FALSE)
coef(arma11)
##       ar1       ma1 
## 0.9941501 0.2099341
coeftest(arma11)
## 
## z test of coefficients:
## 
##      Estimate Std. Error z value  Pr(>|z|)    
## ar1 0.9941501  0.0058697  169.37 < 2.2e-16 ***
## ma1 0.2099341  0.0686059    3.06  0.002213 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Maximum Likelihood Estimate (MLE)

  • For i.i.d. data, the density function \(p(y_{t};\beta)\)

  • The joint density is \[p(\mathbf{y};\beta)=p(y_{1},\dots,y_{T};\beta)=\prod_{t=1}^{T}p(y_{t};\beta)\]

  • The likelihood function is a function of \(\beta\) conditioning on the observing data \(\mathbf{y}\) : \[L(\beta|\mathbf{y})=L(\beta|y_{1},\dots,y_{T})=\prod_{t=1}^{T}p(\beta|y_{t})\]

Maximum Likelihood Estimate (MLE)

  • Maximize the log-likelihood function by tuning \(\beta\): \[\hat{\beta}= \arg\max_{\beta} \log L(\beta|\mathbf{y})=\arg\max_{\theta}\sum_{t=1}^{T}\ln p(\beta|y_{t})\]

MLE for AR(1)

  • AR(1):\[Y_{t}=c+\phi Y_{t-1}+\varepsilon_{t}, \varepsilon_{t}\sim i.i.d.\:\mathcal{N}(0,\sigma^{2})\] \[\beta=(c,\phi,\sigma^{2}) \quad|\phi|<1\]

  • Conditional on past information \(\mathcal{I}_{t-1} = y_{t-1}\) \[y_{t}|\mathcal{I}_{t-1}\sim\mathcal{N}(c+\phi y_{t-1},\sigma^{2})\] \(t=2,\dots,T\).

MLE for AR(1)

  • The conditional log-likelihood is \[\sum_{t=2}^{T}\ln p(y_{t}|y_{t-1};\beta)= \frac{-(T-1)}{2}\ln(2\pi)-\] \[\frac{(T-1)}{2}\ln(\sigma^{2}) -\frac{1}{2\sigma^{2}}\sum_{t=2}^{T}(y_{t}-c-\phi y_{t-1})^{2}.\]

MLE for ARMA(p,q)

  • MLE is “easy” to compute if we condition on prior observations \(Y_1, \dots, Y_{t-1}\) and prior errors \(\hat{\varepsilon}_1, \dots, \hat{\varepsilon}_{t-1}\) where \(\hat{\varepsilon}_1 = Y_1\), \(\hat{\varepsilon}_2 = Y_2 - \phi Y_1 - \theta \hat{\varepsilon}_1\), … and \[\hat{\varepsilon}_t = Y_t - \sum_{j=1}^{t-1} \phi Y_{t-j} - \sum_{j=1}^{t-1} \theta \hat{\varepsilon}_{t-j}.\]

  • Then the likelihood can be written in terms of \(\hat{\varepsilon}_1, \dots, \hat{\varepsilon}_{t-1}\), which are i.i.d. Gaussian white noises. Estimation of \(\theta\) and \(\phi\) then reduces to maximize the log-likelihood of \[\max_{\theta, \phi} \sum_{t=2}^{T}\ln p(\hat{\varepsilon}_t|y_1,\dots, y_{t-1}, \hat{\varepsilon}_1, \dots, \hat{\varepsilon}_{t-1} ;\theta, \phi).\]

Step 4: Test

Test for autocorrelation in the residuals of a fitted ARMA model.

  • Test statistics (Box-Ljung test): \[T\sum_{j=1}^{k}(\hat{\rho}_j)^{2}\rightarrow\chi^{2}(k)\] \(k\) is the number of correlated lags of \(\hat{\rho}\)

  • Test hypothesis \[H_{0}:\rho_{1}=\cdots=\rho_{k}=0.\]

  • The hypothesis: the residuals from the ARMA model have no autocorrelation.

Step 4: Test

Box.test(arma11$resid, lag=3, type="Ljung-Box", fitdf=2)
## 
##  Box-Ljung test
## 
## data:  arma11$resid
## X-squared = 4.5907, df = 1, p-value = 0.03215
Box.test(ar.2.ML$resid, lag=3, type="Ljung-Box", fitdf=2)
## 
##  Box-Ljung test
## 
## data:  ar.2.ML$resid
## X-squared = 2.2142, df = 1, p-value = 0.1367

Step 4: Test

  • lag: the autocorrelation coefficients in the statistics of the residuals.

  • fitdf: fitdf \(= p+q\) and lag \(>\) fitdf.

  • The degrees of freedom: lag \(-p-q\)

  • Cannot reject the hypothesis that the residuals of ARMA(1,1) are independently distributed. So ARMA(1,1) fits well the data.

Step 5: Forecast

forecast(ar.2.ML,h=3)
##        Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## Nov 13       71.11137 62.68790 79.53483 58.22879 83.99394
## Dec 13       70.54327 56.99526 84.09128 49.82337 91.26317
## Jan 14       70.02071 52.48416 87.55726 43.20087 96.84055
forecast(arma11,h=3)
##        Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## Nov 13       71.08221 62.58951 79.57491 58.09374 84.07068
## Dec 13       70.66639 57.37370 83.95908 50.33697 90.99580
## Jan 14       70.25300 53.51844 86.98755 44.65970 95.84629

Step 5: Forecast

plot(forecast(arma11,h=10))

About Forecast

  • If \(\{Y_{t}\}\) is weakly stationary process with Wold representation:\[Y_{t}=\mu+\sum_{j=0}^{\infty}\psi_{j}\varepsilon_{t-j},\;\varepsilon_{t}\sim WN(0,\sigma^{2})\]

  • \(\mathcal{I}_{t}=\{y_{t},y_{t-1},\dots\, y_{0} \}\): the information set up to time \(t\).

  • By Wold representation, we know \[\mathbb{E}[Y_{t}]= \mu, \,\, \mbox{Var}(Y_{t})= \sigma^{2}\sum_{j=0}^{\infty}\psi_{j}^{2}\]

  • Predict the future \(Y_{t+h}\) based on available information \(\mathcal{T}_{t}\) of the noises.

Forecast Error

  • The expression of \(Y_{t+h}\) is \[Y_{t+h}=\mu+ \varepsilon_{t+h}+\psi_{1}\varepsilon_{t+h-1}+\cdots\] \[+\psi_{h-1}\varepsilon_{t+1} +\psi_{h}\varepsilon_{t}+\psi_{h+1}\varepsilon_{t-1}+\cdots\]

  • \(Y_{t+h|t}\) is the forecast of \(Y_{t+h}\) based on information \(\mathcal{T}_{t}\): \[Y_{t+h|t} =\mathbb{E} \left. \left[ Y_{t+h} \right| \mathcal{T}_{t} \right] = \mu+ \] \[ \mathbb{E} \left. \left[ \varepsilon_{t+h}+\psi_{1}\varepsilon_{t+h-1}+\cdots +\psi_{h-1}\varepsilon_{t+1} \right| \mathcal{T}_{t} \right] \] \[ +\psi_{h}\varepsilon_{t}+\psi_{h+1}\varepsilon_{t-1}+\cdots\]

Forecast Error

  • The forecast error is \[\mathbb{E} \left. \left[ \varepsilon_{t+h}+\psi_{1}\varepsilon_{t+h-1}+\cdots +\psi_{h-1}\varepsilon_{t+1} \right| \mathcal{T}_{t} \right] \]

  • The forecast error is \[\varepsilon_{t+h|t}=Y_{t+h}-Y_{t+h|t}\] and prediction Mean Square Error (MSE) is \[MSE(\varepsilon_{t+h|t})=\mathbb{E}[\varepsilon_{t+h|t}^{2}]=\mathbb{E}[Y_{t+h}-Y_{t+h|t}]^{2}\]

Linear Predictor

  • Conditional Expectation: \(Y_{t+h|t}=\mathbb{E}\left[Y_{t+h}|\mathcal{I}_{t}\right].\) A linear function: \[\hat{Y}_{t+h|t}= k_{1}Y_{t}+\ldots+k_{t}Y_{1}\]

  • Coefficients \({\bf k}=(k_{1},\ldots,k_{t})^{\prime}\) are selected to minimize the prediction MSE.

  • As \(\{\varepsilon_{t}\}\) are independent white noises, \(\mathbb{E}[\varepsilon_{t+1}|\mathcal{I}_{t}]=0\) and \(\mathbb{E}[Y_{t+h}|\mathcal{I}_{t}]\) is a linear function of \(\{\varepsilon_{t}\}\) \[ Y_{t+h|t}=\mu+\psi_{h}\varepsilon_{t}+\psi_{h+1}\varepsilon_{t-1}+\dots\]

  • The forecast error is \[\varepsilon_{t+h|t}= Y_{t+h}-Y_{t+h|t} = \varepsilon_{t+h}+\psi_{1}\varepsilon_{t+h-1}+\cdots+\psi_{h-1}\varepsilon_{t+1}\]

Remarks

  • Variance of \(\varepsilon_{t+h|t}\) is \[\sigma^{2}(1+\psi_{1}^{2}+\cdots+\psi_{h-1}^{2})\]

  • If \(\{\varepsilon_{t}\}\) are Gaussian white noises then \[Y_{t+h}|\mathcal{I}_{t}\sim\mathcal{N}(Y_{t+h|t},\sigma^{2}(1+\psi_{1}^{2}+\cdots+\psi_{h-1}^{2}))\] A \(95\%\) confidence interval for the \(h\)-step prediction has the form \[Y_{t+h} \pm 1.96\sqrt{\sigma^{2}(1+\psi_{1}^{2}+\cdots+\psi_{h-1}^{2})}\]

Summary

  • Discussion of a general ARMA modeling procedure

  • Implementation of model selection, estimation, testing, and forecasting under the stationary assumption.