Introduction

This is the first of a series of tutorials from the book, Predictive Modeling Applications in Actuarial Science by Edward W. Frees that will cover the following topics:

Time Series Analysis
Spatial Modeling
Generalized Linear Models
Bayesian Analysis
Unsupervised Learning

Bolded topics will be covered and unbolded topics may be covered if time permits.

Feedback, comments, and suggestions are always welcome!

Time Series Introduction

A time series is a stochastic process indexed by t such as :

\[\vec{y_t}=(y_1,y_2,...,y_n)^T\]

where stochastic means that the element in the set are ordered. Mainly, we are concerned with comparing a set \(y_t\) with \(y_{t-s}\) where \(s\) is a lag operator and \(s\in\{\pm1,\pm2...\}\) so that

\[\vec{y_{t-s}}=(y_{1-s},y_{2-s},...,y_{n-s})^T.\]

We would like the processes \(y_t\) to be stationary which means it must satisfy the following properties:

Constant mean with respect to \(t\). That is, \(\mu=\mu_t, \forall t\).
Constant variance with respect to \(t\) and defined similar to the above.
Constant covariance between \(t\) and \(t-s\). That is, \(cov(y_t,y_{t-s})=\rho_s\).

Often times we will standardize the series to mean 0 and variance 1 for easier operations. For example, in matrix form we get \[\rho_s=y_t^Ty_{t-s}.\]

Case Study: Johnson & Johnson

What we want from \(y_t\) and \(y_{t-s}\) is to see if it has some time-dependent structure. That is, if \(y_t\) is related to \(y_{t-s}\) for some \(s\) where related means correlated.

The dataset we will use is available R and is the JohnsonJohnson dataset. This dataset contains the quarterly earnings of J&J from 1960-1980.

library(tseries)
data(JohnsonJohnson)
(earnings=JohnsonJohnson)

##       Qtr1  Qtr2  Qtr3  Qtr4
## 1960  0.71  0.63  0.85  0.44
## 1961  0.61  0.69  0.92  0.55
## 1962  0.72  0.77  0.92  0.60
## 1963  0.83  0.80  1.00  0.77
## 1964  0.92  1.00  1.24  1.00
## 1965  1.16  1.30  1.45  1.25
## 1966  1.26  1.38  1.86  1.56
## 1967  1.53  1.59  1.83  1.86
## 1968  1.53  2.07  2.34  2.25
## 1969  2.16  2.43  2.70  2.25
## 1970  2.79  3.42  3.69  3.60
## 1971  3.60  4.32  4.32  4.05
## 1972  4.86  5.04  5.04  4.41
## 1973  5.58  5.85  6.57  5.31
## 1974  6.03  6.39  6.93  5.85
## 1975  6.93  7.74  7.83  6.12
## 1976  7.74  8.91  8.28  6.84
## 1977  9.54 10.26  9.54  8.73
## 1978 11.88 12.06 12.15  8.91
## 1979 14.04 12.96 14.85  9.99
## 1980 16.20 14.67 16.02 11.61

plot.ts(earnings)

Notice the increasing variance and the upward trend in the time series. This is not stationary since it violates (1) and (2) above and possibly (3). To stabilize the mean, we can do differencing which is defined as follows: \[\Delta{y_t}=y_t-y_{y-1}.\]

We can use it on the earnings series as follows

plot(diff(earnings))

and now the mean has stabilized. To stabilize the variance, we can log-transform the series as follows

plot(log(earnings+1))

so that the mean is visually constant. To reiterate, stabilize means that we get rid of its dependence on \(t\). The check for violations of (3) is not mentioned in the book.

The book glosses over differencing but to gain some intuition about it, consider a function \(f(x_t)=y_t\) so that \(f\) is some function mapping \(X\) to \(Y\) for \(x_t\) in some interval.

Then we can think of the time series \(y_t\) as the image of some function. This implies that the first difference is the set of slopes for each \(x_t\). Likewise, the second difference is the set of slope of slopes.

To make this a bit more concrete, lets consider several different cases:

Case 1: The true function is linear: \(f(x)=x\).

par(mfrow=c(1,3),mar=c(1,1,1,1))
x=1:100
y=x

# true function
plot(y,type='l')

# first difference
plot(diff(y),type='l')

# second difference
plot(diff(diff(y)),type='l')

CASE 2: The true function is quadratic: \(f(x)=x^2\).

par(mfrow=c(1,3),mar=c(1,1,1,1))
x=1:100
y=x^2

# true function
plot(y,type='l')

# first difference
plot(diff(y),type='l')

# second difference
plot(diff(diff(y)),type='l')

CASE 3: The function is piecewise quadratic then constant: \(f(x)=\left\{ \begin{array}{lr} x^2 &: x<50 \\ 10 &: x\ge 50 \end{array} \right.\)

par(mfrow=c(1,3),mar=c(1,1,1,1))
x=1:100
y=ifelse(x<50,x^2,10)

# true function
plot(y,type='l')

# first difference
plot(diff(y),type='l')

# second difference
plot(diff(diff(y)),type='l')

CASE 4: The function is cubic: \(f(x)=x^3\)

par(mfrow=c(1,3),mar=c(1,1,1,1))
x=1:100
y=x^3

# true function
plot(y,type='l')

# first difference
plot(diff(y),type='l')

# second difference
plot(diff(diff(y)),type='l')

CASE 5: The function is a random walk: \(f(y_t)=y_{t-1}+\epsilon_t\)

\[=\epsilon_t+\epsilon_{t-1}+\epsilon_{t-2}+...\] \[\iff y_t-y_{t-1}=\epsilon_t\]

So that the current value of \(y_t\) is the previous value plus an error term. It is equivalent to an AR(\(\infty\)) with \(a=1\), however, we noted earlier that this variance goes to infinity so is not sensible. This is shown below: \[Var(y_t)=Var(\epsilon_t+\epsilon_{t-1}+\epsilon_{t-2}+...)\] \[=\sigma^2+\sigma^2+\sigma^2+...\] \[=\infty \sigma^2\]

par(mfrow=c(1,3),mar=c(1,1,1,1))
x=rnorm(100)
y=cumsum(x)

# true function
plot(y,type='l')

# first difference
plot(diff(y),type='l')

# second difference
plot(diff(diff(y)),type='l')

CASE 6: The function is an AR(1) function: \(y_t=ay_{t-1}+\epsilon_t\).

The \(a\) parameter controls the degree of weight of the previous term so that higher \(a\) has a bigger time-dependent structure.

par(mfrow=c(1,3),mar=c(1,1,1,1))
x=rnorm(100)
y=x
a=0.75

for(i in 2:100) y[i]=a*y[i-1]+x[i]

# true function
plot(y,type='l')

# first difference
plot(diff(y),type='l')

# second difference
plot(diff(diff(y)),type='l')

CASE 7: The function is an MA(1) function: \(y_t=b\epsilon_{t-1}+\epsilon_t\).

The \(b\) parameter controls the degree of weight of the previous term so that higher \(b\) has a bigger time-dependent structure.

par(mfrow=c(1,3),mar=c(1,1,1,1))
x=rnorm(100)
y=x
b=0.75

for(i in 2:100) y[i]=b*x[i-1]+x[i]

# true function
plot(y,type='l')

# first difference
plot(diff(y),type='l')

# second difference
plot(diff(diff(y)),type='l')

CASE 7: The function is an AR(1) function with a drift \(\mu\): \(y_t=ay_{t-1}+\epsilon_t+\mu\).

par(mfrow=c(1,3),mar=c(1,1,1,1))
x=rnorm(100)
a=0.75
u=2
y=x+u

for(i in 2:100) y[i]=a*y[i-1]+x[i]+u

# true function
plot(y,type='l')

# first difference
plot(diff(y),type='l')

# second difference
plot(diff(diff(y)),type='l')

In general, the log-transformation is first applied and then differencing. This is because applying differencing first can lead to negative numbers which no log exists. Additionally, there is a nice interpretation of log-differncing; that is,

\[\ln (y_t)-\ln (y_{t-1})=\ln (\frac{y_t}{y_{t-1}})=\ln (1+\frac{y_t-y_{t-1}}{y_{t-1}})\approx\frac{y_t-y_{t-1}}{y_{t-1}}\]

so that log-differencing is approximately equal to the percentage difference for small changes. Below is a table to show the degree of similarity for the function \(f(x)=x^2\) and \(x\in[0,5]\).

\(x^2\)	diff(log(f(x)))	\(\frac{y_t-y_{t-1}}{y_{t-1}}\)	Difference (pp)
0.36	2.36	1.21	1.15
1.21	1.11	0.75	0.37
12.96	0.30	0.26	0.04

Additionally, we can compare that visually as follows.

# the correct way
plot(diff(log(earnings+1)))

# the wrong way
plot(log(diff(earnings)+1))

## Warning in log(diff(earnings) + 1): NaNs produced

We defined earlier for \(\rho_s\) to be the covariance between two series \(y_t\) and \(y_{t-s}\). For standardized series, the covariance equals the correlation, \(\rho_s\). We can plot the covariance for all \(s\) using the autocorrelation function (ACF) plot as follows:

The blue bands in the above figure represent the \(95\%\) confidence intervals. Notice the big lag spikes at what should be 4,8,12 not in this picture - need to redo that imply there are seasonal correlations.

Additionally, if we take the above equation and exponentiate it, we get the following: \[\begin{align*} \frac{y_t}{y_{t-1}}&=\exp(\frac{y_t-y_{t-1}}{y_{t-1}}) \\ \implies y_t&=y_{t-1}\exp(\frac{y_t-y_{t-1}}{y_{t-1}}) \\ &=y_{t-1}\exp(\mu+\epsilon_t) \end{align*} \]

where \(\frac{y_t-y_{t-1}}{y_{t-1}} \sim N(\mu,\sigma)\) so that we have exponentially growing errors.

Time Series Models

Knowing about the different relationships between \(y_t\) and \(y_{t-s}\) how can we build models? We will start off by going through the following models:

Autoregressive (AR) models
Moving Average (MA) models
Autoregressive Integrated Moving Average (ARIMA) models

AR Models

The AR(1) model is as follows: \[y_t=a_1y_{t-1}+\epsilon_t.\]

Some intuition about the above equation is that the current time \(y_t\) is a weighted average of the previous value \(y_{t-1}\) plus an error term.

Similarly, for AR(2), we get the following: \[y_t=a_1y_{t-1}+a_2y_{t-2}+\epsilon_t\]

so that the present value is a weighted average of the previous two values. In general, we have that AR(p) is \[y_t=a_1y_{t-1}+a_2y_{t-2}+...+a_py_{t-p}.\]

Notice that for the AR(1) model, we can recursively substitute the \(y_{t-s}\) terms so that we get an infinite series. For example,

\[\begin{align*} y_t&=ay_{t-1}+\epsilon_t \\ &=a(ay_{t-2}+\epsilon_{t-1})+\epsilon_t \\ &=a(a(ay_{t-3}+\epsilon_{t-2})+\epsilon_{t-1})+\epsilon_t \\ &=... \end{align*} \]

and if we factor the last term above we get

\[ \begin{align*} &=a^3y_{t-3}+a^2\epsilon_{t-2}+a\epsilon_{t-1}+\epsilon_t \\ &=a^4y_{t-4}+a^3\epsilon_{t-3}+a^2\epsilon_{t-2}+a\epsilon_{t-1}+\epsilon_t \\ &=\sum_{i=0}^{\infty}a^i\epsilon_{t-i} \end{align*} \]

which is an infinite series. Additionally, we can get its variance as follows:

\[ \begin{align*} Var(\sum_{i=0}^{\infty}a^i\epsilon_{t-i})&=\sum_{i=0}^{\infty}a^{2i}Var(\epsilon_{t-i}) \\ &=\sum_{i=0}^{\infty}a^{2i}\sigma^2 \\ &=\sigma^2\sum_{i=0}^{\infty}a^{2i} \end{align*} \]

where \(Var(\epsilon_t)=\sigma^2\) and \(Cov(\epsilon_t,\epsilon_{t-s})=0\).

Notice that if \(\|a\|\geq 1\) the series diverges and the variance is infinity which is not useful.

PROPOSITION: If \(\|a\|\geq 1\), the series above diverges.

PROOF: To prove for all \(a>1\), we only need to prove it for \(a=1\). We note that for a fixed \(N\), the following holds \(\sum_{i=0}^{N}1^{2i}=N\).

Then we can use induction as follows:

When \(N=1\), then we get the sum equals \(1\).
For \(N=n\), we get that the sum is equal to \(n\).
For any \(N=n+1\), we get that the sum is equal to \(1^{2(n+1)}+\sum_{i=0}^{n}1^{2i}\) which is equal to \(n+1\).

Then the sum of infinite numbers is inifity which implies that our statement holds. For values of \(\|a\|<1\) we can use the proof found in any analysis book that has

PROPOSITION: The sequence \(\sum_{i=0}^{\infty}a^{2i}=1+a^2+a^4+...\) converges.

PROOF: Let \[ \begin{align*} s&=1+a^2+a^4+... \\ \implies a^2s&=a^2+a^4+a^6+... \\ \implies s-a^2s&=1 \\ \implies s(1-a^2)&=1 \\ \implies s&=\frac{1}{1-a^2} \blacksquare \end{align*} \]

As a result, we only get sensible results when \(\|a\|<1\). Additionally, notice that for AR(p) models, the ACF is never zero since \(y_t\) and \(y_{t-s}\) have common noise terms as stated in the book. That is,

\[ \begin{align*} Cov(y_t,y_{t-s})&=Cov(\epsilon_t+a_1\epsilon_{t-1}+...+a_1^s\epsilon_{t-s}+a_1^{s+1}\epsilon_{t-s-1}+..., \epsilon_{t-s}+a_2\epsilon_{t-s-1}+...) \\ &=Cov(a_1^s\epsilon_{t-s}+a_1^{s+1}\epsilon_{t-s-1}+..., \epsilon_{t-s}+a_2\epsilon_{t-s-1}+...) \\ &=a_1^{t+s}Cov(\epsilon_{t-s},\epsilon_{t-s})+a_1^{t+s+1}a_2Cov(\epsilon_{t-s-1},\epsilon_{t-s-1})+... \\ &=a_1^{t+s}Var(\epsilon_{t-s})+a_1^{t+s+1}a_2Var(\epsilon_{t-s-1}) \\ &=a_1^{t+s}\sigma^2+a_1^{t+s+1}a_2\sigma^2+...\neq 0 \end{align*} \]

so that the above holds for all \(s\) which implies that the ACF can never be zero.

MA Models

The next type of model is called the moving average model where the MA(1) is defined as follows: \[y_t=\epsilon_t+b\epsilon_{t-1}\]

so that the model is a weighted average of the previous noise term. Additionally, we can rewrite the above equation recursively as we did for the AR(1) model as follows:

Let \[\epsilon_t=y_t-b\epsilon_{t-1}\]

Then \[ \begin{align*} y_t&=\epsilon_t+b\epsilon_{t-1} \\ &=\epsilon_t+b(y_{t-1}-b\epsilon_{t-2}) \\ &=\epsilon_t+b(y_{t-1}-b(y_{t-2}-b\epsilon_{t-3}))\\ &=\epsilon_t+b(y_{t-1}-b(y_{t-2}-b(y_{t-3}-...))) \\ &=\epsilon_t+by_{t-1}-b^2y_{t-2}+b^3(y_{t-3}-...)) \\ &=\epsilon_t+\sum_{i=1}^{\infty}(-1)^{i+1}b^iy_{t-i} \end{align*} \]

Then we can compare the MA(1) model using the above notation to the AR(1) model as follows:

MA(1): \[y_t=\epsilon_t+by_{t-1}-b^2y_{t-2}+b^3(y_{t-3}-...))\]

AR(1): \[y_t=\epsilon_t+y_{t-1}\]

so that we can rewrite the MA(1) model as an AR(\(\infty\)) model. That is, the MA(1) model is a weighted regression of all of the previous terms whereas the AR(1) model is the previous term only.

This would also hold for MA(2) models. To see this, consider only the first lag term of MA(2) which is AR(\(\infty\)). The second lag is also an AR(\(\infty\)) and so that summing them is also an AR(\(\infty\)).

In general, for any MA(q) can be written as AR(\(\infty\)). Additionally, note that MA(1) is always stationary since:

The mean is \(E[y_t]=E[\epsilon_t+b\epsilon_{t-1}]=0+0\).
The variance is \(Var(y_t)=Var[\epsilon_t+b\epsilon_{t-1}]=\sigma^2+b^2\sigma^2\).
The covariance is \(\rho_s=\left\{ \begin{array}{lr} Var(\epsilon_t+b\epsilon_{t-1})=\sigma^2+b\sigma^2 &: s=0 \\ Cov(\epsilon_t+b\epsilon_{t-1},\epsilon_{t-1}+b\epsilon_{t-2})=Cov(b\epsilon_{t-1},\epsilon_{t-1})=bVar(\epsilon_{t-1})=b\sigma^2 &: s=1 \\ Cov(\epsilon_t+b\epsilon_{t-1},\epsilon_{t-2}+b\epsilon_{t-3})=0&:s>1 \end{array}\right.\)

And notice that the ACF dies off at \(s>q\) because of the last bullet point above.

In general,

\[\rho_s=\left\{ \begin{array}{lr} Var(y_t) &: s=0 \\ Cov(y_t,y_{t-s})>0 &: s\leq q \\ 0 &: s>q \end{array} \right.\]

In plain english, for an MA(q) process, the ACF for \(s>q\) is zero and \(s<=q\) is greater than 1.

Consider the other way of comparing the MA(1) model to the AR(1) model below:

MA(1): \[y_t=\epsilon_t+b\epsilon_{t-1}\]

AR(1): \[y_t=\epsilon_t+a\epsilon_{t-1}+\sum_{i=2}^{\infty}a^i\epsilon_{t-i}\]

so that the AR(1) model is an infinitely weighted MA model. Notice the relationship between AR(1) and MA(\(\infty\)) rewritten above.

And in general, we can rewrite any AR(\(p\)) model as an MA(\(\infty\)) model using the same argument from the previous representations. Additionally, note that AR(1) is stationary since:

The mean is \(E[y_t]=E[\epsilon_t+a\epsilon_{t-1}+...]=0+0+0...\).
The variance is \(Var(y_t)=Var[\epsilon_t+a\epsilon_{t-1}+a^2\epsilon_{t-2}+...]=\sigma^2+a^2\sigma^2+...=\frac{\sigma^2}{1-a}\).
The covariance is \(\rho_s=\left\{ \begin{array}{lr} \frac{\sigma^2}{1-a} &: s=0 \\ Cov(\epsilon_t+a\epsilon_{t-1},\epsilon_{t-s}+a\epsilon_{t-s-1})=a^s\sigma^2+a^{s+2}\sigma^2=\sum_{i=0}^{\infty}a^{s+2i}\sigma^2=\frac{\sigma^2}{a^s(1-a^2)} &: s>0 \end{array}\right.\)

However, the ACF of an AR(1) process is never zero from the last bullet point above.

To summarize:

AR(\(p\)) can be rewritten as an MA(\(\infty\))
MA(\(q\)) can be rewritten as an AR(\(\infty\))

ARIMA Models

ARIMA models are a combination of the AR(\(p\)) and MA(\(q\)) model with an extra I. The I stands for the degree of differencing which was defined previously where:

\[ \begin{align*} I_1&=y_t^1=y_t-y_{t-1} \\ I_2&=y_t^2=y_t^1-y_{t-1}^1=(y_t-y_{t-1})-(y_{t-1}-y_{t-2}) \end{align*} \]

And where we intuitively defined it as the first and second derivative, respectively. We can again write these as AR(\(\infty\)) and MA(\(infty\)).

For an ARIMA(1,0,1) we get the following: \[y_t=ay_{t-1}+\epsilon_t+b\epsilon_{t-1}\]

Let \[v_t=\epsilon_t+b\epsilon_t-1\]

Then, we can substitute the above to get: \[ \begin{align*} y_t&=ay_{t-1}+v_t \\ &=a(ay_{t-2}+v_{t-1})+v_t \\ &=a(a(ay_{t-3}+v_{t-2})+v_{t-1})+v_t \\ &=a^3y_{t-3}+a^2v_{t-2}+av_{t-1}+v_t \\ &=v_t+av_{t-1}+a^2v_{t-2}+...\\ &=(\epsilon_t+b\epsilon_{t-1})+a(\epsilon_{t-1}+b\epsilon_{t-2})+a^2(\epsilon_{t-2}+b\epsilon_{t-3})...\\ &=\epsilon_t+(a+b)\epsilon_{t-1}+(a^2+ab)\epsilon_{t-2}+...\\ &=\epsilon_t+\sum_{i=1}^{\infty}(a^i+a^{i-1}b)\epsilon_{t-i} \end{align*} \]

so that we have written the ARIMA(1,0,1) as an MA(\(\infty\)). Similarly, we can write it as an AR(\(\infty\)).

Fitting ARIMA Models

Before fitting the model, the series is differenced to the degree in I. Then, the coefficients of ARMA(\(p\),\(q\)) are estimated by maximizing the liklihood.

Our loss function for squared-error is as follows: \[ \begin{align*} L(A,B)&=\sum_{i=1}^N(y_i-\hat{y}_i)^2 \\ &=\sum_{i=1}^N\epsilon_i^2 \end{align*} \] where

\(\epsilon_i=(y_i-\hat{y}_i)\) if the true function is ARMA(\(p\),\(q\)).
A,B are the set of coefficients \(a\in A\), \(b \in B\).

The actual problem we are solving is: \[ \DeclareMathOperator*{\argmin}{arg\,min}\\ a,b= \argmin_{A,B}{\vec{\epsilon^T}\vec{\epsilon}} \]

We don’t observe \(\epsilon_t\) so they are approximated \[\hat{\epsilon}_t=y_t-a_1y_{t-1}-...-a_py_{t-p}-b_1\hat{\epsilon}_{t-1}-...-b_q\hat{\epsilon}_{t-q}\]

and all previous \(\epsilon_{t-q}\) are also estimated recursively. The function is seeded with initial values for \(\epsilon_{t-q}\) and is solved using numerical procedures. The book does not mention which specific techniques are used nor how it is exactly solved.

I will make a guess on how it is solved. Consider fit of an ARIMA(2,0,2) model as follows:

To keep improving the model, we would need to alternatively minimizing two functions. That is, we would like to solve the following functions:

1. \[\DeclareMathOperator*{\argmin}{arg\,min}\\ a= \argmin_{A,B}{\sum_{i=1}^N(y_i-\hat{y}_i)^2} \]

Which can be solved as follows:

\[ \left[ \begin{array}{c} y_3\\ y_4\\ \vdots\\ y_t \end{array} \right] = \left[ \begin{array}{cccc} y_{2} & y_{1} & \epsilon_{2} & \epsilon_{1}\\ y_{3} & y_{2} & \epsilon_{3} & \epsilon_{2}\\ \vdots & \vdots & \vdots & \vdots\\ y_{t-1} & y_{t-2} & \epsilon_{t-1} & \epsilon_{t-2} \end{array} \right] \left[ \begin{array}{c} a_1\\ a_2\\ b_1\\ b_2\\ \end{array} \right] \]

If we solve the above using the normal equations \(Y=X^TX\beta\), then we can get a reasonable values for the \(a\)’s. We can fix the \(b\)’s.

The second function we need to alternate with is the following: 2. \[\DeclareMathOperator*{\argmin}{arg\,min}\\ \vec{\epsilon}= \argmin_{\vec{\epsilon}}{\sum_{i=1}^N(y_i-\hat{y}_i)^2} \]

which we can solve as follows:

\[ \left[ \begin{array}{c} y_3\\ y_4\\ \vdots\\ y_t \end{array} \right] = \left[ \begin{array}{cccccccc} y_{2} & y_{1} & b_1 & b_2 & 0 & \dots & 0 & 0\\ y_{3} & y_{2} & 0 & b_1 & b_2 & \dots & 0 & 0\\ \vdots & \vdots & \vdots & \vdots & \vdots & \ddots & \vdots & \vdots\\ y_{t-1} & y_{t-2} & 0 & 0 & 0 & \dots & b_1 & b_2 \end{array} \right] \left[ \begin{array}{c} e_1\\ e_2\\ \vdots\\ e_{t-1}\\ \end{array} \right] \]

and so that we alternate between (1) and (2) until convergence.

Lag Operator

We will now introduce new notation to ARIMA(p,q) models. This notation is called the lag operator notation as follows:

\[a(L)y_t=b(L)\epsilon_t\]

where:

\(a(L)=1-a_1L-a_2L-...-a_pL\)
\(b(L)=1+b_1L+b_2L+...+b_qL\)
\(L^ky_t=y_{t-k}\)

so that ARIMA(1,0,1)=\((1-L)y_t=(1+L)\epsilon_t=y_t-y_{t-1}=\epsilon_t+\epsilon_{t-1}\implies y_t=y_{t-1}+\epsilon+\epsilon_{t-1}\).

Page 137 of the book is a bit confusing. Consider a polynomial in L as follows:

\(a(L)=(1-z_1L)(1-z_2L)\cdot...\cdot(1-z_pL)\); where we can interpret this as a product of differences. That is, \(a(L)y_t=(y_t-z_1y_{t-1})\cdot(y_t-z_2y_{t-1})\cdot...\cdot(y_t-z_py_{t-1})\). This has something to do with the fundamental theorem of algebra that states that there are n roots for a polynomial with greatest degree n. This may be something to explore later.
\((1-z_iL)^{-1}=1+z_iL+(z_iL)^2+...,\)

Model Selection

To determine the p and q for an ARIMA model, we need to look at the ACF and PACF plots. ACF plots were defined earlier and in summary, AR(p) models have slowly dying ACF plots while MA(q) models have a distinct cutoff for some q where \(s>q\) is zero. Alternatively, we can think of the equivalence of AR(1)=MA(\(\infty\)) and AR(\(\infty\))=MA(1) so that ACF for the first case is never zero but decreases and for the second case it dies out.

Model	ACF	PACF
MA(q)	cuts out; zero for \(s>q\)	dies out
AR(p)	dies out	cuts out; zero for \(s>q\)
ARIMA(p,d,q)	dies out	dies out

The PACF for \(y_t\) and \(y_{t-s}\) is denoted \(\pi_s\) and is defined as the correlation between the two variables after removing any effects of the intermediate variables \(y_{t-1},y_{t-2},...,y_{t-s+1}\).

Consider the following cases: \(\pi_s=\left{ \begin{array}{lr} \rho_0/\rho_0&:s=0\\ \rho_1/\rho_0&:s= 1\\ \rho_k/\rho_0\|y_{1<k<s}&:s>1\\ 0&:k>s \end{array} \right} \)

Removing the effect here means subtracting out the intermediate effects. Consider the AR(1) process:
\[y_t=ay_{t-1}+\epsilon_t\]

For \(s=0\), we get the \(rho_0\) which is just the variance.

For \(s=1\), there are no intermediate terms so we get the covariance \(\rho_1\).

For \(s=2\), there is one intermediate term, namely \(y_{t-1}\), so subtract out \(ay_{t-1}\) to get:

\[y_t-ay_{t-1}=\epsilon_t.\]

Then the PACF of the resulting process is \(\pi_2=0\) since the above is just white noise.

For \(s>2\), the above condition holds since \(a_2=a_3=...=0\) and are all white noise.

The \(a\) terms are estimated by least squares if the assumption of \(e_t\sim N(0,\sigma^2)\). More concretely, we can solve for the following:

\[ \left[ \begin{array}{c} y_3\\ y_4\\ \vdots\\ y_n \end{array} \right] = \left[ \begin{array}{ccc} y_1&y_2\\ y_2&y_3\\ \vdots&\vdots\\ y_{n-1}&y_{n-2} \end{array} \right] \left[ \begin{array}{c} a_1\\ a_2 \end{array} \right] \]

by using the normal equations again, then, subtract out the resulting dependencies. More concretely,

Algorithm 1 Solving for PACF

begin

solve: \(\hat{b}=(X^TX)^{-1}X^Ty\)

get residuals: \(\vec{r}=y-\hat{y}=y-X^T\hat{b}\)

**store pacf’s:*\(\pi_s\leftarrow Cov(y_t,r_{t-s})\)

end

Notice that this is the opposite of what happens to an AR(p) process for an ACF plot. That is, for \(s>p\), the PACF for AR(p) is zero.

For an MA(q) process, we are introducing new terms \(y_{t-k}, k<s\) where there previously were none. By doing this, we create intermediate correlations since the relationship, AR(p)=MA(\(\infty\)) so that the pacf is never zero.

The ARIMA(p,d,q) has both terms so the ACF and PACF never die out.

On page 438, the book mentions that the PACF lag s is estimted by correlating the residuals of \(y_t\) and \(y_{t-s}\) on the intermediate values \(y_{t-1},y_{t-2},...,y_{t-s+1}\). This means that \(\vec{r}=y_t-y_{t-s}\) is the residual vector and then performing correlations, \(Cov(r,y_{t-k})=r^Ty_{y-k}\), the largest correlation amounts to projecting the residual vector on vector \(y_{t-k}\) that is the closest. That is, the closest vector to the residuals explains the most amount of variance; or minimizes the second set of residual: (r^Ty_{t-k})T(r^Ty_{t-k}).

ARIMA Lag Operators

The ARIMA(p,d,q) can be written in lag-operator form as follows:

\[a(L)(1-L)^d(y_t-\mu)=b(L)\epsilon.\]

To make it more clear, we first subtract \(y_t\) by its mean so that the resulting \(y_t'\) has zero mean. Then, we apply the differencing \((1-L)^dy_t'\) which we factor out first and then apply the \(y_t\). Then, we apply the normal AR(p) and MA(q) procedures.

We use the ACF and PACF plots to determine the degree of differencing; that is, if both plots converge slowly, we difference it until there are sharp convergence on atleast one of the plots. Rarely is \(d > 2\) required.

Seasonality

Time series sometimes exhibit seasonality effects. That is, there are seasonal or yearly trends so that periods \(t-4s\) or \(t-12s\), respectively, away are correlated.

The seasonal lag-operator form is as follows: \[a(L)(1-L)^d\alpha(L^S)y_t=b(L)\beta(L^S)\epsilon_t.\]

For example, we can model \(ARIMA((1,0,0),(1,1,1)_{12})\) which is: \[(1-aL)(1-\alpha L^{12})(1-L^{12})y_t=(1-\beta L^{12})\epsilon_t\]

which states that there is seasonal differencing \((1-L)^{12}\) with seasonal \(AR_12(1)\) and \(MA_12(1)\) parts.

This model captures both short-term time-dependencies; that is the \(a(L)\) and \(b(L)\) captures the linear function of the previous few values and long-term time-dependencies; that is, \(\alpha(L)\) and \(\beta(L)\) for seasonal or yearly dependencies that ignore terms between.

This allows for fewer parameters which is generally better for prediction accuracy.

ARIMA on J&J Dataset

We will proceed to model in the following steps:

Check for and remove nonstationarity.
Identify the AR and MA orders p and q.
Estimate the AR and MA coefficients.
Check the fitted model.
Check residuals.

Are residuals correlated?

if yes: return to 2.

if no: end.

We will look at the JohnsonJohnson dataset again using the above steps.

par(mar=c(1,1,1,1))
library(forecast)

## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Loading required package: timeDate
## This is forecast 5.7

# plot : plot, acf, pacf
tsdisplay(earnings)

# log it
logearn=log(earnings+1)
tsdisplay(logearn)

# 1st diff
tsdisplay(diff(logearn))

# ACF plot: there appears to be a seasonal component
# seasonal difference
tsdisplay(diff(diff(logearn),4))

#split train/test : 1960-1978 / 1979-1980
train.end=time(logearn)[72]
test.start=time(logearn)[73]
train=window(logearn,end=train.end)
test=window(logearn,start=test.start)

# There are big spikes at ACF: 1,2 and PACF: 1,4
# We will try several models and compare their likelihoods
(m1=Arima(train,order=c(1,1,1),seasonal=list(order=c(1,1,0)),method="ML",include.drift=F))

## Series: train 
## ARIMA(1,1,1)(1,1,0)[4]                    
## 
## Coefficients:
##          ar1      ma1     sar1
##       0.1015  -0.6769  -0.3464
## s.e.  0.2122   0.1597   0.1292
## 
## sigma^2 estimated as 0.00351:  log likelihood=93.72
## AIC=-179.43   AICc=-178.79   BIC=-170.61

(m2=Arima(train,order=c(4,1,4),seasonal=list(order=c(0,1,0)),method='ML'))

## Series: train 
## ARIMA(4,1,4)(0,1,0)[4]                    
## 
## Coefficients:
##           ar1      ar2      ar3      ar4      ma1    ma2     ma3      ma4
##       -0.1583  -0.2097  -0.4764  -0.2092  -0.4062  0.120  0.3241  -0.4943
## s.e.   0.2045   0.2021   0.1891   0.1840   0.1867  0.221  0.1977   0.1950
## 
## sigma^2 estimated as 0.003196:  log likelihood=96.53
## AIC=-175.07   AICc=-171.91   BIC=-155.22

(m3=Arima(train,order=c(3,1,2),seasonal=list(order=c(0,1,0)),method='ML'))

## Series: train 
## ARIMA(3,1,2)(0,1,0)[4]                    
## 
## Coefficients:
##           ar1     ar2     ar3      ma1      ma2
##       -0.5041  0.3269  0.1658  -0.0405  -0.7603
## s.e.   0.2132  0.1999  0.1454   0.1813   0.1485
## 
## sigma^2 estimated as 0.003737:  log likelihood=91.72
## AIC=-171.44   AICc=-170.04   BIC=-158.21

(m4=auto.arima(train,stepwise=FALSE))

## Series: train 
## ARIMA(0,1,1)(0,1,1)[4]                    
## 
## Coefficients:
##           ma1     sma1
##       -0.5931  -0.4169
## s.e.   0.1087   0.1162
## 
## sigma^2 estimated as 0.003457:  log likelihood=94.13
## AIC=-182.26   AICc=-181.88   BIC=-175.65

m4$arma

## [1] 0 1 0 1 4 1 1

(m5=auto.arima(train,stepwise=TRUE))

## Series: train 
## ARIMA(0,1,1)(0,1,1)[4]                    
## 
## Coefficients:
##           ma1     sma1
##       -0.5931  -0.4169
## s.e.   0.1087   0.1162
## 
## sigma^2 estimated as 0.003457:  log likelihood=94.13
## AIC=-182.26   AICc=-181.88   BIC=-175.65

m5$arma

## [1] 0 1 0 1 4 1 1

#number of AR, MA, seasonal AR and seasonal MA coefficients, plus the period and the number of non-seasonal and seasonal differences
# best model chosen: ARIMA((0,1,1),(0,1,1))

tsdisplay(residuals(m4))

tsdisplay(residuals(m5))

# FROM : https://www.otexts.org/fpp/8/9
# out-of-time cross validation
getrmse <- function(x,h,...)
{
  train.end <- time(x)[length(x)-h]
  test.start <- time(x)[length(x)-h+1]
  train <- window(x,end=train.end)
  test <- window(x,start=test.start)
  fit <- Arima(train,...)
  fc <- forecast(fit,h=h)
  return(accuracy(fc,test)[2,"RMSE"])
}


getrmse(logearn,h=24,order=c(0,0,0),seasonal=c(0,0,0),lambda=0)

## [1] 1.440143

getrmse(logearn,h=24,order=c(0,1,0),seasonal=c(0,0,0),lambda=0)

## [1] 0.5536669

getrmse(logearn,h=24,order=c(0,0,0),seasonal=c(0,1,0),lambda=0)

## [1] 0.4944645

getrmse(logearn,h=24,order=c(0,1,0),seasonal=c(0,1,0),lambda=0)

## [1] 0.1376351

getrmse(logearn,h=24,order=c(0,1,1),seasonal=c(0,1,0),lambda=0)

## [1] 0.09076893

getrmse(logearn,h=24,order=c(0,1,1),seasonal=c(0,1,1),lambda=0)

## [1] 0.08427583

getrmse(logearn,h=24,order=c(3,1,1),seasonal=c(1,1,0),lambda=0)

## [1] 0.08925738

getrmse(logearn,h=24,order=c(3,1,2),seasonal=c(1,1,0),lambda=0)

## [1] 0.08102855

getrmse(logearn,h=24,order=c(2,1,1),seasonal=c(0,1,0),lambda=0)

## [1] 0.4066872

getrmse(logearn,h=24,order=c(2,1,2),seasonal=c(0,1,0),lambda=0)

## [1] 0.4194572

getrmse(logearn,h=24,order=c(3,1,2),seasonal=c(0,1,0),lambda=0)

## [1] 0.08287011

getrmse(logearn,h=24,order=c(3,1,2),seasonal=c(0,0,0),lambda=0)

## [1] 0.4895547

getrmse(logearn,h=24,order=c(4,1,1),seasonal=c(0,0,0),lambda=0)

## [1] 0.174001

m1=Arima(train,order=c(0,1,1),seasonal=c(0,1,1))
plot(forecast(m1,h=12))

The candidate models are as follows:

Model	ARIMA	AIC	CV
1.	1,1,0	-	-
2.	0,1,0	-	-
3.	1,1,1	-	-

The best model is the \(ARIMA((0,1,1),(0,1,1)_4)\) which is parsimonious and has good predictive accuracy.

Model Drawbacks

There are some potential drawbacks from ARIMA modeling. Namely,

Explore potential metrics other than squared-error. This overpenalizes outliers.
Assumptions of normal errors. Often times, errors are fat-tailed and thus violate normal assumptions.

Additional Time Series Models

Structural Models

Structural models try to model nonstationarity; that is, there is a level term \(\mu_t\), a slope term \(\delta_t\) that vary with time.

The model is of the form: \[y_t=\mu_t+\epsilon_t\]

where:

\(\mu_t=\mu_{t-1}+\delta_{t-1}+a\eta_{t-1}\).
\(\delta_{t-1}=\delta_{t-2}+bv_{t-1}\).
\(\epsilon_t, \eta_{t-1}, \v_{t-1}\) are uncorrelated noise terms.

So that \(a\) and \(b\) here tell us the importance of changes in level or slope. We can also add in a seasonality component by adding the terms: \[\gamma_t=-\gamma_{t-1}-\gamma_{t-2}-\gamma_{t-3}+c\psi_{t-1}\]

where \(\psi_{t-1} \sim (0,\sigma^2)\). The benefit of this model over ARIMA is interpretability: slope \(\mu_t\), slope \(\delta_t\), seasonality \(\gamma_t\), and parameters \(a,b,\) and \(c\).

The book does not mention how to estimate the parameters but some guessing would imply the following method:

The model is as follows:

\(y_t=\mu_t+\epsilon_t\) >t={t-1}+{t-1}+a{t-1} >>{t-1}={t-2}+bv_{t-2}

Notice that \(E[y_t]=E[\mu_t+\epsilon_t]=\mu_t\). Then let \(\mu_t=y_t\).
Subtract \(E[y_t-y_{t-1}]=\mu_t-\mu_{t-1}=\delta_{t-1}+a\eta_{t-1}=r_1\).
Notice that \(E[delta_{t-1}+a\eta_{t-1}]=\delta_{t-1}\). Then, \(\delta_{t-1}=\delta_{t-2}+bv_{t-2}\).
Take \(\delta_{t-1}-\delta_{t-2}=bv_{t-2}\r_2\). Then we can estimate the effect of \(b\) by regressing \(b\) onto \(r_2\) and get estimates.

Fitting the structural model, we get the following parameter estimates:

StructTS(train)

## 
## Call:
## StructTS(x = train)
## 
## Variances:
##     level      slope       seas    epsilon  
## 2.942e-04  1.553e-05  4.915e-04  9.107e-04

so that we get \(\hat{\sigma}=\) 0.041, \(\hat{a}=\) 0.82, and \(\hat{b}=\) 0.156, and \(\hat{c}=\) 1.189.

Stochastic Volatility Models

Suppose that the variance is nonconstant with respect to time; that is, there is time-varying volatility \(\sigma_t\) so that the model is :

\[y_t=\mu+\sigma_t\epsilon_t\]

where:

\(\sigma^2_t=a_0+a_1\sigma^2_t+b_1(y_t-\mu)^2.\)

Here, \(b_1\) weights the most recent variance term and \(a_1\) estimates the weight on the permanent component \(\sigma^2\).

garch(train,c(1,1))

## 
##  ***** ESTIMATION WITH ANALYTICAL GRADIENT ***** 
## 
## 
##      I     INITIAL X(I)        D(I)
## 
##      1     3.515062e-01     1.000e+00
##      2     5.000000e-02     1.000e+00
##      3     5.000000e-02     1.000e+00
## 
##     IT   NF      F         RELDF    PRELDF    RELDX   STPPAR   D*STEP   NPRELDF
##      0    1  1.122e+02
##      1    2  5.418e+01  5.17e-01  6.27e+00  9.0e-01  7.0e+02  1.0e+00  2.21e+03
##      2    4  5.300e+01  2.17e-02  1.65e-02  2.1e-02  2.8e+00  5.0e-02  3.51e+00
##      3    6  5.101e+01  3.77e-02  3.76e-02  4.1e-02  2.2e+00  1.0e-01  8.19e-01
##      4    8  5.066e+01  6.76e-03  6.77e-03  7.8e-03  2.1e+01  2.0e-02  6.38e-01
##      5   10  5.002e+01  1.26e-02  1.26e-02  1.6e-02  3.4e+00  4.0e-02  5.47e-01
##      6   12  4.991e+01  2.38e-03  2.38e-03  3.0e-03  4.6e+01  8.0e-03  4.89e-01
##      7   14  4.968e+01  4.61e-03  4.61e-03  6.1e-03  6.4e+00  1.6e-02  4.58e-01
##      8   16  4.925e+01  8.60e-03  8.61e-03  1.2e-02  3.6e+00  3.2e-02  4.37e-01
##      9   19  4.924e+01  1.65e-04  1.65e-04  2.4e-04  4.8e+02  6.4e-04  3.97e-01
##     10   22  4.924e+01  3.29e-06  3.29e-06  4.9e-06  2.4e+04  1.3e-05  3.79e-01
##     11   24  4.924e+01  6.58e-06  6.58e-06  9.8e-06  3.0e+03  2.6e-05  3.79e-01
##     12   26  4.924e+01  1.32e-06  1.32e-06  2.0e-06  5.9e+04  5.1e-06  3.79e-01
##     13   28  4.924e+01  2.63e-06  2.63e-06  3.9e-06  7.4e+03  1.0e-05  3.79e-01
##     14   30  4.924e+01  5.26e-07  5.26e-07  7.8e-07  1.5e+05  2.0e-06  3.79e-01
##     15   33  4.924e+01  4.21e-06  4.21e-06  6.3e-06  4.6e+03  1.6e-05  3.79e-01
##     16   36  4.924e+01  8.42e-08  8.42e-08  1.3e-07  9.2e+05  3.3e-07  3.79e-01
##     17   38  4.924e+01  1.68e-08  1.68e-08  2.5e-08  4.6e+06  6.6e-08  3.79e-01
##     18   40  4.924e+01  3.37e-08  3.37e-08  5.0e-08  5.8e+05  1.3e-07  3.79e-01
##     19   42  4.924e+01  6.74e-09  6.74e-09  1.0e-08  1.2e+07  2.6e-08  3.79e-01
##     20   44  4.924e+01  1.35e-08  1.35e-08  2.0e-08  1.4e+06  5.2e-08  3.79e-01
##     21   47  4.924e+01  2.69e-10  2.69e-10  4.0e-10  2.9e+08  1.0e-09  3.79e-01
##     22   49  4.924e+01  5.39e-10  5.39e-10  8.0e-10  3.6e+07  2.1e-09  3.79e-01
##     23   51  4.924e+01  1.08e-09  1.08e-09  1.6e-09  1.8e+07  4.2e-09  3.79e-01
##     24   53  4.924e+01  2.16e-10  2.16e-10  3.2e-10  3.6e+08  8.4e-10  3.79e-01
##     25   55  4.924e+01  4.31e-11  4.31e-11  6.4e-11  1.4e+00  1.7e-10 -1.31e-01
##     26   57  4.924e+01  8.62e-11  8.62e-11  1.3e-10  1.4e+00  3.4e-10 -1.31e-01
##     27   59  4.924e+01  1.72e-10  1.72e-10  2.6e-10  1.1e+08  6.7e-10  3.79e-01
##     28   62  4.924e+01  3.45e-12  3.45e-12  5.1e-12  1.4e+00  1.3e-11 -1.31e-01
##     29   64  4.924e+01  6.90e-12  6.90e-12  1.0e-11  1.4e+00  2.7e-11 -1.31e-01
##     30   66  4.924e+01  1.38e-12  1.38e-12  2.1e-12  1.4e+00  5.4e-12 -1.31e-01
##     31   68  4.924e+01  2.76e-12  2.76e-12  4.1e-12  1.4e+00  1.1e-11 -1.31e-01
##     32   71  4.924e+01  5.51e-14  5.52e-14  8.2e-14  1.4e+00  2.1e-13 -1.31e-01
##     33   73  4.924e+01  1.11e-13  1.10e-13  1.6e-13  1.4e+00  4.3e-13 -1.31e-01
##     34   75  4.924e+01  2.18e-14  2.21e-14  3.3e-14  1.4e+00  8.6e-14 -1.31e-01
##     35   77  4.924e+01  4.18e-15  4.41e-15  6.6e-15  1.4e+00  1.7e-14 -1.31e-01
##     36   79  4.924e+01  8.66e-15  8.83e-15  1.3e-14  1.4e+00  3.4e-14 -1.31e-01
##     37   81  4.924e+01  1.73e-15  1.77e-15  2.6e-15  1.4e+00  6.9e-15 -1.31e-01
##     38   84  4.924e+01  1.46e-14  1.41e-14  2.1e-14  1.5e+00  5.5e-14 -1.35e-01
##     39   86  4.924e+01 -2.03e+08  2.83e-15  4.2e-15  1.4e+00  1.1e-14 -1.31e-01
## 
##  ***** FALSE CONVERGENCE *****
## 
##  FUNCTION     4.923896e+01   RELDX        4.194e-15
##  FUNC. EVALS      86         GRAD. EVALS      39
##  PRELDF       2.825e-15      NPRELDF     -1.309e-01
## 
##      I      FINAL X(I)        D(I)          G(I)
## 
##      1    4.924573e-01     1.000e+00     8.717e+00
##      2    9.000377e-01     1.000e+00     3.981e+00
##      3    2.049781e-15     1.000e+00     8.260e+00

## 
## Call:
## garch(x = train, order = c(1, 1))
## 
## Coefficient(s):
##        a0         a1         b1  
## 4.925e-01  9.000e-01  2.050e-15

MA Parameter Estimation

For MA(1) series we have the following: \[y_t=\epsilon_t-b\epsilon_{t-1}\]

so that \(\epsilon_t=y_t+b\epsilon_{t-1}\) which implies that

\[ \begin{align*} \epsilon_1&=y_1+b\epsilon_0\\ \epsilon_2&=y_2+by_1+b^2\epsilon_0\\ \epsilon_3&=y_3+by_2+b^2y_1+b^3\epsilon_0\\ &\vdots\\ \epsilon_t&=y_t+by_{t-1}+...+b^t\epsilon_0 \end{align*} \]

And in matrix form this equals the following:

\[ \left[ \begin{array}{c} \epsilon_1\\ \epsilon_2\\ \epsilon_3\\ \vdots\\ \epsilon_t \end{array} \right] = \left[ \begin{array}{cccc} 1&0&...&0\\ b&1&0&0\\ b^2&b&1&0\\ \vdots&\vdots&\vdots&\vdots\\ b^{n-1}&b^{n-2}&b^{n-3}...&1\\ \end{array} \right] \left[ \begin{array}{c} y_1\\ y_2\\ y_3\\ \vdots\\ y_t \end{array} \right] + \left[ \begin{array}{c} b\\ b^2\\ b^3\\ \vdots\\ b^t \end{array} \right] \epsilon_0 \]