I took these notes while working through the ARIMA Modeling with R at DataCamp.

Time Series Data and Models

Here are some time series datasets with distinctive properties related to modeling.

The AirPassengers dataset from the datasets package contains Monthly totals of international airline passengers, 1949 to 1960. AirPassengers has an upward trend. It has seasonalitity; Aug is peak travel month, and Nov the smallest. It also has heteroscedasticity; as the overall passenger volume grows, small percentage changes have large absolute values.

plot(AirPassengers, 
     main = "Monthly International Airline Passengers, 1949 - 1960")
text(AirPassengers, labels = 1:12, col = 1:12)

The djia dataset from the astsa package contains daily DJIA values from April 2006 - April 2016. The data have a generally positive trend, but not always. There is no seasonal component, and the residuals are homoscedastic. This time series is an example of a random walk.

library(astsa)
library(xts)
plot(djia$Close)

The soi dataset from the astsa package contains the Southern Oscillation Index (SOI) for a period of 453 months ranging over the years 1950-1987. This series has no trend. It does have seasonality. SOI is a surrogate for sea surface temperature, and you can seee that the data is positively correlated with itself one month apart (\(\rho\) = 0.604), and negatively correlated with itself 6 months apart (\(\rho\) = -0.187) (summer vs winter).

ts.plot(soi,
        main = "Southern Oscillation Index (SOI), 1950 - 1987")

lag.plot(soi, set.lags = c(1, 6))

acf(soi, lag.max = 6, plot = FALSE)
## 
## Autocorrelations of series 'soi', by lag
## 
## 0.0000 0.0833 0.1667 0.2500 0.3333 0.4167 0.5000 
##  1.000  0.604  0.374  0.214  0.050 -0.107 -0.187

In normal regression, \(Y_i = \beta X_i + \epsilon\) with the assumption that the residuals are normally distributed, independent, and homoscedastic (i.e., the errors are white noise).

With time series data you can regress the present period value on the prior period value, \(X_t = \phi X_{t-1} + \epsilon\). This is called auto regression (AR). Typically however, the data are correlated. You can overcome this limitation by using a moving average (MA) for the errors, \(\epsilon_t = W_t + \theta W_{t-1}\). The ARMA combines the two models into one model with autoregression and autocorrelated errors.

\[X_t = \phi X_{t-1} + W_t + \theta W_{t-1}\]

Stationary time series have a constant mean (no trend) and constant correlation structure; that is, you can estimate lag correlations on pairs because the data is consistent throughout the series.

A trend stationary time series is stationary around a trend. A simple example is \(Y_t = \alpha + \beta t + X_t\) where \(X_t\) is a stationary data series. A trend stationary series can be coerced into stationarity by differencing the observations \(X_t - X_{t-1}\).

The random walk model, \(X_t = X_{t-1} + W_t + c\) where \(W_t\) is white noise and c is a drift parameter, can also be coerced ito stationarity with differencing.

A time series with trend and heteroscedasticity, as in the AirPassengers and djia datasets above, can be coerced into stationarity with homoscedasticity by taking the log differences. A time series with trend and heteroscedasticity follows a model like \(X_t = (1 + p_t)X_{t-1}\) where \(p_t\) is a small percentage change. Taking the log difference unwinds the percentage change.

Here are the AirPassengers and djia datasets with log differencing. With the AirPassengers dataset, the trend and heteroscedasticity is now removed (but the periodicity remains).

par(mfrow = c(2, 1))
plot(AirPassengers, 
     main = "Monthly International Airline Passengers, 1949 - 1960")
plot(diff(log(AirPassengers)))

With the djia dataset, the trend and heteroscedasticity are removed, leaving a stationary series.

par(mfrow = c(2, 1))
plot(djia$Close)
plot(diff(log(djia$Close)))

Any stationary time series can be written as a linear combination of white noise. ARMA models have this form, so they are a good choice for modeling stationary time series.

Use the arima.sim() function to generate data from an ARMA model. Here are some examples of data series generated from the arima.sim() function.

par(mfrow = c(1, 3))

# white noise
WN <- arima.sim(model = list(order = c(0, 0, 0)), n = 200)
plot(WN, main = "WN")

# MA(1) with parameter .9 
MA <- arima.sim(model = list(order = c(0, 0, 1), ma = 0.9), n = 200)
plot(MA, main = "MA(1)")

# AR(2) with parameters 1.5 and -.75
AR <- arima.sim(model = list(order = c(2, 0, 0), ar = c(1.5, -.75)), n = 200)
plot(AR, main = "AR(2)")

AR and MA Models

You cannot identify which model to fit to a time series just by looking at a plot of the data. Below is an AR(1) model and an AR(2) model. Their plots are similar.

par(mfrow = c(1, 2))

ar1 <- arima.sim(model = list(order = c(1, 0, 0), ar = -.7), 
               n = 200)
ma1 <- arima.sim(model = list(order = c(0, 0, 1), ma = -.7), 
               n = 200)
plot(ar1, main = "AR(1)")
plot(ma1, main = "MA(1)")

The tools to identify the model orders are the autocorrelation function (ACF) and the partial autocorrelation function (PACF).

AR(p) MA(q) ARMA(p, q)
ACF Tails off Cuts of at lag q Tails off
PACF Cuts off at lag p Tails off Tails off

Produce these two plots with the acf2() function from the astsa package. The combination of the plots will narrow the model selection to one of the three. Suppose you have an AR(1) model \(X_t = 0.9X_{t-1} + W_t\).

set.seed(123)
x <- arima.sim(model = list(order = c(1, 0, 0), 
                            ar = .9), 
               n = 100) 
plot(x, main = "AR(1)")

The ACF plot below tails off, so it is either an AR model or an ARMA model. The PACF plot cuts off at a lag of 1, so it is an AR(1) model.

acf2(x)

##         ACF  PACF
##  [1,]  0.88  0.88
##  [2,]  0.77 -0.02
##  [3,]  0.67  0.00
##  [4,]  0.55 -0.14
##  [5,]  0.48  0.12
##  [6,]  0.43  0.07
##  [7,]  0.39  0.02
##  [8,]  0.34 -0.12
##  [9,]  0.28 -0.06
## [10,]  0.24  0.05
## [11,]  0.18 -0.06
## [12,]  0.09 -0.17
## [13,]  0.05  0.07
## [14,]  0.01 -0.01
## [15,] -0.03  0.01
## [16,] -0.07 -0.13
## [17,] -0.08  0.07
## [18,] -0.10 -0.02
## [19,] -0.12  0.06
## [20,] -0.11  0.02

Use the sarima() function to fit any ARIMA model to a data series. In this case, fit an AR(1) model. The parameter estimate for ar1 is 0.8781 - close to the 0.9 parameter used in the series generation.

sarima(x, p = 1, d = 0, q = 0)
## initial  value 0.693455 
## iter   2 value -0.060553
## iter   3 value -0.060659
## iter   4 value -0.060698
## iter   5 value -0.060866
## iter   6 value -0.060914
## iter   7 value -0.060927
## iter   8 value -0.060927
## iter   9 value -0.060929
## iter  10 value -0.060931
## iter  11 value -0.060931
## iter  12 value -0.060931
## iter  12 value -0.060931
## iter  12 value -0.060931
## final  value -0.060931 
## converged
## initial  value -0.053164 
## iter   2 value -0.053270
## iter   3 value -0.053886
## iter   4 value -0.053890
## iter   5 value -0.053893
## iter   6 value -0.053893
## iter   7 value -0.053893
## iter   8 value -0.053893
## iter   9 value -0.053893
## iter  10 value -0.053893
## iter  10 value -0.053893
## iter  10 value -0.053893
## final  value -0.053893 
## converged

## $fit
## 
## Call:
## stats::arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D, 
##     Q), period = S), xreg = xmean, include.mean = FALSE, transform.pars = trans, 
##     fixed = fixed, optim.control = list(trace = trc, REPORT = 1, reltol = tol))
## 
## Coefficients:
##          ar1    xmean
##       0.8781  -0.3569
## s.e.  0.0456   0.7220
## 
## sigma^2 estimated as 0.8847:  log likelihood = -136.5,  aic = 279.01
## 
## $degrees_of_freedom
## [1] 98
## 
## $ttable
##       Estimate     SE t.value p.value
## ar1     0.8781 0.0456 19.2759  0.0000
## xmean  -0.3569 0.7220 -0.4943  0.6222
## 
## $AIC
## [1] 2.79009
## 
## $AICc
## [1] 2.791327
## 
## $BIC
## [1] 2.868245

Try this again with AR(2) model \(X_t = 1.5X_{t-1} - 0.75X_{t-2} + W_t\). Here is the simulated time series.

set.seed(123)
x <- arima.sim(model = list(order = c(2, 0, 0), 
                            ar = c(1.5, -.75)), 
               n = 200)
plot(x, main = "AR(2)")

The ACF tails off again, so it’s an AR or ARMA. The PACF cuts off at lag q, so it’s an AR(2).

acf2(x)

##         ACF  PACF
##  [1,]  0.83  0.83
##  [2,]  0.46 -0.73
##  [3,]  0.06 -0.04
##  [4,] -0.27 -0.03
##  [5,] -0.43  0.06
##  [6,] -0.43 -0.08
##  [7,] -0.31  0.04
##  [8,] -0.14 -0.08
##  [9,]  0.01  0.03
## [10,]  0.13  0.07
## [11,]  0.20  0.02
## [12,]  0.22  0.08
## [13,]  0.20 -0.01
## [14,]  0.14 -0.11
## [15,]  0.04 -0.04
## [16,] -0.10 -0.11
## [17,] -0.21  0.10
## [18,] -0.27 -0.09
## [19,] -0.26  0.00
## [20,] -0.18 -0.06
## [21,] -0.08 -0.01
## [22,]  0.01 -0.09
## [23,]  0.07  0.01
## [24,]  0.10  0.10
## [25,]  0.11  0.01

Fit an AR(2) model to the series. The parameter estimate for ar1 equals 1.4328, and for ar2 equals -0.7255, close to the 1.5 and -.75 in the simulated data series.

sarima(x, p = 2, d = 0, q = 0)
## initial  value 0.893841 
## iter   2 value 0.748514
## iter   3 value 0.369992
## iter   4 value 0.187328
## iter   5 value 0.042574
## iter   6 value -0.056822
## iter   7 value -0.069519
## iter   8 value -0.073202
## iter   9 value -0.073471
## iter  10 value -0.073747
## iter  11 value -0.073759
## iter  12 value -0.073760
## iter  13 value -0.073760
## iter  13 value -0.073760
## iter  13 value -0.073760
## final  value -0.073760 
## converged
## initial  value -0.067631 
## iter   2 value -0.067644
## iter   3 value -0.067656
## iter   4 value -0.067657
## iter   5 value -0.067657
## iter   6 value -0.067657
## iter   7 value -0.067657
## iter   7 value -0.067657
## iter   7 value -0.067657
## final  value -0.067657 
## converged

## $fit
## 
## Call:
## stats::arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D, 
##     Q), period = S), xreg = xmean, include.mean = FALSE, transform.pars = trans, 
##     fixed = fixed, optim.control = list(trace = trc, REPORT = 1, reltol = tol))
## 
## Coefficients:
##          ar1      ar2    xmean
##       1.4328  -0.7255  -0.0565
## s.e.  0.0479   0.0479   0.2244
## 
## sigma^2 estimated as 0.8619:  log likelihood = -270.26,  aic = 548.51
## 
## $degrees_of_freedom
## [1] 197
## 
## $ttable
##       Estimate     SE  t.value p.value
## ar1     1.4328 0.0479  29.9166  0.0000
## ar2    -0.7255 0.0479 -15.1555  0.0000
## xmean  -0.0565 0.2244  -0.2518  0.8014
## 
## $AIC
## [1] 2.742563
## 
## $AICc
## [1] 2.743175
## 
## $BIC
## [1] 2.808529

Repeat this exercise once more, this time with the MA(1) model \(X_t = W_t - .8W_{t-1}\). Here is the simulated time series.

set.seed(123)
x <- arima.sim(model = list(order = c(0, 0, 1), 
                            ma = 0.8),
               n = 200)
plot(x)

The ACF cuts off at 1, so it is an MA(1). No need to look at the PACF.

acf2(x)

##         ACF  PACF
##  [1,]  0.39  0.39
##  [2,] -0.07 -0.27
##  [3,]  0.02  0.20
##  [4,] -0.02 -0.18
##  [5,] -0.02  0.13
##  [6,] -0.02 -0.14
##  [7,]  0.00  0.13
##  [8,]  0.02 -0.09
##  [9,] -0.01  0.06
## [10,]  0.02 -0.02
## [11,] -0.06 -0.09
## [12,] -0.16 -0.10
## [13,] -0.07  0.04
## [14,]  0.05  0.03
## [15,]  0.02 -0.03
## [16,] -0.05 -0.04
## [17,] -0.01  0.03
## [18,]  0.01 -0.04
## [19,] -0.11 -0.12
## [20,] -0.16 -0.06
## [21,] -0.01  0.07
## [22,]  0.06  0.00
## [23,] -0.02 -0.06
## [24,] -0.03 -0.01
## [25,]  0.01  0.02

Fit an MA(1) model to the time series. The parameter estimate for ma1 equals 0.9231, close to the 0.8 in the generated time series.

sarima(x, p = 0, d = 0, q = 1)
## initial  value 0.155021 
## iter   2 value 0.034060
## iter   3 value 0.007456
## iter   4 value -0.026302
## iter   5 value -0.049695
## iter   6 value -0.049854
## iter   7 value -0.049881
## iter   8 value -0.049924
## iter   9 value -0.049952
## iter  10 value -0.049960
## iter  11 value -0.049960
## iter  12 value -0.049960
## iter  13 value -0.049960
## iter  13 value -0.049960
## iter  13 value -0.049960
## final  value -0.049960 
## converged
## initial  value -0.054644 
## iter   2 value -0.055618
## iter   3 value -0.056055
## iter   4 value -0.056369
## iter   5 value -0.056426
## iter   6 value -0.056453
## iter   7 value -0.056455
## iter   8 value -0.056461
## iter   9 value -0.056466
## iter  10 value -0.056466
## iter  11 value -0.056466
## iter  11 value -0.056466
## final  value -0.056466 
## converged

## $fit
## 
## Call:
## stats::arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D, 
##     Q), period = S), xreg = xmean, include.mean = FALSE, transform.pars = trans, 
##     fixed = fixed, optim.control = list(trace = trc, REPORT = 1, reltol = tol))
## 
## Coefficients:
##          ma1   xmean
##       0.9231  0.0048
## s.e.  0.0402  0.1276
## 
## sigma^2 estimated as 0.8847:  log likelihood = -272.49,  aic = 550.99
## 
## $degrees_of_freedom
## [1] 198
## 
## $ttable
##       Estimate     SE t.value p.value
## ma1     0.9231 0.0402 22.9571    0.00
## xmean   0.0048 0.1276  0.0376    0.97
## 
## $AIC
## [1] 2.754945
## 
## $AICc
## [1] 2.75525
## 
## $BIC
## [1] 2.80442

An ARMA model is an autoregression with correlated errors.

\[X_t = \phi X_{t-1} + W_t + \theta X_{t-1}\]

You can identify an ARMA model from the ACF and PACF plots. Both will decay to zero. However, the PACF will not tell you the p an q orders. The best practice is to start with an ARMA(1, 1) and then add parameters as needed.

Here is simulated data from an ARMA(2,1) process with \(\phi = (1, -.9)\) and \(\theta = .8\).

set.seed(123)
x <- arima.sim(model = list(order = c(2, 0, 1), 
                            ar = c(1, -.9), 
                            ma = .8),
               n = 200)
plot(x)

Both the ACF and PACF decay to zero, so this is an ARMA time series.

acf2(x)

##         ACF  PACF
##  [1,]  0.57  0.57
##  [2,] -0.27 -0.88
##  [3,] -0.78  0.09
##  [4,] -0.61 -0.18
##  [5,]  0.00 -0.09
##  [6,]  0.50 -0.02
##  [7,]  0.54  0.02
##  [8,]  0.17 -0.07
##  [9,] -0.27 -0.01
## [10,] -0.45 -0.02
## [11,] -0.26 -0.02
## [12,]  0.11  0.06
## [13,]  0.35  0.01
## [14,]  0.30 -0.03
## [15,]  0.01  0.02
## [16,] -0.26 -0.03
## [17,] -0.30  0.00
## [18,] -0.09  0.01
## [19,]  0.16 -0.05
## [20,]  0.25  0.02
## [21,]  0.14  0.02
## [22,] -0.07 -0.10
## [23,] -0.21  0.02
## [24,] -0.16  0.07
## [25,]  0.04  0.03

At this point you know you have an ARMA time series, but you don’t know what orders to assign. The right answer, of course, is ARMA(2, 1) because that is the series you generated. The AIC and BIC measures help you decide. Both measures penalize additional parameters, but the BIC is a little more strict: AIC uses \(k = 2\) and BIC uses \(k = \log(n)\). The model with the smallest AIC or BIC is usually best. AIC and BIC often agree with each other. Below are tht \(2^3 = 8\) combinations of p and q up to order 2. The lowest BIC is ARMA(2, 1) (2.961724) - success! The AIC criterion recommends ARMA(2, 2).

x.ar1 <- sarima(x, p = 1, d = 0, q = 0, details = FALSE)
x.ma1 <- sarima(x, p = 0, d = 0, q = 1, details = FALSE)
x.ar2 <- sarima(x, p = 2, d = 0, q = 0, details = FALSE)
x.ma2 <- sarima(x, p = 0, d = 0, q = 2, details = FALSE)
x.arma11 <- sarima(x, p = 1, d = 0, q = 1, details = FALSE)
x.arma21 <- sarima(x, p = 2, d = 0, q = 1, details = FALSE)
x.arma12 <- sarima(x, p = 1, d = 0, q = 2, details = FALSE)
x.arma22 <- sarima(x, p = 2, d = 0, q = 2, details = FALSE)
cat("AR(1) (AIC, BIC): ", x.ar1$AIC, ", ", x.ar1$BIC, "\n")
## AR(1) (AIC, BIC):  5.185697 ,  5.235171
cat("MA(1) (AIC, BIC): ", x.ma1$AIC, ", ", x.ma1$BIC, "\n")
## MA(1) (AIC, BIC):  4.521101 ,  4.570576
cat("AR(2) (AIC, BIC): ", x.ar2$AIC, ", ", x.ar2$BIC, "\n")
## AR(2) (AIC, BIC):  3.257452 ,  3.323418
cat("MA(2) (AIC, BIC): ", x.ma2$AIC, ", ", x.ma2$BIC, "\n")
## MA(2) (AIC, BIC):  3.797553 ,  3.863519
cat("ARMA(1, 1) (AIC, BIC): ", x.arma11$AIC, ", ", x.arma11$BIC, "\n")
## ARMA(1, 1) (AIC, BIC):  4.198092 ,  4.264059
cat("ARMA(2, 1) (AIC, BIC): ", x.arma21$AIC, ", ", x.arma21$BIC, "\n")
## ARMA(2, 1) (AIC, BIC):  2.879266 ,  2.961724
cat("ARMA(1, 2) (AIC, BIC): ", x.arma12$AIC, ", ", x.arma12$BIC, "\n")
## ARMA(1, 2) (AIC, BIC):  3.686392 ,  3.76885
cat("ARMA(2, 2) (AIC, BIC): ", x.arma22$AIC, ", ", x.arma22$BIC, "\n")
## ARMA(2, 2) (AIC, BIC):  2.870311 ,  2.96926

Perform a residual analysis to make sure the residuals are white Gaussian noise.

  • The standardized residuals plot should behave as a white noise sequence with mean zero and variance one. The plot may not convince you that the noise is white, but it will at least alert you if the noise is obviously not white. Look for patterns.
  • The sample ACF of the residuals plot should also look like white noise. 95% of the residuals should be with the dashed bands.
  • The normal Q-Q plot asses normality. Don’t be put off if the extreme values stray from the line. As long as the pattern follows the line, especially in the middle, the normality assumption is reasonable.
  • The p-values corresponding to the Box-Ljung-Pierce Q-statistic again tests for white noise. As long as most of the points are above the dashed line, it’s safe to assume the noise is white. Otherwise, there is still some correlation left in the residuals and you should fit another model or add a parameter.

Here are the diagnostic plots for the ARMA(2, 1) model.

sarima(x, p = 2, d = 0, q = 1)
## initial  value 1.358558 
## iter   2 value 0.596753
## iter   3 value 0.398714
## iter   4 value 0.265589
## iter   5 value 0.196824
## iter   6 value 0.141164
## iter   7 value -0.005829
## iter   8 value -0.010162
## iter   9 value -0.010553
## iter  10 value -0.010570
## iter  11 value -0.010576
## iter  12 value -0.010578
## iter  13 value -0.010578
## iter  14 value -0.010578
## iter  15 value -0.010578
## iter  15 value -0.010578
## iter  15 value -0.010578
## final  value -0.010578 
## converged
## initial  value -0.002361 
## iter   2 value -0.003395
## iter   3 value -0.003693
## iter   4 value -0.004103
## iter   5 value -0.004198
## iter   6 value -0.004260
## iter   7 value -0.004303
## iter   8 value -0.004305
## iter   9 value -0.004306
## iter  10 value -0.004306
## iter  10 value -0.004306
## final  value -0.004306 
## converged

## $fit
## 
## Call:
## stats::arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D, 
##     Q), period = S), xreg = xmean, include.mean = FALSE, transform.pars = trans, 
##     fixed = fixed, optim.control = list(trace = trc, REPORT = 1, reltol = tol))
## 
## Coefficients:
##          ar1      ar2     ma1   xmean
##       1.0409  -0.8807  0.8417  0.0517
## s.e.  0.0351   0.0342  0.0434  0.1523
## 
## sigma^2 estimated as 0.9601:  log likelihood = -282.93,  aic = 575.85
## 
## $degrees_of_freedom
## [1] 196
## 
## $ttable
##       Estimate     SE  t.value p.value
## ar1     1.0409 0.0351  29.6831  0.0000
## ar2    -0.8807 0.0342 -25.7678  0.0000
## ma1     0.8417 0.0434  19.3974  0.0000
## xmean   0.0517 0.1523   0.3397  0.7344
## 
## $AIC
## [1] 2.879266
## 
## $AICc
## [1] 2.880292
## 
## $BIC
## [1] 2.961724

ARIMA

A time series is ARIMA(p,d,q) if the differenced series (of order d) is ARMA(p, q). Consider the integrated model \(Y_t = 0.9Y_{t-1} + W_t\) where \(Y_t = X_t - X_{t-1}\). The integrated series is non-stationary, but differencing makes it stationary.

x <- arima.sim(model = list(order = c(1, 1, 0), ar = 0.9),
               n = 100)
par(mfrow = c(2, 1))
plot(x,
     main = "Simulated ARIMA(1, 1, 0)")
plot(diff(x),
     main = "Differenced ARIMA(1, 1, 0) = ARMA(1, 0, 0)")

Here is an ARMA(2, 1, 0) model with drift.

\[Y_t = 1 + 1.5Y_{t-1} - 0.75Y_{t-2} + W_t\] where \(Y_t = X_t - X_{t-1}\).

set.seed(123)
x <- arima.sim(model = list(order = c(2, 1, 0),
                            ar = c(1.5, -0.75),
                            mean = 1),
               n = 250)
plot(x)

Plot the P/ACF functions of the differenced data to determine the model.

acf2(diff(x))

##         ACF  PACF
##  [1,]  0.82  0.82
##  [2,]  0.44 -0.75
##  [3,]  0.01  0.01
##  [4,] -0.31 -0.03
##  [5,] -0.46  0.07
##  [6,] -0.43 -0.06
##  [7,] -0.27  0.04
##  [8,] -0.07 -0.08
##  [9,]  0.09  0.03
## [10,]  0.19  0.10
## [11,]  0.23  0.03
## [12,]  0.22  0.04
## [13,]  0.16 -0.03
## [14,]  0.08 -0.09
## [15,] -0.03 -0.02
## [16,] -0.15 -0.07
## [17,] -0.22  0.09
## [18,] -0.23 -0.08
## [19,] -0.18 -0.01
## [20,] -0.09 -0.04
## [21,]  0.00 -0.01
## [22,]  0.07 -0.04
## [23,]  0.10  0.03
## [24,]  0.10  0.06
## [25,]  0.08 -0.02
## [26,]  0.05 -0.05

The ACF tails off, indicating an AR process. The PACF cuts off at lag 2. Fit an AR(2, 1, 0) model to the series. The estimated parameters for ar1 (1.4524) and ar2(-0.7598) are close the simulated series parameters.

sarima(x, p = 2, d = 1, q = 0)
## initial  value 0.925157 
## iter   2 value 0.760964
## iter   3 value 0.395097
## iter   4 value 0.245884
## iter   5 value -0.049048
## iter   6 value -0.052712
## iter   7 value -0.078554
## iter   8 value -0.079542
## iter   9 value -0.080615
## iter  10 value -0.081164
## iter  11 value -0.081194
## iter  12 value -0.081195
## iter  13 value -0.081195
## iter  14 value -0.081195
## iter  15 value -0.081195
## iter  15 value -0.081195
## final  value -0.081195 
## converged
## initial  value -0.076425 
## iter   2 value -0.076433
## iter   3 value -0.076438
## iter   4 value -0.076439
## iter   5 value -0.076439
## iter   6 value -0.076439
## iter   7 value -0.076439
## iter   7 value -0.076439
## iter   7 value -0.076439
## final  value -0.076439 
## converged

## $fit
## 
## Call:
## stats::arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D, 
##     Q), period = S), xreg = constant, transform.pars = trans, fixed = fixed, 
##     optim.control = list(trace = trc, REPORT = 1, reltol = tol))
## 
## Coefficients:
##          ar1      ar2  constant
##       1.4524  -0.7598    0.0931
## s.e.  0.0408   0.0409    0.1897
## 
## sigma^2 estimated as 0.8485:  log likelihood = -335.62,  aic = 679.25
## 
## $degrees_of_freedom
## [1] 247
## 
## $ttable
##          Estimate     SE  t.value p.value
## ar1        1.4524 0.0408  35.6208  0.0000
## ar2       -0.7598 0.0409 -18.5675  0.0000
## constant   0.0931 0.1897   0.4909  0.6239
## 
## $AIC
## [1] 2.716999
## 
## $AICc
## [1] 2.717389
## 
## $BIC
## [1] 2.773342

Here is a dataset of actual data. Dataset globtemp from the astsa package contains annual global temperature deviations to 2015. A plot of the data shows random walk behavior, so you should work with the differenced data.

plot(globtemp)

Plot the P/ACF of the differenced data.

acf2(diff(globtemp))

##         ACF  PACF
##  [1,] -0.24 -0.24
##  [2,] -0.19 -0.26
##  [3,] -0.08 -0.23
##  [4,]  0.20  0.06
##  [5,] -0.15 -0.16
##  [6,] -0.03 -0.09
##  [7,]  0.03 -0.05
##  [8,]  0.14  0.07
##  [9,] -0.16 -0.09
## [10,]  0.11  0.11
## [11,] -0.05 -0.03
## [12,]  0.00 -0.02
## [13,] -0.13 -0.10
## [14,]  0.14  0.02
## [15,] -0.01  0.00
## [16,] -0.08 -0.09
## [17,]  0.00  0.00
## [18,]  0.19  0.11
## [19,] -0.07  0.04
## [20,]  0.02  0.13
## [21,] -0.02  0.09
## [22,]  0.08  0.08

You could say both the ACF and PACF tail off, implying an ARIMA(1, 1, 1). Or you could say the ACF cuts off at lag two while the PACF tails off, implying an ARIMA(0, 1, 2). Or you could say the ACF is tailing off while the PACF cuts off at lag 3, implying an ARIMA(3, 1, 0) model. Try all three.

globtemp.arima111 <- sarima(globtemp, p = 1, d = 1, q = 1,
                            details = FALSE)
globtemp.arima012 <- sarima(globtemp, p = 0, d = 1, q = 2,
                            details = FALSE)
globtemp.arima310 <- sarima(globtemp, p = 3, d = 1, q = 0,
                            details = FALSE)
cat("ARIMA(1,1,1) (AIC, BIC): ", 
    globtemp.arima111$AIC, ", ", 
    globtemp.arima111$BIC, "\n")
## ARIMA(1,1,1) (AIC, BIC):  -1.716773 ,  -1.630691
cat("ARIMA(0,1,2) (AIC, BIC): ", 
    globtemp.arima012$AIC, ", ", 
    globtemp.arima012$BIC, "\n")
## ARIMA(0,1,2) (AIC, BIC):  -1.723268 ,  -1.637185
cat("ARIMA(3,1,0) (AIC, BIC): ", 
    globtemp.arima310$AIC, ", ", 
    globtemp.arima310$BIC, "\n")
## ARIMA(3,1,0) (AIC, BIC):  -1.717413 ,  -1.60981

The ARIMA(0, 1, 2) model has the lowest AIC and BIC.

One way to check an analysis is to overfit the model by adding an extra parameter to see if it makes a difference in the results. If the results change drastically, rethink the model. If the results do not change much, you can be confident that your fit is correct.

Here is an ARIMA(0,1,1) model with MA parameter .9, \(Y_t = Y_{t-1} + W_t + 0.9W_{t-1}\).

set.seed(123)
x <- arima.sim(model = list(order = c(0, 1, 1),
                            ma = 0.9),
               n = 250)
plot(x)

Assess the P/ACF to choose a candidate model for the process.

acf2(diff(x))

##         ACF  PACF
##  [1,]  0.41  0.41
##  [2,] -0.11 -0.33
##  [3,] -0.04  0.20
##  [4,] -0.07 -0.24
##  [5,] -0.06  0.15
##  [6,] -0.03 -0.17
##  [7,]  0.04  0.21
##  [8,]  0.07 -0.15
##  [9,] -0.01  0.10
## [10,] -0.04 -0.12
## [11,] -0.09 -0.02
## [12,] -0.12 -0.09
## [13,] -0.03  0.08
## [14,]  0.08  0.03
## [15,]  0.07 -0.02
## [16,] -0.04 -0.06
## [17,] -0.07 -0.01
## [18,] -0.09 -0.10
## [19,] -0.12 -0.02
## [20,] -0.07 -0.05
## [21,]  0.05  0.12
## [22,]  0.10 -0.06
## [23,] -0.01 -0.03
## [24,] -0.07 -0.05
## [25,] -0.04  0.00
## [26,] -0.06 -0.05

The ACF of the differenced data cuts off at lag 1 while the PACF tails off - clearly an ARIMA(0, 1, 1) model. Fit the model to the data using the standard method.

sarima(x, p = 0, d = 1, q = 1)
## initial  value 0.204719 
## iter   2 value 0.067142
## iter   3 value 0.020355
## iter   4 value 0.008626
## iter   5 value -0.057128
## iter   6 value -0.057335
## iter   7 value -0.059199
## iter   8 value -0.059277
## iter   9 value -0.059432
## iter  10 value -0.059432
## iter  11 value -0.059432
## iter  11 value -0.059432
## final  value -0.059432 
## converged
## initial  value -0.065584 
## iter   2 value -0.066183
## iter   3 value -0.066758
## iter   4 value -0.066899
## iter   5 value -0.067003
## iter   6 value -0.067040
## iter   7 value -0.067044
## iter   8 value -0.067049
## iter   9 value -0.067053
## iter  10 value -0.067055
## iter  11 value -0.067055
## iter  12 value -0.067055
## iter  13 value -0.067055
## iter  13 value -0.067055
## iter  13 value -0.067055
## final  value -0.067055 
## converged

## $fit
## 
## Call:
## stats::arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D, 
##     Q), period = S), xreg = constant, transform.pars = trans, fixed = fixed, 
##     optim.control = list(trace = trc, REPORT = 1, reltol = tol))
## 
## Coefficients:
##          ma1  constant
##       0.9562   -0.0192
## s.e.  0.0185    0.1149
## 
## sigma^2 estimated as 0.8659:  log likelihood = -337.97,  aic = 681.94
## 
## $degrees_of_freedom
## [1] 248
## 
## $ttable
##          Estimate     SE t.value p.value
## ma1        0.9562 0.0185 51.7312  0.0000
## constant  -0.0192 0.1149 -0.1675  0.8671
## 
## $AIC
## [1] 2.727767
## 
## $AICc
## [1] 2.727961
## 
## $BIC
## [1] 2.770025

The diagnostic plots look good. The t-value for ma1 is 51.7312. AIC is 2.727767 and BIC is 2.770025. Now check a model by overfitting (adding a parameter) with an additional MA parameter to see if it makes a difference.

sarima(x, p = 0, d = 1, q = 2)
## initial  value 0.204719 
## iter   2 value 0.045506
## iter   3 value 0.010553
## iter   4 value -0.016998
## iter   5 value -0.053664
## iter   6 value -0.055528
## iter   7 value -0.062292
## iter   8 value -0.064508
## iter   9 value -0.064955
## iter  10 value -0.065426
## iter  11 value -0.065569
## iter  12 value -0.065591
## iter  13 value -0.065596
## iter  14 value -0.065599
## iter  15 value -0.065599
## iter  16 value -0.065599
## iter  17 value -0.065599
## iter  17 value -0.065599
## iter  17 value -0.065599
## final  value -0.065599 
## converged
## initial  value -0.071348 
## iter   2 value -0.072129
## iter   3 value -0.072381
## iter   4 value -0.072487
## iter   5 value -0.072545
## iter   6 value -0.072550
## iter   7 value -0.072553
## iter   8 value -0.072553
## iter   8 value -0.072553
## final  value -0.072553 
## converged

## $fit
## 
## Call:
## stats::arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D, 
##     Q), period = S), xreg = constant, transform.pars = trans, fixed = fixed, 
##     optim.control = list(trace = trc, REPORT = 1, reltol = tol))
## 
## Coefficients:
##          ma1      ma2  constant
##       0.8481  -0.1111   -0.0182
## s.e.  0.0668   0.0665    0.1015
## 
## sigma^2 estimated as 0.8565:  log likelihood = -336.6,  aic = 681.19
## 
## $degrees_of_freedom
## [1] 247
## 
## $ttable
##          Estimate     SE t.value p.value
## ma1        0.8481 0.0668 12.6980  0.0000
## ma2       -0.1111 0.0665 -1.6716  0.0959
## constant  -0.0182 0.1015 -0.1793  0.8579
## 
## $AIC
## [1] 2.724771
## 
## $AICc
## [1] 2.725161
## 
## $BIC
## [1] 2.781114

The t-value for ma1 is still significant (t = 12.698, p < 0.0001), but the t-value for ma2 is not significant (t = -1.6716, p = 0.0959). The AIC decreased (AIC = 2.724771), but BIC increased (BIC = 2.781114).

Forecast a data series with the sarima.for() function from the astsa package. Forecast the following ARIMA(1,1,0) model with AR parameter 0.9, \(Y_t = 0.9Y_{t_1} + W_t\) where \(Y_t = X_t - X_{t-1}\). Here is the whole data series, 120 points.

set.seed(123)
x <- arima.sim(model = list(order = c(1, 1, 0),
                            ar = 0.9),
               n = 120)
plot(x)

Propose a model from the P/ACF plots of the differenced data using the first 100 data points.

acf2(diff(window(x, end = 100)))

##         ACF  PACF
##  [1,]  0.88  0.88
##  [2,]  0.77 -0.03
##  [3,]  0.67  0.00
##  [4,]  0.55 -0.14
##  [5,]  0.48  0.13
##  [6,]  0.43  0.06
##  [7,]  0.39  0.02
##  [8,]  0.34 -0.13
##  [9,]  0.28 -0.04
## [10,]  0.23  0.03
## [11,]  0.17 -0.06
## [12,]  0.09 -0.16
## [13,]  0.04  0.07
## [14,]  0.00 -0.01
## [15,] -0.03  0.01
## [16,] -0.07 -0.13
## [17,] -0.08  0.08
## [18,] -0.11 -0.04
## [19,] -0.12  0.08
## [20,] -0.11 -0.02

The ACF plot tails off while the PACF plot cuts off at lag 1, an ARIMA(1,1,0) model.

sarima(window(x, end = 100), p = 1, d = 1, q = 0)
## initial  value 0.698011 
## iter   2 value -0.056080
## iter   3 value -0.056204
## iter   4 value -0.056235
## iter   5 value -0.056355
## iter   6 value -0.056400
## iter   7 value -0.056413
## iter   8 value -0.056414
## iter   9 value -0.056416
## iter  10 value -0.056418
## iter  11 value -0.056418
## iter  12 value -0.056418
## iter  12 value -0.056418
## iter  12 value -0.056418
## final  value -0.056418 
## converged
## initial  value -0.048759 
## iter   2 value -0.048866
## iter   3 value -0.049476
## iter   4 value -0.049477
## iter   5 value -0.049479
## iter   6 value -0.049479
## iter   6 value -0.049479
## final  value -0.049479 
## converged

## $fit
## 
## Call:
## stats::arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D, 
##     Q), period = S), xreg = constant, transform.pars = trans, fixed = fixed, 
##     optim.control = list(trace = trc, REPORT = 1, reltol = tol))
## 
## Coefficients:
##          ar1  constant
##       0.8777   -0.3320
## s.e.  0.0458    0.7266
## 
## sigma^2 estimated as 0.8924:  log likelihood = -135.58,  aic = 277.15
## 
## $degrees_of_freedom
## [1] 97
## 
## $ttable
##          Estimate     SE t.value p.value
## ar1        0.8777 0.0458 19.1841  0.0000
## constant  -0.3320 0.7266 -0.4569  0.6487
## 
## $AIC
## [1] 2.799526
## 
## $AICc
## [1] 2.800788
## 
## $BIC
## [1] 2.878166

The diagnostic plots look good, and the parameter estimation for ar1 is significant (ar1 = 0.8777, t = 19.1841, p < 0.0001). Forecast the next 20 points in the series.

sarima.for(window(x, end = 100), 
           n.ahead = 20, 
           p = 1, d = 1, q = 1) 
## $pred
## Time Series:
## Start = 101 
## End = 120 
## Frequency = 1 
##  [1] -42.77585 -43.44865 -44.07820 -44.67005 -45.22901 -45.75930 -46.26459
##  [8] -46.74808 -47.21257 -47.66049 -48.09396 -48.51483 -48.92471 -49.32502
## [15] -49.71697 -50.10165 -50.47997 -50.85275 -51.22071 -51.58446
## 
## $se
## Time Series:
## Start = 101 
## End = 120 
## Frequency = 1 
##  [1]  0.9444305  2.0260603  3.2427648  4.5394601  5.8809516  7.2436292
##  [7]  8.6111796  9.9721691 11.3185611 12.6447490 13.9468964 15.2224718
## [13] 16.4699118 17.6883743 18.8775531 20.0375397 21.1687167 22.2716777
## [19] 23.3471653 24.3960239
lines(x)  

Here is the same process with real life data, the globtemp dataset. Forecast 35 years ahead with the ARIMA(0,1,2) model.

sarima.for(globtemp, n.ahead = 35, p = 0, d = 1, q = 2)

## $pred
## Time Series:
## Start = 2016 
## End = 2050 
## Frequency = 1 
##  [1] 0.7995567 0.7745381 0.7816919 0.7888457 0.7959996 0.8031534 0.8103072
##  [8] 0.8174611 0.8246149 0.8317688 0.8389226 0.8460764 0.8532303 0.8603841
## [15] 0.8675379 0.8746918 0.8818456 0.8889995 0.8961533 0.9033071 0.9104610
## [22] 0.9176148 0.9247687 0.9319225 0.9390763 0.9462302 0.9533840 0.9605378
## [29] 0.9676917 0.9748455 0.9819994 0.9891532 0.9963070 1.0034609 1.0106147
## 
## $se
## Time Series:
## Start = 2016 
## End = 2050 
## Frequency = 1 
##  [1] 0.09909556 0.11564576 0.12175580 0.12757353 0.13313729 0.13847769
##  [7] 0.14361964 0.14858376 0.15338730 0.15804492 0.16256915 0.16697084
## [13] 0.17125943 0.17544322 0.17952954 0.18352490 0.18743511 0.19126540
## [19] 0.19502047 0.19870459 0.20232164 0.20587515 0.20936836 0.21280424
## [25] 0.21618551 0.21951471 0.22279416 0.22602604 0.22921235 0.23235497
## [31] 0.23545565 0.23851603 0.24153763 0.24452190 0.24747019

Seasonal ARIMA

A pure seasonal ARMA time series would be correlated at the seasonal lags only. The ACF and PACF would behave just as their nonseasonal counterparts, but at the seasonal lags, 1S, 2S, …, where S is the seasonal period (S = 12 for monthly data). Denote a seasonal AR model with 12 month seasonal period with notation \(SAR(1)_{12}\). Here is a 12 month pure seasonal model, a \(SARMA(P = 1, Q = 1)_{S = 12}\).

\[Y_t = 0.9Y_{t-12} + W_t + 0.5W_{t-12}\]

Mixed seasonal models are more common. Here is a \(SARIMA(0,0,0)\times(1,0,0)_{12}\) model.

\[Y_t = \Phi X_{t-12} + W_t + \theta W_{t-1}\]

The SAR(1) part is the relation to last year’s value \(X_{t-12}\). The MA(1) part is the relation to last month’s shock \(W_{t-1}\).

Here is how you would apply this to the AirPassengers series. Recall that AirPassengers contains monthly international air passenger totals from 1949 to 1960.

plot(AirPassengers, 
     main = "International Air Passengers, 1949 - 1960")

The variance is increasing, so log transform the data.

plot(log(AirPassengers), 
     main = "International Air Passengers, 1949 - 1960")

There is an obvious trend, so difference the data as well.

plot(diff(log(AirPassengers)), 
     main = "International Air Passengers, 1949 - 1960")

The log-diffed series shows seasonal persistance with cycles per year, take a seasonal difference as well.

plot(diff(diff(log(AirPassengers)), lag = 12), 
     main = "International Air Passengers, 1949 - 1960")

At this point we have d = 1, D = 1, S = 12. Now look at the P/ACF plots.

ddlx <- diff(diff(log(AirPassengers)), lag = 12)
acf2(ddlx)

##         ACF  PACF
##  [1,] -0.34 -0.34
##  [2,]  0.11 -0.01
##  [3,] -0.20 -0.19
##  [4,]  0.02 -0.13
##  [5,]  0.06  0.03
##  [6,]  0.03  0.03
##  [7,] -0.06 -0.06
##  [8,]  0.00 -0.02
##  [9,]  0.18  0.23
## [10,] -0.08  0.04
## [11,]  0.06  0.05
## [12,] -0.39 -0.34
## [13,]  0.15 -0.11
## [14,] -0.06 -0.08
## [15,]  0.15 -0.02
## [16,] -0.14 -0.14
## [17,]  0.07  0.03
## [18,]  0.02  0.11
## [19,] -0.01 -0.01
## [20,] -0.12 -0.17
## [21,]  0.04  0.13
## [22,] -0.09 -0.07
## [23,]  0.22  0.14
## [24,] -0.02 -0.07
## [25,] -0.10 -0.10
## [26,]  0.05 -0.01
## [27,] -0.03  0.04
## [28,]  0.05 -0.09
## [29,] -0.02  0.05
## [30,] -0.05  0.00
## [31,] -0.05 -0.10
## [32,]  0.20 -0.02
## [33,] -0.12  0.01
## [34,]  0.08 -0.02
## [35,] -0.15  0.02
## [36,] -0.01 -0.16
## [37,]  0.05 -0.03
## [38,]  0.03  0.01
## [39,] -0.02  0.05
## [40,] -0.03 -0.08
## [41,] -0.07 -0.17
## [42,]  0.10  0.07
## [43,] -0.09 -0.10
## [44,]  0.03 -0.06
## [45,] -0.04 -0.03
## [46,] -0.04 -0.12
## [47,]  0.11 -0.01
## [48,] -0.05 -0.05

The seasonal ACF cuts off at lag 1S (S = 12). The PACF tails off at each of the lags, 1S, 2S, 3S, etc. This suggests a seasonal MA(1), so P = 0, Q = 1. Next, the non-seasonal lags tail off in both the ACF and PACF, suggesting an ARMA(1,1), so p = 1, and q = 1. Putting this together, we have a \(SARIMA(1,1,1)\times(0,1,1)_{12}\) model.

AirPassengers.fit1 <- sarima(log(AirPassengers), 
                             p = 1, d = 1, q = 1,
                             P = 0, D = 1, Q = 1,
                             S = 12)
## initial  value -3.085211 
## iter   2 value -3.225399
## iter   3 value -3.276697
## iter   4 value -3.276902
## iter   5 value -3.282134
## iter   6 value -3.282524
## iter   7 value -3.282990
## iter   8 value -3.286319
## iter   9 value -3.286413
## iter  10 value -3.288141
## iter  11 value -3.288262
## iter  12 value -3.288394
## iter  13 value -3.288768
## iter  14 value -3.288969
## iter  15 value -3.289089
## iter  16 value -3.289094
## iter  17 value -3.289100
## iter  17 value -3.289100
## iter  17 value -3.289100
## final  value -3.289100 
## converged
## initial  value -3.288388 
## iter   2 value -3.288459
## iter   3 value -3.288530
## iter   4 value -3.288649
## iter   5 value -3.288753
## iter   6 value -3.288781
## iter   7 value -3.288784
## iter   7 value -3.288784
## iter   7 value -3.288784
## final  value -3.288784 
## converged

AirPassengers.fit1$ttable
##      Estimate     SE t.value p.value
## ar1    0.1960 0.2475  0.7921  0.4298
## ma1   -0.5784 0.2132 -2.7127  0.0076
## sma1  -0.5643 0.0747 -7.5544  0.0000

The model fit shows the non-seasonal AR component (ar1) is not significant, so drop it.

AirPassengers.fit1 <- sarima(log(AirPassengers), 
                             p = 0, d = 1, q = 1,
                             P = 0, D = 1, Q = 1,
                             S = 12)
## initial  value -3.086228 
## iter   2 value -3.267980
## iter   3 value -3.279950
## iter   4 value -3.285996
## iter   5 value -3.289332
## iter   6 value -3.289665
## iter   7 value -3.289672
## iter   8 value -3.289676
## iter   8 value -3.289676
## iter   8 value -3.289676
## final  value -3.289676 
## converged
## initial  value -3.286464 
## iter   2 value -3.286855
## iter   3 value -3.286872
## iter   4 value -3.286874
## iter   4 value -3.286874
## iter   4 value -3.286874
## final  value -3.286874 
## converged

AirPassengers.fit1$ttable
##      Estimate     SE t.value p.value
## ma1   -0.4018 0.0896 -4.4825       0
## sma1  -0.5569 0.0731 -7.6190       0

Now forecast two years beyond the end of the data. The shading shows the 1 and 2 RMS prediction intervals.

sarima.for(log(AirPassengers), 
           p = 0, d = 1, q = 1,
           P = 0, D = 1, Q = 1,
           S = 12,
           n.ahead = 24)

## $pred
##           Jan      Feb      Mar      Apr      May      Jun      Jul
## 1961 6.110186 6.053775 6.171715 6.199300 6.232556 6.368779 6.507294
## 1962 6.206435 6.150025 6.267964 6.295550 6.328805 6.465028 6.603543
##           Aug      Sep      Oct      Nov      Dec
## 1961 6.502906 6.324698 6.209008 6.063487 6.168025
## 1962 6.599156 6.420947 6.305257 6.159737 6.264274
## 
## $se
##             Jan        Feb        Mar        Apr        May        Jun
## 1961 0.03671562 0.04278291 0.04809072 0.05286830 0.05724856 0.06131670
## 1962 0.09008475 0.09549708 0.10061869 0.10549195 0.11014981 0.11461854
##             Jul        Aug        Sep        Oct        Nov        Dec
## 1961 0.06513124 0.06873441 0.07215787 0.07542612 0.07855851 0.08157070
## 1962 0.11891946 0.12307018 0.12708540 0.13097758 0.13475740 0.13843405

Try this again with the unemp dataset from the astsa package.

plot(unemp,
     main = "Monthly U.S. Unemployment, 1948-1978")

The dataset exhibits trending, so remove it with differencing. The P/ACF plots indicate a persistant 12 monthly seasonal trend.

unemp.d <- diff(unemp)
acf2(unemp.d)

##         ACF  PACF
##  [1,]  0.03  0.03
##  [2,] -0.15 -0.15
##  [3,] -0.25 -0.25
##  [4,] -0.08 -0.10
##  [5,]  0.27  0.21
##  [6,] -0.12 -0.23
##  [7,]  0.24  0.32
##  [8,] -0.11 -0.11
##  [9,] -0.27 -0.30
## [10,] -0.20 -0.19
## [11,]  0.00  0.05
## [12,]  0.76  0.64
## [13,] -0.02 -0.15
## [14,] -0.21 -0.22
## [15,] -0.28 -0.06
## [16,] -0.12 -0.16
## [17,]  0.23 -0.09
## [18,] -0.14 -0.09
## [19,]  0.23  0.08
## [20,] -0.12 -0.04
## [21,] -0.24  0.00
## [22,] -0.18  0.02
## [23,] -0.01 -0.09
## [24,]  0.72  0.21
## [25,] -0.01 -0.01
## [26,] -0.19 -0.09
## [27,] -0.26 -0.01
## [28,] -0.10 -0.02
## [29,]  0.23 -0.06
## [30,] -0.13 -0.07
## [31,]  0.22 -0.05
## [32,] -0.12 -0.06
## [33,] -0.22 -0.01
## [34,] -0.16  0.04
## [35,] -0.03 -0.08
## [36,]  0.69  0.13
## [37,] -0.02 -0.04
## [38,] -0.15  0.03
## [39,] -0.24  0.02
## [40,] -0.09 -0.06
## [41,]  0.19 -0.16
## [42,] -0.12  0.02
## [43,]  0.19 -0.08
## [44,] -0.11 -0.03
## [45,] -0.23 -0.07
## [46,] -0.15  0.01
## [47,] -0.01  0.05
## [48,]  0.66  0.10

Remove the seasonal component with a 12-month lag difference. Now examine the P/ACF plots to fit an SARIMA model.

unemp.dd = diff(unemp.d, lag = 12)
acf2(unemp.dd)

##         ACF  PACF
##  [1,]  0.21  0.21
##  [2,]  0.33  0.29
##  [3,]  0.15  0.05
##  [4,]  0.17  0.05
##  [5,]  0.10  0.01
##  [6,]  0.06 -0.02
##  [7,] -0.06 -0.12
##  [8,] -0.02 -0.03
##  [9,] -0.09 -0.05
## [10,] -0.17 -0.15
## [11,] -0.08  0.02
## [12,] -0.48 -0.43
## [13,] -0.18 -0.02
## [14,] -0.16  0.15
## [15,] -0.11  0.03
## [16,] -0.15 -0.04
## [17,] -0.09 -0.01
## [18,] -0.09  0.00
## [19,]  0.03  0.01
## [20,] -0.01  0.01
## [21,]  0.02 -0.01
## [22,] -0.02 -0.16
## [23,]  0.01  0.01
## [24,] -0.02 -0.27
## [25,]  0.09  0.05
## [26,] -0.05 -0.01
## [27,] -0.01 -0.05
## [28,]  0.03  0.05
## [29,]  0.08  0.09
## [30,]  0.01 -0.04
## [31,]  0.03  0.02
## [32,] -0.05 -0.07
## [33,]  0.01 -0.01
## [34,]  0.02 -0.08
## [35,] -0.06 -0.08
## [36,] -0.02 -0.23
## [37,] -0.12 -0.08
## [38,]  0.01  0.06
## [39,] -0.03 -0.07
## [40,] -0.03 -0.01
## [41,] -0.10  0.03
## [42,] -0.02 -0.03
## [43,] -0.13 -0.11
## [44,]  0.00 -0.04
## [45,] -0.06  0.01
## [46,]  0.01  0.00
## [47,]  0.02 -0.03
## [48,]  0.11 -0.04

The seasonal component cuts off at lag 1S in the ACF plot tails off 2S, 3S, etc., so this is an MA(1) and the seasonal parameters will be (0,1,1). Looking at the first 12 months, the ACF tails of and the PACF cuts off at lag 2, so this is an AR(2) and the nonseasonal parameters will be (2,1,0). The SARIMA model is \(SARIMA(2,1,0)\times(0,1,1)_{12}\).

Fit the \(SARIMA(2,1,0)\times(0,1,1)_{12}\) model. The ACF of the residuals is are within the band, the normal Q-Q plot is linear, and the p-values for the Ljung-Box statistic are above the dashed line. This is a good fit.

sarima(unemp, p = 2, d = 1, q = 0, P = 0, D = 1, Q = 1, S = 12)
## initial  value 3.340809 
## iter   2 value 3.105512
## iter   3 value 3.086631
## iter   4 value 3.079778
## iter   5 value 3.069447
## iter   6 value 3.067659
## iter   7 value 3.067426
## iter   8 value 3.067418
## iter   8 value 3.067418
## final  value 3.067418 
## converged
## initial  value 3.065481 
## iter   2 value 3.065478
## iter   3 value 3.065477
## iter   3 value 3.065477
## iter   3 value 3.065477
## final  value 3.065477 
## converged

## $fit
## 
## Call:
## stats::arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D, 
##     Q), period = S), include.mean = !no.constant, transform.pars = trans, fixed = fixed, 
##     optim.control = list(trace = trc, REPORT = 1, reltol = tol))
## 
## Coefficients:
##          ar1     ar2     sma1
##       0.1351  0.2464  -0.6953
## s.e.  0.0513  0.0515   0.0381
## 
## sigma^2 estimated as 449.6:  log likelihood = -1609.91,  aic = 3227.81
## 
## $degrees_of_freedom
## [1] 356
## 
## $ttable
##      Estimate     SE  t.value p.value
## ar1    0.1351 0.0513   2.6326  0.0088
## ar2    0.2464 0.0515   4.7795  0.0000
## sma1  -0.6953 0.0381 -18.2362  0.0000
## 
## $AIC
## [1] 8.723811
## 
## $AICc
## [1] 8.723988
## 
## $BIC
## [1] 8.765793

Now forecast the series for three years.

sarima.for(unemp, p = 2, d = 1, q = 0, P = 0, D = 1, Q = 1, S = 12, n.ahead = 36)

## $pred
##           Jan      Feb      Mar      Apr      May      Jun      Jul
## 1979 676.4664 685.1172 653.2388 585.6939 553.8813 664.4072 647.0657
## 1980 683.3045 687.7649 654.8658 586.1507 553.9285 664.1108 646.6220
## 1981 682.6406 687.0977 654.1968 585.4806 553.2579 663.4398 645.9508
##           Aug      Sep      Oct      Nov      Dec
## 1979 611.0828 594.6414 569.3997 587.5801 581.1833
## 1980 610.5345 594.0427 568.7684 586.9320 580.5249
## 1981 609.8632 593.3713 568.0970 586.2606 579.8535
## 
## $se
##            Jan       Feb       Mar       Apr       May       Jun       Jul
## 1979  21.20465  32.07710  43.70167  53.66329  62.85364  71.12881  78.73590
## 1980 116.99599 124.17344 131.51281 138.60466 145.49706 152.12863 158.52302
## 1981 194.25167 201.10648 208.17066 215.11503 221.96039 228.64285 235.16874
##            Aug       Sep       Oct       Nov       Dec
## 1979  85.75096  92.28663  98.41329 104.19488 109.67935
## 1980 164.68623 170.63839 176.39520 181.97333 187.38718
## 1981 241.53258 247.74268 253.80549 259.72970 265.52323

Try again with the chicken dataset from the astsa package.

plot(chicken,
     main = "Monthly Whole bird spot price, 8/2001 - 7/2016",
     ylab = "US cents per pound")

First remove the trend component.

chicken.d <- diff(chicken)
plot(chicken.d,
     main = "Monthly Whole bird spot price, 8/2001 - 7/2016",
     ylab = "diff(US cents per pound)")

Examine the P/ACF plots. The nonseasonal component tails off in the ACF and cuts off at lag 2 in the PACF, an AR(2). The seasonal component tails off in the ACF and cuts off at lag 1, an AR(1). The SARIMA model is \(SARIMA(2,1,0)\times(1,0,0)_{12}\).

acf2(chicken.d)

##         ACF  PACF
##  [1,]  0.72  0.72
##  [2,]  0.39 -0.29
##  [3,]  0.09 -0.14
##  [4,] -0.07  0.03
##  [5,] -0.16 -0.10
##  [6,] -0.20 -0.06
##  [7,] -0.27 -0.19
##  [8,] -0.23  0.12
##  [9,] -0.11  0.10
## [10,]  0.09  0.16
## [11,]  0.26  0.09
## [12,]  0.33  0.00
## [13,]  0.20 -0.22
## [14,]  0.07  0.03
## [15,] -0.03  0.03
## [16,] -0.10 -0.11
## [17,] -0.19 -0.09
## [18,] -0.25  0.01
## [19,] -0.29 -0.03
## [20,] -0.20  0.07
## [21,] -0.08 -0.04
## [22,]  0.08  0.06
## [23,]  0.16 -0.05
## [24,]  0.18  0.02
## [25,]  0.08 -0.14
## [26,] -0.06 -0.19
## [27,] -0.21 -0.13
## [28,] -0.31 -0.06
## [29,] -0.40 -0.08
## [30,] -0.40 -0.05
## [31,] -0.33  0.01
## [32,] -0.18  0.03
## [33,]  0.02  0.10
## [34,]  0.20  0.02
## [35,]  0.30 -0.01
## [36,]  0.35  0.09
## [37,]  0.26 -0.12
## [38,]  0.13  0.01
## [39,] -0.02 -0.01
## [40,] -0.14 -0.05
## [41,] -0.23  0.02
## [42,] -0.21  0.12
## [43,] -0.18 -0.05
## [44,] -0.11 -0.13
## [45,] -0.03 -0.07
## [46,]  0.08  0.01
## [47,]  0.21  0.14
## [48,]  0.33  0.05

Fit the \(SARIMA(2,1,0)\times(1,0,0)_{12}\) model.

sarima(chicken, p = 2, d = 1, q = 0, P = 1, D = 0, Q = 0, S = 12)
## initial  value 0.015039 
## iter   2 value -0.226398
## iter   3 value -0.412955
## iter   4 value -0.460882
## iter   5 value -0.470787
## iter   6 value -0.471082
## iter   7 value -0.471088
## iter   8 value -0.471090
## iter   9 value -0.471092
## iter  10 value -0.471095
## iter  11 value -0.471095
## iter  12 value -0.471096
## iter  13 value -0.471096
## iter  14 value -0.471096
## iter  15 value -0.471097
## iter  16 value -0.471097
## iter  16 value -0.471097
## iter  16 value -0.471097
## final  value -0.471097 
## converged
## initial  value -0.473585 
## iter   2 value -0.473664
## iter   3 value -0.473721
## iter   4 value -0.473823
## iter   5 value -0.473871
## iter   6 value -0.473885
## iter   7 value -0.473886
## iter   8 value -0.473886
## iter   8 value -0.473886
## iter   8 value -0.473886
## final  value -0.473886 
## converged

## $fit
## 
## Call:
## stats::arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D, 
##     Q), period = S), xreg = constant, transform.pars = trans, fixed = fixed, 
##     optim.control = list(trace = trc, REPORT = 1, reltol = tol))
## 
## Coefficients:
##          ar1      ar2    sar1  constant
##       0.9154  -0.2494  0.3237    0.2353
## s.e.  0.0733   0.0739  0.0715    0.1973
## 
## sigma^2 estimated as 0.3828:  log likelihood = -169.16,  aic = 348.33
## 
## $degrees_of_freedom
## [1] 175
## 
## $ttable
##          Estimate     SE t.value p.value
## ar1        0.9154 0.0733 12.4955  0.0000
## ar2       -0.2494 0.0739 -3.3728  0.0009
## sar1       0.3237 0.0715  4.5238  0.0000
## constant   0.2353 0.1973  1.1923  0.2347
## 
## $AIC
## [1] 1.945971
## 
## $AICc
## [1] 1.947256
## 
## $BIC
## [1] 2.035005

Now forecast 5 years.

sarima.for(chicken, p = 2, d = 1, q = 0, P = 1, D = 0, Q = 0, S = 12, n.ahead = 60)

## $pred
##           Jan      Feb      Mar      Apr      May      Jun      Jul
## 2016                                                               
## 2017 110.5358 110.5612 110.5480 110.7055 111.0047 111.1189 111.1552
## 2018 111.8108 111.9782 112.1330 112.3431 112.5991 112.7952 112.9661
## 2019 114.1331 114.3464 114.5556 114.7827 115.0247 115.2473 115.4617
## 2020 116.7942 117.0224 117.2492 117.4819 117.7193 117.9505 118.1790
## 2021 119.5651 119.7980 120.0306 120.2650 120.5010 120.7350 120.9681
##           Aug      Sep      Oct      Nov      Dec
## 2016 111.0907 110.8740 110.6853 110.5045 110.5527
## 2017 111.1948 111.2838 111.3819 111.4825 111.6572
## 2018 113.1380 113.3260 113.5168 113.7085 113.9242
## 2019 115.6765 115.8965 116.1174 116.3386 116.5675
## 2020 118.4077 118.6380 118.8686 119.0993 119.3326
## 2021                                             
## 
## $se
##             Jan        Feb        Mar        Apr        May        Jun
## 2016                                                                  
## 2017  3.7414959  4.1793190  4.5747009  4.9373266  5.2742129  5.5903499
## 2018  8.2010253  8.5605811  8.9054714  9.2372195  9.5572539  9.8667955
## 2019 12.0038164 12.2921541 12.5738417 12.8492868 13.1188976 13.3830477
## 2020 15.1557253 15.3959082 15.6323906 15.8653300 16.0948844 16.3212022
## 2021 17.8397890 18.0473081 18.2524651 18.4553364 18.6559977 18.8545213
##             Jul        Aug        Sep        Oct        Nov        Dec
## 2016             0.6187194  1.3368594  2.0462419  2.6867986  3.2486625
## 2017  5.8893133  6.2367345  6.6253573  7.0309771  7.4344077  7.8255932
## 2018 10.1668604 10.4736807 10.7857727 11.0980056 11.4063211 11.7085266
## 2019 13.6420693 13.9002670 14.1573839 14.4122197 14.6638269 14.9117124
## 2020 16.5444204 16.7657634 16.9852163 17.2025022 17.4174076 17.6298379
## 2021 19.0509752

Here is one more. The birth dataset from the astsa package contins monthly live births in thousands for the United States, 1948-1979.

plot(birth, 
     main = "Monthly Live Births in U.S., 1948 - 1979")

Remove the trending by diffing.

birth.d <- diff(birth)
plot(birth.d)

Examine the P/ACF. Notice the seasonal persistance (the spikes in the lags of 1, 2, 3, and 4).

acf2(birth.d)

##         ACF  PACF
##  [1,] -0.32 -0.32
##  [2,]  0.16  0.06
##  [3,] -0.08 -0.01
##  [4,] -0.19 -0.25
##  [5,]  0.09 -0.03
##  [6,] -0.28 -0.26
##  [7,]  0.06 -0.17
##  [8,] -0.19 -0.29
##  [9,] -0.05 -0.35
## [10,]  0.17 -0.16
## [11,] -0.26 -0.59
## [12,]  0.82  0.57
## [13,] -0.28  0.13
## [14,]  0.17  0.11
## [15,] -0.07  0.13
## [16,] -0.18  0.09
## [17,]  0.08  0.00
## [18,] -0.28  0.00
## [19,]  0.07  0.05
## [20,] -0.18  0.04
## [21,] -0.05 -0.07
## [22,]  0.16 -0.10
## [23,] -0.24 -0.20
## [24,]  0.78  0.19
## [25,] -0.27  0.01
## [26,]  0.19  0.05
## [27,] -0.08  0.07
## [28,] -0.17  0.07
## [29,]  0.07 -0.02
## [30,] -0.29 -0.06
## [31,]  0.07 -0.02
## [32,] -0.15  0.09
## [33,] -0.04  0.03
## [34,]  0.14 -0.06
## [35,] -0.24 -0.16
## [36,]  0.75  0.03
## [37,] -0.23  0.08
## [38,]  0.16 -0.10
## [39,] -0.08 -0.03
## [40,] -0.15  0.07
## [41,]  0.05 -0.04
## [42,] -0.25  0.06
## [43,]  0.06  0.04
## [44,] -0.18 -0.07
## [45,] -0.03 -0.06
## [46,]  0.15  0.02
## [47,] -0.22 -0.04
## [48,]  0.72  0.10

Remove the seasonal persistance with a 12 month lag. The nonseasonal ACF cuts off at 1 and the PACF tails off, so MA(1). The nonseasonal ACF cuts off at 1 and the PACF cuts off at 1, so MA(1). The model is \(SARIMA(0,1,1)\times(0,1,1)_{12}\).

birth.dd <- diff(birth.d, lag = 12)
acf2(birth.dd, max.lag = 60)

##         ACF  PACF
##  [1,] -0.30 -0.30
##  [2,] -0.09 -0.20
##  [3,] -0.09 -0.21
##  [4,]  0.00 -0.14
##  [5,]  0.07 -0.03
##  [6,]  0.03  0.02
##  [7,] -0.07 -0.06
##  [8,] -0.04 -0.08
##  [9,]  0.11  0.06
## [10,]  0.04  0.08
## [11,]  0.13  0.23
## [12,] -0.43 -0.32
## [13,]  0.14 -0.06
## [14,] -0.01 -0.13
## [15,]  0.03 -0.13
## [16,]  0.01 -0.11
## [17,]  0.02  0.02
## [18,]  0.00  0.06
## [19,]  0.03  0.04
## [20,] -0.07 -0.10
## [21,] -0.01  0.02
## [22,]  0.00  0.00
## [23,]  0.06  0.17
## [24,] -0.01 -0.13
## [25,] -0.12 -0.14
## [26,]  0.17  0.07
## [27,] -0.04 -0.04
## [28,]  0.03 -0.02
## [29,] -0.05  0.02
## [30,] -0.09 -0.06
## [31,] -0.01 -0.07
## [32,]  0.19  0.05
## [33,] -0.03  0.07
## [34,] -0.09 -0.06
## [35,] -0.02  0.05
## [36,] -0.04 -0.16
## [37,]  0.17 -0.01
## [38,] -0.14 -0.04
## [39,]  0.03 -0.01
## [40,] -0.05 -0.03
## [41,]  0.03 -0.01
## [42,]  0.10  0.01
## [43,]  0.00  0.00
## [44,] -0.10  0.03
## [45,] -0.03 -0.02
## [46,]  0.06 -0.07
## [47,]  0.02  0.05
## [48,]  0.01 -0.11
## [49,] -0.01  0.05
## [50,]  0.06  0.06
## [51,] -0.08 -0.03
## [52,]  0.03 -0.03
## [53,]  0.01  0.04
## [54,] -0.02  0.02
## [55,] -0.01 -0.04
## [56,]  0.00 -0.01
## [57,] -0.07 -0.13
## [58,]  0.17  0.07
## [59,] -0.04  0.07
## [60,] -0.01 -0.05

Fit an \(SARIMA(0,1,1)\times(0,1,1)_{12}\). The p-values for the Ljung-Box statistic are all below the band, i.e., the residuals are not white noise. There is additional correlation.

sarima(birth, p = 0, d = 1, q = 1, P = 0, D = 1, Q = 1, S = 12)
## initial  value 2.219164 
## iter   2 value 2.013310
## iter   3 value 1.988107
## iter   4 value 1.980026
## iter   5 value 1.967594
## iter   6 value 1.965384
## iter   7 value 1.965049
## iter   8 value 1.964993
## iter   9 value 1.964992
## iter   9 value 1.964992
## iter   9 value 1.964992
## final  value 1.964992 
## converged
## initial  value 1.951264 
## iter   2 value 1.945867
## iter   3 value 1.945729
## iter   4 value 1.945723
## iter   5 value 1.945723
## iter   5 value 1.945723
## iter   5 value 1.945723
## final  value 1.945723 
## converged

## $fit
## 
## Call:
## stats::arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D, 
##     Q), period = S), include.mean = !no.constant, transform.pars = trans, fixed = fixed, 
##     optim.control = list(trace = trc, REPORT = 1, reltol = tol))
## 
## Coefficients:
##           ma1     sma1
##       -0.4734  -0.7861
## s.e.   0.0598   0.0451
## 
## sigma^2 estimated as 47.4:  log likelihood = -1211.28,  aic = 2428.56
## 
## $degrees_of_freedom
## [1] 358
## 
## $ttable
##      Estimate     SE  t.value p.value
## ma1   -0.4734 0.0598  -7.9097       0
## sma1  -0.7861 0.0451 -17.4227       0
## 
## $AIC
## [1] 6.545975
## 
## $AICc
## [1] 6.546062
## 
## $BIC
## [1] 6.577399

Try adding an additional AR() parameter (nonseasonal) to the model to account for the additional correlation. Much better!

sarima(birth, p = 1, d = 1, q = 1, P = 0, D = 1, Q = 1, S = 12)
## initial  value 2.218186 
## iter   2 value 2.032584
## iter   3 value 1.982464
## iter   4 value 1.975643
## iter   5 value 1.971721
## iter   6 value 1.967284
## iter   7 value 1.963840
## iter   8 value 1.961106
## iter   9 value 1.960849
## iter  10 value 1.960692
## iter  11 value 1.960683
## iter  12 value 1.960675
## iter  13 value 1.960672
## iter  13 value 1.960672
## iter  13 value 1.960672
## final  value 1.960672 
## converged
## initial  value 1.940459 
## iter   2 value 1.934425
## iter   3 value 1.932752
## iter   4 value 1.931750
## iter   5 value 1.931074
## iter   6 value 1.930882
## iter   7 value 1.930860
## iter   8 value 1.930859
## iter   8 value 1.930859
## final  value 1.930859 
## converged

## $fit
## 
## Call:
## stats::arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D, 
##     Q), period = S), include.mean = !no.constant, transform.pars = trans, fixed = fixed, 
##     optim.control = list(trace = trc, REPORT = 1, reltol = tol))
## 
## Coefficients:
##          ar1      ma1     sma1
##       0.3038  -0.7006  -0.8000
## s.e.  0.0865   0.0604   0.0441
## 
## sigma^2 estimated as 45.91:  log likelihood = -1205.93,  aic = 2419.85
## 
## $degrees_of_freedom
## [1] 357
## 
## $ttable
##      Estimate     SE  t.value p.value
## ar1    0.3038 0.0865   3.5104   5e-04
## ma1   -0.7006 0.0604 -11.5984   0e+00
## sma1  -0.8000 0.0441 -18.1302   0e+00
## 
## $AIC
## [1] 6.522519
## 
## $AICc
## [1] 6.522695
## 
## $BIC
## [1] 6.564418