Overview

The file dataF13b.txt contains a monthly time series collected between July 2000 and June 2011. Managing and analyzing the data through the statistical software package R, the goal is to fit an ARIMA model to the given time series and forecast based on the model.

dataF13b <- read.csv(paste0("https://raw.githubusercontent.com/",
              "jzuniga123/SPS/master/dataF13b.txt"), header = T)

Exploratory Data Analysis

Step 1: Assess the extent to which it might be possible to forecast the data series found in the variable from its past values by looking at the lagged scatterplot matrix.

lag.plot(dataF13b, 12)

There is a distinct indication that multiple lags are correlated with the current value.

Step 2: Split the data into a training set (model-building set) using the first 120 values and a validation set for checking forecasts containing the remaining 12 values.

Training <- ts(dataF13b[1:120, ])
Validation <- ts(dataF13b[121:132, ])
summary(Training)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.201   2.961   4.022   4.470   5.665   9.525

Step 3: Analyze a histogram and a plot of the unspecified time series data \(Y_{t}\).

Y <- ts(Training, start = c(2000, 7), frequency = 12)
par(mfrow = c(1, 2))
hist(Y, xlab = "dataF13b", main = "Histogram of dataF13b")
plot(Y, type = "o", pch = 22, lty = 1, pty = 2, xlab = "Year", 
     ylab = "dataF13b", main = "Plot of dataF13b")

The data show an upward trend as well as seasonality and changing variability.

Step 4: Remove the heteroskedasticity by transforming the variable to its log such that \(Y^{'}_{t}=\log(Y_{t})\), and then examine new histogram and plot of transformed time series \(Y^{'}_{t}\).

par(mfrow = c(1, 2))
Yprime <- log(Y)
hist(Yprime, xlab = "Log(dataF13b)", main = "Histogram of Log(dataF13b)")
plot(Yprime, type="o", pch = 22, lty = 1, pty = 2, xlab = "Year",
     ylab="Log(dataF13b)", main = "Plot of Log(dataF13b)")

The data still show an upward trend and seasonality but the heteroskedasticity has been mitigated.

Model Parameter Estimation

Step 5: Conduct an augmented Dickey-Fuller unit root test to see if there exists a Trend Stationarity process with deterministic trend in the data or a Difference Stationarity process with stochastic trend.

library(tseries)
adf.test(Yprime)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  Yprime
## Dickey-Fuller = -4.9895, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary

The \(t_{DF}\) statistic is far below zero therefore the null hypothesis that the model has unit roots is rejected. Regression will therefore be used to remove the trend in lieu of differencing.

Step 6: Given the seasonal waveform visible in the data, proceed with harmonic regression to determine if the observed data have been generated by a fixed signal with some consistent periodic variation (waveform) obscured by some zero-mean random noise process. The na.action parameter in the lm() function can be used to retain the time series attributes for the residuals and fitted values.

k <- length(Yprime)
t <- seq(1, k)
lambda.1 <- 2 * pi / 12
lambda.2 <- 2 * pi / 6
cos.1 <- cos(lambda.1 * t)
cos.2 <- cos(lambda.2 * t)
sin.1 <- sin(lambda.1 * t)
sin.2 <- sin(lambda.2 * t)
Z <- data.frame(Yprime, cos.1, cos.2, sin.1, sin.2)
fit <- lm(Yprime ~ time(Yprime) + cos.1 + cos.2 + sin.1 + sin.2, data = Z)
summary(fit)
## 
## Call:
## lm(formula = Yprime ~ time(Yprime) + cos.1 + cos.2 + sin.1 + 
##     sin.2, data = Z)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.39428 -0.10464 -0.01319  0.12211  0.38029 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2.585e+02  1.019e+01 -25.358  < 2e-16 ***
## time(Yprime)  1.296e-01  5.082e-03  25.495  < 2e-16 ***
## cos.1         1.829e-01  2.067e-02   8.848 1.32e-14 ***
## cos.2         7.058e-03  2.067e-02   0.341    0.733    
## sin.1         2.215e-01  2.073e-02  10.687  < 2e-16 ***
## sin.2         3.343e-02  2.068e-02   1.617    0.109    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1601 on 114 degrees of freedom
## Multiple R-squared:  0.8774, Adjusted R-squared:  0.872 
## F-statistic: 163.2 on 5 and 114 DF,  p-value: < 2.2e-16

Remove insignificant parameters and refit.

Z <- data.frame(Yprime, cos.1, sin.1)
fit <- lm(Yprime ~ time(Yprime) + cos.1 + sin.1, data = Z)
summary(fit)
## 
## Call:
## lm(formula = Yprime ~ time(Yprime) + cos.1 + sin.1, data = Z)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.37004 -0.10992 -0.01708  0.11284  0.35426 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2.579e+02  1.022e+01 -25.249  < 2e-16 ***
## time(Yprime)  1.293e-01  5.094e-03  25.386  < 2e-16 ***
## cos.1         1.829e-01  2.073e-02   8.821 1.36e-14 ***
## sin.1         2.214e-01  2.079e-02  10.650  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1606 on 116 degrees of freedom
## Multiple R-squared:  0.8745, Adjusted R-squared:  0.8712 
## F-statistic: 269.4 on 3 and 116 DF,  p-value: < 2.2e-16

Plot the original data with the fitted curve in red.

plot(Yprime, main = "Log(dataF13b)")
lines(seq(2000 + 6 / 12, by = 1 / 12, length = k), fitted(fit), lty = 1, col = "red")

\[{ \hat { Y } }_{ t }^{ ' }=-{ 257.939 }_{ \left( 10.22 \right) }+{ 0.129 }_{ \left( 0.005 \right) }t+{ 0.183 }_{ \left( 0.021 \right) }\cos { \left( \frac { 2\pi t }{ 12 } \right) } +{ 0.221 }_{ \left( 0.021 \right) }\sin { \left( \frac { 2\pi t }{ 12 } \right) } +{ X }_{ t }={ \hat { \mu } }_{ t }+{ X }_{ t }\]

The fitted additive signal model above is an appropriate sinusoidal fit for the data.

Step 7: Examine the residuals to determine whether an AR, MA, or ARMA model would be an appropriate fit for the random noise process of the fitted residuals, \(Y_{ t }^{ ' }-{ \hat { Y } }_{ t }^{ ' }=X_{t}\). First, examine the residual plot for any obvious pattern.

a <- 0.05
lag.k <- 40
x <- fit$residuals
plot(x, type = "o", pch = 22, lty = 1, pty = 2, 
     xlab = "Year", ylab = "x", main = "Residuals")

The residual plot is centered around zero with no obvious trends or patterns aside from the autocorrelation that appears in the form of persistent positive and negative values. Next, examine the ACF and PACF plots of the residuals.

acf(x, lag.max = lag.k, type = "correlation", plot = T, main = "Residuals") -> x.acf

acf(x, lag.max = lag.k, type = "partial", plot = T, main = "Residuals") -> x.pacf

The ACF decreases geometrically and the PACF lags of 8 and 13 appear to have over 95% of the subsequent lags falling inside \(\pm \frac { 1.96 }{ \sqrt { n } }\). To confirm, examine each lag falling outside the boundaries of \(\pm \frac { 1.96 }{ \sqrt { n } }\) and the percentage of lags after said lag that fall inside \(\pm \frac { 1.96 }{ \sqrt { n } }\).

acf.lags <-  x.acf$lag[abs(x.acf$acf) > qt(1 - a / 2, k - 1) / sqrt(x.acf$n.used)]
pacf.lags <-  x.pacf$lag[abs(x.pacf$acf) > qt(1 - a / 2, k - 1) / sqrt(x.pacf$n.used)]
n.acf <- length(acf.lags)
n.pacf <- length(pacf.lags)
ma.order <- (lag.k - acf.lags - (n.acf - (1:n.acf))) / (lag.k - acf.lags)
ar.order <- (lag.k - pacf.lags - (n.pacf - (1:n.pacf))) / (lag.k - pacf.lags)
ma.models <- matrix(c(acf.lags, round(ma.order, 4)), nrow = n.acf, 
       dimnames = list(c(), c("MA Lags","  'iid' h > p")))
ar.models <- matrix(c(pacf.lags, round(ar.order, 4)), nrow = n.pacf, 
       dimnames = list(c(), c("AR Lags","  'iid' h > q")))
ma.models[ma.models[, 1] > 0 & ma.models[, 2] < 1 
          & ma.models[, 2] > 0 & !is.na(ma.models[, 2]), ]
##       MA Lags   'iid' h > p
##  [1,]       1        0.5128
##  [2,]       2        0.5263
##  [3,]       3        0.5405
##  [4,]       4        0.5556
##  [5,]       9        0.5161
##  [6,]      10        0.5333
##  [7,]      11        0.5517
##  [8,]      12        0.5714
##  [9,]      13        0.5926
## [10,]      26        0.2857
## [11,]      27        0.3077
## [12,]      28        0.3333
ar.models[ar.models[, 1] > 0 & ar.models[, 2] < 1 
          & ar.models[, 2] > 0 & !is.na(ar.models[, 2]), ]
##      AR Lags   'iid' h > q
## [1,]       1        0.7949
## [2,]       2        0.8158
## [3,]       3        0.8378
## [4,]       4        0.8611
## [5,]       6        0.8824
## [6,]       8        0.9062
## [7,]      13        0.9259
## [8,]      24        0.9375

The above supports the conclusion that modeling the time series residuals with an AR(8) or AR(13) would be reasonable choices.

Step 8: Compare the AIC of the AR(8), AR(13), and other potential residual models.

p <- 13
q <- 1
m <- p * (q + 1)
AIC <- AR <- MA <- numeric(m)
for (i in 7:p) { 
  for (j in 0:q) { 
    AR[i + (p * j)] <- i
    MA[i + (p * j)] <- j
    AIC[i + (p * j)] <- arima(x, order = c(i, 0, j), method = "ML")$aic
  }
}
models <- matrix(c(AR, MA, AIC), nrow = m, dimnames = list(c(), c("AR", "MA", "AIC")))
best <- models[models[, 3] != 0, ]
best[order(best[, 3]), ]
##       AR MA       AIC
##  [1,] 13  0 -347.6846
##  [2,]  8  0 -347.1162
##  [3,]  8  1 -346.0576
##  [4,] 13  1 -345.6847
##  [5,]  9  1 -345.2504
##  [6,]  9  0 -345.1416
##  [7,]  7  1 -344.3289
##  [8,]  7  0 -344.0114
##  [9,] 10  0 -343.7421
## [10,] 10  1 -343.7222
## [11,] 11  0 -341.7488
## [12,] 11  1 -341.7427
## [13,] 12  1 -340.8070
## [14,] 12  0 -339.7691

By the principal of parsimony and AIC ranking, the AR(13), AR(8), ARMA(7,1), and ARMA(12,1) residual models are being chosen.

arima.13.0.0 <- arima(x, order = c(13, 0, 0), method = "ML")
arima.08.0.0 <- arima(x, order = c(8, 0, 0), method = "ML")
arima.07.0.1 <- arima(x, order = c(7, 0, 1), method = "ML")
arima.12.0.1 <- arima(x, order = c(12, 0, 1), method = "ML")
ar.13 <- ar(x, aic=T, method = "mle", order.max = 13)
ar.08 <- ar(x, aic=T, method = "mle", order.max = 8)

Model Diagnostics

Step 9: Run diagnostics on the model orders chosen to check for randomness (absence of correlation: independence) and normality. First through visual analysis of randomness using ACF and McLeod-Li plots

library(TSA)
par(mfrow = c(2, 4))
arma.13.0 <- ar.13$resid[-(1:13)]
acf(arma.13.0, lag.max = lag.k, main = "AR(13) Residuals")
McLeod.Li.test(y = arma.13.0, gof.lag = lag.k / 2, main = "AR(13) Residuals")
arma.08.0 <- ar.08$resid[-(1:8)]
acf(arma.08.0, lag.max = lag.k, main = "AR(8) Residuals")
McLeod.Li.test(y = arma.08.0, gof.lag = lag.k / 2, main = "AR(8) Residuals")
arma.07.1 <- arima.07.0.1$residuals[-1]
acf(arma.07.1, lag.max = lag.k, main = "ARMA(7,1) Residuals")
McLeod.Li.test(y = arma.07.1, gof.lag = lag.k / 2, main = "ARMA(7,1) Residuals")
arma.12.1 <- arima.12.0.1$residuals[-1]
acf(arma.12.1, lag.max = lag.k, main = "ARMA(12,1) Residuals")
McLeod.Li.test(y = arma.12.1, gof.lag = lag.k / 2, main = "ARMA(12,1) Residuals")

Visual inspection of the ACF plots alone would appear to indicate that 95% of all the fitted residuals lie within \(\pm { 1.96 }/{\sqrt{ n }}\) and lead one to believe that all the fitted time series residuals have been reduced to random white noise. The McLeod-Lii plots however, show some \(p\)-values near the boundaries for the fitted ARMA(p,q) models. This indicates that although the fitted ARMA(p,q) residuals may be uncorrelated, they may not be independent. Continue with quantitative analysis of randomness in the form of Portmanteau (Box-Pierce) and Ljung-Box tests on the remaining fitted AR(p) models.

test <- c("Box-Pierce", "Ljung-Box")
Box.test(arma.08.0, lag.k / 2, test[1]); Box.test(arma.08.0, lag.k / 2, type = test[2])
## 
##  Box-Pierce test
## 
## data:  arma.08.0
## X-squared = 14.309, df = 20, p-value = 0.8145
## 
##  Box-Ljung test
## 
## data:  arma.08.0
## X-squared = 16.49, df = 20, p-value = 0.6858
Box.test(arma.13.0, lag.k / 2, test[1]); Box.test(arma.13.0, lag.k / 2, type = test[2])
## 
##  Box-Pierce test
## 
## data:  arma.13.0
## X-squared = 3.4382, df = 20, p-value = 1
## 
##  Box-Ljung test
## 
## data:  arma.13.0
## X-squared = 4.175, df = 20, p-value = 0.9999

The respective Box-Pierce and Ljung-Boxii tests both confirm randomness of the AR(p) models. Note however the Ljung-Box test is the more appropriate for the sample size of 120. Next the normality of the fitted AR(p) residuals is examined with quantile-quantile plots.

par(mfrow=c(1,2))
qqnorm(arma.08.0, main = "Fitted AR(8) Normal Q-Q Plot")
qqline(arma.08.0)
qqnorm(arma.13.0, main = "Fitted AR(13) Normal Q-Q Plot")
qqline(arma.13.0)

The data appear normally distributed except in the tails. Proceed with Shapiro-Wilk and Jarque-Bera tests on the normality of the residuals.

shapiro.test(arma.08.0); jarque.bera.test(arma.08.0)
## 
##  Shapiro-Wilk normality test
## 
## data:  arma.08.0
## W = 0.98731, p-value = 0.376
## 
##  Jarque Bera Test
## 
## data:  arma.08.0
## X-squared = 1.9644, df = 2, p-value = 0.3745
shapiro.test(arma.13.0); jarque.bera.test(arma.13.0)
## 
##  Shapiro-Wilk normality test
## 
## data:  arma.13.0
## W = 0.99234, p-value = 0.8142
## 
##  Jarque Bera Test
## 
## data:  arma.13.0
## X-squared = 0.66555, df = 2, p-value = 0.7169

Both tests indicate that the null hypothesis holds. Both AR(p) models have normal residuals.

Step 10: List order coefficients (B), standard errors (se), \(t\)-distribution \(p\)-values (p), lower bounds (LB), and upper bounds (UB) for each respective model. Note that final row is intercept.

round(matrix(c(ar.08$ar,
  sqrt(diag(ar.08$asy.var.coef)),
  2 * pt(abs(ar.08$ar / sqrt(diag(ar.08$asy.var.coef))), df = k - 1, lower.tail = F),
  ar.08$ar - qt(1 - a / 2, k - 1) * sqrt(diag(ar.08$asy.var.coef)),
  ar.08$ar + qt(1 - a / 2, k - 1) * sqrt(diag(ar.08$asy.var.coef))),
  nrow = 8, dimnames = list(c(), c("B","se","p","LB","UB"))), 3)
##           B    se     p     LB     UB
## [1,]  1.812 0.088 0.000  1.637  1.987
## [2,] -1.753 0.185 0.000 -2.118 -1.387
## [3,]  1.330 0.241 0.000  0.852  1.807
## [4,] -0.922 0.264 0.001 -1.445 -0.398
## [5,]  0.647 0.264 0.016  0.124  1.170
## [6,] -0.427 0.241 0.079 -0.905  0.050
## [7,]  0.309 0.185 0.098 -0.057  0.674
## [8,] -0.212 0.088 0.018 -0.387 -0.037
round(matrix(c(ar.13$ar, 
  sqrt(diag(ar.13$asy.var.coef)),
  2 * pt(abs(ar.13$ar / sqrt(diag(ar.13$asy.var.coef))), df = k - 1, lower.tail = F),
  ar.13$ar - qt(1 - a / 2, k - 1) * sqrt(diag(ar.13$asy.var.coef)),
  ar.13$ar + qt(1 - a / 2, k - 1) * sqrt(diag(ar.13$asy.var.coef))),
  nrow = 13, dimnames = list(c(), c("B","se","p","LB","UB"))), 3)
##            B    se     p     LB     UB
##  [1,]  1.820 0.086 0.000  1.650  1.991
##  [2,] -1.786 0.180 0.000 -2.142 -1.430
##  [3,]  1.403 0.238 0.000  0.933  1.874
##  [4,] -1.031 0.267 0.000 -1.560 -0.502
##  [5,]  0.834 0.281 0.004  0.278  1.391
##  [6,] -0.663 0.287 0.022 -1.230 -0.095
##  [7,]  0.600 0.288 0.039  0.030  1.170
##  [8,] -0.589 0.287 0.042 -1.157 -0.022
##  [9,]  0.463 0.281 0.102 -0.093  1.020
## [10,] -0.507 0.267 0.060 -1.036  0.021
## [11,]  0.540 0.238 0.025  0.069  1.010
## [12,] -0.535 0.180 0.004 -0.891 -0.179
## [13,]  0.289 0.086 0.001  0.119  0.460

From the above, the AR(8) model orders 1 to 5 and 8 are significantly different from zero. For the AR(13) model, orders 1 to 8 and 11 to 13 are significantly deferent from zero. Based on the principal of parsimony, the AR(8) model below is the more appropriate model.

\[\hat { x } _{ t }={ 1.812 }_{ \left( 0.088 \right) }{ x }_{ t-1 }-{ 1.753 }_{ \left( 0.088 \right) }{ x }_{ t-2 }+{ 1.33 }_{ \left( 0.241 \right) }{ x }_{ t-3 }-{ 0.922 }_{ \left( 0.264 \right) }{ x }_{ t-4 }+{ 0.647 }_{ \left( 0.264 \right) }{ x }_{ t-5 }-0.212_{ \left( 0.088 \right) }{ x }_{ t-8 }+{ w }_{ t }\]\[{ w }_{ t }\sim WN\left( 0,0.003 \right)\]

Time Series Forecasting

Step 10: Examine the mean to see if it should be included in the AR model.

mu <- mean(x)
se <- sd(x)
round(matrix(c(mu, se, 2 * pt(mu / se, df = k - 1, lower.tail = F), 
  c(mu - qt(1 - a / 2, k - 1) * se, mu + qt(1 - a / 2, k - 1) * se)), nrow = 1, 
  dimnames = list(c(), c("mu","se","p","LB","UB"))), 3)
##      mu    se p     LB    UB
## [1,]  0 0.159 1 -0.314 0.314

The mean is not significantly deferent from zero therefore it need not be included in the model.

Step 11: Go backwards; adding back the trend and transforming the data back using exponentiation.

b0 = fit$coefficients[1]; b1 = fit$coefficients[2]
b2 = fit$coefficients[3]; b3 = fit$coefficients[4]
Yhat = b0 + b1*time(Y) + b2*cos.1 + b3*sin.1 
Xtilde = ts(ar.08$resid,start=c(2000,7),frequency=12)
Ytilde = exp(Yhat) + Xtilde + mean(x) # mean is zero and has no impact

Step 12: Having obtained \(\tilde{ Y }_{ t }^{'}=\hat{ Y }_{ t }^{'}+\hat { x } _{ t }+\bar { x }\) where \(\bar { x }=0\), predict 12 months forward, and plot the prediction.

par(mfrow=c(1,1)); 
Y.pred = predict(arima(Ytilde,c(8,0,0),m="ML"),n.ahead=12)
ts.plot(Y,Y.pred$pred,col=1:2); 
lines(Y.pred$pred + Y.pred$se, lty="dashed", col=4)
lines(Y.pred$pred - Y.pred$se, lty="dashed", col=4)

Next check the accuracy of the prediction against the validation set (V).

plot(ts(dataF13b, start = c(2000, 7), frequency = 12))
lines(Y.pred$pred + Y.pred$se, lty = "dashed", col = 4)
lines(Y.pred$pred - Y.pred$se, lty = "dashed", col = 4)

Examine the 95% prediction interval for \(X\).

round(matrix(c(Y.pred$pred, Y.pred$se, 2 * pt(abs(Y.pred$pred / Y.pred$se), 
  df = 12 - 1, lower.tail = F), Y.pred$pred - qt(1 - a / 2, k - 1) * Y.pred$se, 
  Y.pred$pred + qt(1 - a / 2, k - 1) * Y.pred$se, Validation), nrow = 12, 
  dimnames = list(c(), c("pred","se","p","LB","UB","V"))), 3)
##         pred    se p    LB     UB     V
##  [1,] 10.055 0.134 0 9.789 10.322 9.759
##  [2,] 10.210 0.254 0 9.707 10.712 9.350
##  [3,]  9.694 0.381 0 8.939 10.449 9.150
##  [4,]  8.695 0.480 0 7.744  9.646 8.737
##  [5,]  7.504 0.539 0 6.436  8.572 7.544
##  [6,]  6.386 0.570 0 5.257  7.516 6.399
##  [7,]  5.641 0.580 0 4.492  6.790 5.630
##  [8,]  5.464 0.581 0 4.314  6.614 5.724
##  [9,]  5.899 0.581 0 4.748  7.050 6.473
## [10,]  6.836 0.583 0 5.682  7.991 7.470
## [11,]  8.010 0.583 0 6.856  9.165 8.355
## [12,]  9.105 0.585 0 7.947 10.264 8.425

The prediction (pred) captures the validation data (V) within is upper (UB) and lower bound (LB) pretty well. In conclusion, the AR(8) residual model which is equivalent to the ARIMA(8,0,0) residual model provides the best fit for the data.

Endnotes

i Evaluates the whiteness of the residuals by computing the Ljung-Box test with the squared residuals.

ii Incorporates the Box-Pierce test and considers the whiteness of the residuals as a group.

References

Prof. Wu, Baruch College CUNY, STAT 9701 (Time Series Analysis)