Box-Jenkins Methodology using R Programming

Textbook Examples

CPI Example

Stage 1: Model Identification Refer to page 264. This example is based on CPI dataset.

CPI_Annual <- read.csv(file.choose())

or any relevant method to read the dataset into R. For below code to works, save the data into your directory.

CPI_Annual <- read_excel("CPI Annual.xls")

Check the structure of the dataset to ensure R read our dataset correctly.

Note that our interest variable’s name is CPI. To call the value, try CPI_Annual$CPI.

str(CPI_Annual)

## tibble [38 x 2] (S3: tbl_df/tbl/data.frame)
##  $ Year: num [1:38] 1960 1961 1962 1963 1964 ...
##  $ CPI : num [1:38] 31.3 31.3 31.3 32.2 32.1 32.9 33.4 34.8 34.7 34.6 ...

Now, to works with R Programming, we need to define the variable as time series object by using ts () function. Note that, in ts () function, we will specify the start = , end = , and frequency.

This is specification for the dataset from 1960 to 1997, frequency = 1 represent yearly datasets.

CPI_Annual.ts <- ts(CPI_Annual$CPI, start = 1960, end = 1997, 
                    frequency = 1)

str (CPI_Annual.ts)

##  Time-Series [1:38] from 1960 to 1997: 31.3 31.3 31.3 32.2 32.1 32.9 33.4 34.8 34.7 34.6 ...

Next, we will plot the original series.

plot(CPI_Annual.ts, col = "blue", ylab = "CPI Base 1994 = 100", 
      main = "CPI Data Example", xlab = "Year", type ="o", pch = 20,
      panel.first = grid (),  xaxt = "n")
axis(1, at = seq (from = 1960, to = 1997, by = 2)) # additional graph customization

Explanation: Based on the CPI Data plot, there is significant upward trend from 1972 until 1997. This is a sign that the series is not stationary since there is a trend component in the series.

Next, we plot the ACF and PACF for the original series.

par(mfrow = c(2,1)) #2 graphs in a single device
acf(CPI_Annual.ts, 16, main = "ACF for CPI", panel.first = grid ())
pacf(CPI_Annual.ts, 16, main = "PACF for CPI", panel.first = grid ())

Explanation: Based on Autocorrelation function (ACF) for the original series, the ACF is slowly declining as the number of lags increase suggesting that the original series is not stationary. The term slowly decaying can also be used to explain this pattern.

In addition, based on Partial Autocorrelation function (PACF) of the original series, there is one significant spike (exceeding the standard error (SE) band). This is a signal that we can perform the first differencing to turn the original series into stationary series.

par(mfrow = c(2,1)) #2 graphs in a single device
acf(diff (CPI_Annual.ts), 16, main = "ACF for First Difference CPI", 
     panel.first = grid ())
pacf(diff (CPI_Annual.ts), 16, main = "PACF for First Difference CPI", 
      panel.first = grid ())

Explanation: Figure show the series after we perform the first difference to the original data. Based on the ACF of the first difference, the rate of decay is much faster which the changes from positive autocorrelation at lag 5.

Meanwhile, for PACF, it shows significant spike at lag one and lag eight (barely touched the SE bands).

Stage 2: Model Validation and Estimation

Based on the ACF and PACF after differencing the original series, we may decide the order of our ARIMA model. For this model, d=1, since we perform the first difference to transform the original series into stationary series.

Following the guideline in page 271, since our ACF shows a declining pattern (after performing differencing) and several significant spikes especially at lag 1 (after performing differencing), the order for p=1 (based on PACF), while q=0 (based on declining pattern of ACF).

Thus, it can be identified that the suitable model based on ACF and PACF is ARIMA (1, 1, 0).

To obtain more conclusive evidence, several alternatives will be considered.

For this example, we choose ARIMA (3, 1, 3), ARIMA (3, 1, 0) and ARIMA (1, 1, 1) as competing alternatives (based on textbook examples).

# Model 1: ARIMA (3, 1, 3)
model1 <- arima(CPI_Annual.ts, order = c(3, 1, 3))

## Warning in arima(CPI_Annual.ts, order = c(3, 1, 3)): possible convergence
## problem: optim gave code = 1

summary(model1)

## 
## Call:
## arima(x = CPI_Annual.ts, order = c(3, 1, 3))
## 
## Coefficients:
##          ar1      ar2    ar3     ma1     ma2     ma3
##       0.5839  -0.7052  0.837  0.3387  0.9160  0.1522
## s.e.  0.1651   0.1947  0.154  0.2342  0.3455  0.2004
## 
## sigma^2 estimated as 1.483:  log likelihood = -61.24,  aic = 136.48
## 
## Training set error measures:
##                     ME     RMSE       MAE       MPE     MAPE      MASE
## Training set 0.3159759 1.201683 0.8896298 0.5722527 1.685048 0.4156099
##                     ACF1
## Training set -0.06451308

#Model 2: ARIMA (3, 1, 0)
model2 <- arima(CPI_Annual.ts, order = c(3, 1, 0))
summary(model2)

## 
## Call:
## arima(x = CPI_Annual.ts, order = c(3, 1, 0))
## 
## Coefficients:
##          ar1      ar2     ar3
##       0.8724  -0.2184  0.2350
## s.e.  0.1570   0.2105  0.1574
## 
## sigma^2 estimated as 1.746:  log likelihood = -63.56,  aic = 135.11
## 
## Training set error measures:
##                     ME     RMSE       MAE       MPE     MAPE      MASE
## Training set 0.3201719 1.303726 0.8908956 0.5955232 1.702693 0.4162012
##                     ACF1
## Training set -0.03576427

# Model 3: ARIMA (1, 1, 1)
model3 <- arima(CPI_Annual.ts, order = c(1, 1, 1))
summary(model3)

## 
## Call:
## arima(x = CPI_Annual.ts, order = c(1, 1, 1))
## 
## Coefficients:
##          ar1     ma1
##       0.8475  0.0364
## s.e.  0.1111  0.2873
## 
## sigma^2 estimated as 1.858:  log likelihood = -64.62,  aic = 135.25
## 
## Training set error measures:
##                     ME     RMSE       MAE       MPE     MAPE      MASE
## Training set 0.3674566 1.344919 0.9478941 0.6542321 1.788465 0.4428293
##                     ACF1
## Training set -0.06580553

# Model 4: ARIMA (1, 1, 0)
model4 <- arima(CPI_Annual.ts, order = c(1, 1, 0))
summary(model4)

## 
## Call:
## arima(x = CPI_Annual.ts, order = c(1, 1, 0))
## 
## Coefficients:
##          ar1
##       0.8571
## s.e.  0.0773
## 
## sigma^2 estimated as 1.859:  log likelihood = -64.63,  aic = 133.26
## 
## Training set error measures:
##                     ME     RMSE       MAE       MPE     MAPE      MASE
## Training set 0.3618908 1.345254 0.9474519 0.6475377 1.784458 0.4426227
##                     ACF1
## Training set -0.04314883

Explanation: Based on the lowest AIC value, we choose model 4, ARIMA (1, 1, 0) as the best model. In addition, based on Ljung-Box test, we accepted the null hypothesis indicated that the residuals are white noise.

Box.test(model4$residuals, type = "Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  model4$residuals
## X-squared = 0.076486, df = 1, p-value = 0.7821

Next, we proceed with the application of ARIMA (1, 1, 0) to generate 5 years ahead (1998-2002) forecast values.

Stage 3: Model Application

# forecasting CPI data by using ARIMA (1, 1, 0) model
model4.for <- predict(model4, n.ahead = 5)
ts.plot(CPI_Annual.ts, model4.for$pred, ylim = c(0, 150), 
           ylab = "CPI", col = "blue")
grid ()
# calculating upper and lower limit
# additional plotting
U = model4.for$pred + model4.for$se
L = model4.for$pred - model4.for$se
xx = c(time (U), rev (time (U)))
yy = c(L, rev(U))
polygon(xx, yy, border = 8, col = gray (0.6, alpha=0.2))
lines(model4.for$pred, type = "p", col = "red")

model4.for

## $pred
## Time Series:
## Start = 1998 
## End = 2002 
## Frequency = 1 
## [1] 112.3855 114.5157 116.3414 117.9061 119.2472
## 
## $se
## Time Series:
## Start = 1998 
## End = 2002 
## Frequency = 1 
## [1] 1.363302 2.875459 4.555371 6.327398 8.143844

# plotting forecast value for 1998-2002
plot(model4.for$pred, type = "o", ylim = c(100, 140), ylab ="CPI",
      main = "Forecast Value for 1998-2002")
grid()
polygon(xx, yy, border = 8, col = gray (0.6, alpha=0.2))
text(model4.for$pred~time(model4.for$pred), labels = round (pred, 2), 
      data=model4.for, cex=0.7, font=2, pos=1)

Explanation: The forecast values for 1998-2002 are 112.39, 114.52, 116.34, 117.91 and 119.25.

Monthly Road Accidents in Accidents Example

Based on page 267 dataset, we will investigate the stationarity and non-stationarity characteristics in Accident Casualties data series.

Accident_Casualties <- read.csv (file.choose())

# or any alternatives methods to read the data into R.
Accident_Casualties <- read_excel("Accident Casualties.xls")

# check the structure of the dataset.
str(Accident_Casualties)

## tibble [348 x 3] (S3: tbl_df/tbl/data.frame)
##  $ Year      : num [1:348] 1975 NA NA NA NA ...
##  $ Month     : chr [1:348] "Jan" "Feb" "Mar" "Apr" ...
##  $ Casualties: num [1:348] 3419 3933 2830 3887 4125 ...

Note that, for monthly data, for datasets start from january 1975, we specify start = c(1975, 1). 1 is for January and the frequency = 12 since it is a monthly data series.

acc_casualty.ts <- ts(Accident_Casualties$Casualties, 
                       start = c(1975, 1), end = c(2003, 8), frequency = 12)

Step 1, we will plot the original series of monthly accident casualties.

plot(acc_casualty.ts, ylab = "Number of Casualties",
        main = "Monthly Road Accidents", col = "blue",
        panel.first = grid ())
  fit.lm <- lm(acc_casualty.ts~time(acc_casualty.ts), na.action = NULL)
  abline(fit.lm, col = "red") # add line based on linear model fit.lm

Explanation: Original series registered a general upward trend over period of 1990-2003.

Next, we performing the first difference toward the original series to detrend the trend component in the original series.

plot(diff (acc_casualty.ts), ylab = "(1st Difference)",
     main = "Monthly Road Accidents (1st Difference)", col = "blue",
     panel.first = grid ())
  fit.lm.diff <- lm(diff(acc_casualty.ts)~time(diff(acc_casualty.ts)), na.action = NULL)
  abline(fit.lm.diff, col = "red") # add line based on linear model fit.lm.diff

par(mfrow = c(2,1)) #2 graphs in a single device
acf(acc_casualty.ts, 16, main = "ACF for Monthly Casualties", 
    panel.first = grid ())
pacf(acc_casualty.ts, 16, main = "PACF for Monthly Casualties", 
     panel.first = grid ())

Explanation: ACF plot for the original series indicates that the ACF decline rather slowly as the number of lags increase, therefore the series is not stationary.

In addition, for the PACF, there is a significant spike at lag 1 suggesting that the series can be stationary after performing the first difference on the original series.

par(mfrow = c(2,1)) #2 graphs in a single device
acf(diff(acc_casualty.ts), 16, main = "ACF for Monthly Casualties (1st Difference)", 
    panel.first = grid ())
pacf(diff(acc_casualty.ts), 16, main = "PACF for Monthly Casualties (1st Difference)", 
     panel.first = grid ())

Explanation: It seems that the series is stationary after performing the first difference. For the ACF, there are significant spikes at a certain lag periods while for the PACF, there is decaying pattern observed at lags 1 to 3.

Statistical tests can be used to check for stationarity is found to be more accurate and reliable test. One of them is Augmented Dickey Fuller (ADF) test. We can perform the ADF test for both original series and difference series.

# Performing ADF test for Monthly Casualties Dataset
adf.test(acc_casualty.ts)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  acc_casualty.ts
## Dickey-Fuller = -0.7401, Lag order = 6, p-value = 0.9664
## alternative hypothesis: stationary

# Performing ADF test for first difference of Monthly Casuality Dataset
adf.test(diff(acc_casualty.ts))

## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff(acc_casualty.ts)
## Dickey-Fuller = -8.126, Lag order = 6, p-value = 0.01
## alternative hypothesis: stationary

Note that, if the p-value is lower than the significance level, the null hypothesis of the series is not stationary will be rejected. Therefore, for this example, the series is stationary after we perform the first difference.

Palm Oil Production Example

Stage 1: Model Identification The following example demonstrate the procedure to fit the data series with seasonal component. We will use monthly production of palm oil as example discussed on page 288 of the textbook.

palm_oil <- read.csv(file.choose())

or any relevant method to read the dataset into R. For below code to works, save the data into your directory.

palm_oil <- read_excel("Palm Oil Production.xls")

# check the structure of the dataset.
str(palm_oil)

## tibble [120 x 3] (S3: tbl_df/tbl/data.frame)
##  $ Year  : num [1:120] 1989 NA NA NA NA ...
##  $ Month : chr [1:120] "J" "F" "M" "A" ...
##  $ Tonnes: num [1:120] 364080 341574 368990 392065 444756 ...

palm_oil.ts <- ts(palm_oil$Tonnes, start = c(1989, 1), 
                   end = c(1998, 12), frequency = 12)

options(scipen = 2) #Suppress the scientific value for ylab
plot(palm_oil.ts, type ="o", pch = 16, 
     ylab = "Palm Oil Production", panel.first = grid ())

Explanation: The data series is investigated by plotting the Palm Oil Production against time. The chart show that the series has regular peaks and trough is not stationary since there is seasonal component in the series. From original series, the regular peaks and through indicated seasonal pattern in the series.

par(mfrow = c(2, 1)) # 2 graphs in a single device
acf(palm_oil.ts, 36, main = "ACF for Palm Oil Production", 
      panel.first = grid (), xaxt = "n")
axis(1, at = 0:36/12, labels=0:36) # xaxis re-label
pacf(palm_oil.ts, 36, main = "PACF for Palm Oil Production", 
       panel.first = grid (), xaxt = "n")
axis(1, at = 0:36/12, labels=0:36) # xaxis re-label

Explanation: Based on ACF plot, there is wave like pattern which confirmed the seasonal component in the palm oil series and a slightly extent in the PACF plot. These pattern characterise the series with seasonal effect.

Since there is the presence of seasonal component on the series, the first order seasonal differencing is performed. Note that we add additional argument in diff () function, lag = 12 to perform seasonal differencing.

palm_oil_sea.ts <- diff(palm_oil.ts, lag = 12)

par(mfrow = c(1, 1)) # 1 graph in a single device

plot(palm_oil_sea.ts, type ="o", pch = 16, 
     ylab = "Palm Oil Production", 
     main = "Seasonal Differencing", panel.first = grid ())

par (mfrow = c(2, 1)) # 2 graphs in a single device
acf(palm_oil_sea.ts, 36, 
    main = "ACF for Palm Oil Production (Seasonal Diff)", 
    panel.first = grid (), xaxt = "n")
axis(1, at = 0:36/12, labels=0:36)

pacf(palm_oil_sea.ts, 36, main = "PACF for Palm Oil Production (Seasonal Diff)", panel.first = grid (), xaxt = "n") 
axis(1, at = 0:36/12, labels=0:36)

Explanation: After performing the first seasonal differencing, the ACF and PACF plots show that decaying and undulating characteristics, indicating that the series is still not stationary. The ACF plot shows a decaying pattern and it still shows a somewhat wave pattern (but not so obvious as compared to the original data) while the PACF produce several significant spikes especially at first lag.

This indicates that we need to perform first non-seasonal differencing for the series.

palm_oil_nonsea.ts <- diff(palm_oil_sea.ts, differences = 1)

par (mfrow = c(1, 1)) # 1 graph in a single device
plot (palm_oil_nonsea.ts, type ="o", pch = 16, 
      ylab = "Palm Oil Production", 
      main = "Non-Seasonal Differencing",
      panel.first = grid ())

par(mfrow = c(2, 1)) # 2 graphs in a single device
acf(palm_oil_nonsea.ts, 36, 
    main = "ACF for Palm Oil Production (Non-Seasonal Diff)",
    panel.first = grid (), xaxt = "n")
axis(1, at = 0:36/12, labels=0:36)

pacf(palm_oil_nonsea.ts, 36, main = "PACF for Palm Oil Production (Non-Seasonal Diff)", panel.first = grid (), xaxt = "n")
axis(1, at = 0:36/12, labels=0:36)

Explanation: After performing the non-seasonal differencing, the time plot does not show the presence of trend, both ACF and PACF shows several significant spikes without any obvious pattern.

This may conclude that finally, our series is now stationary after first seasonal and non seasonal differencing, thus D=1 and d=1. To determine the order of AR(p), MA(q), SAR(P), SMA(Q), spikes at certain lags period will be observed.

To identify the seasonal part, one need to observe the spikes at lag 12 or 24 or 36 (for monthly data) and to identify the non-seasonal part is by observing the significant spikes at periods other than that.

Based on ACF, two significant spikes are detected at lag 1 and lag 8. (Note that, we exclude ACF at lag = 0 as the part of significant spike calculation) These two spikes can be used to specify the non-seasonal MA part. Another significant spike is observed at lag 12 to suggest seasonal SMA part of the model. Thus, MA (2) and SMA (1) are suggested based on ACF plot.

Next, to identify the order of p and P, we observed the PACF. There are two possible significant spikes which are at lag 1 and at lag 8, suggesting AR(2) for non-seasonal AR and there is a significant spike at lag 12, indicating SAR(1) for the seasonal AR.

In order to ensure that a well-specified model is not missed out, several model formulations will be identified and estimated.

The following models have been identified: SARIMA(2, 1, 2)(1, 1, 1), SARIMA(1, 1, 2)(0, 1, 1), SARIMA(1, 1, 2)(1, 1, 1) and SARIMA(2, 1, 3)(1, 1, 1).

Stage 2: Model Estimation and Validation

# Model 1: SARIMA (2, 1, 2)(1, 1, 1)_12
model1_sarima <- arima (palm_oil.ts, order = c(2, 1, 2),
                        seas = list(order = c(1, 1, 1), 12))
summary (model1_sarima)

## 
## Call:
## arima(x = palm_oil.ts, order = c(2, 1, 2), seasonal = list(order = c(1, 1, 1), 
##     12))
## 
## Coefficients:
##          ar1     ar2      ma1      ma2    sar1     sma1
##       0.3719  0.3016  -0.8400  -0.1600  0.2754  -1.0000
## s.e.  0.3738  0.2461   0.3878   0.3839  0.1151   0.1158
## 
## sigma^2 estimated as 1672182656:  log likelihood = -1301.23,  aic = 2616.45
## 
## Training set error measures:
##                     ME     RMSE      MAE        MPE     MAPE      MASE
## Training set -226.5567 38614.12 27735.88 -0.3474768 4.738181 0.4744402
##                      ACF1
## Training set -0.001012883

# Model 2: SARIMA (1, 1, 2)(0, 1, 1)_12
model2_sarima <- arima (palm_oil.ts, order = c(1, 1, 2),
                        seas = list(order = c(0, 1, 1), 12))
summary (model2_sarima)

## 
## Call:
## arima(x = palm_oil.ts, order = c(1, 1, 2), seasonal = list(order = c(0, 1, 1), 
##     12))
## 
## Coefficients:
##          ar1      ma1     ma2     sma1
##       0.7773  -1.2402  0.2402  -0.9999
## s.e.  0.0966   0.1505  0.1323   0.2465
## 
## sigma^2 estimated as 1678053335:  log likelihood = -1304.56,  aic = 2619.13
## 
## Training set error measures:
##                    ME     RMSE     MAE        MPE     MAPE      MASE
## Training set 737.5054 38681.84 28061.3 -0.2309915 4.774227 0.4800067
##                     ACF1
## Training set -0.01735445

# Model 3: SARIMA (1, 1, 2)(1, 1, 1)_12
model3_sarima <- arima (palm_oil.ts, order = c(1, 1, 2),
                        seas = list(order = c(1, 1, 1), 12))
summary (model3_sarima)

## 
## Call:
## arima(x = palm_oil.ts, order = c(1, 1, 2), seasonal = list(order = c(1, 1, 1), 
##     12))
## 
## Coefficients:
##          ar1      ma1     ma2    sar1    sma1
##       0.7988  -1.2492  0.2493  0.2679  -1.000
## s.e.  0.0882   0.1350  0.1246  0.1139   0.114
## 
## sigma^2 estimated as 1685415456:  log likelihood = -1301.67,  aic = 2615.34
## 
## Training set error measures:
##                     ME    RMSE      MAE       MPE     MAPE      MASE
## Training set -420.6308 38766.6 27718.66 -0.389585 4.733551 0.4741456
##                     ACF1
## Training set -0.02075458

# Model 4: SARIMA (2, 1, 3)(1, 1, 1)_12
model4_sarima <- arima (palm_oil.ts, order = c(2, 1, 3),
                        seas = list(order = c(1, 1, 1), 12))
summary (model4_sarima)

## 
## Call:
## arima(x = palm_oil.ts, order = c(2, 1, 3), seasonal = list(order = c(1, 1, 1), 
##     12))
## 
## Coefficients:
##          ar1     ar2      ma1      ma2      ma3    sar1     sma1
##       0.4460  0.2210  -0.9105  -0.0359  -0.0533  0.2744  -0.9999
## s.e.  0.4923  0.4024   0.4950   0.6162   0.2000  0.1154   0.1162
## 
## sigma^2 estimated as 1670122121:  log likelihood = -1301.19,  aic = 2618.39
## 
## Training set error measures:
##                     ME     RMSE      MAE        MPE     MAPE      MASE
## Training set -114.9221 38590.32 27723.41 -0.3263134 4.731755 0.4742269
##                      ACF1
## Training set -0.003992601

Explanation: Based on the lowest AIC value, model 3, SARIMA(1, 1, 2)(1, 1, 1) is chosen. In addition, based on Ljung-Box test, the residuals for this model is white noise since we accepted the null hypothesis of the residuals are white noise.

Box.test(model3_sarima$residuals, lag = 1, type = "Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  model3_sarima$residuals
## X-squared = 0.052993, df = 1, p-value = 0.8179

Stage 3: Model Application

# forecast the value for 12-step ahead (1999)
model3_sarima.for <- predict (model3_sarima, n.ahead = 12)

# plotting forecast value for oil palm production based on model3_sarima
par (mfrow = c(1,1)) # 1 graph per device
ts.plot (palm_oil.ts, model3_sarima.for$pred, 
           ylab = "Palm Oil Production Forecast", col = c("blue", "red"))
grid ()
U.for = model3_sarima.for$pred + model3_sarima.for$se
L.for = model3_sarima.for$pred - model3_sarima.for$se
xx.for = c(time (U.for), rev (time (U.for)))
yy.for = c(L.for, rev(U.for))
polygon (xx.for, yy.for, border = 8, col = gray (0.6, alpha =0.2))
lines (model3_sarima.for$pred, type = "p", col = "red")

# plotting forecast value for 1999
month.name.for <- c("Jan", "Feb", "Mac", "Apr", "May", "Jun", "Jul", 
                    "Aug", "Sep", "Oct", "Nov", "Dec", "Jan")

plot (model3_sarima.for$pred, type = "o", ylab ="Palm Oil Production", 
      ylim = c(min (L.for)-10000, max(U.for+10000)),
      main = "Forecast Value for 1999", xaxt = "n")
axis (1, at = seq (from = 1999, to = 2000, by= 1/12), labels = month.name.for)
polygon (xx.for, yy.for, border = 8, col = gray (0.6, alpha=0.2))
text (model3_sarima.for$pred~time(model3_sarima.for$pred), labels = round (pred, 0), 
      data=model3_sarima.for, cex=0.6, font=2, pos=1)
grid()

Figure show the forecast value by using model3_sarima model. Students should be able to comment on the forecast values obtained.

Box-Jenkins Methodology using R Programming

STA570/STA572: Time Series Analysis and Forecasting

Updated Jan 2021

Data Preparation (Textbook Example)

Textbook Examples

CPI Example

Monthly Road Accidents in Accidents Example

Palm Oil Production Example