Autoregressive Integrated Moving Average (ARIMA)

1. Intro

Autoregressive Integrated Moving Average (ARIMA) is often also called the Box-Jenkins time series method. ARIMA has very good accuracy for short-term forecasting, while for long-term forecasting its accuracy is not good. Usually it will tend to be flat (flat/constant) for quite a long period. (Wei W. W., 1990)

The Autoregressive Integrated Moving Average (ARIMA) model is a model that completely ignores independent variables in making forecasts and a model that assumes that the input data must be stationary (Wei W. W., 1990). If the input data is not stationary, adjustments need to be made to produce stationary data. ARIMA uses the past and present values of the dependent variable to produce accurate short-term forecasts. ARIMA is suitable if observations from a time series are statistically related to each other (dependent).

The data on the number of passengers of Argo Jati Railway that served the Cirebon-Gambir route in 2013 to 2018 were used. The selection of Argo Jati Train in this study was due to Ago Jati being one of the executive trains of the Operational Area 3 Cirebon with a pattern of passenger numbers from 2013 to 2018 it moves irregularly or not constant which indicates that there are no seasonal factors in it.

2. Input Data

volume=read.delim("clipboard")
volume

The following is a time series plot form of data on the number of passengers for the period January 2013 to December 2018.

penumpang<-ts(volume$VOLUME,start=c(2013,1),freq=12)
penumpang

##        Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov   Dec
## 2013 10621 10447 11453 11450 12213 13103 12009 13001 11422 12427  6370  6933
## 2014  4664  7118  8952 10308  9309  9261  6177 10857  6804  6984  7452  7044
## 2015 12223 12484 15294 14946 15902 14365 16746 10278 10630  9670  9736 11483
## 2016  9990  9435  9594 10227 13648  9397 17785 10582 12166 11589  9669 12437
## 2017 12946  9169 11046  8410  9451  9560 14312  8510 11041  8344  8404 10585
## 2018  9298  8838 10438 14185 12313 18939 18268 16467 14368 14430 18894 18107

ts.plot(penumpang,col="red",main="Time Series Plot")

From the data plot above, there are two variables, namely the x axis and the y axis, where the x axis is the year while the y axis is the number of Argo Jati Railway passengers. The number of train passengers moving from year to year tends to fluctuate which is irregular or not constant which indicates that there are no seasonal factors in it. Overall, it appears that the data pattern is horizontal.

The data pattern above shows that there has been an increase in value over time, but has experienced some less significant increases in several years.

3. Autoregressive Integrated Moving Average (ARIMA) Method

3.1 Stationary Data

In the ARIMA method, data must meet stationary assumptions in the mean and variance. Data instability can be proven by ACF and PACF graphs. Next is the ACF and PACF graph output from the series of passenger numbers:

library(tseries)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(forecast)

#stationary test
adf.test(penumpang)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  penumpang
## Dickey-Fuller = -2.3824, Lag order = 4, p-value = 0.4196
## alternative hypothesis: stationary

#acf and pacf graph
par (mfrow=c(1,2)) 
Acf(penumpang, lag.max=24) 
Pacf(penumpang,lag.max = 24)

Based on the graph above, it can be concluded that data is not stationary in variance. This can be seen from the distribution of data that is not around a straight line or a constant average. The autocorrelation coefficient test for data on the number of Argo Jati Railway passengers also shows the presence of traits average instability. Because the shape of the ACF chart which tends to decrease slowly approaches zero, the data for the number of train passengers in the last 6 years is not stationary.

ACF and PACF graph analysis have weaknesses because decisions are taken subjectively, so that differences in decision making are possible. For this reason, data instability can be proven using a formal test, the ADF test. Based on the results of the ADF test using the 0.05 level of significance obtained p-value of 0.4196, which means that the data is not stationary.

In the analysis of time series cases like this, if the data is not stationary, it can be overcome by making adjustments to produce stationary data. One method that is commonly used is the method of differencing. Differencing processes can be carried out for several orders until stationary data. The following researchers conduct the first differentiation process:

#first differentiation
penumpang.diff1=diff(penumpang,differences = 1)
ts.plot(penumpang.diff1, main="Difference Data Plot 1")

adf.test(penumpang.diff1)

## Warning in adf.test(penumpang.diff1): p-value smaller than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  penumpang.diff1
## Dickey-Fuller = -5.0385, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary

After doing differencing, the p-value obtained from the ADF test is 0.01, which means stationary data. Based on the plot above, railway passenger data is stationary both in the mean and variance. The data has been stationary in the mean because the data looks to have fluctuated or moved around the average or in other words the data runs parallel to the x axis. Besides that the data is also stationary in variance because based on the plot above the distance of data plots between one another is relatively homogeneous or constant.

3.2 Identification of the ARIMA Model

After doing differencing order 1 and the data has been stationary, then the identification of the model is then done to find out the model that is most suitable for the ARIMA forecasting method. The identification of the ARIMA model can be seen using ACF and PACF plots from first-order differential data. The following is a plot of ACF and PACF plots of transformed data.

#acf and pacf graph of first differentiation
par (mfrow=c(1,2))
Acf(penumpang.diff1, lag.max=24)
Pacf(penumpang.diff1,lag.max = 24)

The output lag used on ACF and PACF charts is only the first 4 lags. The PACF plot is used to model AR. Based on the plot, the PACF stem comes out until the 1st lag so that it is obtained p = 1. While the ACF plot is used to model the MA. In the plot, the ACF bar exits until the second lag so that q = 2. As well as the previous difference until the first order is done, so that d = 1.

Based on this identification, the main ARIMA model, ARIMA (1,1,2) is obtained. Whereas other models can be obtained with lower orders or order combinations on the main model, so that ARIMA (1,1,1), ARIMA (1,1,0), ARIMA (0,1,2), and ARIMA models are obtained (0,1,1).

3.3 Parameter Estimation of the ARIMA Model

Then performs parameter estimates for the five models to see the significance of the model coefficients.

After several ARIMA temporary models have been identified, then the best or most efficient estimation for the parameters in the model is then carried out. The estimation stage is done to see the significance of the coefficients of each model that will be used in forecasting the number of passengers of the Argo Jati Railway in the Cirebon-Gambir class.

#model utama
model1=Arima(penumpang, order=c(1,1,2)) 

#model 2
model2=Arima(penumpang, order=c(1,1,1)) 

#model 3
model3=Arima(penumpang, order=c(1,1,0)) 

#model 4
model4=Arima(penumpang, order=c(0,1,2))

#model 5
model5=Arima(penumpang, order=c(0,1,1)) 


pvalue <- function (x, digits = 4,se=T,...){ 
  if (length(x$coef) > 0) {
    cat("\nCoefficients:\n")
    coef <- round(x$coef, digits = digits)
    if (se && nrow(x$var.coef)) {
      ses <- rep(0, length(coef))
      ses[x$mask] <- round(sqrt(diag(x$var.coef)), digits = digits)
      coef <- matrix(coef, 1, dimnames = list(NULL, names(coef)))
      coef <- rbind(coef, s.e. = ses)
      statt <- coef[1,]/ses
      pval  <- 2*pt(abs(statt), df=length(x$residuals)-1, lower.tail = F)
      coef <- rbind(coef, t=round(statt,digits=digits),sign.=round(pval,digits=digits))
      coef <- t(coef)
    }
    print.default(coef, print.gap = 2)
  }
}

pvalue(model1)

## 
## Coefficients:
##                 s.e.        t   sign.
## ar1  -0.4544  0.2169  -2.0950  0.0397
## ma1   0.0323  0.2004   0.1612  0.8724
## ma2   0.1897  0.2176   0.8718  0.3863

pvalue(model2)

## 
## Coefficients:
##                 s.e.        t   sign.
## ar1  -0.5848  0.1700  -3.4400  0.0010
## ma1   0.1183  0.2006   0.5897  0.5572

pvalue(model3)

## 
## Coefficients:
##                 s.e.       t  sign.
## ar1  -0.4956  0.1016  -4.878      0

pvalue(model4)

## 
## Coefficients:
##                 s.e.        t   sign.
## ma1  -0.3933  0.1331  -2.9549  0.0042
## ma2   0.2184  0.1387   1.5746  0.1198

pvalue(model5)

## 
## Coefficients:
##                 s.e.        t  sign.
## ma1  -0.4322  0.1044  -4.1398  1e-04

The following is a parameter estimate on the model obtained:

Based on the above analysis, it is known that there are 2 models that are feasible to use in forecasting the number of passengers, namely the third model (ARIMA (1,1,0)) and the fifth model (ARIMA (0,1,1)). This is because by using a 95% confidence level both parameters of the model are significant. So that both models are feasible to use in forecasting the number of passengers.

3.4 Diagnostic Testing

From the parameter estimation above, there is a significant coefficient which then the researcher will check for residual analysis conducted by testing the assumptions needed in the time series analysis, namely diagnostic checking. The assumption needed in time series analysis is that there is no residual autocorrelation, the model is homoschedastic (constant residual variable), and the residual is normally distributed.

Then the researcher performs a diagnostic test to see whether there is autocorrelation or not. This test is carried out on only significant models. The following is the diagnostic test output of the ARIMA (1,1,0) model:

tsdiag(model3)

Based on the above output, the residual ARIMA (1,1,0) model is white noise (the residuals of the model have met identical and independent assumptions) because there is no lag to ≥ 1 that exits the interval. In addition, the p-value in L-jung Box is also above the 5% limit, which means that the residual does not contain a correlation. Because the residuals in the ARIMA (1,1,0) model are white noise (WN) and do not contain correlations, the diagnostic test on this model is fulfilled.

Next, a diagnostic test will be carried out on the ARIMA (0,1,1) model.

tsdiag(model5)

Based on the output above, the residual ARIMA (0,1,1) model is not white noise (WN) because there is a lag to ≥ 1, namely the 3rd lag that exits the interval limit. However, the residuals in this model do not contain a correlation because there is no p-value in L-jung Box that is above the 5% limit. That way, it means that the diagnostic test of the ARIMA (0,1,1) model is not fulfilled because the assumption of white noise is not fulfilled.

Based on the results of the above test, it can be concluded that the ARIMA (1,1,0) model is the most appropriate model used for forecasting the number of Argo Jati Railway passengers for the executive class with the Cirebon-Gambir route because for the test the coefficient is significant, all assumptions for testing residual (diagnostic checking) fulfilled. With the selection of the best ARIMA (1,1,0) model, it can be interpreted that this model contains Autoregressive (AR) and does not contain Moving Average (MA). Autoregressive (AR) has the assumption that the current period data is influenced by data in the previous period. While the absence of the Moving Average (MA) in this model has the assumption that the current period data is not affected by errors from the data period.

3.5 Forecasting

The number of train passengers will be predicted for the next 3 periods by 2019 using the ARIMA (1,1,0) model. Based on the model, the forecasting results obtained from January to December 2019 are as follows:

#forecast
pred.penumpang=predict(model3,n.ahead = 3)
pred.penumpang

## $pred
##           Jan      Feb      Mar
## 2019 18497.06 18303.74 18399.56
## 
## $se
##           Jan      Feb      Mar
## 2019 2456.736 2751.527 3311.506

Based on the above output, it is known that by using the ARIMA (1,1,0) model, the number of Argo Jati Train passengers for the executive class with the Cirebon-Gambir route from January to May 2019 tends to move up and down. Next is the plot from the results of forecasting the number of train passengers Argo Jati from January 2019 to May 2019.

#predict
penumpang.fitted=fitted(model3)
penumpang.fitted

##            Jan       Feb       Mar       Apr       May       Jun       Jul
## 2013 10610.379 10598.118 10533.240 10954.391 11451.487 11834.830 12661.885
## 2014  6653.957  5788.596  5901.712  8043.005  9635.919  9804.139  9284.790
## 2015  7246.219  9656.106 12354.639 13901.266 15118.481 15428.173 15126.791
## 2016 10617.126 10729.983  9710.077  9515.194  9913.263 11952.433 11503.944
## 2017 11065.082 12693.722 11041.013 10115.693  9716.494  8935.044  9505.976
## 2018  9504.020  9935.882  9065.992  9644.984 12327.856 13240.829 15654.923
##            Aug       Sep       Oct       Nov       Dec
## 2013 12551.225 12509.330 12204.608 11928.887  9372.061
## 2014  7705.538  8537.428  8812.809  6894.786  7220.043
## 2015 15565.893 13483.767 10455.536 10145.810  9703.288
## 2016 13627.614 14152.059 11380.914 11874.981 10620.619
## 2017 11956.743 11385.674  9786.548  9680.728  8374.262
## 2018 18600.571 17359.639 15408.338 14399.271 16681.485

#grafik
ts.plot(penumpang, col="red",lwd=1,xlab="Year", ylab="Number of Passengers", main="Plot Prediction of Number of Passengers")
lines(penumpang.fitted, col="purple",lwd=1)
legend("bottomright",c("Actual Data","Prediction Data"),bty="n",lwd=2,col=c("red","purple"))

From the above output, it can be seen that the prediction data is close to the original data and also tends not to form a constant seasonal data pattern.

So it can be said that predictive data has high accuracy. This is reinforced by the MAPE value or Mean Absolute Precentage Error. To find out the forecasting method is good or not, it can be seen from the MAPE value. If the value is getting smaller, the forecasting is getting better, because the error rate is smaller.