Autoregressive Integrated Moving Average (ARIMA) is often also called the Box-Jenkins time series method. ARIMA has very good accuracy for short-term forecasting, while for long-term forecasting its accuracy is not good. Usually it will tend to be flat (flat/constant) for quite a long period. (Wei W. W., 1990)
The Autoregressive Integrated Moving Average (ARIMA) model is a model that completely ignores independent variables in making forecasts and a model that assumes that the input data must be stationary (Wei W. W., 1990). If the input data is not stationary, adjustments need to be made to produce stationary data. ARIMA uses the past and present values of the dependent variable to produce accurate short-term forecasts. ARIMA is suitable if observations from a time series are statistically related to each other (dependent).
The data on the number of passengers of Argo Jati Railway that served the Cirebon-Gambir route in 2013 to 2018 were used. The selection of Argo Jati Train in this study was due to Ago Jati being one of the executive trains of the Operational Area 3 Cirebon with a pattern of passenger numbers from 2013 to 2018 it moves irregularly or not constant which indicates that there are no seasonal factors in it.
The following is a time series plot form of data on the number of passengers for the period January 2013 to December 2018.
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 2013 10621 10447 11453 11450 12213 13103 12009 13001 11422 12427 6370 6933
## 2014 4664 7118 8952 10308 9309 9261 6177 10857 6804 6984 7452 7044
## 2015 12223 12484 15294 14946 15902 14365 16746 10278 10630 9670 9736 11483
## 2016 9990 9435 9594 10227 13648 9397 17785 10582 12166 11589 9669 12437
## 2017 12946 9169 11046 8410 9451 9560 14312 8510 11041 8344 8404 10585
## 2018 9298 8838 10438 14185 12313 18939 18268 16467 14368 14430 18894 18107
From the data plot above, there are two variables, namely the x axis and the y axis, where the x axis is the year while the y axis is the number of Argo Jati Railway passengers. The number of train passengers moving from year to year tends to fluctuate which is irregular or not constant which indicates that there are no seasonal factors in it. Overall, it appears that the data pattern is horizontal.
The data pattern above shows that there has been an increase in value over time, but has experienced some less significant increases in several years.
In the ARIMA method, data must meet stationary assumptions in the mean and variance. Data instability can be proven by ACF and PACF graphs. Next is the ACF and PACF graph output from the series of passenger numbers:
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
##
## Augmented Dickey-Fuller Test
##
## data: penumpang
## Dickey-Fuller = -2.3824, Lag order = 4, p-value = 0.4196
## alternative hypothesis: stationary
Based on the graph above, it can be concluded that data is not stationary in variance. This can be seen from the distribution of data that is not around a straight line or a constant average. The autocorrelation coefficient test for data on the number of Argo Jati Railway passengers also shows the presence of traits average instability. Because the shape of the ACF chart which tends to decrease slowly approaches zero, the data for the number of train passengers in the last 6 years is not stationary.
ACF and PACF graph analysis have weaknesses because decisions are
taken subjectively, so that differences in decision making are possible.
For this reason, data instability can be proven using a formal test, the
ADF test. Based on the results of the ADF test using the
0.05 level of significance obtained p-value of
0.4196, which means that the data is not stationary.
In the analysis of time series cases like this, if the data is not stationary, it can be overcome by making adjustments to produce stationary data. One method that is commonly used is the method of differencing. Differencing processes can be carried out for several orders until stationary data. The following researchers conduct the first differentiation process:
#first differentiation
penumpang.diff1=diff(penumpang,differences = 1)
ts.plot(penumpang.diff1, main="Difference Data Plot 1")## Warning in adf.test(penumpang.diff1): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: penumpang.diff1
## Dickey-Fuller = -5.0385, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary
After doing differencing, the p-value obtained from the ADF test is
0.01, which means stationary data. Based on the plot above,
railway passenger data is stationary both in the mean and variance. The
data has been stationary in the mean because the data looks to have
fluctuated or moved around the average or in other words the data runs
parallel to the x axis. Besides that the data is also stationary in
variance because based on the plot above the distance of data plots
between one another is relatively homogeneous or constant.
After doing differencing order 1 and the data has been stationary, then the identification of the model is then done to find out the model that is most suitable for the ARIMA forecasting method. The identification of the ARIMA model can be seen using ACF and PACF plots from first-order differential data. The following is a plot of ACF and PACF plots of transformed data.
#acf and pacf graph of first differentiation
par (mfrow=c(1,2))
Acf(penumpang.diff1, lag.max=24)
Pacf(penumpang.diff1,lag.max = 24)The output lag used on ACF and PACF charts is only the first 4 lags.
The PACF plot is used to model AR. Based on the plot, the PACF stem
comes out until the 1st lag so that it is obtained p = 1.
While the ACF plot is used to model the MA. In the plot, the ACF bar
exits until the second lag so that q = 2. As well as the
previous difference until the first order is done, so that
d = 1.
Based on this identification, the main ARIMA model,
ARIMA (1,1,2) is obtained. Whereas other models can be
obtained with lower orders or order combinations on the main model, so
that ARIMA (1,1,1), ARIMA (1,1,0),
ARIMA (0,1,2), and ARIMA models are obtained
(0,1,1).
Then performs parameter estimates for the five models to see the significance of the model coefficients.
After several ARIMA temporary models have been identified, then the best or most efficient estimation for the parameters in the model is then carried out. The estimation stage is done to see the significance of the coefficients of each model that will be used in forecasting the number of passengers of the Argo Jati Railway in the Cirebon-Gambir class.
#model utama
model1=Arima(penumpang, order=c(1,1,2))
#model 2
model2=Arima(penumpang, order=c(1,1,1))
#model 3
model3=Arima(penumpang, order=c(1,1,0))
#model 4
model4=Arima(penumpang, order=c(0,1,2))
#model 5
model5=Arima(penumpang, order=c(0,1,1))
pvalue <- function (x, digits = 4,se=T,...){
if (length(x$coef) > 0) {
cat("\nCoefficients:\n")
coef <- round(x$coef, digits = digits)
if (se && nrow(x$var.coef)) {
ses <- rep(0, length(coef))
ses[x$mask] <- round(sqrt(diag(x$var.coef)), digits = digits)
coef <- matrix(coef, 1, dimnames = list(NULL, names(coef)))
coef <- rbind(coef, s.e. = ses)
statt <- coef[1,]/ses
pval <- 2*pt(abs(statt), df=length(x$residuals)-1, lower.tail = F)
coef <- rbind(coef, t=round(statt,digits=digits),sign.=round(pval,digits=digits))
coef <- t(coef)
}
print.default(coef, print.gap = 2)
}
}
pvalue(model1) ##
## Coefficients:
## s.e. t sign.
## ar1 -0.4544 0.2169 -2.0950 0.0397
## ma1 0.0323 0.2004 0.1612 0.8724
## ma2 0.1897 0.2176 0.8718 0.3863
##
## Coefficients:
## s.e. t sign.
## ar1 -0.5848 0.1700 -3.4400 0.0010
## ma1 0.1183 0.2006 0.5897 0.5572
##
## Coefficients:
## s.e. t sign.
## ar1 -0.4956 0.1016 -4.878 0
##
## Coefficients:
## s.e. t sign.
## ma1 -0.3933 0.1331 -2.9549 0.0042
## ma2 0.2184 0.1387 1.5746 0.1198
##
## Coefficients:
## s.e. t sign.
## ma1 -0.4322 0.1044 -4.1398 1e-04
The following is a parameter estimate on the model obtained:
Based on the above analysis, it is known that there are 2 models that
are feasible to use in forecasting the number of passengers, namely the
third model (ARIMA (1,1,0)) and the fifth model
(ARIMA (0,1,1)). This is because by using a 95% confidence
level both parameters of the model are significant. So that both models
are feasible to use in forecasting the number of passengers.
From the parameter estimation above, there is a significant coefficient which then the researcher will check for residual analysis conducted by testing the assumptions needed in the time series analysis, namely diagnostic checking. The assumption needed in time series analysis is that there is no residual autocorrelation, the model is homoschedastic (constant residual variable), and the residual is normally distributed.
Then the researcher performs a diagnostic test to see whether there
is autocorrelation or not. This test is carried out on only significant
models. The following is the diagnostic test output of the
ARIMA (1,1,0) model:
Based on the above output, the residual ARIMA (1,1,0)
model is white noise (the residuals of the model have met identical and
independent assumptions) because there is no lag to ≥ 1 that exits the
interval. In addition, the p-value in L-jung Box is also above the 5%
limit, which means that the residual does not contain a correlation.
Because the residuals in the ARIMA (1,1,0) model are white
noise (WN) and do not contain correlations, the diagnostic test on this
model is fulfilled.
Next, a diagnostic test will be carried out on the
ARIMA (0,1,1) model.
Based on the output above, the residual ARIMA (0,1,1)
model is not white noise (WN) because there is a lag to ≥ 1, namely the
3rd lag that exits the interval limit. However, the residuals in this
model do not contain a correlation because there is no p-value in L-jung
Box that is above the 5% limit. That way, it means that the diagnostic
test of the ARIMA (0,1,1) model is not fulfilled because
the assumption of white noise is not fulfilled.
Based on the results of the above test, it can be concluded that the
ARIMA (1,1,0) model is the most appropriate model used for
forecasting the number of Argo Jati Railway passengers for the executive
class with the Cirebon-Gambir route because for the test the coefficient
is significant, all assumptions for testing residual (diagnostic
checking) fulfilled. With the selection of the best
ARIMA (1,1,0) model, it can be interpreted that this model
contains Autoregressive (AR) and does not contain Moving Average (MA).
Autoregressive (AR) has the assumption that the current period data is
influenced by data in the previous period. While the absence of the
Moving Average (MA) in this model has the assumption that the current
period data is not affected by errors from the data period.
The number of train passengers will be predicted for the next 3
periods by 2019 using the ARIMA (1,1,0) model. Based on the
model, the forecasting results obtained from January to December 2019
are as follows:
## $pred
## Jan Feb Mar
## 2019 18497.06 18303.74 18399.56
##
## $se
## Jan Feb Mar
## 2019 2456.736 2751.527 3311.506
Based on the above output, it is known that by using the
ARIMA (1,1,0) model, the number of Argo Jati Train
passengers for the executive class with the Cirebon-Gambir route from
January to May 2019 tends to move up and down. Next is the plot from the
results of forecasting the number of train passengers Argo Jati from
January 2019 to May 2019.
## Jan Feb Mar Apr May Jun Jul
## 2013 10610.379 10598.118 10533.240 10954.391 11451.487 11834.830 12661.885
## 2014 6653.957 5788.596 5901.712 8043.005 9635.919 9804.139 9284.790
## 2015 7246.219 9656.106 12354.639 13901.266 15118.481 15428.173 15126.791
## 2016 10617.126 10729.983 9710.077 9515.194 9913.263 11952.433 11503.944
## 2017 11065.082 12693.722 11041.013 10115.693 9716.494 8935.044 9505.976
## 2018 9504.020 9935.882 9065.992 9644.984 12327.856 13240.829 15654.923
## Aug Sep Oct Nov Dec
## 2013 12551.225 12509.330 12204.608 11928.887 9372.061
## 2014 7705.538 8537.428 8812.809 6894.786 7220.043
## 2015 15565.893 13483.767 10455.536 10145.810 9703.288
## 2016 13627.614 14152.059 11380.914 11874.981 10620.619
## 2017 11956.743 11385.674 9786.548 9680.728 8374.262
## 2018 18600.571 17359.639 15408.338 14399.271 16681.485
#grafik
ts.plot(penumpang, col="red",lwd=1,xlab="Year", ylab="Number of Passengers", main="Plot Prediction of Number of Passengers")
lines(penumpang.fitted, col="purple",lwd=1)
legend("bottomright",c("Actual Data","Prediction Data"),bty="n",lwd=2,col=c("red","purple"))From the above output, it can be seen that the prediction data is close to the original data and also tends not to form a constant seasonal data pattern.
So it can be said that predictive data has high accuracy. This is reinforced by the MAPE value or Mean Absolute Precentage Error. To find out the forecasting method is good or not, it can be seen from the MAPE value. If the value is getting smaller, the forecasting is getting better, because the error rate is smaller.