A. INTRODUCTION

       Autoregressive Integrated Moving Average (ARIMA) is one of the forecasting methods in Time Series. ARIMA is commonly used for non-stationary time series data. Furthermore, the non-stationary itself is being indicated by inconstant mean and variance. Non-stationary caused the time series data to become unpredictable. However, non-stationarity within the data could be turned into stationary by utilizing the I within ARIMA or in its integrated part. ARIMA model equation can be written as follow.
Figure 1. General ARIMA Model Equation
Figure 1. General ARIMA Model Equation
       Need to be noted, ARIMA model (p,d,q) is constructed with p as the Autoregressive (AR), d as differencing or integrated (I), and q as Moving Average (MA) parameter. One of the goals in this analysis is to determine best ARIMA model using real case data before getting into random variable simulation.
       Not only ARIMA model is required to generate random variables through ARIMA, parameters’ coefficients have to be identified. Both best ARIMA model and its parameters are needed to determine the model ability to be generalized for other time series data or at least in the topic of designated real case data.
       Desired result can be achieved through identification of similarities between generated random variables and real case data. Best ARIMA model and its parameters’ coefficient for both data are expected to be similar or quiet identical.

B. REAL CASE DATA

       Before generating random variables through ARIMA, the best ARIMA model and its coefficients for each components needs to be identified. Those can be found through ARIMA analysis using real case data.
       The data for example down below is inflation in Bogor City, Indonesia. The data could be access on Central Bureau of Statistics Indonesia or simply click here to download the cleaned version.
       Note: You may also insert your own data and change several codes. The syntax which could be substituted will be notified in note part.
       As has been mentioned in introduction, we want to answer a question as follow.
       “Has the inflation in Bogor City data used optimum ARIMA model? If I am simulating random variables which follow the same ARIMA model, will the model pass the statistical property tests as if the model found in real case data?”

Loading Libraries

       The following are the packages needed in ARIMA analysis. Each library can be deep learnt by easily going to ‘Help’ on RStudio and search the package name.
# Library
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.4.3
library(readxl)
library(forecast)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(tseries)
library(lmtest)
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.4.3
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.4.3
## Warning: package 'readr' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ purrr     1.2.0
## ✔ forcats   1.0.1     ✔ readr     2.1.6
## ✔ ggplot2   4.0.1     ✔ stringr   1.6.0
## ✔ lubridate 1.9.4     ✔ tibble    3.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(patchwork)
library(FitAR)
## Loading required package: lattice
## Loading required package: leaps
## Loading required package: ltsa
## Loading required package: bestglm
## 
## Attaching package: 'FitAR'
## 
## The following object is masked from 'package:forecast':
## 
##     BoxCox
# Install the remotes package if you don't have it
# install.packages("remotes")

# Install FitAR from the CRAN archive mirror on GitHub
# remotes::install_github("cran/FitAR")

1. Descriptive Statistics

       Import the data by using the location of the file or using ‘Import Dataset’ on ‘File’ menu. Data which has been imported can be searched its descriptive statistics by using summary function. Descriptive statistics as usual is providing basic statistic information such as minimum as well as maximum values, mean, median, and whatnot.
       Note: You can easily change inflation with key word which compatible with your data. You also have to adjust the data$Inflation (%)` with your data’s numerical column name if you are using different time series data. For example, my data happened to have numerical or inflation values in column named ‘Inflation (%)’.
# -- 1. Descriptive Statistics
data <- read_excel("Bogor_Inflation_ARIMA_Syntax.xlsx")
Inflation = data$`Inflation (%)`
# View(data)
summary(data)
##       Date                     Inflation (%)    
##  Min.   :2020-01-01 00:00:00   Min.   :-0.4500  
##  1st Qu.:2021-03-24 06:00:00   1st Qu.: 0.0600  
##  Median :2022-06-16 00:00:00   Median : 0.1950  
##  Mean   :2022-06-16 15:12:00   Mean   : 0.2482  
##  3rd Qu.:2023-09-08 12:00:00   3rd Qu.: 0.4425  
##  Max.   :2024-12-01 00:00:00   Max.   : 1.1800
# Plot Time Series
plot.ts(Inflation, 
          main= "Plot Time Series Inflation in Bogor",
          xlab= "Period")

2. Split Data

       Second step is separating data into training and testing. Training as for identifying best ARIMA model, testing for validating the model and its ability to make prediction. People commonly divided data training and testing following 2:1 ratio. Need to be noted, splitting data is optional unless there will be forecasting part which not contained in this analysis. However, the data will only be used the training (48 data) in case of the additional prediction section.
       Note: You could adjust the periods for training and testing data. Also, do not forget to change Inflation (%) with your numerical column name.
# -- 2. Split Data
train_data <- data[1:48, ]  # 2020-2023
test_data <- data[49:60, ]  # 2024

# Convert to time series objects
tr_ts <- ts(train_data$`Inflation (%)`, 
            start = c(2020, 1), 
            frequency = 12)
te_ts <- ts(test_data$`Inflation (%)`, 
            start = c(2024, 1), 
            frequency = 12)

tr_data <- train_data$`Inflation (%)`

tr_ts
##        Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov   Dec
## 2020  0.78  0.25  0.04 -0.02  0.01  0.27 -0.01 -0.16  0.11  0.13  0.32  0.44
## 2021  0.19  0.24  0.06  0.24  0.40 -0.17  0.07  0.08 -0.01  0.08  0.26  0.56
## 2022  0.53  0.13  0.97  0.68  0.55  0.75  0.55 -0.45  1.18  0.10  0.20  0.49
## 2023  0.61  0.36  0.15  0.32  0.22  0.18  0.24 -0.08  0.16  0.12  0.81  0.22
te_ts
##        Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov   Dec
## 2024  0.06  0.60  0.45  0.10 -0.02 -0.11 -0.07  0.04 -0.14  0.08  0.26  0.49
tr_data
##  [1]  0.78  0.25  0.04 -0.02  0.01  0.27 -0.01 -0.16  0.11  0.13  0.32  0.44
## [13]  0.19  0.24  0.06  0.24  0.40 -0.17  0.07  0.08 -0.01  0.08  0.26  0.56
## [25]  0.53  0.13  0.97  0.68  0.55  0.75  0.55 -0.45  1.18  0.10  0.20  0.49
## [37]  0.61  0.36  0.15  0.32  0.22  0.18  0.24 -0.08  0.16  0.12  0.81  0.22

3. Stationary Test

       In ARIMA analysis we assumed there is trend within the time series data which called non-stationarity. Data considered as non-stationary when its mean and variance are inconstant. Non-stationary needs to be converted to stationary, otherwise the data will be difficult to predict. Fortunately, integrated part within ARIMA enables the alteration of non-stationary to stationary through differencing.
       Nonetheless, stationarity test within time series data before differencing needs to be done by using Augmented Dickey Fuller (ADF) Test. The ADF test will determine if differencing necessary. The desirable outcome would be p-value less than 0.05.
       Note: If you found your data is not stationary (p-value >= 0.05), go to step 4. Otherwise, please check the Partial Autocorrelation Function (PACF) and Autocorrelation Function (ACF) plots. PACF can be used to check available components for the p parameter by looking at the number of lag (stick) outside the confidence interval (blue dot line). ACF plot is likewise but for the q parameter instead.
# -- 3. Stationary Test
adf.test(tr_data)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  tr_data
## Dickey-Fuller = -2.2397, Lag order = 3, p-value = 0.4784
## alternative hypothesis: stationary
acf(tr_data, main = "ACF Plot")

pacf(tr_data, main = "PACF Plot")

4. Differencing

       Please pay attention on how many times the data needs to be differenced since it will be used to determine how many d on ARIMA model (p,d,q). Only differencing until the data is stationary or reject null hypothesis (p-value < 0.05) based on ADF test.
       For instance, since inflation data in Bogor City is stationary at first differencing, therefore, d = 1. Meanwhile, based on PACF and ACF plot, available component combinations for ARIMA model are including 0, 1, and 2 as the p parameter, while q parameter is 0 and 1. The combinations from parameters will be tested against each other in order to find the best ARIMA model.
       Note: You can delete the second differencing if it is not needed or your data is already stationary.
# -- 4. Differencing
diff1 = diff(tr_data)
adf.test(diff1)
## Warning in adf.test(diff1): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff1
## Dickey-Fuller = -4.3662, Lag order = 3, p-value = 0.01
## alternative hypothesis: stationary
acf(diff1, main = "ACF Differencing 1")

pacf(diff1, main = "PACF Differencing 1")

diff2 = diff(diff1) # <--- Only use it when the data yet stationary
adf.test(diff2)
## Warning in adf.test(diff2): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff2
## Dickey-Fuller = -6.7663, Lag order = 3, p-value = 0.01
## alternative hypothesis: stationary
acf(diff2, main = "ACF Differencing 2")

pacf(diff2, main = "PACF Differencing 2")

5. Finding the Best Model

       Best ARIMA model for inflation data could be identified through the value of Akaike’s Information Criterion (AIC). Even though AIC cannot tell particular model explanation power within data, it compares all available ARIMA model and determine which one of it has better explanatory ability. The model with the lowest AIC value is desirable.
       Note: Please add more m label if you have additional model combination and change the best_model with the model which has the lowest AIC value.
# -- 5. Model Determination
m1 <- arima(tr_ts, order = c(0,1,0))
m2 <- arima(tr_ts, order = c(0,1,1))
m3 <- arima(tr_ts, order = c(1,1,0))
m4 <- arima(tr_ts, order = c(1,1,1))
m5 <- arima(tr_ts, order = c(2,1,0))
m6 <- arima(tr_ts, order = c(2,1,1))
aic <- data.frame(Name = c("m1", "m2", "m3", "m4", "m5", "m6"),
                  Model = c("ARIMA (0,1,0)",
                            "ARIMA (0,1,1)",
                            "ARIMA (1,1,0)",
                            "ARIMA (1,1,1)",
                            "ARIMA (2,1,0)",
                            "ARIMA (2,1,1)"),
                  AIC = c(m1$aic,
                          m2$aic,
                          m3$aic,
                          m4$aic,
                          m5$aic,
                          m6$aic))
aic
##   Name         Model      AIC
## 1   m1 ARIMA (0,1,0) 54.52956
## 2   m2 ARIMA (0,1,1) 29.91778
## 3   m3 ARIMA (1,1,0) 39.20062
## 4   m4 ARIMA (1,1,1) 31.55783
## 5   m5 ARIMA (2,1,0) 32.08784
## 6   m6 ARIMA (2,1,1) 32.98561
best_model <- m2

6. Statistical Properties for Model Validation

       Post identifying the best ARIMA model for inflation data, examination of several statistical properties are required in order to ensure its power to predict future values. The following is the tests used to validate the ARIMA model.
  1. Z-Test: Z-test tells the difference between using ARIMA model and not utilizing it whatsoever to explain the variables within data. It judges if significant difference exist. In addition, parameters’ coefficient for the ARIMA model could be found. Since inflation data is using (0,1,1) for the ARIMA model, p parameter or AR will not be displayed. Nonetheless, ARIMA model equation will look as follow. (Desirable outcome, p-value < 0.05)
Figure 2. ARIMA Model (0,1,1) Equation for Inflation in Bogor 2020-2023.
Figure 2. ARIMA Model (0,1,1) Equation for Inflation in Bogor 2020-2023.
       Note: If you have ARIMA Model (2,1,1) you may see 2 AR values and 1 MA value.
  1. Ljung-Box Test: In time series, majority of residual within the model has had to be captured, leaving only pure randomness in residuals. In statistics, the residuals is called white noise. In spite of that, PACF and ACF plot for the residuals are expected to stay in confidence interval excluding lag 0. (Desirable outcome, p-value > 0.05)
  1. Box-Ljung Test : The errors within ARIMA model are expected to have no autocorrelation. (Desirable outcome, p-value > 0.05)
  1. Kolmogorov-Smirnov : The ARIMA model for the data needs to have its residuals normally distributed. (Desirable outcome, p-value > 0.05)
# -- 6. Model Significancy (z-test)
coeftest(best_model)
## 
## z test of coefficients:
## 
##     Estimate Std. Error z value  Pr(>|z|)    
## ma1 -0.87308    0.11830 -7.3804 1.578e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# -- 7. White Noise Residual Test
res1 = best_model$residuals
acf(res1)

pacf(res1)

checkresiduals(best_model)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,1)
## Q* = 7.6004, df = 9, p-value = 0.5749
## 
## Model df: 1.   Total lags used: 10
# -- 8. Autocorrelation Test
Box.test(res1, lag = 23, type = "Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  res1
## X-squared = 18.084, df = 23, p-value = 0.7529
# -- 9. Normality Test (Kolmogorov-Smirnov)
ks.test(res1, "pnorm", mean(res1), sd(res1))
## 
##  Exact one-sample Kolmogorov-Smirnov test
## 
## data:  res1
## D = 0.096765, p-value = 0.7233
## alternative hypothesis: two-sided

C. GENERATING RANDOM DATA

       After discovering the optimum ARIMA model which is (0,1,1) and estimated coefficients for MA(1) is -0.87308 for inflation data in Bogor City, ARIMA simulation through random variables can be conducted.
       Question: “Are the random variables which are deliberately customized to use ARIMA model (0,1,1) will pass statistical validations, have similar results and quiet identical parameter’s coefficient value?”.
       Note: Change model = list(order = c(0,1,1), ma = -0.87308) with your best ARIMA model and estimated coefficient. If you have AR values, please input (order = c(), ar = , ma = ). If you have more than 2 values within one parameter, then use (order = c(2,1,1), ar = c(0.6, 0.8) , ma = c(0.7, 0.2) for instance. Beside that, adjust the n value based on total sample used in real case data.

1. Libraries and Generating Random Variables

# ----- ARIMA Model Simulation ----- #

# -- 1. Library and set.seed
set.seed(123)

# Load library
library(tseries)
library(forecast) 

# -- 2. Generating Data through Time Series
set.seed(123)
n <- 60   
  arima_data <- arima.sim(
    n = n,
    model = list(order = c(0,1,1), ma = -0.87308)
  )

ts.plot(arima_data, main = "ARIMA(0,1,1) Simulation")

2. Stationarity Test and Differencing

       The random data which has been generated and followed ARIMA model (0,1,1) is also being tested regarding its stationarity by using ADF test. Since inflation data has 1 time differencing from previous ARIMA analysis, the generated random data is expected to have d = 1 as well.
       Notes: Similar from the previous analysis, differencing could be modified based on generated random variables ADF test result.
# 3. Stationary Test
# ARIMA Model (0,1,1)
adf.test(arima_data)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  arima_data
## Dickey-Fuller = -3.244, Lag order = 3, p-value = 0.08892
## alternative hypothesis: stationary
# 4. Differencing:
  diff_arima <- diff(arima_data)
adf.test(diff_arima)
## Warning in adf.test(diff_arima): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff_arima
## Dickey-Fuller = -6.436, Lag order = 3, p-value = 0.01
## alternative hypothesis: stationary

3. PACF and ACF Plots

       Identical procedure as if ARIMA analysis with real case data, PACF and ACF plots are still becoming the baseline to determine how many possible component combinations exist for the random generated variables.
       Note: please change diff_arima based on stationarity if the generated data is-for example-already stationary without differencing, or second differencing happened.
# 4. Identifying Model Components (PACF and ACF Plots)
# ARIMA (0,1,1)
acf(diff_arima, main = "ACF Diff ARIMA")

pacf(diff_arima, main = "PACF Diff ARIMA")
       Similar results on both PACF and ACF plots as if inflation data, random variables also have possible AR components from lag 0, 1, and 2. Meanwhile, MA has 0 and 1 components based on ACF’s lag.

4. Best ARIMA Model and Estimating Parameter

       Move on to indentifying best ARIMA model and estimating parameter. ARIMA model which expected to have the lowest AIC value is (0,1,1). Similar conclusion can be achieved since the generated random data has been deliberately made to follow ARIMA model (0,1,1). Meanwhile, parameter’s coefficient is also likely to be identical with inflation ARIMA analysis.
# 5. Estimating Parameter
model_arima1 <- arima(arima_data, order = c(0,1,0))
summary(model_arima1)
## 
## Call:
## arima(x = arima_data, order = c(0, 1, 0))
## 
## 
## sigma^2 estimated as 1.518:  log likelihood = -97.66,  aic = 197.32
## 
## Training set error measures:
##                      ME     RMSE       MAE      MPE     MAPE      MASE
## Training set 0.02360334 1.221995 0.9116226 61.56302 148.5532 0.9836066
##                    ACF1
## Training set -0.4870249
model_arima2 <- arima(arima_data, order = c(0,1,1))
summary(model_arima2)
## 
## Call:
## arima(x = arima_data, order = c(0, 1, 1))
## 
## Coefficients:
##           ma1
##       -0.9653
## s.e.   0.1270
## 
## sigma^2 estimated as 0.791:  log likelihood = -79.44,  aic = 162.88
## 
## Training set error measures:
##                      ME      RMSE       MAE      MPE     MAPE     MASE
## Training set 0.05891256 0.8820697 0.6960251 63.19404 150.3946 0.750985
##                     ACF1
## Training set 0.006669305
model_arima3 <- arima(arima_data, order = c(1,1,0))
summary(model_arima3)
## 
## Call:
## arima(x = arima_data, order = c(1, 1, 0))
## 
## Coefficients:
##           ar1
##       -0.4791
## s.e.   0.1114
## 
## sigma^2 estimated as 1.159:  log likelihood = -89.68,  aic = 183.36
## 
## Training set error measures:
##                     ME    RMSE       MAE      MPE    MAPE      MASE       ACF1
## Training set 0.0328923 1.06748 0.7711757 65.41967 148.156 0.8320697 -0.2248655
model_arima4 <- arima(arima_data, order = c(1,1,1))
summary(model_arima4)
## 
## Call:
## arima(x = arima_data, order = c(1, 1, 1))
## 
## Coefficients:
##          ar1      ma1
##       0.0167  -0.9738
## s.e.  0.1492   0.1606
## 
## sigma^2 estimated as 0.788:  log likelihood = -79.43,  aic = 164.86
## 
## Training set error measures:
##                      ME      RMSE     MAE      MPE     MAPE      MASE
## Training set 0.05768855 0.8803638 0.69387 63.02036 149.5045 0.7486597
##                      ACF1
## Training set -0.005307469
model_arima5 <- arima(arima_data, order = c(2,1,0))
summary(model_arima5)
## 
## Call:
## arima(x = arima_data, order = c(2, 1, 0))
## 
## Coefficients:
##           ar1      ar2
##       -0.7102  -0.4620
## s.e.   0.1145   0.1145
## 
## sigma^2 estimated as 0.9078:  log likelihood = -82.61,  aic = 171.22
## 
## Training set error measures:
##                      ME     RMSE       MAE      MPE     MAPE      MASE
## Training set 0.04137652 0.944959 0.7124278 22.17673 132.9919 0.7686829
##                     ACF1
## Training set -0.06435778
model_arima6 <- arima(arima_data, order = c(2,1,1))
summary(model_arima6)
## 
## Call:
## arima(x = arima_data, order = c(2, 1, 1))
## 
## Coefficients:
##          ar1      ar2      ma1
##       0.0082  -0.0212  -0.9592
## s.e.  0.1800   0.1723   0.1982
## 
## sigma^2 estimated as 0.7923:  log likelihood = -79.42,  aic = 166.85
## 
## Training set error measures:
##                      ME      RMSE       MAE      MPE     MAPE      MASE
## Training set 0.05896945 0.8827781 0.6978108 63.04679 150.1319 0.7529117
##                      ACF1
## Training set -0.001585722
       Based on the result, ARIMA model (0,1,1) turns out to have lowest ACI values. It is compatible with the inflation data ARIMA analysis’ conclusion as well. Beside that, its coefficient of MA(1) from generated random variables indicates moderate similarity (-0.9653) with the estimated coefficient from inflation data (-0.87308).
       Unfortunately, since the data is being used particularly about inflation, ARIMA model (0,1,1) may not be suitable for majority of time series data. However, ARIMA model (0,1,1) could be tried to predict future values for other inflation-related time series data, but separate ARIMA analysis still required nor recommended.

5. Residual Diagnostic Test

       Final step to validate the ARIMA model (0,1,1) assumptions from generated random variables is through residuals diagnostic. The residuals in ACF plot are expected to stay within the confidence interval except the 0 lag of course. In spite of that, autocorrelation test is also being conducted for confirming the residuals are not autocorrelated using Box-Ljung test. Desirable outcome is p-value more than 0.05.
       Note: You can change model_arima2 with previous best ARIMA model from the syntethic data which has the lowest AIC value.
# -- 6. Residual Diagnostic Test
# Residual Plot
ts.plot(residuals(model_arima2), main = "ARIMA Residuals Plot")

# Residual ACF
acf(residuals(model_arima2), main = "ARIMA Residuals in ACF Plot")

# Autocorrelation Test
Box.test(
  residuals(model_arima2),
  lag = 2,
  type = "Ljung-Box"
)
## 
##  Box-Ljung test
## 
## data:  residuals(model_arima2)
## X-squared = 0.011364, df = 2, p-value = 0.9943

CONCLUSION

       Based on ARIMA analysis using Inflation data in Bogor City as well as generated random variables, both displayed similarities such as the use of ARIMA model (0,1,1) as their best ARIMA model. On the other hand, estimated parameter coefficient for both simulation and inflation data shows moderate closeness. Nonetheless, other similar data particularly inflation-related could potentially use same ARIMA (0,1,1) model. However, outside of inflation topic, best ARIMA model identification is required nor recommended since ARIMA model (0,1,1) is specifically coming from and compatible for inflation data.
       Need to be noted, in some cases, inflation data may not be found suitable using ARIMA (0,1,1) as their model. Consequently, separate ARIMA analysis for different data is suggested since the another ARIMA model could have better ability to explain time series data more than ARIMA model (0,1,1). Therefore, generalizing the ARIMA model cannot be fully achieved even for other inflation data, let alone majority time series data.