B. REAL CASE DATA
Before generating random variables
through ARIMA, the best ARIMA model and its coefficients for each
components needs to be identified. Those can be found through ARIMA
analysis using real case data.
Note: You may also insert your own data
and change several codes. The syntax which could be substituted will be
notified in note part.
As has been mentioned in introduction, we
want to answer a question as follow.
“Has the inflation in Bogor City data
used optimum ARIMA model? If I am simulating random variables which
follow the same ARIMA model, will the model pass the statistical
property tests as if the model found in real case data?”
Loading Libraries
The following are the packages needed in
ARIMA analysis. Each library can be deep learnt by easily going to
‘Help’ on RStudio and search the package name.
# Library
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.4.3
library(readxl)
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(tseries)
library(lmtest)
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.4.3
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.4.3
## Warning: package 'readr' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ purrr 1.2.0
## ✔ forcats 1.0.1 ✔ readr 2.1.6
## ✔ ggplot2 4.0.1 ✔ stringr 1.6.0
## ✔ lubridate 1.9.4 ✔ tibble 3.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(patchwork)
library(FitAR)
## Loading required package: lattice
## Loading required package: leaps
## Loading required package: ltsa
## Loading required package: bestglm
##
## Attaching package: 'FitAR'
##
## The following object is masked from 'package:forecast':
##
## BoxCox
# Install the remotes package if you don't have it
# install.packages("remotes")
# Install FitAR from the CRAN archive mirror on GitHub
# remotes::install_github("cran/FitAR")
1. Descriptive Statistics
Import the data by using the location of
the file or using ‘Import Dataset’ on ‘File’ menu. Data which has been
imported can be searched its descriptive statistics by using
summary function. Descriptive statistics as usual is
providing basic statistic information such as minimum as well as maximum
values, mean, median, and whatnot.
Note: You can easily change inflation
with key word which compatible with your data. You also have to adjust
the data$Inflation (%)` with your data’s numerical column
name if you are using different time series data. For example, my data
happened to have numerical or inflation values in column named
‘Inflation (%)’.
# -- 1. Descriptive Statistics
data <- read_excel("Bogor_Inflation_ARIMA_Syntax.xlsx")
Inflation = data$`Inflation (%)`
# View(data)
summary(data)
## Date Inflation (%)
## Min. :2020-01-01 00:00:00 Min. :-0.4500
## 1st Qu.:2021-03-24 06:00:00 1st Qu.: 0.0600
## Median :2022-06-16 00:00:00 Median : 0.1950
## Mean :2022-06-16 15:12:00 Mean : 0.2482
## 3rd Qu.:2023-09-08 12:00:00 3rd Qu.: 0.4425
## Max. :2024-12-01 00:00:00 Max. : 1.1800
# Plot Time Series
plot.ts(Inflation,
main= "Plot Time Series Inflation in Bogor",
xlab= "Period")

2. Split Data
Second step is separating data into
training and testing. Training as for identifying best ARIMA model,
testing for validating the model and its ability to make prediction.
People commonly divided data training and testing following 2:1 ratio.
Need to be noted, splitting data is optional unless there will be
forecasting part which not contained in this analysis. However, the data
will only be used the training (48 data) in case of the additional
prediction section.
Note: You could adjust the periods for
training and testing data. Also, do not forget to change
Inflation (%) with your numerical column name.
# -- 2. Split Data
train_data <- data[1:48, ] # 2020-2023
test_data <- data[49:60, ] # 2024
# Convert to time series objects
tr_ts <- ts(train_data$`Inflation (%)`,
start = c(2020, 1),
frequency = 12)
te_ts <- ts(test_data$`Inflation (%)`,
start = c(2024, 1),
frequency = 12)
tr_data <- train_data$`Inflation (%)`
tr_ts
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 2020 0.78 0.25 0.04 -0.02 0.01 0.27 -0.01 -0.16 0.11 0.13 0.32 0.44
## 2021 0.19 0.24 0.06 0.24 0.40 -0.17 0.07 0.08 -0.01 0.08 0.26 0.56
## 2022 0.53 0.13 0.97 0.68 0.55 0.75 0.55 -0.45 1.18 0.10 0.20 0.49
## 2023 0.61 0.36 0.15 0.32 0.22 0.18 0.24 -0.08 0.16 0.12 0.81 0.22
te_ts
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 2024 0.06 0.60 0.45 0.10 -0.02 -0.11 -0.07 0.04 -0.14 0.08 0.26 0.49
tr_data
## [1] 0.78 0.25 0.04 -0.02 0.01 0.27 -0.01 -0.16 0.11 0.13 0.32 0.44
## [13] 0.19 0.24 0.06 0.24 0.40 -0.17 0.07 0.08 -0.01 0.08 0.26 0.56
## [25] 0.53 0.13 0.97 0.68 0.55 0.75 0.55 -0.45 1.18 0.10 0.20 0.49
## [37] 0.61 0.36 0.15 0.32 0.22 0.18 0.24 -0.08 0.16 0.12 0.81 0.22
3. Stationary Test
In ARIMA analysis we assumed there is
trend within the time series data which called non-stationarity. Data
considered as non-stationary when its mean and variance are inconstant.
Non-stationary needs to be converted to stationary, otherwise the data
will be difficult to predict. Fortunately, integrated part within ARIMA
enables the alteration of non-stationary to stationary through
differencing.
Nonetheless, stationarity test within
time series data before differencing needs to be done by using Augmented
Dickey Fuller (ADF) Test. The ADF test will determine if differencing
necessary. The desirable outcome would be p-value less than 0.05.
Note: If you found your data is not
stationary (p-value >= 0.05), go to step 4. Otherwise, please check
the Partial Autocorrelation Function (PACF) and Autocorrelation Function
(ACF) plots. PACF can be used to check available components for the p
parameter by looking at the number of lag (stick) outside the confidence
interval (blue dot line). ACF plot is likewise but for the q parameter
instead.
# -- 3. Stationary Test
adf.test(tr_data)
##
## Augmented Dickey-Fuller Test
##
## data: tr_data
## Dickey-Fuller = -2.2397, Lag order = 3, p-value = 0.4784
## alternative hypothesis: stationary
acf(tr_data, main = "ACF Plot")

pacf(tr_data, main = "PACF Plot")

4. Differencing
Please pay attention on how many times
the data needs to be differenced since it will be used to determine how
many d on ARIMA model (p,d,q). Only differencing until the data is
stationary or reject null hypothesis (p-value < 0.05) based on ADF
test.
For instance, since inflation data in
Bogor City is stationary at first differencing, therefore, d = 1.
Meanwhile, based on PACF and ACF plot, available component combinations
for ARIMA model are including 0, 1, and 2 as the p parameter, while q
parameter is 0 and 1. The combinations from parameters will be tested
against each other in order to find the best ARIMA model.
Note: You can delete the second
differencing if it is not needed or your data is already
stationary.
# -- 4. Differencing
diff1 = diff(tr_data)
adf.test(diff1)
## Warning in adf.test(diff1): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: diff1
## Dickey-Fuller = -4.3662, Lag order = 3, p-value = 0.01
## alternative hypothesis: stationary
acf(diff1, main = "ACF Differencing 1")

pacf(diff1, main = "PACF Differencing 1")

diff2 = diff(diff1) # <--- Only use it when the data yet stationary
adf.test(diff2)
## Warning in adf.test(diff2): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: diff2
## Dickey-Fuller = -6.7663, Lag order = 3, p-value = 0.01
## alternative hypothesis: stationary
acf(diff2, main = "ACF Differencing 2")

pacf(diff2, main = "PACF Differencing 2")

5. Finding the Best Model
Best ARIMA model for inflation data could
be identified through the value of Akaike’s Information Criterion (AIC).
Even though AIC cannot tell particular model explanation power within
data, it compares all available ARIMA model and determine which one of
it has better explanatory ability. The model with the lowest AIC value
is desirable.
Note: Please add more m
label if you have additional model combination and change the
best_model with the model which has the lowest AIC
value.
# -- 5. Model Determination
m1 <- arima(tr_ts, order = c(0,1,0))
m2 <- arima(tr_ts, order = c(0,1,1))
m3 <- arima(tr_ts, order = c(1,1,0))
m4 <- arima(tr_ts, order = c(1,1,1))
m5 <- arima(tr_ts, order = c(2,1,0))
m6 <- arima(tr_ts, order = c(2,1,1))
aic <- data.frame(Name = c("m1", "m2", "m3", "m4", "m5", "m6"),
Model = c("ARIMA (0,1,0)",
"ARIMA (0,1,1)",
"ARIMA (1,1,0)",
"ARIMA (1,1,1)",
"ARIMA (2,1,0)",
"ARIMA (2,1,1)"),
AIC = c(m1$aic,
m2$aic,
m3$aic,
m4$aic,
m5$aic,
m6$aic))
aic
## Name Model AIC
## 1 m1 ARIMA (0,1,0) 54.52956
## 2 m2 ARIMA (0,1,1) 29.91778
## 3 m3 ARIMA (1,1,0) 39.20062
## 4 m4 ARIMA (1,1,1) 31.55783
## 5 m5 ARIMA (2,1,0) 32.08784
## 6 m6 ARIMA (2,1,1) 32.98561
best_model <- m2
6. Statistical Properties for Model Validation
Post identifying the best ARIMA model for
inflation data, examination of several statistical properties are
required in order to ensure its power to predict future values. The
following is the tests used to validate the ARIMA model.
- Z-Test: Z-test tells the difference between using
ARIMA model and not utilizing it whatsoever to explain the variables
within data. It judges if significant difference exist. In addition,
parameters’ coefficient for the ARIMA model could be found. Since
inflation data is using (0,1,1) for the ARIMA model, p parameter or AR
will not be displayed. Nonetheless, ARIMA model equation will look as
follow. (Desirable outcome, p-value < 0.05)
Figure 2. ARIMA Model (0,1,1) Equation for
Inflation in Bogor 2020-2023.
Note: If you have ARIMA Model (2,1,1) you
may see 2 AR values and 1 MA value.
- Ljung-Box Test: In time series, majority of
residual within the model has had to be captured, leaving only pure
randomness in residuals. In statistics, the residuals is called white
noise. In spite of that, PACF and ACF plot for the residuals are
expected to stay in confidence interval excluding lag 0. (Desirable
outcome, p-value > 0.05)
- Box-Ljung Test : The errors within ARIMA model are
expected to have no autocorrelation. (Desirable outcome, p-value >
0.05)
- Kolmogorov-Smirnov : The ARIMA model for the data
needs to have its residuals normally distributed. (Desirable outcome,
p-value > 0.05)
# -- 6. Model Significancy (z-test)
coeftest(best_model)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ma1 -0.87308 0.11830 -7.3804 1.578e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# -- 7. White Noise Residual Test
res1 = best_model$residuals
acf(res1)

pacf(res1)

checkresiduals(best_model)

##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,1)
## Q* = 7.6004, df = 9, p-value = 0.5749
##
## Model df: 1. Total lags used: 10
# -- 8. Autocorrelation Test
Box.test(res1, lag = 23, type = "Ljung-Box")
##
## Box-Ljung test
##
## data: res1
## X-squared = 18.084, df = 23, p-value = 0.7529
# -- 9. Normality Test (Kolmogorov-Smirnov)
ks.test(res1, "pnorm", mean(res1), sd(res1))
##
## Exact one-sample Kolmogorov-Smirnov test
##
## data: res1
## D = 0.096765, p-value = 0.7233
## alternative hypothesis: two-sided
C. GENERATING RANDOM DATA
After discovering the optimum ARIMA model
which is (0,1,1) and estimated coefficients for MA(1) is -0.87308 for
inflation data in Bogor City, ARIMA simulation through random variables
can be conducted.
Question: “Are the random variables which
are deliberately customized to use ARIMA model (0,1,1) will pass
statistical validations, have similar results and quiet identical
parameter’s coefficient value?”.
Note: Change
model = list(order = c(0,1,1), ma = -0.87308) with your
best ARIMA model and estimated coefficient. If you have AR values,
please input (order = c(), ar = , ma = ). If you have more
than 2 values within one parameter, then use
(order = c(2,1,1), ar = c(0.6, 0.8) , ma = c(0.7, 0.2) for
instance. Beside that, adjust the n value based on total
sample used in real case data.
1. Libraries and Generating Random Variables
# ----- ARIMA Model Simulation ----- #
# -- 1. Library and set.seed
set.seed(123)
# Load library
library(tseries)
library(forecast)
# -- 2. Generating Data through Time Series
set.seed(123)
n <- 60
arima_data <- arima.sim(
n = n,
model = list(order = c(0,1,1), ma = -0.87308)
)
ts.plot(arima_data, main = "ARIMA(0,1,1) Simulation")

2. Stationarity Test and Differencing
The random data which has been generated
and followed ARIMA model (0,1,1) is also being tested regarding its
stationarity by using ADF test. Since inflation data has 1 time
differencing from previous ARIMA analysis, the generated random data is
expected to have d = 1 as well.
Notes: Similar from the previous
analysis, differencing could be modified based on generated random
variables ADF test result.
# 3. Stationary Test
# ARIMA Model (0,1,1)
adf.test(arima_data)
##
## Augmented Dickey-Fuller Test
##
## data: arima_data
## Dickey-Fuller = -3.244, Lag order = 3, p-value = 0.08892
## alternative hypothesis: stationary
# 4. Differencing:
diff_arima <- diff(arima_data)
adf.test(diff_arima)
## Warning in adf.test(diff_arima): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: diff_arima
## Dickey-Fuller = -6.436, Lag order = 3, p-value = 0.01
## alternative hypothesis: stationary
3. PACF and ACF Plots
Identical procedure as if ARIMA analysis
with real case data, PACF and ACF plots are still becoming the baseline
to determine how many possible component combinations exist for the
random generated variables.
Note: please change
diff_arima based on stationarity if the generated data
is-for example-already stationary without differencing, or second
differencing happened.
# 4. Identifying Model Components (PACF and ACF Plots)
# ARIMA (0,1,1)
acf(diff_arima, main = "ACF Diff ARIMA")

pacf(diff_arima, main = "PACF Diff ARIMA")
Similar results on both PACF and ACF
plots as if inflation data, random variables also have possible AR
components from lag 0, 1, and 2. Meanwhile, MA has 0 and 1 components
based on ACF’s lag.
4. Best ARIMA Model and Estimating Parameter
Move on to indentifying best ARIMA model
and estimating parameter. ARIMA model which expected to have the lowest
AIC value is (0,1,1). Similar conclusion can be achieved since the
generated random data has been deliberately made to follow ARIMA model
(0,1,1). Meanwhile, parameter’s coefficient is also likely to be
identical with inflation ARIMA analysis.
# 5. Estimating Parameter
model_arima1 <- arima(arima_data, order = c(0,1,0))
summary(model_arima1)
##
## Call:
## arima(x = arima_data, order = c(0, 1, 0))
##
##
## sigma^2 estimated as 1.518: log likelihood = -97.66, aic = 197.32
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.02360334 1.221995 0.9116226 61.56302 148.5532 0.9836066
## ACF1
## Training set -0.4870249
model_arima2 <- arima(arima_data, order = c(0,1,1))
summary(model_arima2)
##
## Call:
## arima(x = arima_data, order = c(0, 1, 1))
##
## Coefficients:
## ma1
## -0.9653
## s.e. 0.1270
##
## sigma^2 estimated as 0.791: log likelihood = -79.44, aic = 162.88
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.05891256 0.8820697 0.6960251 63.19404 150.3946 0.750985
## ACF1
## Training set 0.006669305
model_arima3 <- arima(arima_data, order = c(1,1,0))
summary(model_arima3)
##
## Call:
## arima(x = arima_data, order = c(1, 1, 0))
##
## Coefficients:
## ar1
## -0.4791
## s.e. 0.1114
##
## sigma^2 estimated as 1.159: log likelihood = -89.68, aic = 183.36
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 0.0328923 1.06748 0.7711757 65.41967 148.156 0.8320697 -0.2248655
model_arima4 <- arima(arima_data, order = c(1,1,1))
summary(model_arima4)
##
## Call:
## arima(x = arima_data, order = c(1, 1, 1))
##
## Coefficients:
## ar1 ma1
## 0.0167 -0.9738
## s.e. 0.1492 0.1606
##
## sigma^2 estimated as 0.788: log likelihood = -79.43, aic = 164.86
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.05768855 0.8803638 0.69387 63.02036 149.5045 0.7486597
## ACF1
## Training set -0.005307469
model_arima5 <- arima(arima_data, order = c(2,1,0))
summary(model_arima5)
##
## Call:
## arima(x = arima_data, order = c(2, 1, 0))
##
## Coefficients:
## ar1 ar2
## -0.7102 -0.4620
## s.e. 0.1145 0.1145
##
## sigma^2 estimated as 0.9078: log likelihood = -82.61, aic = 171.22
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.04137652 0.944959 0.7124278 22.17673 132.9919 0.7686829
## ACF1
## Training set -0.06435778
model_arima6 <- arima(arima_data, order = c(2,1,1))
summary(model_arima6)
##
## Call:
## arima(x = arima_data, order = c(2, 1, 1))
##
## Coefficients:
## ar1 ar2 ma1
## 0.0082 -0.0212 -0.9592
## s.e. 0.1800 0.1723 0.1982
##
## sigma^2 estimated as 0.7923: log likelihood = -79.42, aic = 166.85
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.05896945 0.8827781 0.6978108 63.04679 150.1319 0.7529117
## ACF1
## Training set -0.001585722
Based on the result, ARIMA model (0,1,1)
turns out to have lowest ACI values. It is compatible with the inflation
data ARIMA analysis’ conclusion as well. Beside that, its coefficient of
MA(1) from generated random variables indicates moderate similarity
(-0.9653) with the estimated coefficient from inflation data
(-0.87308).
Unfortunately, since the data is being
used particularly about inflation, ARIMA model (0,1,1) may not be
suitable for majority of time series data. However, ARIMA model (0,1,1)
could be tried to predict future values for other inflation-related time
series data, but separate ARIMA analysis still required nor
recommended.
5. Residual Diagnostic Test
Final step to validate the ARIMA model
(0,1,1) assumptions from generated random variables is through residuals
diagnostic. The residuals in ACF plot are expected to stay within the
confidence interval except the 0 lag of course. In spite of that,
autocorrelation test is also being conducted for confirming the
residuals are not autocorrelated using Box-Ljung test. Desirable outcome
is p-value more than 0.05.
Note: You can change
model_arima2 with previous best ARIMA model from the
syntethic data which has the lowest AIC value.
# -- 6. Residual Diagnostic Test
# Residual Plot
ts.plot(residuals(model_arima2), main = "ARIMA Residuals Plot")

# Residual ACF
acf(residuals(model_arima2), main = "ARIMA Residuals in ACF Plot")

# Autocorrelation Test
Box.test(
residuals(model_arima2),
lag = 2,
type = "Ljung-Box"
)
##
## Box-Ljung test
##
## data: residuals(model_arima2)
## X-squared = 0.011364, df = 2, p-value = 0.9943