In the trend of big data, we often need to do the predictive analysis to help us make the decision. One of the important things to predict is the future based on our past and present data. This kind of prediction we are often called by forecasting.
Forecasting is required in many situations: deciding whether to build another power generation plant in the next five years requires forecasts of future demand; scheduling staff in a call centre next week requires forecasts of call volumes; stocking an inventory requires forecasts of stock requirements. The predictability of an event or a quantity depends on several factors including1:
Auto Regressive Integrated Moving Average (ARIMA)(p,d,q) is an extension version of Auto Regressive (AR), Moving Average (MA), and Auto Regressive Moving Average (ARMA) models2. ARIMA models are the models that is applied to time series problems. ARIMA bind the three types of modeling processes into one modeling framework3:
ARIMAX or Regression ARIMA is an extension of ARIMA model. In forecasting, this method involves independent variables also4. The ARIMAX model represents a composition of the output time series into the following parts: the autoregressive (AR) part, moving average (MA) part, integrated (I) component, and the part that belongs to the exogenous inputs (X)5. The exogenous part (X) reflects the additional incorporation of the present values \(u_i(t)\) and past values \(u_i(t-j)\) of exogenous inputs (dynamic factors in our case) into the ARIMAX model6.
Multiple regression models formula:
\(Y = \beta_0 + \beta_1*x_1+...+\beta_i*x_i\)
Where \(Y\) is a dependent variable of the \(x_i\) predictor variables and \(\varepsilon\) usually assumed to be an uncorrelated error term (i.e., it is white noise). We considered tests such as the Durbin-Watson test for assessing whether \(\varepsilon\) was significantly correlated. We will replace \(\varepsilon\) by nt in the equation. The error series \(\phi_t\) is assumed to follow an ARIMA model. For example, if nt follows an ARIMA (1,1,1) model, we can write
\(Y = \beta_0 + \beta_1x_1+\beta_2x_2+...+\beta_ix_i+\eta_t\)
\((1-\phi_1B)(1-B)\eta_t = (1+\phi_1B)\varepsilon_t\)
Where \(\varepsilon_t\), is a white noise series. ARIMAX model have two error terms here the error from the regression model which we denote by \(\phi_t\) and the error from the ARIMA model which we denote by \(\varepsilon_t\). Only the ARIMA model errors are assumed to be white noise.
One of the case study that can be solved using ARIMAX is forecasting the Quarterly changes in US Consumption based on time and its personal income.
autoplot(uschange[,1:2], facets=TRUE) +
xlab("Year") + ylab("") +
ggtitle("Quarterly changes in US consumption
and personal income")
The potential uses of ARIMAX models is wide. The one thing that should be remember is our data is observed sequentially overtime. Other than that, we know that the changes overtime is influenced by other factor. Hence, if we have sequentially data and predictors that influence it, we can use ARIMAX. Here are several use cases that has been done using ARIMAX:
Is there any other algorithms as an option when we want to predict the US consumption even without ARIMAX?
We often predict the consumption based on its income using regression model. Multiple regression models formula:
\(Y = \beta_0 + \beta_1*x_1 + ... + \beta_i*x_i+\varepsilon\)
Where \(Y\) is a dependent variable of the \(x_i\) predictor variables and ɛ usually assumed to be an uncorrelated error term (i.e., it is white noise)
By using this method we ignore the data that is observed sequentially over time.
By using ARIMA, we forecast the future only based on sequentially over time data. It ignore the other factor that might influence the changes in US consumption. The explanation about arima is stated on the above.
To use the ARIMAX models, there are several advantages and disadvantages that might be face. The explanation of the advantages and the disadvantages is explained below.
The advantages od using ARIMAX is we can combine the regression and time series part in one model, named ARIMAX. This model can optimized our error compared to regression model or ARIMA models.
one disadvantage is that the covariate coefficient is hard to interpret. The value of slope is not the effect on \(Y_t\) when the \(x_t\) is increased by one (as it is in regression). The presence of lagged values of the response variable on the right hand side of the equation mean that the slope \(\beta\) can only be interpreted conditional on the value of previous values of the response variable, which is hardly intuitive.11
Epidemiology and ARIMA model of positive-rate of influenza viruses among children in Wuhan, China: A nine-year retrospective study↩
Comparison of Prediction Accuracy of Multiple Linear Regression, ARIMA and ARIMAX Model for Pest Incidence of Cotton with Weather Factors↩
Container Throughput Forecasting Using Dynamic Factor Analysis and ARIMAX Model↩
Comparison of Prediction Accuracy of Multiple Linear Regression, ARIMA and ARIMAX Model for Pest Incidence of Cotton with Weather Factors↩
Container Throughput Forecasting Using Dynamic Factor Analysis and ARIMAX Model↩