Fitting ARIMA model to Time Series Data

Time series data are prevalent across different industries, ranging from financial to airline industries. Time series analysis provides the tools for companies to generate valuable insights and predict future events. In this document, a brief explanation of ARIMA models will be presented and an example of fitting data with the Box-Jenkins method on carbon dioxide recorded in Mauna Loa.

ARIMA models

ARIMA model, short for Auto-Regressive Integrated Moving Average is fitted to time series data to understand the data or to predict future observations. ARMA models forms the basis on building more advanced models, such as GARCH and SARIMA which are used to model financial data and seasonal data respectively.

ARIMA(p,d,q) model is seperated into
1. AR(p)
2. I(d)
3. MA(q)

p and q represents the lags of ARMA model, while d represent the number of times differencing is needed to make the time series stationary.

A stationary time series is that the observation is not dependent on the time. White noise series is an example of a stationary process.

WN <- arima.sim(model = list(order = c(0, 0, 0)), n = 200)
plot(WN,col=4, main="White Noise Series")

In simple terms:
AR model is a linear combination of p past observations.
MA model is a linear combination of q past forecast errors.

Notations:
\(Z_t\) represents values of the time series at time t
\(z_t\) represents values of the time series at time t after the time series is stationary
\(a_t\) represents the random error at time t \(c\) is a constant

ARIMA (p,d,q) can be expressed as ARIMA (p+d,q) model:

\(Z_t - \varphi_1 Z_{t-1} - ... - \varphi_{p+d} Z_{t-(p+d)} = c + a_t - \theta_1 a_{t-1} - ... - \theta_q a_{t-q}\)

The left hand side of the equation represents the AR form, the right hand side of the equation represents the MA form.

Steps to Fitting ARIMA model

The steps below follows the Box and Jenkins iterative procedure for building time series models.

Search for stationary: Transform the data into a stationary time series if needed
1.1 Removing trends
1.2 Removing seasonal components
1.3 Differentiating successively
Determine the order of ARMA, i.e. ARMA (p,q)
2.1 Fit an AR(p) model to the data and compute the residuals
2.2 Fit an MA(q) model to the residuals of the AR model fitted in the previous step
2.3 Analyze the residuals and test for white noise
2.4 After choosing orders(p,q), use ARIMA to re-estimate the parameters

Load these two libraries

library(timeSeries)

## Loading required package: timeDate

library(timeDate)

Illustration of fitting ARIMA model to CO2 concentration data:
Retreived data from https://gml.noaa.gov/ccgg/trends/data.html , Mauna Loa CO2 monthly mean data.

First thing to do is to read the time series into R. “timeSequence” function helps to set the time span of the data.

co2_csv <-read.csv("co2.csv")
co2 <- co2_csv$average
date_co2 <- timeSequence(from="1958-03-01",to="2021-07-30", by="month")

STEP 1

After having the data ready, the next step is to store the data as a Time Series object so the “timeSeries” package can be utilized to perform time series functions.

co2.ts <- timeSeries(data=co2,date_co2, units="co2")
plot(co2.ts, main = "Time Series Plot of co2")

This shows the monthly concentrations of CO2 at Mauna Loa, Hawaii from March 1958 to July 2021.

From the plot, it can be seen that the time series is not stationary. It exhibits 2 features:
1. Increasing trend
2. Seasonality

To futher corroborate the above observations. The stl function decomposes a time series into trend, seasonal, and irregular componenent utilizing local regression technique.
s.window is time span for seasonal component
s.degree is the degree of locally-fitted polynomial in seasonal extraction, default is 0.
t.window is the time span for trend extraction
t.degree is the degree of locally-fitted polynomial in trend extraction, default is 1.

First degree of polynomial fits each subsset of data into locally linear, to simplify, we would instead set both degrees to 0, which turns the fit into a weighted moving average. It doesn’t matter whether the degree is 0 or 1 because this is only for a simple illustration that the trend, seasonal and remainder component exist.

length(as.ts(co2.ts)) is the frequency (number of data points), * 0.75 because the data starts from March.

co2.stl<-stl(as.ts(co2.ts) ,s.window = 12,t.window=0.75*length(as.ts(co2.ts)),t.degree=0)

Plot to observe the trend, seasonal and irregular component.

plot(co2.stl$time.series[,"trend"], main = "Trend")

Trend component clearly exists.

plot(co2.stl$time.series[,"seasonal"], main = "Seasonal")

Seasonal component clearly exists.

plot(co2.stl$time.series[,"remainder"], main = "Remainder")

Even after excluding the trend and seasonal component, the remainder is not stationary. Therefore, differencing is needed.

co2d <- diff(co2.stl$time.series[,"remainder"])
plot(co2d)

After differencing once, the graph looks like white noise so we can proceed to step 2.

STEP 2

2.1 FIT AR model

Note: Need to take caution in choosing order of AR model

Compute ACF (Autocorrelation function), it measures the correlation between two observations k periods apart, k = 1,2,…

The following table gives a rough idea on how to identify ARMA. (Abraham Jackson et al., 2018)

acf(co2d, main="ACF Plot")

Determine the order with Partial auto-correlation function (PACF) and Akaike information criterion (AIC) plot.

PACF, k = 1,2,…, is the correlation between \(z_t\) and \(z_{t-k}\) condition on \(z_{t-1}\),…,\(z_{t-k-1}\).

AIC is a good model selector as it quantifies goodness of fit and penalize model complexity, it is intended to identenify models that achieve optimal balance between goodness offit and model complexity. In practice, we select the model with smallest AIC.

pacf(as.vector(co2d),main="PACF Plot")

For the PACF plot, there is a sharp drop after the first lag. There are a few bars sticking out of the confidence bands afterwards.

After observing the pacf plot,

co2dar<-ar(co2d)
ts.plot(co2dar$aic,xlab=" ",main="AIC Plot")

AIC plot suggest us choosing 15 or 16 th order.

AR oder

co2dar$order

## [1] 15

AIC suggests that we fit AR(15), however, AIC is a universal criterion used to determine the order of a model.(Carmona, 2013, p. 390) Note that it may be more parsimonious to work with AR(1), especially given the relative small volume of data. But for the sake of simplicity, AR(15) will be directly fitted.

Step 2.2, 2.3

Then a model diagnostic has to be performed to check:
1. Only lag 0 on the ACF plot for residuals are statistically significant
1.1 if not, MA model needs to be fitted
2. Standardized residuals have a mean 0.
3. All lags in Ljung-Box statistics have a p-value above 0.05, implying that the residuals are uncorelated
4. Residuals are normally distributed

fit<-arima(co2d,order=c(15,0,0))
tsdiag(fit)

qqnorm(fit$residuals)

From the ACF of the residuals, it can be seen that there is no extra lags other than lag 0 so no MA fitting is needed. Then, we proceed to standardized residuals and Ljung-Box statistics, the mean is 0 and all the lags have p-value over 0.05 respectively, satisfying the conditions. According to the Normal Q-Q plot, it is approximately normally distributed.

Step 2.4

Estimating ARIMA parameters

fit

## 
## Call:
## arima(x = co2d, order = c(15, 0, 0))
## 
## Coefficients:
##           ar1      ar2      ar3      ar4     ar5     ar6     ar7     ar8
##       -0.2989  -0.0930  -0.0569  -0.0225  0.0300  0.0511  0.0477  0.0382
## s.e.   0.0362   0.0376   0.0378   0.0376  0.0376  0.0375  0.0373  0.0374
##          ar9    ar10    ar11    ar12    ar13    ar14    ar15  intercept
##       0.1007  0.0614  0.0964  0.1251  0.0879  0.0974  0.0862     0.0633
## s.e.  0.0374  0.0375  0.0377  0.0391  0.0394  0.0394  0.0377     0.0167
## 
## sigma^2 estimated as 0.09255:  log likelihood = -174.31,  aic = 382.62

The coefficients do not need to be reestimated because no MA model was fitted to the residuals.

Conclusion

The above shows an example of fitting CO2 data into the ARIMA model, the final model is ARIMA (15,1,0).

References:

Dr. Pieter Tans, NOAA/GML (gml.noaa.gov/ccgg/trends/) and Dr. Ralph Keeling, Scripps Institution of Oceanography (scrippsco2.ucsd.edu/)
Jenkins, G. M. (2015). Time series analysis: Forecasting and control. Wiley.
Stl: Seasonal decomposition of time series by loess. RDocumentation. (n.d.). https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/stl. 4.Abraham Jackson, E., Tamuke, E., & Sillah, A. (2018). Modelling Monthly Headline Consumer Price Index (HCPI) through Seasonal Box-Jenkins Methodology. International Journal of Sciences, 4(01), 51–56. https://doi.org/10.18483/ijsci.1507
Carmona, R. (2013). Statistical Analysis of Financial Data in R (Springer Texts in Statistics) (2nd ed. 2014 ed.). Springer.