Building Arima Models for Forecasting

Build different Arima models in r and forecast future observations.

steps in Model Building
  1. Examine your data
  • Plot the data and examine its patterns and irregularities
  • Clean up any outliers or missing values(you can also impute the missing values)
  • tsclean() is a convenient method for outlier removal and inputing missing values
  • Take a logarithm of a series to help stabilize a strong growth trend
  1. Decompose your data
  • Does the series appear to have trends or seasonality?
  • Use decompose() or stl() to examine and possibly remove components of the series
  1. Stationarity *Is the series stationary?
  • Use adf.test(), ACF, PACF plots to determine order of differencing needed
  1. Autocorrelations and choosing model order Choose order of the ARIMA by examining ACF and PACF plots
  2. Fit an ARIMA model
  3. Evaluate and iterate
  • Check residuals, which should haven no patterns and be normally distributed
  • If there are visible patterns or bias, plot ACF/PACF. Are any additional order parameters needed?
  • Refit model if needed. Compare model errors and fit criteria such as AIC or BIC.
  • Calculate forecast using the chosen model
A Short Introduction to ARIMA

ARIMA stands for auto-regressive integrated moving average and is specified by these three order parameters: (p, d, q). The process of fitting an ARIMA model is sometimes referred to as the Box-Jenkins method.

An auto regressive (AR(p)) component is referring to the use of past values in the regression equation for the series Y. The auto-regressive parameter p specifies the number of lags used in the model. For example, AR(2) or, equivalently, ARIMA(2,0,0), is represented as:

\[Y_t = \mu + \phi_1y_{t-1} + \phi_2 y_{t-2}+ e_t\] The general model AR(p) is represented as:

\[Y_t = \mu + \sum\limits_{i=1}^{p} \phi_{i}y_{t-i} +e_{t}\]

where \(p\) is the order of the AR model and \(\phi_{1}, \phi_{2},\cdots,\phi_{p}\) are \(p\) partial autocorrelation parameters for the AR(p) model. AR(p) model has only \(p\) partial autocorrelations that is statistically different from zero. The autocorrelation coefficient of the AR(p) model trails off to zero and is restricted between \(-1\) to \(1\).

The d represents the degree of differencing in the integrated \((I(d))\) component. Differencing a series involves simply subtracting its current and previous values d times. Often, differencing is used to stabilize the series when the stationarity assumption is not met, which we will discuss below.

A moving average (MA(q)) component represents the error of the model as a linear combination of previous error terms \(e_{t}\). The order \(q\) determines the number of terms to include in the model

\[Y_t = \mu + \theta_1 e_{t-1} + \theta_2 e_{t-2} +...+ \theta_q e_{t-q}+ e_t\]

The moving average is a linear function/combination of past forecast errors.. The general MA(q) model is expressed as:

\[Y_t = \mu + \sum\limits_{i=1}^{q}\theta_{t}e_{t-i} +e_{t}\] Where \(q\) is the order of the model and \(\theta_{1},\theta_{2},\cdots,\theta_{q}\) are parameters of the model.The \(e_{t},e_{t-1},\cdots,e_{t-q}\) are the white noise error terms.This can be equivalently written in terms of the backshift operator \(B\) as :

\[Y_t = \mu + \left( \sum\limits_{i=1}^{q}\theta_{i}B^{i} +1\right)e_{t}\]

The invertibility condition of an MA(q) process allows the possibility of inverting an MA model to obtain an AR model with infinite order. Differencing, autoregressive, and moving average components make up a non-seasonal ARIMA model which can be written as a linear equation: \[ Y_t = \mu +\sum\limits_{i=1}^{p} \phi_{i}y_{t-i} +\sum\limits_{i=1}^{q}\theta_{t}e_{t-i} + e_t\]

Differencing, autoregressive, and moving average components make up a non-seasonal ARIMA model which can be written as a linear equation:

\[ Y_t = \mu +\sum\limits_{i=1}^{p} \phi_{i}y_{d_{t-i}} +\sum\limits_{i=1}^{q}\theta_{t}e_{t-i} + e_t\]

where \(y_{d}\) is \(Y\) differenced \(d\) times and \(\mu\) is a constant mean.

Note that the model above assumes non-seasonal series, which means you might need to de-seasonalize the series before modeling. We will show how this can be done in an example below.

ARIMA models can be also specified through a seasonal structure. In this case, the model is specified by two sets of order parameters: \((p, d, q)\) as described above and \((P, D, Q)_{m}\) parameters describing the seasonal component of m periods.

ARIMA methodology does have its limitations. These models directly rely on past values, and therefore work best on long and stable series. Also note that ARIMA simply approximates historical patterns and therefore does not aim to explain the structure of the underlying data mechanism.

ARIMA(p,d,q) forecasting equation: ARIMA models are, in theory, the most general class of models for forecasting a time series which can be made to be “stationary” by differencing (if necessary), perhaps in conjunction with nonlinear transformations such as logging or deflating (if necessary). A random variable that is a time series is stationary if its statistical properties are all constant over time. A stationary series has no trend, its variations around its mean have a constant amplitude, and it wiggles in a consistent fashion, i.e., its short-term random time patterns always look the same in a statistical sense. The latter condition means that its autocorrelations (correlations with its own prior deviations from the mean) remain constant over time, or equivalently, that its power spectrum remains constant over time. A random variable of this form can be viewed (as usual) as a combination of signal and noise, and the signal (if one is apparent) could be a pattern of fast or slow mean reversion, or sinusoidal oscillation, or rapid alternation in sign, and it could also have a seasonal component. An ARIMA model can be viewed as a “filter” that tries to separate the signal from the noise, and the signal is then extrapolated into the future to obtain forecasts.

The ARIMA forecasting equation for a stationary time series is a linear (i.e., regression-type) equation in which the predictors consist of lags of the dependent variable and/or lags of the forecast errors. That is:

Predicted value of Y = a constant and/or a weighted sum of one or more recent values of Y and/or a weighted sum of one or more recent values of the errors.

If the predictors consist only of lagged values of Y, it is a pure autoregressive (“self-regressed”) model, which is just a special case of a regression model and which could be fitted with standard regression software. For example, a first-order autoregressive (“AR(1)”) model for Y is a simple regression model in which the independent variable is just Y lagged by one period (LAG(Y,1) in Statgraphics or Y_LAG1 in RegressIt). If some of the predictors are lags of the errors, an ARIMA model it is NOT a linear regression model, because there is no way to specify “last period’s error” as an independent variable: the errors must be computed on a period-to-period basis when the model is fitted to the data. From a technical standpoint, the problem with using lagged errors as predictors is that the model’s predictions are not linear functions of the coefficients, even though they are linear functions of the past data. So, coefficients in ARIMA models that include lagged errors must be estimated by nonlinear optimization methods (“hill-climbing”) rather than by just solving a system of equations.

The acronym ARIMA stands for Auto-Regressive Integrated Moving Average. Lags of the stationarized series in the forecasting equation are called “autoregressive” terms, lags of the forecast errors are called “moving average” terms, and a time series which needs to be differenced to be made stationary is said to be an “integrated” version of a stationary series. Random-walk and random-trend models, autoregressive models, and exponential smoothing models are all special cases of ARIMA models.

A nonseasonal ARIMA model is classified as an “ARIMA(p,d,q)” model, where:

p is the number of autoregressive terms, d is the number of nonseasonal differences needed for stationarity, and q is the number of lagged forecast errors in the prediction equation. The forecasting equation is constructed as follows. First, let y denote the dth difference of Y, which means:

If \[d=0: y_{t} = Y_{t}\]

If \(d=1: y_{t} = Y_{t} - Y_{t-1}\)

If $ d=2: y_{t} = (Y_{t} - Y_{t-1}) - (Y_{t-1} - Y_{t-2}) = Y_{t} - 2Y_{t-1} + Y_{t-2}$

Note that the second difference of Y (the d=2 case) is not the difference from 2 periods ago. Rather, it is the first-difference-of-the-first difference, which is the discrete analog of a second derivative, i.e., the local acceleration of the series rather than its local trend.

In terms of y, the general forecasting equation is:

\(ŷ_{t} = μ + ϕ_{1} y_{t-1} +…+ ϕ_{p} y_{t-p} +θ_{1}e_{t-1 }…-e_{t-q }\) Here the moving average parameters (θ’s) are defined so that their signs are negative in the equation, following the convention introduced by Box and Jenkins. Some authors and software (including the R programming language) define them so that they have plus signs instead. When actual numbers are plugged into the equation, there is no ambiguity, but it’s important to know which convention your software uses when you are reading the output. Often the parameters are denoted there by AR(1), AR(2), …, and MA(1), MA(2), … etc..

To identify the appropriate ARIMA model for Y, you begin by determining the order of differencing (d) needing to stationarize the series and remove the gross features of seasonality, perhaps in conjunction with a variance-stabilizing transformation such as logging or deflating. If you stop at this point and predict that the differenced series is constant, you have merely fitted a random walk or random trend model. However, the stationarized series may still have autocorrelated errors, suggesting that some number of AR terms (p ≥ 1) and/or some number MA terms (q ≥ 1) are also needed in the forecasting equation.

The process of determining the values of p, d, and q that are best for a given time series will be discussed in later sections of the notes (whose links are at the top of this page), but a preview of some of the types of nonseasonal ARIMA models that are commonly encountered is given below.

ARIMA(1,0,0) = first-order autoregressive model: if the series is stationary and autocorrelated, perhaps it can be predicted as a multiple of its own previous value, plus a constant. The forecasting equation in this case is

\(Ŷ_{t} = μ + ϕ_{1}Y_{t-1}\)

…which is Y regressed on itself lagged by one period. This is an “ARIMA(1,0,0)+constant” model. If the mean of Y is zero, then the constant term would not be included.

If the slope coefficient ϕ1 is positive and less than 1 in magnitude (it must be less than 1 in magnitude if Y is stationary), the model describes mean-reverting behavior in which next period’s value should be predicted to be ϕ1 times as far away from the mean as this period’s value. If ϕ1 is negative, it predicts mean-reverting behavior with alternation of signs, i.e., it also predicts that Y will be below the mean next period if it is above the mean this period.

In a second-order autoregressive model (ARIMA(2,0,0)), there would be a Yt-2 term on the right as well, and so on. Depending on the signs and magnitudes of the coefficients, an ARIMA(2,0,0) model could describe a system whose mean reversion takes place in a sinusoidally oscillating fashion, like the motion of a mass on a spring that is subjected to random shocks.

ARIMA(0,1,0) = random walk: If the series Y is not stationary, the simplest possible model for it is a random walk model, which can be considered as a limiting case of an AR(1) model in which the autoregressive coefficient is equal to 1, i.e., a series with infinitely slow mean reversion. The prediction equation for this model can be written as:

\(Ŷ_{t} - Y_{t-1} = μ\)

or equivalently

\(Ŷ_{t} = μ + Y_{t-1}\)

…where the constant term is the average period-to-period change (i.e. the long-term drift) in Y. This model could be fitted as a no-intercept regression model in which the first difference of Y is the dependent variable. Since it includes (only) a nonseasonal difference and a constant term, it is classified as an “ARIMA(0,1,0) model with constant.” The random-walk-without-drift model would be an ARIMA(0,1,0) model without constant

ARIMA(1,1,0) = differenced first-order autoregressive model: If the errors of a random walk model are autocorrelated, perhaps the problem can be fixed by adding one lag of the dependent variable to the prediction equation–i.e., by regressing the first difference of Y on itself lagged by one period. This would yield the following prediction equation:

\(Ŷ_{t} - Y_{t-1} = μ + ϕ_{1}(Y_{t-1} - Y_{t-2})\)

\(Ŷ_{t} - Y_{t-1} = μ\)

which can be rearranged to

\(Ŷ_{t} = μ + Y_{t-1} + ϕ_{1} (Y_{t-1} - Y_{t-2})\)

This is a first-order autoregressive model with one order of nonseasonal differencing and a constant term–i.e., an ARIMA(1,1,0) model.

ARIMA(0,1,1) without constant = simple exponential smoothing: Another strategy for correcting autocorrelated errors in a random walk model is suggested by the simple exponential smoothing model. Recall that for some nonstationary time series (e.g., ones that exhibit noisy fluctuations around a slowly-varying mean), the random walk model does not perform as well as a moving average of past values. In other words, rather than taking the most recent observation as the forecast of the next observation, it is better to use an average of the last few observations in order to filter out the noise and more accurately estimate the local mean. The simple exponential smoothing model uses an exponentially weighted moving average of past values to achieve this effect. The prediction equation for the simple exponential smoothing model can be written in a number of mathematically equivalent forms, one of which is the so-called “error correction” form, in which the previous forecast is adjusted in the direction of the error it made:

$Ŷ_{t} = Ŷ_{t-1} + αe_{t-1} $

Because et-1 = Yt-1 - Ŷt-1 by definition, this can be rewritten as:

\(Ŷ_{t} = Y_{t-1} - (1-α)e_{t-1}\)

  $ =  Y_{t-1}  - θ_{1}e_{t-1}$

which is an ARIMA(0,1,1)-without-constant forecasting equation with θ1 = 1-α. This means that you can fit a simple exponential smoothing by specifying it as an ARIMA(0,1,1) model without constant, and the estimated MA(1) coefficient corresponds to 1-minus-alpha in the SES formula. Recall that in the SES model, the average age of the data in the 1-period-ahead forecasts is 1/α, meaning that they will tend to lag behind trends or turning points by about 1/α periods. It follows that the average age of the data in the 1-period-ahead forecasts of an ARIMA(0,1,1)-without-constant model is 1/(1-θ1). So, for example, if θ1 = 0.8, the average age is 5. As θ1 approaches 1, the ARIMA(0,1,1)-without-constant model becomes a very-long-term moving average, and as θ1 approaches 0 it becomes a random-walk-without-drift model.

What’s the best way to correct for autocorrelation: adding AR terms or adding MA terms? In the previous two models discussed above, the problem of autocorrelated errors in a random walk model was fixed in two different ways: by adding a lagged value of the differenced series to the equation or adding a lagged value of the forecast error. Which approach is best? A rule-of-thumb for this situation, which will be discussed in more detail later on, is that positive autocorrelation is usually best treated by adding an AR term to the model and negative autocorrelation is usually best treated by adding an MA term. In business and economic time series, negative autocorrelation often arises as an artifact of differencing. (In general, differencing reduces positive autocorrelation and may even cause a switch from positive to negative autocorrelation.) So, the ARIMA(0,1,1) model, in which differencing is accompanied by an MA term, is more often used than an ARIMA(1,1,0) model.

ARIMA(0,1,1) with constant = simple exponential smoothing with growth: By implementing the SES model as an ARIMA model, you actually gain some flexibility. First of all, the estimated MA(1) coefficient is allowed to be negative: this corresponds to a smoothing factor larger than 1 in an SES model, which is usually not allowed by the SES model-fitting procedure. Second, you have the option of including a constant term in the ARIMA model if you wish, in order to estimate an average non-zero trend. The ARIMA(0,1,1) model with constant has the prediction equation:

Ŷt = μ + Yt-1 - θ1et-1

The one-period-ahead forecasts from this model are qualitatively similar to those of the SES model, except that the trajectory of the long-term forecasts is typically a sloping line (whose slope is equal to mu) rather than a horizontal line.

ARIMA(0,2,1) or (0,2,2) without constant = linear exponential smoothing: Linear exponential smoothing models are ARIMA models which use two nonseasonal differences in conjunction with MA terms. The second difference of a series Y is not simply the difference between Y and itself lagged by two periods, but rather it is the first difference of the first difference–i.e., the change-in-the-change of Y at period t. Thus, the second difference of Y at period t is equal to (Yt - Yt-1) - (Yt-1 - Yt-2) = Yt - 2Yt-1 + Yt-2. A second difference of a discrete function is analogous to a second derivative of a continuous function: it measures the “acceleration” or “curvature” in the function at a given point in time.

The ARIMA(0,2,2) model without constant predicts that the second difference of the series equals a linear function of the last two forecast errors:

\(Ŷ_{t} - 2Y_{t-1} + Y_{t-2} = - θ_{1}e_{t-1} - θ_{2}e_{t-2}\)

which can be rearranged as:

\(Ŷ_{t} = 2 Y_{t-1} - Y_{t-2} + θ_{1}e_{t-1} +θ_{2}e_{t-2}\)

where θ1 and θ2 are the MA(1) and MA(2) coefficients. This is a general linear exponential smoothing model, essentially the same as Holt’s model, and Brown’s model is a special case. It uses exponentially weighted moving averages to estimate both a local level and a local trend in the series. The long-term forecasts from this model converge to a straight line whose slope depends on the average trend observed toward the end of the series.

ARIMA(1,1,2) without constant = damped-trend linear exponential smoothing:

\(Ŷ_{t} = Y_{t-1} + ϕ_{1} (Y_{t-1} - Y_{t-2} ) + θ_{1}e_{t-1} +θ_{1}et_{-1}\)

This model is illustrated in the accompanying slides on ARIMA models. It extrapolates the local trend at the end of the series but flattens it out at longer forecast horizons to introduce a note of conservatism, a practice that has empirical support. See the article on “Why the Damped Trend works” by Gardner and McKenzie and the “Golden Rule” article by Armstrong et al. for details.

It is generally advisable to stick to models in which at least one of p and q is no larger than 1, i.e., do not try to fit a model such as ARIMA(2,1,2), as this is likely to lead to overfitting and “common-factor” issues that are discussed in more detail in the notes on the mathematical structure of ARIMA models.

  • Step 1: Load R Packages
library(tseries)
library(lubridate)
library(tidyverse)
library(timetk)
library(readxl)
library(tidyquant)
library(scales)
library(forecast)   #  forecasting pkg
library(sweep)   # Broom tidiers for forecast pkg
library(timekit)
library(broom)
library(tibble)
library(stringr)
library(ggthemes)
modeldat=read_excel("/Users/nanaakwasiabayieboateng/Documents/memphisclassesbooks/Time series/Manpower/actvsfcdatajuly311.xlsx",sheet = "Sheet1")%>%na.omit()%>%dplyr::rename(Date=Week)
modeldat%>%head()
## # A tibble: 6 x 2
##         Date Actual
##       <dttm>  <dbl>
## 1 2016-01-11     26
## 2 2016-01-18     27
## 3 2016-01-25     28
## 4 2016-02-01     22
## 5 2016-02-08     27
## 6 2016-02-15     31
  • Step 2: Examine Your Data

Plotting the series and helps to visually examine it for any outliers, volatility, or irregularities. The weekly data shows some fluctuations from one week to another. However, even with this volatility present, we already see some patterns emerge.Higher number of durability vehicles is tested in the summer and less in the winter months.

modeldat%>%ggplot(aes(Date,Actual))+geom_line()+

geom_point(alpha = 0.5, color = palette_light()[[1]], shape=20,size=2) +
   labs(title = "Durability Vehicles Forecasting", x = "Time", y = "Number of \n Durability Vehicles",
       subtitle = "data from 2016 & 2017") +
    theme_tq()

Outliers could bias the model by skewing statistical summaries. tsclean() function in the forecast package provides a convenient mechanism to remove time series outliers . tsclean() identifies and replaces outliers using series smoothing and decomposition. This method is also capable of inputing missing values in the series if there are any.

Note that we are using the ts() command to create a time series object to pass to tsclean():

read_excel("/Users/nanaakwasiabayieboateng/Documents/memphisclassesbooks/Time series/Manpower/actvsfcdatajuly311.xlsx",sheet = "Sheet1")%>%is.na()%>%sum()
## [1] 4

The code above sums the number of missing data points which is found to be about 4.

tsclean() or na.interp() is used to impute the missing values in a time series data.

#Data has to be numeric for the tsclean function to work
dat=read_excel("/Users/nanaakwasiabayieboateng/Documents/memphisclassesbooks/Time series/Manpower/actvsfcdatajuly311.xlsx",sheet = "Sheet1")%>%dplyr::rename(Date=Week)%>%na.omit()




y=modeldat["Actual"]

modeldat%>%mutate(clean=tsclean(as.matrix.data.frame(y)))
## # A tibble: 82 x 3
##          Date Actual clean
##        <dttm>  <dbl> <dbl>
##  1 2016-01-11     26    26
##  2 2016-01-18     27    27
##  3 2016-01-25     28    28
##  4 2016-02-01     22    22
##  5 2016-02-08     27    27
##  6 2016-02-15     31    31
##  7 2016-02-22     26    26
##  8 2016-02-29     27    27
##  9 2016-03-07     21    21
## 10 2016-03-14     23    23
## # ... with 72 more rows
y=ts(dat["Actual"])


na.interp(ts(y), lambda = NULL)
## Time Series:
## Start = 1 
## End = 82 
## Frequency = 1 
##       Actual
##  [1,]     26
##  [2,]     27
##  [3,]     28
##  [4,]     22
##  [5,]     27
##  [6,]     31
##  [7,]     26
##  [8,]     27
##  [9,]     21
## [10,]     23
## [11,]     25
## [12,]     23
## [13,]     23
## [14,]     23
## [15,]     23
## [16,]     27
## [17,]     31
## [18,]     30
## [19,]     33
## [20,]     29
## [21,]     29
## [22,]     32
## [23,]     31
## [24,]     36
## [25,]     37
## [26,]     38
## [27,]     39
## [28,]     40
## [29,]     35
## [30,]     38
## [31,]     32
## [32,]     33
## [33,]     29
## [34,]     29
## [35,]     27
## [36,]     24
## [37,]     23
## [38,]     31
## [39,]     32
## [40,]     33
## [41,]     30
## [42,]     32
## [43,]     33
## [44,]     35
## [45,]     32
## [46,]     30
## [47,]     29
## [48,]     27
## [49,]     27
## [50,]     26
## [51,]     26
## [52,]     25
## [53,]     25
## [54,]     22
## [55,]     25
## [56,]     26
## [57,]     23
## [58,]     25
## [59,]     28
## [60,]     28
## [61,]     29
## [62,]     30
## [63,]     32
## [64,]     36
## [65,]     32
## [66,]     27
## [67,]     32
## [68,]     28
## [69,]     28
## [70,]     30
## [71,]     34
## [72,]     32
## [73,]     28
## [74,]     34
## [75,]     32
## [76,]     35
## [77,]     32
## [78,]     30
## [79,]     33
## [80,]     33
## [81,]     32
## [82,]     30
tsclean(ts(y))%>%tk_tbl()%>%head()
## Warning in tk_tbl.zoo(zoo::as.zoo(data), preserve_index, rename_index,
## timekit_idx, : Warning: No index to preserve. Object otherwise converted to
## tibble successfully.
## # A tibble: 6 x 1
##   Actual
##    <dbl>
## 1     26
## 2     27
## 3     28
## 4     22
## 5     27
## 6     31
dat_ts = ts(dat["Actual"])

dat$clean = tsclean(dat_ts)

dat$clean1=na.interp(dat_ts, lambda = NULL)

dat$clean1%>%tk_tbl()%>%head()
## Warning in tk_tbl.zoo(zoo::as.zoo(data), preserve_index, rename_index,
## timekit_idx, : Warning: No index to preserve. Object otherwise converted to
## tibble successfully.
## # A tibble: 6 x 1
##   Actual
##    <dbl>
## 1     26
## 2     27
## 3     28
## 4     22
## 5     27
## 6     31
y=modeldat["Actual"]

modeldat=modeldat%>%mutate(clean=tsclean(as.matrix.data.frame(y)))


modeldat%>%mutate(clean=na.interp(as.matrix.data.frame(y)))
## # A tibble: 82 x 3
##          Date Actual clean
##        <dttm>  <dbl> <dbl>
##  1 2016-01-11     26    26
##  2 2016-01-18     27    27
##  3 2016-01-25     28    28
##  4 2016-02-01     22    22
##  5 2016-02-08     27    27
##  6 2016-02-15     31    31
##  7 2016-02-22     26    26
##  8 2016-02-29     27    27
##  9 2016-03-07     21    21
## 10 2016-03-14     23    23
## # ... with 72 more rows
modeldat%>%head
## # A tibble: 6 x 3
##         Date Actual clean
##       <dttm>  <dbl> <dbl>
## 1 2016-01-11     26    26
## 2 2016-01-18     27    27
## 3 2016-01-25     28    28
## 4 2016-02-01     22    22
## 5 2016-02-08     27    27
## 6 2016-02-15     31    31

Even after removing outliers, the weekly data is still pretty volatile. Visually, we could a draw a line through the series tracing its bigger troughs and peaks while smoothing out noisy fluctuations. This line can be described by one of the simplest — but also very useful —concepts in time series analysis known as a moving average. It is an intuitive concept that averages points across several time periods, thereby smoothing the observed data into a more stable predictable series.

Formally, a moving average (MA) of order q can be calculated by taking an average of series Y, k periods around each point:

\[ MA = \frac{1}{q} \sum^k_{j=-k} y_{t+j} \]

where m = 2k + 1. The above quantity is also called a symmetric moving average because data on each side of a point is involved in the calculation.

Note that the moving average in this context is distinct from the M A(q) component in the above ARIMA definition. Moving average M A(q) as part of the ARIMA framework refers to error lags and combinations, whereas the summary statistic of moving average refers to a data smoothing technique.

The wider the window of the moving average, the smoother original series becomes.

#modeldat$ma4 = ma(modeldat$clean, order=4) # using the clean count with no outliers
#modeldat$ma1 = ma(modeldat$clean, order=1)





modeldat=modeldat%>%mutate(clean=as.numeric(modeldat$clean),ma4=as.numeric(ma(modeldat$clean, order=4)),ma8 = as.numeric(ma(modeldat$clean, order=8)))

str(modeldat )
## Classes 'tbl_df', 'tbl' and 'data.frame':    82 obs. of  5 variables:
##  $ Date  : POSIXct, format: "2016-01-11" "2016-01-18" ...
##  $ Actual: num  26 27 28 22 27 31 26 27 21 23 ...
##  $ clean : num  26 27 28 22 27 31 26 27 21 23 ...
##  $ ma4   : num  NA NA 25.9 26.5 26.8 ...
##  $ ma8   : num  NA NA NA NA 26.4 ...
modeldat%>%tail()
## # A tibble: 6 x 5
##         Date Actual clean    ma4    ma8
##       <dttm>  <dbl> <dbl>  <dbl>  <dbl>
## 1 2017-07-17     32    32 32.375 32.375
## 2 2017-07-24     30    30 32.250 32.375
## 3 2017-07-31     33    33 32.000     NA
## 4 2017-08-07     33    33 32.000     NA
## 5 2017-08-14     32    32     NA     NA
## 6 2017-08-21     30    30     NA     NA
ggplot() +
  geom_line(data =modeldat, aes(x = Date, y = clean, colour = "Actual"))+
  geom_line(data = modeldat, aes(x = Date, y = ma8,   colour = "8-Weekly "))  +
   geom_line(data = modeldat, aes(x = Date, y = ma4, colour = "4-Week "))+theme(legend.position="bottom")+theme_economist_white()+
 
 labs(title="Moving Average",y="Vehicle Count",colour="MA")

#ggplotly()

In addition to volatility, modeling daily level data might require specifying multiple seasonality levels, such day of the week, week of the year, month of the year, holidays, etc. For the sake of simplicity, we will model the smoothed series of weekly moving average (as shown by the blue line above).

  • Step 3: Decompose Your Data

The building blocks of a time series analysis are seasonality, trend, and cycle. These intuitive components capture the historical patterns in the series. Not every series will have all three (or any) of these components, but if they are present, deconstructing the series can help you understand its behavior and prepare a foundation for building a forecasting model.

Seasonal component refers to fluctuations in the data related to calendar cycles. For example, more vehicle testing occurs in the summer and during warm weather, and less during colder months. Usually, seasonality is fixed at some number; for instance, quarter or month of the year.

Trend component is the overall pattern of the series.It answers if the data is increasing,decreasing or constant over time.

Cycle component consists of decreasing or increasing patterns that are not seasonal. Usually, trend and cycle components are grouped together. Trend-cycle component is estimated using moving averages.

Finally, part of the series that can’t be attributed to seasonal, cycle, or trend components is referred to as residual or error. These components can be extracted by decomposition.

Formally, if Y is the number of vehicles tested, we can decompose the series in two ways: by using either an additive or multiplicative model,

\[Y = S_t + T_t + E_t\]\[Y = S_t * T_t * E_t\]

where \(S_{t}\) is the seasonal component, \(T\) is trend and cycle, and E is the remaining error.

An additive model is usually more appropriate when the seasonal or trend component is not proportional to the level of the series, as we can just overlay (i.e. add) components together to reconstruct the series. On the other hand, if the seasonality component changes with the level or trend of the series, a simple “overlay,” or addition of components, won’t be sufficient to reconstruct the series. In that case, a multiplicative model might be more appropriate.

As mentioned above, ARIMA models can be fitted to both seasonal and non-seasonal data. Seasonal ARIMA requires a more complicated specification of the model structure, although the process of determining (P, D, Q) is similar to that of choosing non-seasonal order parameters. Therefore, we will explore how to de-seasonalize the series and use a “vanilla” non-seasonal ARIMA model.

First, we calculate seasonal component of the data using stl(). STL is a flexible function for decomposing and forecasting the series. It calculates the seasonal component of the series using smoothing, and adjusts the original series by subtracting seasonality in two simple lines:

ma4_ts = ts(na.omit(modeldat$ma4), frequency=4)


#Seasonal Decomposition of Time Series by Loess
#Decompose a time series into seasonal, trend and irregular components using #loess, acronym STL.

decomp = stl(ma4_ts, s.window="periodic")
 
deseasonal_ts <- seasadj(decomp)
 
autoplot(decomp)

sweep::sw_tidy_decomp(decomp)%>%head()
## # A tibble: 6 x 6
##           index observed       season    trend  remainder  seasadj
##   <S3: yearqtr>    <dbl>        <dbl>    <dbl>      <dbl>    <dbl>
## 1          1 Q1   25.875  0.019121827 26.11277 -0.2568879 25.85588
## 2          1 Q2   26.500  0.003109354 26.37775  0.1191396 26.49689
## 3          1 Q3   26.750 -0.020664134 26.59650  0.1741640 26.77066
## 4          1 Q4   27.125 -0.001567046 26.74690  0.3796717 27.12657
## 5          2 Q1   27.000  0.019121827 26.26895  0.7119241 26.98088
## 6          2 Q2   25.250  0.003109354 25.42256 -0.1756734 25.24689
autoplot(deseasonal_ts)

The de-seasoned data data looks stationary now with no obvious jumps and lows. Note that stl() by default assumes additive model structure. Use allow.multiplicative.trend=TRUE to incorporate the multiplicative model.

In the case of additive model structure, the same task of decomposing the series and removing the seasonality can be accomplished by simply subtracting the seasonal component from the original series. seasadj() is a convenient method inside the forecast package.

As for the frequency parameter in ts() object, we are specifying periodicity of the data, i.e., number of observations per period. Since we are using smoothed daily data, we have 30 observations per month.

We now have a de-seasonalized series and can proceed to the next step.

  • Step 4: Stationarity

Fitting an ARIMA model requires the series to be stationary. A series is said to be stationary when its mean, variance, and autocovariance are time invariant. This assumption makes intuitive sense: Since ARIMA uses previous lags of series to model its behavior, modeling stable series with consistent properties involves less uncertainty. The left panel below shows an example of a stationary series, where data values oscillate with a steady variance around the mean of 1. The panel on the right shows a non-stationary series; mean of this series will differ across different time windows.

knitr::include_graphics('/Users/nanaakwasiabayieboateng/Documents/memphisclassesbooks/Time series/Stationary.png')

The augmented Dickey-Fuller (ADF) test is a formal statistical test for stationarity. The null hypothesis assumes that the series is non-stationary. ADF procedure tests whether the change in Y can be explained by lagged value and a linear trend. If contribution of the lagged value to the change in Y is non-significant and there is a presence of a trend component, the series is non-stationary and null hypothesis will not be rejected.

  • Augmented Dickey–Fuller Test

Computes the Augmented Dickey-Fuller test for the null that x has a unit root.

m=as_vector(modeldat["Actual"])



# adf.test(ma4_ts , alternative = "stationary")
 
 #tseries::adf.test(diff(m), alternative="stationary")

sweep::sw_glance(tseries::adf.test(diff(m), alternative="stationary"))
## # A tibble: 1 x 5
##   statistic    p.value parameter                       method alternative
##       <dbl>      <dbl>     <dbl>                       <fctr>      <fctr>
## 1 -3.816363 0.02231461         4 Augmented Dickey-Fuller Test  stationary
sweep::sw_glance(tseries::adf.test(m, alternative="stationary", k=0))
## # A tibble: 1 x 5
##   statistic    p.value parameter                       method alternative
##       <dbl>      <dbl>     <dbl>                       <fctr>      <fctr>
## 1 -3.344676 0.07020866         0 Augmented Dickey-Fuller Test  stationary
#sweep::sw_tidy_decomp(dat_ts)%>%head()

sweep::sw_tidy(tseries::adf.test(ma4_ts, alternative="stationary"))
## # A tibble: 1 x 5
##   statistic   p.value parameter                       method alternative
##       <dbl>     <dbl>     <dbl>                       <fctr>      <fctr>
## 1 -2.193784 0.4964101         4 Augmented Dickey-Fuller Test  stationary

The vehicle data is non-stationary(p-value = 0.07021) by the ADF test whereas the differenced form of the data is stationary(0.02231)

Usually, non-stationary series can be corrected by a simple transformation such as differencing. Differencing the series can help in removing its trend or cycles. The idea behind differencing is that, if the original data series does not have constant properties over time, then the change from one period to another might. The difference is calculated by subtracting one period’s values from the previous period’s values:

\[Y_{d_t} = Y_t - Y_{t-1}\]

Higher order differences are calculated in a similar fashion. For example, second order differencing (d = 2) is simply expanded to include second lag of the series:

\[Y_{d2_t} = Y_{d_t} - Y_{d_t-1} = (Y_t - Y_{t-1}) - (Y_{t-1} - Y_{t-2})\]

Similarly, differencing can be used if there is a seasonal pattern at specific lags. In such a case, subtracting a value for the “season” from a previous period represents the change from one period to another, as well as from one season to another:

\[Y_{d_t} = (Y_t - Y_{t-s}) - (Y_{t-1} - Y_{t-s-1} )\]

The number of differences performed is represented by the d component of ARIMA. Now, we move on to diagnostics that can help determine the order of differencing.

A nonseasonal ARIMA model is classified as an “ARIMA(p,d,q)” model, where:

p is the number of autoregressive terms, d is the number of nonseasonal differences needed for stationarity, and q is the number of lagged forecast errors in the prediction equation. The forecasting equation is constructed as follows. First, let y denote the dth difference of Y, which means:

If $ d=0: yt = Yt$

If \(d=1: yt = Yt - Yt-1\)

If \(d=2: yt = (Yt - Yt-1) - (Yt-1 - Yt-2) = Yt - 2Yt-1 + Yt-2\)

  • Step 5: Autocorrelations and Choosing Model Order

Autocorrelation plots (also known as ACF or the auto correlation function) are a useful visual tool in determining whether a series is stationary. These plots can also help to choose the order parameters for ARIMA model. If the series is correlated with its lags then, generally, there are some trend or seasonal components and therefore its statistical properties are not constant over time.

ACF plots displays correlation between a series and its lags. In addition to suggesting the order of differencing, ACF plots can help in determining the order of the M A (q) model. Partial autocorrelation plots (PACF), as the name suggests, display correlation between a variable and its lags that is not explained by previous lags. PACF plots are useful when determining the order of the AR(p) model.

R plots 95% significance boundaries as blue dotted lines. There is significant autocorrelations(positive ) in lag 1 in vehicle data, as shown by the ACF plot below. There is significant ACF(positve and negative) at lags 5.This sugesst fitting AR model with order between 1 to 5.

ggAcf(dat_ts,main="non differenced data",lag=40)

ggPacf(dat_ts,main="non differenced data",lag=40)

ggAcf(diff(dat_ts),main="non differenced data",lag=40)

ggPacf(diff(dat_ts),main="non differenced data",lag=40)

ggAcf(ma4_ts,main="non differenced data")

ggPacf(ma4_ts,main="non differenced data")

ggAcf(deseasonal_ts,main="deseasonal data",lag=40)

ggPacf(deseasonal_ts,main="deseasonal data",lag=40)

We can start with the order of d = 1 and re-evaluate whether further differencing is needed.

The augmented Dickey-Fuller test on differenced data rejects the null hypotheses of non-stationarity. Plotting the differenced series, we see an oscillating pattern around 0 with no visible strong trend. This suggests that differencing of order 1 terms is sufficient and should be included in the model.

diff_data = diff(deseasonal_ts , differences = 1)

autoplot(diff_data)+ylab("Differenced Data")

adf.test(diff_data, alternative = "stationary")
## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff_data
## Dickey-Fuller = -2.9132, Lag order = 4, p-value = 0.2026
## alternative hypothesis: stationary

Next, spikes at particular lags of the differenced series can help inform the choice of p or q for our model:

ggAcf(diff_data, main='ACF for Differenced Series',lag=20)

ggPacf(diff_data, main='PACF for Differenced Series',lag=20)

  • Step 6: Fitting an ARIMA model

Now let’s fit a model. The forecast package allows the user to explicitly specify the order of the model using the arima() function, or automatically generate a set of optimal (p, d, q) using auto.arima(). This function searches through combinations of order parameters and picks the set that optimizes model fit criteria.

There exist a number of such criteria for comparing quality of fit across multiple models. Two of the most widely used are Akaike information criteria (AIC) and Baysian information criteria (BIC). These criteria are closely related and can be interpreted as an estimate of how much information would be lost if a given model is chosen. When comparing models, one wants to minimize AIC and BIC.

While auto.arima() can be very useful, it is still important to complete steps 1-5 in order to understand the series and interpret model results. Note that auto.arima() also allows the user to specify maximum order for (p, d, q), which is set to 5 by default.

We can specify non-seasonal ARIMA structure and fit the model to de-seasonalize data. Parameters (2,0,2) suggested by the automated procedure are in line with our expectations based on the steps above; the model incorporates differencing of degree 0, and uses an autoregressive term of first and second lag and a moving average model of order 1 and 2.

  • Fitting non-differenced data
sweep::sw_tidy(auto.arima(dat_ts, seasonal=TRUE))
## # A tibble: 2 x 2
##        term   estimate
##       <chr>      <dbl>
## 1       ar1  0.7566933
## 2 intercept 29.3606866
sweep::sw_glance(auto.arima(dat_ts, seasonal=TRUE)) 
## # A tibble: 1 x 12
##                        model.desc    sigma    logLik      AIC      BIC
##                             <chr>    <dbl>     <dbl>    <dbl>    <dbl>
## 1 ARIMA(1,0,0) with non-zero mean 2.814355 -200.6137 407.2273 414.4475
## # ... with 7 more variables: ME <dbl>, RMSE <dbl>, MAE <dbl>, MPE <dbl>,
## #   MAPE <dbl>, MASE <dbl>, ACF1 <dbl>
sweep::sw_augment(auto.arima(dat_ts, seasonal=TRUE) )%>%head()
## # A tibble: 6 x 4
##   index .actual  .fitted     .resid
##   <int>   <dbl>    <dbl>      <dbl>
## 1     1      26 28.19712 -2.1971163
## 2     2      27 26.81768  0.1823223
## 3     3      28 27.57437  0.4256290
## 4     4      22 28.33106 -6.3310642
## 5     5      27 23.79090  3.2090953
## 6     6      31 27.57437  3.4256290
  • Fitting Differenced Data
sweep::sw_tidy(auto.arima(diff(dat_ts), seasonal=TRUE))
## # A tibble: 1 x 2
##    term   estimate
##   <chr>      <dbl>
## 1   ma1 -0.2817896
sweep::sw_glance(auto.arima(diff(dat_ts), seasonal=TRUE)) 
## # A tibble: 1 x 12
##                    model.desc    sigma    logLik      AIC      BIC
##                         <chr>    <dbl>     <dbl>    <dbl>    <dbl>
## 1 ARIMA(0,0,1) with zero mean 2.881785 -200.2035 404.4069 409.1958
## # ... with 7 more variables: ME <dbl>, RMSE <dbl>, MAE <dbl>, MPE <dbl>,
## #   MAPE <dbl>, MASE <dbl>, ACF1 <dbl>
sweep::sw_augment(auto.arima(diff(dat_ts), seasonal=TRUE) )%>%head()
## # A tibble: 6 x 4
##   index .actual     .fitted     .resid
##   <dbl>   <dbl>       <dbl>      <dbl>
## 1     2       1  0.03748455  0.9625154
## 2     3       1 -0.25739294  1.2573929
## 3     4      -6 -0.35459146 -5.6454085
## 4     5       5  1.59051328  3.4094867
## 5     6       4 -0.96073319  4.9607332
## 6     7      -5 -1.39788153 -3.6021185
auto.arima(diff(dat_ts), seasonal=TRUE)%>%forecast::forecast()%>%tk_tbl()
## # A tibble: 10 x 6
##    index `Point Forecast`   `Lo 80`  `Hi 80`   `Lo 95`  `Hi 95`
##    <int>            <dbl>     <dbl>    <dbl>     <dbl>    <dbl>
##  1    83        0.6287268 -3.064429 4.321883 -5.019467 6.276921
##  2    84        0.0000000 -3.836983 3.836983 -5.868160 5.868160
##  3    85        0.0000000 -3.836983 3.836983 -5.868160 5.868160
##  4    86        0.0000000 -3.836983 3.836983 -5.868160 5.868160
##  5    87        0.0000000 -3.836983 3.836983 -5.868160 5.868160
##  6    88        0.0000000 -3.836983 3.836983 -5.868160 5.868160
##  7    89        0.0000000 -3.836983 3.836983 -5.868160 5.868160
##  8    90        0.0000000 -3.836983 3.836983 -5.868160 5.868160
##  9    91        0.0000000 -3.836983 3.836983 -5.868160 5.868160
## 10    92        0.0000000 -3.836983 3.836983 -5.868160 5.868160
sweep::sw_tidy(auto.arima(deseasonal_ts, seasonal=FALSE))
## # A tibble: 5 x 2
##        term   estimate
##       <chr>      <dbl>
## 1       ar1  1.7097140
## 2       ar2 -0.7779049
## 3       ma1  0.6745576
## 4       ma2 -0.2911242
## 5 intercept 29.4055740
sweep::sw_glance(auto.arima(deseasonal_ts, seasonal=FALSE)) 
## # A tibble: 1 x 12
##                        model.desc     sigma    logLik      AIC      BIC
##                             <chr>     <dbl>     <dbl>    <dbl>    <dbl>
## 1 ARIMA(2,0,2) with non-zero mean 0.4669992 -52.82671 117.6534 131.7937
## # ... with 7 more variables: ME <dbl>, RMSE <dbl>, MAE <dbl>, MPE <dbl>,
## #   MAPE <dbl>, MASE <dbl>, ACF1 <dbl>
sweep::sw_augment(auto.arima(deseasonal_ts, seasonal=FALSE) )%>%head()
## # A tibble: 6 x 4
##           index  .actual  .fitted     .resid
##   <S3: yearqtr>    <dbl>    <dbl>      <dbl>
## 1          1 Q1 25.85588 26.29345 -0.4375730
## 2          1 Q2 26.49689 26.27075  0.2261406
## 3          1 Q3 26.77066 27.14749 -0.3768261
## 4          1 Q4 27.12657 27.00378  0.1227841
## 5          2 Q1 26.98088 27.63272 -0.6518461
## 6          2 Q2 25.24689 26.53556 -1.2886669
auto.arima(deseasonal_ts, seasonal=FALSE)%>%forecast::forecast()%>%tk_tbl()
## # A tibble: 8 x 6
##   index `Point Forecast`  `Lo 80`  `Hi 80`  `Lo 95`  `Hi 95`
##   <chr>            <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
## 1 20 Q3         31.83125 31.23252 32.42998 30.91557 32.74693
## 2 20 Q4         31.49825 29.95069 33.04582 29.13146 33.86505
## 3 21 Q1         31.09651 28.72275 33.47026 27.46616 34.72685
## 4 21 Q2         30.66868 27.58570 33.75167 25.95366 35.38371
## 5 21 Q3         30.24974 26.59471 33.90478 24.65986 35.83963
## 6 21 Q4         29.86629 25.78038 33.95219 23.61743 36.11514
## 7 22 Q1         29.53657 25.15031 33.92284 22.82836 36.24478
## 8 22 Q2         29.27116 24.69441 33.84791 22.27162 36.27069

The model with de-seasoned data ARIMA(2,0,2) has AIC of 117 lower than model with non de-seasoned data ARIMA(1,0,0) with AIC of 407 . This sugest it’s a better model.

Using the ARIMA notation introduced above, the fitted model can be written as

\[ \hat Y_{d_t} =29.4055740+ 1.7097140 Y_{t-1}-0.7779049Y_{t-2} +0.6745576e_{t-1}-0.2911242e_{t-2} + E\]

where E is some error and the original series is differenced with order 1.

AR(1) coefficient p = 1.7097140 tells us that the next value in the series is taken as a dampened previous value by a factor of 1.7097140 and depends on previous error lag.

Step 7: Evaluate and Iterate

So now we have fitted a model that can produce a forecast, but does it make sense? Can we trust this model? We can start by examining ACF and PACF plots for model residuals. If model order parameters and structure are correctly specified, we would expect no significant autocorrelations present.

fit<-auto.arima(deseasonal_ts, seasonal=FALSE)


sweep::sw_tidy(fit)
## # A tibble: 5 x 2
##        term   estimate
##       <chr>      <dbl>
## 1       ar1  1.7097140
## 2       ar2 -0.7779049
## 3       ma1  0.6745576
## 4       ma2 -0.2911242
## 5 intercept 29.4055740
sweep::sw_augment(fit)%>%head()
## # A tibble: 6 x 4
##           index  .actual  .fitted     .resid
##   <S3: yearqtr>    <dbl>    <dbl>      <dbl>
## 1          1 Q1 25.85588 26.29345 -0.4375730
## 2          1 Q2 26.49689 26.27075  0.2261406
## 3          1 Q3 26.77066 27.14749 -0.3768261
## 4          1 Q4 27.12657 27.00378  0.1227841
## 5          2 Q1 26.98088 27.63272 -0.6518461
## 6          2 Q2 25.24689 26.53556 -1.2886669
sweep::sw_glance(fit)
## # A tibble: 1 x 12
##                        model.desc     sigma    logLik      AIC      BIC
##                             <chr>     <dbl>     <dbl>    <dbl>    <dbl>
## 1 ARIMA(2,0,2) with non-zero mean 0.4669992 -52.82671 117.6534 131.7937
## # ... with 7 more variables: ME <dbl>, RMSE <dbl>, MAE <dbl>, MPE <dbl>,
## #   MAPE <dbl>, MASE <dbl>, ACF1 <dbl>
forecast::ggtsdisplay(residuals(fit), lag.max=45, main='(2,0,2) Model Residuals')

autoplot(fit)

f1=forecast::forecast(fit)%>%tk_tbl()
f1
## # A tibble: 8 x 6
##   index `Point Forecast`  `Lo 80`  `Hi 80`  `Lo 95`  `Hi 95`
##   <chr>            <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
## 1 20 Q3         31.83125 31.23252 32.42998 30.91557 32.74693
## 2 20 Q4         31.49825 29.95069 33.04582 29.13146 33.86505
## 3 21 Q1         31.09651 28.72275 33.47026 27.46616 34.72685
## 4 21 Q2         30.66868 27.58570 33.75167 25.95366 35.38371
## 5 21 Q3         30.24974 26.59471 33.90478 24.65986 35.83963
## 6 21 Q4         29.86629 25.78038 33.95219 23.61743 36.11514
## 7 22 Q1         29.53657 25.15031 33.92284 22.82836 36.24478
## 8 22 Q2         29.27116 24.69441 33.84791 22.27162 36.27069
fit
## Series: deseasonal_ts 
## ARIMA(2,0,2)           with non-zero mean 
## 
## Coefficients:
##          ar1      ar2     ma1      ma2     mean
##       1.7097  -0.7779  0.6746  -0.2911  29.4056
## s.e.  0.0939   0.0908  0.1388   0.1368   1.0086
## 
## sigma^2 estimated as 0.2181:  log likelihood=-52.83
## AIC=117.65   AICc=118.84   BIC=131.79

The ACF is significant at lag 4 and dimishes afterwards.There is a clear pattern present in ACF/PACF and model residuals plots repeating at lag 4 This suggests that our model may be better off with a different specification, such as p = 4 or q = 4.

We can repeat the fitting process allowing for the MA(4) component and examine diagnostic plots again. This time, there are no significant autocorrelations present. If the model is not correctly specified, that will usually be reflected in residuals in the form of trends, skeweness, or any other patterns not captured by the model. Ideally, residuals should look like white noise, meaning they are normally distributed. A convenience function ggtsdisplay() can be used to plot these model diagnostics. Residuals plots show a smaller error range, more or less centered around 0. We can observe that AIC is smaller for the (4, 0, 4) structure as well: The ARIMA model (4,0,4) has a smaller AIC (91) and all partial autocorrelations and autocorrelations are non significant and close to zero. This is lower is lower than AIC of the automatic model selected (114). Besides the automatic model has signifcant ACF at lag 4.

fit2 = arima(deseasonal_ts, order=c(4,0,4))


sweep::sw_tidy(fit2)
## # A tibble: 9 x 2
##        term    estimate
##       <chr>       <dbl>
## 1       ar1  0.59875229
## 2       ar2  0.30753098
## 3       ar3  0.02012705
## 4       ar4 -0.15135380
## 5       ma1  1.87930847
## 6       ma2  1.89917571
## 7       ma3  1.87843539
## 8       ma4  0.88579919
## 9 intercept 29.48064809
sweep::sw_glance(fit2)
## # A tibble: 1 x 12
##                        model.desc     sigma    logLik      AIC      BIC
##                             <chr>     <dbl>     <dbl>    <dbl>    <dbl>
## 1 ARIMA(4,0,4) with non-zero mean 0.3440867 -35.70411 91.40821 114.9753
## # ... with 7 more variables: ME <dbl>, RMSE <dbl>, MAE <dbl>, MPE <dbl>,
## #   MAPE <dbl>, MASE <dbl>, ACF1 <dbl>
forecast::ggtsdisplay(residuals(fit2), lag.max=40, main='de-Seasonal Model Residuals ARIMA(4,0,4)')

forecast::accuracy(fit2) #accuracy estimates of model
##                       ME      RMSE       MAE         MPE      MAPE
## Training set 0.007840418 0.3440867 0.2677631 0.008896333 0.9302244
##                   MASE       ACF1
## Training set 0.3249028 0.00933182
forecast::tsoutliers(m)  #checks for outliers
## $index
## named integer(0)
## 
## $replacements
## named numeric(0)
forecast::arimaorder(fit2)  #gets order of arima model
## [1] 4 0 4
forecast::checkresiduals(fit) #plots histogram and density of residuals

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,0,2) with non-zero mean
## Q* = 15.152, df = 3, p-value = 0.001691
## 
## Model df: 5.   Total lags used: 8
lambda <- BoxCox.lambda(lynx)

forecast::BoxCox(m,lambda ) #box cox transformation
##  Actual1  Actual2  Actual3  Actual4  Actual5  Actual6  Actual7  Actual8 
## 4.217669 4.279811 4.340033 3.946845 4.279811 4.510357 4.217669 4.279811 
##  Actual9 Actual10 Actual11 Actual12 Actual13 Actual14 Actual15 Actual16 
## 3.872646 4.018238 4.153465 4.018238 4.018238 4.018238 4.018238 4.279811 
## Actual17 Actual18 Actual19 Actual20 Actual21 Actual22 Actual23 Actual24 
## 4.510357 4.455198 4.616295 4.398457 4.398457 4.564028 4.510357 4.765420 
## Actual25 Actual26 Actual27 Actual28 Actual29 Actual30 Actual31 Actual32 
## 4.812788 4.859083 4.904356 4.948656 4.716923 4.859083 4.564028 4.616295 
## Actual33 Actual34 Actual35 Actual36 Actual37 Actual38 Actual39 Actual40 
## 4.398457 4.398457 4.279811 4.087046 4.018238 4.510357 4.564028 4.616295 
## Actual41 Actual42 Actual43 Actual44 Actual45 Actual46 Actual47 Actual48 
## 4.455198 4.564028 4.616295 4.716923 4.564028 4.455198 4.398457 4.279811 
## Actual49 Actual50 Actual51 Actual52 Actual53 Actual54 Actual55 Actual56 
## 4.279811 4.217669 4.217669 4.153465 4.153465 3.946845 4.153465 4.217669 
## Actual57 Actual58 Actual59 Actual60 Actual61 Actual62 Actual63 Actual64 
## 4.018238 4.153465 4.340033 4.340033 4.398457 4.455198 4.564028 4.765420 
## Actual65 Actual66 Actual67 Actual68 Actual69 Actual70 Actual71 Actual72 
## 4.564028 4.279811 4.564028 4.340033 4.340033 4.455198 4.667236 4.564028 
## Actual73 Actual74 Actual75 Actual76 Actual77 Actual78 Actual79 Actual80 
## 4.340033 4.667236 4.564028 4.716923 4.564028 4.455198 4.616295 4.616295 
## Actual81 Actual82 
## 4.564028 4.455198
f3=f1=forecast::forecast(fit)%>%tk_tbl()
f1%>%tk_tbl()
## Warning in tk_tbl.data.frame(.): Warning: No index to preserve. Object
## otherwise converted to tibble successfully.
## # A tibble: 8 x 6
##   index `Point Forecast`  `Lo 80`  `Hi 80`  `Lo 95`  `Hi 95`
##   <chr>            <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
## 1 20 Q3         31.83125 31.23252 32.42998 30.91557 32.74693
## 2 20 Q4         31.49825 29.95069 33.04582 29.13146 33.86505
## 3 21 Q1         31.09651 28.72275 33.47026 27.46616 34.72685
## 4 21 Q2         30.66868 27.58570 33.75167 25.95366 35.38371
## 5 21 Q3         30.24974 26.59471 33.90478 24.65986 35.83963
## 6 21 Q4         29.86629 25.78038 33.95219 23.61743 36.11514
## 7 22 Q1         29.53657 25.15031 33.92284 22.82836 36.24478
## 8 22 Q2         29.27116 24.69441 33.84791 22.27162 36.27069
f3
## # A tibble: 8 x 6
##   index `Point Forecast`  `Lo 80`  `Hi 80`  `Lo 95`  `Hi 95`
##   <chr>            <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
## 1 20 Q3         31.83125 31.23252 32.42998 30.91557 32.74693
## 2 20 Q4         31.49825 29.95069 33.04582 29.13146 33.86505
## 3 21 Q1         31.09651 28.72275 33.47026 27.46616 34.72685
## 4 21 Q2         30.66868 27.58570 33.75167 25.95366 35.38371
## 5 21 Q3         30.24974 26.59471 33.90478 24.65986 35.83963
## 6 21 Q4         29.86629 25.78038 33.95219 23.61743 36.11514
## 7 22 Q1         29.53657 25.15031 33.92284 22.82836 36.24478
## 8 22 Q2         29.27116 24.69441 33.84791 22.27162 36.27069
sweep::sw_tidy(forecast::forecast(fit)) #similar to step above
## Warning in tidy.default(x, ...): No method for tidying an S3 object of
## class forecast , using as.data.frame
## # A tibble: 8 x 5
##   `Point Forecast`  `Lo 80`  `Hi 80`  `Lo 95`  `Hi 95`
## *            <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
## 1         31.83125 31.23252 32.42998 30.91557 32.74693
## 2         31.49825 29.95069 33.04582 29.13146 33.86505
## 3         31.09651 28.72275 33.47026 27.46616 34.72685
## 4         30.66868 27.58570 33.75167 25.95366 35.38371
## 5         30.24974 26.59471 33.90478 24.65986 35.83963
## 6         29.86629 25.78038 33.95219 23.61743 36.11514
## 7         29.53657 25.15031 33.92284 22.82836 36.24478
## 8         29.27116 24.69441 33.84791 22.27162 36.27069
#Forecasts for intermittent demand using Croston's method
f2=forecast::croston(m,h=12)
f2
##    Point Forecast
## 83       31.15001
## 84       31.15001
## 85       31.15001
## 86       31.15001
## 87       31.15001
## 88       31.15001
## 89       31.15001
## 90       31.15001
## 91       31.15001
## 92       31.15001
## 93       31.15001
## 94       31.15001
#Test accuracy of two models
dm.test(residuals(fit),residuals(fit2),h=12)
## 
##  Diebold-Mariano Test
## 
## data:  residuals(fit)residuals(fit2)
## DM = 3.4832, Forecast horizon = 12, Loss function power = 2,
## p-value = 0.0008204
## alternative hypothesis: two.sided
#find frequency of series
forecast::findfrequency(dat)
## [1] 3
#dshw(taylor)

taylor%>%tk_tbl()%>%head()
## # A tibble: 6 x 2
##      index value
##      <dbl> <int>
## 1 1.000000 22262
## 2 1.002976 21756
## 3 1.005952 22247
## 4 1.008929 22759
## 5 1.011905 22549
## 6 1.014881 22313

Forecasting using a fitted model is straightforward in R. We can specify forecast horizon h periods ahead for predictions to be made, and use the fitted model to generate those predictions:

fcast <- forecast(fit2, h=12)
autoplot(fcast)

fcast%>%timetk::tk_tbl()
## # A tibble: 12 x 6
##    index `Point Forecast`  `Lo 80`  `Hi 80`  `Lo 95`  `Hi 95`
##    <chr>            <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
##  1 20 Q3         31.82830 31.38182 32.27477 31.14547 32.51112
##  2 20 Q4         31.68145 30.48838 32.87452 29.85681 33.50609
##  3 21 Q1         31.40017 29.37489 33.42546 28.30277 34.49758
##  4 21 Q2         30.94811 27.99651 33.89970 26.43403 35.46218
##  5 21 Q3         30.63858 26.99909 34.27806 25.07247 36.20468
##  6 21 Q4         30.33078 26.27020 34.39137 24.12065 36.54091
##  7 22 Q1         30.08477 25.74603 34.42352 23.44923 36.72032
##  8 22 Q2         29.90501 25.40686 34.40316 23.02569 36.78434
##  9 22 Q3         29.76238 25.17612 34.34864 22.74830 36.77646
## 10 22 Q4         29.66333 25.02998 34.29668 22.57723 36.74943
## 11 23 Q1         29.59377 24.93743 34.25012 22.47251 36.71504
## 12 23 Q2         29.54600 24.87883 34.21318 22.40818 36.68383

The light blue line above shows the fit provided by the model, but what if we wanted to get a sense of how the model will perform in the future? One method is to reserve a portion of our data as a “hold-out” set, fit the model, and then compare the forecast to the actual observed values:

hold <- window(ts(deseasonal_ts), start=62)

fit_no_holdout = arima(ts(deseasonal_ts[-c(62:82)]), order=c(4,0,4))

fcast_no_holdout <- forecast(fit_no_holdout,h=20)

plot(fcast_no_holdout, main=" ")
lines(ts(deseasonal_ts))

sweep::sw_tidy(fcast_no_holdout)
## Warning in tidy.default(x, ...): No method for tidying an S3 object of
## class forecast , using as.data.frame
## # A tibble: 20 x 5
##    `Point Forecast`  `Lo 80`  `Hi 80`  `Lo 95`  `Hi 95`
##  *            <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
##  1         32.87171 32.43642 33.30700 32.20599 33.53743
##  2         33.64538 32.42853 34.86224 31.78436 35.50640
##  3         33.72108 31.62513 35.81702 30.51560 36.92655
##  4         33.24859 30.24215 36.25503 28.65063 37.84655
##  5         32.89047 29.17398 36.60697 27.20659 38.57436
##  6         32.06539 27.82344 36.30735 25.57788 38.55290
##  7         31.24645 26.61129 35.88162 24.15758 38.33533
##  8         30.66001 25.80282 35.51719 23.23158 38.08843
##  9         29.84077 24.86045 34.82109 22.22403 37.45752
## 10         29.33303 24.29381 34.37225 21.62621 37.03985
## 11         28.98588 23.93364 34.03812 21.25914 36.71262
## 12         28.55589 23.50200 33.60978 20.82663 36.28514
## 13         28.48571 23.43094 33.54049 20.75510 36.21633
## 14         28.41776 23.35441 33.48111 20.67403 36.16149
## 15         28.35032 23.27585 33.42478 20.58959 36.11104
## 16         28.53892 23.44815 33.62968 20.75326 36.32457
## 17         28.60030 23.49243 33.70818 20.78848 36.41212
## 18         28.71406 23.59470 33.83343 20.88467 36.54345
## 19         28.93847 23.80865 34.06830 21.09308 36.78387
## 20         28.99202 23.85622 34.12783 21.13749 36.84656

However, the blue line representing forecast seems very naive: It goes close to a straight line fairly soon, which seems unlikely given past behavior of the series. Recall that the model is assuming a series with no seasonality, and is differencing the original non-stationary data. In other words, plotted predictions are based on the assumption that there will be no other seasonal fluctuations in the data and the change in number of bicycles from one day to another is more or less constant in mean and variance. This forecast may be a naive model, but it illustrates the process of choosing an ARIMA model and could also serve as a benchmark to grade against as more complex models are built.

How can we improve the forecast and iterate on this model? One simple change is to add back the seasonal component we extracted earlier. Another approach would be to allow for (P, D, Q) components to be included to the model, which is a default in the auto.arima() function. Re-fitting the model on the same data, we see that there still might be some seasonal pattern in the series, with the seasonal component described by AR(1):

fit_w_seasonality=auto.arima(deseasonal_ts, seasonal=FALSE)

fit_w_seasonality
## Series: deseasonal_ts 
## ARIMA(2,0,2)           with non-zero mean 
## 
## Coefficients:
##          ar1      ar2     ma1      ma2     mean
##       1.7097  -0.7779  0.6746  -0.2911  29.4056
## s.e.  0.0939   0.0908  0.1388   0.1368   1.0086
## 
## sigma^2 estimated as 0.2181:  log likelihood=-52.83
## AIC=117.65   AICc=118.84   BIC=131.79
seas_fcast<-forecast(fit_w_seasonality, h=12)

seas_fcast
##       Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 20 Q3       31.83125 31.23252 32.42998 30.91557 32.74693
## 20 Q4       31.49825 29.95069 33.04582 29.13146 33.86505
## 21 Q1       31.09651 28.72275 33.47026 27.46616 34.72685
## 21 Q2       30.66868 27.58570 33.75167 25.95366 35.38371
## 21 Q3       30.24974 26.59471 33.90478 24.65986 35.83963
## 21 Q4       29.86629 25.78038 33.95219 23.61743 36.11514
## 22 Q1       29.53657 25.15031 33.92284 22.82836 36.24478
## 22 Q2       29.27116 24.69441 33.84791 22.27162 36.27069
## 22 Q3       29.07385 24.39060 33.75711 21.91143 36.23628
## 22 Q4       28.94299 24.21042 33.67557 21.70514 36.18084
## 23 Q1       28.87274 24.12391 33.62156 21.61003 36.13544
## 23 Q2       28.85442 24.10350 33.60534 21.58851 36.12033
autoplot(seas_fcast)

fit3=auto.arima(m)

forecast(fit3)
##    Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 83       29.84445 26.23771 33.45119 24.32842 35.36048
## 84       29.72675 25.20380 34.24970 22.80949 36.64400
## 85       29.63768 24.66556 34.60980 22.03348 37.24189
## 86       29.57029 24.35838 34.78220 21.59936 37.54122
## 87       29.51929 24.17492 34.86366 21.34578 37.69280
## 88       29.48070 24.06195 34.89946 21.19343 37.76797
## 89       29.45150 23.99061 34.91239 21.09979 37.80321
## 90       29.42940 23.94453 34.91428 21.04102 37.81779
## 91       29.41269 23.91413 34.91124 21.00337 37.82200
## 92       29.40003 23.89366 34.90641 20.97876 37.82131

Note that (p,d,q) parameters also changed after we included a seasonal component. We can go through the same process of evaluating model residuals and ACF/PACF plots and adjusting the structure if necessary. For example, we notice the same evidence of higher order is present in the auto correlations with lag 7, which suggests that a higher-order component might be needed:

Both forecast estimates above are provided with confidence bounds: 80% confidence limits shaded in darker blue, and 95% in lighter blue. Longer term forecasts will usually have more uncertainty, as the model will regress future Y on previously predicted values. This is reflected in the shape of the confidence bounds in this case, as they start to widen with increasing horizon. The pattern in confidence bounds may signal the need for a more stable model. It is very important to look at prediction bounds and keep in mind the expected error associated with point estimates.

What’s Next?

After an initial naive model is built, it’s natural to wonder how to improve on it. Other forecasting techniques, such as exponential smoothing, would help make the model more accurate using a weighted combinations of seasonality, trend, and historical values to make predictions. In addition, daily bicycle demand is probably highly dependent on other factors, such weather, holidays, time of the day, etc. One could try fitting time series models that allow for inclusion of other predictors using methods such ARMAX or dynamic regression. These more complex models allow for control of other factors in predicting the time series.

To summarize, the procedure outlined in this tutorial is an introduction to ARIMA modeling. Material here can be used as a general guideline to examining your series, using ACF and PACF plots to choose model order, and fitting the model in R.

# Using Fourier series for a "ts" object
# K is chosen to minimize the AICc
deaths.model<-auto.arima(USAccDeaths, xreg=fourier(USAccDeaths,K=5), seasonal=FALSE)
deaths.fcast <- forecast(deaths.model, xreg=fourier(USAccDeaths, K=5, h=36))
autoplot(deaths.fcast) + xlab("Year")

#modelf<-auto.arima(m, xreg=fourier(m,K=1), seasonal=FALSE)
#modelf

autoplot(fit) +forecast::geom_forecast()

#plots the data with forecast

autoplot(dat_ts) +forecast::geom_forecast()+ylab("Number of Durability \n Vehicles")

Neural Network Time Series Forecasts

Feed-forward neural networks with a single hidden layer and lagged inputs for forecasting univariate time series.

#plots the lag
forecast::gglagplot(m,lags=4)

fit3=forecast::nnetar(dat_ts,size=8)

fcast3 <-forecast(fit3,h=12)%>%tk_tbl()
fcast3
## # A tibble: 12 x 2
##    index `Point Forecast`
##    <int>            <dbl>
##  1    83         31.76918
##  2    84         31.08046
##  3    85         31.28886
##  4    86         31.19301
##  5    87         31.23316
##  6    88         31.21557
##  7    89         31.22314
##  8    90         31.21986
##  9    91         31.22127
## 10    92         31.22066
## 11    93         31.22093
## 12    94         31.22081
autoplot(forecast(fit3))

  • Number of differences required for a stationary series
forecast::nsdiffs(deseasonal_ts) # stationary  data=0
## [1] 0
nsdiffs(log(AirPassengers))
## [1] 1
ndiffs(diff(log(AirPassengers),12))
## [1] 1
#Returns forecasts and prediction intervals for a theta method forecast.

forecast::thetaf(m,h=12, level = c(90, 95))%>%sweep::sw_tidy()
## Warning in tidy.default(x, ...): No method for tidying an S3 object of
## class forecast , using as.data.frame
## # A tibble: 12 x 5
##    `Point Forecast`  `Lo 90`  `Hi 90`  `Lo 95`  `Hi 95`
##  *            <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
##  1         30.66647 25.98455 35.34839 25.08762 36.24532
##  2         30.68723 24.93286 36.44159 23.83047 37.54398
##  3         30.70798 24.05177 37.36419 22.77662 38.63934
##  4         30.72874 23.27907 38.17840 21.85191 39.60556
##  5         30.74949 22.58311 38.91588 21.01864 40.48034
##  6         30.77025 21.94516 39.59534 20.25450 41.28599
##  7         30.79100 21.35307 40.22893 19.54501 42.03699
##  8         30.81176 20.79842 40.82509 18.88013 42.74338
##  9         30.83251 20.27508 41.38994 18.25256 43.41246
## 10         30.85327 19.77844 41.92809 17.65680 44.04973
## 11         30.87402 19.30492 42.44312 17.08859 44.65945
## 12         30.89478 18.85167 42.93788 16.54453 45.24502
dat_tk=tk_augment_timeseries_signature(dat)

#dat%>%tk_xts()%>%timetk::tk_tbl()%>%tk_xts


tsmod <- stlm(USAccDeaths, modelfunction=ar)

plot(forecast(tsmod, h=36))

decomp <- stl(USAccDeaths,s.window="periodic")
plot(forecast(decomp))

plot(stlf(AirPassengers, lambda=0))

x <- rnorm(100)  # no unit-root
adf.test(x)
## Warning in adf.test(x): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  x
## Dickey-Fuller = -5.8754, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary
y <- diffinv(x)   # contains a unit-root
adf.test(y)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  y
## Dickey-Fuller = -1.511, Lag order = 4, p-value = 0.7794
## alternative hypothesis: stationary
#cbind(x,y)
Naive and Random Walk Forecasts