Forecasting with Time Series Modeling

‘Time’ is an important factor which ensures success for a business, establishing this way methods of prediction and forecasting as extremely useful. One such method dealing with time based data are Time Series Models. These models are very useful when dealing with serially correlated data. They involve working on time (years, days, hours, minutes) based data to derive hidden insights for informed decision making. Exaples of analysis using time series modeling include predicting sales number for the next year, analyzing website traffic, competition position, etc.

The following project represents a guide for working with time based data, using time series modeling and its related techniques.

Content

  1. Loading and Exploring the time series data
  2. Checking if time series data are stationary
  3. ARMA Models
  4. Exploiting the ACF and PACF to find optimal Parameters
  5. Building the ARIMA model and Predicting future Values

Loading and Exploring the time series data

The data used for this project come from an inbuilt dataset of R called “AirPassengers”. It consists of monthly totals of international airline passengers, from 1949 to 1960.

#loading the dataset
data("AirPassengers")

#examining start and end date of data
start(AirPassengers)
## [1] 1949    1
end(AirPassengers)
## [1] 1960   12
#check the cycle of the time series
frequency(AirPassengers) #so the cycle for this time series is 12 months in a year
## [1] 12
#print the cycle across years
cycle(AirPassengers)
##      Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1949   1   2   3   4   5   6   7   8   9  10  11  12
## 1950   1   2   3   4   5   6   7   8   9  10  11  12
## 1951   1   2   3   4   5   6   7   8   9  10  11  12
## 1952   1   2   3   4   5   6   7   8   9  10  11  12
## 1953   1   2   3   4   5   6   7   8   9  10  11  12
## 1954   1   2   3   4   5   6   7   8   9  10  11  12
## 1955   1   2   3   4   5   6   7   8   9  10  11  12
## 1956   1   2   3   4   5   6   7   8   9  10  11  12
## 1957   1   2   3   4   5   6   7   8   9  10  11  12
## 1958   1   2   3   4   5   6   7   8   9  10  11  12
## 1959   1   2   3   4   5   6   7   8   9  10  11  12
## 1960   1   2   3   4   5   6   7   8   9  10  11  12
#check overall data
summary(AirPassengers)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   104.0   180.0   265.5   280.3   360.5   622.0
#the number of passengers is distributed across the spectrum

More Detailed Metrics

#plotting the time series
plot(AirPassengers)
#number of passengers increases each year; higher variance as year increases

#fitting a line to time series
abline(reg=lm(AirPassengers ~ time(AirPassengers)))

#Aggregate cycles and show a year to year trend
plot(aggregate(AirPassengers, FUN = mean), ylab = "AirPassengers (mean)")

#we see that passengers tend to increase year by year

#Using a box plot we will try to get a sense for a possible seasonal effect
boxplot(AirPassengers~cycle(AirPassengers), xlab = "Month", ylab = "AirPassengers", main = "Average Passengers per Month", names = month.abb, col = "green")

#higher numbers for July and August
#different but small variance --> strong seasonal effect with a cycle of 12 months or less

Important Inferences


Checking if time series data are stationary

There are three basic criteria for a series to be classified as stationary series:
  1. The mean of the series should not be a function of time rather should be a constant.
  2. The variance of the series should not a be a function of time. This property is known as homoscedasticity.
  3. The covariance of the i th term and the (i + m) th term should not be a function of time.

Unless our time series is stationary, we cannot build a time series model. In cases where the stationary criteria are violated, it requires to first stationarize the time series and then try stochastic models for predictions. There are multiple ways of bringing this stationarity. Some of them are Detrending, Differencing etc.

Before testing for stationarity in our time series data, we need to address two issues:

  1. We need to remove unequal variances. This can be done by taking the log of the values.
  2. We need to address the trend component. This can be done by taking the difference of the series

We perform the Dickey-Fuller Test, in order to check if our data are stationary, using the difference of logged values.

library(tseries)

#Dickey Fuller test with the difference of logged values
adf.test(diff(log(AirPassengers)), alternative="stationary", k=0)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff(log(AirPassengers))
## Dickey-Fuller = -9.6003, Lag order = 0, p-value = 0.01
## alternative hypothesis: stationary

Our time series data appear to be stationary, so we can now perform any kind of times series modeling

ARMA Models

ARMA models are commonly used in time series modeling. In ARMA model, AR stands for auto-regression and MA stands for moving average.

AR models are calculated using the AR(1) formation: \[x(t) = alpha * x(t - 1) + error (t)\]

The numeral (1) in AR(1) denotes that the next instance is solely dependent on the previous instance. The alpha is a coefficient which we seek so as to minimize the error function. x(t- 1) is linked to x(t-2) with any shock to x(t) to gradually fade off.

On the other hand, the MA model can be calculated by: \[x(t) = beta * error(t-1) + error (t)\]

Compared to the MA model, the AR model has much lasting effect of the shock. In the MA model the noise quickly vanished with time.

The primary difference between an AR and MA model is based on the correlation between time series objects at different time points. The correlation between x(t) and x(t-n) for n > order of MA is always zero. This directly flows from the fact that covariance between x(t) and x(t-n) is zero for MA models. However, the correlation of x(t) and x(t-n) gradually declines with n becoming larger in the AR model. This difference gets exploited irrespective of having the AR model or MA model. The correlation plot can give us the order of MA model.

Exploiting the ACF and PACF to find optimal Parameters

We have already shown that our time series data are stationary (for cases with non-stationary data we just transform our values following appropriate methods in order to make our data stationary).

Next, we need to find answers in two important questions:

  1. Is it an AR or MA process?

  2. What order of AR or MA process do we need to use?

The first question can be answered using Total Correlation Chart (also known as Auto - correlation Function / ACF). ACF is a plot of total correlation between different lag functions. For instance, in our problem, the number of passengers at time point t is x(t). We are interested in the correlation of x(t) with x(t-1) , x(t-2) and so on.

In a moving average series of lag n, we will not get any correlation between x(t) and x(t - n -1) . Hence, the total correlation chart cuts off at nth lag (the lag for a MA series).

For an AR series this correlation will gradually go down without any cut off value, hence we use the partial ACF. We find out the partial correlation of each lag that will cut off after the degree of AR series. For instance, if we have a AR(1) series, if we exclude the effect of 1st lag (x (t-1) ), our 2nd lag (x (t-2) ) is independent of x(t). Hence, the partial correlation function (PACF) will drop sharply after the 1st lag.

To sum up, using ACF and PACF we identify the type of stationary series.

In order to build our ARIMA model we are going to work with three parameters:

p = AR
d = I
q = MA

The additional “I” value in AR“I”MA is the integration part referring to the differencing technique for removing non-sationarity from time series.

The parameters p,d,q can be found using ACF and PACF plots. If both ACF and PACF decrease gradually indicates non-stationarity, and we need to make the time series stationary by introducing a value to “d”. We already know that the ‘d’ component is 1 as we need only 1 difference to make the series stationary (We have difference the series once and get to see that the trend is removed. Had the trend been still there we would have difference the series once again. This series did not require to be difference more than once hence, d=1).

We do this using the Correlation plots. Following are the ACF plots for the series :

#ACF Plots
acf(log(AirPassengers))

Clearly, the decay of ACF chart is very slow, which means that the population is not stationary.

We now regress on the difference of logs rather than log directly. Let’s see the ACF and PACF curves after regressing on the difference.

acf(diff(log(AirPassengers)))

pacf(diff(log(AirPassengers)))

Clearly, ACF plot cuts off after the first lag. Hence, the value of p should be 0 as the ACF is the curve getting a cut off, while the value of q should be 1 or 2.

Using the auto.arima() function from the “forecast” package, we find the optimal parameters.

library(forecast)
auto.arima(AirPassengers)
## Series: AirPassengers 
## ARIMA(0,1,1)(0,1,0)[12]                    
## 
## Coefficients:
##           ma1
##       -0.3184
## s.e.   0.0877
## 
## sigma^2 estimated as 138.3:  log likelihood=-508.32
## AIC=1020.64   AICc=1020.73   BIC=1026.39

We come with (0,1,1) as (p,d,q) to be one of the combination with least AIC and BIC.

Building the ARIMA model and Predicting future Values

Finally, we fit an ARIMA model to our time series and predict the future 10 years. In addition, we will try to fit a seasonal component in the ARIMA formulation.

(fit <- arima(log(AirPassengers), c(0, 1, 1),seasonal = list(order = c(0, 1, 1), period = 12)))
## 
## Call:
## arima(x = log(AirPassengers), order = c(0, 1, 1), seasonal = list(order = c(0, 
##     1, 1), period = 12))
## 
## Coefficients:
##           ma1     sma1
##       -0.4018  -0.5569
## s.e.   0.0896   0.0731
## 
## sigma^2 estimated as 0.001348:  log likelihood = 244.7,  aic = -483.4
APforecast <- predict(fit, n.ahead=10*12)

We now visualize the prediction along with the training data.

ts.plot(AirPassengers,2.718^APforecast$pred, log = "y", lty = c(1,3)) 

#explanations for the ts.plot arguments provided:
#2.718^APforecast$pred: we are  undoing the log from the values.In order to do that, we need to find the log inverse of what we have got.
#i.e. log(forecast) = APforecast$pred
#hence, forecast = e ^ APforecast$pred
#e= 2.718 
#log = "y' is to plot on a logarithmic scale
#lty = c(1,3) will set the LineTYpe to 1 (for solid) for the original time series and 3 (for dotted) for the predicted time series.
#print(pred$pred) would give us log of the predicted values. print(2.718^pred$pred) would give us the actual predicted values.