Many of the forecasting packages in R requires a time series that is covariance stationary. For those who are not familiar
with this term, there is an excellent online textbook by Hyndman and Athanasopoulos, Forecasting: Principles and Practice.
Click here to go directly to that chapter.
The first step, therefore, in making a predictive or analytic procedure is to analyze a time series to see if it is
already covariance stationary. More often than not, a transformation is required to convert the non-stationary time series
to a stationary time series.
Examples of time series, include stock prices, raw material prices, and ALL data that is ordered by a given interval in
date and time. But to keep this article easy to follow, I will use a time series of Powerball, which is a type of lottery
game played once a week in Australia.
Fortunately, the historical data is easy to obtain from the Powerball website as a CSV file. For this example, I will only
be using only ONE (1) column to keep it easy to follow. However, you can extend this example to the rest of the columns.
library(forecast)
## Warning: package 'forecast' was built under R version 2.15.2
## Loading required package: parallel
## Loading required package: tseries
## Warning: package 'tseries' was built under R version 2.15.2
## Loading required package: fracdiff
## Warning: package 'fracdiff' was built under R version 2.15.2
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 2.15.2
## Attaching package: 'zoo'
## The following object(s) are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: Rcpp
## Warning: package 'Rcpp' was built under R version 2.15.2
## Loading required package: RcppArmadillo
## Warning: package 'RcppArmadillo' was built under R version 2.15.2
## Loading required package: colorspace
## Loading required package: nnet
## Loading required package: caret
## Warning: package 'caret' was built under R version 2.15.2
## Loading required package: lattice
## Loading required package: reshape
## Warning: package 'reshape' was built under R version 2.15.2
## Loading required package: plyr
## Warning: package 'plyr' was built under R version 2.15.2
## Attaching package: 'reshape'
## The following object(s) are masked from 'package:plyr':
##
## rename, round_any
## Loading required package: cluster
## Loading required package: foreach
## Warning: package 'foreach' was built under R version 2.15.2
## This is forecast 4.00
library(tseries)
library(zoo)
# (1-1) Load the data into a data.frame
hist.df <- read.csv("../R-nonsource/powerball.csv", colClasses = "character")
# (1-2) Coerce the column into a numeric vector and create a zoo object
n1.zoo <- zoo(matrix(as.numeric(hist.df$Number_1), ncol = 1), as.Date(hist.df$Draw_date,
"%Y/%m/%d"))
names(n1.zoo) <- c("n1")
# (1-3) Note: For the zoo object, the latest result is at the LAST row
tail(n1.zoo)
## n1
## 2012-10-18 31
## 2012-10-25 42
## 2012-11-01 25
## 2012-11-08 2
## 2012-11-15 37
## 2012-11-22 6
First, we load the data into a data.frame, then we coerce ONE (1) column into a numeric vector and create a zoo object.
For the zoo object, the LAST row corresponds to latest result, the penultimate row corresponds to previous week's results,
and so on.
# --- Layout
layout(matrix(1:6, 2, 3, byrow = TRUE))
# (2-1) Sharpiro test (p>0.05 is normal) of data, diff(data), and
# diff(log(data)) Requires a vector as input
n1.vtr <- as.numeric(n1.zoo)
p1 <- shapiro.test(n1.vtr)$p.value
p2 <- shapiro.test(diff(n1.vtr))$p.value
p3 <- shapiro.test(diff(log(n1.vtr)))$p.value
# (2-2) Plot histogram of data, diff(data), and diff(log(data))
hist(n1.vtr, prob = T, main = c("n1", paste("Shapiro p=", prettyNum(p1, digits = 2))),
xlab = "n1", xlim = c(1, 45))
lines(density(n1.vtr))
hist(diff(n1.vtr), prob = T, main = c("diff(n1)", paste("Shapiro p=", prettyNum(p2,
digits = 2))), xlab = "diff(n1)")
lines(density(diff(n1.vtr)))
hist(diff(log(n1.vtr)), prob = T, main = c("diff(log(n1))", paste("Shapiro p=",
prettyNum(p3, digits = 2))), xlab = "diff(log(n1))")
lines(density(diff(log(n1.vtr))))
# (2-3) QQ plot of the above
qqnorm(n1.zoo)
qqnorm(diff(n1.zoo))
qqnorm(diff(log(n1.zoo)))
The above code tests the column for a normal distribution. It can be seen using a histogram (and a QQ plot), that the
diff(data) and the diff(log(data)) are approximately normal, but NOT the data. However, the Shapiro test indicates that
NONE of the distribution is normal (p-value is LESS THAN 0.05).
# --- Layout
layout(matrix(1:3, 1, 3, byrow = TRUE))
# (3-1) Acf plot
acf(n1.zoo, main = "n1")
acf(diff(n1.zoo), main = "diff(n1)")
acf(diff(log(n1.zoo)), main = "diff(log(n1))")
The ACF plot has two horizontal blue lines for significance tests. If any of the period has a vertical line that crosses
the blue lines, then there is a correlation between that period and period 0. It can be seen in the ACF plots, that both
the diff(data) and the diff(log(data)) has an autocorrelation between period 1 and period 0.
# (4-1) Augmented Dickey-Fuller (ADF) test The null-hypothesis for an ADF
# test is that the data is non-stationary Therefore, p>0.05 is
# non-stationary
adf.test(n1.zoo, alternative = "stationary")
## Warning: p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: n1.zoo
## Dickey-Fuller = -9.203, Lag order = 9, p-value = 0.01
## alternative hypothesis: stationary
adf.test(diff(n1.zoo), alternative = "stationary")
## Warning: p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: diff(n1.zoo)
## Dickey-Fuller = -16.43, Lag order = 9, p-value = 0.01
## alternative hypothesis: stationary
adf.test(diff(log(n1.zoo)), alternative = "stationary")
## Warning: p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: diff(log(n1.zoo))
## Dickey-Fuller = -16.89, Lag order = 9, p-value = 0.01
## alternative hypothesis: stationary
Using the ADF test, we show the column is a stationary time series for EACH of its different forms, i.e. data, diff(data)
and diff(log(data)).
# (5-1) Seasonal differencing test If ns>0, then seasonal differencing is
# required
nsdiffs(n1.zoo)
## Warning: I can't handle data with frequency less than 1. Seasonality will
## be ignored.
## [1] 0
nsdiffs(diff(n1.zoo))
## Warning: I can't handle data with frequency less than 1. Seasonality will
## be ignored.
## [1] 0
nsdiffs(diff(log(n1.zoo)))
## Warning: I can't handle data with frequency less than 1. Seasonality will
## be ignored.
## [1] 0
Using the nsdiffs() function, we show the column does NOT require seasonal differencing for ANY of its different forms.
# (6-1) AIC test
auto.arima(n1.zoo)
## Warning: p-value greater than printed p-value
## Series: n1.zoo
## ARIMA(1,0,1) with non-zero mean
##
## Coefficients:
## ar1 ma1 intercept
## -0.806 0.858 22.896
## s.e. 0.093 0.080 0.468
##
## sigma^2 estimated as 179: log likelihood=-3458
## AIC=6924 AICc=6924 BIC=6943
auto.arima(diff(n1.zoo))
## Warning: p-value greater than printed p-value
## Series: diff(n1.zoo)
## ARIMA(0,0,2) with zero mean
##
## Coefficients:
## ma1 ma2
## -0.940 -0.055
## s.e. 0.035 0.035
##
## sigma^2 estimated as 180: log likelihood=-3460
## AIC=6926 AICc=6926 BIC=6940
auto.arima(diff(log(n1.zoo)))
## Warning: p-value greater than printed p-value
## Warning: Unable to fit final model using maximum likelihood. AIC value
## approximated
## Series: diff(log(n1.zoo))
## ARIMA(1,0,2) with zero mean
##
## Coefficients:
## ar1 ma1 ma2
## -0.781 -0.182 -0.79
## s.e. 0.121 0.119 0.12
##
## sigma^2 estimated as 0.795: log likelihood=-1123
## AIC=2254 AICc=2254 BIC=2273
Using the auto.arima() function from the package forecast, we find that the column has the best fitted model (with the
lowest AIC) when using the diff(log(data)). This AIC has improved almost THREE (3) times that of the fitted model when
using the data without transformation.
# (7-1) Obtain ONE(1)-period lookahead forecast and confidence interval
f3 <- forecast(auto.arima(diff(log(n1.zoo))), h = 1)
## Warning: p-value greater than printed p-value
## Warning: Unable to fit final model using maximum likelihood. AIC value
## approximated
## Warning: Incompatible methods ("Ops.zoo", "Ops.ts") for "-"
f3
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 15673 0.8434 -0.299 1.986 -0.9037 2.59
# (7-2) Transform diff(log(data)) back into data Select the 95% confidence
# interval Powerball numbers MUST be between ONE (1) AND FORTY-FIVE (45)
u <- exp(log(n1.zoo[NROW(n1.zoo), 1]) + max(f3$upper))
p <- exp(log(n1.zoo[NROW(n1.zoo), 1]) + f3$mean)
## Warning: Incompatible methods ("Ops.zoo", "Ops.ts") for "+"
l <- exp(log(n1.zoo[NROW(n1.zoo), 1]) + min(f3$lower))
u <- min(45, trunc(u))
p <- round(p)
l <- max(1, round(l))
data.frame(lower = l, predict = p, upper = u)
## lower predict upper
## 1 2 14 45
Based on the above AIC, and considering ALL of the previous tests, we select the diff(log(data)) as our time series input,
and then predict a ONE(1)-period lookahead for this time series. We also obtain a confidence interval and then transform
our predicted value (with intervals) back to its original form.
In conclusion, I have shown you how to analyze ONE (1) column of the Powerball time series to determine if it requires
transformation prior to fitting a model for forecasting. The transformations diff(data) and diff(log(data)) are commonly
used for ANY time series, however, you are NOT limited to these two types. In fact, you are encouraged to explore other
types of transformation that may lead you to a better forecasting model than mine.