To Transform or Not To Transform

Many of the forecasting packages in R requires a time series that is covariance stationary. For those who are not familiar
with this term, there is an excellent online textbook by Hyndman and Athanasopoulos, Forecasting: Principles and Practice.
Click here to go directly to that chapter.

The first step, therefore, in making a predictive or analytic procedure is to analyze a time series to see if it is
already covariance stationary. More often than not, a transformation is required to convert the non-stationary time series
to a stationary time series.

Examples of time series, include stock prices, raw material prices, and ALL data that is ordered by a given interval in
date and time. But to keep this article easy to follow, I will use a time series of Powerball, which is a type of lottery
game played once a week in Australia.

Fortunately, the historical data is easy to obtain from the Powerball website as a CSV file. For this example, I will only
be using only ONE (1) column to keep it easy to follow. However, you can extend this example to the rest of the columns.

(1) Load the data into a data frame

library(forecast)
## Warning: package 'forecast' was built under R version 2.15.2
## Loading required package: parallel
## Loading required package: tseries
## Warning: package 'tseries' was built under R version 2.15.2
## Loading required package: fracdiff
## Warning: package 'fracdiff' was built under R version 2.15.2
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 2.15.2
## Attaching package: 'zoo'
## The following object(s) are masked from 'package:base':
## 
## as.Date, as.Date.numeric
## Loading required package: Rcpp
## Warning: package 'Rcpp' was built under R version 2.15.2
## Loading required package: RcppArmadillo
## Warning: package 'RcppArmadillo' was built under R version 2.15.2
## Loading required package: colorspace
## Loading required package: nnet
## Loading required package: caret
## Warning: package 'caret' was built under R version 2.15.2
## Loading required package: lattice
## Loading required package: reshape
## Warning: package 'reshape' was built under R version 2.15.2
## Loading required package: plyr
## Warning: package 'plyr' was built under R version 2.15.2
## Attaching package: 'reshape'
## The following object(s) are masked from 'package:plyr':
## 
## rename, round_any
## Loading required package: cluster
## Loading required package: foreach
## Warning: package 'foreach' was built under R version 2.15.2
## This is forecast 4.00
library(tseries)
library(zoo)

# (1-1) Load the data into a data.frame
hist.df <- read.csv("../R-nonsource/powerball.csv", colClasses = "character")

# (1-2) Coerce the column into a numeric vector and create a zoo object
n1.zoo <- zoo(matrix(as.numeric(hist.df$Number_1), ncol = 1), as.Date(hist.df$Draw_date, 
    "%Y/%m/%d"))
names(n1.zoo) <- c("n1")

# (1-3) Note: For the zoo object, the latest result is at the LAST row
tail(n1.zoo)
##            n1
## 2012-10-18 31
## 2012-10-25 42
## 2012-11-01 25
## 2012-11-08  2
## 2012-11-15 37
## 2012-11-22  6

First, we load the data into a data.frame, then we coerce ONE (1) column into a numeric vector and create a zoo object.
For the zoo object, the LAST row corresponds to latest result, the penultimate row corresponds to previous week's results,
and so on.

(2) Test the column for normality

# --- Layout
layout(matrix(1:6, 2, 3, byrow = TRUE))

# (2-1) Sharpiro test (p>0.05 is normal) of data, diff(data), and
# diff(log(data)) Requires a vector as input
n1.vtr <- as.numeric(n1.zoo)
p1 <- shapiro.test(n1.vtr)$p.value
p2 <- shapiro.test(diff(n1.vtr))$p.value
p3 <- shapiro.test(diff(log(n1.vtr)))$p.value

# (2-2) Plot histogram of data, diff(data), and diff(log(data))
hist(n1.vtr, prob = T, main = c("n1", paste("Shapiro p=", prettyNum(p1, digits = 2))), 
    xlab = "n1", xlim = c(1, 45))
lines(density(n1.vtr))
hist(diff(n1.vtr), prob = T, main = c("diff(n1)", paste("Shapiro p=", prettyNum(p2, 
    digits = 2))), xlab = "diff(n1)")
lines(density(diff(n1.vtr)))
hist(diff(log(n1.vtr)), prob = T, main = c("diff(log(n1))", paste("Shapiro p=", 
    prettyNum(p3, digits = 2))), xlab = "diff(log(n1))")
lines(density(diff(log(n1.vtr))))

# (2-3) QQ plot of the above
qqnorm(n1.zoo)
qqnorm(diff(n1.zoo))
qqnorm(diff(log(n1.zoo)))

plot of chunk unnamed-chunk-2

The above code tests the column for a normal distribution. It can be seen using a histogram (and a QQ plot), that the
diff(data) and the diff(log(data)) are approximately normal, but NOT the data. However, the Shapiro test indicates that
NONE of the distribution is normal (p-value is LESS THAN 0.05).

(3) Test the column for autocorrelation

# --- Layout
layout(matrix(1:3, 1, 3, byrow = TRUE))

# (3-1) Acf plot
acf(n1.zoo, main = "n1")
acf(diff(n1.zoo), main = "diff(n1)")
acf(diff(log(n1.zoo)), main = "diff(log(n1))")

plot of chunk unnamed-chunk-3

The ACF plot has two horizontal blue lines for significance tests. If any of the period has a vertical line that crosses
the blue lines, then there is a correlation between that period and period 0. It can be seen in the ACF plots, that both
the diff(data) and the diff(log(data)) has an autocorrelation between period 1 and period 0.

(4) Test the column for stationary

# (4-1) Augmented Dickey-Fuller (ADF) test The null-hypothesis for an ADF
# test is that the data is non-stationary Therefore, p>0.05 is
# non-stationary
adf.test(n1.zoo, alternative = "stationary")
## Warning: p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  n1.zoo 
## Dickey-Fuller = -9.203, Lag order = 9, p-value = 0.01
## alternative hypothesis: stationary
adf.test(diff(n1.zoo), alternative = "stationary")
## Warning: p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff(n1.zoo) 
## Dickey-Fuller = -16.43, Lag order = 9, p-value = 0.01
## alternative hypothesis: stationary
adf.test(diff(log(n1.zoo)), alternative = "stationary")
## Warning: p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff(log(n1.zoo)) 
## Dickey-Fuller = -16.89, Lag order = 9, p-value = 0.01
## alternative hypothesis: stationary

Using the ADF test, we show the column is a stationary time series for EACH of its different forms, i.e. data, diff(data)
and diff(log(data)).

(5) Test the column for seasonality

# (5-1) Seasonal differencing test If ns>0, then seasonal differencing is
# required
nsdiffs(n1.zoo)
## Warning: I can't handle data with frequency less than 1. Seasonality will
## be ignored.
## [1] 0
nsdiffs(diff(n1.zoo))
## Warning: I can't handle data with frequency less than 1. Seasonality will
## be ignored.
## [1] 0
nsdiffs(diff(log(n1.zoo)))
## Warning: I can't handle data with frequency less than 1. Seasonality will
## be ignored.
## [1] 0

Using the nsdiffs() function, we show the column does NOT require seasonal differencing for ANY of its different forms.

(6) Auto fit the column using Arima

# (6-1) AIC test
auto.arima(n1.zoo)
## Warning: p-value greater than printed p-value
## Series: n1.zoo 
## ARIMA(1,0,1) with non-zero mean 
## 
## Coefficients:
##          ar1    ma1  intercept
##       -0.806  0.858     22.896
## s.e.   0.093  0.080      0.468
## 
## sigma^2 estimated as 179:  log likelihood=-3458
## AIC=6924   AICc=6924   BIC=6943
auto.arima(diff(n1.zoo))
## Warning: p-value greater than printed p-value
## Series: diff(n1.zoo) 
## ARIMA(0,0,2) with zero mean     
## 
## Coefficients:
##          ma1     ma2
##       -0.940  -0.055
## s.e.   0.035   0.035
## 
## sigma^2 estimated as 180:  log likelihood=-3460
## AIC=6926   AICc=6926   BIC=6940
auto.arima(diff(log(n1.zoo)))
## Warning: p-value greater than printed p-value
## Warning: Unable to fit final model using maximum likelihood. AIC value
## approximated
## Series: diff(log(n1.zoo)) 
## ARIMA(1,0,2) with zero mean     
## 
## Coefficients:
##          ar1     ma1    ma2
##       -0.781  -0.182  -0.79
## s.e.   0.121   0.119   0.12
## 
## sigma^2 estimated as 0.795:  log likelihood=-1123
## AIC=2254   AICc=2254   BIC=2273

Using the auto.arima() function from the package forecast, we find that the column has the best fitted model (with the
lowest AIC) when using the diff(log(data)). This AIC has improved almost THREE (3) times that of the fitted model when
using the data without transformation.

(7) Predict ONE(1)-period lookahead

# (7-1) Obtain ONE(1)-period lookahead forecast and confidence interval
f3 <- forecast(auto.arima(diff(log(n1.zoo))), h = 1)
## Warning: p-value greater than printed p-value
## Warning: Unable to fit final model using maximum likelihood. AIC value
## approximated
## Warning: Incompatible methods ("Ops.zoo", "Ops.ts") for "-"
f3
##       Point Forecast  Lo 80 Hi 80   Lo 95 Hi 95
## 15673         0.8434 -0.299 1.986 -0.9037  2.59

# (7-2) Transform diff(log(data)) back into data Select the 95% confidence
# interval Powerball numbers MUST be between ONE (1) AND FORTY-FIVE (45)
u <- exp(log(n1.zoo[NROW(n1.zoo), 1]) + max(f3$upper))
p <- exp(log(n1.zoo[NROW(n1.zoo), 1]) + f3$mean)
## Warning: Incompatible methods ("Ops.zoo", "Ops.ts") for "+"
l <- exp(log(n1.zoo[NROW(n1.zoo), 1]) + min(f3$lower))
u <- min(45, trunc(u))
p <- round(p)
l <- max(1, round(l))
data.frame(lower = l, predict = p, upper = u)
##   lower predict upper
## 1     2      14    45

Based on the above AIC, and considering ALL of the previous tests, we select the diff(log(data)) as our time series input,
and then predict a ONE(1)-period lookahead for this time series. We also obtain a confidence interval and then transform
our predicted value (with intervals) back to its original form.

In conclusion, I have shown you how to analyze ONE (1) column of the Powerball time series to determine if it requires
transformation prior to fitting a model for forecasting. The transformations diff(data) and diff(log(data)) are commonly
used for ANY time series, however, you are NOT limited to these two types. In fact, you are encouraged to explore other
types of transformation that may lead you to a better forecasting model than mine.