For the following series, find an appropriate Box-Cox transformation in order to stabilize the variance.
usnetelec
For usnetelec
, the BoxCox.lambda() function
returns an optimal value of 0.517. By simply rounding down to 0.5, the transformation becomes much more interpretable, as a square root transformation. Upon comparing the original and transformed time series, we see modest improvements in terms of constant seasonal variance. The benefits and costs of making the transformation must be considered. Marginal gains in stabilizing variance come at the expense of more complicated explanations and the need for back calculations and the possibility of bias-adjustments.
head(usnetelec)
## Time Series:
## Start = 1949
## End = 1954
## Frequency = 1
## [1] 296.1 334.1 375.3 403.8 447.0 476.3
autoplot(usnetelec) +
ggtitle("Original net electric generation in US time series")
paste("optimal lambda: ", BoxCox.lambda(usnetelec) %>% round(3))
## [1] "optimal lambda: 0.517"
autoplot(BoxCox(usnetelec, 0.5)) +
ggtitle("Transformed net electric generation in US time series, lambda=0.5")
usgdp
A value of lambda of 0.33 does a good job at both stabilizing the variance and linearizing the time series, and will likely have benefits to the modeling effort. This should also be considered a simple and relatively easily understood as a cube root transformation. The benefits of the transformation are more clear in this time series than in the usnetelec
time series.
# plot the original time series
autoplot(usgdp) +
ggtitle("Orignal US GDP time series ")
# find optimal value of lambda
paste("optimal lambda: ", BoxCox.lambda(usgdp) %>% round(3))
## [1] "optimal lambda: 0.366"
autoplot(BoxCox(usgdp,0.33)) +
ggtitle("Transformed US GDP time series, lambda=0.33 ")
mcopper
Value of lambda is 0 which means the transformation is simply the natural log. The benefits of the transformation are evident in comparison to the original time series.
# plot original time series
autoplot(mcopper) +
ggtitle("Orignal copper price time series ")
# find optimal value of lambda
paste("optimal lambda: ", BoxCox.lambda(mcopper) %>% round(3))
## [1] "optimal lambda: 0.192"
autoplot(BoxCox(mcopper, 0)) +
ggtitle("Transformed copper price time series, lambda=0 ")
Why is a Box-Cox transformation unhelpful for the cangas
data?
Box-Cox transformations for the cangas
time series are ineffective due to the nature of the seasonal variance and rate of change over time. We can see seasonal component that is moderate from 1960 to about 1975. Then there is an increase in seasonal variation throughout the 80’s in to the early 90’s where we begin to see small seasonal fluctuations. The season plot demonstrates changing variation over the duration of the time series. As a result there is no value of lambda that will stabilize variance across the entire time series. For this reason there is no real benefit to the transformation.
# plot original time series
autoplot(cangas) +
ggtitle("Orignal Canadian gas production time series ")
ggseasonplot(cangas, polar = T)
# find optimal value of lambda
paste("optimal lambda: ", BoxCox.lambda(cangas) %>% round(3))
## [1] "optimal lambda: 0.577"
autoplot(BoxCox(cangas, 0.577)) +
ggtitle("Transformed Canadian gas production time series, lambda=0.577 ")
What Box-Cox transformation would you select for your retail data (from Exercise 3 in Section 2.10)? A Box-Cox lambda value of 0.0 (natural log transformation) is very helpful for this data, liquor retail sales. Seasonal variance is constant throughout the time series and linear trend is noticeable. This will simplify the modeling effort. A lambda of -0.1 also works well but is less interpretable.
# read in retail data from excel spreadsheet
retaildata <- readxl::read_excel("retail.xlsx", skip=1)
# create time series object
myts <- ts(retaildata[,"A3349627V"],
frequency=12, start=c(1982,4))
# plot original time series
autoplot(myts) +
ggtitle("Orignal liquor sales time series ")
# find optimal value of lambda
paste("otimal lambda: ", BoxCox.lambda(myts) %>% round(3))
## [1] "otimal lambda: -0.058"
autoplot(BoxCox(myts, 0.0)) +
ggtitle("Transformed liquor sales time series, lambda=0 ")
For your retail time series (from Exercise 3 in Section 2.10):
Split the data into two parts using; The time series is split so that the model will be trained using all data prior to 2011. The remaining will be test the model.
# split data
myts.train <- window(myts, end=c(2010,12))
myts.test <- window(myts, start=2011)
Check that your data have been split appropriately by producing the following plot.
Data has been split appropriately.
autoplot(myts) +
autolayer(myts.train, series="Training") +
autolayer(myts.test, series="Test")
Calculate forecasts using snaive applied to myts.train.
A forecast is made using a seasonal random walk. This forecast method uses the last observation from the previous corresponding season. A prediction interval is shaded in blue.
fc <- snaive(myts.train)
autoplot(fc)
Compare the accuracy of your forecasts against the actual values stored in myts.test.
Training set data clearly has better accuracy since this is the data that the model has been fit to. The test set gives a much better indication of the capabilities/limitations of the model. The table below produced from the accuracy()
function provides metrics of model performance and can be used to evaluate different models in the process of model selection. In this example, evaluation is between the test and training sets, therefore scale dependent errors can be evaluated; mean error (ME), and root mean squared error (RMSE). The ME shows the model bias, an unbiased model would have an ME of 0 in the training set. Comparing RMSE, we see the test set is higher by a factor of almost 2.5. Percentage errors (MPE, MAPE) have some general disadvantages so for evaluating datasets on the same scale should probably be avoided.
accuracy(fc,myts.test)
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 6.870871 12.27525 8.893093 5.476112 7.780981 1.00000 0.6617306
## Test set 28.400000 29.39091 28.400000 11.015822 11.015822 3.19349 0.5697915
## Theil's U
## Training set NA
## Test set 0.7493485
Check the residuals. Do the residuals appear to be uncorrelated and normally distributed?
There is some serial correlation in the residuals evident in the ACF plot and confirmed by the results of the Ljung-Box test which would suggest that information is left behind and the model forecast could be improved. Residuals do not exhibit much deviation from a normal distribution, but there are some instances of atypically large values around 1989/1990 and 2005/2006 that are biasing the model; the model mean is 6.87. Even if these periods were to be disregarded, there would still be some skewness in the histogram, suggesting that calculating prediction intervals assuming normality may be misleading. For a majority of the data, this model works fairly well, but the atypical periods need to be handled more effectively.
checkresiduals(fc)
##
## Ljung-Box test
##
## data: Residuals from Seasonal naive method
## Q* = 591.71, df = 24, p-value < 2.2e-16
##
## Model df: 0. Total lags used: 24