Time Series Analysis and Forecasting with the TSstudio Package

Intro

The TSstudio provides a set of tools for time series analysis and forecasting applications, using (mainly) models from the forecast package and visualization tools based on the plotly package. The following tutorial demonstrates the usage of the package to forecast the monthly consumption of natural gas in the US in the next 5 years (or 60 months)

Installation

Install from CRAN:

install.packages("TSstudio")

Or from Github:

devtools::install_github("RamiKrispin/TSstudio")

As for December 2018, the most updated version is 0.1.3

library(TSstudio)

Data

The USgas dataset, one of the package datasets, represents the monthly consumption (in Billion Cubic Feet) of natural gas in the US since January 2000 [1]:

# Load the series
data("USgas")

# Get the series info
ts_info(USgas)

##  The USgas series is a ts object with 1 variable and 227 observations
##  Frequency: 12 
##  Start time: 2000 1 
##  End time: 2018 11

# Plot the series
ts_plot(USgas,
        title = "US Monthly Natural Gas Consumption",
        Ytitle = "Billion Cubic Feet",
        Xtitle = "Source: Federal Reserve Bank of St. Louis", 
        slider = TRUE)

Exploratory analysis

The goal of the exploratory analysis is to identify the key characteristics of the series with the use of descriptive analysis methods. Throughout this process we are mainly want to detect:

Seasonal pattern - single/multiple seasonal pattern (if exists)
The trend type - linear/exponential
Structural breaks and outliers
Any other pattern of the series

Those insights provide us with a better understanding of the past and can be utilized to forecast the future.

Decomposing the series

We will start with decomposing the series into its three components - the trend, seasonal and random components. The ts_decompose function provides an interactive inference for the decompose function form the stats package:

ts_decompose(USgas)

We can observe that the trend of the series is fairly flat up to 2010 and afterward start to increase. Also, it seems from the trend plot that it is not linear.

Seasonal analysis

You can note from both the series and decompose plots that the series has a strong seasonal pattern along with a non-linear trend. We will use the ts_seasonal, ts_heatmap and ts_surface functions to explore the seasonal pattern of the series:

ts_seasonal(USgas, type = "all")

The ts_seasonal function provides three different views of the seasonality of the series (when the type argument is equal to all):

Split and plot the series by the full frequency cycle of the series, which in the case of monthly series is a full year. This view allows you to observe and compare the variation of each frequency unit from year to year. The plot’s color scale set by the chronological order of the lines (i.e., from dark color for early years and bright colors for the latest years).
Plot each frequency unit over time, and in the case of monthly series, each line represents a month of consumption over time. This allows us to observe if the seasonal pattern remains the same over time.
Last but not least, is box-plot representative of each frequency unit, which allows us to compare the distribution of each frequency unit.

The main observations from this set of plots are:

The structure of the seasonal pattern remain the same over the years - high consumption through the winter months, low through the spring and fall and small spike during the summertime
The distribution of the consumption during the wintertime is wider than the ones throughout the rest of the year. This may be related to the strength of the winter, as most of the heating systems in the US consume gas.
The series is growing from year to year

To get a more clear view of the seasonal pattern of the series, you may want to remove the series growth (or detrend) and replot it:

 ts_seasonal(USgas - decompose(USgas)$trend, 
             type = "all", 
             title = "Seasonal Plot - USgas (Detrend)")

Alternatively, you can use ts_heatmap or the ts_surface function to view and explore the seasonal pattern of the series:

ts_heatmap(USgas)

ts_surface(USgas)

Correlation analysis

The next step is to identify the level of correlation between the series and it’s lags, using the ts_acf function (an interactive interface for the acf function):

ts_acf(USgas, lag.max = 36)

As expected you can notice that the series has a high correlation with its seasonal lags. A more intuitive way to review identify the relationship between the series and its lags is with the ts_lags function, which provides plots of the series against its lags:

ts_lags(USgas)

As observed before with the ts_acf function, you can notice that the seasonal lag (or lag 12) of the series has a strong linear relationship with the series. Similarly, we can zoom in on the seasonal lags of the series using the lags argument:

ts_lags(USgas, lags = c(12, 24, 36, 48))

Exporatory analysis summary

Here are the key insights we learned from the exploratory analysis process:

The series has a strong seasonal pattern, no indication for multi-seasonality
The series trend has a structural break around 2010 and a non-linear growth
The series has a strong correlation with its seasonal lags

Forecasting the series

Using the information we learned from the exploratory analysis, we will conduct “horse race” between several models and select the one that performs best on the testing sets, by using two training approaches:

Traditional approach - by splitting the series into training and testing (sample out) partitions. Train each model on the training set and evaluate its performance on the testing set. We will use the following three models - auto.arima, ets, and tslm from the forecast package
Backtesting approach - by using an expanding window to train and test each model on multiple training and testing sets. We will utilize the ts_backtesting function to train multiple models from the forecast, forecastHybrid, and bsts packages.

To handle the stuctual break of the series, we will use a binary flag with a value of 0 for any observations before 2010, and 1 afterward:

USgas_df <- ts_to_prophet(USgas) # converting the series to df format

head(USgas_df)

##           ds      y
## 1 2000-01-01 2510.5
## 2 2000-02-01 2330.7
## 3 2000-03-01 2050.6
## 4 2000-04-01 1783.3
## 5 2000-05-01 1632.9
## 6 2000-06-01 1513.1

library(lubridate)
USgas_df$flag <- ifelse(year(USgas_df$ds) >= 2010, 1, 0)

Traditional Approach

We will use the following three forecasting approaches:

ARIMA – autoregressive moving average model using the auto.arima function
ETS (Error, Trend and Seasonal) model - or exponential smoothing state space model with the ets function
TSLM (Time Series Linear Model) – forecasting with linear regression model using the tslm function

When running a “horse racing” between forecasting models, it is recommended diversify your modeling approach. The performance of the models may change according to the data structure and by the tuning parameters.

We will start by spliting the series to training and testing partitions:

# Set the sample out and forecast horizon
h1 <- 12 # the length of the testing partition
h2 <- 60 # forecast horizon

# Splitting the time series object to training and testing partitions
USgas_split <- ts_split(USgas, sample.out = h1)
train <- USgas_split$train
test <- USgas_split$test

ts_info(train)

##  The train series is a ts object with 1 variable and 215 observations
##  Frequency: 12 
##  Start time: 2000 1 
##  End time: 2017 11

ts_info(test)

##  The test series is a ts object with 1 variable and 12 observations
##  Frequency: 12 
##  Start time: 2017 12 
##  End time: 2018 11

# Splitting the data.frame object to training and testing partitions
train_df <- USgas_df[1:(nrow(USgas_df) - h1), ]
test_df <- USgas_df[(nrow(USgas_df) - h1 + 1):nrow(USgas_df), ]

Model 1 - ARIMA

set.seed(1234)

library(forecast)
library(plotly)

# auto.arima
md1 <- auto.arima(train, 
                  stepwise = FALSE, 
                  approximation = FALSE,
                  D = 1)
fc1 <- forecast(md1, h = h1)
accuracy(fc1, test)

##                      ME      RMSE       MAE        MPE     MAPE      MASE
## Training set  -1.586018  98.37274  72.07709 -0.2748132 3.497391 0.6718145
## Test set     198.515290 216.81325 200.60105  7.9783005 8.055568 1.8697577
##                     ACF1 Theil's U
## Training set  0.02180888        NA
## Test set     -0.02214037  0.815383

test_forecast(forecast.obj = fc1, actual = USgas, test = test) %>% 
  layout(legend = list(x = 0.1, y = 0.95))

It seems like the ARIMA model capture well the change in the trend but fail to capture the seasonal peaks. In addition, the error rate on the testing set is more than twice than the one on the training set. This may indication for overfitting.

Model 2 - ETS

# ETS
md2 <- ets(train, opt.crit = "mse")
fc2 <- forecast(md2, h = h1)
accuracy(fc2, test)

##                      ME      RMSE       MAE        MPE     MAPE      MASE
## Training set   4.530822  99.61901  73.11756 0.07081195 3.561811 0.6815125
## Test set     114.734134 173.88352 155.67014 4.94191827 6.458404 1.4509667
##                    ACF1 Theil's U
## Training set 0.06787102        NA
## Test set     0.41614139 0.6901814

test_forecast(forecast.obj = fc2, actual = USgas, test = test) %>% 
  layout(legend = list(x = 0.1, y = 0.95))

The ETS do a better job on capturing the both the trend and the peaks with respect to the ARIMA model above. With an overall error of 6.5% testing set and 3.5% on the training set.

Model 3 - TSLM

# Time series linear regression
md3 <- tslm(train ~ season + trend)
fc3 <- forecast(md3, h = h1)
accuracy(fc3, test)

##                                     ME     RMSE       MAE        MPE
## Training set  -0.000000000000004239567 117.1926  90.05241 -0.3040729
## Test set     224.707145243281871671570 247.8366 229.88827  9.0005014
##                  MAPE      MASE       ACF1 Theil's U
## Training set 4.501471 0.8393584 0.57441653        NA
## Test set     9.192438 2.1427373 0.01462551 0.9237391

test_forecast(forecast.obj = fc3, actual = USgas, test = test) %>% 
  layout(legend = list(x = 0.1, y = 0.95))

Not surprisingly, the linear approach seems vary linear and therefore fail to capture the peaks and the trend structure. Far behind from the ARIMA and the ETS, the TSLM score an error rate of 9.2% on the testing set.

We can try to impose the structural break by adding the flag variable we prepared before:

# Time series linear regression
md3a <- tslm(train ~ season + trend + flag, data = train_df)
fc3a <- forecast(md3a, h = h1, newdata = test_df)
accuracy(fc3a, test)

##                                    ME     RMSE       MAE        MPE
## Training set   0.00000000000002747886 111.2361  85.85517 -0.2677946
## Test set     257.57309711294334420018 277.6615 257.57310 10.3525724
##                   MAPE      MASE        ACF1 Theil's U
## Training set  4.233153 0.8002369 0.524934967        NA
## Test set     10.352572 2.4007814 0.006969844  1.043482

test_forecast(forecast.obj = fc3a, actual = USgas, test = test) %>% 
  layout(legend = list(x = 0.1, y = 0.95))

That even worse! Let’s try to use polynomial regression by adding the second square of the trend:

# Time series linear regression
md3b <- tslm(train ~ season + trend + I(trend ^ 2))
fc3b <- forecast(md3b, h = h1)
accuracy(fc3b, test)

##                                   ME     RMSE       MAE        MPE
## Training set -0.00000000000001585288 105.1812  76.25964 -0.2349141
## Test set     88.96024173258881262427 134.7567 118.67616  3.4109552
##                  MAPE      MASE        ACF1 Theil's U
## Training set 3.706897 0.7107991  0.47320514        NA
## Test set     4.559134 1.1061541 -0.03942927 0.4659541

test_forecast(forecast.obj = fc3b, actual = USgas, test = test)  %>% 
  layout(legend = list(x = 0.1, y = 0.95))

That’s look much better! Although we still of on the peaks, this model is doing a better job on capturing the overall trend of the series. With an error rate of 4.5% on the testing set and 3.7% on the training set. If the model selection criteria are MAPE or RMSE, we should select this model. Before we retrain the model, let’s review this model residuals:

check_res(md3b)

We can see that the ACF plot indicated that the residuals are correlated, which is an indication that model missed or not capturing well some of the patterns of the series. In this case, you can either try to add additional features to the model such as different degree of polynomial or other regressors (such as weather indicators, energy prices, etc.). For simplicity reasons, based on the error performance, we will move forward with this model and forecast the next 5 years:

# Time series linear regression
md_final <- tslm(USgas ~ season + trend + I(trend ^ 2))
fc_final <- forecast(md_final, h = h2)
plot_forecast(fc_final) %>% 
  layout(legend = list(x = 0.1, y = 0.95))

Backtesting approach

We will use the ts_backtesting to train and test multiple models (auto.arima, ets, HoltWinters, nnetar, tbats, from the forecast package, hybridModel from the forecastHybrid package and bsts from the bsts package) over six periods of time (periods = 6). Likewise as we did in the traditional approach above, we will set the testing partition to 12 months (h = h1) and the forecast horizon to 60 months (h = h2)

md5 <- ts_backtesting(ts.obj = USgas,
                      periods = 6, 
                      error = "MAPE",
                      window_size = h1,
                      h = h2,
                      a.arg = list(stepwise = FALSE, 
                                   approximation = FALSE,
                                   D = 1),
                      e.arg = list(opt.crit = "mse"),
                      n.arg = list(P = 2, 
                                   p =1,
                                   repeats = 100),
                      h.arg = list(errorMethod = "RMSE",
                                   verbos = FALSE))

##    Model_Name  avgMAPE    sdMAPE  avgRMSE    sdRMSE
## 1        bsts 6.370000 0.6272798 183.9350 19.428653
## 2         ets 6.721667 0.5984786 176.9167 12.839003
## 3      hybrid 6.873333 0.6701542 189.5117  6.532257
## 4  auto.arima 7.008333 0.8395098 199.2150 13.176969
## 5      nnetar 7.185000 0.6678548 219.5000 10.152292
## 6 HoltWinters 7.308333 0.3264608 204.8283  8.812086
## 7       tbats 7.818333 0.8971380 213.9417 19.818474

md5$summary_plot

The main advantage the backtesting approach, over the traditional approach, is that it provides an overview of the performance of each model overtime. This allow you to identify, in additional to accuracy, the stability of the model’s performance overtime. Looking at the summary plot above, you can notice that:

The nnetar model is the most stable model, as the range of its RMSE is fairly small.
Yet, on average, the auto.arima model achived the lowest RMSE and MAPE on the testing partitions (since we defined the model selection critirion as RMSE, the function select this forecasting approach)
You may also consider the bsts or the ets models, as their error consitenly dropping overtime

[1] U.S. Bureau of Transportation Statistics, Natural Gas Consumption [NATURALGAS], retrieved from FRED, Federal Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/NATURALGAS, January 7, 2018.