A simple pipeline for time series fitting and prediction.

Introduction

Good day to everyone,

This markdown serves as an exercise to demostrate the capacities of the library forecast to both fit and predict a time series type of data. Moreover, giving the historical data provided on the package, we can also simulate how our pipeline would behave in production and, giving this insights, how to improve the overall performance of the modeling.

Why (and when) to use time series forecasting over (usually statistical) Machine Learning

One of the main issues I’ve experience on data science teams is the tendency of certain data scientists to stick to one tool and use them as much as they can. Even if certain advanced techniques like Long short-term memory Neural Networks may get a better result metric-wise, there are other considerations we need to watch before choising the technical stack of the solution

  • Does the client needs to give an explanation of the prediction? If yes, we may choose easier, more transparent models then
  • Which are the time constraints on the project? These question should be answered in two temporal echelons: The time to complete the project and the time to get the prediction since the information is formed.
  • Which is the technical stack of the client? This is by far the less considered question. It serves nothing a tailored, clockwork precision, GPU trained model if the client does not have both the knowledge and architecture to support them.

Traditional time series models like the ones presented here usually respond pretty well to those questions. ARIMA models can be interpreted extracting the coeffients of the model, they are fast and light to both train and predict and, as they aren’t very CPU intensive, pretty much implementable in many environments. Another advantage they have is that, as they are unidimentional,the data pipeline is much simpler to maintain.

An example of what to expect

Hereby is presented a pipeline of what to expect when time series models can do good enough, among some visualizations on how we expect the model to work on the long run.

The packages used (and their respective versions)

  • forecast (8.19): The most important one, it gives prediction and fitting capabilities on state-of-the-art time series models like TBATS, BATS, etc.
  • dplyr (1.0.10): The go-to package to manipulate dataframes. Part of the tidyverse
  • ggplot2 (3.4.0). The go-to package to produce some nice plots. Part of the tidyverse
  • purrr (1.0.0): For functional programming. It would be esseencial to work on a LCW (list-column-workflow). I.e nested dataframes with objects as elements
  • tictoc (1.1) : To crudely measure the time taken on every part of the pipeline. There are better packages to estimate the time like microbenchmark but this is good enough to give us un idea.

This markdown was run on a VM compute cloud from GCP. with only 4Gb of ram on R version 3.6.3

library(forecast)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(purrr)
library(tidyr)
library(tictoc)

The time series in question

The time series used is the taylor data series from the package forecast describing the Half-hourly electricity demand in England and Wales from Monday 5 June 2000 to Sunday 27 August 2000.

time_series = forecast::taylor
time_series = forecast::tsclean(time_series) %>% as.numeric()
forecast::tsdisplay(time_series,lag.max = 48*7)

Note that time series was stripped from their temporal-related context. This is to give a better example that no matter how raw the data may come, we can still achieve our goals on modelling and prediction.

The auto-correlation function and partial autocorrelation function plots confirms us the weekly periodical nature of the data and the high correlation among contiguous observations. If the plot would show us some more along the lines of white noise (no significant auto-correlation neither partial auto-correlation), this technique would not work for predicting future values. For more information, I strongly recommend to read this article

The one-off Simple training

Just from watching the plot of the data we may get assume there’s a 2 stational periods: daily and weekly. Giving we have exactly 12 weeks of data we are going to do the following:

  • Use The first 4 weeks to train a TBATS model with the given seasonal periods (7 and 366)
  • Forecast for 1 week
  • Compare it with the actual data and take some metrics from it

Roughly it takes 1 min to train a model. Lets see how good it fits the data. Let us compare it with the actual measurements by binding the data used for train, test and finally forecast data.

# We "tidify" the results by using a dataframe


fc_df = forecast_model %>%
  forecast(h = length_test,
           biasadj  =T,
           level = c(50, 80, 95)) %>%
  as.data.frame()

colnames(fc_df) = colnames(fc_df) %>% tolower() %>%base::gsub(" ","_",.) # column names on camel_case

# Adding discretized time and real values
fc_df$time = (present+1) :future_win 
fc_df$value = ts_test


# To compare, lets bind it with the real values used to model the behaviour

present_df = data.frame(value = ts_train)
present_df$time = past_win:present


df_all = bind_rows(present_df, fc_df)

df_all %>%
  filter(time >= max(time) - 48*2*7) %>%  #W WE just want to show the last 2 weeks
  ggplot(aes(x = time)) +
  geom_ribbon(aes(ymin = lo_80, ymax = hi_80), fill = "red", alpha = 0.1) +
  geom_ribbon(aes(ymin = lo_50, ymax = hi_50), fill = "red", alpha = 0.2) +
  geom_line(aes(y=point_forecast), linetype = 1, color = "red") +
  geom_line(aes(y=value)) +
  theme_minimal() +
  labs(x = "Discretized Time", y = "Electricity demand (MW)")

Looks like the Use huge advantage of time series forecasting is that the model itself allows us to give certain expectation through confidence intervals (here 50% and 80%). It looks like we did quite well. Lets calculate the and absolute error distribution across the day.

#dday and dhour are discretized analogues of day and hour, given the bihourly nature of the data
fc_df = fc_df %>%
  mutate(dday = floor((time - min(time))/48),
         dhour= floor((time - min(time))/ 2) %% 24)

fc_df %>% 
  mutate(absolute_error = abs(value - point_forecast),
         relative_error = absolute_error / value) %>%
  select(dday, absolute_error, relative_error) %>% 
  gather(key, value, -dday) %>%
  ggplot(aes(x = factor(dday +1), y = value, fill = factor(dday))) +
  geom_boxplot(show.legend = F) +
  scale_fill_viridis_d() +
  facet_wrap(~key, scales = "free_y") +
  theme_minimal() +
  labs(x = "Day of the week")

Even if the absolute error oscillated among the 800 MegaWatts. It still comprehends only the 2-5% of the total demand. So far so good. More over around 100 percent of actual values are around the 80% confidence interval

How the model may behave on the long run: A Moving window perspective

We can scale this thought process to the whole historial data of 12 weeks. That will allow us to do 8 models across time given when do we consider the present. In other words, we may take the beginning of all the last 8 weeks, train a model given the last 4, predict for 1 week, rinse and repeat. To make the process more understadable from a coding point of view. We will use a List Column Workflow to map functions from list of elements to others. Using functional programming

# simple function to slice the time series given 2 indexes
extract_data = function(x,y){
  time_series[x:y]
}

# simple function to train a TBATS model given a time series
train_model = function(x){
  tbats(
  x,
  use.box.cox = T,
  use.trend = T,
  use.arma.errors = T,
  biasadj = T ,
  seasonal.periods = seasonal_periods)
}


# This dataframe contains the index for the present and past-future windows scopes we are going to use for each model

df_all_predictions = tibble(i = as.integer(0:length(time_series)))

df_all_predictions = df_all_predictions %>% 
  mutate(
    present = length_train + i * length_test,
    past_win = present - length_train +1,
    future_win = present + length_test) %>% 
    filter( future_win <= length(time_series))



df_all_predictions = df_all_predictions %>% 
  mutate(
    train_ts =  map2(.x = past_win, .y = present, .f = extract_data),
    test_ts = map2(present +1, future_win, extract_data)
    )


# We train all the models sequentially. 
tictoc::tic()

df_all_predictions = df_all_predictions %>% 
  mutate(
   models = map(train_ts, train_model))


tictoc::toc()
## 679.246 sec elapsed

In average, the whole fitting process take about 10 minutes from all the relevant points on the data.

# simple function to create a data frame with a time column from x to y
create_time = function(x,y){
  data.frame(time = x:y)
  }

# Manual tidyfication of a forecast model 
forecast_and_tidy = function(mdl, times){
  
  fc_df = mdl %>%
    forecast(h = length_test) %>% as.data.frame()

  colnames(fc_df) = colnames(fc_df) %>% tolower() %>%base::gsub(" ","_",.)

  bind_cols(fc_df, times)
}

# adds on a tidy forecast dataframe test values on it
add_real_values = function(fc_dataframe, test_values){
  
  fc_dataframe$value = test_values
  fc_dataframe
  
}


df_all_predictions = df_all_predictions %>% 
  mutate(
    time_past = map2(past_win, present, create_time),
    time_future = map2(present+ 1, future_win, create_time),
    forecast_tidy = map2(models, time_future, forecast_and_tidy),
    forecast_tidy = map2(forecast_tidy, test_ts, add_real_values)
  )

# Finally, we unnest de dataframes
df_tidy = df_all_predictions %>%
  select(i, forecast_tidy) %>% 
  tidyr::unnest()
## Warning: `cols` is now required when using unnest().
## Please use `cols = c(forecast_tidy)`

Let us see how the model behave in reality, comparing the forecast values for each model iteration with every model used

df_tidy %>% 
  ggplot(aes(x = time, fill = factor(i+1))) +
  geom_ribbon(aes(ymin = lo_80, ymax = hi_80),, alpha = 0.8,show.legend = T) +
  geom_line(aes(y=value), linetype = 1,show.legend = F) +
  scale_fill_brewer(palette ="Spectral") +
  theme_minimal() +
  labs(x = "Discretized Time", y = "Electricity demand (MW)", fill = "model\niteration")

For the 5th and 7th week there seem to be some issues, lets open up the absolute error distribution for every model. One plot I do not see much but may help to take some data-driven decisions on how frequent we have to fit the model is how the relative error spreads through time

df_tidy %>%
  group_by(i) %>%
  mutate(time = time - min(time)) %>%
  mutate(relative_error = abs(value- point_forecast)/value) %>%
  ungroup()  %>%
  ggplot(aes(time, relative_error)) +
  geom_point(aes(color = factor(i+1)), alpha = 0.15)+
  scale_fill_brewer(palette ="Spectral") +
  geom_smooth(method = "loess", span = 1/14, color = "red") +
  theme_minimal() +
  scale_y_continuous(labels = scales::label_percent()) +
  labs(x = "Discretized time passed from fitting", y = "Relative error", col = "model\niteration")

In practise, the relative error might stay below or around 5% but at the end of the week the relative error can jump up to 20% in some cases. It will be advisable to talk with your client what are the minimal acceptable error he can work with and, from that feedback, understand how early you need to train a newer model.

df_tidy %>%
  mutate(error = abs(value - point_forecast)) %>% 
  ggplot(aes(x = factor(i+1), y = error, fill = factor(i+1))) +
  geom_boxplot() +
  scale_fill_brewer(palette ="Spectral") +
  theme_minimal() +
  labs(x = "Model iteration", fill = "model\niteration", y = "Absolute Error (MW)")

Unfortunately the model can not predict the monthly stationarity across the Electricity demand. This is a bit obvious given we explicitly told the model that we stationarity is at most weekly. These are the solution proposed to aliviate this issue:

  • Using more weeks to fit the model and explicit the monthly stationarity. This may be the easier solution, but at the cost of adding more time for training and the need for a bigger infrastructure to put the pipeline into production.
  • Feature engineering: We may decouple the fitting problem 2 smaller ones:
    • Guess the mean for each week using extrapolation
    • Training the model among the differences on the means
    • Adding up the 2 results

This procedure is left as an exercise to the reader ;)

Some conclusions and words from the Author

Even if Times Series forecasting may not have nowadays the populatity of Machine learning models on this data-driven culture, I Believe they can still be considered at least a baseline benchmark to compare them. Their relative ease to train and deploy can be a great asset when the overall goal on the project can be to shorten the Time-To-Market as much as possible, especially when there’s no other data other variables to compare the response.

Rather than expecting a total change of culture among the people who took the time to read this article. I hope this helps to consider simpler yet still useful method like Time Series Forecasting as an alternative for your production pipelines.