Time Series Forecasting describes a set of research problems where our observations are collected at regular time intervals and where we can assume correlations among successive observations. The main idea is to forecast the future based on previous data. The process itself is called forecasting. Time Series is an observation for single data.

Before we work with the time series, we load the library first:

library(lubridate)  #for working with dates
library(tidyquant)  #for some ggplot aesthetics; theme_tq
library(forecast)   #for time series modeling and forecasting; autoplot
library(tidyverse)  #for data wrangling
library(padr)       #pad (complete the hours/time)
library(recipes)    #for data preprocess
library(yardstick)  #for measuring forecast performance; mae_vec
library(timetk)     #a toolkit for time series
library(ggplot2)    #for plotting our model
library(plotly)

1 Scotty: “Bring me the crystal ball!”

*Carbikemovers - India

Scotty is one of the transportation company in India, in the end of 2017, they need know and be ready for the end year’s demands. Unfortunately, Scotty was not old enough to have last year data for December, so they can not look back at past demands to prepare forecast for December’s demands.

Fortunately, we have already know that time series analysis is more than enough to help us to forecast. As an investment for the business’ future, we will develop an automated forecasting framework.

Based on “scotty-ts-auto/data/data-train.csv” data (up to Saturday, December 2nd 2017), we will make a report of our automated forecasting framework for hourly demands, and that would be evaluated on the next 7 days (Sunday, December 3rd 2017 to Monday, December 9th 2017).

2 Data Preprocess

Let’s start by importing the dataset:

Data Train

# import dataset
data_train <- read_csv("data/data-train.csv")

## Parsed with column specification:
## cols(
##   id = col_character(),
##   trip_id = col_character(),
##   driver_id = col_character(),
##   rider_id = col_character(),
##   start_time = col_datetime(format = ""),
##   src_lat = col_double(),
##   src_lon = col_double(),
##   src_area = col_character(),
##   src_sub_area = col_character(),
##   dest_lat = col_double(),
##   dest_lon = col_double(),
##   dest_area = col_character(),
##   dest_sub_area = col_character(),
##   distance = col_double(),
##   status = col_character(),
##   confirmed_time_sec = col_double()
## )

# quick check
head(data_train, 10)

## # A tibble: 10 x 16
##    id    trip_id driver_id rider_id start_time          src_lat src_lon
##    <chr> <chr>   <chr>     <chr>    <dttm>                <dbl>   <dbl>
##  1 59d0~ 59d005~ 59a892c5~ 59ad2d6~ 2017-10-01 00:00:17    41.1    29.0
##  2 59d0~ <NA>    <NA>      59cd704~ 2017-10-01 00:02:34    41.1    29.0
##  3 59d0~ 59d006~ 599dc0df~ 59bd62c~ 2017-10-01 00:03:29    41.0    29.0
##  4 59d0~ 59d007~ 59a58565~ 59c17cd~ 2017-10-01 00:04:24    41.1    29.0
##  5 59d0~ <NA>    <NA>      596f47a~ 2017-10-01 00:06:06    41.1    29.0
##  6 59d0~ 59d007~ 599dc0df~ 59bd62c~ 2017-10-01 00:08:19    41.0    29.0
##  7 59d0~ 59d008~ 599dc0df~ 59bd62c~ 2017-10-01 00:09:01    41.0    29.0
##  8 59d0~ 59d008~ 599dc0df~ 59bd62c~ 2017-10-01 00:09:45    41.0    29.0
##  9 59d0~ <NA>    <NA>      59bd62c~ 2017-10-01 00:11:06    41.0    29.0
## 10 59d0~ 59d008~ 599dc0df~ 596eab7~ 2017-10-01 00:12:24    41.0    29.0
## # ... with 9 more variables: src_area <chr>, src_sub_area <chr>,
## #   dest_lat <dbl>, dest_lon <dbl>, dest_area <chr>, dest_sub_area <chr>,
## #   distance <dbl>, status <chr>, confirmed_time_sec <dbl>

as per information above, we would like to create automated forecasting one week after data train,
Let’s show the format of data evaluation :

# import dataset
data_submission <- read_csv("data/data-submission.csv")

## Parsed with column specification:
## cols(
##   src_sub_area = col_character(),
##   datetime = col_datetime(format = ""),
##   demand = col_logical()
## )

# quick check
head(data_submission, 10)

## # A tibble: 10 x 3
##    src_sub_area datetime            demand
##    <chr>        <dttm>              <lgl> 
##  1 sxk97        2017-12-03 00:00:00 NA    
##  2 sxk97        2017-12-03 01:00:00 NA    
##  3 sxk97        2017-12-03 02:00:00 NA    
##  4 sxk97        2017-12-03 03:00:00 NA    
##  5 sxk97        2017-12-03 04:00:00 NA    
##  6 sxk97        2017-12-03 05:00:00 NA    
##  7 sxk97        2017-12-03 06:00:00 NA    
##  8 sxk97        2017-12-03 07:00:00 NA    
##  9 sxk97        2017-12-03 08:00:00 NA    
## 10 sxk97        2017-12-03 09:00:00 NA

2.1 Data Aggregation

We only need area, datetime and demand columns, so, we make the same format to data Train.
we aggregate the time using floor_date function and we count the demand of the Scotty at that hour.

# aggregate the time data
subarea_train <- data_train %>% 
  mutate(
    datetime = floor_date(start_time, "hour")) %>% 
  group_by(src_sub_area, datetime) %>%
  summarise(demand = n()) %>% 
  ungroup()

# quick check
subarea_train

## # A tibble: 4,225 x 3
##    src_sub_area datetime            demand
##    <chr>        <dttm>               <int>
##  1 sxk97        2017-10-01 00:00:00      6
##  2 sxk97        2017-10-01 01:00:00      4
##  3 sxk97        2017-10-01 02:00:00      9
##  4 sxk97        2017-10-01 03:00:00      2
##  5 sxk97        2017-10-01 04:00:00      5
##  6 sxk97        2017-10-01 05:00:00      4
##  7 sxk97        2017-10-01 06:00:00      1
##  8 sxk97        2017-10-01 08:00:00      3
##  9 sxk97        2017-10-01 09:00:00      3
## 10 sxk97        2017-10-01 10:00:00      3
## # ... with 4,215 more rows

From the quick check, we knew that the time series data has not completed yet.
So, we would like to make it complete by doing time series padding.

2.2 Time Series Padding

As mentioned before, we will complete time series data. E.g. we will fill data 2017-10-01 07:00:00 with demand = 0.

sxk97 2017-10-01 06:00:00 1
sxk97 2017-10-01 08:00:00 3

subarea_train <- subarea_train %>% 
  group_by(src_sub_area) %>% 
  pad(
      start_val = ymd_hms("2017-10-01 00:00:00"), 
      end_val = ymd_hms("2017-12-02 23:00:00")) %>% #detect internally: "hour"
  ungroup()

## pad applied on the interval: hour

#replace NA value with '0'
subarea_train$demand <- replace_na(subarea_train$demand,0)

subarea_train

## # A tibble: 4,536 x 3
##    src_sub_area datetime            demand
##    <chr>        <dttm>               <dbl>
##  1 sxk97        2017-10-01 00:00:00      6
##  2 sxk97        2017-10-01 01:00:00      4
##  3 sxk97        2017-10-01 02:00:00      9
##  4 sxk97        2017-10-01 03:00:00      2
##  5 sxk97        2017-10-01 04:00:00      5
##  6 sxk97        2017-10-01 05:00:00      4
##  7 sxk97        2017-10-01 06:00:00      1
##  8 sxk97        2017-10-01 07:00:00      0
##  9 sxk97        2017-10-01 08:00:00      3
## 10 sxk97        2017-10-01 09:00:00      3
## # ... with 4,526 more rows

3 Cross-Validation

Now, the data has already complete. It contains src_sub_area, datetime, and demand columns. The dates are ranging from:

range(subarea_train$datetime)

## [1] "2017-10-01 00:00:00 UTC" "2017-12-02 23:00:00 UTC"

Here’s some example data from last month:

# quick plot
subarea_train %>%
  filter(datetime >= max(datetime) - hours(24 * 7 * 4)) %>%
  ggplot(aes(x = datetime, y = demand)) +
    geom_line() +
    labs(x = NULL, y = NULL) +
    facet_wrap(~ src_sub_area, scale = "free", ncol = 1) +
    tidyquant::theme_tq()

3.1 Prepare Data

3.1.1 THE SCENARIO

To help us benchmark multiple models, we will need to split some portion of our data for validation. The strategy here is to cut the last 1 week as our test dataset. Then, we will cut again some (bigger) portion as the train dataset, say, 1 weeks times 8–approximately 2 months. This strategy is just the simplified version of Rolling Forecasting Origin, with only having one pair of train and test sample.

So the first step here is to get the start and end date of the train and test sample. The most straighforward way is to define the train and test size, then recursively get the start and end index for each:

# train-val-test size
test_size <- 24 * 7
train_size <- 24 * 7 * 8

# get the min-max of the time index for each sample
test_end <- max(subarea_train$datetime)
test_start <- test_end - hours(test_size) + hours(1)

train_end <- test_start - hours(1)
train_start <- train_end - hours(train_size) + hours(1)

We combine the start and end indices into a date interval:

# get the interval of each samples
intrain <- interval(train_start, train_end)
intest <- interval(test_start, test_end)

intrain

## [1] 2017-10-01 UTC--2017-11-25 23:00:00 UTC

#> [1] 2017-10-01 UTC--2017-11-25 23:00:00 UTC

intest

## [1] 2017-11-26 UTC--2017-12-02 23:00:00 UTC

#> [1] 2017-11-26 UTC--2017-12-02 23:00:00 UTC

Here’s the data after splitting process: data train & test

# plot the train and test
p <- subarea_train %>%
  mutate(sample = case_when(
    datetime %within% intrain ~ "train",
    datetime %within% intest ~ "test"
  )) %>%
  drop_na() %>%
  mutate(sample = factor(sample, levels = c("train", "test"))) %>%
  ggplot(aes(x = datetime, y = demand, colour = sample)) +
    geom_line() +
    labs(x = NULL, y = NULL, colour = NULL) +
    facet_wrap(~ src_sub_area, scale = "free", ncol = 1) +
    tidyquant::theme_tq() +
    tidyquant::scale_colour_tq()

ggplotly(p)

3.1.2 DATA PREPROCESS USING RECIPES

Data preprocessing is a very crucial step in time series model fitting. In this Capstone project, we will use recipes package for data preprocessing and comparing multiple preprocess specifications.

Since recipe package work columnwise, we need to convert our data into a wide format first:

# convert to wide format
subarea_wide <- subarea_train %>% 
  spread(src_sub_area, demand)

subarea_wide

## # A tibble: 1,512 x 4
##    datetime            sxk97 sxk9e sxk9s
##    <dttm>              <dbl> <dbl> <dbl>
##  1 2017-10-01 00:00:00     6     8    10
##  2 2017-10-01 01:00:00     4     2     3
##  3 2017-10-01 02:00:00     9     3     1
##  4 2017-10-01 03:00:00     2     2     0
##  5 2017-10-01 04:00:00     5     1     0
##  6 2017-10-01 05:00:00     4     2     0
##  7 2017-10-01 06:00:00     1     1     0
##  8 2017-10-01 07:00:00     0     0     1
##  9 2017-10-01 08:00:00     3     2     5
## 10 2017-10-01 09:00:00     3    11     5
## # ... with 1,502 more rows

Then we could start to define the preprocess recipe(), and bake() our data based on the defined recipe:

# recipe is a set of steps, obj: make the data additive
# we only do this with data train
rec <- recipe(~ ., filter(subarea_wide, datetime %within% intrain)) %>%
  step_sqrt(all_numeric()) %>%
  step_center(all_numeric()) %>%
  step_scale(all_numeric()) %>%
  prep()

# preview the bake results
subarea_wide <- bake(rec, subarea_wide)

subarea_wide

## # A tibble: 1,512 x 4
##    datetime             sxk97  sxk9e  sxk9s
##    <dttm>               <dbl>  <dbl>  <dbl>
##  1 2017-10-01 00:00:00 -0.594 -0.697 -0.111
##  2 2017-10-01 01:00:00 -0.841 -1.28  -0.780
##  3 2017-10-01 02:00:00 -0.291 -1.15  -1.12 
##  4 2017-10-01 03:00:00 -1.16  -1.28  -1.59 
##  5 2017-10-01 04:00:00 -0.711 -1.44  -1.59 
##  6 2017-10-01 05:00:00 -0.841 -1.28  -1.59 
##  7 2017-10-01 06:00:00 -1.39  -1.44  -1.59 
##  8 2017-10-01 07:00:00 -1.94  -1.85  -1.12 
##  9 2017-10-01 08:00:00 -0.989 -1.28  -0.544
## 10 2017-10-01 09:00:00 -0.989 -0.497 -0.544
## # ... with 1,502 more rows

If we use recipes package, the next steps is to create a revert back function:

# revert back function
rec_revert <- function(vector, rec, varname) {

  # store recipe values
  rec_center <- rec$steps[[2]]$means[varname]
  rec_scale <- rec$steps[[3]]$sds[varname]

  # convert back based on the recipe
  results <- (vector * rec_scale + rec_center) ^ 2

  # add additional adjustment if necessary
  results <- round(results)

  # return the results
  results
}

This revert back function will convert back the data to its original form.

Now we can convert our data into the long format again:

# convert back to long format
subarea_long <- subarea_wide %>% 
  gather(src_sub_area, demand, -datetime)
  
subarea_long

## # A tibble: 4,536 x 3
##    datetime            src_sub_area demand
##    <dttm>              <chr>         <dbl>
##  1 2017-10-01 00:00:00 sxk97        -0.594
##  2 2017-10-01 01:00:00 sxk97        -0.841
##  3 2017-10-01 02:00:00 sxk97        -0.291
##  4 2017-10-01 03:00:00 sxk97        -1.16 
##  5 2017-10-01 04:00:00 sxk97        -0.711
##  6 2017-10-01 05:00:00 sxk97        -0.841
##  7 2017-10-01 06:00:00 sxk97        -1.39 
##  8 2017-10-01 07:00:00 sxk97        -1.94 
##  9 2017-10-01 08:00:00 sxk97        -0.989
## 10 2017-10-01 09:00:00 sxk97        -0.989
## # ... with 4,526 more rows

3.1.3 NESTED MODEL FITTING AND FORECASTING

In functional programming using purrr, we need to convert our tbl into a nested tbl. Nested data is like a table inside a table, which could be controlled using a key indicator; in other words, we can have tbl for each subarea and samples. Using this format, the fitting and forecasting process would be very versatile, yet we can still convert the results as long as we have a proper key like provider.

Let’s start by converting our tbl into a nested tbl. First, we need to add sample indicator so it could be recognized as a key when we nest() the data:

# adjust by sample
subarea_nest <- subarea_long %>%
  mutate(sample = case_when(
    datetime %within% intrain ~ "train",
    datetime %within% intest ~ "test"
  )) %>%
  drop_na()

# nest the data train
subarea_train_nest <- subarea_nest %>% 
  group_by(src_sub_area, sample) %>%
  nest(.key = "data") %>%
  spread(sample, data)

subarea_train_nest

## # A tibble: 3 x 3
##   src_sub_area test               train               
##   <chr>        <list>             <list>              
## 1 sxk97        <tibble [168 x 2]> <tibble [1,344 x 2]>
## 2 sxk9e        <tibble [168 x 2]> <tibble [1,344 x 2]>
## 3 sxk9s        <tibble [168 x 2]> <tibble [1,344 x 2]>

# nest the data submission too
submit_nest <- data_submission %>% nest(datetime)
submit_nest

## # A tibble: 3 x 3
##   src_sub_area demand data              
##   <chr>        <lgl>  <list>            
## 1 sxk97        NA     <tibble [168 x 1]>
## 2 sxk9e        NA     <tibble [168 x 1]>
## 3 sxk9s        NA     <tibble [168 x 1]>

3.1.4 PREPARING THE DATA MODEL LIST

For data and model combination, we could start by defining some options for data representation. Recall that our series have a relatively high frequency, so we could consider two option of data representation here:
- a ts object with daily seasonality (24 hours), and
- an msts with daily and weekly seasonality (24 * 7).

To incorporate them into our nested data, we need to create another nested data frame containing the data daily and weekly, after that we will do a function for converting the data into the specified data. In this case, we would like to compare multiple seasonality specifications.

Let’s start with making a named list containing the transformation functions:

# function: data funs list
data_funs <- list(
  ts = function(x) ts(x$demand, frequency = 24),
  msts = function(x) msts(x$demand, seasonal.periods = c(24, 24 * 7))
)

data_funs

## $ts
## function (x) 
## ts(x$demand, frequency = 24)
## 
## $msts
## function (x) 
## msts(x$demand, seasonal.periods = c(24, 24 * 7))

Then we convert the list into a tbl using enframe().
Note that we should also give a key – which is the src_sub_area in our case – so we could use left_join() later.

Use rep() function:

# convert to nested
data_funs <- data_funs %<>%
  rep(length(unique(subarea_nest$src_sub_area))) %>%
  enframe("data_fun_name", "data_fun") %>%
  mutate(src_sub_area =
    sort(rep(unique(subarea_nest$src_sub_area), length(unique(.$data_fun_name))))
  )

data_funs

## # A tibble: 6 x 3
##   data_fun_name data_fun src_sub_area
##   <chr>         <list>   <chr>       
## 1 ts            <fn>     sxk97       
## 2 msts          <fn>     sxk97       
## 3 ts            <fn>     sxk9e       
## 4 msts          <fn>     sxk9e       
## 5 ts            <fn>     sxk9s       
## 6 msts          <fn>     sxk9s

Then the last steps here is to join the nested function with our Nested Model Fitting:

# combine with models
subarea_nest_fit <- subarea_train_nest %>% 
  left_join(data_funs)

## Joining, by = "src_sub_area"

subarea_nest_fit

## # A tibble: 6 x 5
##   src_sub_area test               train              data_fun_name data_fun
##   <chr>        <list>             <list>             <chr>         <list>  
## 1 sxk97        <tibble [168 x 2]> <tibble [1,344 x ~ ts            <fn>    
## 2 sxk97        <tibble [168 x 2]> <tibble [1,344 x ~ msts          <fn>    
## 3 sxk9e        <tibble [168 x 2]> <tibble [1,344 x ~ ts            <fn>    
## 4 sxk9e        <tibble [168 x 2]> <tibble [1,344 x ~ msts          <fn>    
## 5 sxk9s        <tibble [168 x 2]> <tibble [1,344 x ~ ts            <fn>    
## 6 sxk9s        <tibble [168 x 2]> <tibble [1,344 x ~ msts          <fn>

3.1.5 PREPARING THE TIME SERIES MODEL LIST

Similar to when we create the data representation (daily and weekly) list, we also make some time series models as a nested list.

Again, let’s start by making a list of models. We would like to compare multiple model specifications. For this Capstone Project, we will use auto.arima(), ets(), stlm(), and tbats(). We need to make some functions to call those model, and store them inside a list:

# time series models list
models <- list(
  auto.arima = function(x) auto.arima(x),
  ets = function(x) ets(x),
  stlm = function(x) stlm(x),
  tbats = function(x) tbats(x, use.box.cox = FALSE)
)

models

## $auto.arima
## function (x) 
## auto.arima(x)
## 
## $ets
## function (x) 
## ets(x)
## 
## $stlm
## function (x) 
## stlm(x)
## 
## $tbats
## function (x) 
## tbats(x, use.box.cox = FALSE)

Then we can convert it into a nested format like previous example:

# convert to nested
models <- models %>%
  rep(length(unique(subarea_nest$src_sub_area))) %>%
  enframe("model_name", "model") %>%
  mutate(src_sub_area =
    sort(rep(unique(subarea_nest$src_sub_area), length(unique(.$model_name))))
  )

models

## # A tibble: 12 x 3
##    model_name model  src_sub_area
##    <chr>      <list> <chr>       
##  1 auto.arima <fn>   sxk97       
##  2 ets        <fn>   sxk97       
##  3 stlm       <fn>   sxk97       
##  4 tbats      <fn>   sxk97       
##  5 auto.arima <fn>   sxk9e       
##  6 ets        <fn>   sxk9e       
##  7 stlm       <fn>   sxk9e       
##  8 tbats      <fn>   sxk9e       
##  9 auto.arima <fn>   sxk9s       
## 10 ets        <fn>   sxk9s       
## 11 stlm       <fn>   sxk9s       
## 12 tbats      <fn>   sxk9s

And finally, we can join the result into our nested data. We also apply some rule here:
since ets() and auto.arima() are not suitable for data with multiple seasonality time series, we use filter to remove them out:

# combine with models
subarea_nest_model <- subarea_nest_fit %>%
  left_join(models) %>%
  filter(
    !(model_name == "ets" & data_fun_name == "msts"),
    !(model_name == "auto.arima" & data_fun_name == "msts")
  )

## Joining, by = "src_sub_area"

subarea_nest_model

## # A tibble: 18 x 7
##    src_sub_area test      train     data_fun_name data_fun model_name model
##    <chr>        <list>    <list>    <chr>         <list>   <chr>      <lis>
##  1 sxk97        <tibble ~ <tibble ~ ts            <fn>     auto.arima <fn> 
##  2 sxk97        <tibble ~ <tibble ~ ts            <fn>     ets        <fn> 
##  3 sxk97        <tibble ~ <tibble ~ ts            <fn>     stlm       <fn> 
##  4 sxk97        <tibble ~ <tibble ~ ts            <fn>     tbats      <fn> 
##  5 sxk97        <tibble ~ <tibble ~ msts          <fn>     stlm       <fn> 
##  6 sxk97        <tibble ~ <tibble ~ msts          <fn>     tbats      <fn> 
##  7 sxk9e        <tibble ~ <tibble ~ ts            <fn>     auto.arima <fn> 
##  8 sxk9e        <tibble ~ <tibble ~ ts            <fn>     ets        <fn> 
##  9 sxk9e        <tibble ~ <tibble ~ ts            <fn>     stlm       <fn> 
## 10 sxk9e        <tibble ~ <tibble ~ ts            <fn>     tbats      <fn> 
## 11 sxk9e        <tibble ~ <tibble ~ msts          <fn>     stlm       <fn> 
## 12 sxk9e        <tibble ~ <tibble ~ msts          <fn>     tbats      <fn> 
## 13 sxk9s        <tibble ~ <tibble ~ ts            <fn>     auto.arima <fn> 
## 14 sxk9s        <tibble ~ <tibble ~ ts            <fn>     ets        <fn> 
## 15 sxk9s        <tibble ~ <tibble ~ ts            <fn>     stlm       <fn> 
## 16 sxk9s        <tibble ~ <tibble ~ ts            <fn>     tbats      <fn> 
## 17 sxk9s        <tibble ~ <tibble ~ msts          <fn>     stlm       <fn> 
## 18 sxk9s        <tibble ~ <tibble ~ msts          <fn>     tbats      <fn>

3.1.6 EXECUTE THE NESTED FITTING

To execute the model fitting, we need to wrap up the needed arguments as a list using map() function. Then, we could call the function using invoke_map().

We need to do this for data transformation using the function inside data_fun, then continue to fit the model with the same process using the function inside model.

Note: this function will take some time, so we already save it into RDS file, we can only readRDS and get the result faster.

# invoke nested fitting
# subarea_nest_model <- subarea_nest_model  %>% 
#   mutate(
#     params = map(train, ~ list(x = .x)),
#     data = invoke_map(data_fun, params),
#     params = map(data, ~ list(x = .x)),
#     fitted = invoke_map(model, params)
#     # error = map_dbl(fitted, mae_vec(.x$demand, .x$estimate))
#   ) %>%
#   select(-data, -params)
# 
# saveRDS(subarea_nest_model, "subarea_nest_model.RDS")

subarea_nest_model <- readRDS("subarea_nest_model.RDS")
subarea_nest_model

## # A tibble: 18 x 8
##    src_sub_area test   train data_fun_name data_fun model_name model fitted
##    <chr>        <list> <lis> <chr>         <list>   <chr>      <lis> <list>
##  1 sxk97        <tibb~ <tib~ ts            <fn>     auto.arima <fn>  <ARIM~
##  2 sxk97        <tibb~ <tib~ ts            <fn>     ets        <fn>  <ets> 
##  3 sxk97        <tibb~ <tib~ ts            <fn>     stlm       <fn>  <stlm>
##  4 sxk97        <tibb~ <tib~ ts            <fn>     tbats      <fn>  <tbat~
##  5 sxk97        <tibb~ <tib~ msts          <fn>     stlm       <fn>  <stlm>
##  6 sxk97        <tibb~ <tib~ msts          <fn>     tbats      <fn>  <tbat~
##  7 sxk9e        <tibb~ <tib~ ts            <fn>     auto.arima <fn>  <ARIM~
##  8 sxk9e        <tibb~ <tib~ ts            <fn>     ets        <fn>  <ets> 
##  9 sxk9e        <tibb~ <tib~ ts            <fn>     stlm       <fn>  <stlm>
## 10 sxk9e        <tibb~ <tib~ ts            <fn>     tbats      <fn>  <tbat~
## 11 sxk9e        <tibb~ <tib~ msts          <fn>     stlm       <fn>  <stlm>
## 12 sxk9e        <tibb~ <tib~ msts          <fn>     tbats      <fn>  <tbat~
## 13 sxk9s        <tibb~ <tib~ ts            <fn>     auto.arima <fn>  <ARIM~
## 14 sxk9s        <tibb~ <tib~ ts            <fn>     ets        <fn>  <ets> 
## 15 sxk9s        <tibb~ <tib~ ts            <fn>     stlm       <fn>  <stlm>
## 16 sxk9s        <tibb~ <tib~ ts            <fn>     tbats      <fn>  <tbat~
## 17 sxk9s        <tibb~ <tib~ msts          <fn>     stlm       <fn>  <stlm>
## 18 sxk9s        <tibb~ <tib~ msts          <fn>     tbats      <fn>  <tbat~

# calculate test errors
subarea_error <- subarea_nest_model %>%
  mutate(error =
    map(fitted, ~ forecast(.x, h = 24 * 7)) %>%
    map2_dbl(test, ~ mae_vec(truth = .y$demand, estimate = .x$mean))
  ) %>%
  arrange(src_sub_area, error) #arrange based on smallest error for each area

subarea_error

## # A tibble: 18 x 9
##    src_sub_area test  train data_fun_name data_fun model_name model fitted
##    <chr>        <lis> <lis> <chr>         <list>   <chr>      <lis> <list>
##  1 sxk97        <tib~ <tib~ msts          <fn>     tbats      <fn>  <tbat~
##  2 sxk97        <tib~ <tib~ msts          <fn>     stlm       <fn>  <stlm>
##  3 sxk97        <tib~ <tib~ ts            <fn>     tbats      <fn>  <tbat~
##  4 sxk97        <tib~ <tib~ ts            <fn>     ets        <fn>  <ets> 
##  5 sxk97        <tib~ <tib~ ts            <fn>     stlm       <fn>  <stlm>
##  6 sxk97        <tib~ <tib~ ts            <fn>     auto.arima <fn>  <ARIM~
##  7 sxk9e        <tib~ <tib~ msts          <fn>     tbats      <fn>  <tbat~
##  8 sxk9e        <tib~ <tib~ msts          <fn>     stlm       <fn>  <stlm>
##  9 sxk9e        <tib~ <tib~ ts            <fn>     ets        <fn>  <ets> 
## 10 sxk9e        <tib~ <tib~ ts            <fn>     tbats      <fn>  <tbat~
## 11 sxk9e        <tib~ <tib~ ts            <fn>     stlm       <fn>  <stlm>
## 12 sxk9e        <tib~ <tib~ ts            <fn>     auto.arima <fn>  <ARIM~
## 13 sxk9s        <tib~ <tib~ msts          <fn>     tbats      <fn>  <tbat~
## 14 sxk9s        <tib~ <tib~ msts          <fn>     stlm       <fn>  <stlm>
## 15 sxk9s        <tib~ <tib~ ts            <fn>     tbats      <fn>  <tbat~
## 16 sxk9s        <tib~ <tib~ ts            <fn>     auto.arima <fn>  <ARIM~
## 17 sxk9s        <tib~ <tib~ ts            <fn>     stlm       <fn>  <stlm>
## 18 sxk9s        <tib~ <tib~ ts            <fn>     ets        <fn>  <ets> 
## # ... with 1 more variable: error <dbl>

# show the area, data fun name, model & error
subarea_error %>%
  select(src_sub_area, ends_with("_name"), error)

## # A tibble: 18 x 4
##    src_sub_area data_fun_name model_name error
##    <chr>        <chr>         <chr>      <dbl>
##  1 sxk97        msts          tbats      0.488
##  2 sxk97        msts          stlm       0.543
##  3 sxk97        ts            tbats      0.573
##  4 sxk97        ts            ets        0.574
##  5 sxk97        ts            stlm       0.600
##  6 sxk97        ts            auto.arima 0.702
##  7 sxk9e        msts          tbats      0.359
##  8 sxk9e        msts          stlm       0.366
##  9 sxk9e        ts            ets        0.419
## 10 sxk9e        ts            tbats      0.422
## 11 sxk9e        ts            stlm       0.438
## 12 sxk9e        ts            auto.arima 0.502
## 13 sxk9s        msts          tbats      0.421
## 14 sxk9s        msts          stlm       0.428
## 15 sxk9s        ts            tbats      0.431
## 16 sxk9s        ts            auto.arima 0.503
## 17 sxk9s        ts            stlm       0.540
## 18 sxk9s        ts            ets        0.550

Get the best model by choosing the smallest error:

subarea_best_model <- subarea_error %>%
  group_by(src_sub_area) %>% 
  arrange(error) %>% 
  slice(1) %>%  #choose the smallest error & best model for each area directly
  ungroup() %>% 
  select(src_sub_area, ends_with("_name"),error)

Do left join in order to get the whole information regarding the Best model:

subarea_best_model <- subarea_best_model %>% 
  select(-error) %>% 
  left_join(subarea_error)%>% 
  select(-error)

## Joining, by = c("src_sub_area", "data_fun_name", "model_name")

subarea_best_model

## # A tibble: 3 x 8
##   src_sub_area data_fun_name model_name test   train  data_fun model fitted
##   <chr>        <chr>         <chr>      <list> <list> <list>   <lis> <list>
## 1 sxk97        msts          tbats      <tibb~ <tibb~ <fn>     <fn>  <tbat~
## 2 sxk9e        msts          tbats      <tibb~ <tibb~ <fn>     <fn>  <tbat~
## 3 sxk9s        msts          tbats      <tibb~ <tibb~ <fn>     <fn>  <tbat~

3.1.7 UNNESTING THE RESULT

Beside measuring the error, we can also compare the forecast results to the real test series through graphical analysis. But to do that, we need to make a tbl containing our forecast to the test dataset, then do some spread-gather tricks to make a set of keys that unique for each data representations, models, and also one for the forecast itself. If we get to that format, we could conveniently unnest() the data into a proper long format

Let’s start with creating the forecast:

subarea_test <- subarea_error %>% 
  mutate(
    forecast =
      map(fitted, ~ forecast(.x, h = 24 * 7)) %>%
      map2(test, ~ tibble(
        datetime = .y$datetime,
        demand = as.vector(.x$mean)
      )),
    key = paste(data_fun_name, model_name, sep = "-")
  )
  
subarea_test

## # A tibble: 18 x 11
##    src_sub_area test  train data_fun_name data_fun model_name model fitted
##    <chr>        <lis> <lis> <chr>         <list>   <chr>      <lis> <list>
##  1 sxk97        <tib~ <tib~ msts          <fn>     tbats      <fn>  <tbat~
##  2 sxk97        <tib~ <tib~ msts          <fn>     stlm       <fn>  <stlm>
##  3 sxk97        <tib~ <tib~ ts            <fn>     tbats      <fn>  <tbat~
##  4 sxk97        <tib~ <tib~ ts            <fn>     ets        <fn>  <ets> 
##  5 sxk97        <tib~ <tib~ ts            <fn>     stlm       <fn>  <stlm>
##  6 sxk97        <tib~ <tib~ ts            <fn>     auto.arima <fn>  <ARIM~
##  7 sxk9e        <tib~ <tib~ msts          <fn>     tbats      <fn>  <tbat~
##  8 sxk9e        <tib~ <tib~ msts          <fn>     stlm       <fn>  <stlm>
##  9 sxk9e        <tib~ <tib~ ts            <fn>     ets        <fn>  <ets> 
## 10 sxk9e        <tib~ <tib~ ts            <fn>     tbats      <fn>  <tbat~
## 11 sxk9e        <tib~ <tib~ ts            <fn>     stlm       <fn>  <stlm>
## 12 sxk9e        <tib~ <tib~ ts            <fn>     auto.arima <fn>  <ARIM~
## 13 sxk9s        <tib~ <tib~ msts          <fn>     tbats      <fn>  <tbat~
## 14 sxk9s        <tib~ <tib~ msts          <fn>     stlm       <fn>  <stlm>
## 15 sxk9s        <tib~ <tib~ ts            <fn>     tbats      <fn>  <tbat~
## 16 sxk9s        <tib~ <tib~ ts            <fn>     auto.arima <fn>  <ARIM~
## 17 sxk9s        <tib~ <tib~ ts            <fn>     stlm       <fn>  <stlm>
## 18 sxk9s        <tib~ <tib~ ts            <fn>     ets        <fn>  <ets> 
## # ... with 3 more variables: error <dbl>, forecast <list>, key <chr>

Then do some spread-gather to create a proper key:

subarea_test_key <- subarea_test %>%
  select(src_sub_area, key, actual = test, forecast) %>%
  spread(key, forecast) %>%
  gather(key, value, -src_sub_area)

subarea_test_key

## # A tibble: 21 x 3
##    src_sub_area key           value             
##    <chr>        <chr>         <list>            
##  1 sxk97        actual        <tibble [168 x 2]>
##  2 sxk9e        actual        <tibble [168 x 2]>
##  3 sxk9s        actual        <tibble [168 x 2]>
##  4 sxk97        msts-stlm     <tibble [168 x 2]>
##  5 sxk9e        msts-stlm     <tibble [168 x 2]>
##  6 sxk9s        msts-stlm     <tibble [168 x 2]>
##  7 sxk97        msts-tbats    <tibble [168 x 2]>
##  8 sxk9e        msts-tbats    <tibble [168 x 2]>
##  9 sxk9s        msts-tbats    <tibble [168 x 2]>
## 10 sxk97        ts-auto.arima <tibble [168 x 2]>
## # ... with 11 more rows

The last but not least, unnest() the series data, and apply the revert back function:

subarea_test_unnest <- subarea_test_key %>%
  unnest(value) %>%
  mutate(demand = rec_revert(demand, rec, src_sub_area))
  
subarea_test_unnest

## # A tibble: 3,528 x 4
##    src_sub_area key    datetime            demand
##    <chr>        <chr>  <dttm>               <dbl>
##  1 sxk97        actual 2017-11-26 00:00:00     38
##  2 sxk97        actual 2017-11-26 01:00:00     21
##  3 sxk97        actual 2017-11-26 02:00:00     10
##  4 sxk97        actual 2017-11-26 03:00:00     10
##  5 sxk97        actual 2017-11-26 04:00:00      5
##  6 sxk97        actual 2017-11-26 05:00:00      2
##  7 sxk97        actual 2017-11-26 06:00:00      0
##  8 sxk97        actual 2017-11-26 07:00:00     11
##  9 sxk97        actual 2017-11-26 08:00:00      4
## 10 sxk97        actual 2017-11-26 09:00:00      4
## # ... with 3,518 more rows

With the resulting tbl, we can compare the forecast and actual data on test like this:

# plot forecast on test
p <- subarea_test_unnest %>%
  ggplot(aes(x = datetime, y = demand, colour = key)) +
    geom_line() +
    labs(x = NULL, y = NULL, colour = NULL) +
    facet_wrap(~ src_sub_area, scale = "free", ncol = 1) +
    tidyquant::theme_tq() +
    tidyquant::scale_colour_tq()

ggplotly(p)

4 Automated Model Selection

AUTOMATE THE MODEL SELECTION

As you can see from the plot results, it is hard to decide which model we want to use based on graphical comparation. Then the most straighforward solution is to use the model with the least error.

It is very simple to do the model selection, we only need some basic dplyr grammars to filter() the model with lowest error:

# filter by lowest test error
subarea_min_error <- subarea_error %>%
  select(-fitted) %>% # remove unused
  group_by(src_sub_area) %>%
  filter(error == min(error)) %>%
  ungroup()

subarea_min_error   #same selection with subarea_best_model

## # A tibble: 3 x 8
##   src_sub_area test    train  data_fun_name data_fun model_name model error
##   <chr>        <list>  <list> <chr>         <list>   <chr>      <lis> <dbl>
## 1 sxk97        <tibbl~ <tibb~ msts          <fn>     tbats      <fn>  0.488
## 2 sxk9e        <tibbl~ <tibb~ msts          <fn>     tbats      <fn>  0.359
## 3 sxk9s        <tibbl~ <tibb~ msts          <fn>     tbats      <fn>  0.421

Based on previous process, we conclude that the Automated best model is using tbats. There are two interesting time series forecasting methods called BATS and TBATS [1] that are capable of modeling time series with multiple seasonalities.

The names are acronyms for key features of the models: Trigonometric seasonality, Box-Cox transformation, ARMA errors, Trend and Seasonal components.

TBATS model takes it roots in exponential smoothing methods. Each seasonality is modeled by a trigonometric representation based on Fourier series. One major advantage of this approach is that it requires only 2 seed states regardless of the length of period. Another advantage is the ability to model seasonal effects of non-integer lengths.

5 Prediction Performance

PERFORM THE FINAL FORECAST

After we have the final model, then finally we can proceed to the final forecast. For the final forecast, we can do the same process as in model fitting, but this time we will use train and test data as our new “full data”.

Now let’s start by recombine the train and test dataset:

# recombine samples
dtfull <- subarea_min_error %>%
  mutate(fulldata = map2(train, test, ~ bind_rows(.x, .y))) %>%
  select(src_sub_area, fulldata, everything(), -train, -test)

dtfull

## # A tibble: 3 x 7
##   src_sub_area fulldata       data_fun_name data_fun model_name model error
##   <chr>        <list>         <chr>         <list>   <chr>      <lis> <dbl>
## 1 sxk97        <tibble [1,51~ msts          <fn>     tbats      <fn>  0.488
## 2 sxk9e        <tibble [1,51~ msts          <fn>     tbats      <fn>  0.359
## 3 sxk9s        <tibble [1,51~ msts          <fn>     tbats      <fn>  0.421

Then do the same nested fitting as in previous example:

# dtfull_nest <- dtfull %>%
#   mutate(
#     params = map(fulldata, ~ list(x = .x)),
#     data = invoke_map(data_fun, params),
#     params = map(data, ~ list(x = .x)),
#     fitted = invoke_map(model, params)
#   ) %>%
#   select(-data, -params)
# 
# saveRDS(dtfull_nest, "dtfull_nest.RDS")

dtfull_nest <- readRDS("dtfull_nest.RDS")
dtfull_nest

## # A tibble: 3 x 8
##   src_sub_area fulldata data_fun_name model_name data_fun model error
##   <chr>        <list>   <chr>         <chr>      <list>   <lis> <dbl>
## 1 sxk9e        <tibble~ msts          tbats      <fn>     <fn>  0.359
## 2 sxk9s        <tibble~ msts          tbats      <fn>     <fn>  0.421
## 3 sxk97        <tibble~ msts          tbats      <fn>     <fn>  0.488
## # ... with 1 more variable: fitted <list>

Next, let’s make a tbl containing each of our forecast results, and convert our nested data to a proper long format:

# get forecast
dtfull_nest <- dtfull_nest %>%
  mutate(forecast =
    map(fitted, ~ forecast(.x, h = 24 * 7 * 4)) %>%
    map2(fulldata, ~ tibble(
      datetime = timetk::tk_make_future_timeseries(.y$datetime, 24 * 7 * 4),
      demand = as.vector(.x$mean)
    ))
  )

dtfull_nest

## # A tibble: 3 x 9
##   src_sub_area fulldata data_fun_name model_name data_fun model error
##   <chr>        <list>   <chr>         <chr>      <list>   <lis> <dbl>
## 1 sxk9e        <tibble~ msts          tbats      <fn>     <fn>  0.359
## 2 sxk9s        <tibble~ msts          tbats      <fn>     <fn>  0.421
## 3 sxk97        <tibble~ msts          tbats      <fn>     <fn>  0.488
## # ... with 2 more variables: fitted <list>, forecast <list>

Finally, we can unnest() the data to get the result:

# unnest actual and forecast
dtfull_unnest <- dtfull_nest %>%
  select(src_sub_area, actual = fulldata, forecast) %>%
  gather(key, value, -src_sub_area) %>%
  unnest(value) %>%
  mutate(demand = rec_revert(demand, rec, src_sub_area))
  
dtfull_unnest

## # A tibble: 6,552 x 4
##    src_sub_area key    datetime            demand
##    <chr>        <chr>  <dttm>               <dbl>
##  1 sxk9e        actual 2017-10-01 00:00:00      8
##  2 sxk9e        actual 2017-10-01 01:00:00      2
##  3 sxk9e        actual 2017-10-01 02:00:00      3
##  4 sxk9e        actual 2017-10-01 03:00:00      2
##  5 sxk9e        actual 2017-10-01 04:00:00      1
##  6 sxk9e        actual 2017-10-01 05:00:00      2
##  7 sxk9e        actual 2017-10-01 06:00:00      1
##  8 sxk9e        actual 2017-10-01 07:00:00      0
##  9 sxk9e        actual 2017-10-01 08:00:00      2
## 10 sxk9e        actual 2017-10-01 09:00:00     11
## # ... with 6,542 more rows

We show the plot of forecast results for each provider in a proper long format:

range(dtfull_unnest$datetime)

## [1] "2017-10-01 00:00:00 UTC" "2017-12-30 23:00:00 UTC"

d <- dtfull_unnest %>%
  ggplot(aes(x = datetime, y = demand, colour = key)) +
    geom_line() +
    labs(x = NULL, y = NULL, colour = NULL) +
    facet_wrap(~ src_sub_area, scale = "free", ncol = 1) +
    tidyquant::theme_tq() +
    tidyquant::scale_colour_tq()

ggplotly(d)

data_test <- dtfull_unnest %>% 
  filter(datetime %within% intest) %>% 
  select(src_sub_area, datetime, demand)

data_test

## # A tibble: 504 x 3
##    src_sub_area datetime            demand
##    <chr>        <dttm>               <dbl>
##  1 sxk9e        2017-11-26 00:00:00     21
##  2 sxk9e        2017-11-26 01:00:00     15
##  3 sxk9e        2017-11-26 02:00:00     12
##  4 sxk9e        2017-11-26 03:00:00      7
##  5 sxk9e        2017-11-26 04:00:00      0
##  6 sxk9e        2017-11-26 05:00:00      0
##  7 sxk9e        2017-11-26 06:00:00      1
##  8 sxk9e        2017-11-26 07:00:00      1
##  9 sxk9e        2017-11-26 08:00:00     14
## 10 sxk9e        2017-11-26 09:00:00     13
## # ... with 494 more rows

Mean Absolute Error (MAE) Result for data test

mae_test_result <- subarea_error %>%
  group_by(src_sub_area) %>% 
  select(src_sub_area, error) %>% 
  arrange(error) %>% 
  slice(1) %>% 
  ungroup()

# Reached MAE < 11 for all sub-area in (your own) test dataset
# We used tbats - the scale for MAE is a little bit different
mae_test_result %<>%
  summarise(src_sub_area = "all sub-area",
            error = mean(error)) %>%
  bind_rows(mae_test_result, .)

mae_test_result

## # A tibble: 4 x 2
##   src_sub_area error
##   <chr>        <dbl>
## 1 sxk97        0.488
## 2 sxk9e        0.359
## 3 sxk9s        0.421
## 4 all sub-area 0.423

5.1 Data Submission

At the plot above, data shown in the interval: 2017-10-01 00:00:00 - 2017-12-30 23:00:00. We will just prepare 1 week forecast after data train: 2017-12-03 00:00:00 - 2017-12-09 23:00:00

#quick check on data submission again: 
data_submission

## # A tibble: 504 x 3
##    src_sub_area datetime            demand
##    <chr>        <dttm>              <lgl> 
##  1 sxk97        2017-12-03 00:00:00 NA    
##  2 sxk97        2017-12-03 01:00:00 NA    
##  3 sxk97        2017-12-03 02:00:00 NA    
##  4 sxk97        2017-12-03 03:00:00 NA    
##  5 sxk97        2017-12-03 04:00:00 NA    
##  6 sxk97        2017-12-03 05:00:00 NA    
##  7 sxk97        2017-12-03 06:00:00 NA    
##  8 sxk97        2017-12-03 07:00:00 NA    
##  9 sxk97        2017-12-03 08:00:00 NA    
## 10 sxk97        2017-12-03 09:00:00 NA    
## # ... with 494 more rows

range(data_submission$datetime)

## [1] "2017-12-03 00:00:00 UTC" "2017-12-09 23:00:00 UTC"

Forecast our data submission using our best model:

submission <- subarea_best_model %>% #use the best model 
  mutate(
    forecast =
      map(fitted, ~ forecast(.x, h = 24 * 7)) %>%
      map2(submit_nest$data, ~ tibble( #map into our data submission nest
        datetime = .y$datetime,
        demand = as.vector(.x$mean)
      )),
    key = paste(data_fun_name, model_name, sep = "-")
  )

Select only the required colums:

submission <- submission %>%
  select(src_sub_area, forecast) %>%
  unnest(forecast) %>%
  mutate(demand = rec_revert(demand, rec, src_sub_area))

submission

## # A tibble: 504 x 3
##    src_sub_area datetime            demand
##    <chr>        <dttm>               <dbl>
##  1 sxk97        2017-12-03 00:00:00     18
##  2 sxk97        2017-12-03 01:00:00     15
##  3 sxk97        2017-12-03 02:00:00     11
##  4 sxk97        2017-12-03 03:00:00      7
##  5 sxk97        2017-12-03 04:00:00      5
##  6 sxk97        2017-12-03 05:00:00      3
##  7 sxk97        2017-12-03 06:00:00      2
##  8 sxk97        2017-12-03 07:00:00      6
##  9 sxk97        2017-12-03 08:00:00     12
## 10 sxk97        2017-12-03 09:00:00     12
## # ... with 494 more rows

write_csv(submission, "data/meilinie_scotty-ts-auto.csv")

Checking the MAE Result for data submission

because we don’t have actual data (truth) from our data submission, we check the MAE in Algoritma shiny: https://algoritma.shinyapps.io/leaderboard_capsml/
And the result:

Submission’s Model Performance

Plot the submission forecast - only :

s <- submission %>%
  ggplot(aes(x = datetime, y = demand)) +
    geom_line() +
    labs(x = NULL, y = NULL, colour = NULL) +
    facet_wrap(~ src_sub_area, scale = "free", ncol = 1) +
    tidyquant::theme_tq() +
    tidyquant::scale_colour_tq()

ggplotly(s)

6 Conclusion

Working with time series, please be prepared with the data preprocess. First of all, we have to check and make data complete, pad() is one of the function that helps us a lot to complete the data.
Checking the data including setting the seasonality that we want to achieve
By using automated model selection, we get the model easier and make us forecast the result faster.
ets() and auto.arima() are not suitable for multiple seasonality time series data.
Demand in each / all subarea tend to follow the multi seasonal model that we’ve created. Because the data is multiple seasonality, we can check our model performance using MAE function - not only visual checking via plot.

Note:

Special thanks to @teamalgoritma who helps a lot in learning, preparing and completing Machine Learning course - especially in this Capstone Project.

(c) Meilinie - June 2019

Capstone ML: Time Series Forecasting

Analysis of Scotty: India’s Transport Network

Meilinie

Jun 30, 2019