Time Series Forecasting describes a set of research problems where our observations are collected at regular time intervals and where we can assume correlations among successive observations. The main idea is to forecast the future based on previous data. The process itself is called forecasting. Time Series is an observation for single data.

Before we work with the time series, we load the library first:

1 Scotty: “Bring me the crystal ball!”

*Carbikemovers - India

*Carbikemovers - India

Scotty is one of the transportation company in India, in the end of 2017, they need know and be ready for the end year’s demands. Unfortunately, Scotty was not old enough to have last year data for December, so they can not look back at past demands to prepare forecast for December’s demands.

Fortunately, we have already know that time series analysis is more than enough to help us to forecast. As an investment for the business’ future, we will develop an automated forecasting framework.

Based on “scotty-ts-auto/data/data-train.csv” data (up to Saturday, December 2nd 2017), we will make a report of our automated forecasting framework for hourly demands, and that would be evaluated on the next 7 days (Sunday, December 3rd 2017 to Monday, December 9th 2017).

2 Data Preprocess

Let’s start by importing the dataset:

Data Train

## Parsed with column specification:
## cols(
##   id = col_character(),
##   trip_id = col_character(),
##   driver_id = col_character(),
##   rider_id = col_character(),
##   start_time = col_datetime(format = ""),
##   src_lat = col_double(),
##   src_lon = col_double(),
##   src_area = col_character(),
##   src_sub_area = col_character(),
##   dest_lat = col_double(),
##   dest_lon = col_double(),
##   dest_area = col_character(),
##   dest_sub_area = col_character(),
##   distance = col_double(),
##   status = col_character(),
##   confirmed_time_sec = col_double()
## )
## # A tibble: 10 x 16
##    id    trip_id driver_id rider_id start_time          src_lat src_lon
##    <chr> <chr>   <chr>     <chr>    <dttm>                <dbl>   <dbl>
##  1 59d0~ 59d005~ 59a892c5~ 59ad2d6~ 2017-10-01 00:00:17    41.1    29.0
##  2 59d0~ <NA>    <NA>      59cd704~ 2017-10-01 00:02:34    41.1    29.0
##  3 59d0~ 59d006~ 599dc0df~ 59bd62c~ 2017-10-01 00:03:29    41.0    29.0
##  4 59d0~ 59d007~ 59a58565~ 59c17cd~ 2017-10-01 00:04:24    41.1    29.0
##  5 59d0~ <NA>    <NA>      596f47a~ 2017-10-01 00:06:06    41.1    29.0
##  6 59d0~ 59d007~ 599dc0df~ 59bd62c~ 2017-10-01 00:08:19    41.0    29.0
##  7 59d0~ 59d008~ 599dc0df~ 59bd62c~ 2017-10-01 00:09:01    41.0    29.0
##  8 59d0~ 59d008~ 599dc0df~ 59bd62c~ 2017-10-01 00:09:45    41.0    29.0
##  9 59d0~ <NA>    <NA>      59bd62c~ 2017-10-01 00:11:06    41.0    29.0
## 10 59d0~ 59d008~ 599dc0df~ 596eab7~ 2017-10-01 00:12:24    41.0    29.0
## # ... with 9 more variables: src_area <chr>, src_sub_area <chr>,
## #   dest_lat <dbl>, dest_lon <dbl>, dest_area <chr>, dest_sub_area <chr>,
## #   distance <dbl>, status <chr>, confirmed_time_sec <dbl>

as per information above, we would like to create automated forecasting one week after data train,
Let’s show the format of data evaluation :

## Parsed with column specification:
## cols(
##   src_sub_area = col_character(),
##   datetime = col_datetime(format = ""),
##   demand = col_logical()
## )
## # A tibble: 10 x 3
##    src_sub_area datetime            demand
##    <chr>        <dttm>              <lgl> 
##  1 sxk97        2017-12-03 00:00:00 NA    
##  2 sxk97        2017-12-03 01:00:00 NA    
##  3 sxk97        2017-12-03 02:00:00 NA    
##  4 sxk97        2017-12-03 03:00:00 NA    
##  5 sxk97        2017-12-03 04:00:00 NA    
##  6 sxk97        2017-12-03 05:00:00 NA    
##  7 sxk97        2017-12-03 06:00:00 NA    
##  8 sxk97        2017-12-03 07:00:00 NA    
##  9 sxk97        2017-12-03 08:00:00 NA    
## 10 sxk97        2017-12-03 09:00:00 NA

2.1 Data Aggregation

We only need area, datetime and demand columns, so, we make the same format to data Train.
we aggregate the time using floor_date function and we count the demand of the Scotty at that hour.

## # A tibble: 4,225 x 3
##    src_sub_area datetime            demand
##    <chr>        <dttm>               <int>
##  1 sxk97        2017-10-01 00:00:00      6
##  2 sxk97        2017-10-01 01:00:00      4
##  3 sxk97        2017-10-01 02:00:00      9
##  4 sxk97        2017-10-01 03:00:00      2
##  5 sxk97        2017-10-01 04:00:00      5
##  6 sxk97        2017-10-01 05:00:00      4
##  7 sxk97        2017-10-01 06:00:00      1
##  8 sxk97        2017-10-01 08:00:00      3
##  9 sxk97        2017-10-01 09:00:00      3
## 10 sxk97        2017-10-01 10:00:00      3
## # ... with 4,215 more rows

From the quick check, we knew that the time series data has not completed yet.
So, we would like to make it complete by doing time series padding.

2.2 Time Series Padding

As mentioned before, we will complete time series data. E.g. we will fill data 2017-10-01 07:00:00 with demand = 0.

sxk97 2017-10-01 06:00:00 1
sxk97 2017-10-01 08:00:00 3

## pad applied on the interval: hour
## # A tibble: 4,536 x 3
##    src_sub_area datetime            demand
##    <chr>        <dttm>               <dbl>
##  1 sxk97        2017-10-01 00:00:00      6
##  2 sxk97        2017-10-01 01:00:00      4
##  3 sxk97        2017-10-01 02:00:00      9
##  4 sxk97        2017-10-01 03:00:00      2
##  5 sxk97        2017-10-01 04:00:00      5
##  6 sxk97        2017-10-01 05:00:00      4
##  7 sxk97        2017-10-01 06:00:00      1
##  8 sxk97        2017-10-01 07:00:00      0
##  9 sxk97        2017-10-01 08:00:00      3
## 10 sxk97        2017-10-01 09:00:00      3
## # ... with 4,526 more rows

3 Cross-Validation

Now, the data has already complete. It contains src_sub_area, datetime, and demand columns. The dates are ranging from:

## [1] "2017-10-01 00:00:00 UTC" "2017-12-02 23:00:00 UTC"

Here’s some example data from last month:

3.1 Prepare Data

3.1.1 THE SCENARIO

To help us benchmark multiple models, we will need to split some portion of our data for validation. The strategy here is to cut the last 1 week as our test dataset. Then, we will cut again some (bigger) portion as the train dataset, say, 1 weeks times 8–approximately 2 months. This strategy is just the simplified version of Rolling Forecasting Origin, with only having one pair of train and test sample.

So the first step here is to get the start and end date of the train and test sample. The most straighforward way is to define the train and test size, then recursively get the start and end index for each:

We combine the start and end indices into a date interval:

## [1] 2017-10-01 UTC--2017-11-25 23:00:00 UTC
## [1] 2017-11-26 UTC--2017-12-02 23:00:00 UTC

Here’s the data after splitting process: data train & test

3.1.2 DATA PREPROCESS USING RECIPES

Data preprocessing is a very crucial step in time series model fitting. In this Capstone project, we will use recipes package for data preprocessing and comparing multiple preprocess specifications.

Since recipe package work columnwise, we need to convert our data into a wide format first:

## # A tibble: 1,512 x 4
##    datetime            sxk97 sxk9e sxk9s
##    <dttm>              <dbl> <dbl> <dbl>
##  1 2017-10-01 00:00:00     6     8    10
##  2 2017-10-01 01:00:00     4     2     3
##  3 2017-10-01 02:00:00     9     3     1
##  4 2017-10-01 03:00:00     2     2     0
##  5 2017-10-01 04:00:00     5     1     0
##  6 2017-10-01 05:00:00     4     2     0
##  7 2017-10-01 06:00:00     1     1     0
##  8 2017-10-01 07:00:00     0     0     1
##  9 2017-10-01 08:00:00     3     2     5
## 10 2017-10-01 09:00:00     3    11     5
## # ... with 1,502 more rows

Then we could start to define the preprocess recipe(), and bake() our data based on the defined recipe:

## # A tibble: 1,512 x 4
##    datetime             sxk97  sxk9e  sxk9s
##    <dttm>               <dbl>  <dbl>  <dbl>
##  1 2017-10-01 00:00:00 -0.594 -0.697 -0.111
##  2 2017-10-01 01:00:00 -0.841 -1.28  -0.780
##  3 2017-10-01 02:00:00 -0.291 -1.15  -1.12 
##  4 2017-10-01 03:00:00 -1.16  -1.28  -1.59 
##  5 2017-10-01 04:00:00 -0.711 -1.44  -1.59 
##  6 2017-10-01 05:00:00 -0.841 -1.28  -1.59 
##  7 2017-10-01 06:00:00 -1.39  -1.44  -1.59 
##  8 2017-10-01 07:00:00 -1.94  -1.85  -1.12 
##  9 2017-10-01 08:00:00 -0.989 -1.28  -0.544
## 10 2017-10-01 09:00:00 -0.989 -0.497 -0.544
## # ... with 1,502 more rows

If we use recipes package, the next steps is to create a revert back function:

This revert back function will convert back the data to its original form.

Now we can convert our data into the long format again:

## # A tibble: 4,536 x 3
##    datetime            src_sub_area demand
##    <dttm>              <chr>         <dbl>
##  1 2017-10-01 00:00:00 sxk97        -0.594
##  2 2017-10-01 01:00:00 sxk97        -0.841
##  3 2017-10-01 02:00:00 sxk97        -0.291
##  4 2017-10-01 03:00:00 sxk97        -1.16 
##  5 2017-10-01 04:00:00 sxk97        -0.711
##  6 2017-10-01 05:00:00 sxk97        -0.841
##  7 2017-10-01 06:00:00 sxk97        -1.39 
##  8 2017-10-01 07:00:00 sxk97        -1.94 
##  9 2017-10-01 08:00:00 sxk97        -0.989
## 10 2017-10-01 09:00:00 sxk97        -0.989
## # ... with 4,526 more rows

3.1.3 NESTED MODEL FITTING AND FORECASTING

In functional programming using purrr, we need to convert our tbl into a nested tbl. Nested data is like a table inside a table, which could be controlled using a key indicator; in other words, we can have tbl for each subarea and samples. Using this format, the fitting and forecasting process would be very versatile, yet we can still convert the results as long as we have a proper key like provider.

Let’s start by converting our tbl into a nested tbl. First, we need to add sample indicator so it could be recognized as a key when we nest() the data:

## # A tibble: 3 x 3
##   src_sub_area test               train               
##   <chr>        <list>             <list>              
## 1 sxk97        <tibble [168 x 2]> <tibble [1,344 x 2]>
## 2 sxk9e        <tibble [168 x 2]> <tibble [1,344 x 2]>
## 3 sxk9s        <tibble [168 x 2]> <tibble [1,344 x 2]>
## # A tibble: 3 x 3
##   src_sub_area demand data              
##   <chr>        <lgl>  <list>            
## 1 sxk97        NA     <tibble [168 x 1]>
## 2 sxk9e        NA     <tibble [168 x 1]>
## 3 sxk9s        NA     <tibble [168 x 1]>

3.1.4 PREPARING THE DATA MODEL LIST

For data and model combination, we could start by defining some options for data representation. Recall that our series have a relatively high frequency, so we could consider two option of data representation here:
- a ts object with daily seasonality (24 hours), and
- an msts with daily and weekly seasonality (24 * 7).

To incorporate them into our nested data, we need to create another nested data frame containing the data daily and weekly, after that we will do a function for converting the data into the specified data. In this case, we would like to compare multiple seasonality specifications.

Let’s start with making a named list containing the transformation functions:

## $ts
## function (x) 
## ts(x$demand, frequency = 24)
## 
## $msts
## function (x) 
## msts(x$demand, seasonal.periods = c(24, 24 * 7))

Then we convert the list into a tbl using enframe().
Note that we should also give a key – which is the src_sub_area in our case – so we could use left_join() later.

Use rep() function:

## # A tibble: 6 x 3
##   data_fun_name data_fun src_sub_area
##   <chr>         <list>   <chr>       
## 1 ts            <fn>     sxk97       
## 2 msts          <fn>     sxk97       
## 3 ts            <fn>     sxk9e       
## 4 msts          <fn>     sxk9e       
## 5 ts            <fn>     sxk9s       
## 6 msts          <fn>     sxk9s

Then the last steps here is to join the nested function with our Nested Model Fitting:

## Joining, by = "src_sub_area"
## # A tibble: 6 x 5
##   src_sub_area test               train              data_fun_name data_fun
##   <chr>        <list>             <list>             <chr>         <list>  
## 1 sxk97        <tibble [168 x 2]> <tibble [1,344 x ~ ts            <fn>    
## 2 sxk97        <tibble [168 x 2]> <tibble [1,344 x ~ msts          <fn>    
## 3 sxk9e        <tibble [168 x 2]> <tibble [1,344 x ~ ts            <fn>    
## 4 sxk9e        <tibble [168 x 2]> <tibble [1,344 x ~ msts          <fn>    
## 5 sxk9s        <tibble [168 x 2]> <tibble [1,344 x ~ ts            <fn>    
## 6 sxk9s        <tibble [168 x 2]> <tibble [1,344 x ~ msts          <fn>

3.1.5 PREPARING THE TIME SERIES MODEL LIST

Similar to when we create the data representation (daily and weekly) list, we also make some time series models as a nested list.

Again, let’s start by making a list of models. We would like to compare multiple model specifications. For this Capstone Project, we will use auto.arima(), ets(), stlm(), and tbats(). We need to make some functions to call those model, and store them inside a list:

## $auto.arima
## function (x) 
## auto.arima(x)
## 
## $ets
## function (x) 
## ets(x)
## 
## $stlm
## function (x) 
## stlm(x)
## 
## $tbats
## function (x) 
## tbats(x, use.box.cox = FALSE)

Then we can convert it into a nested format like previous example:

## # A tibble: 12 x 3
##    model_name model  src_sub_area
##    <chr>      <list> <chr>       
##  1 auto.arima <fn>   sxk97       
##  2 ets        <fn>   sxk97       
##  3 stlm       <fn>   sxk97       
##  4 tbats      <fn>   sxk97       
##  5 auto.arima <fn>   sxk9e       
##  6 ets        <fn>   sxk9e       
##  7 stlm       <fn>   sxk9e       
##  8 tbats      <fn>   sxk9e       
##  9 auto.arima <fn>   sxk9s       
## 10 ets        <fn>   sxk9s       
## 11 stlm       <fn>   sxk9s       
## 12 tbats      <fn>   sxk9s

And finally, we can join the result into our nested data. We also apply some rule here:
since ets() and auto.arima() are not suitable for data with multiple seasonality time series, we use filter to remove them out:

## Joining, by = "src_sub_area"
## # A tibble: 18 x 7
##    src_sub_area test      train     data_fun_name data_fun model_name model
##    <chr>        <list>    <list>    <chr>         <list>   <chr>      <lis>
##  1 sxk97        <tibble ~ <tibble ~ ts            <fn>     auto.arima <fn> 
##  2 sxk97        <tibble ~ <tibble ~ ts            <fn>     ets        <fn> 
##  3 sxk97        <tibble ~ <tibble ~ ts            <fn>     stlm       <fn> 
##  4 sxk97        <tibble ~ <tibble ~ ts            <fn>     tbats      <fn> 
##  5 sxk97        <tibble ~ <tibble ~ msts          <fn>     stlm       <fn> 
##  6 sxk97        <tibble ~ <tibble ~ msts          <fn>     tbats      <fn> 
##  7 sxk9e        <tibble ~ <tibble ~ ts            <fn>     auto.arima <fn> 
##  8 sxk9e        <tibble ~ <tibble ~ ts            <fn>     ets        <fn> 
##  9 sxk9e        <tibble ~ <tibble ~ ts            <fn>     stlm       <fn> 
## 10 sxk9e        <tibble ~ <tibble ~ ts            <fn>     tbats      <fn> 
## 11 sxk9e        <tibble ~ <tibble ~ msts          <fn>     stlm       <fn> 
## 12 sxk9e        <tibble ~ <tibble ~ msts          <fn>     tbats      <fn> 
## 13 sxk9s        <tibble ~ <tibble ~ ts            <fn>     auto.arima <fn> 
## 14 sxk9s        <tibble ~ <tibble ~ ts            <fn>     ets        <fn> 
## 15 sxk9s        <tibble ~ <tibble ~ ts            <fn>     stlm       <fn> 
## 16 sxk9s        <tibble ~ <tibble ~ ts            <fn>     tbats      <fn> 
## 17 sxk9s        <tibble ~ <tibble ~ msts          <fn>     stlm       <fn> 
## 18 sxk9s        <tibble ~ <tibble ~ msts          <fn>     tbats      <fn>

3.1.6 EXECUTE THE NESTED FITTING

To execute the model fitting, we need to wrap up the needed arguments as a list using map() function. Then, we could call the function using invoke_map().

We need to do this for data transformation using the function inside data_fun, then continue to fit the model with the same process using the function inside model.

Note: this function will take some time, so we already save it into RDS file, we can only readRDS and get the result faster.

## # A tibble: 18 x 8
##    src_sub_area test   train data_fun_name data_fun model_name model fitted
##    <chr>        <list> <lis> <chr>         <list>   <chr>      <lis> <list>
##  1 sxk97        <tibb~ <tib~ ts            <fn>     auto.arima <fn>  <ARIM~
##  2 sxk97        <tibb~ <tib~ ts            <fn>     ets        <fn>  <ets> 
##  3 sxk97        <tibb~ <tib~ ts            <fn>     stlm       <fn>  <stlm>
##  4 sxk97        <tibb~ <tib~ ts            <fn>     tbats      <fn>  <tbat~
##  5 sxk97        <tibb~ <tib~ msts          <fn>     stlm       <fn>  <stlm>
##  6 sxk97        <tibb~ <tib~ msts          <fn>     tbats      <fn>  <tbat~
##  7 sxk9e        <tibb~ <tib~ ts            <fn>     auto.arima <fn>  <ARIM~
##  8 sxk9e        <tibb~ <tib~ ts            <fn>     ets        <fn>  <ets> 
##  9 sxk9e        <tibb~ <tib~ ts            <fn>     stlm       <fn>  <stlm>
## 10 sxk9e        <tibb~ <tib~ ts            <fn>     tbats      <fn>  <tbat~
## 11 sxk9e        <tibb~ <tib~ msts          <fn>     stlm       <fn>  <stlm>
## 12 sxk9e        <tibb~ <tib~ msts          <fn>     tbats      <fn>  <tbat~
## 13 sxk9s        <tibb~ <tib~ ts            <fn>     auto.arima <fn>  <ARIM~
## 14 sxk9s        <tibb~ <tib~ ts            <fn>     ets        <fn>  <ets> 
## 15 sxk9s        <tibb~ <tib~ ts            <fn>     stlm       <fn>  <stlm>
## 16 sxk9s        <tibb~ <tib~ ts            <fn>     tbats      <fn>  <tbat~
## 17 sxk9s        <tibb~ <tib~ msts          <fn>     stlm       <fn>  <stlm>
## 18 sxk9s        <tibb~ <tib~ msts          <fn>     tbats      <fn>  <tbat~
## # A tibble: 18 x 9
##    src_sub_area test  train data_fun_name data_fun model_name model fitted
##    <chr>        <lis> <lis> <chr>         <list>   <chr>      <lis> <list>
##  1 sxk97        <tib~ <tib~ msts          <fn>     tbats      <fn>  <tbat~
##  2 sxk97        <tib~ <tib~ msts          <fn>     stlm       <fn>  <stlm>
##  3 sxk97        <tib~ <tib~ ts            <fn>     tbats      <fn>  <tbat~
##  4 sxk97        <tib~ <tib~ ts            <fn>     ets        <fn>  <ets> 
##  5 sxk97        <tib~ <tib~ ts            <fn>     stlm       <fn>  <stlm>
##  6 sxk97        <tib~ <tib~ ts            <fn>     auto.arima <fn>  <ARIM~
##  7 sxk9e        <tib~ <tib~ msts          <fn>     tbats      <fn>  <tbat~
##  8 sxk9e        <tib~ <tib~ msts          <fn>     stlm       <fn>  <stlm>
##  9 sxk9e        <tib~ <tib~ ts            <fn>     ets        <fn>  <ets> 
## 10 sxk9e        <tib~ <tib~ ts            <fn>     tbats      <fn>  <tbat~
## 11 sxk9e        <tib~ <tib~ ts            <fn>     stlm       <fn>  <stlm>
## 12 sxk9e        <tib~ <tib~ ts            <fn>     auto.arima <fn>  <ARIM~
## 13 sxk9s        <tib~ <tib~ msts          <fn>     tbats      <fn>  <tbat~
## 14 sxk9s        <tib~ <tib~ msts          <fn>     stlm       <fn>  <stlm>
## 15 sxk9s        <tib~ <tib~ ts            <fn>     tbats      <fn>  <tbat~
## 16 sxk9s        <tib~ <tib~ ts            <fn>     auto.arima <fn>  <ARIM~
## 17 sxk9s        <tib~ <tib~ ts            <fn>     stlm       <fn>  <stlm>
## 18 sxk9s        <tib~ <tib~ ts            <fn>     ets        <fn>  <ets> 
## # ... with 1 more variable: error <dbl>
## # A tibble: 18 x 4
##    src_sub_area data_fun_name model_name error
##    <chr>        <chr>         <chr>      <dbl>
##  1 sxk97        msts          tbats      0.488
##  2 sxk97        msts          stlm       0.543
##  3 sxk97        ts            tbats      0.573
##  4 sxk97        ts            ets        0.574
##  5 sxk97        ts            stlm       0.600
##  6 sxk97        ts            auto.arima 0.702
##  7 sxk9e        msts          tbats      0.359
##  8 sxk9e        msts          stlm       0.366
##  9 sxk9e        ts            ets        0.419
## 10 sxk9e        ts            tbats      0.422
## 11 sxk9e        ts            stlm       0.438
## 12 sxk9e        ts            auto.arima 0.502
## 13 sxk9s        msts          tbats      0.421
## 14 sxk9s        msts          stlm       0.428
## 15 sxk9s        ts            tbats      0.431
## 16 sxk9s        ts            auto.arima 0.503
## 17 sxk9s        ts            stlm       0.540
## 18 sxk9s        ts            ets        0.550

Get the best model by choosing the smallest error:

Do left join in order to get the whole information regarding the Best model:

## Joining, by = c("src_sub_area", "data_fun_name", "model_name")
## # A tibble: 3 x 8
##   src_sub_area data_fun_name model_name test   train  data_fun model fitted
##   <chr>        <chr>         <chr>      <list> <list> <list>   <lis> <list>
## 1 sxk97        msts          tbats      <tibb~ <tibb~ <fn>     <fn>  <tbat~
## 2 sxk9e        msts          tbats      <tibb~ <tibb~ <fn>     <fn>  <tbat~
## 3 sxk9s        msts          tbats      <tibb~ <tibb~ <fn>     <fn>  <tbat~

3.1.7 UNNESTING THE RESULT

Beside measuring the error, we can also compare the forecast results to the real test series through graphical analysis. But to do that, we need to make a tbl containing our forecast to the test dataset, then do some spread-gather tricks to make a set of keys that unique for each data representations, models, and also one for the forecast itself. If we get to that format, we could conveniently unnest() the data into a proper long format

Let’s start with creating the forecast:

## # A tibble: 18 x 11
##    src_sub_area test  train data_fun_name data_fun model_name model fitted
##    <chr>        <lis> <lis> <chr>         <list>   <chr>      <lis> <list>
##  1 sxk97        <tib~ <tib~ msts          <fn>     tbats      <fn>  <tbat~
##  2 sxk97        <tib~ <tib~ msts          <fn>     stlm       <fn>  <stlm>
##  3 sxk97        <tib~ <tib~ ts            <fn>     tbats      <fn>  <tbat~
##  4 sxk97        <tib~ <tib~ ts            <fn>     ets        <fn>  <ets> 
##  5 sxk97        <tib~ <tib~ ts            <fn>     stlm       <fn>  <stlm>
##  6 sxk97        <tib~ <tib~ ts            <fn>     auto.arima <fn>  <ARIM~
##  7 sxk9e        <tib~ <tib~ msts          <fn>     tbats      <fn>  <tbat~
##  8 sxk9e        <tib~ <tib~ msts          <fn>     stlm       <fn>  <stlm>
##  9 sxk9e        <tib~ <tib~ ts            <fn>     ets        <fn>  <ets> 
## 10 sxk9e        <tib~ <tib~ ts            <fn>     tbats      <fn>  <tbat~
## 11 sxk9e        <tib~ <tib~ ts            <fn>     stlm       <fn>  <stlm>
## 12 sxk9e        <tib~ <tib~ ts            <fn>     auto.arima <fn>  <ARIM~
## 13 sxk9s        <tib~ <tib~ msts          <fn>     tbats      <fn>  <tbat~
## 14 sxk9s        <tib~ <tib~ msts          <fn>     stlm       <fn>  <stlm>
## 15 sxk9s        <tib~ <tib~ ts            <fn>     tbats      <fn>  <tbat~
## 16 sxk9s        <tib~ <tib~ ts            <fn>     auto.arima <fn>  <ARIM~
## 17 sxk9s        <tib~ <tib~ ts            <fn>     stlm       <fn>  <stlm>
## 18 sxk9s        <tib~ <tib~ ts            <fn>     ets        <fn>  <ets> 
## # ... with 3 more variables: error <dbl>, forecast <list>, key <chr>

Then do some spread-gather to create a proper key:

## # A tibble: 21 x 3
##    src_sub_area key           value             
##    <chr>        <chr>         <list>            
##  1 sxk97        actual        <tibble [168 x 2]>
##  2 sxk9e        actual        <tibble [168 x 2]>
##  3 sxk9s        actual        <tibble [168 x 2]>
##  4 sxk97        msts-stlm     <tibble [168 x 2]>
##  5 sxk9e        msts-stlm     <tibble [168 x 2]>
##  6 sxk9s        msts-stlm     <tibble [168 x 2]>
##  7 sxk97        msts-tbats    <tibble [168 x 2]>
##  8 sxk9e        msts-tbats    <tibble [168 x 2]>
##  9 sxk9s        msts-tbats    <tibble [168 x 2]>
## 10 sxk97        ts-auto.arima <tibble [168 x 2]>
## # ... with 11 more rows

The last but not least, unnest() the series data, and apply the revert back function:

## # A tibble: 3,528 x 4
##    src_sub_area key    datetime            demand
##    <chr>        <chr>  <dttm>               <dbl>
##  1 sxk97        actual 2017-11-26 00:00:00     38
##  2 sxk97        actual 2017-11-26 01:00:00     21
##  3 sxk97        actual 2017-11-26 02:00:00     10
##  4 sxk97        actual 2017-11-26 03:00:00     10
##  5 sxk97        actual 2017-11-26 04:00:00      5
##  6 sxk97        actual 2017-11-26 05:00:00      2
##  7 sxk97        actual 2017-11-26 06:00:00      0
##  8 sxk97        actual 2017-11-26 07:00:00     11
##  9 sxk97        actual 2017-11-26 08:00:00      4
## 10 sxk97        actual 2017-11-26 09:00:00      4
## # ... with 3,518 more rows

With the resulting tbl, we can compare the forecast and actual data on test like this:

4 Automated Model Selection

AUTOMATE THE MODEL SELECTION

As you can see from the plot results, it is hard to decide which model we want to use based on graphical comparation. Then the most straighforward solution is to use the model with the least error.

It is very simple to do the model selection, we only need some basic dplyr grammars to filter() the model with lowest error:

## # A tibble: 3 x 8
##   src_sub_area test    train  data_fun_name data_fun model_name model error
##   <chr>        <list>  <list> <chr>         <list>   <chr>      <lis> <dbl>
## 1 sxk97        <tibbl~ <tibb~ msts          <fn>     tbats      <fn>  0.488
## 2 sxk9e        <tibbl~ <tibb~ msts          <fn>     tbats      <fn>  0.359
## 3 sxk9s        <tibbl~ <tibb~ msts          <fn>     tbats      <fn>  0.421

Based on previous process, we conclude that the Automated best model is using tbats. There are two interesting time series forecasting methods called BATS and TBATS [1] that are capable of modeling time series with multiple seasonalities.

The names are acronyms for key features of the models: Trigonometric seasonality, Box-Cox transformation, ARMA errors, Trend and Seasonal components.

TBATS model takes it roots in exponential smoothing methods. Each seasonality is modeled by a trigonometric representation based on Fourier series. One major advantage of this approach is that it requires only 2 seed states regardless of the length of period. Another advantage is the ability to model seasonal effects of non-integer lengths.

5 Prediction Performance

PERFORM THE FINAL FORECAST

After we have the final model, then finally we can proceed to the final forecast. For the final forecast, we can do the same process as in model fitting, but this time we will use train and test data as our new “full data”.

Now let’s start by recombine the train and test dataset:

## # A tibble: 3 x 7
##   src_sub_area fulldata       data_fun_name data_fun model_name model error
##   <chr>        <list>         <chr>         <list>   <chr>      <lis> <dbl>
## 1 sxk97        <tibble [1,51~ msts          <fn>     tbats      <fn>  0.488
## 2 sxk9e        <tibble [1,51~ msts          <fn>     tbats      <fn>  0.359
## 3 sxk9s        <tibble [1,51~ msts          <fn>     tbats      <fn>  0.421

Then do the same nested fitting as in previous example:

## # A tibble: 3 x 8
##   src_sub_area fulldata data_fun_name model_name data_fun model error
##   <chr>        <list>   <chr>         <chr>      <list>   <lis> <dbl>
## 1 sxk9e        <tibble~ msts          tbats      <fn>     <fn>  0.359
## 2 sxk9s        <tibble~ msts          tbats      <fn>     <fn>  0.421
## 3 sxk97        <tibble~ msts          tbats      <fn>     <fn>  0.488
## # ... with 1 more variable: fitted <list>

Next, let’s make a tbl containing each of our forecast results, and convert our nested data to a proper long format:

## # A tibble: 3 x 9
##   src_sub_area fulldata data_fun_name model_name data_fun model error
##   <chr>        <list>   <chr>         <chr>      <list>   <lis> <dbl>
## 1 sxk9e        <tibble~ msts          tbats      <fn>     <fn>  0.359
## 2 sxk9s        <tibble~ msts          tbats      <fn>     <fn>  0.421
## 3 sxk97        <tibble~ msts          tbats      <fn>     <fn>  0.488
## # ... with 2 more variables: fitted <list>, forecast <list>

Finally, we can unnest() the data to get the result:

## # A tibble: 6,552 x 4
##    src_sub_area key    datetime            demand
##    <chr>        <chr>  <dttm>               <dbl>
##  1 sxk9e        actual 2017-10-01 00:00:00      8
##  2 sxk9e        actual 2017-10-01 01:00:00      2
##  3 sxk9e        actual 2017-10-01 02:00:00      3
##  4 sxk9e        actual 2017-10-01 03:00:00      2
##  5 sxk9e        actual 2017-10-01 04:00:00      1
##  6 sxk9e        actual 2017-10-01 05:00:00      2
##  7 sxk9e        actual 2017-10-01 06:00:00      1
##  8 sxk9e        actual 2017-10-01 07:00:00      0
##  9 sxk9e        actual 2017-10-01 08:00:00      2
## 10 sxk9e        actual 2017-10-01 09:00:00     11
## # ... with 6,542 more rows

We show the plot of forecast results for each provider in a proper long format:

## [1] "2017-10-01 00:00:00 UTC" "2017-12-30 23:00:00 UTC"
## # A tibble: 504 x 3
##    src_sub_area datetime            demand
##    <chr>        <dttm>               <dbl>
##  1 sxk9e        2017-11-26 00:00:00     21
##  2 sxk9e        2017-11-26 01:00:00     15
##  3 sxk9e        2017-11-26 02:00:00     12
##  4 sxk9e        2017-11-26 03:00:00      7
##  5 sxk9e        2017-11-26 04:00:00      0
##  6 sxk9e        2017-11-26 05:00:00      0
##  7 sxk9e        2017-11-26 06:00:00      1
##  8 sxk9e        2017-11-26 07:00:00      1
##  9 sxk9e        2017-11-26 08:00:00     14
## 10 sxk9e        2017-11-26 09:00:00     13
## # ... with 494 more rows

Mean Absolute Error (MAE) Result for data test

## # A tibble: 4 x 2
##   src_sub_area error
##   <chr>        <dbl>
## 1 sxk97        0.488
## 2 sxk9e        0.359
## 3 sxk9s        0.421
## 4 all sub-area 0.423

5.1 Data Submission

At the plot above, data shown in the interval: 2017-10-01 00:00:00 - 2017-12-30 23:00:00. We will just prepare 1 week forecast after data train: 2017-12-03 00:00:00 - 2017-12-09 23:00:00

## # A tibble: 504 x 3
##    src_sub_area datetime            demand
##    <chr>        <dttm>              <lgl> 
##  1 sxk97        2017-12-03 00:00:00 NA    
##  2 sxk97        2017-12-03 01:00:00 NA    
##  3 sxk97        2017-12-03 02:00:00 NA    
##  4 sxk97        2017-12-03 03:00:00 NA    
##  5 sxk97        2017-12-03 04:00:00 NA    
##  6 sxk97        2017-12-03 05:00:00 NA    
##  7 sxk97        2017-12-03 06:00:00 NA    
##  8 sxk97        2017-12-03 07:00:00 NA    
##  9 sxk97        2017-12-03 08:00:00 NA    
## 10 sxk97        2017-12-03 09:00:00 NA    
## # ... with 494 more rows
## [1] "2017-12-03 00:00:00 UTC" "2017-12-09 23:00:00 UTC"

Forecast our data submission using our best model:

Select only the required colums:

## # A tibble: 504 x 3
##    src_sub_area datetime            demand
##    <chr>        <dttm>               <dbl>
##  1 sxk97        2017-12-03 00:00:00     18
##  2 sxk97        2017-12-03 01:00:00     15
##  3 sxk97        2017-12-03 02:00:00     11
##  4 sxk97        2017-12-03 03:00:00      7
##  5 sxk97        2017-12-03 04:00:00      5
##  6 sxk97        2017-12-03 05:00:00      3
##  7 sxk97        2017-12-03 06:00:00      2
##  8 sxk97        2017-12-03 07:00:00      6
##  9 sxk97        2017-12-03 08:00:00     12
## 10 sxk97        2017-12-03 09:00:00     12
## # ... with 494 more rows

Checking the MAE Result for data submission

because we don’t have actual data (truth) from our data submission, we check the MAE in Algoritma shiny: https://algoritma.shinyapps.io/leaderboard_capsml/
And the result:

Submission’s Model Performance

Submission’s Model Performance

Plot the submission forecast - only :

6 Conclusion

  1. Working with time series, please be prepared with the data preprocess. First of all, we have to check and make data complete, pad() is one of the function that helps us a lot to complete the data.
  2. Checking the data including setting the seasonality that we want to achieve
  3. By using automated model selection, we get the model easier and make us forecast the result faster.
  4. ets() and auto.arima() are not suitable for multiple seasonality time series data.
  5. Demand in each / all subarea tend to follow the multi seasonal model that we’ve created. Because the data is multiple seasonality, we can check our model performance using MAE function - not only visual checking via plot.

Note:

Special thanks to @teamalgoritma who helps a lot in learning, preparing and completing Machine Learning course - especially in this Capstone Project.

(c) Meilinie - June 2019