Time Series : Ride-sharing Business

Scotty is a ride-sharing business operating in several big cities in Turkey. The company provides motorcycles ride-sharing service for Turkey’s citizen, and really value the efficiency in traveling through the traffic–the apps even give some reference to Star Trek “beam me up” in their order buttons.

Scotty provided us with a real-time transaction dataset. With this dataset, we are going to help them in solving their problems in order to improve their business processes.

Scotty: “Bring me the crystal ball!” It’s almost the end of 2017 and we need to prepare a forecast model to helps Scotty ready for the end year’s demands. Unfortunately, Scotty is not old enough to have last year’s data for December, so we can not look back at past demands to prepare forecast for December’s demands. Fortunately, you already know that time series analysis is more than enough to help us to forecast! But, as an investment for the business’ future, we need to develop an automated forecasting framework so we don’t have to meddle with forecast model selection anymore in the future!

Build an automated forecasting model for hourly demands that would be evaluated on the next 7 days (Sunday, December 3rd 2017 to Monday, December 9th 2017)!

Import Data

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.7
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.1.0     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(padr)
library(tidymodels)

## Registered S3 method overwritten by 'tune':
##   method                   from   
##   required_pkgs.model_spec parsnip

## -- Attaching packages -------------------------------------- tidymodels 0.1.4 --

## v broom        0.7.10     v rsample      0.1.1 
## v dials        0.0.10     v tune         0.1.6 
## v infer        1.0.0      v workflows    0.2.4 
## v modeldata    0.1.1      v workflowsets 0.1.0 
## v parsnip      0.1.7      v yardstick    0.0.9 
## v recipes      0.1.17

## -- Conflicts ----------------------------------------- tidymodels_conflicts() --
## x scales::discard() masks purrr::discard()
## x dplyr::filter()   masks stats::filter()
## x recipes::fixed()  masks stringr::fixed()
## x dplyr::lag()      masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step()   masks stats::step()
## * Search for functions across packages at https://www.tidymodels.org/find/

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(forecast)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

## 
## Attaching package: 'forecast'

## The following object is masked from 'package:yardstick':
## 
##     accuracy

library(readr)        
library(DT) 
library(ggplot2)
library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library(dplyr)      
library(padr)     
library(zoo)

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(tidyr)      
library(recipes)      
library(tibble)     
library(purrr)        
library(png)        
library(grid)       
library(tidyquant)

## Loading required package: PerformanceAnalytics

## Loading required package: xts

## 
## Attaching package: 'xts'

## The following objects are masked from 'package:dplyr':
## 
##     first, last

## 
## Attaching package: 'PerformanceAnalytics'

## The following object is masked from 'package:graphics':
## 
##     legend

## Loading required package: quantmod

## Loading required package: TTR

## 
## Attaching package: 'TTR'

## The following object is masked from 'package:dials':
## 
##     momentum

## == Need to Learn tidyquant? ====================================================
## Business Science offers a 1-hour course - Learning Lab #9: Performance Analysis & Portfolio Optimization with tidyquant!
## </> Learn more at: https://university.business-science.io/p/learning-labs-pro </>

scotty <- read.csv("data/data-train.csv")
head(scotty)

##                         id                  trip_id                driver_id
## 1 59d005e1ffcfa261708ce9cd 59d005e9cb564761a8fe5d3e 59a892c5568be44b2734f276
## 2 59d0066affcfa261708ceb11                     <NA>                     <NA>
## 3 59d006a1ffcfa261708ceba4 59d006c131e39c618969343d 599dc0dfa5b4fd5471ad8453
## 4 59d006d8cb564761a8fe5efd 59d007193d32b861760d4a77 59a5856573d56e7103b5d51d
## 5 59d0073ecb564761a8fe5fdc                     <NA>                     <NA>
## 6 59d007c3ffcfa261708ceee1 59d007de31e39c618969354f 599dc0dfa5b4fd5471ad8453
##                   rider_id           start_time  src_lat  src_lon src_area
## 1 59ad2d6efba75a581666b506 2017-10-01T00:00:17Z 41.07047 29.01945     sxk9
## 2 59cd704bcf482f6ce2fadfdb 2017-10-01T00:02:34Z 41.07487 28.99528     sxk9
## 3 59bd62cc7e3c3b663924ba86 2017-10-01T00:03:29Z 41.04995 29.03107     sxk9
## 4 59c17cd1f6da2d274e16d0a7 2017-10-01T00:04:24Z 41.05287 28.99522     sxk9
## 5 596f47a8a294a82ea3e90e77 2017-10-01T00:06:06Z 41.06760 28.98827     sxk9
## 6 59bd62cc7e3c3b663924ba86 2017-10-01T00:08:19Z 41.04995 29.03107     sxk9
##   src_sub_area dest_lat dest_lon dest_area dest_sub_area  distance    status
## 1        sxk9s 41.11716 29.03650      sxk9         sxk9u  5.379250 confirmed
## 2        sxk9e 41.08351 29.00228      sxk9         sxk9e  1.126098 nodrivers
## 3        sxk9s 41.04495 28.98192      sxk9         sxk9e  4.169492 confirmed
## 4        sxk9e 41.08140 28.98197      sxk9         sxk9e  3.358296 confirmed
## 5        sxk9e 41.02125 29.11316      sxk9         sxk9q 11.693573 nodrivers
## 6        sxk9s 41.04495 28.98192      sxk9         sxk9e  4.169492 confirmed
##   confirmed_time_sec
## 1                  8
## 2                  0
## 3                 32
## 4                 65
## 5                  0
## 6                 27

The dataset includes information about:

‘id’ ; Transaction id
‘trip_id’: Trip id
‘driver_id’: Driver id
‘rider_id’: Rider id
‘start_time’: Start time of request
‘src_lat’: Request source latitude
‘src_lon’: Request source longitude
‘src_area’: Request source area
‘src_sub_area’: Request source sub-area
‘dest_lat’: Requested destination latitude
‘dest_lon’: Requested destination longitude
‘dest_area’: Requested destination area
‘dest_sub_area’: Requested destination sub-area
‘distance’: Trip distance (in KM)
‘status’: Trip status (all status considered as a demand)
‘confirmed_time_sec’: Time different from request to confirmed (in seconds)

scotty <- scotty %>% 
  select(start_time, src_sub_area)

Data Preprocessing

Determine the steps to be performed in Data Preprocessing: *Floor the date to hour

train_date <- scotty %>% 
  mutate(datetime = ymd_hms(start_time),
         datetime = floor_date(datetime, unit = "hour"))
tail(train_date)

##                 start_time src_sub_area            datetime
## 90108 2017-12-02T23:57:29Z        sxk97 2017-12-02 23:00:00
## 90109 2017-12-02T23:57:45Z        sxk9s 2017-12-02 23:00:00
## 90110 2017-12-02T23:58:52Z        sxk97 2017-12-02 23:00:00
## 90111 2017-12-02T23:58:55Z        sxk9e 2017-12-02 23:00:00
## 90112 2017-12-02T23:58:59Z        sxk9s 2017-12-02 23:00:00
## 90113 2017-12-02T23:59:20Z        sxk9e 2017-12-02 23:00:00

*Aggregate the data based on the src_area, datetime

scotty_sum <- train_date %>% 
  group_by(datetime, src_sub_area) %>% 
  summarise(
    demand = n()
  )

## `summarise()` has grouped output by 'datetime'. You can override using the `.groups` argument.

scotty_sum

## # A tibble: 4,225 x 3
## # Groups:   datetime [1,490]
##    datetime            src_sub_area demand
##    <dttm>              <chr>         <int>
##  1 2017-10-01 00:00:00 sxk97             6
##  2 2017-10-01 00:00:00 sxk9e             8
##  3 2017-10-01 00:00:00 sxk9s            10
##  4 2017-10-01 01:00:00 sxk97             4
##  5 2017-10-01 01:00:00 sxk9e             2
##  6 2017-10-01 01:00:00 sxk9s             3
##  7 2017-10-01 02:00:00 sxk97             9
##  8 2017-10-01 02:00:00 sxk9e             3
##  9 2017-10-01 02:00:00 sxk9s             1
## 10 2017-10-01 03:00:00 sxk97             2
## # ... with 4,215 more rows

scotty_sum %>% 
  group_by(src_sub_area) %>% 
  summarise(
    demand = n()
  )

## # A tibble: 3 x 2
##   src_sub_area demand
##   <chr>         <int>
## 1 sxk97          1439
## 2 sxk9e          1419
## 3 sxk9s          1367

Determine the start and end of padding interval

scotty_sum <- scotty_sum %>% 
  pad('hour', 
      start_val = as.POSIXct(min(scotty_sum$datetime)), 
      end_val = as.POSIXct(max(scotty_sum$datetime)),
      group = "src_sub_area") %>%
  mutate(
    src_sub_area = as.factor(src_sub_area)
  )

## Warning: group argument and dplyr::groups are both present and differ,
## dplyr::groups are ignored

scotty_sum %>% 
  group_by(src_sub_area) %>% 
  summarise(
    demand = n()
  )

## # A tibble: 3 x 2
##   src_sub_area demand
##   <fct>         <int>
## 1 sxk97          1512
## 2 sxk9e          1512
## 3 sxk9s          1512

scotty_sum <- scotty_sum %>% 
  arrange(src_sub_area, datetime)
scotty_sum

## # A tibble: 4,536 x 3
## # Groups:   datetime [1,512]
##    datetime            src_sub_area demand
##    <dttm>              <fct>         <int>
##  1 2017-10-01 00:00:00 sxk97             6
##  2 2017-10-01 01:00:00 sxk97             4
##  3 2017-10-01 02:00:00 sxk97             9
##  4 2017-10-01 03:00:00 sxk97             2
##  5 2017-10-01 04:00:00 sxk97             5
##  6 2017-10-01 05:00:00 sxk97             4
##  7 2017-10-01 06:00:00 sxk97             1
##  8 2017-10-01 07:00:00 sxk97            NA
##  9 2017-10-01 08:00:00 sxk97             3
## 10 2017-10-01 09:00:00 sxk97             3
## # ... with 4,526 more rows

*Fill the NA count on the new time interval with 0 (impute missing values)

#mutate(demand = replace_na(demand,0))
scotty_sum %>% 
  filter(is.na(demand) == 1) %>% 
  group_by() %>% 
  summarise(
    na_rows = n()
  )

## # A tibble: 1 x 1
##   na_rows
##     <int>
## 1     311

scotty_sum <- na.fill0(scotty_sum, fill = 0)

scotty_sum %>% 
  filter(is.na(demand) == 1) %>% 
  group_by() %>% 
  summarise(
    na_rows = n()
  )

## # A tibble: 1 x 1
##   na_rows
##     <int>
## 1       0

scotty_sum

## # A tibble: 4,536 x 3
## # Groups:   datetime [1,512]
##    datetime            src_sub_area demand
##    <dttm>              <fct>         <int>
##  1 2017-10-01 00:00:00 sxk97             6
##  2 2017-10-01 01:00:00 sxk97             4
##  3 2017-10-01 02:00:00 sxk97             9
##  4 2017-10-01 03:00:00 sxk97             2
##  5 2017-10-01 04:00:00 sxk97             5
##  6 2017-10-01 05:00:00 sxk97             4
##  7 2017-10-01 06:00:00 sxk97             1
##  8 2017-10-01 07:00:00 sxk97             0
##  9 2017-10-01 08:00:00 sxk97             3
## 10 2017-10-01 09:00:00 sxk97             3
## # ... with 4,526 more rows

Cross-Validation Scheme

Cross Validation which we want to forecast for the next 7 days within hours, meaning we need to take test data for the last 7 days per each src_sub_area

#data test: 504 obs –> 7 hari * 24 jam * 3 source area data train: 4032 obs
scotty_sum %>%
  filter(datetime > (max(scotty_sum$datetime) - hours(24 * 7 * 4))) %>%
  ggplot(aes(x = datetime, y = demand)) +
    geom_line() +
    labs(x = NULL, y = NULL) +
    facet_wrap(~ src_sub_area, scale = "free", ncol = 1) +
    theme_light()

# number of test data and data train
test_size <- 24 * 7
train_size <- nrow(scotty_sum)/3 - test_size

# initial range and final range for each data
test_end <- max(scotty_sum$datetime)
test_start <- test_end - hours(test_size) + hours(1)

train_end <- test_start - hours(1)
train_start <- train_end - hours(train_size) + hours(1)

# interval for each sample
interval_train <- interval(train_start, train_end)
interval_test <- interval(test_start, test_end)
interval_train

## [1] 2017-10-01 UTC--2017-11-25 23:00:00 UTC

interval_test

## [1] 2017-11-26 UTC--2017-12-02 23:00:00 UTC

t <- scotty_sum %>%
  mutate(sample = case_when(
    datetime %within% interval_train ~ "train",
    datetime %within% interval_test ~ "test"
  )) %>%
  mutate(sample = factor(sample, levels = c("train", "test"))) %>%
  ggplot(aes(x = datetime, y = demand, colour = sample)) +
    geom_line() +
    labs(x = NULL, y = NULL, colour = NULL) +
    facet_wrap(~ src_sub_area, scale = "free", ncol = 1) +
    theme_light()

t

* Nested data frame dengan function nest(-src_sub_area)

scotty_sum_wide <- scotty_sum  %>% 
  spread(src_sub_area, demand)

scotty_sum_wide

## # A tibble: 1,512 x 4
## # Groups:   datetime [1,512]
##    datetime            sxk97 sxk9e sxk9s
##    <dttm>              <int> <int> <int>
##  1 2017-10-01 00:00:00     6     8    10
##  2 2017-10-01 01:00:00     4     2     3
##  3 2017-10-01 02:00:00     9     3     1
##  4 2017-10-01 03:00:00     2     2     0
##  5 2017-10-01 04:00:00     5     1     0
##  6 2017-10-01 05:00:00     4     2     0
##  7 2017-10-01 06:00:00     1     1     0
##  8 2017-10-01 07:00:00     0     0     1
##  9 2017-10-01 08:00:00     3     2     5
## 10 2017-10-01 09:00:00     3    11     5
## # ... with 1,502 more rows

scotty_recipe <- recipe(~.,  
                        filter(scotty_sum_wide, datetime %within% interval_train)) %>% 
  step_sqrt(all_numeric()) %>%
  step_center(all_numeric()) %>%
  step_scale(all_numeric()) %>%
  prep()
scotty_recipe

## Recipe
## 
## Inputs:
## 
##       role #variables
##  predictor          4
## 
## Training data contained 1344 data points and no missing data.
## 
## Operations:
## 
## Square root transformation on sxk97, sxk9e, sxk9s [trained]
## Centering for sxk97, sxk9e, sxk9s [trained]
## Scaling for sxk97, sxk9e, sxk9s [trained]

# preview the bake results

scotty_sum_wide <- bake(scotty_recipe, scotty_sum_wide)

scotty_sum_wide

## # A tibble: 1,512 x 4
##    datetime             sxk97  sxk9e  sxk9s
##    <dttm>               <dbl>  <dbl>  <dbl>
##  1 2017-10-01 00:00:00 -0.594 -0.697 -0.111
##  2 2017-10-01 01:00:00 -0.841 -1.28  -0.780
##  3 2017-10-01 02:00:00 -0.291 -1.15  -1.12 
##  4 2017-10-01 03:00:00 -1.16  -1.28  -1.59 
##  5 2017-10-01 04:00:00 -0.711 -1.44  -1.59 
##  6 2017-10-01 05:00:00 -0.841 -1.28  -1.59 
##  7 2017-10-01 06:00:00 -1.39  -1.44  -1.59 
##  8 2017-10-01 07:00:00 -1.94  -1.85  -1.12 
##  9 2017-10-01 08:00:00 -0.989 -1.28  -0.544
## 10 2017-10-01 09:00:00 -0.989 -0.497 -0.544
## # ... with 1,502 more rows

# revert back function
rec_revert <- function(vector, rec, varname) {

  # store recipe values
  rec_center <- rec$steps[[2]]$means[varname]
  rec_scale <- rec$steps[[3]]$sds[varname]

  # convert back based on the recipe
  results <- (vector * rec_scale + rec_center) ^ 2

  # add additional adjustment if necessary
  results <- round(results)
  results
}

# convert back to long format
scotty_sum_long <- scotty_sum_wide %>% 
  gather(src_sub_area, demand, -datetime)
  
scotty_sum_long

## # A tibble: 4,536 x 3
##    datetime            src_sub_area demand
##    <dttm>              <chr>         <dbl>
##  1 2017-10-01 00:00:00 sxk97        -0.594
##  2 2017-10-01 01:00:00 sxk97        -0.841
##  3 2017-10-01 02:00:00 sxk97        -0.291
##  4 2017-10-01 03:00:00 sxk97        -1.16 
##  5 2017-10-01 04:00:00 sxk97        -0.711
##  6 2017-10-01 05:00:00 sxk97        -0.841
##  7 2017-10-01 06:00:00 sxk97        -1.39 
##  8 2017-10-01 07:00:00 sxk97        -1.94 
##  9 2017-10-01 08:00:00 sxk97        -0.989
## 10 2017-10-01 09:00:00 sxk97        -0.989
## # ... with 4,526 more rows

# adjust by sample
scotty_sum_nest <- scotty_sum_long %>%
  mutate(sample = case_when(
    datetime %within% interval_train ~ "train",
    datetime %within% interval_test ~ "test"
  )) %>%
  drop_na()

# nest the data train
scotty_sum_train_nest <- scotty_sum_nest %>% 
  group_by(src_sub_area, sample) %>%
  nest(.key = "data") %>%
  spread(sample, data)

## Warning: `.key` is deprecated

scotty_sum_train_nest

## # A tibble: 3 x 3
## # Groups:   src_sub_area [3]
##   src_sub_area test               train               
##   <chr>        <list>             <list>              
## 1 sxk97        <tibble [168 x 2]> <tibble [1,344 x 2]>
## 2 sxk9e        <tibble [168 x 2]> <tibble [1,344 x 2]>
## 3 sxk9s        <tibble [168 x 2]> <tibble [1,344 x 2]>

Create a ts object for each src_sub area

# list of functions that will be used to form a time series

ts_function_list <- list(
  ts = function(x) ts(x$demand, frequency = 24),
  msts = function(x) msts(x$demand, seasonal.periods = c(24, 24 * 7))
)

ts_function_list

## $ts
## function(x) ts(x$demand, frequency = 24)
## 
## $msts
## function(x) msts(x$demand, seasonal.periods = c(24, 24 * 7))

ts_function_list <- ts_function_list %>% 
  rep(length(unique(scotty_sum_nest$src_sub_area))) %>%
  enframe("func_name", "func") %>%
  mutate(
    src_sub_area = sort(rep(unique(scotty_sum_nest$src_sub_area), length(unique(.$func_name))))
  )

ts_function_list

## # A tibble: 6 x 3
##   func_name func   src_sub_area
##   <chr>     <list> <chr>       
## 1 ts        <fn>   sxk97       
## 2 msts      <fn>   sxk97       
## 3 ts        <fn>   sxk9e       
## 4 msts      <fn>   sxk9e       
## 5 ts        <fn>   sxk9s       
## 6 msts      <fn>   sxk9s

scotty_fit <- scotty_sum_train_nest %>% 
  left_join(ts_function_list)

## Joining, by = "src_sub_area"

Automated Model Selection

# time series model list
model_function_list <- list(
  ets = function(x) ets(x),
  auto.arima = function(x) auto.arima(x),
  stlm = function(x) stlm(x),
  tbats = function(x) tbats(x, use.box.cox = FALSE)
)

model_function_list

## $ets
## function(x) ets(x)
## 
## $auto.arima
## function(x) auto.arima(x)
## 
## $stlm
## function(x) stlm(x)
## 
## $tbats
## function(x) tbats(x, use.box.cox = FALSE)

# combining into nested format
model_function_list <- model_function_list %>% 
  rep(length(unique(scotty_sum_nest$src_sub_area))) %>%
  enframe("model_name", "model") %>%
  mutate(src_sub_area =
    sort(rep(unique(scotty_sum_nest$src_sub_area), length(unique(.$model_name))))
  )
model_function_list

## # A tibble: 12 x 3
##    model_name model  src_sub_area
##    <chr>      <list> <chr>       
##  1 ets        <fn>   sxk97       
##  2 auto.arima <fn>   sxk97       
##  3 stlm       <fn>   sxk97       
##  4 tbats      <fn>   sxk97       
##  5 ets        <fn>   sxk9e       
##  6 auto.arima <fn>   sxk9e       
##  7 stlm       <fn>   sxk9e       
##  8 tbats      <fn>   sxk9e       
##  9 ets        <fn>   sxk9s       
## 10 auto.arima <fn>   sxk9s       
## 11 stlm       <fn>   sxk9s       
## 12 tbats      <fn>   sxk9s

scotty_fit <- scotty_fit %>% 
  left_join(model_function_list) %>%
  filter(
    !(model_name == "ets" & func_name == "msts"),
    !(model_name == "auto.arima" & func_name == "msts")
  )

## Joining, by = "src_sub_area"

scotty_fit

## # A tibble: 18 x 7
## # Groups:   src_sub_area [3]
##    src_sub_area test               train        func_name func  model_name model
##    <chr>        <list>             <list>       <chr>     <lis> <chr>      <lis>
##  1 sxk97        <tibble [168 x 2]> <tibble [1,~ ts        <fn>  ets        <fn> 
##  2 sxk97        <tibble [168 x 2]> <tibble [1,~ ts        <fn>  auto.arima <fn> 
##  3 sxk97        <tibble [168 x 2]> <tibble [1,~ ts        <fn>  stlm       <fn> 
##  4 sxk97        <tibble [168 x 2]> <tibble [1,~ ts        <fn>  tbats      <fn> 
##  5 sxk97        <tibble [168 x 2]> <tibble [1,~ msts      <fn>  stlm       <fn> 
##  6 sxk97        <tibble [168 x 2]> <tibble [1,~ msts      <fn>  tbats      <fn> 
##  7 sxk9e        <tibble [168 x 2]> <tibble [1,~ ts        <fn>  ets        <fn> 
##  8 sxk9e        <tibble [168 x 2]> <tibble [1,~ ts        <fn>  auto.arima <fn> 
##  9 sxk9e        <tibble [168 x 2]> <tibble [1,~ ts        <fn>  stlm       <fn> 
## 10 sxk9e        <tibble [168 x 2]> <tibble [1,~ ts        <fn>  tbats      <fn> 
## 11 sxk9e        <tibble [168 x 2]> <tibble [1,~ msts      <fn>  stlm       <fn> 
## 12 sxk9e        <tibble [168 x 2]> <tibble [1,~ msts      <fn>  tbats      <fn> 
## 13 sxk9s        <tibble [168 x 2]> <tibble [1,~ ts        <fn>  ets        <fn> 
## 14 sxk9s        <tibble [168 x 2]> <tibble [1,~ ts        <fn>  auto.arima <fn> 
## 15 sxk9s        <tibble [168 x 2]> <tibble [1,~ ts        <fn>  stlm       <fn> 
## 16 sxk9s        <tibble [168 x 2]> <tibble [1,~ ts        <fn>  tbats      <fn> 
## 17 sxk9s        <tibble [168 x 2]> <tibble [1,~ msts      <fn>  stlm       <fn> 
## 18 sxk9s        <tibble [168 x 2]> <tibble [1,~ msts      <fn>  tbats      <fn>

scotty_fit <- scotty_fit %>%
  mutate(
    params = map(train, ~ list(x = .x)),
    data = invoke_map(func, params),
    params = map(data, ~ list(x = .x)),
    fitted = invoke_map(model, params)
  ) %>%
  select(-data, -params)

# calculate test errors
scotty_model_error <- scotty_fit %>%
  mutate(error =
    map(fitted, ~ forecast(.x, h = 24 * 7)) %>%
    map2_dbl(test, ~ mae_vec(truth = .y$demand, estimate = .x$mean))
  ) %>%
  arrange(src_sub_area, error) #arrange based on smallest error for each area

scotty_model_error

## # A tibble: 18 x 9
## # Groups:   src_sub_area [3]
##    src_sub_area test     train     func_name func  model_name model fitted error
##    <chr>        <list>   <list>    <chr>     <lis> <chr>      <lis> <list> <dbl>
##  1 sxk97        <tibble~ <tibble ~ msts      <fn>  tbats      <fn>  <tbat~ 0.488
##  2 sxk97        <tibble~ <tibble ~ msts      <fn>  stlm       <fn>  <stlm> 0.540
##  3 sxk97        <tibble~ <tibble ~ ts        <fn>  tbats      <fn>  <tbat~ 0.573
##  4 sxk97        <tibble~ <tibble ~ ts        <fn>  ets        <fn>  <ets>  0.574
##  5 sxk97        <tibble~ <tibble ~ ts        <fn>  stlm       <fn>  <stlm> 0.598
##  6 sxk97        <tibble~ <tibble ~ ts        <fn>  auto.arima <fn>  <fr_A~ 0.702
##  7 sxk9e        <tibble~ <tibble ~ msts      <fn>  tbats      <fn>  <tbat~ 0.359
##  8 sxk9e        <tibble~ <tibble ~ msts      <fn>  stlm       <fn>  <stlm> 0.362
##  9 sxk9e        <tibble~ <tibble ~ ts        <fn>  ets        <fn>  <ets>  0.419
## 10 sxk9e        <tibble~ <tibble ~ ts        <fn>  tbats      <fn>  <tbat~ 0.422
## 11 sxk9e        <tibble~ <tibble ~ ts        <fn>  stlm       <fn>  <stlm> 0.435
## 12 sxk9e        <tibble~ <tibble ~ ts        <fn>  auto.arima <fn>  <fr_A~ 0.502
## 13 sxk9s        <tibble~ <tibble ~ msts      <fn>  stlm       <fn>  <stlm> 0.417
## 14 sxk9s        <tibble~ <tibble ~ msts      <fn>  tbats      <fn>  <tbat~ 0.421
## 15 sxk9s        <tibble~ <tibble ~ ts        <fn>  tbats      <fn>  <tbat~ 0.431
## 16 sxk9s        <tibble~ <tibble ~ ts        <fn>  auto.arima <fn>  <fr_A~ 0.503
## 17 sxk9s        <tibble~ <tibble ~ ts        <fn>  stlm       <fn>  <stlm> 0.516
## 18 sxk9s        <tibble~ <tibble ~ ts        <fn>  ets        <fn>  <ets>  0.550

scotty_best_model <- scotty_model_error %>%
  group_by(src_sub_area) %>% 
  arrange(error) %>% 
  slice(1) %>%  #choose the smallest error & best model for each area directly
  ungroup() %>% 
  select(src_sub_area, ends_with("_name"),error)

scotty_best_model <- scotty_best_model %>% 
  select(-error) %>% 
  left_join(scotty_model_error)%>% 
  select(-error)

## Joining, by = c("src_sub_area", "func_name", "model_name")

scotty_best_model

## # A tibble: 3 x 8
##   src_sub_area func_name model_name test               train  func  model fitted
##   <chr>        <chr>     <chr>      <list>             <list> <lis> <lis> <list>
## 1 sxk97        msts      tbats      <tibble [168 x 2]> <tibb~ <fn>  <fn>  <tbat~
## 2 sxk9e        msts      tbats      <tibble [168 x 2]> <tibb~ <fn>  <fn>  <tbat~
## 3 sxk9s        msts      stlm       <tibble [168 x 2]> <tibb~ <fn>  <fn>  <stlm>

scotty_test <- scotty_model_error %>% 
  mutate(
    forecast =
      map(fitted, ~ forecast(.x, h = 24 * 7)) %>%
      map2(test, ~ tibble(
        datetime = .y$datetime,
        demand = as.vector(.x$mean)
      )),
    key = paste(func_name, model_name, sep = "-")
  )
  
scotty_test

## # A tibble: 18 x 11
## # Groups:   src_sub_area [3]
##    src_sub_area test     train     func_name func  model_name model fitted error
##    <chr>        <list>   <list>    <chr>     <lis> <chr>      <lis> <list> <dbl>
##  1 sxk97        <tibble~ <tibble ~ msts      <fn>  tbats      <fn>  <tbat~ 0.488
##  2 sxk97        <tibble~ <tibble ~ msts      <fn>  stlm       <fn>  <stlm> 0.540
##  3 sxk97        <tibble~ <tibble ~ ts        <fn>  tbats      <fn>  <tbat~ 0.573
##  4 sxk97        <tibble~ <tibble ~ ts        <fn>  ets        <fn>  <ets>  0.574
##  5 sxk97        <tibble~ <tibble ~ ts        <fn>  stlm       <fn>  <stlm> 0.598
##  6 sxk97        <tibble~ <tibble ~ ts        <fn>  auto.arima <fn>  <fr_A~ 0.702
##  7 sxk9e        <tibble~ <tibble ~ msts      <fn>  tbats      <fn>  <tbat~ 0.359
##  8 sxk9e        <tibble~ <tibble ~ msts      <fn>  stlm       <fn>  <stlm> 0.362
##  9 sxk9e        <tibble~ <tibble ~ ts        <fn>  ets        <fn>  <ets>  0.419
## 10 sxk9e        <tibble~ <tibble ~ ts        <fn>  tbats      <fn>  <tbat~ 0.422
## 11 sxk9e        <tibble~ <tibble ~ ts        <fn>  stlm       <fn>  <stlm> 0.435
## 12 sxk9e        <tibble~ <tibble ~ ts        <fn>  auto.arima <fn>  <fr_A~ 0.502
## 13 sxk9s        <tibble~ <tibble ~ msts      <fn>  stlm       <fn>  <stlm> 0.417
## 14 sxk9s        <tibble~ <tibble ~ msts      <fn>  tbats      <fn>  <tbat~ 0.421
## 15 sxk9s        <tibble~ <tibble ~ ts        <fn>  tbats      <fn>  <tbat~ 0.431
## 16 sxk9s        <tibble~ <tibble ~ ts        <fn>  auto.arima <fn>  <fr_A~ 0.503
## 17 sxk9s        <tibble~ <tibble ~ ts        <fn>  stlm       <fn>  <stlm> 0.516
## 18 sxk9s        <tibble~ <tibble ~ ts        <fn>  ets        <fn>  <ets>  0.550
## # ... with 2 more variables: forecast <list>, key <chr>

scotty_test_key <- scotty_test %>%
  select(src_sub_area, key, actual = test, forecast) %>%
  spread(key, forecast) %>%
  gather(key, value, -src_sub_area)

scotty_test_key

## # A tibble: 21 x 3
## # Groups:   src_sub_area [3]
##    src_sub_area key           value             
##    <chr>        <chr>         <list>            
##  1 sxk97        actual        <tibble [168 x 2]>
##  2 sxk9e        actual        <tibble [168 x 2]>
##  3 sxk9s        actual        <tibble [168 x 2]>
##  4 sxk97        msts-stlm     <tibble [168 x 2]>
##  5 sxk9e        msts-stlm     <tibble [168 x 2]>
##  6 sxk9s        msts-stlm     <tibble [168 x 2]>
##  7 sxk97        msts-tbats    <tibble [168 x 2]>
##  8 sxk9e        msts-tbats    <tibble [168 x 2]>
##  9 sxk9s        msts-tbats    <tibble [168 x 2]>
## 10 sxk97        ts-auto.arima <tibble [168 x 2]>
## # ... with 11 more rows

scotty_test_unnest <- scotty_test_key %>%
  unnest(value) %>%
  mutate(demand = rec_revert(demand, scotty_recipe, src_sub_area))
  
scotty_test_unnest

## # A tibble: 3,528 x 4
## # Groups:   src_sub_area [3]
##    src_sub_area key    datetime            demand
##    <chr>        <chr>  <dttm>               <dbl>
##  1 sxk97        actual 2017-11-26 00:00:00     38
##  2 sxk97        actual 2017-11-26 01:00:00     21
##  3 sxk97        actual 2017-11-26 02:00:00     10
##  4 sxk97        actual 2017-11-26 03:00:00     10
##  5 sxk97        actual 2017-11-26 04:00:00      5
##  6 sxk97        actual 2017-11-26 05:00:00      2
##  7 sxk97        actual 2017-11-26 06:00:00      0
##  8 sxk97        actual 2017-11-26 07:00:00     11
##  9 sxk97        actual 2017-11-26 08:00:00      4
## 10 sxk97        actual 2017-11-26 09:00:00      4
## # ... with 3,518 more rows

t2 <- scotty_test_unnest %>%
  ggplot(aes(x = datetime, y = demand, colour = key)) +
  geom_line(data = scotty_test_unnest %>% 
              filter(key == "actual"),aes(y = demand),alpha = 0.3,size = 0.8)+
   geom_line(data = scotty_test_unnest %>% 
               filter(key != "actual"),aes(frame = key,col = key)) +
  labs(x = "", y = "Demand",title = "Model Prediction Comparison", frame = "Model") +
   facet_wrap(~ src_sub_area, scale = "free_y", ncol = 1) +
    tidyquant::theme_tq() +
    theme(legend.position = "none",axis.title.y = element_text(size = 12), axis.text.y = element_text(size = 7))

## Warning: Ignoring unknown aesthetics: frame

t2

## Automated Model Selection

# filter by lowest test error
scotty_min_error <- scotty_model_error %>%
  select(-fitted) %>% # remove unused
  group_by(src_sub_area) %>%
  filter(error == min(error)) %>%
  ungroup()

scotty_min_error   #same selection with subarea_best_model

## # A tibble: 3 x 8
##   src_sub_area test               train   func_name func  model_name model error
##   <chr>        <list>             <list>  <chr>     <lis> <chr>      <lis> <dbl>
## 1 sxk97        <tibble [168 x 2]> <tibbl~ msts      <fn>  tbats      <fn>  0.488
## 2 sxk9e        <tibble [168 x 2]> <tibbl~ msts      <fn>  tbats      <fn>  0.359
## 3 sxk9s        <tibble [168 x 2]> <tibbl~ msts      <fn>  stlm       <fn>  0.417

scotty_combine <- scotty_min_error %>%
  mutate(fulldata = map2(train, test, ~ bind_rows(.x, .y))) %>%
  select(src_sub_area, fulldata, everything(), -train, -test)

scotty_combine

## # A tibble: 3 x 7
##   src_sub_area fulldata             func_name func   model_name model  error
##   <chr>        <list>               <chr>     <list> <chr>      <list> <dbl>
## 1 sxk97        <tibble [1,512 x 2]> msts      <fn>   tbats      <fn>   0.488
## 2 sxk9e        <tibble [1,512 x 2]> msts      <fn>   tbats      <fn>   0.359
## 3 sxk9s        <tibble [1,512 x 2]> msts      <fn>   stlm       <fn>   0.417

scotty_combine_nest <- scotty_combine %>%
  mutate(
    params = map(fulldata, ~ list(x = .x)),
    data = invoke_map(func, params),
    params = map(data, ~ list(x = .x)),
    fitted = invoke_map(model, params)
  ) %>%
  select(-data, -params)
scotty_combine_nest

## # A tibble: 3 x 8
##   src_sub_area fulldata          func_name func   model_name model  error fitted
##   <chr>        <list>            <chr>     <list> <chr>      <list> <dbl> <list>
## 1 sxk97        <tibble [1,512 x~ msts      <fn>   tbats      <fn>   0.488 <tbat~
## 2 sxk9e        <tibble [1,512 x~ msts      <fn>   tbats      <fn>   0.359 <tbat~
## 3 sxk9s        <tibble [1,512 x~ msts      <fn>   stlm       <fn>   0.417 <stlm>

# get forecast
scotty_combine_forecast <- scotty_combine_nest %>%
  mutate(forecast =
    map(fitted, ~ forecast(.x, h = 24 * 7 * 4)) %>%
    map2(fulldata, ~ tibble(
      datetime = timetk::tk_make_future_timeseries(.y$datetime, 24 * 7 * 4),
      demand = as.vector(.x$mean)
    ))
  )

scotty_combine_forecast

## # A tibble: 3 x 9
##   src_sub_area fulldata  func_name func   model_name model error fitted forecast
##   <chr>        <list>    <chr>     <list> <chr>      <lis> <dbl> <list> <list>  
## 1 sxk97        <tibble ~ msts      <fn>   tbats      <fn>  0.488 <tbat~ <tibble~
## 2 sxk9e        <tibble ~ msts      <fn>   tbats      <fn>  0.359 <tbat~ <tibble~
## 3 sxk9s        <tibble ~ msts      <fn>   stlm       <fn>  0.417 <stlm> <tibble~

scotty_combine_forecast_unnest <- scotty_combine_forecast %>% 
  select(src_sub_area, actual = fulldata, forecast) %>%
  gather(key, value, -src_sub_area) %>%
  unnest(value) %>%
  mutate(demand = rec_revert(demand, scotty_recipe, src_sub_area))

scotty_combine_forecast_unnest

## # A tibble: 6,552 x 4
##    src_sub_area key    datetime            demand
##    <chr>        <chr>  <dttm>               <dbl>
##  1 sxk97        actual 2017-10-01 00:00:00      6
##  2 sxk97        actual 2017-10-01 01:00:00      4
##  3 sxk97        actual 2017-10-01 02:00:00      9
##  4 sxk97        actual 2017-10-01 03:00:00      2
##  5 sxk97        actual 2017-10-01 04:00:00      5
##  6 sxk97        actual 2017-10-01 05:00:00      4
##  7 sxk97        actual 2017-10-01 06:00:00      1
##  8 sxk97        actual 2017-10-01 07:00:00      0
##  9 sxk97        actual 2017-10-01 08:00:00      3
## 10 sxk97        actual 2017-10-01 09:00:00      3
## # ... with 6,542 more rows

t3 <- scotty_combine_forecast_unnest %>%
  ggplot(aes(x = datetime, y = demand, colour = key)) +
    geom_line() +
    labs(x = NULL, y = NULL, colour = NULL) +
    facet_wrap(~ src_sub_area, scale = "free", ncol = 1) +
    theme_light() +
    tidyquant::scale_colour_tq()
t3

data_test <- scotty_combine_forecast_unnest %>%
  filter(datetime %within% interval_test) %>%
  select(src_sub_area, datetime, demand)

data_test

## # A tibble: 504 x 3
##    src_sub_area datetime            demand
##    <chr>        <dttm>               <dbl>
##  1 sxk97        2017-11-26 00:00:00     38
##  2 sxk97        2017-11-26 01:00:00     21
##  3 sxk97        2017-11-26 02:00:00     10
##  4 sxk97        2017-11-26 03:00:00     10
##  5 sxk97        2017-11-26 04:00:00      5
##  6 sxk97        2017-11-26 05:00:00      2
##  7 sxk97        2017-11-26 06:00:00      0
##  8 sxk97        2017-11-26 07:00:00     11
##  9 sxk97        2017-11-26 08:00:00      4
## 10 sxk97        2017-11-26 09:00:00      4
## # ... with 494 more rows

mae_test_result <- scotty_model_error %>%
  group_by(src_sub_area) %>% 
  select(src_sub_area, error) %>% 
  arrange(error) %>% 
  slice(2) %>% 
  ungroup()

# Reached MAE < 11 for all sub-area in (your own) test dataset
# We used tbats - the scale for MAE is a little bit different
mae_test_result %<>%
  summarise(src_sub_area = "all sub-area",
            error = mean(error)) %>%
  bind_rows(mae_test_result, .)

mae_test_result

## # A tibble: 4 x 2
##   src_sub_area error
##   <chr>        <dbl>
## 1 sxk97        0.540
## 2 sxk9e        0.362
## 3 sxk9s        0.421
## 4 all sub-area 0.441

submission <- read_csv("data/data-test.csv")

## Rows: 504 Columns: 3

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (1): src_sub_area
## lgl  (1): demand
## dttm (1): datetime

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

submission

## # A tibble: 504 x 3
##    src_sub_area datetime            demand
##    <chr>        <dttm>              <lgl> 
##  1 sxk97        2017-12-03 00:00:00 NA    
##  2 sxk97        2017-12-03 01:00:00 NA    
##  3 sxk97        2017-12-03 02:00:00 NA    
##  4 sxk97        2017-12-03 03:00:00 NA    
##  5 sxk97        2017-12-03 04:00:00 NA    
##  6 sxk97        2017-12-03 05:00:00 NA    
##  7 sxk97        2017-12-03 06:00:00 NA    
##  8 sxk97        2017-12-03 07:00:00 NA    
##  9 sxk97        2017-12-03 08:00:00 NA    
## 10 sxk97        2017-12-03 09:00:00 NA    
## # ... with 494 more rows

range(submission$datetime)

## [1] "2017-12-03 00:00:00 UTC" "2017-12-09 23:00:00 UTC"

# nest the data submission
submit_nest <- submission %>% nest(datetime)

## Warning: All elements of `...` must be named.
## Did you want `data = c(datetime)`?

submit_nest

## # A tibble: 3 x 3
##   src_sub_area demand data              
##   <chr>        <lgl>  <list>            
## 1 sxk97        NA     <tibble [168 x 1]>
## 2 sxk9e        NA     <tibble [168 x 1]>
## 3 sxk9s        NA     <tibble [168 x 1]>

submission <- scotty_best_model %>%  
  mutate(
    forecast =
      map(fitted, ~ forecast(.x, h = 24 * 7)) %>%
      map2(submit_nest$data, ~ tibble( #map into our data submission nest
        datetime = .y$datetime,
        demand = as.vector(.x$mean)
      )),
    key = paste(func_name, func, sep = "-")
  )

submission <- submission %>%
  select(src_sub_area, forecast) %>%
  unnest(forecast) %>%
  mutate(demand = rec_revert(demand, scotty_recipe, src_sub_area))

submission

## # A tibble: 504 x 3
##    src_sub_area datetime            demand
##    <chr>        <dttm>               <dbl>
##  1 sxk97        2017-12-03 00:00:00     18
##  2 sxk97        2017-12-03 01:00:00     15
##  3 sxk97        2017-12-03 02:00:00     11
##  4 sxk97        2017-12-03 03:00:00      7
##  5 sxk97        2017-12-03 04:00:00      5
##  6 sxk97        2017-12-03 05:00:00      3
##  7 sxk97        2017-12-03 06:00:00      2
##  8 sxk97        2017-12-03 07:00:00      6
##  9 sxk97        2017-12-03 08:00:00     12
## 10 sxk97        2017-12-03 09:00:00     12
## # ... with 494 more rows

write.csv(submission, "taniaarsya_submission_scottyts.csv",row.names = F)

Prediction Performance

knitr::include_graphics("leaderboard scotty ts.png")

## Conclusion

In this project has succeeded in building an automated forecast model for hourly demand which will be evaluated on Sunday, December 3, 2017 to Monday, December 9, 2017.

In solving this problem, machine learning algorithms have been implemented for time series analysis. ts : for one season msts : for multiple / complex seasons. basically uses ets(), auto.arima(), stlm(), and tbats() and picks which one has the smallest error. ets() and auto.arima() are not suitable for some seasonal time series data.According to automated modeling, the demand in each/all subareas tends to follow the multi-seasonal model that has been created.

In this model I choose to use the TBTS model because it has the lowest error. The TBTS model has the ability to deal with complex seasons (eg, non-integer seasons, non-nested seasonales, and seasons with large periods) without seasonal constraints, making it possible to make detailed and long-term forecasts.

I believe this project has many business implementations. Forecasting analysis can play a major role in driving the success or failure of a company, where we can keep prices low by optimizing business operations, cash flow, production, staff and financial management. We can basically reduce uncertainty and anticipate changes in the market and improve internal communication, and external communication between business and customers. The automation and functions I learned about in this project are also very useful in business where they can be used to significantly increase efficiency.