Introduction

Scotty is a ride-sharing business that operating in several big cities in Turkey. The company provides motorcycles ride-sharing service for Turkey’s citizen, and really value the efficiency in traveling through the traffic–the apps even give some reference to Star Trek “beam me up” in their order buttons.

Scotty provided us with real-time transaction dataset. With this dataset, we are going to help them in solving some forecasting and classification problems in order to improve their business processes.

It’s almost the end of 2017 and we need to prepare a forecast model to helps Scotty ready for the end year’s demands. Unfortunately, Scotty is not old enough to have last year data for December, so we can not look back at past demands to prepare forecast for December’s demands. Fortunately, you already know that time series analysis is more than enough to help us to forecast! But, as an investment for the business’ future, we need to develop an automated forecasting framework so we don’t have to meddling with forecast model selection anymore in the future!

This time, I am going to show you the process of using automated time-series forecasting for multiple model. To do it, I will use package called purrr().

Let’s get into the problem! Enjoy the codes!

Pre-Start

Since I am using R programming languange, it is necessary to load packages which may help me to do the problems

library(tidyverse)
library(yardstick)
library(MLmetrics)
library(lubridate)
library(forecast)
library(purrr)
library(TSstudio)
library(padr)
library(tseries)
library(recipes)
library(ggplot2)
library(plotly)
library(tidyquant)
library(glue)

Data Wrangling

I will load my data into scotty object

scotty <- read_csv("data-train.csv")

Next, observe the scotty with glimpse() function

glimpse(scotty)

## Observations: 90,113
## Variables: 16
## $ id                 <chr> "59d005e1ffcfa261708ce9cd", "59d0066affcfa261708...
## $ trip_id            <chr> "59d005e9cb564761a8fe5d3e", NA, "59d006c131e39c6...
## $ driver_id          <chr> "59a892c5568be44b2734f276", NA, "599dc0dfa5b4fd5...
## $ rider_id           <chr> "59ad2d6efba75a581666b506", "59cd704bcf482f6ce2f...
## $ start_time         <dttm> 2017-10-01 00:00:17, 2017-10-01 00:02:34, 2017-...
## $ src_lat            <dbl> 41.07047, 41.07487, 41.04995, 41.05287, 41.06760...
## $ src_lon            <dbl> 29.01945, 28.99528, 29.03107, 28.99522, 28.98827...
## $ src_area           <chr> "sxk9", "sxk9", "sxk9", "sxk9", "sxk9", "sxk9", ...
## $ src_sub_area       <chr> "sxk9s", "sxk9e", "sxk9s", "sxk9e", "sxk9e", "sx...
## $ dest_lat           <dbl> 41.11716, 41.08351, 41.04495, 41.08140, 41.02125...
## $ dest_lon           <dbl> 29.03650, 29.00228, 28.98192, 28.98197, 29.11316...
## $ dest_area          <chr> "sxk9", "sxk9", "sxk9", "sxk9", "sxk9", "sxk9", ...
## $ dest_sub_area      <chr> "sxk9u", "sxk9e", "sxk9e", "sxk9e", "sxk9q", "sx...
## $ distance           <dbl> 5.379250, 1.126098, 4.169492, 3.358296, 11.69357...
## $ status             <chr> "confirmed", "nodrivers", "confirmed", "confirme...
## $ confirmed_time_sec <dbl> 8, 0, 32, 65, 0, 27, 32, 30, 0, 24, 9, 0, 118, 1...

Now, look at the top 6 data using head() function

head(scotty)

## # A tibble: 6 x 16
##   id    trip_id driver_id rider_id start_time          src_lat src_lon src_area
##   <chr> <chr>   <chr>     <chr>    <dttm>                <dbl>   <dbl> <chr>   
## 1 59d0~ 59d005~ 59a892c5~ 59ad2d6~ 2017-10-01 00:00:17    41.1    29.0 sxk9    
## 2 59d0~ <NA>    <NA>      59cd704~ 2017-10-01 00:02:34    41.1    29.0 sxk9    
## 3 59d0~ 59d006~ 599dc0df~ 59bd62c~ 2017-10-01 00:03:29    41.0    29.0 sxk9    
## 4 59d0~ 59d007~ 59a58565~ 59c17cd~ 2017-10-01 00:04:24    41.1    29.0 sxk9    
## 5 59d0~ <NA>    <NA>      596f47a~ 2017-10-01 00:06:06    41.1    29.0 sxk9    
## 6 59d0~ 59d007~ 599dc0df~ 59bd62c~ 2017-10-01 00:08:19    41.0    29.0 sxk9    
## # ... with 8 more variables: src_sub_area <chr>, dest_lat <dbl>,
## #   dest_lon <dbl>, dest_area <chr>, dest_sub_area <chr>, distance <dbl>,
## #   status <chr>, confirmed_time_sec <dbl>

Since I am going to make timeseries model, I only need the start_time and src_sub_area variables. Therefore, I will subset my dataframe and store it into new object scotty_new

scott_new <- scotty %>% 
  mutate(src_sub_area = as.factor(src_sub_area)) %>% 
  select(start_time, src_sub_area)

Next, I want to count the order based on the time (hourly), hence I will make my start_time into hours. I will put in to scotty_agg

scotty_agg <-scott_new %>% 
  mutate (start_time = ymd_hms(start_time) %>%  floor_date(unit = "hour")) %>% 
  group_by(start_time, src_sub_area) %>% 
  summarise(demand = n()) %>% 
  ungroup()

Great! Now I have the order number for each hour and area.

This data has 3 levels or factors of src_sub_area, so I will make the data from long data frame into wide data frame using spread() function inside tidyr package

scotty_agg<- scotty_agg %>% 
  spread(src_sub_area,demand)

Next step, I am going to pad the timeseries object to make sure that the timeseries data is full. Before that, I have to find the minimum and maximum value of the time

min(scotty_agg$start_time)

## [1] "2017-10-01 UTC"

max(scotty_agg$start_time)

## [1] "2017-12-02 23:00:00 UTC"

Padding time!

scotty_pad <-scotty_agg %>% 
  pad(start_val = ymd_hms ("2017-10-01 00:00:00 UTC"), end_val = ymd_hms("2017-12-02 23:00:00 UTC"))

Since this data is a demand order for each hour, so I am going to replace the NA values with 0 (This missing values mean that there are no orders at that time)

scotty_pad <- scotty_pad %>% 
  mutate(sxk97 = replace_na(sxk97, 0)) %>% 
  mutate(sxk9e = replace_na(sxk9e, 0)) %>% 
  mutate(sxk9s = replace_na(sxk9s, 0))

Voila! The data is much cleaner now.

Exploratory Data Analysis

To make a clearer view, I will visualize the data using graph of ggplot()

scotty_pad %>% 
  pivot_longer(cols = c("sxk97","sxk9e","sxk9s"), names_to = "src_sub_area", values_to = "demand") %>% 
  filter(start_time >= max(start_time) - hours(24 * 7 * 4)) %>%
  ggplot(mapping = aes(x =start_time, y= demand))+
  geom_line()+
  facet_wrap(~src_sub_area, scale ="free", ncol =1)+
  tidyquant::theme_tq()

As we see from the three time-series graphs, they have similarities among them. The order number is 0 in certain date of month and has peaks at the same time too. This pattern is repeating regularly each day and week, hence we can call it seasonality pattern.

Cross Validation

Since the data consits of 3 months only, I will split the data into 3 months for cross validation. I am going to split to data test and data train into 4 weeks for 3 months per hour a day, so it becomes 24 hours * 7 days * 4 weeks * 3 months.

test_size <- 24 * 7  #test size is only 7 days since I want to predict for next 7 days
train_size <- 24 * 7 * 4 * 2 #train size is 2 months

# get the min-max of the time index for each sample
test_end <- max(scotty_pad$start_time)
test_start <- test_end - hours(test_size) + hours(1)

train_end <- test_start - hours(1)
train_start <- train_end - hours(train_size) + hours(1)

To make it easier, I can combine the start and end interval into date interval using interval() function

intrain <- interval(start = train_start, end = train_end)
intest <- interval(start = test_start, end = test_end)

intrain

## [1] 2017-10-01 UTC--2017-11-25 23:00:00 UTC

intest

## [1] 2017-11-26 UTC--2017-12-02 23:00:00 UTC

It is cool, right?

Now, I am going to process the data using recipe() and bake() function

rec <- recipe(~., filter(scotty_pad, start_time %within% intrain)) %>% 
  step_sqrt(all_numeric()) %>%
  step_center(all_numeric()) %>%
  step_scale(all_numeric()) %>%
  prep()
  
bake <- bake(rec, scotty_pad)
bake

## # A tibble: 1,512 x 4
##    start_time           sxk97  sxk9e  sxk9s
##    <dttm>               <dbl>  <dbl>  <dbl>
##  1 2017-10-01 00:00:00 -0.594 -0.697 -0.111
##  2 2017-10-01 01:00:00 -0.841 -1.28  -0.780
##  3 2017-10-01 02:00:00 -0.291 -1.15  -1.12 
##  4 2017-10-01 03:00:00 -1.16  -1.28  -1.59 
##  5 2017-10-01 04:00:00 -0.711 -1.44  -1.59 
##  6 2017-10-01 05:00:00 -0.841 -1.28  -1.59 
##  7 2017-10-01 06:00:00 -1.39  -1.44  -1.59 
##  8 2017-10-01 07:00:00 -1.94  -1.85  -1.12 
##  9 2017-10-01 08:00:00 -0.989 -1.28  -0.544
## 10 2017-10-01 09:00:00 -0.989 -0.497 -0.544
## # ... with 1,502 more rows

If we use recipe() function, we have to create revert back function

# revert back function
  rec_revert <- function(vector, rec, varname) {

  # store recipe values
  rec_center <- rec$steps[[2]]$means[varname]
  rec_scale <- rec$steps[[3]]$sds[varname]

  # convert back based on the recipe
  results <- (vector * rec_scale + rec_center) ^ 2

  # add additional adjustment if necessary
  results <- round(results)

  # return the results
  results

}

Revert the data frame into long data frame again

scotty_clean<- bake %>% 
  gather(src_sub_area, demand, -start_time)

Nested Model Fitting and Forecasting

In programming using purrr package, we have to need to convert our tbl into nested tbl. It is just like we have tbl inside tbl. Now, let’s start by converting tbl into nested tbl.

scotty_clean<-scotty_clean %>%
  mutate(sample = case_when(
    start_time %within% intrain ~ "train",
    start_time %within% intest ~ "test"
  ))

After that, we can start to nest the tbl using nest() function based on sample and src_sub_area then spread() or pivot_wider() again the tbl based on sample (This time I will try to use pivot_wider() instead of spread())

nest <-scotty_clean %<>%
  group_by(src_sub_area, sample) %>%
  nest(.key = "data") %>%
  pivot_wider(names_from = sample, values_from = data)

nest

## # A tibble: 3 x 3
## # Groups:   src_sub_area [3]
##   src_sub_area          train           test
##   <chr>        <list<df[,2]>> <list<df[,2]>>
## 1 sxk97           [1,344 x 2]      [168 x 2]
## 2 sxk9e           [1,344 x 2]      [168 x 2]
## 3 sxk9s           [1,344 x 2]      [168 x 2]

Representation Model Lists

Since the data is in high frequency, I can consider two options of model, daily seasonality(ts) and weekly seasonality (msts)

We need to create functions for another nested data

data_funs <- list(
  ts = function(x) ts(x$demand, frequency = 24),
  msts = function(x) msts(x$demand, seasonal.periods = c(24, 24 * 7))
)

data_funs

## $ts
## function(x) ts(x$demand, frequency = 24)
## 
## $msts
## function(x) msts(x$demand, seasonal.periods = c(24, 24 * 7))

We can convert the list above into tbl using enframe() function.

data_funs<- data_funs %>% 
  rep(length(unique(nest$src_sub_area))) %>%
  enframe("data_fun_name", "data_fun") %>%
  mutate(src_sub_area =
    sort(rep(unique(nest$src_sub_area), length(unique(.$data_fun_name))))
  )

data_funs

## # A tibble: 6 x 3
##   data_fun_name data_fun src_sub_area
##   <chr>         <list>   <chr>       
## 1 ts            <fn>     sxk97       
## 2 msts          <fn>     sxk97       
## 3 ts            <fn>     sxk9e       
## 4 msts          <fn>     sxk9e       
## 5 ts            <fn>     sxk9s       
## 6 msts          <fn>     sxk9s

The next step is joining the data_funs with nest data.frame using left_join()

join <- nest %>% 
  left_join(data_funs)

Time Series Model Lists

In order to look at the trend and seasonality pattern of all area, I will try to visualize it using ggseasonplot() function

#area sxk97
areasxk97<- scotty_pad %>% 
  gather(src_sub_area, demand, -start_time) %>% 
  filter(src_sub_area == "sxk97") %>% 
  .$demand

Since the data only consists 2 months of period, it is not relevant to be used to find monthly pattern. Therefore, I will only observe the daily and weekly pattern.
Create the plot

plotdaily<-ggseasonplot(ts(areasxk97,frequency = 24),polar = T) +
  labs(title = "Daily Pattern",x = "Hour") +
  scale_y_sqrt()+
  theme_minimal()+
  theme(legend.position = "none") 

plotweekly<-ggseasonplot(ts(areasxk97,frequency = 24*7),polar = T) +
  labs(title = "Weekly Pattern",x = "Week") +
  scale_y_sqrt()+
  theme_minimal()+
  theme(legend.position = "none")

Plot to visualize

plotdaily

plotweekly

Based on the plots above, it can be concluded that:
1. On daily pattern, the highest demand can be found at nearly 5pm-7pm
2. On weekly pattern, the highest demand can be found on Fridays and Saturdays (10 o’clock direction)

Next, let’s find the trend and seasonality by its decomposition

decompose <- scotty_pad %>% 
  gather(src_sub_area, demand, -start_time) %>%
  filter(src_sub_area == "sxk97") %>% 
  .$demand %>% 
  msts(.,seasonal.periods = c(24,24*7)) #24 for daily, 24*7 for weekly

autoplot(mstl(decompose))+ 
  labs(title = "Decomposition by Daily and Weekly Pattern for Area sxk97")+
  theme_minimal()

After a quick screening, we can see that the trend line made smoothly. This indicates that this time series data has trend and also seasonality pattern.

At this time, I am going to use five different time series models, which are:

Auto arima
Exponential smoothing state space model (ets)
Seasonal and Trend decomposition using Loess (stlm)
Trigonometric seasonality, Box-Cox transformation, ARMA errors, Trend and Seasonal components (tbats)
Autoregressive integrated moving average (arima)
Holt Winter

since the data has trend and seasonality.

# models list
models <- list(
  auto.arima = function (x) {
    while(adf.test(x)$p.value > 0.05) {
      x = diff(x)
    }
    auto.arima(x, seasonal = TRUE)},
  ets = function(x) ets(x),
  stlm = function(x) stlm(x, method = "arima"), #stlm arima
  tbats = function(x) tbats(x, use.box.cox = FALSE,
                            use.trend = TRUE,
                            use.damped.trend = NULL,
                            use.arma.errors = TRUE,
                            use.parallel = TRUE,
                            num.cores = 6),
  holt.winter = function (x) HoltWinters(x, seasonal = "additive")
)

I will aggregate all of the models into one nested rbl in order to a quicker processing.
Now convert the list into tbl like previous function

models <- models %>% 
  rep(length(unique(join$src_sub_area))) %>%
  enframe("model_name", "model") %>%
  mutate(src_sub_area =
    sort(rep(unique(join$src_sub_area), length(unique(.$model_name))))
  )

Great! We can now join join and models. However, we do not want to have the auto.arima(), stlm() and ets() with data msts, since they are not compatible for multi seasonality time-series. Hence, we have to remove it.

scot <- join %>% 
  left_join(models) %>%
  filter(
    !(model_name == "ets" & data_fun_name == "msts"),
    !(model_name == "auto.arima" & data_fun_name == "msts"),
    !(model_name == "stlm" & data_fun_name == "msts")
  )

Nested Fitting Execution / Modelling

To execute the model fitting, we need to wrap up the needed arguments as a list using map() function. Then, we could call the function using invoke_map(). We need to do this for data transformation using the function inside data_fun, then continue to fit the model with the same process using the function inside model.

scot<-scot %>% 
  mutate(
    params = map(train, ~ list(x = .x)),
    data = invoke_map(data_fun, params),
    params = map(data, ~ list(x = .x)),
    fitted = invoke_map(model, params)
  ) %>%
  select(-data, -params)

Model Evaluation

Next, we have to forecast the data using forecast() and try to observe the error using function in yardstick() package. I will use MAE (mean absolute error) as the error function. The error values are splitted into train_error (for evaluation dataset) and test_error (for data test dataset).

scot<- scot %>% 
  mutate(train_error =
    map(fitted, ~ forecast(.x, h = 24 * 7)) %>%
    map2_dbl(train, ~ mae_vec(truth = rec_revert(.y$demand,rec,src_sub_area), estimate = rec_revert(.x$fitted,rec,src_sub_area))), 
    test_error= 
      map(fitted, ~ forecast(.x, h = 24 * 7)) %>%
      map2_dbl(test, ~ mae_vec(truth = rec_revert(.y$demand,rec,src_sub_area), estimate = rec_revert(.x$mean,rec,src_sub_area)))
  ) %>%
  arrange(src_sub_area, train_error, test_error)

scotty_error <- scot
scot %>%
  select(src_sub_area, ends_with("_name"), train_error, test_error)

## # A tibble: 21 x 5
## # Groups:   src_sub_area [3]
##    src_sub_area data_fun_name model_name  train_error test_error
##    <chr>        <chr>         <chr>             <dbl>      <dbl>
##  1 sxk97        ts            stlm               4.55       9.39
##  2 sxk97        msts          tbats              4.69       7.42
##  3 sxk97        ts            tbats              4.91       9.02
##  4 sxk97        ts            ets                5.05       9   
##  5 sxk97        msts          holt.winter        5.07       8.70
##  6 sxk97        ts            holt.winter        5.54       9.12
##  7 sxk97        ts            auto.arima         5.62      11.3 
##  8 sxk9e        ts            stlm               6.34      11.2 
##  9 sxk9e        msts          tbats              6.48       9.76
## 10 sxk9e        ts            tbats              6.85      10.9 
## # ... with 11 more rows

Unnesting The Result

We can compare the forecast results to the real test series using graphical analysis. Before that, we have to build a forecast model in order to forecast the data. Next, we will try to do some spread-gather tricks to make a set of keys that unique for each data representations, models, and also one for the forecast itself.

First, build the forecast model.

scot_test <- scot %>%
  mutate(
    forecast =
      map(fitted, ~ forecast(.x, h = 24 * 7 )) %>%
      map2(test, ~ tibble(
        start_time = .y$start_time,
        demand = as.vector(.x$mean)
      )),
    key = paste(data_fun_name, model_name, sep = "-")
  )


scot_test

## # A tibble: 21 x 12
## # Groups:   src_sub_area [3]
##    src_sub_area       train      test data_fun_name data_fun model_name model
##    <chr>        <list<df[,> <list<df> <chr>         <list>   <chr>      <lis>
##  1 sxk97        [1,344 x 2] [168 x 2] ts            <fn>     stlm       <fn> 
##  2 sxk97        [1,344 x 2] [168 x 2] msts          <fn>     tbats      <fn> 
##  3 sxk97        [1,344 x 2] [168 x 2] ts            <fn>     tbats      <fn> 
##  4 sxk97        [1,344 x 2] [168 x 2] ts            <fn>     ets        <fn> 
##  5 sxk97        [1,344 x 2] [168 x 2] msts          <fn>     holt.wint~ <fn> 
##  6 sxk97        [1,344 x 2] [168 x 2] ts            <fn>     holt.wint~ <fn> 
##  7 sxk97        [1,344 x 2] [168 x 2] ts            <fn>     auto.arima <fn> 
##  8 sxk9e        [1,344 x 2] [168 x 2] ts            <fn>     stlm       <fn> 
##  9 sxk9e        [1,344 x 2] [168 x 2] msts          <fn>     tbats      <fn> 
## 10 sxk9e        [1,344 x 2] [168 x 2] ts            <fn>     tbats      <fn> 
## # ... with 11 more rows, and 5 more variables: fitted <list>,
## #   train_error <dbl>, test_error <dbl>, forecast <list>, key <chr>

Next, do spread-gather

scot_test <- scot_test %>% 
  select(src_sub_area, key, actual = test, forecast) %>%
  spread(key, forecast) %>%
  gather(key, value, -src_sub_area)

scot_test

## # A tibble: 24 x 3
## # Groups:   src_sub_area [3]
##    src_sub_area key              value             
##    <chr>        <chr>            <list>            
##  1 sxk97        actual           <tibble [168 x 2]>
##  2 sxk9e        actual           <tibble [168 x 2]>
##  3 sxk9s        actual           <tibble [168 x 2]>
##  4 sxk97        msts-holt.winter <tibble [168 x 2]>
##  5 sxk9e        msts-holt.winter <tibble [168 x 2]>
##  6 sxk9s        msts-holt.winter <tibble [168 x 2]>
##  7 sxk97        msts-tbats       <tibble [168 x 2]>
##  8 sxk9e        msts-tbats       <tibble [168 x 2]>
##  9 sxk9s        msts-tbats       <tibble [168 x 2]>
## 10 sxk97        ts-auto.arima    <tibble [168 x 2]>
## # ... with 14 more rows

Finally, let’s unnest the data using unnest() function

scot_test <- scot_test %>% 
  unnest(value) %>%
  mutate(demand = rec_revert(demand, rec, src_sub_area))
  
scot_test

## # A tibble: 4,032 x 4
## # Groups:   src_sub_area [3]
##    src_sub_area key    start_time          demand
##    <chr>        <chr>  <dttm>               <dbl>
##  1 sxk97        actual 2017-11-26 00:00:00     38
##  2 sxk97        actual 2017-11-26 01:00:00     21
##  3 sxk97        actual 2017-11-26 02:00:00     10
##  4 sxk97        actual 2017-11-26 03:00:00     10
##  5 sxk97        actual 2017-11-26 04:00:00      5
##  6 sxk97        actual 2017-11-26 05:00:00      2
##  7 sxk97        actual 2017-11-26 06:00:00      0
##  8 sxk97        actual 2017-11-26 07:00:00     11
##  9 sxk97        actual 2017-11-26 08:00:00      4
## 10 sxk97        actual 2017-11-26 09:00:00      4
## # ... with 4,022 more rows

Great! Now, let’s try to visualize the data using ggplot() graph for a better view

Graph Visualization

p1 <-scot_test %>%
  ggplot(aes(x = start_time, y = demand, colour = key)) +
  geom_line(data = scot_test %>% 
              filter(key == "actual"),aes(y = demand),alpha = 0.3,size = 0.8)+
   geom_line(data = scot_test %>% 
               filter(key != "actual"),aes(frame = key,col = key)) +
  labs(x = "", y = "Demand",title = "Model Prediction Comparison", frame = "Model") +
   facet_wrap(~ src_sub_area, scale = "free_y", ncol = 1) +
    tidyquant::theme_tq() +
    theme(legend.position = "none",axis.title.y = element_text(size = 12), axis.text.y = element_text(size = 7))
    

ggplotly(p1)

It’s a cool stuff, isn’t it? We can adjust to compare each model by just slide the button to desired model. The transparent line at the back is the test data which will be forecasted, while the coloured line is the forecast representation.
We can see that all models has such different pattern of line. However, all of them has trend and seasonality pattern for each area. This indicates that the time-series model produce a pretty good results.

Automatic Model Selection

It is quite hard to compare the results by only using graphs. We can do model selection automatically by using filter() function to find the lowest error.

scot_best <- scot %>% 
  select(-c(fitted, train, test, data_fun, model)) %>% 
  group_by(src_sub_area) %>%
  filter(test_error == min(test_error)) %>%
  ungroup()

scot_best

## # A tibble: 3 x 5
##   src_sub_area data_fun_name model_name train_error test_error
##   <chr>        <chr>         <chr>            <dbl>      <dbl>
## 1 sxk97        msts          tbats             4.69       7.42
## 2 sxk9e        msts          tbats             6.48       9.76
## 3 sxk9s        msts          tbats             4.81       6.94

It can be seen from the data frame above that the best model which has the lowest error (based on MAE) for the test data (data which want to be forecasted) is tbats model with error :

7.42 for sxk97
9.76 for sxk9e
6.94 for sxk9s

and for train data (evaluation data set)

4.68 for sxk97
6.47 for sxk9e
4.80 for sxk9s

For reminder, these results are produced from the splitted train data . In order to forecast the real data, I have to combine the train data and test data into full data. This step will be explained on the next section

Final Forecast Model

Now for the final step is to make the last forecast model. Same as previous steps, but we use train and test combination as our new data.

scot<- scot %>% 
  mutate(fulldata = map2(train, test, ~ bind_rows(.x, .y))) %>%
  select(src_sub_area, fulldata, everything(), -train, -test)

scot

## # A tibble: 21 x 9
## # Groups:   src_sub_area [3]
##    src_sub_area fulldata data_fun_name data_fun model_name model fitted
##    <chr>        <list>   <chr>         <list>   <chr>      <lis> <list>
##  1 sxk97        <tibble~ ts            <fn>     stlm       <fn>  <stlm>
##  2 sxk97        <tibble~ msts          <fn>     tbats      <fn>  <tbat~
##  3 sxk97        <tibble~ ts            <fn>     tbats      <fn>  <tbat~
##  4 sxk97        <tibble~ ts            <fn>     ets        <fn>  <ets> 
##  5 sxk97        <tibble~ msts          <fn>     holt.wint~ <fn>  <HltW~
##  6 sxk97        <tibble~ ts            <fn>     holt.wint~ <fn>  <HltW~
##  7 sxk97        <tibble~ ts            <fn>     auto.arima <fn>  <fr_A~
##  8 sxk9e        <tibble~ ts            <fn>     stlm       <fn>  <stlm>
##  9 sxk9e        <tibble~ msts          <fn>     tbats      <fn>  <tbat~
## 10 sxk9e        <tibble~ ts            <fn>     tbats      <fn>  <tbat~
## # ... with 11 more rows, and 2 more variables: train_error <dbl>,
## #   test_error <dbl>

Do the nested fitting

scot <- scot %<>%
  mutate(
    params = map(fulldata, ~ list(x = .x)),
    data = invoke_map(data_fun, params),
    params = map(data, ~ list(x = .x)),
    fitted = invoke_map(model, params)
  ) %>%
  select(-data, -params)

After that, we make model to predict for the next 7 days.

scot<-  scot %>% 
  mutate(forecast =
    map(fitted, ~ forecast(.x, h = 24 * 7 )) %>%
    map2(fulldata, ~ tibble(
      start_time = timetk::tk_make_future_timeseries(.y$start_time, 24 * 7 ),
      demand = as.vector(.x$mean)
    ))
  )

# save the data for evaluation
scot_test_full <- scot 

scot

## # A tibble: 21 x 10
## # Groups:   src_sub_area [3]
##    src_sub_area fulldata data_fun_name data_fun model_name model fitted
##    <chr>        <list>   <chr>         <list>   <chr>      <lis> <list>
##  1 sxk97        <tibble~ ts            <fn>     stlm       <fn>  <stlm>
##  2 sxk97        <tibble~ msts          <fn>     tbats      <fn>  <tbat~
##  3 sxk97        <tibble~ ts            <fn>     tbats      <fn>  <tbat~
##  4 sxk97        <tibble~ ts            <fn>     ets        <fn>  <ets> 
##  5 sxk97        <tibble~ msts          <fn>     holt.wint~ <fn>  <HltW~
##  6 sxk97        <tibble~ ts            <fn>     holt.wint~ <fn>  <HltW~
##  7 sxk97        <tibble~ ts            <fn>     auto.arima <fn>  <fr_A~
##  8 sxk9e        <tibble~ ts            <fn>     stlm       <fn>  <stlm>
##  9 sxk9e        <tibble~ msts          <fn>     tbats      <fn>  <tbat~
## 10 sxk9e        <tibble~ ts            <fn>     tbats      <fn>  <tbat~
## # ... with 11 more rows, and 3 more variables: train_error <dbl>,
## #   test_error <dbl>, forecast <list>

To obtain the best model, filter() again the lowest error

scot_best_full <- scot %>% 
  select(-fitted) %>% 
  group_by(src_sub_area) %>%
  filter(test_error == min(test_error)) %>%
  ungroup()

scot_best_full

## # A tibble: 3 x 9
##   src_sub_area fulldata data_fun_name data_fun model_name model train_error
##   <chr>        <list>   <chr>         <list>   <chr>      <lis>       <dbl>
## 1 sxk97        <tibble~ msts          <fn>     tbats      <fn>         4.69
## 2 sxk9e        <tibble~ msts          <fn>     tbats      <fn>         6.48
## 3 sxk9s        <tibble~ msts          <fn>     tbats      <fn>         4.81
## # ... with 2 more variables: test_error <dbl>, forecast <list>

Finally, we have to unnest the model to retrieve the results

scot <- scot %>% 
  select(src_sub_area, actual = fulldata, forecast) %>%
  gather(key, value, -src_sub_area) %>%
  unnest(value) %>%
  mutate(demand = rec_revert(demand, rec, src_sub_area))

scot

## # A tibble: 35,280 x 4
## # Groups:   src_sub_area [3]
##    src_sub_area key    start_time          demand
##    <chr>        <chr>  <dttm>               <dbl>
##  1 sxk97        actual 2017-10-01 00:00:00      6
##  2 sxk97        actual 2017-10-01 01:00:00      4
##  3 sxk97        actual 2017-10-01 02:00:00      9
##  4 sxk97        actual 2017-10-01 03:00:00      2
##  5 sxk97        actual 2017-10-01 04:00:00      5
##  6 sxk97        actual 2017-10-01 05:00:00      4
##  7 sxk97        actual 2017-10-01 06:00:00      1
##  8 sxk97        actual 2017-10-01 07:00:00      0
##  9 sxk97        actual 2017-10-01 08:00:00      3
## 10 sxk97        actual 2017-10-01 09:00:00      3
## # ... with 35,270 more rows

Now, to get a better view, we can visualize the data using graph

p2<- scot %>%
  ggplot(aes(x = start_time, y = demand, colour = key)) +
    geom_line() +
    labs(x = NULL, y = NULL, colour = NULL) +
    facet_wrap(~ src_sub_area, scale = "free", ncol = 1) +
    labs(x = "", y = "Demand",title = "Final Model Prediction Comparison")+
    tidyquant::theme_tq() +
    tidyquant::scale_colour_tq()

ggplotly(p2)

Conlusion

Data Summary

To conclude the time-series process, I will filter the data for forecast key.

scot %>% 
  filter(key == "forecast") %>% 
  select(-key) %>% 
  select(src_sub_area, datetime = start_time, demand) %>% 
  ggplot(mapping = aes(x = datetime, y= demand)) +
  geom_line()+ 
  labs(x = NULL, y = NULL, colour = NULL) +
  facet_wrap(~ src_sub_area, scale = "free", ncol = 1) +
  labs(x = "", y = "Demand",title = "Demand Forecast for December 3rd - December 9th 2017")+
  tidyquant::theme_tq() +
  tidyquant::scale_colour_tq()

To check the quality of the forecast result, we can use MAE (mean absolute error) and evaluate the error for each src_sub_area :

For November 26th 2017 - December 2nd 2017 demand prediction (using own evaluation dataset)

Here is the final demand results:

scot_test %>% 
  filter(key != "actual")

## # A tibble: 3,528 x 4
## # Groups:   src_sub_area [3]
##    src_sub_area key              start_time          demand
##    <chr>        <chr>            <dttm>               <dbl>
##  1 sxk97        msts-holt.winter 2017-11-26 00:00:00     24
##  2 sxk97        msts-holt.winter 2017-11-26 01:00:00     27
##  3 sxk97        msts-holt.winter 2017-11-26 02:00:00     20
##  4 sxk97        msts-holt.winter 2017-11-26 03:00:00     14
##  5 sxk97        msts-holt.winter 2017-11-26 04:00:00      9
##  6 sxk97        msts-holt.winter 2017-11-26 05:00:00      6
##  7 sxk97        msts-holt.winter 2017-11-26 06:00:00      1
##  8 sxk97        msts-holt.winter 2017-11-26 07:00:00      3
##  9 sxk97        msts-holt.winter 2017-11-26 08:00:00      5
## 10 sxk97        msts-holt.winter 2017-11-26 09:00:00      5
## # ... with 3,518 more rows

I can filter() for the best error.

scot_best %>% 
  select(src_sub_area, data_fun_name,model_name, train_error, test_error)

## # A tibble: 3 x 5
##   src_sub_area data_fun_name model_name train_error test_error
##   <chr>        <chr>         <chr>            <dbl>      <dbl>
## 1 sxk97        msts          tbats             4.69       7.42
## 2 sxk9e        msts          tbats             6.48       9.76
## 3 sxk9s        msts          tbats             4.81       6.94

Results :
msts-tbats produced the lowest error among the other models.

For December 3rd 2017 - December 9th 2017 demand prediction (using “data-test” set)

Here is the final demand results:

scot %>% 
  filter(key=="forecast")

## # A tibble: 3,528 x 4
## # Groups:   src_sub_area [3]
##    src_sub_area key      start_time          demand
##    <chr>        <chr>    <dttm>               <dbl>
##  1 sxk97        forecast 2017-12-03 00:00:00     43
##  2 sxk97        forecast 2017-12-03 01:00:00     28
##  3 sxk97        forecast 2017-12-03 02:00:00     23
##  4 sxk97        forecast 2017-12-03 03:00:00     15
##  5 sxk97        forecast 2017-12-03 04:00:00      9
##  6 sxk97        forecast 2017-12-03 05:00:00      5
##  7 sxk97        forecast 2017-12-03 06:00:00      5
##  8 sxk97        forecast 2017-12-03 07:00:00     10
##  9 sxk97        forecast 2017-12-03 08:00:00     20
## 10 sxk97        forecast 2017-12-03 09:00:00     29
## # ... with 3,518 more rows

Filter the lowest error

scot_best_full %>% 
  select(src_sub_area, data_fun_name,model_name, train_error, test_error)

## # A tibble: 3 x 5
##   src_sub_area data_fun_name model_name train_error test_error
##   <chr>        <chr>         <chr>            <dbl>      <dbl>
## 1 sxk97        msts          tbats             4.69       7.42
## 2 sxk9e        msts          tbats             6.48       9.76
## 3 sxk9s        msts          tbats             4.81       6.94

Results:
msts-tbats also produces the lowest error for the next 7 days. All of the three areas have MAE for train_error and test_error value less than 10. This is a good news.

Assumption Checking

Last but not least, I am going to check two assumptions of a good time-series forecasting model, which are no Auto-correlation and Normality Residual.

Sub Area sxk97

No Auto-correlation Check
I will use Ljung-Box test to find whether the model has auto-correlation or not.

Box.test(residuals(scot_test_full$fitted[[1]]), type = "Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  residuals(scot_test_full$fitted[[1]])
## X-squared = 0.0095585, df = 1, p-value = 0.9221

Based on Hypothesis Testing criterion :
H0: no-autocorrelation <- p-value > 0.05
H1: autocorrelation <- p-value < 0.05

Since p-value is 0.9221 > 0.05 , the model has no- autocorrelation. This means that the residual has no correlation each other.

Normality Residual Check
The last assumption is normality residual. I will check for the histogram of residuals distibution.

ggplotly(gghistogram(residuals(scot_test_full$fitted[[1]]))+
           labs(title="Residuals Distribution for sxk97",x="Residuals")+
           theme_minimal()+
           labs(y = NULL))

From the chart above, we can see that the residuals are normally distributed. This indicates that most of the residuals (error) are approaching 0 (zero).

Sub Area sxk9e

No Auto-correlation Check
I will use Ljung-Box test to find whether the model has auto-correlation or not.

Box.test(residuals(scot_test_full$fitted[[2]]), type = "Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  residuals(scot_test_full$fitted[[2]])
## X-squared = 0.005713, df = 1, p-value = 0.9397

Based on Hypothesis Testing criterion :
H0: no-autocorrelation <- p-value > 0.05
H1: autocorrelation <- p-value < 0.05

Since p-value is 0.93 > 0.05 , the model has no-autocorrelation. This means that the residual has no correlation each other. All information are collected and used at this sub area. It can be called a good model.

Normality Residual Check
The last assumption is normality residual. I will check for the histogram of residuals distibution.

ggplotly(gghistogram(residuals(scot_test_full$fitted[[2]]))+
           labs(title="Residuals Distribution for sxk9e",x="Residuals")+
           theme_minimal()+
           labs(y = NULL))

Just like area sxk97, the residuals are normally distributed. This indicates that most of the residuals (error) are approaching 0 (zero).

Sub Area sxk9s

No Auto-correlation Check
I will use Ljung-Box test to find whether the model has auto-correlation or not.

Box.test(residuals(scot_test_full$fitted[[3]]), type = "Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  residuals(scot_test_full$fitted[[3]])
## X-squared = 0.27352, df = 1, p-value = 0.601

Based on Hypothesis Testing criterion :
H0: no-autocorrelation <- p-value > 0.05
H1: autocorrelation <- p-value < 0.05

Since p-value is 0.60 > 0.05 , the model has no-autocorrelation. This means that the residual has no correlation each other.

Normality Residual Check
The last assumption is normality residual. I will check for the histogram of residuals distibution.

ggplotly(gghistogram(residuals(scot_test_full$fitted[[3]]))+
           labs(title="Residuals Distribution for sxk9s",x="Residuals")+
           theme_minimal()+
           labs(y = NULL))

Residuals of sub area sxk9s also normally distributed. This indicates that most of the residuals (error) are approaching 0 (zero).

Results Summary

After creating all of time-series forecasting process, we have can say that the best model with lowest error for this kind of data set is the msts-tbats with MAE < 10. Based on the assumption check, area sxk97, area sxk9e, and sxk9s have no auto-correlation between the residuals. This sounds great for a time-series modelling since all information are used well on the model. Hence, the model is effectively build. Moreover, the three areas’ residuals are normally distributed based on the histogram. Overall, we can say that using automated time-series forecasting model for this dataset is a pretty good option.

Data Exporting and Submission

I will read the data submission and stored in in submission object

submission <- read_csv("data-test.csv")
head(submission, 5)

## # A tibble: 5 x 3
##   src_sub_area datetime            demand
##   <chr>        <dttm>              <lgl> 
## 1 sxk97        2017-12-03 00:00:00 NA    
## 2 sxk97        2017-12-03 01:00:00 NA    
## 3 sxk97        2017-12-03 02:00:00 NA    
## 4 sxk97        2017-12-03 03:00:00 NA    
## 5 sxk97        2017-12-03 04:00:00 NA

Next, the following codes are written for submission collection

submission_nest <- submission %>% 
  nest(datetime)

Obtain the best_model from scotty_error that has been made. This is the error of msts_tbats.

best_model <- scotty_error %>%
  group_by(src_sub_area) %>% 
  arrange(test_error) %>% 
  slice(1) %>%
  ungroup() %>% 
  select(src_sub_area, ends_with("_name"),test_error)

best_model

## # A tibble: 3 x 4
##   src_sub_area data_fun_name model_name test_error
##   <chr>        <chr>         <chr>           <dbl>
## 1 sxk97        msts          tbats            7.42
## 2 sxk9e        msts          tbats            9.76
## 3 sxk9s        msts          tbats            6.94

Join the error once more

best_model2 <- best_model %>% 
  select(-test_error) %>% 
  left_join(scotty_error)%>% 
  select(-c(test_error,train_error))

best_model2

## # A tibble: 3 x 8
##   src_sub_area data_fun_name model_name       train      test data_fun model
##   <chr>        <chr>         <chr>      <list<df[,> <list<df> <list>   <lis>
## 1 sxk97        msts          tbats      [1,344 x 2] [168 x 2] <fn>     <fn> 
## 2 sxk9e        msts          tbats      [1,344 x 2] [168 x 2] <fn>     <fn> 
## 3 sxk9s        msts          tbats      [1,344 x 2] [168 x 2] <fn>     <fn> 
## # ... with 1 more variable: fitted <list>

Another small adjusment

sub_forecast <- best_model2 %>% 
  mutate(
    forecast =
      map(fitted, ~ forecast(.x, h = 24 * 7)) %>%
      map2(submission_nest$data,
           ~ tibble(
        datetime = .y$datetime,
        demand = as.vector(.x$mean)
      )),
    key = paste(data_fun_name, model_name, sep = "-")
  )

sub_forecast

## # A tibble: 3 x 10
##   src_sub_area data_fun_name model_name       train      test data_fun model
##   <chr>        <chr>         <chr>      <list<df[,> <list<df> <list>   <lis>
## 1 sxk97        msts          tbats      [1,344 x 2] [168 x 2] <fn>     <fn> 
## 2 sxk9e        msts          tbats      [1,344 x 2] [168 x 2] <fn>     <fn> 
## 3 sxk9s        msts          tbats      [1,344 x 2] [168 x 2] <fn>     <fn> 
## # ... with 3 more variables: fitted <list>, forecast <list>, key <chr>

Now, let’s build the final submission!

final_submission <- sub_forecast %>% 
  select(src_sub_area, forecast) %>% 
  unnest(forecast) %>% 
  mutate(demand = rec_revert(demand, rec, src_sub_area))

final_submission

## # A tibble: 504 x 3
##    src_sub_area datetime            demand
##    <chr>        <dttm>               <dbl>
##  1 sxk97        2017-12-03 00:00:00     20
##  2 sxk97        2017-12-03 01:00:00     17
##  3 sxk97        2017-12-03 02:00:00     12
##  4 sxk97        2017-12-03 03:00:00      9
##  5 sxk97        2017-12-03 04:00:00      7
##  6 sxk97        2017-12-03 05:00:00      4
##  7 sxk97        2017-12-03 06:00:00      3
##  8 sxk97        2017-12-03 07:00:00      7
##  9 sxk97        2017-12-03 08:00:00     13
## 10 sxk97        2017-12-03 09:00:00     13
## # ... with 494 more rows

I will export the data to csv file

write.csv(x = final_submission, file = "alfado_scotty-ts.csv")

Ending

So, that’s all for the process of automated time-series forecasting using purrr package in R programming language.I hope this page can help you understand time-series problem and the solution behind it.

See you in the other page!

Author,
Alfado Sembiring

Notes :
In case you want to look up my profile, click the link below :
Jump To My Profile (open link in a new tab)

Reference

Purrr-Operly Fitting Multiple Time Series Model

Turkish Share Ride - Automated Multiple Time Series Model Forecasting

Alfado Sembiring

Introduction

Pre-Start

Data Wrangling

Exploratory Data Analysis

Cross Validation

Nested Model Fitting and Forecasting

Representation Model Lists

Time Series Model Lists

Nested Fitting Execution / Modelling

Model Evaluation

Unnesting The Result

Graph Visualization

Automatic Model Selection

Final Forecast Model

Conlusion

Data Summary

Assumption Checking

Sub Area sxk97

Sub Area sxk9e

Sub Area sxk9s

Results Summary

Data Exporting and Submission

Ending

Reference