Automated Multiple Time Series Analysis on Scotty - Turkish Online Transportation Platform

Brief Introduction
Goals & Objective
Data Preprocessing
Exploratory Data Analysis
Modelling & Forecasting
MAE Result
- Data Submission
Checking MAE Result for data submission
Plot the submission forecast :
Conclusion

Brief Introduction

Scotty is a ride-sharing business that operating in several big cities in Turkey. The company provides motorcycles ride-sharing service for Turkey’s citizen, and really value the efficiency in traveling through the traffic.

In this project we will going to help Scotty Company to do a forecast analysis to predict the end year’s demand. This time we will using automated time-series forecasting technique for multiple model.

Goals & Objective

Build an automated forecasting model for hourly demands that would be evaluated on the next 7 days (Sunday, December 3rd 2017 to Monday, December 9th 2017)

Data Preprocessing

The first step it to load every package that we need in this project.

Read Library

library(forecast)

## Warning: package 'forecast' was built under R version 3.6.3

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:base':
## 
##     date

library(padr)

## Warning: package 'padr' was built under R version 3.6.3

library(tidymodels)

## Warning: package 'tidymodels' was built under R version 3.6.3

## -- Attaching packages ------------------------------------ tidymodels 0.1.0 --

## v broom     0.5.4     v recipes   0.1.9
## v dials     0.0.4     v rsample   0.0.5
## v dplyr     0.8.4     v tibble    2.1.3
## v ggplot2   3.2.1     v tune      0.0.1
## v infer     0.5.1     v workflows 0.1.1
## v parsnip   0.0.5     v yardstick 0.0.6
## v purrr     0.3.3

## Warning: package 'dials' was built under R version 3.6.3

## Warning: package 'infer' was built under R version 3.6.3

## Warning: package 'parsnip' was built under R version 3.6.3

## Warning: package 'tune' was built under R version 3.6.3

## Warning: package 'workflows' was built under R version 3.6.3

## Warning: package 'yardstick' was built under R version 3.6.3

## -- Conflicts --------------------------------------- tidymodels_conflicts() --
## x yardstick::accuracy() masks forecast::accuracy()
## x purrr::discard()      masks scales::discard()
## x dplyr::filter()       masks stats::filter()
## x dplyr::lag()          masks stats::lag()
## x ggplot2::margin()     masks dials::margin()
## x recipes::step()       masks stats::step()
## x recipes::yj_trans()   masks scales::yj_trans()

library(tidyverse)

## -- Attaching packages ------------------------------------- tidyverse 1.3.0 --

## v readr   1.3.1     v forcats 0.4.0
## v stringr 1.4.0

## Warning: package 'readr' was built under R version 3.6.3

## -- Conflicts ---------------------------------------- tidyverse_conflicts() --
## x lubridate::as.difftime() masks base::as.difftime()
## x readr::col_factor()      masks scales::col_factor()
## x lubridate::date()        masks base::date()
## x purrr::discard()         masks scales::discard()
## x dplyr::filter()          masks stats::filter()
## x stringr::fixed()         masks recipes::fixed()
## x lubridate::intersect()   masks base::intersect()
## x dplyr::lag()             masks stats::lag()
## x ggplot2::margin()        masks dials::margin()
## x lubridate::setdiff()     masks base::setdiff()
## x readr::spec()            masks yardstick::spec()
## x lubridate::union()       masks base::union()

library(readr)        
library(DT)           
library(ggplot2)
library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library(dplyr)      
library(padr)     
library(zoo)

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(tidyr)      
library(recipes)      
library(tibble)     
library(purrr)        
library(png)        
library(grid)       
library(tidyquant)

## Warning: package 'tidyquant' was built under R version 3.6.3

## Loading required package: PerformanceAnalytics

## Warning: package 'PerformanceAnalytics' was built under R version 3.6.3

## Loading required package: xts

## Warning: package 'xts' was built under R version 3.6.3

## 
## Attaching package: 'xts'

## The following objects are masked from 'package:dplyr':
## 
##     first, last

## 
## Attaching package: 'PerformanceAnalytics'

## The following object is masked from 'package:graphics':
## 
##     legend

## Loading required package: quantmod

## Loading required package: TTR

## Warning: package 'TTR' was built under R version 3.6.3

## Version 0.4-0 included new data defaults. See ?getSymbols.

## == Need to Learn tidyquant? ==================================================
## Business Science offers a 1-hour course - Learning Lab #9: Performance Analysis & Portfolio Optimization with tidyquant!
## </> Learn more at: https://university.business-science.io/p/learning-labs-pro </>

Load Data

Load our data into scotty object

setwd("C:/Users/Asus/Documents/DataQuest/Algoritma Data Science/R/Machine Learning Capstone Project")
scotty <- read_csv("data-train.csv")

## Parsed with column specification:
## cols(
##   id = col_character(),
##   trip_id = col_character(),
##   driver_id = col_character(),
##   rider_id = col_character(),
##   start_time = col_datetime(format = ""),
##   src_lat = col_double(),
##   src_lon = col_double(),
##   src_area = col_character(),
##   src_sub_area = col_character(),
##   dest_lat = col_double(),
##   dest_lon = col_double(),
##   dest_area = col_character(),
##   dest_sub_area = col_character(),
##   distance = col_double(),
##   status = col_character(),
##   confirmed_time_sec = col_double()
## )

Explore Data

glimpse(scotty)

## Observations: 90,113
## Variables: 16
## $ id                 <chr> "59d005e1ffcfa261708ce9cd", "59d0066affcfa261708...
## $ trip_id            <chr> "59d005e9cb564761a8fe5d3e", NA, "59d006c131e39c6...
## $ driver_id          <chr> "59a892c5568be44b2734f276", NA, "599dc0dfa5b4fd5...
## $ rider_id           <chr> "59ad2d6efba75a581666b506", "59cd704bcf482f6ce2f...
## $ start_time         <dttm> 2017-10-01 00:00:17, 2017-10-01 00:02:34, 2017-...
## $ src_lat            <dbl> 41.07047, 41.07487, 41.04995, 41.05287, 41.06760...
## $ src_lon            <dbl> 29.01945, 28.99528, 29.03107, 28.99522, 28.98827...
## $ src_area           <chr> "sxk9", "sxk9", "sxk9", "sxk9", "sxk9", "sxk9", ...
## $ src_sub_area       <chr> "sxk9s", "sxk9e", "sxk9s", "sxk9e", "sxk9e", "sx...
## $ dest_lat           <dbl> 41.11716, 41.08351, 41.04495, 41.08140, 41.02125...
## $ dest_lon           <dbl> 29.03650, 29.00228, 28.98192, 28.98197, 29.11316...
## $ dest_area          <chr> "sxk9", "sxk9", "sxk9", "sxk9", "sxk9", "sxk9", ...
## $ dest_sub_area      <chr> "sxk9u", "sxk9e", "sxk9e", "sxk9e", "sxk9q", "sx...
## $ distance           <dbl> 5.379250, 1.126098, 4.169492, 3.358296, 11.69357...
## $ status             <chr> "confirmed", "nodrivers", "confirmed", "confirme...
## $ confirmed_time_sec <dbl> 8, 0, 32, 65, 0, 27, 32, 30, 0, 24, 9, 0, 118, 1...

Since we will do forecast analysis based on time analysis, we will only use 2 column.

scotty <- scotty %>% 
  select(start_time, src_sub_area)

Because we will work in hours, we need to floor the data to specific time level.

scotty$datetime <- floor_date(scotty$start_time, unit = "hour")

scotty_sum <- scotty %>% 
  group_by(datetime, src_sub_area) %>% 
  summarise(
    demand = n()
  )
scotty_sum

## # A tibble: 4,225 x 3
## # Groups:   datetime [1,490]
##    datetime            src_sub_area demand
##    <dttm>              <chr>         <int>
##  1 2017-10-01 00:00:00 sxk97             6
##  2 2017-10-01 00:00:00 sxk9e             8
##  3 2017-10-01 00:00:00 sxk9s            10
##  4 2017-10-01 01:00:00 sxk97             4
##  5 2017-10-01 01:00:00 sxk9e             2
##  6 2017-10-01 01:00:00 sxk9s             3
##  7 2017-10-01 02:00:00 sxk97             9
##  8 2017-10-01 02:00:00 sxk9e             3
##  9 2017-10-01 02:00:00 sxk9s             1
## 10 2017-10-01 03:00:00 sxk97             2
## # ... with 4,215 more rows

In this table we can see the sum of orders for each area and for each hours.

Data Validation Check

Since we’re working with time series data, we need to check several assumption to see the data validity.

1. Missing data is prohibited

scotty_sum %>% 
  group_by(src_sub_area) %>% 
  summarise(
    demand = n()
  )

## # A tibble: 3 x 2
##   src_sub_area demand
##   <chr>         <int>
## 1 sxk97          1439
## 2 sxk9e          1419
## 3 sxk9s          1367

As we can see here, the amount of order in each src_sub_area is not the same. That means there is a period of time that is missing. Therefore we need to fill the incomplete time periods on datetime variables using the pad () function from the padr library.

scotty_sum <- scotty_sum %>% 
  pad('hour', 
      start_val = as.POSIXct(min(scotty_sum$datetime)), 
      end_val = as.POSIXct(max(scotty_sum$datetime)),
      group = "src_sub_area") %>%
  mutate(
    src_sub_area = as.factor(src_sub_area)
  )

## Warning: group argument and dplyr::groups are both present and differ,
## dplyr::groups are ignored

scotty_sum %>% 
  group_by(src_sub_area) %>% 
  summarise(
    demand = n()
  )

## # A tibble: 3 x 2
##   src_sub_area demand
##   <fct>         <int>
## 1 sxk97          1512
## 2 sxk9e          1512
## 3 sxk9s          1512

After the pad () function is performed, the amount of data per src_sub_area is the same, which is 1.512 data per sub-area.

2. Data must be ordered by their time period

scotty_sum <- scotty_sum %>% 
  arrange(src_sub_area, datetime)

3. No NA

scotty_sum %>% 
  filter(is.na(demand) == 1) %>% 
  group_by() %>% 
  summarise(
    na_rows = n()
  )

## # A tibble: 1 x 1
##   na_rows
##     <int>
## 1     311

Since we found NA rows, we need to fill them with 0, which means at that particular hour, there is no order (demand).

scotty_sum <- na.fill0(scotty_sum, fill = 0)

scotty_sum %>% 
  filter(is.na(demand) == 1) %>% 
  group_by() %>% 
  summarise(
    na_rows = n()
  )

## # A tibble: 1 x 1
##   na_rows
##     <int>
## 1       0

Exploratory Data Analysis

Forming Data in Time Series Format

 scotty_sum %>%
  filter(datetime > (max(scotty_sum$datetime) - hours(24 * 7 * 4))) %>%
  ggplot(aes(x = datetime, y = demand)) +
    geom_line() +
    labs(x = NULL, y = NULL) +
    facet_wrap(~ src_sub_area, scale = "free", ncol = 1) +
    theme_light()

Cross Validation

Before we make a model, we must divide our data into data train and data test first. We will use the data train to conduct training on our model while the test data will be used as a comparison and see if our models overfit / underfit to predict new data outside of the training data. We will use the last 7 days of our data to become a test data and the rest we will use to become a data train.

# number of test data and data train
test_size <- 24 * 7
train_size <- nrow(scotty_sum)/3 - test_size

# initial range and final range for each data
test_end <- max(scotty_sum$datetime)
test_start <- test_end - hours(test_size) + hours(1)

train_end <- test_start - hours(1)
train_start <- train_end - hours(train_size) + hours(1)

# interval for each sample
interval_train <- interval(train_start, train_end)
interval_test <- interval(test_start, test_end)

interval_train

## [1] 2017-10-01 UTC--2017-11-25 23:00:00 UTC

interval_test

## [1] 2017-11-26 UTC--2017-12-02 23:00:00 UTC

Here’s the data after splitting process: data train & test

plot the train and test

e <- scotty_sum %>%
  mutate(sample = case_when(
    datetime %within% interval_train ~ "train",
    datetime %within% interval_test ~ "test"
  )) %>%
  mutate(sample = factor(sample, levels = c("train", "test"))) %>%
  ggplot(aes(x = datetime, y = demand, colour = sample)) +
    geom_line() +
    labs(x = NULL, y = NULL, colour = NULL) +
    facet_wrap(~ src_sub_area, scale = "free", ncol = 1) +
    theme_light()

ggplotly(e)

Date Time Series

Before we form Time Series data, we will first process the data using the library recipes.

Because the library recipes work in the form of columns, we will change our data to be tabular based on src_sub_area using the spread () function of the tidyr library:

scotty_sum_wide <- scotty_sum  %>% 
  spread(src_sub_area, demand)

scotty_sum_wide

## # A tibble: 1,512 x 4
## # Groups:   datetime [1,512]
##    datetime            sxk97 sxk9e sxk9s
##    <dttm>              <dbl> <dbl> <dbl>
##  1 2017-10-01 00:00:00     6     8    10
##  2 2017-10-01 01:00:00     4     2     3
##  3 2017-10-01 02:00:00     9     3     1
##  4 2017-10-01 03:00:00     2     2     0
##  5 2017-10-01 04:00:00     5     1     0
##  6 2017-10-01 05:00:00     4     2     0
##  7 2017-10-01 06:00:00     1     1     0
##  8 2017-10-01 07:00:00     0     0     1
##  9 2017-10-01 08:00:00     3     2     5
## 10 2017-10-01 09:00:00     3    11     5
## # ... with 1,502 more rows

Next we will do data manipulation, which is a scale that will be performed using the recipe () library recipes function:

This is done so that modeling becomes more robust and less sensitive to outlier data.

scotty_recipe <- recipe(~.,  
                        filter(scotty_sum_wide, datetime %within% interval_train)) %>% 
  step_sqrt(all_numeric()) %>%
  step_center(all_numeric()) %>%
  step_scale(all_numeric()) %>%
  prep()

# preview the bake results

scotty_sum_wide <- bake(scotty_recipe, scotty_sum_wide)

scotty_sum_wide

## # A tibble: 1,512 x 4
##    datetime             sxk97  sxk9e  sxk9s
##    <dttm>               <dbl>  <dbl>  <dbl>
##  1 2017-10-01 00:00:00 -0.594 -0.697 -0.111
##  2 2017-10-01 01:00:00 -0.841 -1.28  -0.780
##  3 2017-10-01 02:00:00 -0.291 -1.15  -1.12 
##  4 2017-10-01 03:00:00 -1.16  -1.28  -1.59 
##  5 2017-10-01 04:00:00 -0.711 -1.44  -1.59 
##  6 2017-10-01 05:00:00 -0.841 -1.28  -1.59 
##  7 2017-10-01 06:00:00 -1.39  -1.44  -1.59 
##  8 2017-10-01 07:00:00 -1.94  -1.85  -1.12 
##  9 2017-10-01 08:00:00 -0.989 -1.28  -0.544
## 10 2017-10-01 09:00:00 -0.989 -0.497 -0.544
## # ... with 1,502 more rows

Because we have manipulated data, we must create a function to return our data to the initial value that we will call the function

# revert back function
rec_revert <- function(vector, rec, varname) {

  # store recipe values
  rec_center <- rec$steps[[2]]$means[varname]
  rec_scale <- rec$steps[[3]]$sds[varname]

  # convert back based on the recipe
  results <- (vector * rec_scale + rec_center) ^ 2

  # add additional adjustment if necessary
  results <- round(results)

  # return the results
  results
}

This revert back function will convert back the data to its original form.

Now we can convert our data into the long format again:

# convert back to long format
scotty_sum_long <- scotty_sum_wide %>% 
  gather(src_sub_area, demand, -datetime)
  
scotty_sum_long

## # A tibble: 4,536 x 3
##    datetime            src_sub_area demand
##    <dttm>              <chr>         <dbl>
##  1 2017-10-01 00:00:00 sxk97        -0.594
##  2 2017-10-01 01:00:00 sxk97        -0.841
##  3 2017-10-01 02:00:00 sxk97        -0.291
##  4 2017-10-01 03:00:00 sxk97        -1.16 
##  5 2017-10-01 04:00:00 sxk97        -0.711
##  6 2017-10-01 05:00:00 sxk97        -0.841
##  7 2017-10-01 06:00:00 sxk97        -1.39 
##  8 2017-10-01 07:00:00 sxk97        -1.94 
##  9 2017-10-01 08:00:00 sxk97        -0.989
## 10 2017-10-01 09:00:00 sxk97        -0.989
## # ... with 4,526 more rows

Modelling & Forecasting

In functional programming using purrr, we need to convert our tbl into a nested tbl. Nested data is like a table inside a table, which could be controlled using a key indicator; in other words, we can have tbl for each subarea and samples. Using this format, the fitting and forecasting process would be very versatile, yet we can still convert the results as long as we have a proper key like provider.

Let’s start by converting our tbl into a nested tbl. First, we need to add sample indicator so it could be recognized as a key when we nest() the data:

# adjust by sample
scotty_sum_nest <- scotty_sum_long %>%
  mutate(sample = case_when(
    datetime %within% interval_train ~ "train",
    datetime %within% interval_test ~ "test"
  )) %>%
  drop_na()

# nest the data train
scotty_sum_train_nest <- scotty_sum_nest %>% 
  group_by(src_sub_area, sample) %>%
  nest(.key = "data") %>%
  spread(sample, data)

## Warning: `.key` is deprecated

scotty_sum_train_nest

## # A tibble: 3 x 3
## # Groups:   src_sub_area [3]
##   src_sub_area test               train               
##   <chr>        <list>             <list>              
## 1 sxk97        <tibble [168 x 2]> <tibble [1,344 x 2]>
## 2 sxk9e        <tibble [168 x 2]> <tibble [1,344 x 2]>
## 3 sxk9s        <tibble [168 x 2]> <tibble [1,344 x 2]>

Function

To make our data into a time series form we need the function ts () or msts (). So we will create a function data list named ts_function_list which contains both functions, which for time series will use the seasonality as follows:

ts used for single seasonality, will use daily seasonality (frequency = 24) msts used for multiple / complex seasonality, will use daily seasonality (frequency = 24) and weekly (frequency = 24 * 7)

# list of functions that will be used to form a time series

ts_function_list <- list(
  ts = function(x) ts(x$demand, frequency = 24),
  msts = function(x) msts(x$demand, seasonal.periods = c(24, 24 * 7))
)

ts_function_list

## $ts
## function(x) ts(x$demand, frequency = 24)
## 
## $msts
## function(x) msts(x$demand, seasonal.periods = c(24, 24 * 7))

Then we convert the list into a tbl using enframe(). Note that we should also give a key – which is the src_sub_area in our case – so we could use left_join() later.

Use rep() function:

ts_function_list <- ts_function_list %>% 
  rep(length(unique(scotty_sum_nest$src_sub_area))) %>%
  enframe("func_name", "func") %>%
  mutate(
    src_sub_area = sort(rep(unique(scotty_sum_nest$src_sub_area), length(unique(.$func_name))))
  )

ts_function_list

## # A tibble: 6 x 3
##   func_name func   src_sub_area
##   <chr>     <list> <chr>       
## 1 ts        <fn>   sxk97       
## 2 msts      <fn>   sxk97       
## 3 ts        <fn>   sxk9e       
## 4 msts      <fn>   sxk9e       
## 5 ts        <fn>   sxk9s       
## 6 msts      <fn>   sxk9s

Then we join thee nested function with our Nested Model Fitting :

scotty_fit <- scotty_sum_train_nest %>% 
  left_join(ts_function_list)

## Joining, by = "src_sub_area"

scotty_fit

## # A tibble: 6 x 5
## # Groups:   src_sub_area [3]
##   src_sub_area test               train                func_name func  
##   <chr>        <list>             <list>               <chr>     <list>
## 1 sxk97        <tibble [168 x 2]> <tibble [1,344 x 2]> ts        <fn>  
## 2 sxk97        <tibble [168 x 2]> <tibble [1,344 x 2]> msts      <fn>  
## 3 sxk9e        <tibble [168 x 2]> <tibble [1,344 x 2]> ts        <fn>  
## 4 sxk9e        <tibble [168 x 2]> <tibble [1,344 x 2]> msts      <fn>  
## 5 sxk9s        <tibble [168 x 2]> <tibble [1,344 x 2]> ts        <fn>  
## 6 sxk9s        <tibble [168 x 2]> <tibble [1,344 x 2]> msts      <fn>

Make a time series model list

We will use four different models

Exponential smoothing state space model (ets)
Autoregressive integrated moving average (Auto arima)
Seasonal and Trend decomposition using Loess (stlm)
Trigonometric seasonality, Box-Cox transformation, ARMA errors, Trend and Seasonal components (tbats)

# time series model list
model_function_list <- list(
  ets = function(x) ets(x),
  auto.arima = function(x) auto.arima(x),
  stlm = function(x) stlm(x),
  tbats = function(x) tbats(x, use.box.cox = FALSE)
)

model_function_list

## $ets
## function(x) ets(x)
## 
## $auto.arima
## function(x) auto.arima(x)
## 
## $stlm
## function(x) stlm(x)
## 
## $tbats
## function(x) tbats(x, use.box.cox = FALSE)

# combining into nested format
model_function_list <- model_function_list %>% 
  rep(length(unique(scotty_sum_nest$src_sub_area))) %>%
  enframe("model_name", "model") %>%
  mutate(src_sub_area =
    sort(rep(unique(scotty_sum_nest$src_sub_area), length(unique(.$model_name))))
  )
model_function_list

## # A tibble: 12 x 3
##    model_name model  src_sub_area
##    <chr>      <list> <chr>       
##  1 ets        <fn>   sxk97       
##  2 auto.arima <fn>   sxk97       
##  3 stlm       <fn>   sxk97       
##  4 tbats      <fn>   sxk97       
##  5 ets        <fn>   sxk9e       
##  6 auto.arima <fn>   sxk9e       
##  7 stlm       <fn>   sxk9e       
##  8 tbats      <fn>   sxk9e       
##  9 ets        <fn>   sxk9s       
## 10 auto.arima <fn>   sxk9s       
## 11 stlm       <fn>   sxk9s       
## 12 tbats      <fn>   sxk9s

The next step is to merge model_function_list to the scotty_data_nest data. Because the ets and auto.arima models cannot be used for multiseasonal time series types, we will not combine the two models with the msts type time series:

scotty_fit <- scotty_fit %>% 
  left_join(model_function_list) %>%
  filter(
    !(model_name == "ets" & func_name == "msts"),
    !(model_name == "auto.arima" & func_name == "msts")
  )

## Joining, by = "src_sub_area"

scotty_fit

## # A tibble: 18 x 7
## # Groups:   src_sub_area [3]
##    src_sub_area test           train           func_name func   model_name model
##    <chr>        <list>         <list>          <chr>     <list> <chr>      <lis>
##  1 sxk97        <tibble [168 ~ <tibble [1,344~ ts        <fn>   ets        <fn> 
##  2 sxk97        <tibble [168 ~ <tibble [1,344~ ts        <fn>   auto.arima <fn> 
##  3 sxk97        <tibble [168 ~ <tibble [1,344~ ts        <fn>   stlm       <fn> 
##  4 sxk97        <tibble [168 ~ <tibble [1,344~ ts        <fn>   tbats      <fn> 
##  5 sxk97        <tibble [168 ~ <tibble [1,344~ msts      <fn>   stlm       <fn> 
##  6 sxk97        <tibble [168 ~ <tibble [1,344~ msts      <fn>   tbats      <fn> 
##  7 sxk9e        <tibble [168 ~ <tibble [1,344~ ts        <fn>   ets        <fn> 
##  8 sxk9e        <tibble [168 ~ <tibble [1,344~ ts        <fn>   auto.arima <fn> 
##  9 sxk9e        <tibble [168 ~ <tibble [1,344~ ts        <fn>   stlm       <fn> 
## 10 sxk9e        <tibble [168 ~ <tibble [1,344~ ts        <fn>   tbats      <fn> 
## 11 sxk9e        <tibble [168 ~ <tibble [1,344~ msts      <fn>   stlm       <fn> 
## 12 sxk9e        <tibble [168 ~ <tibble [1,344~ msts      <fn>   tbats      <fn> 
## 13 sxk9s        <tibble [168 ~ <tibble [1,344~ ts        <fn>   ets        <fn> 
## 14 sxk9s        <tibble [168 ~ <tibble [1,344~ ts        <fn>   auto.arima <fn> 
## 15 sxk9s        <tibble [168 ~ <tibble [1,344~ ts        <fn>   stlm       <fn> 
## 16 sxk9s        <tibble [168 ~ <tibble [1,344~ ts        <fn>   tbats      <fn> 
## 17 sxk9s        <tibble [168 ~ <tibble [1,344~ msts      <fn>   stlm       <fn> 
## 18 sxk9s        <tibble [168 ~ <tibble [1,344~ msts      <fn>   tbats      <fn>

Create the model

The first step is to run the function to form time series data based on the func variable and then form the model based on the model variable and the results will be accommodated to the fitted variable.

Since it is time consuming, for further analysis, we will save the model in “scotty_model.RDS”.

# scotty_fit <- scotty_fit %>%
#   mutate(
#     params = map(train, ~ list(x = .x)),
#     data = invoke_map(func, params),
#     params = map(data, ~ list(x = .x)),
#     fitted = invoke_map(model, params)
#   ) %>%
#   select(-data, -params)
# 
# saveRDS(scotty_model, "scotty_model.RDS")

Comparing with test Data

Calculate errors

scotty_model <- readRDS("scotty_model.RDS")

# calculate test errors
scotty_model_error <- scotty_model %>%
  mutate(error =
    map(fitted, ~ forecast(.x, h = 24 * 7)) %>%
    map2_dbl(test, ~ mae_vec(truth = .y$demand, estimate = .x$mean))
  ) %>%
  arrange(src_sub_area, error) #arrange based on smallest error for each area

scotty_model_error

## # A tibble: 18 x 9
## # Groups:   src_sub_area [3]
##    src_sub_area test     train     func_name func  model_name model fitted error
##    <chr>        <list>   <list>    <chr>     <lis> <chr>      <lis> <list> <dbl>
##  1 sxk97        <tibble~ <tibble ~ msts      <fn>  tbats      <fn>  <tbat~ 0.488
##  2 sxk97        <tibble~ <tibble ~ msts      <fn>  stlm       <fn>  <stlm> 0.543
##  3 sxk97        <tibble~ <tibble ~ ts        <fn>  tbats      <fn>  <tbat~ 0.573
##  4 sxk97        <tibble~ <tibble ~ ts        <fn>  ets        <fn>  <ets>  0.574
##  5 sxk97        <tibble~ <tibble ~ ts        <fn>  stlm       <fn>  <stlm> 0.600
##  6 sxk97        <tibble~ <tibble ~ ts        <fn>  auto.arima <fn>  <fr_A~ 0.702
##  7 sxk9e        <tibble~ <tibble ~ msts      <fn>  tbats      <fn>  <tbat~ 0.359
##  8 sxk9e        <tibble~ <tibble ~ msts      <fn>  stlm       <fn>  <stlm> 0.366
##  9 sxk9e        <tibble~ <tibble ~ ts        <fn>  ets        <fn>  <ets>  0.419
## 10 sxk9e        <tibble~ <tibble ~ ts        <fn>  tbats      <fn>  <tbat~ 0.422
## 11 sxk9e        <tibble~ <tibble ~ ts        <fn>  stlm       <fn>  <stlm> 0.438
## 12 sxk9e        <tibble~ <tibble ~ ts        <fn>  auto.arima <fn>  <fr_A~ 0.502
## 13 sxk9s        <tibble~ <tibble ~ msts      <fn>  tbats      <fn>  <tbat~ 0.421
## 14 sxk9s        <tibble~ <tibble ~ msts      <fn>  stlm       <fn>  <stlm> 0.428
## 15 sxk9s        <tibble~ <tibble ~ ts        <fn>  tbats      <fn>  <tbat~ 0.431
## 16 sxk9s        <tibble~ <tibble ~ ts        <fn>  auto.arima <fn>  <fr_A~ 0.503
## 17 sxk9s        <tibble~ <tibble ~ ts        <fn>  stlm       <fn>  <stlm> 0.540
## 18 sxk9s        <tibble~ <tibble ~ ts        <fn>  ets        <fn>  <ets>  0.550

Get the best model by choosing the smallest error:

scotty_best_model <- scotty_model_error %>%
  group_by(src_sub_area) %>% 
  arrange(error) %>% 
  slice(1) %>%  #choose the smallest error & best model for each area directly
  ungroup() %>% 
  select(src_sub_area, ends_with("_name"),error)

scotty_best_model <- scotty_best_model %>% 
  select(-error) %>% 
  left_join(scotty_model_error)%>% 
  select(-error)

## Joining, by = c("src_sub_area", "func_name", "model_name")

scotty_best_model

## # A tibble: 3 x 8
##   src_sub_area func_name model_name test        train        func   model fitted
##   <chr>        <chr>     <chr>      <list>      <list>       <list> <lis> <list>
## 1 sxk97        msts      tbats      <tibble [1~ <tibble [1,~ <fn>   <fn>  <tbat~
## 2 sxk9e        msts      tbats      <tibble [1~ <tibble [1,~ <fn>   <fn>  <tbat~
## 3 sxk9s        msts      tbats      <tibble [1~ <tibble [1,~ <fn>   <fn>  <tbat~

Unnesting the Result

We need to make a tbl containing our forecast to the test dataset, then do some spread-gather tricks to make a set of keys that unique for each data representations, models, and also one for the forecast itself. If we get to that format, we could do unnest() the data into a proper long format.

Let’s start with creating the forecast :

scotty_test <- scotty_model_error %>% 
  mutate(
    forecast =
      map(fitted, ~ forecast(.x, h = 24 * 7)) %>%
      map2(test, ~ tibble(
        datetime = .y$datetime,
        demand = as.vector(.x$mean)
      )),
    key = paste(func_name, model_name, sep = "-")
  )
  
scotty_test

## # A tibble: 18 x 11
## # Groups:   src_sub_area [3]
##    src_sub_area test  train func_name func  model_name model fitted error
##    <chr>        <lis> <lis> <chr>     <lis> <chr>      <lis> <list> <dbl>
##  1 sxk97        <tib~ <tib~ msts      <fn>  tbats      <fn>  <tbat~ 0.488
##  2 sxk97        <tib~ <tib~ msts      <fn>  stlm       <fn>  <stlm> 0.543
##  3 sxk97        <tib~ <tib~ ts        <fn>  tbats      <fn>  <tbat~ 0.573
##  4 sxk97        <tib~ <tib~ ts        <fn>  ets        <fn>  <ets>  0.574
##  5 sxk97        <tib~ <tib~ ts        <fn>  stlm       <fn>  <stlm> 0.600
##  6 sxk97        <tib~ <tib~ ts        <fn>  auto.arima <fn>  <fr_A~ 0.702
##  7 sxk9e        <tib~ <tib~ msts      <fn>  tbats      <fn>  <tbat~ 0.359
##  8 sxk9e        <tib~ <tib~ msts      <fn>  stlm       <fn>  <stlm> 0.366
##  9 sxk9e        <tib~ <tib~ ts        <fn>  ets        <fn>  <ets>  0.419
## 10 sxk9e        <tib~ <tib~ ts        <fn>  tbats      <fn>  <tbat~ 0.422
## 11 sxk9e        <tib~ <tib~ ts        <fn>  stlm       <fn>  <stlm> 0.438
## 12 sxk9e        <tib~ <tib~ ts        <fn>  auto.arima <fn>  <fr_A~ 0.502
## 13 sxk9s        <tib~ <tib~ msts      <fn>  tbats      <fn>  <tbat~ 0.421
## 14 sxk9s        <tib~ <tib~ msts      <fn>  stlm       <fn>  <stlm> 0.428
## 15 sxk9s        <tib~ <tib~ ts        <fn>  tbats      <fn>  <tbat~ 0.431
## 16 sxk9s        <tib~ <tib~ ts        <fn>  auto.arima <fn>  <fr_A~ 0.503
## 17 sxk9s        <tib~ <tib~ ts        <fn>  stlm       <fn>  <stlm> 0.540
## 18 sxk9s        <tib~ <tib~ ts        <fn>  ets        <fn>  <ets>  0.550
## # ... with 2 more variables: forecast <list>, key <chr>

Then do some spread-gather to create a proper key:

scotty_test_key <- scotty_test %>%
  select(src_sub_area, key, actual = test, forecast) %>%
  spread(key, forecast) %>%
  gather(key, value, -src_sub_area)

scotty_test_key

## # A tibble: 21 x 3
## # Groups:   src_sub_area [3]
##    src_sub_area key           value             
##    <chr>        <chr>         <list>            
##  1 sxk97        actual        <tibble [168 x 2]>
##  2 sxk9e        actual        <tibble [168 x 2]>
##  3 sxk9s        actual        <tibble [168 x 2]>
##  4 sxk97        msts-stlm     <tibble [168 x 2]>
##  5 sxk9e        msts-stlm     <tibble [168 x 2]>
##  6 sxk9s        msts-stlm     <tibble [168 x 2]>
##  7 sxk97        msts-tbats    <tibble [168 x 2]>
##  8 sxk9e        msts-tbats    <tibble [168 x 2]>
##  9 sxk9s        msts-tbats    <tibble [168 x 2]>
## 10 sxk97        ts-auto.arima <tibble [168 x 2]>
## # ... with 11 more rows

Because our forecast data is still in the scotty_recipe function, we will return the original value using the scotty_recipe_revert function:

scotty_test_unnest <- scotty_test_key %>%
  unnest(value) %>%
  mutate(demand = rec_revert(demand, scotty_recipe, src_sub_area))
  
scotty_test_unnest

## # A tibble: 3,528 x 4
## # Groups:   src_sub_area [3]
##    src_sub_area key    datetime            demand
##    <chr>        <chr>  <dttm>               <dbl>
##  1 sxk97        actual 2017-11-26 00:00:00     38
##  2 sxk97        actual 2017-11-26 01:00:00     21
##  3 sxk97        actual 2017-11-26 02:00:00     10
##  4 sxk97        actual 2017-11-26 03:00:00     10
##  5 sxk97        actual 2017-11-26 04:00:00      5
##  6 sxk97        actual 2017-11-26 05:00:00      2
##  7 sxk97        actual 2017-11-26 06:00:00      0
##  8 sxk97        actual 2017-11-26 07:00:00     11
##  9 sxk97        actual 2017-11-26 08:00:00      4
## 10 sxk97        actual 2017-11-26 09:00:00      4
## # ... with 3,518 more rows

With the resulting tbl, we can compare the forecast and actual data on test like this:

p2 <- scotty_test_unnest %>%
  ggplot(aes(x = datetime, y = demand, colour = key)) +
  geom_line(data = scotty_test_unnest %>% 
              filter(key == "actual"),aes(y = demand),alpha = 0.3,size = 0.8)+
   geom_line(data = scotty_test_unnest %>% 
               filter(key != "actual"),aes(frame = key,col = key)) +
  labs(x = "", y = "Demand",title = "Model Prediction Comparison", frame = "Model") +
   facet_wrap(~ src_sub_area, scale = "free_y", ncol = 1) +
    tidyquant::theme_tq() +
    theme(legend.position = "none",axis.title.y = element_text(size = 12), axis.text.y = element_text(size = 7))

## Warning: Ignoring unknown aesthetics: frame

ggplotly(p2)

## Warning in p$x$data[firstFrame] <- p$x$frames[[1]]$data: number of items to
## replace is not a multiple of replacement length

We can visually determine which model is the best one. transparent line in the back is the one which will be forecasted, where the coloured line is our forecast representation. However there is another way to evaluate the best model.

Automated Model Selection

We can automatically select model with the smallest error by using filter() function

# filter by lowest test error
scotty_min_error <- scotty_model_error %>%
  select(-fitted) %>% # remove unused
  group_by(src_sub_area) %>%
  filter(error == min(error)) %>%
  ungroup()

scotty_min_error   #same selection with subarea_best_model

## # A tibble: 3 x 8
##   src_sub_area test         train         func_name func  model_name model error
##   <chr>        <list>       <list>        <chr>     <lis> <chr>      <lis> <dbl>
## 1 sxk97        <tibble [16~ <tibble [1,3~ msts      <fn>  tbats      <fn>  0.488
## 2 sxk9e        <tibble [16~ <tibble [1,3~ msts      <fn>  tbats      <fn>  0.359
## 3 sxk9s        <tibble [16~ <tibble [1,3~ msts      <fn>  tbats      <fn>  0.421

We can see that from our automated modelling, the demand tends to follow multi seasonal function, with ‘tbats’ model show the least error.

Performing Final Forecast

Here we will do the same process as in model fitting, but this time we will use train and test data as our “full data”.

First, let’s start by combining the train and test dataset :

scotty_combine <- scotty_min_error %>%
  mutate(fulldata = map2(train, test, ~ bind_rows(.x, .y))) %>%
  select(src_sub_area, fulldata, everything(), -train, -test)

scotty_combine

## # A tibble: 3 x 7
##   src_sub_area fulldata             func_name func   model_name model  error
##   <chr>        <list>               <chr>     <list> <chr>      <list> <dbl>
## 1 sxk97        <tibble [1,512 x 2]> msts      <fn>   tbats      <fn>   0.488
## 2 sxk9e        <tibble [1,512 x 2]> msts      <fn>   tbats      <fn>   0.359
## 3 sxk9s        <tibble [1,512 x 2]> msts      <fn>   tbats      <fn>   0.421

then we do create the model as in previous example:

# scotty_combine_nest <- scotty_combine %>%
#   mutate(
#     params = map(fulldata, ~ list(x = .x)),
#     data = invoke_map(func, params),
#     params = map(data, ~ list(x = .x)),
#     fitted = invoke_map(model, params)
#   ) %>%
#   select(-data, -params)
# 
# saveRDS(scotty_combine_nest, "scottycb1_nest.RDS")

scotty_combine_nest <- readRDS("scottycb1_nest.RDS")
scotty_combine_nest

## # A tibble: 3 x 8
##   src_sub_area fulldata          func_name func   model_name model  error fitted
##   <chr>        <list>            <chr>     <list> <chr>      <list> <dbl> <list>
## 1 sxk97        <tibble [1,512 x~ msts      <fn>   tbats      <fn>   0.488 <tbat~
## 2 sxk9e        <tibble [1,512 x~ msts      <fn>   tbats      <fn>   0.359 <tbat~
## 3 sxk9s        <tibble [1,512 x~ msts      <fn>   tbats      <fn>   0.421 <tbat~

Then, let’s make a tbl containing the forecast results, and convert it to a long format.

# get forecast
scotty_combine_forecast <- scotty_combine_nest %>%
  mutate(forecast =
    map(fitted, ~ forecast(.x, h = 24 * 7 * 4)) %>%
    map2(fulldata, ~ tibble(
      datetime = timetk::tk_make_future_timeseries(.y$datetime, 24 * 7 * 4),
      demand = as.vector(.x$mean)
    ))
  )

scotty_combine_forecast

## # A tibble: 3 x 9
##   src_sub_area fulldata  func_name func   model_name model error fitted forecast
##   <chr>        <list>    <chr>     <list> <chr>      <lis> <dbl> <list> <list>  
## 1 sxk97        <tibble ~ msts      <fn>   tbats      <fn>  0.488 <tbat~ <tibble~
## 2 sxk9e        <tibble ~ msts      <fn>   tbats      <fn>  0.359 <tbat~ <tibble~
## 3 sxk9s        <tibble ~ msts      <fn>   tbats      <fn>  0.421 <tbat~ <tibble~

Unnesting the data

scotty_combine_forecast_unnest <- scotty_combine_forecast %>% 
  select(src_sub_area, actual = fulldata, forecast) %>%
  gather(key, value, -src_sub_area) %>%
  unnest(value) %>%
  mutate(demand = rec_revert(demand, scotty_recipe, src_sub_area))

scotty_combine_forecast_unnest

## # A tibble: 6,552 x 4
##    src_sub_area key    datetime            demand
##    <chr>        <chr>  <dttm>               <dbl>
##  1 sxk97        actual 2017-10-01 00:00:00      6
##  2 sxk97        actual 2017-10-01 01:00:00      4
##  3 sxk97        actual 2017-10-01 02:00:00      9
##  4 sxk97        actual 2017-10-01 03:00:00      2
##  5 sxk97        actual 2017-10-01 04:00:00      5
##  6 sxk97        actual 2017-10-01 05:00:00      4
##  7 sxk97        actual 2017-10-01 06:00:00      1
##  8 sxk97        actual 2017-10-01 07:00:00      0
##  9 sxk97        actual 2017-10-01 08:00:00      3
## 10 sxk97        actual 2017-10-01 09:00:00      3
## # ... with 6,542 more rows

d <- scotty_combine_forecast_unnest %>%
  ggplot(aes(x = datetime, y = demand, colour = key)) +
    geom_line() +
    labs(x = NULL, y = NULL, colour = NULL) +
    facet_wrap(~ src_sub_area, scale = "free", ncol = 1) +
    theme_light() +
    tidyquant::scale_colour_tq()

ggplotly(d)

data_test <- scotty_combine_forecast_unnest %>%
  filter(datetime %within% interval_test) %>%
  select(src_sub_area, datetime, demand)

data_test

## # A tibble: 504 x 3
##    src_sub_area datetime            demand
##    <chr>        <dttm>               <dbl>
##  1 sxk97        2017-11-26 00:00:00     38
##  2 sxk97        2017-11-26 01:00:00     21
##  3 sxk97        2017-11-26 02:00:00     10
##  4 sxk97        2017-11-26 03:00:00     10
##  5 sxk97        2017-11-26 04:00:00      5
##  6 sxk97        2017-11-26 05:00:00      2
##  7 sxk97        2017-11-26 06:00:00      0
##  8 sxk97        2017-11-26 07:00:00     11
##  9 sxk97        2017-11-26 08:00:00      4
## 10 sxk97        2017-11-26 09:00:00      4
## # ... with 494 more rows

MAE Result

mae_test_result <- scotty_model_error %>%
  group_by(src_sub_area) %>% 
  select(src_sub_area, error) %>% 
  arrange(error) %>% 
  slice(1) %>% 
  ungroup()

# Reached MAE < 11 for all sub-area in (your own) test dataset
# We used tbats - the scale for MAE is a little bit different
mae_test_result %<>%
  summarise(src_sub_area = "all sub-area",
            error = mean(error)) %>%
  bind_rows(mae_test_result, .)

mae_test_result

## # A tibble: 4 x 2
##   src_sub_area error
##   <chr>        <dbl>
## 1 sxk97        0.488
## 2 sxk9e        0.359
## 3 sxk9s        0.421
## 4 all sub-area 0.423

Data Submission

Here we will prepare 1 week forecast after data train 2017-12-03 00:00:00 - 2017-12-09 23:00:00

submission <- read_csv("data-test.csv")

## Parsed with column specification:
## cols(
##   src_sub_area = col_character(),
##   datetime = col_datetime(format = ""),
##   demand = col_logical()
## )

submission

## # A tibble: 504 x 3
##    src_sub_area datetime            demand
##    <chr>        <dttm>              <lgl> 
##  1 sxk97        2017-12-03 00:00:00 NA    
##  2 sxk97        2017-12-03 01:00:00 NA    
##  3 sxk97        2017-12-03 02:00:00 NA    
##  4 sxk97        2017-12-03 03:00:00 NA    
##  5 sxk97        2017-12-03 04:00:00 NA    
##  6 sxk97        2017-12-03 05:00:00 NA    
##  7 sxk97        2017-12-03 06:00:00 NA    
##  8 sxk97        2017-12-03 07:00:00 NA    
##  9 sxk97        2017-12-03 08:00:00 NA    
## 10 sxk97        2017-12-03 09:00:00 NA    
## # ... with 494 more rows

range(submission$datetime)

## [1] "2017-12-03 00:00:00 UTC" "2017-12-09 23:00:00 UTC"

# nest the data submission
submit_nest <- submission %>% nest(datetime)

## Warning: All elements of `...` must be named.
## Did you want `data = c(datetime)`?

submit_nest

## # A tibble: 3 x 3
##   src_sub_area demand data              
##   <chr>        <lgl>  <list>            
## 1 sxk97        NA     <tibble [168 x 1]>
## 2 sxk9e        NA     <tibble [168 x 1]>
## 3 sxk9s        NA     <tibble [168 x 1]>

Forecasting our data with the best model:

submission <- scotty_best_model %>%  
  mutate(
    forecast =
      map(fitted, ~ forecast(.x, h = 24 * 7)) %>%
      map2(submit_nest$data, ~ tibble( #map into our data submission nest
        datetime = .y$datetime,
        demand = as.vector(.x$mean)
      )),
    key = paste(func_name, func, sep = "-")
  )

Select Only The Required Columns :

submission <- submission %>%
  select(src_sub_area, forecast) %>%
  unnest(forecast) %>%
  mutate(demand = rec_revert(demand, scotty_recipe, src_sub_area))

submission

## # A tibble: 504 x 3
##    src_sub_area datetime            demand
##    <chr>        <dttm>               <dbl>
##  1 sxk97        2017-12-03 00:00:00     18
##  2 sxk97        2017-12-03 01:00:00     15
##  3 sxk97        2017-12-03 02:00:00     11
##  4 sxk97        2017-12-03 03:00:00      7
##  5 sxk97        2017-12-03 04:00:00      5
##  6 sxk97        2017-12-03 05:00:00      3
##  7 sxk97        2017-12-03 06:00:00      2
##  8 sxk97        2017-12-03 07:00:00      6
##  9 sxk97        2017-12-03 08:00:00     12
## 10 sxk97        2017-12-03 09:00:00     12
## # ... with 494 more rows

write_csv(submission, "final2.csv")

Checking MAE Result for data submission

Plot the submission forecast :

s <- submission %>%
  ggplot(aes(x = datetime, y = demand)) +
    geom_line() +
    labs(x = NULL, y = NULL, colour = NULL) +
    facet_wrap(~ src_sub_area, scale = "free", ncol = 1) +
    tidyquant::theme_tq() +
    tidyquant::scale_colour_tq()

ggplotly(s)

Conclusion

In this project we have succesufully build an automated forecasting model for hourly demands that would be evaluated on Sunday, December 3rd 2017 to Monday, December 9th 2017.
In solving the problem we have implemented machine learning algorithms for time-series analysis. which is ts for single seasonality and msts for multiple/complex seasonality. We basically use ets(), auto.arima(), stlm(), and tbats() and choose which one has the smallest error. ets() and auto.arima() are not suitable for multiple seasonality time series data.
According to our automated modelling, demand in each / all subarea tend to follow the multi seasonal model that we’ve created.
In this model we choose to use TBATS model since it has the lowest error. TBATS model has the capability to deal with complex seasonalities (e.g., non-integer seasonality, non-nested seasonality and large-period seasonality) with no seasonality constraints, making it possible to create detailed, long-term forecasts.
I believe this project has many business implementations. Forecasting analysis can play a major role ini driving company success or failure, where we can keeps prices low by optimizing a business operation, cash flow, production, staff, and financial management. We can basically reduce the uncertainty and anticipate change in the market as well as improves internal communication, and external communication between a business and their customers. Automation and function that I learned in this project is also very useful in the business where we can majorly improve efficiency.