Introduction

Scotty is a ride-sharing business that operating in several big cities in Turkey. The company provides motorcycles ride-sharing service for Turkey’s citizen, and really value the efficiency in traveling through the traffic–the apps even give some reference to Star Trek “beam me up” in their order buttons.

Scotty provided us with real-time transaction dataset. With this dataset, we are going to help them in solving some forecasting and classification problems in order to improve their business processes.

It’s almost the end of 2017 and we need to prepare a forecast model to helps Scotty ready for the end year’s demands. Unfortunately, Scotty is not old enough to have last year data for December, so we can not look back at past demands to prepare forecast for December’s demands. Fortunately, you already know that time series analysis is more than enough to help us to forecast! But, as an investment for the business’ future, we need to develop an automated forecasting framework so we don’t have to meddling with forecast model selection anymore in the future!

Library

library(dplyr)
library(lubridate)
library(tidyverse)
library(forecast)
library(purrr)
library(yardstick)
library(recipes)
library(magrittr)
library(padr)
library(tidyr)
library(ggpubr)
library(timetk)
library(tidyquant)
library(plotly)
library(ggplot2)

Import Data

ride_df <- read.csv("data/data-train.csv")

glimpse(ride_df)

## Rows: 90,113
## Columns: 16
## $ id                 <fct> 59d005e1ffcfa261708ce9cd, 59d0066affcfa261708ceb11…
## $ trip_id            <fct> 59d005e9cb564761a8fe5d3e, NA, 59d006c131e39c618969…
## $ driver_id          <fct> 59a892c5568be44b2734f276, NA, 599dc0dfa5b4fd5471ad…
## $ rider_id           <fct> 59ad2d6efba75a581666b506, 59cd704bcf482f6ce2fadfdb…
## $ start_time         <fct> 2017-10-01T00:00:17Z, 2017-10-01T00:02:34Z, 2017-1…
## $ src_lat            <dbl> 41.07047, 41.07487, 41.04995, 41.05287, 41.06760, …
## $ src_lon            <dbl> 29.01945, 28.99528, 29.03107, 28.99522, 28.98827, …
## $ src_area           <fct> sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sx…
## $ src_sub_area       <fct> sxk9s, sxk9e, sxk9s, sxk9e, sxk9e, sxk9s, sxk9s, s…
## $ dest_lat           <dbl> 41.11716, 41.08351, 41.04495, 41.08140, 41.02125, …
## $ dest_lon           <dbl> 29.03650, 29.00228, 28.98192, 28.98197, 29.11316, …
## $ dest_area          <fct> sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sx…
## $ dest_sub_area      <fct> sxk9u, sxk9e, sxk9e, sxk9e, sxk9q, sxk9e, sxk9e, s…
## $ distance           <dbl> 5.379250, 1.126098, 4.169492, 3.358296, 11.693573,…
## $ status             <fct> confirmed, nodrivers, confirmed, confirmed, nodriv…
## $ confirmed_time_sec <int> 8, 0, 32, 65, 0, 27, 32, 30, 0, 24, 9, 0, 118, 106…

We would only use two variables, which are the time and area.

ride_df %<>% 
  select(c(start_time, src_sub_area))

glimpse(ride_df)

## Rows: 90,113
## Columns: 2
## $ start_time   <fct> 2017-10-01T00:00:17Z, 2017-10-01T00:02:34Z, 2017-10-01T0…
## $ src_sub_area <fct> sxk9s, sxk9e, sxk9s, sxk9e, sxk9e, sxk9s, sxk9s, sxk9s, …

unique(ride_df$src_sub_area)

## [1] sxk9s sxk9e sxk97
## Levels: sxk97 sxk9e sxk9s

For this time series case, we need to round the time hourly.

ride_df$start_time <- as.POSIXct(ride_df$start_time, format="%Y-%m-%dT%H:%M:%SZ")
ride_df$start_time <- floor_date(ride_df$start_time, unit="hour")

Next, we need to count how many demand on the specific hour and area.

ride_df %<>% 
group_by(src_sub_area, start_time) %>% 
  mutate(demand = n()) %>% 
ungroup()

In time series, we aren’t permitted to have missing time for time series modeling, we need to do padding and replaced the demand in that missing time with zero value.

min_date <- min(ride_df$start_time)
start_val <- make_datetime(year = year(min_date), month=month(min_date), day=day(min_date), hour = 0)

max_date <- max(ride_df$start_time)
end_val <- make_datetime(year = year(max_date), month=month(max_date), day=day(max_date), hour = 23)

ride_df %<>% 
  group_by(src_sub_area) %>% 
  mutate(demand = replace_na(demand,0)) %>%
  pad(start_val = start_val, end_val = end_val) %>%
  ungroup() %>% 
  distinct()

ride_df %<>% 
  group_by(src_sub_area) %>% 
  mutate(demand = replace_na(demand,0))

head(ride_df)

## # A tibble: 6 x 3
## # Groups:   src_sub_area [1]
##   start_time          src_sub_area demand
##   <dttm>              <fct>         <dbl>
## 1 2017-10-01 00:00:00 sxk97             6
## 2 2017-10-01 01:00:00 sxk97             4
## 3 2017-10-01 02:00:00 sxk97             9
## 4 2017-10-01 03:00:00 sxk97             2
## 5 2017-10-01 04:00:00 sxk97             5
## 6 2017-10-01 05:00:00 sxk97             4

Visualize the demand data per sub area.

ggplotly(ggplot(ride_df,aes(x = start_time, y = demand)) +
           geom_line(aes(col = src_sub_area)) +
           labs(x = "", y = "Order Request",title = "Order Demand by Sub Area") +
           facet_wrap(~ src_sub_area, scale = "free_y", ncol = 1) + 
           theme_tq() +
           scale_colour_tq() +
           theme(legend.position = "none")
         )

The trend that resulted from the decomposition is not smooth enough that might be caused by uncaptured extra seasonality, so it can be considered as multi-seasonal data. So that, we need to try another option by creating the multiple time series object, msts with daily and weekly seasonality:

daily_weekly <- ride_df %>% filter(src_sub_area == "sxk97") %>% .$demand %>% msts(.,seasonal.periods = c(24,24*7))

autoplot(mstl(daily_weekly)) + labs(title = "Decomposition on Daily and Weekly Basis") +theme(legend.position = "none",plot.title = element_text(hjust = 0.5))

I use a squared scale in order to clarify the pattern. Create a data visualization of one of the sub-areas to identify emerging seasonal patterns.

sxk97 <- ride_df %>% filter(src_sub_area == "sxk97") %>% .$demand

sxk97_daily <- ggseasonplot(ts(sxk97,frequency = 24),polar = T) +
  theme(legend.position = "none") +
  labs(title = "SXK97 Daily",x = "Hour") +
  scale_y_sqrt()

sxk97_w <- ride_df %>% filter(src_sub_area == "sxk97") %>% 
  mutate(date = format(start_time,"%Y/%m/%d")) %>% 
  group_by(date) %>% 
  mutate(demand = sum(demand)) %>% 
  select(c(date,demand)) %>% 
  distinct()

sxk97_weekly <- ggseasonplot(ts(sxk97_w$demand,frequency = 7),polar = T) +
  theme(legend.position = "none") +
  labs(title = "SXK97 Weekly",x =NULL) +
  scale_y_sqrt() 

ggarrange(sxk97_daily, sxk97_weekly, ncol=2, nrow=1)

sxk9e <- ride_df %>% filter(src_sub_area == "sxk9e") %>% .$demand

sxk9e_daily <- ggseasonplot(ts(sxk9e,frequency = 24),polar = T) +
  theme(legend.position = "none") +
  labs(title = "SXK9E Daily",x = "Hour") +
  scale_y_sqrt()

sxk9e_w <- ride_df %>% filter(src_sub_area == "sxk9e") %>% 
  mutate(date = format(start_time,"%Y/%m/%d")) %>% 
  group_by(date) %>% 
  mutate(demand = sum(demand)) %>% 
  select(c(date,demand)) %>% 
  distinct()

sxk9e_weekly <- ggseasonplot(ts(sxk9e_w$demand,frequency = 7),polar = T) +
  theme(legend.position = "none") +
  labs(title = "SXK9E Weekly",x =NULL) +
  scale_y_sqrt() 

ggarrange(sxk9e_daily, sxk9e_weekly, ncol=2, nrow=1)

sxk9s <- ride_df %>% filter(src_sub_area == "sxk9s") %>% .$demand

sxk9s_daily <- ggseasonplot(ts(sxk9s,frequency = 24),polar = T) +
  theme(legend.position = "none") +
  labs(title = "SXK9S Daily",x = "Hour") +
  scale_y_sqrt()

sxk9s_w <- ride_df %>% filter(src_sub_area == "sxk9s") %>% 
  mutate(date = format(start_time,"%Y/%m/%d")) %>% 
  group_by(date) %>% 
  mutate(demand = sum(demand)) %>% 
  select(c(date,demand)) %>% 
  distinct()

sxk9s_weekly <- ggseasonplot(ts(sxk9s_w$demand,frequency = 7),polar = T) +
  theme(legend.position = "none") +
  labs(title = "SXK9S Weekly",x =NULL) +
  scale_y_sqrt() 

ggarrange(sxk9s_daily, sxk9s_weekly, ncol=2, nrow=1)

On the daily pattern plot, we can see that the highest demand is at 17-19 at night. While the lowest demand is at 5-6 in the morning. On the plot of the weekly pattern (starting from Sunday morning at 12 o’clock), we can see that the highest demand appears on Friday or Saturday.

I decided to use only the daily and weekly patterns in this modeling. This is because the monthly pattern is not very visible and the observations on the dataset are only for 2 months.

Data Preprocessing

Cross Validation

Next, do Cross Validation by dividing the data into Train data (to train the model) and Test data (to evaluate the model). Test data were obtained from observations during the last 1 week and the rest was entered into the Train data.

Determine the beginning and end of the train and test data.

test_size <- 24*7

test_end <- max(ride_df$start_time)
test_start <- test_end - hours(test_size) + hours(1)

train_end <- test_start - hours(1)
train_start <- min(ride_df$start_time)

intrain <- interval(train_start, train_end)
intest <- interval(test_start, test_end)

Then, we would label start_time whether it is a train or test dataset

ride_df %<>%
  mutate(sample = case_when(
    start_time %within% intrain ~ "train",
    start_time %within% intest ~ "test"
  )) %>% 
  drop_na() %>% 
  mutate(sample = factor(sample, levels = c("train", "test")))

head(ride_df)

## # A tibble: 6 x 4
## # Groups:   src_sub_area [1]
##   start_time          src_sub_area demand sample
##   <dttm>              <fct>         <dbl> <fct> 
## 1 2017-10-01 00:00:00 sxk97             6 train 
## 2 2017-10-01 01:00:00 sxk97             4 train 
## 3 2017-10-01 02:00:00 sxk97             9 train 
## 4 2017-10-01 03:00:00 sxk97             2 train 
## 5 2017-10-01 04:00:00 sxk97             5 train 
## 6 2017-10-01 05:00:00 sxk97             4 train

Data Scaling

Use recipes packages to do scaling and prevent outlier on our model. We need to change our data into wide format because recipes package only accept columnwise format.

ride_df %<>%
  spread(src_sub_area, demand)

recipe <- recipe(~., filter(ride_df, start_time %within% intrain)) %>% 
  step_sqrt(all_numeric()) %>%
  step_center(all_numeric()) %>%
  step_scale(all_numeric()) %>%
  prep()

# Execute calling function
ride_df <- bake(recipe, ride_df)

# Converts data back to high format
ride_df %<>%
  gather(src_sub_area, demand, -start_time,-sample)

head(ride_df)

## # A tibble: 6 x 4
##   start_time          sample src_sub_area demand
##   <dttm>              <fct>  <chr>         <dbl>
## 1 2017-10-01 00:00:00 train  sxk97        -0.594
## 2 2017-10-01 01:00:00 train  sxk97        -0.841
## 3 2017-10-01 02:00:00 train  sxk97        -0.291
## 4 2017-10-01 03:00:00 train  sxk97        -1.16 
## 5 2017-10-01 04:00:00 train  sxk97        -0.711
## 6 2017-10-01 05:00:00 train  sxk97        -0.841

After scaling, don’t forget to create a function to return the data to the actual scaled value.

recipe_revert <- function(vector, rec, varname) {
  rec_center <- rec$steps[[2]]$means[varname]
  rec_scale <- rec$steps[[3]]$sds[varname]
  results <- (vector * rec_scale + rec_center) ^ 2
  results <- round(results)
  results
}

Next, nest the Training data and Test data to simplify the model selection process.

ride_df %<>%
  group_by(src_sub_area, sample) %>%
  nest(data = c(start_time, demand)) %>%
  pivot_wider(names_from = sample, values_from = data)

head(ride_df)

## # A tibble: 3 x 3
## # Groups:   src_sub_area [3]
##   src_sub_area train                test              
##   <chr>        <list>               <list>            
## 1 sxk97        <tibble [1,344 × 2]> <tibble [168 × 2]>
## 2 sxk9e        <tibble [1,344 × 2]> <tibble [168 × 2]>
## 3 sxk9s        <tibble [1,344 × 2]> <tibble [168 × 2]>

Modelling & Forecasting

As we know before, there are two option of data representation, a ts object with daily seasonality and a msts with daily and weekly seasonality. To apply them into our data, then we need to make a data frame which contain the object and the function.

# Make a list of object functions
data_funs <- list(
  ts = function(x) ts(x$demand, frequency = 24),
  msts = function(x) msts(x$demand, seasonal.periods = c(24, 24 * 7))
)

# Change the form of the function into a data frame
# that is ready to be combined with the dataset
data_funs %<>%
  rep(length(unique(ride_df$src_sub_area))) %>%
  enframe("data_fun_name", "data_fun") %>%
  mutate(src_sub_area =
    sort(rep(unique(ride_df$src_sub_area), length(unique(.$data_fun_name))))
  )

data_funs

## # A tibble: 6 x 3
##   data_fun_name data_fun src_sub_area
##   <chr>         <list>   <chr>       
## 1 ts            <fn>     sxk97       
## 2 msts          <fn>     sxk97       
## 3 ts            <fn>     sxk9e       
## 4 msts          <fn>     sxk9e       
## 5 ts            <fn>     sxk9s       
## 6 msts          <fn>     sxk9s

Combine them into one dataframe:

ride_df %<>% left_join(data_funs)

head(ride_df)

## # A tibble: 6 x 5
## # Groups:   src_sub_area [3]
##   src_sub_area train                test               data_fun_name data_fun
##   <chr>        <list>               <list>             <chr>         <list>  
## 1 sxk97        <tibble [1,344 × 2]> <tibble [168 × 2]> ts            <fn>    
## 2 sxk97        <tibble [1,344 × 2]> <tibble [168 × 2]> msts          <fn>    
## 3 sxk9e        <tibble [1,344 × 2]> <tibble [168 × 2]> ts            <fn>    
## 4 sxk9e        <tibble [1,344 × 2]> <tibble [168 × 2]> msts          <fn>    
## 5 sxk9s        <tibble [1,344 × 2]> <tibble [168 × 2]> ts            <fn>    
## 6 sxk9s        <tibble [1,344 × 2]> <tibble [168 × 2]> msts          <fn>

Make a list of model algorithms that you want to apply to the dataset. I used the model:

Exponential Smoothing State Space Model (ets)
Seasonal and Trend decomposition using Loess (stlm)
Trigonometric seasonality, Box-Cox transformation, ARMA errors, Trend and Seasonal components (tbats)
Autoregressive integrated moving average (arima)
Holt Winter (holt.winter)

I’ll try all of these models, then see which model gives the smallest error when the model makes predictions. Especially for the ets and arima models, they are not applied to multiple seasonal time series objects because these models are not compatible with these objects.

models <- list(
  stlm = function(x) stlm(x),
  tbats = function(x) tbats(x, use.box.cox = FALSE, 
                  use.trend = TRUE, 
                  use.damped.trend = TRUE,
                  use.parallel = FALSE),
  holt.winter = function(x) HoltWinters(x,seasonal = "additive"),
  auto.arima = function(x) auto.arima(x),
  ets = function(x) ets(x)
)

models %<>%
  rep(length(unique(ride_df$src_sub_area))) %>%
  enframe("model_name", "model") %>%
  mutate(src_sub_area =
    sort(rep(unique(ride_df$src_sub_area), length(unique(.$model_name))))
  )

head(models)

## # A tibble: 6 x 3
##   model_name  model  src_sub_area
##   <chr>       <list> <chr>       
## 1 stlm        <fn>   sxk97       
## 2 tbats       <fn>   sxk97       
## 3 holt.winter <fn>   sxk97       
## 4 auto.arima  <fn>   sxk97       
## 5 ets         <fn>   sxk97       
## 6 stlm        <fn>   sxk9e

Then combine models with data:

ride_df %<>% 
  left_join(models) %>% 
  filter(!(model_name == "ets" & data_fun_name == "msts"),
         !(model_name == "auto.arima" & data_fun_name == "msts"))

head(ride_df)

## # A tibble: 6 x 7
## # Groups:   src_sub_area [1]
##   src_sub_area train        test         data_fun_name data_fun model_name model
##   <chr>        <list>       <list>       <chr>         <list>   <chr>      <lis>
## 1 sxk97        <tibble [1,… <tibble [16… ts            <fn>     stlm       <fn> 
## 2 sxk97        <tibble [1,… <tibble [16… ts            <fn>     tbats      <fn> 
## 3 sxk97        <tibble [1,… <tibble [16… ts            <fn>     holt.wint… <fn> 
## 4 sxk97        <tibble [1,… <tibble [16… ts            <fn>     auto.arima <fn> 
## 5 sxk97        <tibble [1,… <tibble [16… ts            <fn>     ets        <fn> 
## 6 sxk97        <tibble [1,… <tibble [16… msts          <fn>     stlm       <fn>

Apply time series object creation functions and modeling algorithms to datasets.

# ride_df %<>%
#   mutate(
#     params = map(train, ~ list(x = .x)),
#     data = invoke_map(data_fun, params),
#     params = map(data, ~ list(x = .x)),
#     fitted = invoke_map(model, params)
#   ) %>%
#   select(-data, -params)
# 
# ride_model <- saveRDS(ride_df, "ride_model.RDS")

ride_df <- readRDS("ride_model.RDS")

head(ride_df)

## # A tibble: 6 x 8
## # Groups:   src_sub_area [1]
##   src_sub_area train     test     data_fun_name data_fun model_name model fitted
##   <chr>        <list>    <list>   <chr>         <list>   <chr>      <lis> <list>
## 1 sxk97        <tibble … <tibble… ts            <fn>     stlm       <fn>  <stlm>
## 2 sxk97        <tibble … <tibble… ts            <fn>     tbats      <fn>  <tbat…
## 3 sxk97        <tibble … <tibble… ts            <fn>     holt.wint… <fn>  <HltW…
## 4 sxk97        <tibble … <tibble… ts            <fn>     auto.arima <fn>  <fr_A…
## 5 sxk97        <tibble … <tibble… ts            <fn>     ets        <fn>  <ets> 
## 6 sxk97        <tibble … <tibble… msts          <fn>     stlm       <fn>  <stlm>

Evaluation

Usemae_vec from yardstick package to measure the train and test error.

ride_df %<>% 
  mutate(MAE_test =
    map(fitted, ~ forecast(.x, h = 24 * 7)) %>%
    map2_dbl(test, ~ mae_vec(truth = recipe_revert(.y$demand,recipe,src_sub_area), estimate = recipe_revert(.x$mean,recipe,src_sub_area)))) %>%
  arrange(src_sub_area, MAE_test)

ride_df %<>% 
  mutate(MAE_train =
    map(fitted, ~ forecast(.x, h = 24 * 7)) %>%
    map2_dbl(train, ~ mae_vec(truth = recipe_revert(.y$demand,recipe,src_sub_area), estimate = recipe_revert(.x$fitted,recipe,src_sub_area)))) %>%
  arrange(src_sub_area, MAE_test)

ride_df %>%
  select(src_sub_area, ends_with("_name"), MAE_test, MAE_train)

## # A tibble: 24 x 5
## # Groups:   src_sub_area [3]
##    src_sub_area data_fun_name model_name  MAE_test MAE_train
##    <chr>        <chr>         <chr>          <dbl>     <dbl>
##  1 sxk97        msts          tbats           7.42      4.69
##  2 sxk97        msts          stlm            8.67      3.63
##  3 sxk97        msts          holt.winter     8.70      5.07
##  4 sxk97        ts            ets             9         5.05
##  5 sxk97        ts            tbats           9.02      4.91
##  6 sxk97        ts            holt.winter     9.12      5.54
##  7 sxk97        ts            stlm            9.34      4.67
##  8 sxk97        ts            auto.arima     11.3       5.62
##  9 sxk9e        msts          tbats           9.76      6.48
## 10 sxk9e        msts          stlm           10.1       5.00
## # … with 14 more rows

Create a visualization that shows how the predicted results differ from each model when compared to real demand data.

ride_df_test <- ride_df %>%
  mutate(
    forecast =
      map(fitted, ~ forecast(.x, h = 24 * 7)) %>%
      map2(test, ~ tibble(
        start_time = .y$start_time,
        demand = as.vector(.x$mean)
      )),
    key = paste(data_fun_name, model_name, sep = "-")
  )

Convert before visualize into visualization step.

ride_df_test %<>%
  select(src_sub_area, key, actual = test, forecast) %>%
  spread(key, forecast) %>%
  gather(key, value, -src_sub_area) %>%
  unnest(value) %>%
  mutate(demand = recipe_revert(demand,recipe,src_sub_area))

# Visualization step

ggplotly(ggplot(ride_df_test,aes(x = start_time, y = demand)) +
           geom_line(data = ride_df_test %>% filter(key == "actual"),aes(y = demand),alpha = 0.2,size =  0.8) +
           geom_line(data = ride_df_test %>% filter(key != "actual"),aes(frame = key,col = key)) +
           labs(x = "", y = "Order)",title = "Comparison Of Model Prediction Results", frame = "Models") + facet_wrap(~ src_sub_area, scale = "free_y", ncol = 1) +
           theme_tq() +
           scale_colour_tq()+
           theme(legend.position = "none"))

head(ride_df_test)

## # A tibble: 6 x 4
## # Groups:   src_sub_area [1]
##   src_sub_area key    start_time          demand
##   <chr>        <chr>  <dttm>               <dbl>
## 1 sxk97        actual 2017-11-26 00:00:00     38
## 2 sxk97        actual 2017-11-26 01:00:00     21
## 3 sxk97        actual 2017-11-26 02:00:00     10
## 4 sxk97        actual 2017-11-26 03:00:00     10
## 5 sxk97        actual 2017-11-26 04:00:00      5
## 6 sxk97        actual 2017-11-26 05:00:00      2

Choose the model that has the smallest prediction error for each sub-area then combining into test and train data.

# Smallest error selection
ride_df %<>%
  select(-fitted) %>%
  group_by(src_sub_area) %>%
  filter(MAE_test == min(MAE_test)) %>%
  ungroup()

# Combining test and train data
ride_df %<>%
  mutate(fulldata = map2(train, test, ~ bind_rows(.x, .y))) %>%
  select(src_sub_area, fulldata, everything(), -train, -test)

ride_df

## # A tibble: 3 x 8
##   src_sub_area fulldata data_fun_name data_fun model_name model MAE_test
##   <chr>        <list>   <chr>         <list>   <chr>      <lis>    <dbl>
## 1 sxk97        <tibble… msts          <fn>     tbats      <fn>      7.42
## 2 sxk9e        <tibble… msts          <fn>     tbats      <fn>      9.76
## 3 sxk9s        <tibble… msts          <fn>     tbats      <fn>      6.94
## # … with 1 more variable: MAE_train <dbl>

Then we would do nested fitting. The tbats model has the smallest prediction error for each sub-area. Next, combine the Test data and Train data to create a final model. Modeling with the tbats model of the combined dataset, then predicting demand for the next 7 days.

# Running model
# ride_df %<>%
#   mutate(
#     params = map(fulldata, ~ list(x = .x)),
#     data = invoke_map(data_fun, params),
#     params = map(data, ~ list(x = .x)),
#     fitted = invoke_map(model, params)
#   ) %>%
#   select(-data, -params)
# 
# ride_bestmodel <- saveRDS(ride_df, "ride_bestmodel.RDS")

ride_df <- readRDS("ride_bestmodel.RDS")

# Make Prediction
ride_df %<>%
  mutate(forecast =
    map(fitted, ~ forecast(.x, h = 24 * 7)) %>%
    map2(fulldata, ~ tibble(
      start_time = tk_make_future_timeseries(.y$start_time, 24 * 7),
      demand = as.vector(.x$mean)
    ))
  )

ride_df

## # A tibble: 3 x 10
##   src_sub_area fulldata data_fun_name data_fun model_name model MAE_error_train
##   <chr>        <list>   <chr>         <list>   <chr>      <lis>           <dbl>
## 1 sxk97        <tibble… msts          <fn>     tbats      <fn>             4.69
## 2 sxk9e        <tibble… msts          <fn>     tbats      <fn>             6.48
## 3 sxk9s        <tibble… msts          <fn>     tbats      <fn>             4.81
## # … with 3 more variables: MAE_error_test <dbl>, fitted <list>, forecast <list>

Open data nests and create visualizations of prediction results.

ride_df %<>%
  select(src_sub_area, actual = fulldata, forecast) %>%
  gather(key, value, -src_sub_area) %>%
  unnest(value) %>%
  mutate(demand = recipe_revert(demand,recipe,src_sub_area))

# Visualization
ggplotly(ggplot(ride_df,aes(x = start_time, y = demand, colour = key)) +
           geom_line() +
           labs(y = NULL, x = NULL, colour = NULL, title = "Model Prediction Results") +
           facet_wrap(~ src_sub_area, scale = "free_y", ncol = 1) +
           theme_tq())

Getting our final forecast result.

ride_actual <- ride_df %>% 
  filter(key == "actual")

ride_forecast <- ride_df %>% 
  filter(key == "forecast") %>%
  filter(start_time >= "2017-12-03 00:00:00")

ride_final <- rbind(ride_actual, ride_forecast)

data_submit <- ride_final %>% 
  filter(key == "forecast") %>% 
  rename(datetime = start_time) %>% 
  select(- key)

#write.csv(data_submit, "data-submission.csv")

head(data_submit)

## # A tibble: 6 x 3
##   src_sub_area datetime            demand
##   <chr>        <dttm>               <dbl>
## 1 sxk97        2017-12-03 00:00:00     40
## 2 sxk97        2017-12-03 01:00:00     30
## 3 sxk97        2017-12-03 02:00:00     21
## 4 sxk97        2017-12-03 03:00:00     15
## 5 sxk97        2017-12-03 04:00:00     11
## 6 sxk97        2017-12-03 05:00:00      7

Conclusion

The forecast from tbats models showing a better performance for all and each sub-area. This online transportation case has two types of seasonality, daily and weekly. So we use stlm, tbats, and HoltWinter.

Reference

Algoritma Tech Blog. (2019) Purr-Operly Fitting Multiple Time Series Model
Rob J Hyndman and George Athanasopoulos. (2018) Forecasting: Principles and Practice
David Burba. (2018) An overview of time series forecasting models

Time Series On Ride Sharing

Muslim Al Fatih