Scotty: Bring me the crystal ball!

Reynaldi Gevin

2022-08-08

Introduction

Scotty is a ride-sharing business operating in several big cities in Turkey. The company provides motorcycles ride-sharing service for Turkey’s citizen, and really value the efficiency in traveling through the traffic–the apps even give some reference to Star Trek “beam me up” in their order buttons.

Scotty provided us with a real-time transaction dataset. With this dataset, we are going to help them in solving their problems in order to improve their business processes.

Objective

It’s almost the end of 2017 and we need to prepare a forecast model to helps Scotty ready for the end year’s demands. Unfortunately, Scotty is not old enough to have last year’s data for December, so we can not look back at past demands to prepare forecast for December’s demands. Fortunately, you already know that time series analysis is more than enough to help us to forecast! But, as an investment for the business’ future, we need to develop an automated forecasting framework so we don’t have to meddle with forecast model selection anymore in the future!

Build an automated forecasting model for hourly demands that would be evaluated on the next 7 days (Sunday, December 3rd 2017 to Monday, December 9th 2017)!

Read Library

library(dplyr)       # data manipulation
library(lubridate)   # restructuring a certain type of datetime
library(tidyverse)   # data wrangling
library(forecast)    # time series modelling and forecasting
library(yardstick)   # measuring forecasting performance
library(recipes)     # data preprocessing
library(magrittr)    # extension dplyr for piping syntax
library(timetk)      # additional toolkit needed for time series models
library(purrr)       # nested model fitting
library(tidyquant)   # creating better ggplot aesthetics
library(padr)        # adjusting time stamp
library(ggpubr)      # data visualization
library(plotly)      # interactive plotting
library(ggplot2)     # data visalization
library(tidyr)       # data manipulation

About Dataset

Import data

The train dataset contains detailed transaction details from October 1st 2017 to December 2nd 2017:

df <- read.csv("Capstone ML Data/5. scotty-ts/data/data-train.csv")
rmarkdown::paged_table(df)

Meta data

The dataset includes information about :

  • id: Transaction id
  • trip_id: Trip id
  • driver_id: Driver id
  • rider_id: Rider id
  • start_time: Start time of request
  • src_lat: Request source latitude
  • src_lon: Request source longitude
  • src_area: Request source area
  • src_sub_area: Request source sub-area
  • dest_lat: Requested destination latitude
  • dest_lon: Requested destination longitude
  • dest_area: Requested destination area
  • dest_sub_area: Requested destination sub-area
  • distance: Trip distance (in KM)
  • status: Trip status (all status considered as a demand)
  • confirmed_time_sec: Time different from request to confirmed (in seconds)

Data Wrangling

Datatype check

glimpse(df)
## Rows: 90,113
## Columns: 16
## $ id                 <chr> "59d005e1ffcfa261708ce9cd", "59d0066affcfa261708ceb…
## $ trip_id            <chr> "59d005e9cb564761a8fe5d3e", NA, "59d006c131e39c6189…
## $ driver_id          <chr> "59a892c5568be44b2734f276", NA, "599dc0dfa5b4fd5471…
## $ rider_id           <chr> "59ad2d6efba75a581666b506", "59cd704bcf482f6ce2fadf…
## $ start_time         <chr> "2017-10-01T00:00:17Z", "2017-10-01T00:02:34Z", "20…
## $ src_lat            <dbl> 41.07047, 41.07487, 41.04995, 41.05287, 41.06760, 4…
## $ src_lon            <dbl> 29.01945, 28.99528, 29.03107, 28.99522, 28.98827, 2…
## $ src_area           <chr> "sxk9", "sxk9", "sxk9", "sxk9", "sxk9", "sxk9", "sx…
## $ src_sub_area       <chr> "sxk9s", "sxk9e", "sxk9s", "sxk9e", "sxk9e", "sxk9s…
## $ dest_lat           <dbl> 41.11716, 41.08351, 41.04495, 41.08140, 41.02125, 4…
## $ dest_lon           <dbl> 29.03650, 29.00228, 28.98192, 28.98197, 29.11316, 2…
## $ dest_area          <chr> "sxk9", "sxk9", "sxk9", "sxk9", "sxk9", "sxk9", "sx…
## $ dest_sub_area      <chr> "sxk9u", "sxk9e", "sxk9e", "sxk9e", "sxk9q", "sxk9e…
## $ distance           <dbl> 5.379250, 1.126098, 4.169492, 3.358296, 11.693573, …
## $ status             <chr> "confirmed", "nodrivers", "confirmed", "confirmed",…
## $ confirmed_time_sec <int> 8, 0, 32, 65, 0, 27, 32, 30, 0, 24, 9, 0, 118, 106,…

There are some columns that have incorrect data types

# Count unique values in every column
sapply(df, n_distinct)
##                 id            trip_id          driver_id           rider_id 
##              90113              85438               1516              16010 
##         start_time            src_lat            src_lon           src_area 
##              88729              65749              57103                  1 
##       src_sub_area           dest_lat           dest_lon          dest_area 
##                  3              37496              30055                 53 
##      dest_sub_area           distance             status confirmed_time_sec 
##                164              76240                  2                397

We will change some data types that are not correct, such as:

  • start_time to date-time
  • driver_id, rider_id, src_sub_area, dest_area, dest_sub_area, status to factor

Change datatypes

df <-  df %>% 
  mutate_at(vars(driver_id,
                 rider_id,
                 src_sub_area,
                 dest_area,
                 dest_sub_area,
                 status),
            as.factor) %>%
  mutate(start_time = ymd_hms(start_time)) 
glimpse(df)
## Rows: 90,113
## Columns: 16
## $ id                 <chr> "59d005e1ffcfa261708ce9cd", "59d0066affcfa261708ceb…
## $ trip_id            <chr> "59d005e9cb564761a8fe5d3e", NA, "59d006c131e39c6189…
## $ driver_id          <fct> 59a892c5568be44b2734f276, NA, 599dc0dfa5b4fd5471ad8…
## $ rider_id           <fct> 59ad2d6efba75a581666b506, 59cd704bcf482f6ce2fadfdb,…
## $ start_time         <dttm> 2017-10-01 00:00:17, 2017-10-01 00:02:34, 2017-10-…
## $ src_lat            <dbl> 41.07047, 41.07487, 41.04995, 41.05287, 41.06760, 4…
## $ src_lon            <dbl> 29.01945, 28.99528, 29.03107, 28.99522, 28.98827, 2…
## $ src_area           <chr> "sxk9", "sxk9", "sxk9", "sxk9", "sxk9", "sxk9", "sx…
## $ src_sub_area       <fct> sxk9s, sxk9e, sxk9s, sxk9e, sxk9e, sxk9s, sxk9s, sx…
## $ dest_lat           <dbl> 41.11716, 41.08351, 41.04495, 41.08140, 41.02125, 4…
## $ dest_lon           <dbl> 29.03650, 29.00228, 28.98192, 28.98197, 29.11316, 2…
## $ dest_area          <fct> sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sxk…
## $ dest_sub_area      <fct> sxk9u, sxk9e, sxk9e, sxk9e, sxk9q, sxk9e, sxk9e, sx…
## $ distance           <dbl> 5.379250, 1.126098, 4.169492, 3.358296, 11.693573, …
## $ status             <fct> confirmed, nodrivers, confirmed, confirmed, nodrive…
## $ confirmed_time_sec <int> 8, 0, 32, 65, 0, 27, 32, 30, 0, 24, 9, 0, 118, 106,…

Data Preparation

Flor start_time

df <- df %>%
  mutate(datetime = floor_date(start_time, unit = "hour"))

Calculate the demand for each area in every hour

# take 3 sub areas separately <data aggregation>
sxk97 <- df %>% 
  filter(src_sub_area == 'sxk97') %>%
  group_by(src_sub_area, datetime) %>% 
  summarise(demand = n(), .groups = "drop")

sxk9e <- df %>% 
  filter(src_sub_area == 'sxk9e') %>% 
  group_by(src_sub_area, datetime) %>% 
  summarise(demand = n(), .groups = "drop")

sxk9s <- df %>% 
  filter(src_sub_area == 'sxk9s') %>% 
  group_by(src_sub_area, datetime) %>% 
  summarise(demand = n(), .groups = "drop")

Characteristics Time Series Data

Time series data: data related to time and have a fixed time interval.

💡 The main requirements for time series data:

  1. The data must be sorted according to the time period from the oldest data to the newest data
  2. The time interval must be the same
  3. No missing data for each interval
  4. There can be no empty data

Check range of datetime in every sub areas

range(sxk9e$datetime)
## [1] "2017-10-01 00:00:00 UTC" "2017-12-02 23:00:00 UTC"
range(sxk97$datetime)
## [1] "2017-10-01 00:00:00 UTC" "2017-12-02 23:00:00 UTC"
range(sxk9s$datetime)
## [1] "2017-10-01 00:00:00 UTC" "2017-12-02 23:00:00 UTC"

Each area has the same time range

Inspect period missed

all(seq(
     from=as.POSIXct("2017-10-01 00:00:00", tz="UTC"),
     to=as.POSIXct("2017-12-02 23:00:00", tz="UTC"),
     by="hour"
   ) == sxk9e$datetime)
## [1] FALSE

The data above is not ready to be analyzed in time series because there are several missing periods in 1 hour intervals

Padding

Padding aims to ensure that the data intervals are the same, patch the time period (hours) based on the column whose data type is date and fill in the missing values (NA).

sxk97_use <- sxk97 %>% 
  arrange(datetime) %>%  # sort time
  pad(start_val = min(sxk97$datetime), # padding process
      end_val = max(sxk97$datetime)) %>% 
  mutate(demand = ifelse(is.na(demand), 0, demand)) %>% # fill `NA` in demand with 0
  replace(is.na(.), "sxk97") # fill `NA` in sub_area with sub_area name
## pad applied on the interval: hour
sxk9e_use <- sxk9e %>% 
  arrange(datetime) %>%  
  pad(start_val = min(sxk9e$datetime),
      end_val = max(sxk9e$datetime)) %>% 
  mutate(demand = ifelse(is.na(demand), 0, demand)) %>% 
  replace(is.na(.), "sxk9e")
## pad applied on the interval: hour
sxk9s_use <- sxk9s %>% 
  arrange(datetime) %>%
  pad(start_val = min(sxk9s$datetime),
      end_val = max(sxk9s$datetime)) %>% 
  mutate(demand = ifelse(is.na(demand), 0, demand)) %>% 
  replace(is.na(.), "sxk9s")
## pad applied on the interval: hour
# missing value final check 
anyNA(sxk9s_use)
## [1] FALSE
anyNA(sxk97_use)
## [1] FALSE
anyNA(sxk9e_use)
## [1] FALSE

When demand is equal to 0 it means that there is no incoming demand at that hour

Data is ready to be used for time series analysis

Gather each sub area into one dataframe

scotty <- rbind(sxk97_use,sxk9e_use,sxk9s_use)
rmarkdown::paged_table(scotty)

Exploratory Data Analysis

We would like to see the demand by sub-area throughout the day:

ggplotly(ggplot(scotty,aes(x = datetime,
                      y = demand)) +
           geom_line(aes(col = src_sub_area)) +
           labs(y = "Order Request",
                x = "Date",
                title = "General Order Demand") +
           facet_wrap(~ src_sub_area,
                      scale = "free_y",
                      ncol = 1) +
           theme_bw() +
           theme(legend.position = "none",
                 plot.title = element_text(hjust = 0.5)))

Based in the graph, it can be seen that on 3-11-2017 at 6 pm there was a very high demand in SXK97(207) and SXK9S (217) areas, it can be seen from the very high spike at that time. Meanwhile, for the SXK9E area, the highest demand occurred on 24-11-2017 at 6 pm, which was 200 demands

We could see that there are seasonal pattern, although the trend is not clear yet. To investigate more, we would make daily and weekly polar plot

Daily and Weekly Demand by Sub-Area

sxk97

sxk97_plot <- scotty %>%
  filter(src_sub_area == "sxk97") %>%
  .$demand

sxk97_daily <- ggseasonplot(ts(sxk97_plot,
                               frequency = 24),
                            polar = T)  +
  theme_light() +
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5)) +
  labs(title = "SXK97 Daily",
       x = "Hour") +
  scale_y_sqrt()
 
sxk97_w <- scotty %>%
  filter(src_sub_area == "sxk97") %>%
  mutate(date = format(datetime,"%Y/%m/%d")) %>%
  group_by(date) %>%
  mutate(demand = sum(demand)) %>%
  select(c(date,demand)) %>%
  distinct()
 
sxk97_weekly <- ggseasonplot(ts(sxk97_w$demand, frequency = 7),
                             polar = T)  +
  theme_light() +
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5)) +
  labs(title = "SXK97 Weekly",
       x = "Day") +
  scale_y_sqrt()

ggarrange(sxk97_daily, sxk97_weekly)

In SXK97 peak hours occurs at 6 pm and peak days occurs in Friday. For off-peak hours occurs at 5-6 am and Sunday is the day with the least demand.

sxk9e

sxk9e_plot<- scotty %>% 
  filter(src_sub_area == "sxk9e") %>% 
  .$demand

sxk9e_daily <- ggseasonplot(ts(sxk9e_plot,frequency = 24),
                            polar = T)  +
  theme_light() +
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5)) +
  labs(title = "SXK9E Daily",
       x = "Hour") +
  scale_y_sqrt()

sxk9e_w <- scotty %>% 
  filter(src_sub_area == "sxk9e") %>% 
  mutate(date = format(datetime,"%Y/%m/%d")) %>% 
  group_by(date) %>% 
  mutate(demand = sum(demand)) %>% 
  select(c(date,demand)) %>% 
  distinct()

sxk9e_weekly <- ggseasonplot(ts(sxk9e_w$demand,frequency = 7),
                             polar = T)  +
  theme_light() +
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5)) +
  labs(title = "SXK9E Weekly",
       x = "Day") +
  scale_y_sqrt()

ggarrange(sxk9e_daily, sxk9e_weekly)

In SXK9E peak hours occurs at 6-7 pm and peak days occurs in Friday. For off-peak hours occurs at 5-6 am and Sunday is the day with the least demand.

sxk9s

sxk9s_plot<- scotty %>% 
  filter(src_sub_area == "sxk9s") %>% 
  .$demand

sxk9s_daily <- ggseasonplot(ts(sxk9s_plot,frequency = 24),
                            polar = T)  +
  theme_light() +
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5)) +
  labs(title = "SXK9S Daily",
       x = "Hour") +
  scale_y_sqrt()

sxk9s_w <- scotty %>% 
  filter(src_sub_area == "sxk9s") %>% 
  mutate(date = format(datetime,"%Y/%m/%d")) %>% 
  group_by(date) %>% 
  mutate(demand = sum(demand)) %>% 
  select(c(date,demand)) %>% 
  distinct()

sxk9s_weekly <- ggseasonplot(ts(sxk9s_w$demand,frequency = 7),
                             polar = T)  +
  theme_light() +
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5)) +
  labs(title = "SXK9S Weekly",
       x = "Day") +
  scale_y_sqrt()

ggarrange(sxk9s_daily, sxk9s_weekly)

In SXK9S peak hours occurs at 6 pm and peak days occurs in Friday. For off-peak hours occurs at 5-6 am and Sunday is the day with the least demand.

Decomposition

Decomposition is a stage in time series analysis that is used to describe several components in time series data.

💡 Components in the time series:

  • Trend : data pattern in general, tends to increase or decrease. If there is a trend there is still a pattern, it means that there is a pattern that has not been decomposed properly.
  • Seasonal : seasonal patterns that form repeating patterns over a fixed period of time
  • Error/Reminder/Random : patterns that cannot be caught in trend and seasonal

Before doing forecasting modeling, we need to observe the time series object from the decompose result. The main idea of decompose is to describe the three components of the object ts (trend, seasonal, residual).

Daily

area <- scotty %>% 
  filter(src_sub_area == "sxk97")

ts_day <- ts(data = area$demand,
         frequency = 24)
ts_day %>% 
  decompose() %>% 
  autoplot() +
  labs(title = "Decomposition on Daily Time Series") +
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.3))

Weekly

ts_week <- ts(data = area$demand,
         frequency = 24*7)

ts_week %>% 
  decompose() %>% 
  autoplot() +
  labs(title = "Decomposition on Weekly Time Series") +
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.3))

Unfortunately, the trend that resulted from the Daily and Weekly decomposition is not smooth enough. that might be caused by uncaptured extra seasonality, so it can be considered as multi-seasonal data.

So that, we need to try another option by creating the multiple time series object, msts with daily and weekly seasonality.

Daily-Weekly (Multi Seasonality)

daily_weekly <- msts(data = area$demand,
                     seasonal.periods = c(24, 24*7))

daily_weekly %>%
  mstl() %>% 
  autoplot() +
  labs(title = "Decomposition on Daily Weekly Time Series") +
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.3))

This time the trend become smoother, it means that our data is indicated to have multi seasonality

The conclusions obtained from the EDA process will help us to choose the best time series model


Data preprocessing

Cross Validation

Before modeling, we have to seperate our data into two: train and test dataset. Test dataset would be the last one week from the data, while the remain is our train dataset.

# size of train data and test data
test_size <- 24 * 7
train_size <- nrow(sxk97_use) - test_size
# initial range and final range for each data
test_end <- max(scotty$datetime)
test_start <- test_end - hours(test_size) + hours(1)

train_end <- test_start - hours(1)
train_start <- train_end - hours(train_size) + hours(1)
# interval for each data
intrain <- interval(train_start, train_end)
intest <- interval(test_start, test_end)

Plot train and test

train_test_plot <- scotty %>%
  mutate(sample = case_when(
    datetime %within% intrain ~ "train",
    datetime %within% intest ~ "test"
  )) %>%
  mutate(sample = factor(sample, levels = c("train", "test"))) %>%
  ggplot(aes(x = datetime, y = demand, colour = sample)) +
    geom_line() +
    labs(x = NULL, y = NULL, colour = NULL) +
    facet_wrap(~ src_sub_area, scale = "free", ncol = 1) +
    theme_light()

ggplotly(train_test_plot)

Date time series

Before we form Time Series data, we will first process the data using the library recipes. Library recipes work in the form of columns, so we will change our data to tabular based on src_sub_area using the spread() function of the tidyr library

scotty <- scotty %>% 
  spread(src_sub_area, demand)
rmarkdown::paged_table(scotty)

We will do some data manipulation, which aims to make our model becomes more robust and less sensitive to outlier

Scaling will be performed using recipe() function from library recipes:

rec <- recipe(~ ., filter(scotty, datetime %within% intrain)) %>% 
  step_sqrt(all_numeric()) %>% 
  step_center(all_numeric()) %>% 
  step_scale(all_numeric()) %>% 
  prep()
# preview bake result
scotty <- bake(rec, scotty)
rmarkdown::paged_table(scotty)

Because we have manipulated data, we must create a function to return our data to the initial value. So we cread new function named rec_revert

rec_revert <- function(vector, rec, varname) {

  # store recipe values
  rec_center <- rec$steps[[2]]$means[varname]
  rec_scale <- rec$steps[[3]]$sds[varname]

  # convert back based on the recipe
  results <- (vector * rec_scale + rec_center) ^ 2

  # add additional adjustment if necessary
  results <- round(results)

  # return the results
  results
  
}

This revert back function will convert back the data to its original form.

Now we can convert our data into the long format again:

scotty <- scotty %>% 
  gather(src_sub_area, demand, -datetime)

rmarkdown::paged_table(scotty) 

Modelling

In functional programming using purrr, we need to convert our tbl into a nested tbl. Nested data is like a table inside a table, which could be controlled using a key indicator; in other words, we can have tbl for each subarea and samples. Using this format, the fitting and forecasting process would be very versatile, yet we can still convert the results as long as we have a proper key like provider.

Let’s start by converting our tbl into a nested tbl. First, we need to add sample indicator so it could be recognized as a key when we nest() the data:

# label data to train and test
scotty <- scotty %>% 
  mutate(sample = case_when(
    datetime %within% intrain ~ "train",
    datetime %within% intest ~ "test"
  )) %>%
  drop_na()
# nest train data
scotty <- scotty %>% 
  group_by(src_sub_area, sample) %>% 
  nest(.key = "data") %>% 
  spread(sample, data)
## Warning: `.key` is deprecated
rmarkdown::paged_table(scotty) 

Time series format

To make our data into a time series form we need the function ts () or msts (). So we will create a function data list named ts_function_list which contains both functions, which for time series will use the seasonality as follows:

data_funs <- list(
  ts = function(x) ts(x$demand, frequency = 24),
  msts = function(x) msts(x$demand, seasonal.periods = c(24, 24*7))
)

data_funs
## $ts
## function(x) ts(x$demand, frequency = 24)
## 
## $msts
## function(x) msts(x$demand, seasonal.periods = c(24, 24*7))

ts used for single seasonality, will use daily seasonality (frequency = 24) msts used for multiple seasonality, will use daily seasonality (frequency = 24) and weekly (frequency = 24 * 7)

Then we convert the list into a tbl using enframe(). Note that we should also give a key – which is the src_sub_area in our case. Use rep() function:

# list to tbl
data_funs <-  data_funs %>% 
  rep(length(unique(scotty$src_sub_area))) %>% 
  enframe("data_fun_name", "data_fun") %>% 
  mutate(src_sub_area = 
           sort(rep(unique(scotty$src_sub_area), length(unique(.$data_fun_name))))
         )

data_funs
## # A tibble: 6 × 3
##   data_fun_name data_fun src_sub_area
##   <chr>         <list>   <chr>       
## 1 ts            <fn>     sxk97       
## 2 msts          <fn>     sxk97       
## 3 ts            <fn>     sxk9e       
## 4 msts          <fn>     sxk9e       
## 5 ts            <fn>     sxk9s       
## 6 msts          <fn>     sxk9s

Then we join nested function with our nested model ditting :

# join nested function
scotty <- scotty %>%
  left_join(data_funs)
## Joining, by = "src_sub_area"
scotty
## # A tibble: 6 × 5
## # Groups:   src_sub_area [3]
##   src_sub_area test               train                data_fun_name data_fun
##   <chr>        <list>             <list>               <chr>         <list>  
## 1 sxk97        <tibble [168 × 2]> <tibble [1,344 × 2]> ts            <fn>    
## 2 sxk97        <tibble [168 × 2]> <tibble [1,344 × 2]> msts          <fn>    
## 3 sxk9e        <tibble [168 × 2]> <tibble [1,344 × 2]> ts            <fn>    
## 4 sxk9e        <tibble [168 × 2]> <tibble [1,344 × 2]> msts          <fn>    
## 5 sxk9s        <tibble [168 × 2]> <tibble [1,344 × 2]> ts            <fn>    
## 6 sxk9s        <tibble [168 × 2]> <tibble [1,344 × 2]> msts          <fn>

Time series model list

We will use four different models

  • Exponential smoothing state space model (ets)
  • Autoregressive integrated moving average (Auto arima)
  • Seasonal and Trend decomposition using Loess (stlm)
  • Exponential smoothing state space model using (holtswinters)
  • Trigonometric seasonality, Box-Cox transformation, ARMA errors, Trend and Seasonal components (tbats)
# models list
models <- list(
  auto.arima = function(x) auto.arima(x),
  ets = function(x) ets(x),
  holt.winter = function(x) HoltWinters(x, seasonal = "additive"),
  stlm = function(x) stlm(x),
  tbats = function(x) tbats(x, use.box.cox = FALSE)
)

models
## $auto.arima
## function(x) auto.arima(x)
## 
## $ets
## function(x) ets(x)
## 
## $holt.winter
## function(x) HoltWinters(x, seasonal = "additive")
## 
## $stlm
## function(x) stlm(x)
## 
## $tbats
## function(x) tbats(x, use.box.cox = FALSE)
# combine into nested format: our previous nested model
models <- models %>% 
  rep(length(unique(scotty$src_sub_area))) %>% 
  enframe("model_name", "model") %>% 
  mutate(src_sub_area = 
           sort(rep(unique(scotty$src_sub_area), length(unique(.$model_name))))
         )
models
## # A tibble: 15 × 3
##    model_name  model  src_sub_area
##    <chr>       <list> <chr>       
##  1 auto.arima  <fn>   sxk97       
##  2 ets         <fn>   sxk97       
##  3 holt.winter <fn>   sxk97       
##  4 stlm        <fn>   sxk97       
##  5 tbats       <fn>   sxk97       
##  6 auto.arima  <fn>   sxk9e       
##  7 ets         <fn>   sxk9e       
##  8 holt.winter <fn>   sxk9e       
##  9 stlm        <fn>   sxk9e       
## 10 tbats       <fn>   sxk9e       
## 11 auto.arima  <fn>   sxk9s       
## 12 ets         <fn>   sxk9s       
## 13 holt.winter <fn>   sxk9s       
## 14 stlm        <fn>   sxk9s       
## 15 tbats       <fn>   sxk9s

The next step is to merge model_function_list to the scotty_data_nest data. Because the ets and auto.arima models cannot be used for multiseasonal time series types, we will not combine the two models with the msts type time series:

# drop model that we don't need
scotty <- scotty %>% 
  left_join(models) %>% # join by "src_sub_area"
  filter(
    !(model_name == "ets" & data_fun_name == "msts"),
    !(model_name == "auto.arima" & data_fun_name == "msts")
  )
## Joining, by = "src_sub_area"
scotty
## # A tibble: 24 × 7
## # Groups:   src_sub_area [3]
##    src_sub_area test     train    data_fun_name data_fun model_name  model 
##    <chr>        <list>   <list>   <chr>         <list>   <chr>       <list>
##  1 sxk97        <tibble> <tibble> ts            <fn>     auto.arima  <fn>  
##  2 sxk97        <tibble> <tibble> ts            <fn>     ets         <fn>  
##  3 sxk97        <tibble> <tibble> ts            <fn>     holt.winter <fn>  
##  4 sxk97        <tibble> <tibble> ts            <fn>     stlm        <fn>  
##  5 sxk97        <tibble> <tibble> ts            <fn>     tbats       <fn>  
##  6 sxk97        <tibble> <tibble> msts          <fn>     holt.winter <fn>  
##  7 sxk97        <tibble> <tibble> msts          <fn>     stlm        <fn>  
##  8 sxk97        <tibble> <tibble> msts          <fn>     tbats       <fn>  
##  9 sxk9e        <tibble> <tibble> ts            <fn>     auto.arima  <fn>  
## 10 sxk9e        <tibble> <tibble> ts            <fn>     ets         <fn>  
## # … with 14 more rows

Creat the model

Run the function to form time series data based on the model and the results will be accommodated to the fitted variable.

Since it is time consuming, for further analysis, we will save the model in “scotty_model.RDS”.

# scotty <- scotty %<>%
#   mutate(
#    params = map(train, ~ list(x = .x)),
#   data = invoke_map(data_fun, params),
#    params = map(data, ~ list(x = .x)),
#    fitted = invoke_map(model, params)
#  ) %>%
#  select(-data, -params)
#
#
#scotty_model <- saveRDS(scotty, "scotty_model.RDS")

Calculate test and train errors

# read model
scotty <- readRDS("scotty_model.RDS")
# calculate errors
scotty_model_error <- scotty %>% 
  mutate(MAE_error_test = map(fitted, ~ forecast(.x, h = test_size)) %>%
    map2_dbl(test, ~ mae_vec(truth = rec_revert(.y$demand,rec,src_sub_area),
                             estimate = rec_revert(.x$mean,rec,src_sub_area)))) %>% 
  arrange(src_sub_area, MAE_error_test) %>% 
  mutate(MAE_error_train = map(fitted, ~ forecast(.x, h = test_size)) %>%
    map2_dbl(train, ~ mae_vec(truth = rec_revert(.y$demand,rec,src_sub_area),
                             estimate = rec_revert(.x$fitted,rec,src_sub_area)))) %>% 
  arrange(src_sub_area, MAE_error_train) 

scotty_model_error
## # A tibble: 24 × 10
## # Groups:   src_sub_area [3]
##    src_sub_area test     train    data_fun_name data_fun model_name  model 
##    <chr>        <list>   <list>   <chr>         <list>   <chr>       <list>
##  1 sxk97        <tibble> <tibble> msts          <fn>     stlm        <fn>  
##  2 sxk97        <tibble> <tibble> ts            <fn>     stlm        <fn>  
##  3 sxk97        <tibble> <tibble> msts          <fn>     tbats       <fn>  
##  4 sxk97        <tibble> <tibble> ts            <fn>     tbats       <fn>  
##  5 sxk97        <tibble> <tibble> ts            <fn>     ets         <fn>  
##  6 sxk97        <tibble> <tibble> msts          <fn>     holt.winter <fn>  
##  7 sxk97        <tibble> <tibble> ts            <fn>     holt.winter <fn>  
##  8 sxk97        <tibble> <tibble> ts            <fn>     auto.arima  <fn>  
##  9 sxk9e        <tibble> <tibble> msts          <fn>     stlm        <fn>  
## 10 sxk9e        <tibble> <tibble> ts            <fn>     stlm        <fn>  
## # … with 14 more rows, and 3 more variables: fitted <list>,
## #   MAE_error_test <dbl>, MAE_error_train <dbl>

Choose the best model: the model with smallest MAE_error_test

scotty_best_model <- scotty_model_error %>%
  group_by(src_sub_area) %>% 
  arrange(MAE_error_test) %>% 
  slice(1) %>%  #choose the smallest error & best model for each area directly
  ungroup() %>% 
  select(src_sub_area, ends_with("_name"),MAE_error_test)

scotty_best_model
## # A tibble: 3 × 4
##   src_sub_area data_fun_name model_name MAE_error_test
##   <chr>        <chr>         <chr>               <dbl>
## 1 sxk97        msts          tbats                7.60
## 2 sxk9e        msts          tbats                9.26
## 3 sxk9s        msts          tbats                7
scotty_best_model <- scotty_best_model %>% 
  select(-MAE_error_test) %>% 
  left_join(scotty_model_error)%>% 
  select(-MAE_error_test,-MAE_error_train)
## Joining, by = c("src_sub_area", "data_fun_name", "model_name")
scotty_best_model
## # A tibble: 3 × 8
##   src_sub_area data_fun_name model_name test     train    data_fun model fitted 
##   <chr>        <chr>         <chr>      <list>   <list>   <list>   <lis> <list> 
## 1 sxk97        msts          tbats      <tibble> <tibble> <fn>     <fn>  <tbats>
## 2 sxk9e        msts          tbats      <tibble> <tibble> <fn>     <fn>  <tbats>
## 3 sxk9s        msts          tbats      <tibble> <tibble> <fn>     <fn>  <tbats>

Forecasting

Forecasting is carried out on the test data using the best previously selected model. We will forecast for the next one week or 148 hours

Unnesting the result

We need to make a tbl containing our forecast to data test, then do some spread-gather tricks to make a set of keys that unique for each data representations, models, and also one for the forecast itself. After we get to that format, we could do unnest() the data into a proper long format. So let’s start with creating the forecast first

scotty_test <- scotty_model_error %>% 
  mutate(
    forecast =
      map(fitted, ~ forecast(.x, h = 24 * 7)) %>%
      map2(test, ~ tibble(
        datetime = .y$datetime,
        demand = as.vector(.x$mean)
      )),
    key = paste(data_fun_name, model_name, sep = "-")
  )
  
scotty_test
## # A tibble: 24 × 12
## # Groups:   src_sub_area [3]
##    src_sub_area test     train    data_fun_name data_fun model_name  model 
##    <chr>        <list>   <list>   <chr>         <list>   <chr>       <list>
##  1 sxk97        <tibble> <tibble> msts          <fn>     stlm        <fn>  
##  2 sxk97        <tibble> <tibble> ts            <fn>     stlm        <fn>  
##  3 sxk97        <tibble> <tibble> msts          <fn>     tbats       <fn>  
##  4 sxk97        <tibble> <tibble> ts            <fn>     tbats       <fn>  
##  5 sxk97        <tibble> <tibble> ts            <fn>     ets         <fn>  
##  6 sxk97        <tibble> <tibble> msts          <fn>     holt.winter <fn>  
##  7 sxk97        <tibble> <tibble> ts            <fn>     holt.winter <fn>  
##  8 sxk97        <tibble> <tibble> ts            <fn>     auto.arima  <fn>  
##  9 sxk9e        <tibble> <tibble> msts          <fn>     stlm        <fn>  
## 10 sxk9e        <tibble> <tibble> ts            <fn>     stlm        <fn>  
## # … with 14 more rows, and 5 more variables: fitted <list>,
## #   MAE_error_test <dbl>, MAE_error_train <dbl>, forecast <list>, key <chr>
# Create a proper key
scotty_test_key <- scotty_test %>%
  select(src_sub_area, key, actual = test, forecast) %>%
  spread(key, forecast) %>%
  gather(key, value, -src_sub_area)

scotty_test_key
## # A tibble: 27 × 3
## # Groups:   src_sub_area [3]
##    src_sub_area key              value             
##    <chr>        <chr>            <list>            
##  1 sxk97        actual           <tibble [168 × 2]>
##  2 sxk9e        actual           <tibble [168 × 2]>
##  3 sxk9s        actual           <tibble [168 × 2]>
##  4 sxk97        msts-holt.winter <tibble [168 × 2]>
##  5 sxk9e        msts-holt.winter <tibble [168 × 2]>
##  6 sxk9s        msts-holt.winter <tibble [168 × 2]>
##  7 sxk97        msts-stlm        <tibble [168 × 2]>
##  8 sxk9e        msts-stlm        <tibble [168 × 2]>
##  9 sxk9s        msts-stlm        <tibble [168 × 2]>
## 10 sxk97        msts-tbats       <tibble [168 × 2]>
## # … with 17 more rows

Because our forecast data is still in the rec function, we should return the original value using the rec_revert function:

# unnest the result
scotty_test_unnest <- scotty_test_key %>%
  unnest(value) %>%
  mutate(demand = rec_revert(demand, rec, src_sub_area))
  
scotty_test_unnest
## # A tibble: 4,536 × 4
## # Groups:   src_sub_area [3]
##    src_sub_area key    datetime            demand
##    <chr>        <chr>  <dttm>               <dbl>
##  1 sxk97        actual 2017-11-26 00:00:00     38
##  2 sxk97        actual 2017-11-26 01:00:00     21
##  3 sxk97        actual 2017-11-26 02:00:00     10
##  4 sxk97        actual 2017-11-26 03:00:00     10
##  5 sxk97        actual 2017-11-26 04:00:00      5
##  6 sxk97        actual 2017-11-26 05:00:00      2
##  7 sxk97        actual 2017-11-26 06:00:00      0
##  8 sxk97        actual 2017-11-26 07:00:00     11
##  9 sxk97        actual 2017-11-26 08:00:00      4
## 10 sxk97        actual 2017-11-26 09:00:00      4
## # … with 4,526 more rows

Plot the result

With this result, we can compare the forecast and actual data on data test by doing some interactive plotting

# interactive plotting forecast and original data test
fplot <- scotty_test_unnest %>%
  ggplot(aes(x = datetime, 
             y = demand, 
             colour = key)) +
  geom_line(data = scotty_test_unnest %>%
              filter(key == "actual"),
            aes(y = demand),
            alpha = 0.3,size = 0.8)+
   geom_line(data = scotty_test_unnest %>%
               filter(key != "actual"),
             aes(frame = key,
                 col = key)) +
  labs(x = "", y = "Demand",
       title = "Model Prediction Comparison", frame = "Model") +
   facet_wrap(~ src_sub_area, scale = "free_y", ncol = 1) +
    tidyquant::theme_tq() +
    theme(legend.position = "none",axis.title.y = element_text(size = 12), axis.text.y = element_text(size = 7))

ggplotly(fplot)

Automated model selection

We can automatically select model with the smallest error by using filter from dplyr

scotty_min_error <- scotty_model_error %>%
  select(-fitted) %>% # remove unused
  group_by(src_sub_area) %>%
  filter(MAE_error_test == min(MAE_error_test)) %>%
  ungroup()

scotty_min_error    
## # A tibble: 3 × 9
##   src_sub_area test     train    data_fun_name data_fun model_name model 
##   <chr>        <list>   <list>   <chr>         <list>   <chr>      <list>
## 1 sxk97        <tibble> <tibble> msts          <fn>     tbats      <fn>  
## 2 sxk9e        <tibble> <tibble> msts          <fn>     tbats      <fn>  
## 3 sxk9s        <tibble> <tibble> msts          <fn>     tbats      <fn>  
## # … with 2 more variables: MAE_error_test <dbl>, MAE_error_train <dbl>

From the EDA results, we already know that our dataset has multi-seasonality, so based on the automatic selection of the above model, it can also be seen that the model with the smallest MAE error follows the multi-seasonal function and tbats is the model with the lowest MAE.


Performing final forecast

Here we will do the same process as in model fitting, but this time we will use train and test data as our “full data”

Combining train and test data

scotty_combine <- scotty_min_error %>%
  mutate(fulldata = map2(train, test, ~ bind_rows(.x, .y))) %>%
  select(src_sub_area, fulldata, everything(), -train, -test)

scotty_combine
## # A tibble: 3 × 8
##   src_sub_area fulldata data_fun_name data_fun model_name model  MAE_error_test
##   <chr>        <list>   <chr>         <list>   <chr>      <list>          <dbl>
## 1 sxk97        <tibble> msts          <fn>     tbats      <fn>             7.60
## 2 sxk9e        <tibble> msts          <fn>     tbats      <fn>             9.26
## 3 sxk9s        <tibble> msts          <fn>     tbats      <fn>             7   
## # … with 1 more variable: MAE_error_train <dbl>

Creat model

# scotty_combine_nest <- scotty_combine %>%
#   mutate(
#     params = map(fulldata, ~ list(x = .x)),
#     data = invoke_map(data_fun, params),
#     params = map(data, ~ list(x = .x)),
#     fitted = invoke_map(model, params)
#   ) %>%
#   select(-data, -params)
# 
# saveRDS(scotty_combine_nest, "scottycb1_nest.RDS")

Model result

scotty_combine_nest <- readRDS("scottycb1_nest.RDS")
scotty_combine_nest
## # A tibble: 3 × 9
##   src_sub_area fulldata data_fun_name data_fun model_name model  MAE_error_test
##   <chr>        <list>   <chr>         <list>   <chr>      <list>          <dbl>
## 1 sxk97        <tibble> msts          <fn>     tbats      <fn>             7.60
## 2 sxk9e        <tibble> msts          <fn>     tbats      <fn>             9.26
## 3 sxk9s        <tibble> msts          <fn>     tbats      <fn>             7   
## # … with 2 more variables: MAE_error_train <dbl>, fitted <list>

Forecasting

Then, let’s make a tbl containing the forecast results, and convert it to a long format

scotty_combine_forecast <- scotty_combine_nest %>%
  select(-MAE_error_train) %>% 
  mutate(forecast =
    map(fitted, ~ forecast(.x, h = 24 * 7 * 4)) %>%
    map2(fulldata, ~ tibble(
      datetime = timetk::tk_make_future_timeseries(.y$datetime, 24 * 7 * 4),
      demand = as.vector(.x$mean)
    ))
  )

scotty_combine_forecast
## # A tibble: 3 × 9
##   src_sub_area fulldata data_fun_name data_fun model_name model  MAE_error_test
##   <chr>        <list>   <chr>         <list>   <chr>      <list>          <dbl>
## 1 sxk97        <tibble> msts          <fn>     tbats      <fn>             7.60
## 2 sxk9e        <tibble> msts          <fn>     tbats      <fn>             9.26
## 3 sxk9s        <tibble> msts          <fn>     tbats      <fn>             7   
## # … with 2 more variables: fitted <list>, forecast <list>

Unesting forecast result

scotty_combine_forecast_unnest <- scotty_combine_forecast %>% 
  select(src_sub_area, actual = fulldata, forecast) %>%
  gather(key, value, -src_sub_area) %>%
  unnest(value) %>%
  mutate(demand = rec_revert(demand, rec, src_sub_area))

scotty_combine_forecast_unnest
## # A tibble: 6,552 × 4
##    src_sub_area key    datetime            demand
##    <chr>        <chr>  <dttm>               <dbl>
##  1 sxk97        actual 2017-10-01 00:00:00      6
##  2 sxk97        actual 2017-10-01 01:00:00      4
##  3 sxk97        actual 2017-10-01 02:00:00      9
##  4 sxk97        actual 2017-10-01 03:00:00      2
##  5 sxk97        actual 2017-10-01 04:00:00      5
##  6 sxk97        actual 2017-10-01 05:00:00      4
##  7 sxk97        actual 2017-10-01 06:00:00      1
##  8 sxk97        actual 2017-10-01 07:00:00      0
##  9 sxk97        actual 2017-10-01 08:00:00      3
## 10 sxk97        actual 2017-10-01 09:00:00      3
## # … with 6,542 more rows

Plot forecast

fplot2 <- scotty_combine_forecast_unnest %>%
  ggplot(aes(x = datetime, y = demand, colour = key)) +
    geom_line() +
    labs(x = NULL, y = NULL, colour = NULL) +
    facet_wrap(~ src_sub_area, scale = "free", ncol = 1) +
    theme_light() +
    scale_colour_tq()

ggplotly(fplot2)

Final data test result

data_test <- scotty_combine_forecast_unnest %>%
  filter(datetime %within% intest) %>%
  select(src_sub_area, datetime, demand)

data_test
## # A tibble: 504 × 3
##    src_sub_area datetime            demand
##    <chr>        <dttm>               <dbl>
##  1 sxk97        2017-11-26 00:00:00     38
##  2 sxk97        2017-11-26 01:00:00     21
##  3 sxk97        2017-11-26 02:00:00     10
##  4 sxk97        2017-11-26 03:00:00     10
##  5 sxk97        2017-11-26 04:00:00      5
##  6 sxk97        2017-11-26 05:00:00      2
##  7 sxk97        2017-11-26 06:00:00      0
##  8 sxk97        2017-11-26 07:00:00     11
##  9 sxk97        2017-11-26 08:00:00      4
## 10 sxk97        2017-11-26 09:00:00      4
## # … with 494 more rows

MAE Result

mae_test_result <- scotty_model_error %>%
  group_by(src_sub_area) %>% 
  select(src_sub_area, MAE_error_test) %>% 
  arrange(MAE_error_test) %>% 
  slice(1) %>% 
  ungroup()

# Reached MAE < 10 for all sub-area in test dataset


mae_test_result %<>%
  summarise(src_sub_area = "all sub-area", # also count MAE for all sub-area
            MAE_error_test = mean(MAE_error_test)) %>%
  bind_rows(mae_test_result, .)

mae_test_result
## # A tibble: 4 × 2
##   src_sub_area MAE_error_test
##   <chr>                 <dbl>
## 1 sxk97                  7.60
## 2 sxk9e                  9.26
## 3 sxk9s                  7   
## 4 all sub-area           7.95

Data submission

Read the data submission that has been given as a challenge in this project

submission <- read_csv("Capstone ML Data/5. scotty-ts/data/data-test.csv", show_col_types = F) %>% 
  mutate(datetime = ymd_hms(datetime))

submission
## # A tibble: 504 × 3
##    src_sub_area datetime            demand
##    <chr>        <dttm>              <lgl> 
##  1 sxk97        2017-12-03 00:00:00 NA    
##  2 sxk97        2017-12-03 01:00:00 NA    
##  3 sxk97        2017-12-03 02:00:00 NA    
##  4 sxk97        2017-12-03 03:00:00 NA    
##  5 sxk97        2017-12-03 04:00:00 NA    
##  6 sxk97        2017-12-03 05:00:00 NA    
##  7 sxk97        2017-12-03 06:00:00 NA    
##  8 sxk97        2017-12-03 07:00:00 NA    
##  9 sxk97        2017-12-03 08:00:00 NA    
## 10 sxk97        2017-12-03 09:00:00 NA    
## # … with 494 more rows
# Range data check
range(submission$datetime)
## [1] "2017-12-03 00:00:00 UTC" "2017-12-09 23:00:00 UTC"

Our data submission has datetime range from 2017-12-03 00:00:00 UTC to 2017-12-09 23:00:00 UTC equal to 1 week or 148 hours

Nest data submission

submit_nest <- submission %>% nest(datetime)
submit_nest
## # A tibble: 3 × 3
##   src_sub_area demand data              
##   <chr>        <lgl>  <list>            
## 1 sxk97        NA     <tibble [168 × 1]>
## 2 sxk9e        NA     <tibble [168 × 1]>
## 3 sxk9s        NA     <tibble [168 × 1]>

Forecast data submission with the best model

submission <- scotty_best_model %>%  
  mutate(
    forecast =
      map(fitted, ~ forecast(.x, h = 24 * 7)) %>%
      map2(submit_nest$data, ~ tibble( #map into our data submission nest
        datetime = .y$datetime,
        demand = as.vector(.x$mean)
      )),
    key = paste(data_fun_name, data_fun, sep = "-")
  )

submission
## # A tibble: 3 × 10
##   src_sub_area data_fun_name model_name test     train    data_fun model fitted 
##   <chr>        <chr>         <chr>      <list>   <list>   <list>   <lis> <list> 
## 1 sxk97        msts          tbats      <tibble> <tibble> <fn>     <fn>  <tbats>
## 2 sxk9e        msts          tbats      <tibble> <tibble> <fn>     <fn>  <tbats>
## 3 sxk9s        msts          tbats      <tibble> <tibble> <fn>     <fn>  <tbats>
## # … with 2 more variables: forecast <list>, key <chr>

Look for the result

submission <- submission %>%
  select(src_sub_area, forecast) %>%
  unnest(forecast) %>%
  mutate(demand = rec_revert(demand, rec, src_sub_area))

submission
## # A tibble: 504 × 3
##    src_sub_area datetime            demand
##    <chr>        <dttm>               <dbl>
##  1 sxk97        2017-12-03 00:00:00     18
##  2 sxk97        2017-12-03 01:00:00     15
##  3 sxk97        2017-12-03 02:00:00     11
##  4 sxk97        2017-12-03 03:00:00      7
##  5 sxk97        2017-12-03 04:00:00      5
##  6 sxk97        2017-12-03 05:00:00      3
##  7 sxk97        2017-12-03 06:00:00      2
##  8 sxk97        2017-12-03 07:00:00      6
##  9 sxk97        2017-12-03 08:00:00     12
## 10 sxk97        2017-12-03 09:00:00     12
## # … with 494 more rows

Write CSV for data submission

write_csv(submission, "submit_scotty.csv")

Plot submission forecast

subforcast <- submission %>%
  ggplot(aes(x = datetime, y = demand)) +
    geom_line() +
    labs(x = NULL, y = NULL, colour = NULL) +
    facet_wrap(~ src_sub_area, scale = "free", ncol = 1) +
    theme_linedraw() +
    scale_colour_tq()

ggplotly(subforcast)

Assumption Checking

Assumptions in the time series are tested to measure whether the residuals obtained from the modeling results are good enough to describe and capture information in the data. Why use residual data? Because by using residual data, we can get information from the actual data as well as from the prediction results using the model. A good forecasting method produces the following residual values :

  1. Uncorrelated residuals. If there are correlated residuals, it means that there is still information left that should be used to calculate forecast results.

  2. Residuals have an average of 0. normally distributed

No-autocorrelation

To check the presence or absence of autocorrelation in the residual time series forecasting results, you can use acf residual plot to visualized the residual or use Ljung-box testing

\(H_0\): residual has no-autocorrelation

\(H_1\): residual has autocorrelation

sxk97

# Using acf residual plot visualization with function `acf(residual model)`
acf(residuals(scotty_best_model$fitted[[1]]))

# perform a Ljung-box test using the function `Box.test`
Box.test(residuals(scotty_best_model$fitted[[1]]), type = "Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  residuals(scotty_best_model$fitted[[1]])
## X-squared = 0.0041596, df = 1, p-value = 0.9486

sxk9e

acf(residuals(scotty_best_model$fitted[[2]]))

Box.test(residuals(scotty_best_model$fitted[[2]]), type = "Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  residuals(scotty_best_model$fitted[[2]])
## X-squared = 0.0012231, df = 1, p-value = 0.9721

sxk9s

acf(residuals(scotty_best_model$fitted[[3]]))

Box.test(residuals(scotty_best_model$fitted[[3]]), type = "Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  residuals(scotty_best_model$fitted[[3]])
## X-squared = 0.0010346, df = 1, p-value = 0.9743

Based on the acf plot, very few residuals crossed the threshold and the p-value of the L-jung box test results was also above the aplha (0.05) for all sub area, so \(H_1\) is rejected

It can be concluded that the residuals in all sub area do not have autocorrelation

Normality of residual

To check the normality of the residuals in the forecasting time series results, we can perform a normality test (shapiro test) using the shapiro.test(residual model) function or just plot histogram to see the distribution of the residuals

\(H_0\): residual spread normally

\(H_1\): residuals are not normally distributed

sxk97

# plot histogram
hist(residuals(scotty_best_model$fitted[[1]]), breaks = 30, 
              col = "chocolate", border = "red", main = "sxk97", xlab = NULL)

shapiro.test(residuals(scotty_best_model$fitted[[1]]))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(scotty_best_model$fitted[[1]])
## W = 0.9851, p-value = 1.6e-10

sxk9e

hist(residuals(scotty_best_model$fitted[[2]]), breaks = 30, 
              col = "chocolate", border = "red", main = "sxk9e", xlab = NULL)

shapiro.test(residuals(scotty_best_model$fitted[[2]]))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(scotty_best_model$fitted[[2]])
## W = 0.99493, p-value = 0.0001655

sxk9s

hist(residuals(scotty_best_model$fitted[[3]]), breaks = 30, 
              col = "chocolate", border = "red", main = "sxk9s", xlab = NULL)

shapiro.test(residuals(scotty_best_model$fitted[[3]]))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(scotty_best_model$fitted[[3]])
## W = 0.99045, p-value = 1.144e-07

Based on the histogram plot, it can be seen that the residuals have spread normally with the highest residual frequency around 0. The p-value of the normality test (Shapiro test) for all areas is greater than alpha (0.05), so \(H_1\) is rejected

It can be concluded that the residual value of the model is normally distributed for all sub area

Rubrics-Scotty: “Bring me the crystal ball!”

Data Preprocess

  • Demonstrate and explain how to properly do data aggregation.

    • Should you floor the date to specific time level (minutes or hours or days)?

    Saya melakukan pembulatan kebawah dengan tingkat waktu jam. Dikarenakan data set yang digunakan mengandung satuan menit dan detik maka untuk memudahkan analisis time series serta agar tidak terlalu bervariatif nilai timenya maka dilakukan pembulatan ke jam saja. Pembulatan kebawah dilakukan dengan alasan bahwa walaupun waktu terekamnya jam 05:59 tapi seharusnya masih dikategorikan kedalam pukul 05 bukan pukul 06.

    • How do we group the data for aggregation/summarise?

    Mengelompokkan data dilakukan dengan cara memilih kolom-kolom mana saja yang ingin kita kelompokkan dan yang akan kita lakukan perhitungan statistik selanjutnya menggunakan fungsi group_by(). Pada kasus ini kita mengelompokkan berdasarkan sub area dan date time yang sudah dalam satuan tanggal-jam.

  • Demonstrate how to properly do time series padding.

    • Should you do time series padding?

    Ya kita harus melakukan time series padding, karena data yang digunakan belum tentu memiliki interval waktu yang lengkap dan objek yang akan di forecast juga belum tentu ada di tiap interval waktunya

    • Do you need to round the datetime into hour or minutes?

    Pembulatan waktu saya lakukan kedalam jam, karena untuk mengurangi keberagaman nilai menit yang sangat banyak jika ingin melihat pola harian atau mingguannya, sehingga menurut saya jauh lebih tepat untuk dilakukan pembulatan ke satuan jam.

    • When is the start and the end of the time interval for time series padding?

    Start time = 2017-10-01 00:00:00 UTC

    End time = 2017-12-02 23:00:00 UTC

Cross-Validation Scheme

  • Demonstrate and explain how to prepare cross-validation data for automated model selection.

    • How to cross-validate data for time series?

    Ada beberapa komponen yang harus disiapkan terlebih dahulu sebelum mealakukan cross validasi yaitu test dan train size, index waktu awal dan akhir untuk data test dan train, serta intervalnya. Dalam kasus ini karena kita ditugaskan untuk menforecast selama 1 minggu dengan satuan jam maka size data test adalah 24*7 dan data train adalah sisanya. Kemudian dicari rentang datetime dimulainya data train dan test serta berakhirnya dan yang terakhir dicari interval data train dan test dari rentang yang sudah didapat sebelumnya.

    • Do you need to group the data by the source area?

    Tentu saja kita perlu melakukan grouping ke sub areanya, agar kita bisa tahu setiap sub area memiliki data train dan testnya masing-masing. Karena nantinya forecasting dilakukan terhadap setiap sub area yang ditentukan

    • Do you need to make nested dataframe?

    Nedted dataframe secara sederhana adalah terdapat table didalam table, kita perlu membuat nested datafram agar tiap sub area dapat menyimpan data train dan testnya masing-masing

    • How many observations you will use as the testing dataset?

    Jumlah observasi yang digunakan adalah sebanyak 168 observasi, karena objective yang diberikan adalah untuk melakukan forecasting selama 1 minggu dalam satuan jam atau 24*7 jam

  • (2 Points) Demonstrate and explain how to prepare cross-validation data for “best” model evaluation.

    • Do you need to further split the data train into training set and validation set?

    Dalam kasus ini diberikan 2 buah data, yaitu data train yang berisi nilai demand dan data test yang tidak berisi nilai demand (objek yang akan di forecast) di markdown bernama submission. Nah untuk melakukan pemodelan data train ini harus dipisah/split terlebih dahulu menjadi dua yaitu data train dan data validation yang dimana dalam markdown yang saya tulis diatas bernama data test. Data validation ini digunakan untuk mendapatkan nilai evaluasi dari model yang sudah kita training menggunakan data training sehingga nantinya setelah didapat model yang terbaik lalu dipakailah model tersebut untuk memprediksi (forecast) data test yang sudah diberikan sebelumnya.

    • How do you split them? Should you use rolling origin method?

    Ya, tentu menggunakan metode tersebut. Untuk membantu dalam membandingkan beberapa model, maka perlu dilakukan splitting data train menjadi data train dan data validasi. Strateginya adalah dengan mengambil 1 minggu terakhir sebagai data test dan porsi yang lebih besarnya sebagai data train. Strategi ini merupakan versi sederhana dari metode rolling origin yang dimana hanya diberikan 1 pasang data test dan data train saja.

    • How much of the data will be used as the validation set?

    Jumlah data yang digunakan untuk data validasi adalah sebanyak 168 data, karena kita ingin memprediksi demand untuk 1 minggu dalam satuan waktu

    • Do we forecast with windowed data or expanding data?

    Kita forcast dengan windowed data, karena dari awal (pada tahap cross validation) sudah di tetapkan size dari data yang akan di forcast yaitu sebanyak 24 jam * 7 hari dan tidak bertambah.

Automated Model Selection

  • Compare multiple preprocess specifications.

    • Is different preprocess will have diffrerent results?

    Tentu saja dengan metode preprocessing yang berbeda akan menghasilkan hasil yang berbeda juga, karena pada tahap preprocessing kita melakukan manipulasi terhadap data seperti discaling, diakar, atau bahkan di logaritmakan. Nilai tersebut akan masuk kedalam pemodelan dan pastinya akan menghasilkan hasil yang berbeda tiap metode preprocessingnya.

    • How many kind of preprocess spesification you will prepare?

    Pada kasus ini saya melakukan 3 proses preprocessing yaitu :

    1. Diakarkan menggunakan fungsi step_sqrt
    2. Lalu di kurangi dengan nilai tengah menggunakan fungsi step_center
    3. Terakhir di scaling menggunakan fungsi step_scale
    • Will you choose 2 different speficiation: log transformation and square root transformation specification? Will you create another preprocess approach?

    Saat melakukan preprocessing saya menggunakan pendekatan preprocessing yang berbeda yaitu dengan cara menormalisasi datanya dengan mengakarkan, kemudian mengurangi dengan nilai tengahnya lalu kemudian di scaling.

  • Compare multiple seasonality specifications.

    • How many seasonality specification you will create?

    Di pemodelan saya membuat 2 seasonality yaitu single seasonality (daily) dan multiple seasonality (daily dan weekly)

    • Will you create model with daily sesasonality only?

    Saya tidak hanya membuat model dengan single seasonality saja, melainkan juga dengan menggunakan multiple seasonality harian dan mingguan. Hal ini dikarenakan saat saya meakukan EDA saya melihat bahwa data yang digunakan ternyata terindikasi memilki lebih dari satu seasonality, sehingga untuk pemodelannya saya juga mengikutkan multiple seasonalitynya.

    • Will you create multiple seasonality (daily and weekly)?

    Iya saya membuat multiple seasonality daily dan weekly, karena saat proses EDA ketika dilakukan decompose menggunakan multi seasonal trend yang dihasilkan lebih smooth dibandingkan hanya menggunakan single seasonality daily ataupun weekly

  • Compare multiple model specifications.

    • How many forecasting model will you use?

    Saya menggunakan model forcasting sebanyal 5 model yaitu

    • Exponential smoothing state space model (ets)

    • Autoregressive integrated moving average (Auto arima)

    • Seasonal and Trend decomposition using Loess (stlm)

    • Exponential smoothing state space model using (holtswinters)

    • Trigonometric seasonality, Box-Cox transformation, ARMA errors, Trend and Seasonal

    • Will you use exponential smoothing? Will you use ARIMA?

    Saya menggunakan model exponensial smoothing seperti ets dan holtswinter

    Saya juga menggunakan model ARIMA tetapi lebih tepatnya AUTO SARIMA karena data yang digunakan memiliki seasonal dan model dengan otomatis mencari kombinasi nilai (p,d,q)(P,D,Q) nya

  • Automate best specifications selection.

    • Since we use multiple preprocess, seasonality, and models, can you make an automated script to summarise the result?

    Jawaban dari pertanyaan ini sudah terdapat di bagian Modelling lebih tepatnya pada bagian calculate test and train error, sehingga saya akan menampilkan gambar sebagai tampilan hasil scripnya

  • How do you measure the model performance?

Pengukuran model terbaik dilakukan dengan menggunakan nilai Mean Absolute Error (MAE) sebagai acuannya. Mean Absolute Error (MAE) menunjukkan rata-rata dari nilai absolut error. MAE bisa diinterpretasikan sebagai seberapa besar penyimpangan hasil prediksi terhadap nilai aktualnya. Semakin rendah nilai MAE maka semakin bagus hasil forecasting dari modelnya.

  • Which model and specifications has the best performance?

Model yang digunakan adalah tbats dengan spesifikasi :

  • TBATS(1, {2,2}, -, {<24,7>, <168,6>})

Interpretasi : TBATS(omega, p,q, phi, <m1,k1>,…,<mJ,kJ>) omega adalah parameter Box-Cox, phi adalah parameter redaman, kesalahan/error dimodelkan sebagai proses ARMA(p,q) dan m1,…,mJ mencantumkan periode musiman yang digunakan dalam model dan k1,…,kJ adalah jumlah suku Fourier terms yang sesuai yang digunakan untuk setiap seasonality. Sehingga :

- omega = 1, artinya tidak ada parameter Box-Cox
- p,q = (2,2), artinya banyak data yang dipakai ketika Auto Regressive = 2 dan Moving Average = 2
- Memiliki 2 tipe seasonal yang berbeda, dengan panjang 24 = harian dan 168 = mingguan
- Model tbats menyesuaikan untuk seasonality harian menggunakan 7 suku Fourier, dan mingguan menggunakan 6 suku Fourier 
  • Parameters
    • Alpha: 0.02163328
    • Gamma-1 Values: -0.0007510731 -0.001416854
    • Gamma-2 Values: 0.001190823 0.0002403854
    • AR coefficients: -0.245483 0.528058
    • MA coefficients: 0.595991 -0.241668
  • Sigma: 0.4635748
  • AIC: 7694.876
  • Note: Angka pada setiap parameter diatas hanya representatif untuk sub area sxk97 saja yang digunakan sebagai contoh. Kita juga dapat mengakses parameter model untuk sub area lainnya

Prediction Performance

  • Reached MAE < 12 for sub-area sxk97 in (your own) evaluation dataset.

  • Reached MAE < 11 for sub-area sxk9e in (your own) evaluation dataset.

  • Reached MAE < 10 for sub-area sxk9s in (your own) evaluation dataset.

  • Reached MAE < 11 for all sub-area in (your own) evaluation dataset.

  • Reached MAE < 12 for sub-area sxk97 in test dataset.

  • Reached MAE < 11 for sub-area sxk9e in test dataset.

  • Reached MAE < 10 for sub-area sxk9s in test dataset.

  • Reached MAE < 11 for all sub-area in test dataset.

Conclusion

  1. Pada kasus ini, forecast menggunakan multiple seasonality menghasilkan prediksi yang lebih baik.

  2. Pada ketiga lokasi, forecast menggunakan permodelan TBATS menghasilkan prediksi yang paling baik.

  3. Error dari model menghasilkan error yang memenuhi kedua asumsi time series.

  4. Selain kasus scotty, metode automated multiple time series forecasting ini juga dapat digunakan di bidang sales, stock market, atau bahkan juga fenomena alam seperti curah hujan dan kenaikan muka air laut.

Assumption Checking

  • Does the model meet the autocorrelation assumption? What about the normality of residuals?

Model memenuhi kedua asumsi time series yang ada.

Melalui uji Ljung-box dihasilkan p-value yang lebih kecil dari alpha 0.05 di setiap daerahnya yang berarti residual hasil forcastnya tidak berkolerasi.

Melaui uji Shapiro dihasilkan p-value yang lebih kecil dari alpha 0.05 di setiap daerahnya yang berarti nilai-nilai residual memiliki rata-rata 0 dan terdistribusi secara normal.

Kedua asumsi tersebut juga didukung dengan plot yang sudah disediakan.

  • If the assumptions are not met, what is the cause? how to handle that? Based on seasonality when the highest demand?

Asumsi pada time series diujikan untuk mengukur apakah residual yang peroleh dari hasil modeling sudah cukup baik untuk menggambarkan dan menangkap informasi pada data. Mengapa menggunakan residual data? Karena dengan menggunakan residual data, kita dapat mendapatkan informasi dari data aktual maupun dari hasil prediksi menggunakan model.

Sehingga jika asumsi tidak terpenuhi maka dikarenakan model tidak dapat menangkap informasi atau pola yang ada didalam data yang kita gunakan. Cara mengatasinya adalah dengan melakukan tunning terhadap model yang dipakai, contohnya : 1. Memilih jenis model yang tepat sesuai dengan karakteristik datanya (Additive atau Multipicative) 2. Memilih seasonality yang tepat 3. Melakukan differencing 4. Melakukan transformasi menggunakan log() 5. Memilih konfigurasi nilai pembentuk model yang tepat (p,d,q dan P,D,Q)

Daily -> Jam 6 sore

Weekly -> Hari Jumat

Leaderboard Performa Model