Automated Multiple Time Series Analysis on Scotty - Turkish Online Transportation Platform
Automated Multiple Time Series Analysis on Scotty - Turkish Online Transportation Platform
Brief Introduction
Scotty is a ride-sharing business that operating in several big cities in Turkey. The company provides motorcycles ride-sharing service for Turkey’s citizen, and really value the efficiency in traveling through the traffic.
In this project we will going to help Scotty Company to do a forecast analysis to predict the end year’s demand. This time we will using automated time-series forecasting technique for multiple model.
Goals & Objective
Build an automated forecasting model for hourly demands that would be evaluated on the next 7 days (Sunday, December 3rd 2017 to Monday, December 9th 2017)
Data Preprocessing
The first step it to load every package that we need in this project.
Read Library
## Warning: package 'forecast' was built under R version 3.6.3
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
## Warning: package 'padr' was built under R version 3.6.3
## Warning: package 'tidymodels' was built under R version 3.6.3
## -- Attaching packages ------------------------------------ tidymodels 0.1.0 --
## v broom 0.5.4 v recipes 0.1.9
## v dials 0.0.4 v rsample 0.0.5
## v dplyr 0.8.4 v tibble 2.1.3
## v ggplot2 3.2.1 v tune 0.0.1
## v infer 0.5.1 v workflows 0.1.1
## v parsnip 0.0.5 v yardstick 0.0.6
## v purrr 0.3.3
## Warning: package 'dials' was built under R version 3.6.3
## Warning: package 'infer' was built under R version 3.6.3
## Warning: package 'parsnip' was built under R version 3.6.3
## Warning: package 'tune' was built under R version 3.6.3
## Warning: package 'workflows' was built under R version 3.6.3
## Warning: package 'yardstick' was built under R version 3.6.3
## -- Conflicts --------------------------------------- tidymodels_conflicts() --
## x yardstick::accuracy() masks forecast::accuracy()
## x purrr::discard() masks scales::discard()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x ggplot2::margin() masks dials::margin()
## x recipes::step() masks stats::step()
## x recipes::yj_trans() masks scales::yj_trans()
## -- Attaching packages ------------------------------------- tidyverse 1.3.0 --
## v readr 1.3.1 v forcats 0.4.0
## v stringr 1.4.0
## Warning: package 'readr' was built under R version 3.6.3
## -- Conflicts ---------------------------------------- tidyverse_conflicts() --
## x lubridate::as.difftime() masks base::as.difftime()
## x readr::col_factor() masks scales::col_factor()
## x lubridate::date() masks base::date()
## x purrr::discard() masks scales::discard()
## x dplyr::filter() masks stats::filter()
## x stringr::fixed() masks recipes::fixed()
## x lubridate::intersect() masks base::intersect()
## x dplyr::lag() masks stats::lag()
## x ggplot2::margin() masks dials::margin()
## x lubridate::setdiff() masks base::setdiff()
## x readr::spec() masks yardstick::spec()
## x lubridate::union() masks base::union()
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(tidyr)
library(recipes)
library(tibble)
library(purrr)
library(png)
library(grid)
library(tidyquant)## Warning: package 'tidyquant' was built under R version 3.6.3
## Loading required package: PerformanceAnalytics
## Warning: package 'PerformanceAnalytics' was built under R version 3.6.3
## Loading required package: xts
## Warning: package 'xts' was built under R version 3.6.3
##
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
##
## first, last
##
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
##
## legend
## Loading required package: quantmod
## Loading required package: TTR
## Warning: package 'TTR' was built under R version 3.6.3
## Version 0.4-0 included new data defaults. See ?getSymbols.
## == Need to Learn tidyquant? ==================================================
## Business Science offers a 1-hour course - Learning Lab #9: Performance Analysis & Portfolio Optimization with tidyquant!
## </> Learn more at: https://university.business-science.io/p/learning-labs-pro </>
Load Data
Load our data into scotty object
setwd("C:/Users/Asus/Documents/DataQuest/Algoritma Data Science/R/Machine Learning Capstone Project")
scotty <- read_csv("data-train.csv")## Parsed with column specification:
## cols(
## id = col_character(),
## trip_id = col_character(),
## driver_id = col_character(),
## rider_id = col_character(),
## start_time = col_datetime(format = ""),
## src_lat = col_double(),
## src_lon = col_double(),
## src_area = col_character(),
## src_sub_area = col_character(),
## dest_lat = col_double(),
## dest_lon = col_double(),
## dest_area = col_character(),
## dest_sub_area = col_character(),
## distance = col_double(),
## status = col_character(),
## confirmed_time_sec = col_double()
## )
Explore Data
## Observations: 90,113
## Variables: 16
## $ id <chr> "59d005e1ffcfa261708ce9cd", "59d0066affcfa261708...
## $ trip_id <chr> "59d005e9cb564761a8fe5d3e", NA, "59d006c131e39c6...
## $ driver_id <chr> "59a892c5568be44b2734f276", NA, "599dc0dfa5b4fd5...
## $ rider_id <chr> "59ad2d6efba75a581666b506", "59cd704bcf482f6ce2f...
## $ start_time <dttm> 2017-10-01 00:00:17, 2017-10-01 00:02:34, 2017-...
## $ src_lat <dbl> 41.07047, 41.07487, 41.04995, 41.05287, 41.06760...
## $ src_lon <dbl> 29.01945, 28.99528, 29.03107, 28.99522, 28.98827...
## $ src_area <chr> "sxk9", "sxk9", "sxk9", "sxk9", "sxk9", "sxk9", ...
## $ src_sub_area <chr> "sxk9s", "sxk9e", "sxk9s", "sxk9e", "sxk9e", "sx...
## $ dest_lat <dbl> 41.11716, 41.08351, 41.04495, 41.08140, 41.02125...
## $ dest_lon <dbl> 29.03650, 29.00228, 28.98192, 28.98197, 29.11316...
## $ dest_area <chr> "sxk9", "sxk9", "sxk9", "sxk9", "sxk9", "sxk9", ...
## $ dest_sub_area <chr> "sxk9u", "sxk9e", "sxk9e", "sxk9e", "sxk9q", "sx...
## $ distance <dbl> 5.379250, 1.126098, 4.169492, 3.358296, 11.69357...
## $ status <chr> "confirmed", "nodrivers", "confirmed", "confirme...
## $ confirmed_time_sec <dbl> 8, 0, 32, 65, 0, 27, 32, 30, 0, 24, 9, 0, 118, 1...
Since we will do forecast analysis based on time analysis, we will only use 2 column.
Because we will work in hours, we need to floor the data to specific time level.
## # A tibble: 4,225 x 3
## # Groups: datetime [1,490]
## datetime src_sub_area demand
## <dttm> <chr> <int>
## 1 2017-10-01 00:00:00 sxk97 6
## 2 2017-10-01 00:00:00 sxk9e 8
## 3 2017-10-01 00:00:00 sxk9s 10
## 4 2017-10-01 01:00:00 sxk97 4
## 5 2017-10-01 01:00:00 sxk9e 2
## 6 2017-10-01 01:00:00 sxk9s 3
## 7 2017-10-01 02:00:00 sxk97 9
## 8 2017-10-01 02:00:00 sxk9e 3
## 9 2017-10-01 02:00:00 sxk9s 1
## 10 2017-10-01 03:00:00 sxk97 2
## # ... with 4,215 more rows
In this table we can see the sum of orders for each area and for each hours.
Data Validation Check
Since we’re working with time series data, we need to check several assumption to see the data validity.
1. Missing data is prohibited
## # A tibble: 3 x 2
## src_sub_area demand
## <chr> <int>
## 1 sxk97 1439
## 2 sxk9e 1419
## 3 sxk9s 1367
As we can see here, the amount of order in each src_sub_area is not the same. That means there is a period of time that is missing. Therefore we need to fill the incomplete time periods on datetime variables using the pad () function from the padr library.
scotty_sum <- scotty_sum %>%
pad('hour',
start_val = as.POSIXct(min(scotty_sum$datetime)),
end_val = as.POSIXct(max(scotty_sum$datetime)),
group = "src_sub_area") %>%
mutate(
src_sub_area = as.factor(src_sub_area)
)## Warning: group argument and dplyr::groups are both present and differ,
## dplyr::groups are ignored
## # A tibble: 3 x 2
## src_sub_area demand
## <fct> <int>
## 1 sxk97 1512
## 2 sxk9e 1512
## 3 sxk9s 1512
After the pad () function is performed, the amount of data per src_sub_area is the same, which is 1.512 data per sub-area.
2. Data must be ordered by their time period
3. No NA
## # A tibble: 1 x 1
## na_rows
## <int>
## 1 311
Since we found NA rows, we need to fill them with 0, which means at that particular hour, there is no order (demand).
scotty_sum <- na.fill0(scotty_sum, fill = 0)
scotty_sum %>%
filter(is.na(demand) == 1) %>%
group_by() %>%
summarise(
na_rows = n()
)## # A tibble: 1 x 1
## na_rows
## <int>
## 1 0
Exploratory Data Analysis
Forming Data in Time Series Format
scotty_sum %>%
filter(datetime > (max(scotty_sum$datetime) - hours(24 * 7 * 4))) %>%
ggplot(aes(x = datetime, y = demand)) +
geom_line() +
labs(x = NULL, y = NULL) +
facet_wrap(~ src_sub_area, scale = "free", ncol = 1) +
theme_light()Cross Validation
Before we make a model, we must divide our data into data train and data test first. We will use the data train to conduct training on our model while the test data will be used as a comparison and see if our models overfit / underfit to predict new data outside of the training data. We will use the last 7 days of our data to become a test data and the rest we will use to become a data train.
# number of test data and data train
test_size <- 24 * 7
train_size <- nrow(scotty_sum)/3 - test_size
# initial range and final range for each data
test_end <- max(scotty_sum$datetime)
test_start <- test_end - hours(test_size) + hours(1)
train_end <- test_start - hours(1)
train_start <- train_end - hours(train_size) + hours(1)# interval for each sample
interval_train <- interval(train_start, train_end)
interval_test <- interval(test_start, test_end)## [1] 2017-10-01 UTC--2017-11-25 23:00:00 UTC
## [1] 2017-11-26 UTC--2017-12-02 23:00:00 UTC
Here’s the data after splitting process: data train & test
plot the train and test
e <- scotty_sum %>%
mutate(sample = case_when(
datetime %within% interval_train ~ "train",
datetime %within% interval_test ~ "test"
)) %>%
mutate(sample = factor(sample, levels = c("train", "test"))) %>%
ggplot(aes(x = datetime, y = demand, colour = sample)) +
geom_line() +
labs(x = NULL, y = NULL, colour = NULL) +
facet_wrap(~ src_sub_area, scale = "free", ncol = 1) +
theme_light()
ggplotly(e)Date Time Series
Before we form Time Series data, we will first process the data using the library recipes.
Because the library recipes work in the form of columns, we will change our data to be tabular based on src_sub_area using the spread () function of the tidyr library:
## # A tibble: 1,512 x 4
## # Groups: datetime [1,512]
## datetime sxk97 sxk9e sxk9s
## <dttm> <dbl> <dbl> <dbl>
## 1 2017-10-01 00:00:00 6 8 10
## 2 2017-10-01 01:00:00 4 2 3
## 3 2017-10-01 02:00:00 9 3 1
## 4 2017-10-01 03:00:00 2 2 0
## 5 2017-10-01 04:00:00 5 1 0
## 6 2017-10-01 05:00:00 4 2 0
## 7 2017-10-01 06:00:00 1 1 0
## 8 2017-10-01 07:00:00 0 0 1
## 9 2017-10-01 08:00:00 3 2 5
## 10 2017-10-01 09:00:00 3 11 5
## # ... with 1,502 more rows
Next we will do data manipulation, which is a scale that will be performed using the recipe () library recipes function:
This is done so that modeling becomes more robust and less sensitive to outlier data.
scotty_recipe <- recipe(~.,
filter(scotty_sum_wide, datetime %within% interval_train)) %>%
step_sqrt(all_numeric()) %>%
step_center(all_numeric()) %>%
step_scale(all_numeric()) %>%
prep()
# preview the bake results
scotty_sum_wide <- bake(scotty_recipe, scotty_sum_wide)
scotty_sum_wide## # A tibble: 1,512 x 4
## datetime sxk97 sxk9e sxk9s
## <dttm> <dbl> <dbl> <dbl>
## 1 2017-10-01 00:00:00 -0.594 -0.697 -0.111
## 2 2017-10-01 01:00:00 -0.841 -1.28 -0.780
## 3 2017-10-01 02:00:00 -0.291 -1.15 -1.12
## 4 2017-10-01 03:00:00 -1.16 -1.28 -1.59
## 5 2017-10-01 04:00:00 -0.711 -1.44 -1.59
## 6 2017-10-01 05:00:00 -0.841 -1.28 -1.59
## 7 2017-10-01 06:00:00 -1.39 -1.44 -1.59
## 8 2017-10-01 07:00:00 -1.94 -1.85 -1.12
## 9 2017-10-01 08:00:00 -0.989 -1.28 -0.544
## 10 2017-10-01 09:00:00 -0.989 -0.497 -0.544
## # ... with 1,502 more rows
Because we have manipulated data, we must create a function to return our data to the initial value that we will call the function
# revert back function
rec_revert <- function(vector, rec, varname) {
# store recipe values
rec_center <- rec$steps[[2]]$means[varname]
rec_scale <- rec$steps[[3]]$sds[varname]
# convert back based on the recipe
results <- (vector * rec_scale + rec_center) ^ 2
# add additional adjustment if necessary
results <- round(results)
# return the results
results
}This revert back function will convert back the data to its original form.
Now we can convert our data into the long format again:
# convert back to long format
scotty_sum_long <- scotty_sum_wide %>%
gather(src_sub_area, demand, -datetime)
scotty_sum_long## # A tibble: 4,536 x 3
## datetime src_sub_area demand
## <dttm> <chr> <dbl>
## 1 2017-10-01 00:00:00 sxk97 -0.594
## 2 2017-10-01 01:00:00 sxk97 -0.841
## 3 2017-10-01 02:00:00 sxk97 -0.291
## 4 2017-10-01 03:00:00 sxk97 -1.16
## 5 2017-10-01 04:00:00 sxk97 -0.711
## 6 2017-10-01 05:00:00 sxk97 -0.841
## 7 2017-10-01 06:00:00 sxk97 -1.39
## 8 2017-10-01 07:00:00 sxk97 -1.94
## 9 2017-10-01 08:00:00 sxk97 -0.989
## 10 2017-10-01 09:00:00 sxk97 -0.989
## # ... with 4,526 more rows
Modelling & Forecasting
In functional programming using purrr, we need to convert our tbl into a nested tbl. Nested data is like a table inside a table, which could be controlled using a key indicator; in other words, we can have tbl for each subarea and samples. Using this format, the fitting and forecasting process would be very versatile, yet we can still convert the results as long as we have a proper key like provider.
Let’s start by converting our tbl into a nested tbl. First, we need to add sample indicator so it could be recognized as a key when we nest() the data:
# adjust by sample
scotty_sum_nest <- scotty_sum_long %>%
mutate(sample = case_when(
datetime %within% interval_train ~ "train",
datetime %within% interval_test ~ "test"
)) %>%
drop_na()
# nest the data train
scotty_sum_train_nest <- scotty_sum_nest %>%
group_by(src_sub_area, sample) %>%
nest(.key = "data") %>%
spread(sample, data)## Warning: `.key` is deprecated
## # A tibble: 3 x 3
## # Groups: src_sub_area [3]
## src_sub_area test train
## <chr> <list> <list>
## 1 sxk97 <tibble [168 x 2]> <tibble [1,344 x 2]>
## 2 sxk9e <tibble [168 x 2]> <tibble [1,344 x 2]>
## 3 sxk9s <tibble [168 x 2]> <tibble [1,344 x 2]>
Function
To make our data into a time series form we need the function ts () or msts (). So we will create a function data list named ts_function_list which contains both functions, which for time series will use the seasonality as follows:
ts used for single seasonality, will use daily seasonality (frequency = 24) msts used for multiple / complex seasonality, will use daily seasonality (frequency = 24) and weekly (frequency = 24 * 7)
# list of functions that will be used to form a time series
ts_function_list <- list(
ts = function(x) ts(x$demand, frequency = 24),
msts = function(x) msts(x$demand, seasonal.periods = c(24, 24 * 7))
)
ts_function_list## $ts
## function(x) ts(x$demand, frequency = 24)
##
## $msts
## function(x) msts(x$demand, seasonal.periods = c(24, 24 * 7))
Then we convert the list into a tbl using enframe(). Note that we should also give a key – which is the src_sub_area in our case – so we could use left_join() later.
Use rep() function:
ts_function_list <- ts_function_list %>%
rep(length(unique(scotty_sum_nest$src_sub_area))) %>%
enframe("func_name", "func") %>%
mutate(
src_sub_area = sort(rep(unique(scotty_sum_nest$src_sub_area), length(unique(.$func_name))))
)
ts_function_list## # A tibble: 6 x 3
## func_name func src_sub_area
## <chr> <list> <chr>
## 1 ts <fn> sxk97
## 2 msts <fn> sxk97
## 3 ts <fn> sxk9e
## 4 msts <fn> sxk9e
## 5 ts <fn> sxk9s
## 6 msts <fn> sxk9s
Then we join thee nested function with our Nested Model Fitting :
## Joining, by = "src_sub_area"
## # A tibble: 6 x 5
## # Groups: src_sub_area [3]
## src_sub_area test train func_name func
## <chr> <list> <list> <chr> <list>
## 1 sxk97 <tibble [168 x 2]> <tibble [1,344 x 2]> ts <fn>
## 2 sxk97 <tibble [168 x 2]> <tibble [1,344 x 2]> msts <fn>
## 3 sxk9e <tibble [168 x 2]> <tibble [1,344 x 2]> ts <fn>
## 4 sxk9e <tibble [168 x 2]> <tibble [1,344 x 2]> msts <fn>
## 5 sxk9s <tibble [168 x 2]> <tibble [1,344 x 2]> ts <fn>
## 6 sxk9s <tibble [168 x 2]> <tibble [1,344 x 2]> msts <fn>
Make a time series model list
We will use four different models
- Exponential smoothing state space model (ets)
- Autoregressive integrated moving average (Auto arima)
- Seasonal and Trend decomposition using Loess (stlm)
- Trigonometric seasonality, Box-Cox transformation, ARMA errors, Trend and Seasonal components (tbats)
# time series model list
model_function_list <- list(
ets = function(x) ets(x),
auto.arima = function(x) auto.arima(x),
stlm = function(x) stlm(x),
tbats = function(x) tbats(x, use.box.cox = FALSE)
)
model_function_list ## $ets
## function(x) ets(x)
##
## $auto.arima
## function(x) auto.arima(x)
##
## $stlm
## function(x) stlm(x)
##
## $tbats
## function(x) tbats(x, use.box.cox = FALSE)
# combining into nested format
model_function_list <- model_function_list %>%
rep(length(unique(scotty_sum_nest$src_sub_area))) %>%
enframe("model_name", "model") %>%
mutate(src_sub_area =
sort(rep(unique(scotty_sum_nest$src_sub_area), length(unique(.$model_name))))
)
model_function_list## # A tibble: 12 x 3
## model_name model src_sub_area
## <chr> <list> <chr>
## 1 ets <fn> sxk97
## 2 auto.arima <fn> sxk97
## 3 stlm <fn> sxk97
## 4 tbats <fn> sxk97
## 5 ets <fn> sxk9e
## 6 auto.arima <fn> sxk9e
## 7 stlm <fn> sxk9e
## 8 tbats <fn> sxk9e
## 9 ets <fn> sxk9s
## 10 auto.arima <fn> sxk9s
## 11 stlm <fn> sxk9s
## 12 tbats <fn> sxk9s
The next step is to merge model_function_list to the scotty_data_nest data. Because the ets and auto.arima models cannot be used for multiseasonal time series types, we will not combine the two models with the msts type time series:
scotty_fit <- scotty_fit %>%
left_join(model_function_list) %>%
filter(
!(model_name == "ets" & func_name == "msts"),
!(model_name == "auto.arima" & func_name == "msts")
)## Joining, by = "src_sub_area"
## # A tibble: 18 x 7
## # Groups: src_sub_area [3]
## src_sub_area test train func_name func model_name model
## <chr> <list> <list> <chr> <list> <chr> <lis>
## 1 sxk97 <tibble [168 ~ <tibble [1,344~ ts <fn> ets <fn>
## 2 sxk97 <tibble [168 ~ <tibble [1,344~ ts <fn> auto.arima <fn>
## 3 sxk97 <tibble [168 ~ <tibble [1,344~ ts <fn> stlm <fn>
## 4 sxk97 <tibble [168 ~ <tibble [1,344~ ts <fn> tbats <fn>
## 5 sxk97 <tibble [168 ~ <tibble [1,344~ msts <fn> stlm <fn>
## 6 sxk97 <tibble [168 ~ <tibble [1,344~ msts <fn> tbats <fn>
## 7 sxk9e <tibble [168 ~ <tibble [1,344~ ts <fn> ets <fn>
## 8 sxk9e <tibble [168 ~ <tibble [1,344~ ts <fn> auto.arima <fn>
## 9 sxk9e <tibble [168 ~ <tibble [1,344~ ts <fn> stlm <fn>
## 10 sxk9e <tibble [168 ~ <tibble [1,344~ ts <fn> tbats <fn>
## 11 sxk9e <tibble [168 ~ <tibble [1,344~ msts <fn> stlm <fn>
## 12 sxk9e <tibble [168 ~ <tibble [1,344~ msts <fn> tbats <fn>
## 13 sxk9s <tibble [168 ~ <tibble [1,344~ ts <fn> ets <fn>
## 14 sxk9s <tibble [168 ~ <tibble [1,344~ ts <fn> auto.arima <fn>
## 15 sxk9s <tibble [168 ~ <tibble [1,344~ ts <fn> stlm <fn>
## 16 sxk9s <tibble [168 ~ <tibble [1,344~ ts <fn> tbats <fn>
## 17 sxk9s <tibble [168 ~ <tibble [1,344~ msts <fn> stlm <fn>
## 18 sxk9s <tibble [168 ~ <tibble [1,344~ msts <fn> tbats <fn>
Create the model
The first step is to run the function to form time series data based on the func variable and then form the model based on the model variable and the results will be accommodated to the fitted variable.
Since it is time consuming, for further analysis, we will save the model in “scotty_model.RDS”.
Comparing with test Data
Calculate errors
scotty_model <- readRDS("scotty_model.RDS")
# calculate test errors
scotty_model_error <- scotty_model %>%
mutate(error =
map(fitted, ~ forecast(.x, h = 24 * 7)) %>%
map2_dbl(test, ~ mae_vec(truth = .y$demand, estimate = .x$mean))
) %>%
arrange(src_sub_area, error) #arrange based on smallest error for each area
scotty_model_error## # A tibble: 18 x 9
## # Groups: src_sub_area [3]
## src_sub_area test train func_name func model_name model fitted error
## <chr> <list> <list> <chr> <lis> <chr> <lis> <list> <dbl>
## 1 sxk97 <tibble~ <tibble ~ msts <fn> tbats <fn> <tbat~ 0.488
## 2 sxk97 <tibble~ <tibble ~ msts <fn> stlm <fn> <stlm> 0.543
## 3 sxk97 <tibble~ <tibble ~ ts <fn> tbats <fn> <tbat~ 0.573
## 4 sxk97 <tibble~ <tibble ~ ts <fn> ets <fn> <ets> 0.574
## 5 sxk97 <tibble~ <tibble ~ ts <fn> stlm <fn> <stlm> 0.600
## 6 sxk97 <tibble~ <tibble ~ ts <fn> auto.arima <fn> <fr_A~ 0.702
## 7 sxk9e <tibble~ <tibble ~ msts <fn> tbats <fn> <tbat~ 0.359
## 8 sxk9e <tibble~ <tibble ~ msts <fn> stlm <fn> <stlm> 0.366
## 9 sxk9e <tibble~ <tibble ~ ts <fn> ets <fn> <ets> 0.419
## 10 sxk9e <tibble~ <tibble ~ ts <fn> tbats <fn> <tbat~ 0.422
## 11 sxk9e <tibble~ <tibble ~ ts <fn> stlm <fn> <stlm> 0.438
## 12 sxk9e <tibble~ <tibble ~ ts <fn> auto.arima <fn> <fr_A~ 0.502
## 13 sxk9s <tibble~ <tibble ~ msts <fn> tbats <fn> <tbat~ 0.421
## 14 sxk9s <tibble~ <tibble ~ msts <fn> stlm <fn> <stlm> 0.428
## 15 sxk9s <tibble~ <tibble ~ ts <fn> tbats <fn> <tbat~ 0.431
## 16 sxk9s <tibble~ <tibble ~ ts <fn> auto.arima <fn> <fr_A~ 0.503
## 17 sxk9s <tibble~ <tibble ~ ts <fn> stlm <fn> <stlm> 0.540
## 18 sxk9s <tibble~ <tibble ~ ts <fn> ets <fn> <ets> 0.550
Get the best model by choosing the smallest error:
scotty_best_model <- scotty_model_error %>%
group_by(src_sub_area) %>%
arrange(error) %>%
slice(1) %>% #choose the smallest error & best model for each area directly
ungroup() %>%
select(src_sub_area, ends_with("_name"),error)scotty_best_model <- scotty_best_model %>%
select(-error) %>%
left_join(scotty_model_error)%>%
select(-error)## Joining, by = c("src_sub_area", "func_name", "model_name")
## # A tibble: 3 x 8
## src_sub_area func_name model_name test train func model fitted
## <chr> <chr> <chr> <list> <list> <list> <lis> <list>
## 1 sxk97 msts tbats <tibble [1~ <tibble [1,~ <fn> <fn> <tbat~
## 2 sxk9e msts tbats <tibble [1~ <tibble [1,~ <fn> <fn> <tbat~
## 3 sxk9s msts tbats <tibble [1~ <tibble [1,~ <fn> <fn> <tbat~
Unnesting the Result
We need to make a tbl containing our forecast to the test dataset, then do some spread-gather tricks to make a set of keys that unique for each data representations, models, and also one for the forecast itself. If we get to that format, we could do unnest() the data into a proper long format.
Let’s start with creating the forecast :
scotty_test <- scotty_model_error %>%
mutate(
forecast =
map(fitted, ~ forecast(.x, h = 24 * 7)) %>%
map2(test, ~ tibble(
datetime = .y$datetime,
demand = as.vector(.x$mean)
)),
key = paste(func_name, model_name, sep = "-")
)
scotty_test## # A tibble: 18 x 11
## # Groups: src_sub_area [3]
## src_sub_area test train func_name func model_name model fitted error
## <chr> <lis> <lis> <chr> <lis> <chr> <lis> <list> <dbl>
## 1 sxk97 <tib~ <tib~ msts <fn> tbats <fn> <tbat~ 0.488
## 2 sxk97 <tib~ <tib~ msts <fn> stlm <fn> <stlm> 0.543
## 3 sxk97 <tib~ <tib~ ts <fn> tbats <fn> <tbat~ 0.573
## 4 sxk97 <tib~ <tib~ ts <fn> ets <fn> <ets> 0.574
## 5 sxk97 <tib~ <tib~ ts <fn> stlm <fn> <stlm> 0.600
## 6 sxk97 <tib~ <tib~ ts <fn> auto.arima <fn> <fr_A~ 0.702
## 7 sxk9e <tib~ <tib~ msts <fn> tbats <fn> <tbat~ 0.359
## 8 sxk9e <tib~ <tib~ msts <fn> stlm <fn> <stlm> 0.366
## 9 sxk9e <tib~ <tib~ ts <fn> ets <fn> <ets> 0.419
## 10 sxk9e <tib~ <tib~ ts <fn> tbats <fn> <tbat~ 0.422
## 11 sxk9e <tib~ <tib~ ts <fn> stlm <fn> <stlm> 0.438
## 12 sxk9e <tib~ <tib~ ts <fn> auto.arima <fn> <fr_A~ 0.502
## 13 sxk9s <tib~ <tib~ msts <fn> tbats <fn> <tbat~ 0.421
## 14 sxk9s <tib~ <tib~ msts <fn> stlm <fn> <stlm> 0.428
## 15 sxk9s <tib~ <tib~ ts <fn> tbats <fn> <tbat~ 0.431
## 16 sxk9s <tib~ <tib~ ts <fn> auto.arima <fn> <fr_A~ 0.503
## 17 sxk9s <tib~ <tib~ ts <fn> stlm <fn> <stlm> 0.540
## 18 sxk9s <tib~ <tib~ ts <fn> ets <fn> <ets> 0.550
## # ... with 2 more variables: forecast <list>, key <chr>
Then do some spread-gather to create a proper key:
scotty_test_key <- scotty_test %>%
select(src_sub_area, key, actual = test, forecast) %>%
spread(key, forecast) %>%
gather(key, value, -src_sub_area)
scotty_test_key## # A tibble: 21 x 3
## # Groups: src_sub_area [3]
## src_sub_area key value
## <chr> <chr> <list>
## 1 sxk97 actual <tibble [168 x 2]>
## 2 sxk9e actual <tibble [168 x 2]>
## 3 sxk9s actual <tibble [168 x 2]>
## 4 sxk97 msts-stlm <tibble [168 x 2]>
## 5 sxk9e msts-stlm <tibble [168 x 2]>
## 6 sxk9s msts-stlm <tibble [168 x 2]>
## 7 sxk97 msts-tbats <tibble [168 x 2]>
## 8 sxk9e msts-tbats <tibble [168 x 2]>
## 9 sxk9s msts-tbats <tibble [168 x 2]>
## 10 sxk97 ts-auto.arima <tibble [168 x 2]>
## # ... with 11 more rows
Because our forecast data is still in the scotty_recipe function, we will return the original value using the scotty_recipe_revert function:
scotty_test_unnest <- scotty_test_key %>%
unnest(value) %>%
mutate(demand = rec_revert(demand, scotty_recipe, src_sub_area))
scotty_test_unnest## # A tibble: 3,528 x 4
## # Groups: src_sub_area [3]
## src_sub_area key datetime demand
## <chr> <chr> <dttm> <dbl>
## 1 sxk97 actual 2017-11-26 00:00:00 38
## 2 sxk97 actual 2017-11-26 01:00:00 21
## 3 sxk97 actual 2017-11-26 02:00:00 10
## 4 sxk97 actual 2017-11-26 03:00:00 10
## 5 sxk97 actual 2017-11-26 04:00:00 5
## 6 sxk97 actual 2017-11-26 05:00:00 2
## 7 sxk97 actual 2017-11-26 06:00:00 0
## 8 sxk97 actual 2017-11-26 07:00:00 11
## 9 sxk97 actual 2017-11-26 08:00:00 4
## 10 sxk97 actual 2017-11-26 09:00:00 4
## # ... with 3,518 more rows
With the resulting tbl, we can compare the forecast and actual data on test like this:
p2 <- scotty_test_unnest %>%
ggplot(aes(x = datetime, y = demand, colour = key)) +
geom_line(data = scotty_test_unnest %>%
filter(key == "actual"),aes(y = demand),alpha = 0.3,size = 0.8)+
geom_line(data = scotty_test_unnest %>%
filter(key != "actual"),aes(frame = key,col = key)) +
labs(x = "", y = "Demand",title = "Model Prediction Comparison", frame = "Model") +
facet_wrap(~ src_sub_area, scale = "free_y", ncol = 1) +
tidyquant::theme_tq() +
theme(legend.position = "none",axis.title.y = element_text(size = 12), axis.text.y = element_text(size = 7))## Warning: Ignoring unknown aesthetics: frame
## Warning in p$x$data[firstFrame] <- p$x$frames[[1]]$data: number of items to
## replace is not a multiple of replacement length
We can visually determine which model is the best one. transparent line in the back is the one which will be forecasted, where the coloured line is our forecast representation. However there is another way to evaluate the best model.
Automated Model Selection
We can automatically select model with the smallest error by using filter() function
# filter by lowest test error
scotty_min_error <- scotty_model_error %>%
select(-fitted) %>% # remove unused
group_by(src_sub_area) %>%
filter(error == min(error)) %>%
ungroup()
scotty_min_error #same selection with subarea_best_model ## # A tibble: 3 x 8
## src_sub_area test train func_name func model_name model error
## <chr> <list> <list> <chr> <lis> <chr> <lis> <dbl>
## 1 sxk97 <tibble [16~ <tibble [1,3~ msts <fn> tbats <fn> 0.488
## 2 sxk9e <tibble [16~ <tibble [1,3~ msts <fn> tbats <fn> 0.359
## 3 sxk9s <tibble [16~ <tibble [1,3~ msts <fn> tbats <fn> 0.421
We can see that from our automated modelling, the demand tends to follow multi seasonal function, with ‘tbats’ model show the least error.
Performing Final Forecast
Here we will do the same process as in model fitting, but this time we will use train and test data as our “full data”.
First, let’s start by combining the train and test dataset :
scotty_combine <- scotty_min_error %>%
mutate(fulldata = map2(train, test, ~ bind_rows(.x, .y))) %>%
select(src_sub_area, fulldata, everything(), -train, -test)
scotty_combine## # A tibble: 3 x 7
## src_sub_area fulldata func_name func model_name model error
## <chr> <list> <chr> <list> <chr> <list> <dbl>
## 1 sxk97 <tibble [1,512 x 2]> msts <fn> tbats <fn> 0.488
## 2 sxk9e <tibble [1,512 x 2]> msts <fn> tbats <fn> 0.359
## 3 sxk9s <tibble [1,512 x 2]> msts <fn> tbats <fn> 0.421
then we do create the model as in previous example:
# scotty_combine_nest <- scotty_combine %>%
# mutate(
# params = map(fulldata, ~ list(x = .x)),
# data = invoke_map(func, params),
# params = map(data, ~ list(x = .x)),
# fitted = invoke_map(model, params)
# ) %>%
# select(-data, -params)
#
# saveRDS(scotty_combine_nest, "scottycb1_nest.RDS")
scotty_combine_nest <- readRDS("scottycb1_nest.RDS")
scotty_combine_nest## # A tibble: 3 x 8
## src_sub_area fulldata func_name func model_name model error fitted
## <chr> <list> <chr> <list> <chr> <list> <dbl> <list>
## 1 sxk97 <tibble [1,512 x~ msts <fn> tbats <fn> 0.488 <tbat~
## 2 sxk9e <tibble [1,512 x~ msts <fn> tbats <fn> 0.359 <tbat~
## 3 sxk9s <tibble [1,512 x~ msts <fn> tbats <fn> 0.421 <tbat~
Then, let’s make a tbl containing the forecast results, and convert it to a long format.
# get forecast
scotty_combine_forecast <- scotty_combine_nest %>%
mutate(forecast =
map(fitted, ~ forecast(.x, h = 24 * 7 * 4)) %>%
map2(fulldata, ~ tibble(
datetime = timetk::tk_make_future_timeseries(.y$datetime, 24 * 7 * 4),
demand = as.vector(.x$mean)
))
)
scotty_combine_forecast## # A tibble: 3 x 9
## src_sub_area fulldata func_name func model_name model error fitted forecast
## <chr> <list> <chr> <list> <chr> <lis> <dbl> <list> <list>
## 1 sxk97 <tibble ~ msts <fn> tbats <fn> 0.488 <tbat~ <tibble~
## 2 sxk9e <tibble ~ msts <fn> tbats <fn> 0.359 <tbat~ <tibble~
## 3 sxk9s <tibble ~ msts <fn> tbats <fn> 0.421 <tbat~ <tibble~
Unnesting the data
scotty_combine_forecast_unnest <- scotty_combine_forecast %>%
select(src_sub_area, actual = fulldata, forecast) %>%
gather(key, value, -src_sub_area) %>%
unnest(value) %>%
mutate(demand = rec_revert(demand, scotty_recipe, src_sub_area))
scotty_combine_forecast_unnest## # A tibble: 6,552 x 4
## src_sub_area key datetime demand
## <chr> <chr> <dttm> <dbl>
## 1 sxk97 actual 2017-10-01 00:00:00 6
## 2 sxk97 actual 2017-10-01 01:00:00 4
## 3 sxk97 actual 2017-10-01 02:00:00 9
## 4 sxk97 actual 2017-10-01 03:00:00 2
## 5 sxk97 actual 2017-10-01 04:00:00 5
## 6 sxk97 actual 2017-10-01 05:00:00 4
## 7 sxk97 actual 2017-10-01 06:00:00 1
## 8 sxk97 actual 2017-10-01 07:00:00 0
## 9 sxk97 actual 2017-10-01 08:00:00 3
## 10 sxk97 actual 2017-10-01 09:00:00 3
## # ... with 6,542 more rows
d <- scotty_combine_forecast_unnest %>%
ggplot(aes(x = datetime, y = demand, colour = key)) +
geom_line() +
labs(x = NULL, y = NULL, colour = NULL) +
facet_wrap(~ src_sub_area, scale = "free", ncol = 1) +
theme_light() +
tidyquant::scale_colour_tq()
ggplotly(d)data_test <- scotty_combine_forecast_unnest %>%
filter(datetime %within% interval_test) %>%
select(src_sub_area, datetime, demand)
data_test## # A tibble: 504 x 3
## src_sub_area datetime demand
## <chr> <dttm> <dbl>
## 1 sxk97 2017-11-26 00:00:00 38
## 2 sxk97 2017-11-26 01:00:00 21
## 3 sxk97 2017-11-26 02:00:00 10
## 4 sxk97 2017-11-26 03:00:00 10
## 5 sxk97 2017-11-26 04:00:00 5
## 6 sxk97 2017-11-26 05:00:00 2
## 7 sxk97 2017-11-26 06:00:00 0
## 8 sxk97 2017-11-26 07:00:00 11
## 9 sxk97 2017-11-26 08:00:00 4
## 10 sxk97 2017-11-26 09:00:00 4
## # ... with 494 more rows
MAE Result
mae_test_result <- scotty_model_error %>%
group_by(src_sub_area) %>%
select(src_sub_area, error) %>%
arrange(error) %>%
slice(1) %>%
ungroup()
# Reached MAE < 11 for all sub-area in (your own) test dataset
# We used tbats - the scale for MAE is a little bit different
mae_test_result %<>%
summarise(src_sub_area = "all sub-area",
error = mean(error)) %>%
bind_rows(mae_test_result, .)
mae_test_result## # A tibble: 4 x 2
## src_sub_area error
## <chr> <dbl>
## 1 sxk97 0.488
## 2 sxk9e 0.359
## 3 sxk9s 0.421
## 4 all sub-area 0.423
Data Submission
Here we will prepare 1 week forecast after data train 2017-12-03 00:00:00 - 2017-12-09 23:00:00
## Parsed with column specification:
## cols(
## src_sub_area = col_character(),
## datetime = col_datetime(format = ""),
## demand = col_logical()
## )
## # A tibble: 504 x 3
## src_sub_area datetime demand
## <chr> <dttm> <lgl>
## 1 sxk97 2017-12-03 00:00:00 NA
## 2 sxk97 2017-12-03 01:00:00 NA
## 3 sxk97 2017-12-03 02:00:00 NA
## 4 sxk97 2017-12-03 03:00:00 NA
## 5 sxk97 2017-12-03 04:00:00 NA
## 6 sxk97 2017-12-03 05:00:00 NA
## 7 sxk97 2017-12-03 06:00:00 NA
## 8 sxk97 2017-12-03 07:00:00 NA
## 9 sxk97 2017-12-03 08:00:00 NA
## 10 sxk97 2017-12-03 09:00:00 NA
## # ... with 494 more rows
## [1] "2017-12-03 00:00:00 UTC" "2017-12-09 23:00:00 UTC"
## Warning: All elements of `...` must be named.
## Did you want `data = c(datetime)`?
## # A tibble: 3 x 3
## src_sub_area demand data
## <chr> <lgl> <list>
## 1 sxk97 NA <tibble [168 x 1]>
## 2 sxk9e NA <tibble [168 x 1]>
## 3 sxk9s NA <tibble [168 x 1]>
Forecasting our data with the best model:
submission <- scotty_best_model %>%
mutate(
forecast =
map(fitted, ~ forecast(.x, h = 24 * 7)) %>%
map2(submit_nest$data, ~ tibble( #map into our data submission nest
datetime = .y$datetime,
demand = as.vector(.x$mean)
)),
key = paste(func_name, func, sep = "-")
)Select Only The Required Columns :
submission <- submission %>%
select(src_sub_area, forecast) %>%
unnest(forecast) %>%
mutate(demand = rec_revert(demand, scotty_recipe, src_sub_area))
submission## # A tibble: 504 x 3
## src_sub_area datetime demand
## <chr> <dttm> <dbl>
## 1 sxk97 2017-12-03 00:00:00 18
## 2 sxk97 2017-12-03 01:00:00 15
## 3 sxk97 2017-12-03 02:00:00 11
## 4 sxk97 2017-12-03 03:00:00 7
## 5 sxk97 2017-12-03 04:00:00 5
## 6 sxk97 2017-12-03 05:00:00 3
## 7 sxk97 2017-12-03 06:00:00 2
## 8 sxk97 2017-12-03 07:00:00 6
## 9 sxk97 2017-12-03 08:00:00 12
## 10 sxk97 2017-12-03 09:00:00 12
## # ... with 494 more rows
Checking MAE Result for data submission
Plot the submission forecast :
Conclusion
In this project we have succesufully build an automated forecasting model for hourly demands that would be evaluated on Sunday, December 3rd 2017 to Monday, December 9th 2017.
In solving the problem we have implemented machine learning algorithms for time-series analysis. which is ts for single seasonality and msts for multiple/complex seasonality. We basically use ets(), auto.arima(), stlm(), and tbats() and choose which one has the smallest error. ets() and auto.arima() are not suitable for multiple seasonality time series data.
According to our automated modelling, demand in each / all subarea tend to follow the multi seasonal model that we’ve created.
In this model we choose to use TBATS model since it has the lowest error. TBATS model has the capability to deal with complex seasonalities (e.g., non-integer seasonality, non-nested seasonality and large-period seasonality) with no seasonality constraints, making it possible to create detailed, long-term forecasts.
I believe this project has many business implementations. Forecasting analysis can play a major role ini driving company success or failure, where we can keeps prices low by optimizing a business operation, cash flow, production, staff, and financial management. We can basically reduce the uncertainty and anticipate change in the market as well as improves internal communication, and external communication between a business and their customers. Automation and function that I learned in this project is also very useful in the business where we can majorly improve efficiency.