Scotty is a ride-sharing business operating in several big cities in Turkey. The company provides motorcycles ride-sharing service for Turkey’s citizen, and really value the efficiency in traveling through the traffic–the apps even give some reference to Star Trek “beam me up” in their order buttons.
Scotty provided us with a real-time transaction dataset. With this dataset, we are going to help them in solving their problems in order to improve their business processes.
Scotty: “Bring me the crystal ball!” It’s almost the end of 2017 and we need to prepare a forecast model to helps Scotty ready for the end year’s demands. Unfortunately, Scotty is not old enough to have last year’s data for December, so we can not look back at past demands to prepare forecast for December’s demands. Fortunately, you already know that time series analysis is more than enough to help us to forecast! But, as an investment for the business’ future, we need to develop an automated forecasting framework so we don’t have to meddle with forecast model selection anymore in the future!
Build an automated forecasting model for hourly demands that would be evaluated on the next 7 days (Sunday, December 3rd 2017 to Monday, December 9th 2017)!
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.7
## v tidyr 1.1.4 v stringr 1.4.0
## v readr 2.1.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(padr)
library(tidymodels)
## Registered S3 method overwritten by 'tune':
## method from
## required_pkgs.model_spec parsnip
## -- Attaching packages -------------------------------------- tidymodels 0.1.4 --
## v broom 0.7.10 v rsample 0.1.1
## v dials 0.0.10 v tune 0.1.6
## v infer 1.0.0 v workflows 0.2.4
## v modeldata 0.1.1 v workflowsets 0.1.0
## v parsnip 0.1.7 v yardstick 0.0.9
## v recipes 0.1.17
## -- Conflicts ----------------------------------------- tidymodels_conflicts() --
## x scales::discard() masks purrr::discard()
## x dplyr::filter() masks stats::filter()
## x recipes::fixed() masks stringr::fixed()
## x dplyr::lag() masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step() masks stats::step()
## * Search for functions across packages at https://www.tidymodels.org/find/
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
##
## Attaching package: 'forecast'
## The following object is masked from 'package:yardstick':
##
## accuracy
library(readr)
library(DT)
library(ggplot2)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(dplyr)
library(padr)
library(zoo)
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(tidyr)
library(recipes)
library(tibble)
library(purrr)
library(png)
library(grid)
library(tidyquant)
## Loading required package: PerformanceAnalytics
## Loading required package: xts
##
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
##
## first, last
##
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
##
## legend
## Loading required package: quantmod
## Loading required package: TTR
##
## Attaching package: 'TTR'
## The following object is masked from 'package:dials':
##
## momentum
## == Need to Learn tidyquant? ====================================================
## Business Science offers a 1-hour course - Learning Lab #9: Performance Analysis & Portfolio Optimization with tidyquant!
## </> Learn more at: https://university.business-science.io/p/learning-labs-pro </>
scotty <- read.csv("data/data-train.csv")
head(scotty)
## id trip_id driver_id
## 1 59d005e1ffcfa261708ce9cd 59d005e9cb564761a8fe5d3e 59a892c5568be44b2734f276
## 2 59d0066affcfa261708ceb11 <NA> <NA>
## 3 59d006a1ffcfa261708ceba4 59d006c131e39c618969343d 599dc0dfa5b4fd5471ad8453
## 4 59d006d8cb564761a8fe5efd 59d007193d32b861760d4a77 59a5856573d56e7103b5d51d
## 5 59d0073ecb564761a8fe5fdc <NA> <NA>
## 6 59d007c3ffcfa261708ceee1 59d007de31e39c618969354f 599dc0dfa5b4fd5471ad8453
## rider_id start_time src_lat src_lon src_area
## 1 59ad2d6efba75a581666b506 2017-10-01T00:00:17Z 41.07047 29.01945 sxk9
## 2 59cd704bcf482f6ce2fadfdb 2017-10-01T00:02:34Z 41.07487 28.99528 sxk9
## 3 59bd62cc7e3c3b663924ba86 2017-10-01T00:03:29Z 41.04995 29.03107 sxk9
## 4 59c17cd1f6da2d274e16d0a7 2017-10-01T00:04:24Z 41.05287 28.99522 sxk9
## 5 596f47a8a294a82ea3e90e77 2017-10-01T00:06:06Z 41.06760 28.98827 sxk9
## 6 59bd62cc7e3c3b663924ba86 2017-10-01T00:08:19Z 41.04995 29.03107 sxk9
## src_sub_area dest_lat dest_lon dest_area dest_sub_area distance status
## 1 sxk9s 41.11716 29.03650 sxk9 sxk9u 5.379250 confirmed
## 2 sxk9e 41.08351 29.00228 sxk9 sxk9e 1.126098 nodrivers
## 3 sxk9s 41.04495 28.98192 sxk9 sxk9e 4.169492 confirmed
## 4 sxk9e 41.08140 28.98197 sxk9 sxk9e 3.358296 confirmed
## 5 sxk9e 41.02125 29.11316 sxk9 sxk9q 11.693573 nodrivers
## 6 sxk9s 41.04495 28.98192 sxk9 sxk9e 4.169492 confirmed
## confirmed_time_sec
## 1 8
## 2 0
## 3 32
## 4 65
## 5 0
## 6 27
The dataset includes information about:
scotty <- scotty %>%
select(start_time, src_sub_area)
Determine the steps to be performed in Data Preprocessing: *Floor the date to hour
train_date <- scotty %>%
mutate(datetime = ymd_hms(start_time),
datetime = floor_date(datetime, unit = "hour"))
tail(train_date)
## start_time src_sub_area datetime
## 90108 2017-12-02T23:57:29Z sxk97 2017-12-02 23:00:00
## 90109 2017-12-02T23:57:45Z sxk9s 2017-12-02 23:00:00
## 90110 2017-12-02T23:58:52Z sxk97 2017-12-02 23:00:00
## 90111 2017-12-02T23:58:55Z sxk9e 2017-12-02 23:00:00
## 90112 2017-12-02T23:58:59Z sxk9s 2017-12-02 23:00:00
## 90113 2017-12-02T23:59:20Z sxk9e 2017-12-02 23:00:00
*Aggregate the data based on the src_area, datetime
scotty_sum <- train_date %>%
group_by(datetime, src_sub_area) %>%
summarise(
demand = n()
)
## `summarise()` has grouped output by 'datetime'. You can override using the `.groups` argument.
scotty_sum
## # A tibble: 4,225 x 3
## # Groups: datetime [1,490]
## datetime src_sub_area demand
## <dttm> <chr> <int>
## 1 2017-10-01 00:00:00 sxk97 6
## 2 2017-10-01 00:00:00 sxk9e 8
## 3 2017-10-01 00:00:00 sxk9s 10
## 4 2017-10-01 01:00:00 sxk97 4
## 5 2017-10-01 01:00:00 sxk9e 2
## 6 2017-10-01 01:00:00 sxk9s 3
## 7 2017-10-01 02:00:00 sxk97 9
## 8 2017-10-01 02:00:00 sxk9e 3
## 9 2017-10-01 02:00:00 sxk9s 1
## 10 2017-10-01 03:00:00 sxk97 2
## # ... with 4,215 more rows
scotty_sum %>%
group_by(src_sub_area) %>%
summarise(
demand = n()
)
## # A tibble: 3 x 2
## src_sub_area demand
## <chr> <int>
## 1 sxk97 1439
## 2 sxk9e 1419
## 3 sxk9s 1367
scotty_sum <- scotty_sum %>%
pad('hour',
start_val = as.POSIXct(min(scotty_sum$datetime)),
end_val = as.POSIXct(max(scotty_sum$datetime)),
group = "src_sub_area") %>%
mutate(
src_sub_area = as.factor(src_sub_area)
)
## Warning: group argument and dplyr::groups are both present and differ,
## dplyr::groups are ignored
scotty_sum %>%
group_by(src_sub_area) %>%
summarise(
demand = n()
)
## # A tibble: 3 x 2
## src_sub_area demand
## <fct> <int>
## 1 sxk97 1512
## 2 sxk9e 1512
## 3 sxk9s 1512
scotty_sum <- scotty_sum %>%
arrange(src_sub_area, datetime)
scotty_sum
## # A tibble: 4,536 x 3
## # Groups: datetime [1,512]
## datetime src_sub_area demand
## <dttm> <fct> <int>
## 1 2017-10-01 00:00:00 sxk97 6
## 2 2017-10-01 01:00:00 sxk97 4
## 3 2017-10-01 02:00:00 sxk97 9
## 4 2017-10-01 03:00:00 sxk97 2
## 5 2017-10-01 04:00:00 sxk97 5
## 6 2017-10-01 05:00:00 sxk97 4
## 7 2017-10-01 06:00:00 sxk97 1
## 8 2017-10-01 07:00:00 sxk97 NA
## 9 2017-10-01 08:00:00 sxk97 3
## 10 2017-10-01 09:00:00 sxk97 3
## # ... with 4,526 more rows
*Fill the NA count on the new time interval with 0 (impute missing values)
#mutate(demand = replace_na(demand,0))
scotty_sum %>%
filter(is.na(demand) == 1) %>%
group_by() %>%
summarise(
na_rows = n()
)
## # A tibble: 1 x 1
## na_rows
## <int>
## 1 311
scotty_sum <- na.fill0(scotty_sum, fill = 0)
scotty_sum %>%
filter(is.na(demand) == 1) %>%
group_by() %>%
summarise(
na_rows = n()
)
## # A tibble: 1 x 1
## na_rows
## <int>
## 1 0
scotty_sum
## # A tibble: 4,536 x 3
## # Groups: datetime [1,512]
## datetime src_sub_area demand
## <dttm> <fct> <int>
## 1 2017-10-01 00:00:00 sxk97 6
## 2 2017-10-01 01:00:00 sxk97 4
## 3 2017-10-01 02:00:00 sxk97 9
## 4 2017-10-01 03:00:00 sxk97 2
## 5 2017-10-01 04:00:00 sxk97 5
## 6 2017-10-01 05:00:00 sxk97 4
## 7 2017-10-01 06:00:00 sxk97 1
## 8 2017-10-01 07:00:00 sxk97 0
## 9 2017-10-01 08:00:00 sxk97 3
## 10 2017-10-01 09:00:00 sxk97 3
## # ... with 4,526 more rows
#data test: 504 obs –> 7 hari * 24 jam * 3 source area data train: 4032 obs
scotty_sum %>%
filter(datetime > (max(scotty_sum$datetime) - hours(24 * 7 * 4))) %>%
ggplot(aes(x = datetime, y = demand)) +
geom_line() +
labs(x = NULL, y = NULL) +
facet_wrap(~ src_sub_area, scale = "free", ncol = 1) +
theme_light()
# number of test data and data train
test_size <- 24 * 7
train_size <- nrow(scotty_sum)/3 - test_size
# initial range and final range for each data
test_end <- max(scotty_sum$datetime)
test_start <- test_end - hours(test_size) + hours(1)
train_end <- test_start - hours(1)
train_start <- train_end - hours(train_size) + hours(1)
# interval for each sample
interval_train <- interval(train_start, train_end)
interval_test <- interval(test_start, test_end)
interval_train
## [1] 2017-10-01 UTC--2017-11-25 23:00:00 UTC
interval_test
## [1] 2017-11-26 UTC--2017-12-02 23:00:00 UTC
t <- scotty_sum %>%
mutate(sample = case_when(
datetime %within% interval_train ~ "train",
datetime %within% interval_test ~ "test"
)) %>%
mutate(sample = factor(sample, levels = c("train", "test"))) %>%
ggplot(aes(x = datetime, y = demand, colour = sample)) +
geom_line() +
labs(x = NULL, y = NULL, colour = NULL) +
facet_wrap(~ src_sub_area, scale = "free", ncol = 1) +
theme_light()
t
* Nested data frame dengan function nest(-src_sub_area)
scotty_sum_wide <- scotty_sum %>%
spread(src_sub_area, demand)
scotty_sum_wide
## # A tibble: 1,512 x 4
## # Groups: datetime [1,512]
## datetime sxk97 sxk9e sxk9s
## <dttm> <int> <int> <int>
## 1 2017-10-01 00:00:00 6 8 10
## 2 2017-10-01 01:00:00 4 2 3
## 3 2017-10-01 02:00:00 9 3 1
## 4 2017-10-01 03:00:00 2 2 0
## 5 2017-10-01 04:00:00 5 1 0
## 6 2017-10-01 05:00:00 4 2 0
## 7 2017-10-01 06:00:00 1 1 0
## 8 2017-10-01 07:00:00 0 0 1
## 9 2017-10-01 08:00:00 3 2 5
## 10 2017-10-01 09:00:00 3 11 5
## # ... with 1,502 more rows
scotty_recipe <- recipe(~.,
filter(scotty_sum_wide, datetime %within% interval_train)) %>%
step_sqrt(all_numeric()) %>%
step_center(all_numeric()) %>%
step_scale(all_numeric()) %>%
prep()
scotty_recipe
## Recipe
##
## Inputs:
##
## role #variables
## predictor 4
##
## Training data contained 1344 data points and no missing data.
##
## Operations:
##
## Square root transformation on sxk97, sxk9e, sxk9s [trained]
## Centering for sxk97, sxk9e, sxk9s [trained]
## Scaling for sxk97, sxk9e, sxk9s [trained]
# preview the bake results
scotty_sum_wide <- bake(scotty_recipe, scotty_sum_wide)
scotty_sum_wide
## # A tibble: 1,512 x 4
## datetime sxk97 sxk9e sxk9s
## <dttm> <dbl> <dbl> <dbl>
## 1 2017-10-01 00:00:00 -0.594 -0.697 -0.111
## 2 2017-10-01 01:00:00 -0.841 -1.28 -0.780
## 3 2017-10-01 02:00:00 -0.291 -1.15 -1.12
## 4 2017-10-01 03:00:00 -1.16 -1.28 -1.59
## 5 2017-10-01 04:00:00 -0.711 -1.44 -1.59
## 6 2017-10-01 05:00:00 -0.841 -1.28 -1.59
## 7 2017-10-01 06:00:00 -1.39 -1.44 -1.59
## 8 2017-10-01 07:00:00 -1.94 -1.85 -1.12
## 9 2017-10-01 08:00:00 -0.989 -1.28 -0.544
## 10 2017-10-01 09:00:00 -0.989 -0.497 -0.544
## # ... with 1,502 more rows
# revert back function
rec_revert <- function(vector, rec, varname) {
# store recipe values
rec_center <- rec$steps[[2]]$means[varname]
rec_scale <- rec$steps[[3]]$sds[varname]
# convert back based on the recipe
results <- (vector * rec_scale + rec_center) ^ 2
# add additional adjustment if necessary
results <- round(results)
results
}
# convert back to long format
scotty_sum_long <- scotty_sum_wide %>%
gather(src_sub_area, demand, -datetime)
scotty_sum_long
## # A tibble: 4,536 x 3
## datetime src_sub_area demand
## <dttm> <chr> <dbl>
## 1 2017-10-01 00:00:00 sxk97 -0.594
## 2 2017-10-01 01:00:00 sxk97 -0.841
## 3 2017-10-01 02:00:00 sxk97 -0.291
## 4 2017-10-01 03:00:00 sxk97 -1.16
## 5 2017-10-01 04:00:00 sxk97 -0.711
## 6 2017-10-01 05:00:00 sxk97 -0.841
## 7 2017-10-01 06:00:00 sxk97 -1.39
## 8 2017-10-01 07:00:00 sxk97 -1.94
## 9 2017-10-01 08:00:00 sxk97 -0.989
## 10 2017-10-01 09:00:00 sxk97 -0.989
## # ... with 4,526 more rows
# adjust by sample
scotty_sum_nest <- scotty_sum_long %>%
mutate(sample = case_when(
datetime %within% interval_train ~ "train",
datetime %within% interval_test ~ "test"
)) %>%
drop_na()
# nest the data train
scotty_sum_train_nest <- scotty_sum_nest %>%
group_by(src_sub_area, sample) %>%
nest(.key = "data") %>%
spread(sample, data)
## Warning: `.key` is deprecated
scotty_sum_train_nest
## # A tibble: 3 x 3
## # Groups: src_sub_area [3]
## src_sub_area test train
## <chr> <list> <list>
## 1 sxk97 <tibble [168 x 2]> <tibble [1,344 x 2]>
## 2 sxk9e <tibble [168 x 2]> <tibble [1,344 x 2]>
## 3 sxk9s <tibble [168 x 2]> <tibble [1,344 x 2]>
# list of functions that will be used to form a time series
ts_function_list <- list(
ts = function(x) ts(x$demand, frequency = 24),
msts = function(x) msts(x$demand, seasonal.periods = c(24, 24 * 7))
)
ts_function_list
## $ts
## function(x) ts(x$demand, frequency = 24)
##
## $msts
## function(x) msts(x$demand, seasonal.periods = c(24, 24 * 7))
ts_function_list <- ts_function_list %>%
rep(length(unique(scotty_sum_nest$src_sub_area))) %>%
enframe("func_name", "func") %>%
mutate(
src_sub_area = sort(rep(unique(scotty_sum_nest$src_sub_area), length(unique(.$func_name))))
)
ts_function_list
## # A tibble: 6 x 3
## func_name func src_sub_area
## <chr> <list> <chr>
## 1 ts <fn> sxk97
## 2 msts <fn> sxk97
## 3 ts <fn> sxk9e
## 4 msts <fn> sxk9e
## 5 ts <fn> sxk9s
## 6 msts <fn> sxk9s
scotty_fit <- scotty_sum_train_nest %>%
left_join(ts_function_list)
## Joining, by = "src_sub_area"
# time series model list
model_function_list <- list(
ets = function(x) ets(x),
auto.arima = function(x) auto.arima(x),
stlm = function(x) stlm(x),
tbats = function(x) tbats(x, use.box.cox = FALSE)
)
model_function_list
## $ets
## function(x) ets(x)
##
## $auto.arima
## function(x) auto.arima(x)
##
## $stlm
## function(x) stlm(x)
##
## $tbats
## function(x) tbats(x, use.box.cox = FALSE)
# combining into nested format
model_function_list <- model_function_list %>%
rep(length(unique(scotty_sum_nest$src_sub_area))) %>%
enframe("model_name", "model") %>%
mutate(src_sub_area =
sort(rep(unique(scotty_sum_nest$src_sub_area), length(unique(.$model_name))))
)
model_function_list
## # A tibble: 12 x 3
## model_name model src_sub_area
## <chr> <list> <chr>
## 1 ets <fn> sxk97
## 2 auto.arima <fn> sxk97
## 3 stlm <fn> sxk97
## 4 tbats <fn> sxk97
## 5 ets <fn> sxk9e
## 6 auto.arima <fn> sxk9e
## 7 stlm <fn> sxk9e
## 8 tbats <fn> sxk9e
## 9 ets <fn> sxk9s
## 10 auto.arima <fn> sxk9s
## 11 stlm <fn> sxk9s
## 12 tbats <fn> sxk9s
scotty_fit <- scotty_fit %>%
left_join(model_function_list) %>%
filter(
!(model_name == "ets" & func_name == "msts"),
!(model_name == "auto.arima" & func_name == "msts")
)
## Joining, by = "src_sub_area"
scotty_fit
## # A tibble: 18 x 7
## # Groups: src_sub_area [3]
## src_sub_area test train func_name func model_name model
## <chr> <list> <list> <chr> <lis> <chr> <lis>
## 1 sxk97 <tibble [168 x 2]> <tibble [1,~ ts <fn> ets <fn>
## 2 sxk97 <tibble [168 x 2]> <tibble [1,~ ts <fn> auto.arima <fn>
## 3 sxk97 <tibble [168 x 2]> <tibble [1,~ ts <fn> stlm <fn>
## 4 sxk97 <tibble [168 x 2]> <tibble [1,~ ts <fn> tbats <fn>
## 5 sxk97 <tibble [168 x 2]> <tibble [1,~ msts <fn> stlm <fn>
## 6 sxk97 <tibble [168 x 2]> <tibble [1,~ msts <fn> tbats <fn>
## 7 sxk9e <tibble [168 x 2]> <tibble [1,~ ts <fn> ets <fn>
## 8 sxk9e <tibble [168 x 2]> <tibble [1,~ ts <fn> auto.arima <fn>
## 9 sxk9e <tibble [168 x 2]> <tibble [1,~ ts <fn> stlm <fn>
## 10 sxk9e <tibble [168 x 2]> <tibble [1,~ ts <fn> tbats <fn>
## 11 sxk9e <tibble [168 x 2]> <tibble [1,~ msts <fn> stlm <fn>
## 12 sxk9e <tibble [168 x 2]> <tibble [1,~ msts <fn> tbats <fn>
## 13 sxk9s <tibble [168 x 2]> <tibble [1,~ ts <fn> ets <fn>
## 14 sxk9s <tibble [168 x 2]> <tibble [1,~ ts <fn> auto.arima <fn>
## 15 sxk9s <tibble [168 x 2]> <tibble [1,~ ts <fn> stlm <fn>
## 16 sxk9s <tibble [168 x 2]> <tibble [1,~ ts <fn> tbats <fn>
## 17 sxk9s <tibble [168 x 2]> <tibble [1,~ msts <fn> stlm <fn>
## 18 sxk9s <tibble [168 x 2]> <tibble [1,~ msts <fn> tbats <fn>
scotty_fit <- scotty_fit %>%
mutate(
params = map(train, ~ list(x = .x)),
data = invoke_map(func, params),
params = map(data, ~ list(x = .x)),
fitted = invoke_map(model, params)
) %>%
select(-data, -params)
# calculate test errors
scotty_model_error <- scotty_fit %>%
mutate(error =
map(fitted, ~ forecast(.x, h = 24 * 7)) %>%
map2_dbl(test, ~ mae_vec(truth = .y$demand, estimate = .x$mean))
) %>%
arrange(src_sub_area, error) #arrange based on smallest error for each area
scotty_model_error
## # A tibble: 18 x 9
## # Groups: src_sub_area [3]
## src_sub_area test train func_name func model_name model fitted error
## <chr> <list> <list> <chr> <lis> <chr> <lis> <list> <dbl>
## 1 sxk97 <tibble~ <tibble ~ msts <fn> tbats <fn> <tbat~ 0.488
## 2 sxk97 <tibble~ <tibble ~ msts <fn> stlm <fn> <stlm> 0.540
## 3 sxk97 <tibble~ <tibble ~ ts <fn> tbats <fn> <tbat~ 0.573
## 4 sxk97 <tibble~ <tibble ~ ts <fn> ets <fn> <ets> 0.574
## 5 sxk97 <tibble~ <tibble ~ ts <fn> stlm <fn> <stlm> 0.598
## 6 sxk97 <tibble~ <tibble ~ ts <fn> auto.arima <fn> <fr_A~ 0.702
## 7 sxk9e <tibble~ <tibble ~ msts <fn> tbats <fn> <tbat~ 0.359
## 8 sxk9e <tibble~ <tibble ~ msts <fn> stlm <fn> <stlm> 0.362
## 9 sxk9e <tibble~ <tibble ~ ts <fn> ets <fn> <ets> 0.419
## 10 sxk9e <tibble~ <tibble ~ ts <fn> tbats <fn> <tbat~ 0.422
## 11 sxk9e <tibble~ <tibble ~ ts <fn> stlm <fn> <stlm> 0.435
## 12 sxk9e <tibble~ <tibble ~ ts <fn> auto.arima <fn> <fr_A~ 0.502
## 13 sxk9s <tibble~ <tibble ~ msts <fn> stlm <fn> <stlm> 0.417
## 14 sxk9s <tibble~ <tibble ~ msts <fn> tbats <fn> <tbat~ 0.421
## 15 sxk9s <tibble~ <tibble ~ ts <fn> tbats <fn> <tbat~ 0.431
## 16 sxk9s <tibble~ <tibble ~ ts <fn> auto.arima <fn> <fr_A~ 0.503
## 17 sxk9s <tibble~ <tibble ~ ts <fn> stlm <fn> <stlm> 0.516
## 18 sxk9s <tibble~ <tibble ~ ts <fn> ets <fn> <ets> 0.550
scotty_best_model <- scotty_model_error %>%
group_by(src_sub_area) %>%
arrange(error) %>%
slice(1) %>% #choose the smallest error & best model for each area directly
ungroup() %>%
select(src_sub_area, ends_with("_name"),error)
scotty_best_model <- scotty_best_model %>%
select(-error) %>%
left_join(scotty_model_error)%>%
select(-error)
## Joining, by = c("src_sub_area", "func_name", "model_name")
scotty_best_model
## # A tibble: 3 x 8
## src_sub_area func_name model_name test train func model fitted
## <chr> <chr> <chr> <list> <list> <lis> <lis> <list>
## 1 sxk97 msts tbats <tibble [168 x 2]> <tibb~ <fn> <fn> <tbat~
## 2 sxk9e msts tbats <tibble [168 x 2]> <tibb~ <fn> <fn> <tbat~
## 3 sxk9s msts stlm <tibble [168 x 2]> <tibb~ <fn> <fn> <stlm>
scotty_test <- scotty_model_error %>%
mutate(
forecast =
map(fitted, ~ forecast(.x, h = 24 * 7)) %>%
map2(test, ~ tibble(
datetime = .y$datetime,
demand = as.vector(.x$mean)
)),
key = paste(func_name, model_name, sep = "-")
)
scotty_test
## # A tibble: 18 x 11
## # Groups: src_sub_area [3]
## src_sub_area test train func_name func model_name model fitted error
## <chr> <list> <list> <chr> <lis> <chr> <lis> <list> <dbl>
## 1 sxk97 <tibble~ <tibble ~ msts <fn> tbats <fn> <tbat~ 0.488
## 2 sxk97 <tibble~ <tibble ~ msts <fn> stlm <fn> <stlm> 0.540
## 3 sxk97 <tibble~ <tibble ~ ts <fn> tbats <fn> <tbat~ 0.573
## 4 sxk97 <tibble~ <tibble ~ ts <fn> ets <fn> <ets> 0.574
## 5 sxk97 <tibble~ <tibble ~ ts <fn> stlm <fn> <stlm> 0.598
## 6 sxk97 <tibble~ <tibble ~ ts <fn> auto.arima <fn> <fr_A~ 0.702
## 7 sxk9e <tibble~ <tibble ~ msts <fn> tbats <fn> <tbat~ 0.359
## 8 sxk9e <tibble~ <tibble ~ msts <fn> stlm <fn> <stlm> 0.362
## 9 sxk9e <tibble~ <tibble ~ ts <fn> ets <fn> <ets> 0.419
## 10 sxk9e <tibble~ <tibble ~ ts <fn> tbats <fn> <tbat~ 0.422
## 11 sxk9e <tibble~ <tibble ~ ts <fn> stlm <fn> <stlm> 0.435
## 12 sxk9e <tibble~ <tibble ~ ts <fn> auto.arima <fn> <fr_A~ 0.502
## 13 sxk9s <tibble~ <tibble ~ msts <fn> stlm <fn> <stlm> 0.417
## 14 sxk9s <tibble~ <tibble ~ msts <fn> tbats <fn> <tbat~ 0.421
## 15 sxk9s <tibble~ <tibble ~ ts <fn> tbats <fn> <tbat~ 0.431
## 16 sxk9s <tibble~ <tibble ~ ts <fn> auto.arima <fn> <fr_A~ 0.503
## 17 sxk9s <tibble~ <tibble ~ ts <fn> stlm <fn> <stlm> 0.516
## 18 sxk9s <tibble~ <tibble ~ ts <fn> ets <fn> <ets> 0.550
## # ... with 2 more variables: forecast <list>, key <chr>
scotty_test_key <- scotty_test %>%
select(src_sub_area, key, actual = test, forecast) %>%
spread(key, forecast) %>%
gather(key, value, -src_sub_area)
scotty_test_key
## # A tibble: 21 x 3
## # Groups: src_sub_area [3]
## src_sub_area key value
## <chr> <chr> <list>
## 1 sxk97 actual <tibble [168 x 2]>
## 2 sxk9e actual <tibble [168 x 2]>
## 3 sxk9s actual <tibble [168 x 2]>
## 4 sxk97 msts-stlm <tibble [168 x 2]>
## 5 sxk9e msts-stlm <tibble [168 x 2]>
## 6 sxk9s msts-stlm <tibble [168 x 2]>
## 7 sxk97 msts-tbats <tibble [168 x 2]>
## 8 sxk9e msts-tbats <tibble [168 x 2]>
## 9 sxk9s msts-tbats <tibble [168 x 2]>
## 10 sxk97 ts-auto.arima <tibble [168 x 2]>
## # ... with 11 more rows
scotty_test_unnest <- scotty_test_key %>%
unnest(value) %>%
mutate(demand = rec_revert(demand, scotty_recipe, src_sub_area))
scotty_test_unnest
## # A tibble: 3,528 x 4
## # Groups: src_sub_area [3]
## src_sub_area key datetime demand
## <chr> <chr> <dttm> <dbl>
## 1 sxk97 actual 2017-11-26 00:00:00 38
## 2 sxk97 actual 2017-11-26 01:00:00 21
## 3 sxk97 actual 2017-11-26 02:00:00 10
## 4 sxk97 actual 2017-11-26 03:00:00 10
## 5 sxk97 actual 2017-11-26 04:00:00 5
## 6 sxk97 actual 2017-11-26 05:00:00 2
## 7 sxk97 actual 2017-11-26 06:00:00 0
## 8 sxk97 actual 2017-11-26 07:00:00 11
## 9 sxk97 actual 2017-11-26 08:00:00 4
## 10 sxk97 actual 2017-11-26 09:00:00 4
## # ... with 3,518 more rows
t2 <- scotty_test_unnest %>%
ggplot(aes(x = datetime, y = demand, colour = key)) +
geom_line(data = scotty_test_unnest %>%
filter(key == "actual"),aes(y = demand),alpha = 0.3,size = 0.8)+
geom_line(data = scotty_test_unnest %>%
filter(key != "actual"),aes(frame = key,col = key)) +
labs(x = "", y = "Demand",title = "Model Prediction Comparison", frame = "Model") +
facet_wrap(~ src_sub_area, scale = "free_y", ncol = 1) +
tidyquant::theme_tq() +
theme(legend.position = "none",axis.title.y = element_text(size = 12), axis.text.y = element_text(size = 7))
## Warning: Ignoring unknown aesthetics: frame
t2
## Automated Model Selection
# filter by lowest test error
scotty_min_error <- scotty_model_error %>%
select(-fitted) %>% # remove unused
group_by(src_sub_area) %>%
filter(error == min(error)) %>%
ungroup()
scotty_min_error #same selection with subarea_best_model
## # A tibble: 3 x 8
## src_sub_area test train func_name func model_name model error
## <chr> <list> <list> <chr> <lis> <chr> <lis> <dbl>
## 1 sxk97 <tibble [168 x 2]> <tibbl~ msts <fn> tbats <fn> 0.488
## 2 sxk9e <tibble [168 x 2]> <tibbl~ msts <fn> tbats <fn> 0.359
## 3 sxk9s <tibble [168 x 2]> <tibbl~ msts <fn> stlm <fn> 0.417
scotty_combine <- scotty_min_error %>%
mutate(fulldata = map2(train, test, ~ bind_rows(.x, .y))) %>%
select(src_sub_area, fulldata, everything(), -train, -test)
scotty_combine
## # A tibble: 3 x 7
## src_sub_area fulldata func_name func model_name model error
## <chr> <list> <chr> <list> <chr> <list> <dbl>
## 1 sxk97 <tibble [1,512 x 2]> msts <fn> tbats <fn> 0.488
## 2 sxk9e <tibble [1,512 x 2]> msts <fn> tbats <fn> 0.359
## 3 sxk9s <tibble [1,512 x 2]> msts <fn> stlm <fn> 0.417
scotty_combine_nest <- scotty_combine %>%
mutate(
params = map(fulldata, ~ list(x = .x)),
data = invoke_map(func, params),
params = map(data, ~ list(x = .x)),
fitted = invoke_map(model, params)
) %>%
select(-data, -params)
scotty_combine_nest
## # A tibble: 3 x 8
## src_sub_area fulldata func_name func model_name model error fitted
## <chr> <list> <chr> <list> <chr> <list> <dbl> <list>
## 1 sxk97 <tibble [1,512 x~ msts <fn> tbats <fn> 0.488 <tbat~
## 2 sxk9e <tibble [1,512 x~ msts <fn> tbats <fn> 0.359 <tbat~
## 3 sxk9s <tibble [1,512 x~ msts <fn> stlm <fn> 0.417 <stlm>
# get forecast
scotty_combine_forecast <- scotty_combine_nest %>%
mutate(forecast =
map(fitted, ~ forecast(.x, h = 24 * 7 * 4)) %>%
map2(fulldata, ~ tibble(
datetime = timetk::tk_make_future_timeseries(.y$datetime, 24 * 7 * 4),
demand = as.vector(.x$mean)
))
)
scotty_combine_forecast
## # A tibble: 3 x 9
## src_sub_area fulldata func_name func model_name model error fitted forecast
## <chr> <list> <chr> <list> <chr> <lis> <dbl> <list> <list>
## 1 sxk97 <tibble ~ msts <fn> tbats <fn> 0.488 <tbat~ <tibble~
## 2 sxk9e <tibble ~ msts <fn> tbats <fn> 0.359 <tbat~ <tibble~
## 3 sxk9s <tibble ~ msts <fn> stlm <fn> 0.417 <stlm> <tibble~
scotty_combine_forecast_unnest <- scotty_combine_forecast %>%
select(src_sub_area, actual = fulldata, forecast) %>%
gather(key, value, -src_sub_area) %>%
unnest(value) %>%
mutate(demand = rec_revert(demand, scotty_recipe, src_sub_area))
scotty_combine_forecast_unnest
## # A tibble: 6,552 x 4
## src_sub_area key datetime demand
## <chr> <chr> <dttm> <dbl>
## 1 sxk97 actual 2017-10-01 00:00:00 6
## 2 sxk97 actual 2017-10-01 01:00:00 4
## 3 sxk97 actual 2017-10-01 02:00:00 9
## 4 sxk97 actual 2017-10-01 03:00:00 2
## 5 sxk97 actual 2017-10-01 04:00:00 5
## 6 sxk97 actual 2017-10-01 05:00:00 4
## 7 sxk97 actual 2017-10-01 06:00:00 1
## 8 sxk97 actual 2017-10-01 07:00:00 0
## 9 sxk97 actual 2017-10-01 08:00:00 3
## 10 sxk97 actual 2017-10-01 09:00:00 3
## # ... with 6,542 more rows
t3 <- scotty_combine_forecast_unnest %>%
ggplot(aes(x = datetime, y = demand, colour = key)) +
geom_line() +
labs(x = NULL, y = NULL, colour = NULL) +
facet_wrap(~ src_sub_area, scale = "free", ncol = 1) +
theme_light() +
tidyquant::scale_colour_tq()
t3
data_test <- scotty_combine_forecast_unnest %>%
filter(datetime %within% interval_test) %>%
select(src_sub_area, datetime, demand)
data_test
## # A tibble: 504 x 3
## src_sub_area datetime demand
## <chr> <dttm> <dbl>
## 1 sxk97 2017-11-26 00:00:00 38
## 2 sxk97 2017-11-26 01:00:00 21
## 3 sxk97 2017-11-26 02:00:00 10
## 4 sxk97 2017-11-26 03:00:00 10
## 5 sxk97 2017-11-26 04:00:00 5
## 6 sxk97 2017-11-26 05:00:00 2
## 7 sxk97 2017-11-26 06:00:00 0
## 8 sxk97 2017-11-26 07:00:00 11
## 9 sxk97 2017-11-26 08:00:00 4
## 10 sxk97 2017-11-26 09:00:00 4
## # ... with 494 more rows
mae_test_result <- scotty_model_error %>%
group_by(src_sub_area) %>%
select(src_sub_area, error) %>%
arrange(error) %>%
slice(2) %>%
ungroup()
# Reached MAE < 11 for all sub-area in (your own) test dataset
# We used tbats - the scale for MAE is a little bit different
mae_test_result %<>%
summarise(src_sub_area = "all sub-area",
error = mean(error)) %>%
bind_rows(mae_test_result, .)
mae_test_result
## # A tibble: 4 x 2
## src_sub_area error
## <chr> <dbl>
## 1 sxk97 0.540
## 2 sxk9e 0.362
## 3 sxk9s 0.421
## 4 all sub-area 0.441
submission <- read_csv("data/data-test.csv")
## Rows: 504 Columns: 3
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): src_sub_area
## lgl (1): demand
## dttm (1): datetime
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
submission
## # A tibble: 504 x 3
## src_sub_area datetime demand
## <chr> <dttm> <lgl>
## 1 sxk97 2017-12-03 00:00:00 NA
## 2 sxk97 2017-12-03 01:00:00 NA
## 3 sxk97 2017-12-03 02:00:00 NA
## 4 sxk97 2017-12-03 03:00:00 NA
## 5 sxk97 2017-12-03 04:00:00 NA
## 6 sxk97 2017-12-03 05:00:00 NA
## 7 sxk97 2017-12-03 06:00:00 NA
## 8 sxk97 2017-12-03 07:00:00 NA
## 9 sxk97 2017-12-03 08:00:00 NA
## 10 sxk97 2017-12-03 09:00:00 NA
## # ... with 494 more rows
range(submission$datetime)
## [1] "2017-12-03 00:00:00 UTC" "2017-12-09 23:00:00 UTC"
# nest the data submission
submit_nest <- submission %>% nest(datetime)
## Warning: All elements of `...` must be named.
## Did you want `data = c(datetime)`?
submit_nest
## # A tibble: 3 x 3
## src_sub_area demand data
## <chr> <lgl> <list>
## 1 sxk97 NA <tibble [168 x 1]>
## 2 sxk9e NA <tibble [168 x 1]>
## 3 sxk9s NA <tibble [168 x 1]>
submission <- scotty_best_model %>%
mutate(
forecast =
map(fitted, ~ forecast(.x, h = 24 * 7)) %>%
map2(submit_nest$data, ~ tibble( #map into our data submission nest
datetime = .y$datetime,
demand = as.vector(.x$mean)
)),
key = paste(func_name, func, sep = "-")
)
submission <- submission %>%
select(src_sub_area, forecast) %>%
unnest(forecast) %>%
mutate(demand = rec_revert(demand, scotty_recipe, src_sub_area))
submission
## # A tibble: 504 x 3
## src_sub_area datetime demand
## <chr> <dttm> <dbl>
## 1 sxk97 2017-12-03 00:00:00 18
## 2 sxk97 2017-12-03 01:00:00 15
## 3 sxk97 2017-12-03 02:00:00 11
## 4 sxk97 2017-12-03 03:00:00 7
## 5 sxk97 2017-12-03 04:00:00 5
## 6 sxk97 2017-12-03 05:00:00 3
## 7 sxk97 2017-12-03 06:00:00 2
## 8 sxk97 2017-12-03 07:00:00 6
## 9 sxk97 2017-12-03 08:00:00 12
## 10 sxk97 2017-12-03 09:00:00 12
## # ... with 494 more rows
write.csv(submission, "taniaarsya_submission_scottyts.csv",row.names = F)
knitr::include_graphics("leaderboard scotty ts.png")
## Conclusion
In this project has succeeded in building an automated forecast model for hourly demand which will be evaluated on Sunday, December 3, 2017 to Monday, December 9, 2017.
In solving this problem, machine learning algorithms have been implemented for time series analysis. ts : for one season msts : for multiple / complex seasons. basically uses ets(), auto.arima(), stlm(), and tbats() and picks which one has the smallest error. ets() and auto.arima() are not suitable for some seasonal time series data.According to automated modeling, the demand in each/all subareas tends to follow the multi-seasonal model that has been created.
In this model I choose to use the TBTS model because it has the lowest error. The TBTS model has the ability to deal with complex seasons (eg, non-integer seasons, non-nested seasonales, and seasons with large periods) without seasonal constraints, making it possible to make detailed and long-term forecasts.
I believe this project has many business implementations. Forecasting analysis can play a major role in driving the success or failure of a company, where we can keep prices low by optimizing business operations, cash flow, production, staff and financial management. We can basically reduce uncertainty and anticipate changes in the market and improve internal communication, and external communication between business and customers. The automation and functions I learned about in this project are also very useful in business where they can be used to significantly increase efficiency.