Time Series Forecasting describes a set of research problems where our observations are collected at regular time intervals and where we can assume correlations among successive observations. The main idea is to forecast the future based on previous data. The process itself is called forecasting. Time Series is an observation for single data.
Before we work with the time series, we load the library first:
library(lubridate) #for working with dates
library(tidyquant) #for some ggplot aesthetics; theme_tq
library(forecast) #for time series modeling and forecasting; autoplot
library(tidyverse) #for data wrangling
library(padr) #pad (complete the hours/time)
library(recipes) #for data preprocess
library(yardstick) #for measuring forecast performance; mae_vec
library(timetk) #a toolkit for time series
library(ggplot2) #for plotting our model
library(plotly)*Carbikemovers - India
Scotty is one of the transportation company in India, in the end of 2017, they need know and be ready for the end year’s demands. Unfortunately, Scotty was not old enough to have last year data for December, so they can not look back at past demands to prepare forecast for December’s demands.
Fortunately, we have already know that time series analysis is more than enough to help us to forecast. As an investment for the business’ future, we will develop an automated forecasting framework.
Based on “scotty-ts-auto/data/data-train.csv” data (up to Saturday, December 2nd 2017), we will make a report of our automated forecasting framework for hourly demands, and that would be evaluated on the next 7 days (Sunday, December 3rd 2017 to Monday, December 9th 2017).
Let’s start by importing the dataset:
Data Train
## Parsed with column specification:
## cols(
## id = col_character(),
## trip_id = col_character(),
## driver_id = col_character(),
## rider_id = col_character(),
## start_time = col_datetime(format = ""),
## src_lat = col_double(),
## src_lon = col_double(),
## src_area = col_character(),
## src_sub_area = col_character(),
## dest_lat = col_double(),
## dest_lon = col_double(),
## dest_area = col_character(),
## dest_sub_area = col_character(),
## distance = col_double(),
## status = col_character(),
## confirmed_time_sec = col_double()
## )
## # A tibble: 10 x 16
## id trip_id driver_id rider_id start_time src_lat src_lon
## <chr> <chr> <chr> <chr> <dttm> <dbl> <dbl>
## 1 59d0~ 59d005~ 59a892c5~ 59ad2d6~ 2017-10-01 00:00:17 41.1 29.0
## 2 59d0~ <NA> <NA> 59cd704~ 2017-10-01 00:02:34 41.1 29.0
## 3 59d0~ 59d006~ 599dc0df~ 59bd62c~ 2017-10-01 00:03:29 41.0 29.0
## 4 59d0~ 59d007~ 59a58565~ 59c17cd~ 2017-10-01 00:04:24 41.1 29.0
## 5 59d0~ <NA> <NA> 596f47a~ 2017-10-01 00:06:06 41.1 29.0
## 6 59d0~ 59d007~ 599dc0df~ 59bd62c~ 2017-10-01 00:08:19 41.0 29.0
## 7 59d0~ 59d008~ 599dc0df~ 59bd62c~ 2017-10-01 00:09:01 41.0 29.0
## 8 59d0~ 59d008~ 599dc0df~ 59bd62c~ 2017-10-01 00:09:45 41.0 29.0
## 9 59d0~ <NA> <NA> 59bd62c~ 2017-10-01 00:11:06 41.0 29.0
## 10 59d0~ 59d008~ 599dc0df~ 596eab7~ 2017-10-01 00:12:24 41.0 29.0
## # ... with 9 more variables: src_area <chr>, src_sub_area <chr>,
## # dest_lat <dbl>, dest_lon <dbl>, dest_area <chr>, dest_sub_area <chr>,
## # distance <dbl>, status <chr>, confirmed_time_sec <dbl>
as per information above, we would like to create automated forecasting one week after data train,
Let’s show the format of data evaluation :
## Parsed with column specification:
## cols(
## src_sub_area = col_character(),
## datetime = col_datetime(format = ""),
## demand = col_logical()
## )
## # A tibble: 10 x 3
## src_sub_area datetime demand
## <chr> <dttm> <lgl>
## 1 sxk97 2017-12-03 00:00:00 NA
## 2 sxk97 2017-12-03 01:00:00 NA
## 3 sxk97 2017-12-03 02:00:00 NA
## 4 sxk97 2017-12-03 03:00:00 NA
## 5 sxk97 2017-12-03 04:00:00 NA
## 6 sxk97 2017-12-03 05:00:00 NA
## 7 sxk97 2017-12-03 06:00:00 NA
## 8 sxk97 2017-12-03 07:00:00 NA
## 9 sxk97 2017-12-03 08:00:00 NA
## 10 sxk97 2017-12-03 09:00:00 NA
We only need area, datetime and demand columns, so, we make the same format to data Train.
we aggregate the time using floor_date function and we count the demand of the Scotty at that hour.
# aggregate the time data
subarea_train <- data_train %>%
mutate(
datetime = floor_date(start_time, "hour")) %>%
group_by(src_sub_area, datetime) %>%
summarise(demand = n()) %>%
ungroup()
# quick check
subarea_train## # A tibble: 4,225 x 3
## src_sub_area datetime demand
## <chr> <dttm> <int>
## 1 sxk97 2017-10-01 00:00:00 6
## 2 sxk97 2017-10-01 01:00:00 4
## 3 sxk97 2017-10-01 02:00:00 9
## 4 sxk97 2017-10-01 03:00:00 2
## 5 sxk97 2017-10-01 04:00:00 5
## 6 sxk97 2017-10-01 05:00:00 4
## 7 sxk97 2017-10-01 06:00:00 1
## 8 sxk97 2017-10-01 08:00:00 3
## 9 sxk97 2017-10-01 09:00:00 3
## 10 sxk97 2017-10-01 10:00:00 3
## # ... with 4,215 more rows
From the quick check, we knew that the time series data has not completed yet.
So, we would like to make it complete by doing time series padding.
As mentioned before, we will complete time series data. E.g. we will fill data 2017-10-01 07:00:00 with demand = 0.
sxk97 2017-10-01 06:00:00 1
sxk97 2017-10-01 08:00:00 3
subarea_train <- subarea_train %>%
group_by(src_sub_area) %>%
pad(
start_val = ymd_hms("2017-10-01 00:00:00"),
end_val = ymd_hms("2017-12-02 23:00:00")) %>% #detect internally: "hour"
ungroup()## pad applied on the interval: hour
## # A tibble: 4,536 x 3
## src_sub_area datetime demand
## <chr> <dttm> <dbl>
## 1 sxk97 2017-10-01 00:00:00 6
## 2 sxk97 2017-10-01 01:00:00 4
## 3 sxk97 2017-10-01 02:00:00 9
## 4 sxk97 2017-10-01 03:00:00 2
## 5 sxk97 2017-10-01 04:00:00 5
## 6 sxk97 2017-10-01 05:00:00 4
## 7 sxk97 2017-10-01 06:00:00 1
## 8 sxk97 2017-10-01 07:00:00 0
## 9 sxk97 2017-10-01 08:00:00 3
## 10 sxk97 2017-10-01 09:00:00 3
## # ... with 4,526 more rows
Now, the data has already complete. It contains src_sub_area, datetime, and demand columns. The dates are ranging from:
## [1] "2017-10-01 00:00:00 UTC" "2017-12-02 23:00:00 UTC"
Here’s some example data from last month:
# quick plot
subarea_train %>%
filter(datetime >= max(datetime) - hours(24 * 7 * 4)) %>%
ggplot(aes(x = datetime, y = demand)) +
geom_line() +
labs(x = NULL, y = NULL) +
facet_wrap(~ src_sub_area, scale = "free", ncol = 1) +
tidyquant::theme_tq()To help us benchmark multiple models, we will need to split some portion of our data for validation. The strategy here is to cut the last 1 week as our test dataset. Then, we will cut again some (bigger) portion as the train dataset, say, 1 weeks times 8–approximately 2 months. This strategy is just the simplified version of Rolling Forecasting Origin, with only having one pair of train and test sample.
So the first step here is to get the start and end date of the train and test sample. The most straighforward way is to define the train and test size, then recursively get the start and end index for each:
# train-val-test size
test_size <- 24 * 7
train_size <- 24 * 7 * 8
# get the min-max of the time index for each sample
test_end <- max(subarea_train$datetime)
test_start <- test_end - hours(test_size) + hours(1)
train_end <- test_start - hours(1)
train_start <- train_end - hours(train_size) + hours(1)We combine the start and end indices into a date interval:
# get the interval of each samples
intrain <- interval(train_start, train_end)
intest <- interval(test_start, test_end)
intrain## [1] 2017-10-01 UTC--2017-11-25 23:00:00 UTC
## [1] 2017-11-26 UTC--2017-12-02 23:00:00 UTC
Here’s the data after splitting process: data train & test
# plot the train and test
p <- subarea_train %>%
mutate(sample = case_when(
datetime %within% intrain ~ "train",
datetime %within% intest ~ "test"
)) %>%
drop_na() %>%
mutate(sample = factor(sample, levels = c("train", "test"))) %>%
ggplot(aes(x = datetime, y = demand, colour = sample)) +
geom_line() +
labs(x = NULL, y = NULL, colour = NULL) +
facet_wrap(~ src_sub_area, scale = "free", ncol = 1) +
tidyquant::theme_tq() +
tidyquant::scale_colour_tq()
ggplotly(p)Data preprocessing is a very crucial step in time series model fitting. In this Capstone project, we will use recipes package for data preprocessing and comparing multiple preprocess specifications.
Since recipe package work columnwise, we need to convert our data into a wide format first:
# convert to wide format
subarea_wide <- subarea_train %>%
spread(src_sub_area, demand)
subarea_wide## # A tibble: 1,512 x 4
## datetime sxk97 sxk9e sxk9s
## <dttm> <dbl> <dbl> <dbl>
## 1 2017-10-01 00:00:00 6 8 10
## 2 2017-10-01 01:00:00 4 2 3
## 3 2017-10-01 02:00:00 9 3 1
## 4 2017-10-01 03:00:00 2 2 0
## 5 2017-10-01 04:00:00 5 1 0
## 6 2017-10-01 05:00:00 4 2 0
## 7 2017-10-01 06:00:00 1 1 0
## 8 2017-10-01 07:00:00 0 0 1
## 9 2017-10-01 08:00:00 3 2 5
## 10 2017-10-01 09:00:00 3 11 5
## # ... with 1,502 more rows
Then we could start to define the preprocess recipe(), and bake() our data based on the defined recipe:
# recipe is a set of steps, obj: make the data additive
# we only do this with data train
rec <- recipe(~ ., filter(subarea_wide, datetime %within% intrain)) %>%
step_sqrt(all_numeric()) %>%
step_center(all_numeric()) %>%
step_scale(all_numeric()) %>%
prep()
# preview the bake results
subarea_wide <- bake(rec, subarea_wide)
subarea_wide## # A tibble: 1,512 x 4
## datetime sxk97 sxk9e sxk9s
## <dttm> <dbl> <dbl> <dbl>
## 1 2017-10-01 00:00:00 -0.594 -0.697 -0.111
## 2 2017-10-01 01:00:00 -0.841 -1.28 -0.780
## 3 2017-10-01 02:00:00 -0.291 -1.15 -1.12
## 4 2017-10-01 03:00:00 -1.16 -1.28 -1.59
## 5 2017-10-01 04:00:00 -0.711 -1.44 -1.59
## 6 2017-10-01 05:00:00 -0.841 -1.28 -1.59
## 7 2017-10-01 06:00:00 -1.39 -1.44 -1.59
## 8 2017-10-01 07:00:00 -1.94 -1.85 -1.12
## 9 2017-10-01 08:00:00 -0.989 -1.28 -0.544
## 10 2017-10-01 09:00:00 -0.989 -0.497 -0.544
## # ... with 1,502 more rows
If we use recipes package, the next steps is to create a revert back function:
# revert back function
rec_revert <- function(vector, rec, varname) {
# store recipe values
rec_center <- rec$steps[[2]]$means[varname]
rec_scale <- rec$steps[[3]]$sds[varname]
# convert back based on the recipe
results <- (vector * rec_scale + rec_center) ^ 2
# add additional adjustment if necessary
results <- round(results)
# return the results
results
}This revert back function will convert back the data to its original form.
Now we can convert our data into the long format again:
# convert back to long format
subarea_long <- subarea_wide %>%
gather(src_sub_area, demand, -datetime)
subarea_long## # A tibble: 4,536 x 3
## datetime src_sub_area demand
## <dttm> <chr> <dbl>
## 1 2017-10-01 00:00:00 sxk97 -0.594
## 2 2017-10-01 01:00:00 sxk97 -0.841
## 3 2017-10-01 02:00:00 sxk97 -0.291
## 4 2017-10-01 03:00:00 sxk97 -1.16
## 5 2017-10-01 04:00:00 sxk97 -0.711
## 6 2017-10-01 05:00:00 sxk97 -0.841
## 7 2017-10-01 06:00:00 sxk97 -1.39
## 8 2017-10-01 07:00:00 sxk97 -1.94
## 9 2017-10-01 08:00:00 sxk97 -0.989
## 10 2017-10-01 09:00:00 sxk97 -0.989
## # ... with 4,526 more rows
In functional programming using purrr, we need to convert our tbl into a nested tbl. Nested data is like a table inside a table, which could be controlled using a key indicator; in other words, we can have tbl for each subarea and samples. Using this format, the fitting and forecasting process would be very versatile, yet we can still convert the results as long as we have a proper key like provider.
Let’s start by converting our tbl into a nested tbl. First, we need to add sample indicator so it could be recognized as a key when we nest() the data:
# adjust by sample
subarea_nest <- subarea_long %>%
mutate(sample = case_when(
datetime %within% intrain ~ "train",
datetime %within% intest ~ "test"
)) %>%
drop_na()
# nest the data train
subarea_train_nest <- subarea_nest %>%
group_by(src_sub_area, sample) %>%
nest(.key = "data") %>%
spread(sample, data)
subarea_train_nest## # A tibble: 3 x 3
## src_sub_area test train
## <chr> <list> <list>
## 1 sxk97 <tibble [168 x 2]> <tibble [1,344 x 2]>
## 2 sxk9e <tibble [168 x 2]> <tibble [1,344 x 2]>
## 3 sxk9s <tibble [168 x 2]> <tibble [1,344 x 2]>
## # A tibble: 3 x 3
## src_sub_area demand data
## <chr> <lgl> <list>
## 1 sxk97 NA <tibble [168 x 1]>
## 2 sxk9e NA <tibble [168 x 1]>
## 3 sxk9s NA <tibble [168 x 1]>
For data and model combination, we could start by defining some options for data representation. Recall that our series have a relatively high frequency, so we could consider two option of data representation here:
- a ts object with daily seasonality (24 hours), and
- an msts with daily and weekly seasonality (24 * 7).
To incorporate them into our nested data, we need to create another nested data frame containing the data daily and weekly, after that we will do a function for converting the data into the specified data. In this case, we would like to compare multiple seasonality specifications.
Let’s start with making a named list containing the transformation functions:
# function: data funs list
data_funs <- list(
ts = function(x) ts(x$demand, frequency = 24),
msts = function(x) msts(x$demand, seasonal.periods = c(24, 24 * 7))
)
data_funs## $ts
## function (x)
## ts(x$demand, frequency = 24)
##
## $msts
## function (x)
## msts(x$demand, seasonal.periods = c(24, 24 * 7))
Then we convert the list into a tbl using enframe().
Note that we should also give a key – which is the src_sub_area in our case – so we could use left_join() later.
Use rep() function:
# convert to nested
data_funs <- data_funs %<>%
rep(length(unique(subarea_nest$src_sub_area))) %>%
enframe("data_fun_name", "data_fun") %>%
mutate(src_sub_area =
sort(rep(unique(subarea_nest$src_sub_area), length(unique(.$data_fun_name))))
)
data_funs## # A tibble: 6 x 3
## data_fun_name data_fun src_sub_area
## <chr> <list> <chr>
## 1 ts <fn> sxk97
## 2 msts <fn> sxk97
## 3 ts <fn> sxk9e
## 4 msts <fn> sxk9e
## 5 ts <fn> sxk9s
## 6 msts <fn> sxk9s
Then the last steps here is to join the nested function with our Nested Model Fitting:
## Joining, by = "src_sub_area"
## # A tibble: 6 x 5
## src_sub_area test train data_fun_name data_fun
## <chr> <list> <list> <chr> <list>
## 1 sxk97 <tibble [168 x 2]> <tibble [1,344 x ~ ts <fn>
## 2 sxk97 <tibble [168 x 2]> <tibble [1,344 x ~ msts <fn>
## 3 sxk9e <tibble [168 x 2]> <tibble [1,344 x ~ ts <fn>
## 4 sxk9e <tibble [168 x 2]> <tibble [1,344 x ~ msts <fn>
## 5 sxk9s <tibble [168 x 2]> <tibble [1,344 x ~ ts <fn>
## 6 sxk9s <tibble [168 x 2]> <tibble [1,344 x ~ msts <fn>
Similar to when we create the data representation (daily and weekly) list, we also make some time series models as a nested list.
Again, let’s start by making a list of models. We would like to compare multiple model specifications. For this Capstone Project, we will use auto.arima(), ets(), stlm(), and tbats(). We need to make some functions to call those model, and store them inside a list:
# time series models list
models <- list(
auto.arima = function(x) auto.arima(x),
ets = function(x) ets(x),
stlm = function(x) stlm(x),
tbats = function(x) tbats(x, use.box.cox = FALSE)
)
models## $auto.arima
## function (x)
## auto.arima(x)
##
## $ets
## function (x)
## ets(x)
##
## $stlm
## function (x)
## stlm(x)
##
## $tbats
## function (x)
## tbats(x, use.box.cox = FALSE)
Then we can convert it into a nested format like previous example:
# convert to nested
models <- models %>%
rep(length(unique(subarea_nest$src_sub_area))) %>%
enframe("model_name", "model") %>%
mutate(src_sub_area =
sort(rep(unique(subarea_nest$src_sub_area), length(unique(.$model_name))))
)
models## # A tibble: 12 x 3
## model_name model src_sub_area
## <chr> <list> <chr>
## 1 auto.arima <fn> sxk97
## 2 ets <fn> sxk97
## 3 stlm <fn> sxk97
## 4 tbats <fn> sxk97
## 5 auto.arima <fn> sxk9e
## 6 ets <fn> sxk9e
## 7 stlm <fn> sxk9e
## 8 tbats <fn> sxk9e
## 9 auto.arima <fn> sxk9s
## 10 ets <fn> sxk9s
## 11 stlm <fn> sxk9s
## 12 tbats <fn> sxk9s
And finally, we can join the result into our nested data. We also apply some rule here:
since ets() and auto.arima() are not suitable for data with multiple seasonality time series, we use filter to remove them out:
# combine with models
subarea_nest_model <- subarea_nest_fit %>%
left_join(models) %>%
filter(
!(model_name == "ets" & data_fun_name == "msts"),
!(model_name == "auto.arima" & data_fun_name == "msts")
)## Joining, by = "src_sub_area"
## # A tibble: 18 x 7
## src_sub_area test train data_fun_name data_fun model_name model
## <chr> <list> <list> <chr> <list> <chr> <lis>
## 1 sxk97 <tibble ~ <tibble ~ ts <fn> auto.arima <fn>
## 2 sxk97 <tibble ~ <tibble ~ ts <fn> ets <fn>
## 3 sxk97 <tibble ~ <tibble ~ ts <fn> stlm <fn>
## 4 sxk97 <tibble ~ <tibble ~ ts <fn> tbats <fn>
## 5 sxk97 <tibble ~ <tibble ~ msts <fn> stlm <fn>
## 6 sxk97 <tibble ~ <tibble ~ msts <fn> tbats <fn>
## 7 sxk9e <tibble ~ <tibble ~ ts <fn> auto.arima <fn>
## 8 sxk9e <tibble ~ <tibble ~ ts <fn> ets <fn>
## 9 sxk9e <tibble ~ <tibble ~ ts <fn> stlm <fn>
## 10 sxk9e <tibble ~ <tibble ~ ts <fn> tbats <fn>
## 11 sxk9e <tibble ~ <tibble ~ msts <fn> stlm <fn>
## 12 sxk9e <tibble ~ <tibble ~ msts <fn> tbats <fn>
## 13 sxk9s <tibble ~ <tibble ~ ts <fn> auto.arima <fn>
## 14 sxk9s <tibble ~ <tibble ~ ts <fn> ets <fn>
## 15 sxk9s <tibble ~ <tibble ~ ts <fn> stlm <fn>
## 16 sxk9s <tibble ~ <tibble ~ ts <fn> tbats <fn>
## 17 sxk9s <tibble ~ <tibble ~ msts <fn> stlm <fn>
## 18 sxk9s <tibble ~ <tibble ~ msts <fn> tbats <fn>
To execute the model fitting, we need to wrap up the needed arguments as a list using map() function. Then, we could call the function using invoke_map().
We need to do this for data transformation using the function inside data_fun, then continue to fit the model with the same process using the function inside model.
Note: this function will take some time, so we already save it into RDS file, we can only readRDS and get the result faster.
# invoke nested fitting
# subarea_nest_model <- subarea_nest_model %>%
# mutate(
# params = map(train, ~ list(x = .x)),
# data = invoke_map(data_fun, params),
# params = map(data, ~ list(x = .x)),
# fitted = invoke_map(model, params)
# # error = map_dbl(fitted, mae_vec(.x$demand, .x$estimate))
# ) %>%
# select(-data, -params)
#
# saveRDS(subarea_nest_model, "subarea_nest_model.RDS")
subarea_nest_model <- readRDS("subarea_nest_model.RDS")
subarea_nest_model## # A tibble: 18 x 8
## src_sub_area test train data_fun_name data_fun model_name model fitted
## <chr> <list> <lis> <chr> <list> <chr> <lis> <list>
## 1 sxk97 <tibb~ <tib~ ts <fn> auto.arima <fn> <ARIM~
## 2 sxk97 <tibb~ <tib~ ts <fn> ets <fn> <ets>
## 3 sxk97 <tibb~ <tib~ ts <fn> stlm <fn> <stlm>
## 4 sxk97 <tibb~ <tib~ ts <fn> tbats <fn> <tbat~
## 5 sxk97 <tibb~ <tib~ msts <fn> stlm <fn> <stlm>
## 6 sxk97 <tibb~ <tib~ msts <fn> tbats <fn> <tbat~
## 7 sxk9e <tibb~ <tib~ ts <fn> auto.arima <fn> <ARIM~
## 8 sxk9e <tibb~ <tib~ ts <fn> ets <fn> <ets>
## 9 sxk9e <tibb~ <tib~ ts <fn> stlm <fn> <stlm>
## 10 sxk9e <tibb~ <tib~ ts <fn> tbats <fn> <tbat~
## 11 sxk9e <tibb~ <tib~ msts <fn> stlm <fn> <stlm>
## 12 sxk9e <tibb~ <tib~ msts <fn> tbats <fn> <tbat~
## 13 sxk9s <tibb~ <tib~ ts <fn> auto.arima <fn> <ARIM~
## 14 sxk9s <tibb~ <tib~ ts <fn> ets <fn> <ets>
## 15 sxk9s <tibb~ <tib~ ts <fn> stlm <fn> <stlm>
## 16 sxk9s <tibb~ <tib~ ts <fn> tbats <fn> <tbat~
## 17 sxk9s <tibb~ <tib~ msts <fn> stlm <fn> <stlm>
## 18 sxk9s <tibb~ <tib~ msts <fn> tbats <fn> <tbat~
# calculate test errors
subarea_error <- subarea_nest_model %>%
mutate(error =
map(fitted, ~ forecast(.x, h = 24 * 7)) %>%
map2_dbl(test, ~ mae_vec(truth = .y$demand, estimate = .x$mean))
) %>%
arrange(src_sub_area, error) #arrange based on smallest error for each area
subarea_error## # A tibble: 18 x 9
## src_sub_area test train data_fun_name data_fun model_name model fitted
## <chr> <lis> <lis> <chr> <list> <chr> <lis> <list>
## 1 sxk97 <tib~ <tib~ msts <fn> tbats <fn> <tbat~
## 2 sxk97 <tib~ <tib~ msts <fn> stlm <fn> <stlm>
## 3 sxk97 <tib~ <tib~ ts <fn> tbats <fn> <tbat~
## 4 sxk97 <tib~ <tib~ ts <fn> ets <fn> <ets>
## 5 sxk97 <tib~ <tib~ ts <fn> stlm <fn> <stlm>
## 6 sxk97 <tib~ <tib~ ts <fn> auto.arima <fn> <ARIM~
## 7 sxk9e <tib~ <tib~ msts <fn> tbats <fn> <tbat~
## 8 sxk9e <tib~ <tib~ msts <fn> stlm <fn> <stlm>
## 9 sxk9e <tib~ <tib~ ts <fn> ets <fn> <ets>
## 10 sxk9e <tib~ <tib~ ts <fn> tbats <fn> <tbat~
## 11 sxk9e <tib~ <tib~ ts <fn> stlm <fn> <stlm>
## 12 sxk9e <tib~ <tib~ ts <fn> auto.arima <fn> <ARIM~
## 13 sxk9s <tib~ <tib~ msts <fn> tbats <fn> <tbat~
## 14 sxk9s <tib~ <tib~ msts <fn> stlm <fn> <stlm>
## 15 sxk9s <tib~ <tib~ ts <fn> tbats <fn> <tbat~
## 16 sxk9s <tib~ <tib~ ts <fn> auto.arima <fn> <ARIM~
## 17 sxk9s <tib~ <tib~ ts <fn> stlm <fn> <stlm>
## 18 sxk9s <tib~ <tib~ ts <fn> ets <fn> <ets>
## # ... with 1 more variable: error <dbl>
# show the area, data fun name, model & error
subarea_error %>%
select(src_sub_area, ends_with("_name"), error)## # A tibble: 18 x 4
## src_sub_area data_fun_name model_name error
## <chr> <chr> <chr> <dbl>
## 1 sxk97 msts tbats 0.488
## 2 sxk97 msts stlm 0.543
## 3 sxk97 ts tbats 0.573
## 4 sxk97 ts ets 0.574
## 5 sxk97 ts stlm 0.600
## 6 sxk97 ts auto.arima 0.702
## 7 sxk9e msts tbats 0.359
## 8 sxk9e msts stlm 0.366
## 9 sxk9e ts ets 0.419
## 10 sxk9e ts tbats 0.422
## 11 sxk9e ts stlm 0.438
## 12 sxk9e ts auto.arima 0.502
## 13 sxk9s msts tbats 0.421
## 14 sxk9s msts stlm 0.428
## 15 sxk9s ts tbats 0.431
## 16 sxk9s ts auto.arima 0.503
## 17 sxk9s ts stlm 0.540
## 18 sxk9s ts ets 0.550
Get the best model by choosing the smallest error:
subarea_best_model <- subarea_error %>%
group_by(src_sub_area) %>%
arrange(error) %>%
slice(1) %>% #choose the smallest error & best model for each area directly
ungroup() %>%
select(src_sub_area, ends_with("_name"),error)Do left join in order to get the whole information regarding the Best model:
subarea_best_model <- subarea_best_model %>%
select(-error) %>%
left_join(subarea_error)%>%
select(-error)## Joining, by = c("src_sub_area", "data_fun_name", "model_name")
## # A tibble: 3 x 8
## src_sub_area data_fun_name model_name test train data_fun model fitted
## <chr> <chr> <chr> <list> <list> <list> <lis> <list>
## 1 sxk97 msts tbats <tibb~ <tibb~ <fn> <fn> <tbat~
## 2 sxk9e msts tbats <tibb~ <tibb~ <fn> <fn> <tbat~
## 3 sxk9s msts tbats <tibb~ <tibb~ <fn> <fn> <tbat~
Beside measuring the error, we can also compare the forecast results to the real test series through graphical analysis. But to do that, we need to make a tbl containing our forecast to the test dataset, then do some spread-gather tricks to make a set of keys that unique for each data representations, models, and also one for the forecast itself. If we get to that format, we could conveniently unnest() the data into a proper long format
Let’s start with creating the forecast:
subarea_test <- subarea_error %>%
mutate(
forecast =
map(fitted, ~ forecast(.x, h = 24 * 7)) %>%
map2(test, ~ tibble(
datetime = .y$datetime,
demand = as.vector(.x$mean)
)),
key = paste(data_fun_name, model_name, sep = "-")
)
subarea_test## # A tibble: 18 x 11
## src_sub_area test train data_fun_name data_fun model_name model fitted
## <chr> <lis> <lis> <chr> <list> <chr> <lis> <list>
## 1 sxk97 <tib~ <tib~ msts <fn> tbats <fn> <tbat~
## 2 sxk97 <tib~ <tib~ msts <fn> stlm <fn> <stlm>
## 3 sxk97 <tib~ <tib~ ts <fn> tbats <fn> <tbat~
## 4 sxk97 <tib~ <tib~ ts <fn> ets <fn> <ets>
## 5 sxk97 <tib~ <tib~ ts <fn> stlm <fn> <stlm>
## 6 sxk97 <tib~ <tib~ ts <fn> auto.arima <fn> <ARIM~
## 7 sxk9e <tib~ <tib~ msts <fn> tbats <fn> <tbat~
## 8 sxk9e <tib~ <tib~ msts <fn> stlm <fn> <stlm>
## 9 sxk9e <tib~ <tib~ ts <fn> ets <fn> <ets>
## 10 sxk9e <tib~ <tib~ ts <fn> tbats <fn> <tbat~
## 11 sxk9e <tib~ <tib~ ts <fn> stlm <fn> <stlm>
## 12 sxk9e <tib~ <tib~ ts <fn> auto.arima <fn> <ARIM~
## 13 sxk9s <tib~ <tib~ msts <fn> tbats <fn> <tbat~
## 14 sxk9s <tib~ <tib~ msts <fn> stlm <fn> <stlm>
## 15 sxk9s <tib~ <tib~ ts <fn> tbats <fn> <tbat~
## 16 sxk9s <tib~ <tib~ ts <fn> auto.arima <fn> <ARIM~
## 17 sxk9s <tib~ <tib~ ts <fn> stlm <fn> <stlm>
## 18 sxk9s <tib~ <tib~ ts <fn> ets <fn> <ets>
## # ... with 3 more variables: error <dbl>, forecast <list>, key <chr>
Then do some spread-gather to create a proper key:
subarea_test_key <- subarea_test %>%
select(src_sub_area, key, actual = test, forecast) %>%
spread(key, forecast) %>%
gather(key, value, -src_sub_area)
subarea_test_key## # A tibble: 21 x 3
## src_sub_area key value
## <chr> <chr> <list>
## 1 sxk97 actual <tibble [168 x 2]>
## 2 sxk9e actual <tibble [168 x 2]>
## 3 sxk9s actual <tibble [168 x 2]>
## 4 sxk97 msts-stlm <tibble [168 x 2]>
## 5 sxk9e msts-stlm <tibble [168 x 2]>
## 6 sxk9s msts-stlm <tibble [168 x 2]>
## 7 sxk97 msts-tbats <tibble [168 x 2]>
## 8 sxk9e msts-tbats <tibble [168 x 2]>
## 9 sxk9s msts-tbats <tibble [168 x 2]>
## 10 sxk97 ts-auto.arima <tibble [168 x 2]>
## # ... with 11 more rows
The last but not least, unnest() the series data, and apply the revert back function:
subarea_test_unnest <- subarea_test_key %>%
unnest(value) %>%
mutate(demand = rec_revert(demand, rec, src_sub_area))
subarea_test_unnest## # A tibble: 3,528 x 4
## src_sub_area key datetime demand
## <chr> <chr> <dttm> <dbl>
## 1 sxk97 actual 2017-11-26 00:00:00 38
## 2 sxk97 actual 2017-11-26 01:00:00 21
## 3 sxk97 actual 2017-11-26 02:00:00 10
## 4 sxk97 actual 2017-11-26 03:00:00 10
## 5 sxk97 actual 2017-11-26 04:00:00 5
## 6 sxk97 actual 2017-11-26 05:00:00 2
## 7 sxk97 actual 2017-11-26 06:00:00 0
## 8 sxk97 actual 2017-11-26 07:00:00 11
## 9 sxk97 actual 2017-11-26 08:00:00 4
## 10 sxk97 actual 2017-11-26 09:00:00 4
## # ... with 3,518 more rows
With the resulting tbl, we can compare the forecast and actual data on test like this:
AUTOMATE THE MODEL SELECTION
As you can see from the plot results, it is hard to decide which model we want to use based on graphical comparation. Then the most straighforward solution is to use the model with the least error.
It is very simple to do the model selection, we only need some basic dplyr grammars to filter() the model with lowest error:
# filter by lowest test error
subarea_min_error <- subarea_error %>%
select(-fitted) %>% # remove unused
group_by(src_sub_area) %>%
filter(error == min(error)) %>%
ungroup()
subarea_min_error #same selection with subarea_best_model ## # A tibble: 3 x 8
## src_sub_area test train data_fun_name data_fun model_name model error
## <chr> <list> <list> <chr> <list> <chr> <lis> <dbl>
## 1 sxk97 <tibbl~ <tibb~ msts <fn> tbats <fn> 0.488
## 2 sxk9e <tibbl~ <tibb~ msts <fn> tbats <fn> 0.359
## 3 sxk9s <tibbl~ <tibb~ msts <fn> tbats <fn> 0.421
Based on previous process, we conclude that the Automated best model is using tbats. There are two interesting time series forecasting methods called BATS and TBATS [1] that are capable of modeling time series with multiple seasonalities.
The names are acronyms for key features of the models: Trigonometric seasonality, Box-Cox transformation, ARMA errors, Trend and Seasonal components.
TBATS model takes it roots in exponential smoothing methods. Each seasonality is modeled by a trigonometric representation based on Fourier series. One major advantage of this approach is that it requires only 2 seed states regardless of the length of period. Another advantage is the ability to model seasonal effects of non-integer lengths.
PERFORM THE FINAL FORECAST
After we have the final model, then finally we can proceed to the final forecast. For the final forecast, we can do the same process as in model fitting, but this time we will use train and test data as our new “full data”.
Now let’s start by recombine the train and test dataset:
# recombine samples
dtfull <- subarea_min_error %>%
mutate(fulldata = map2(train, test, ~ bind_rows(.x, .y))) %>%
select(src_sub_area, fulldata, everything(), -train, -test)
dtfull## # A tibble: 3 x 7
## src_sub_area fulldata data_fun_name data_fun model_name model error
## <chr> <list> <chr> <list> <chr> <lis> <dbl>
## 1 sxk97 <tibble [1,51~ msts <fn> tbats <fn> 0.488
## 2 sxk9e <tibble [1,51~ msts <fn> tbats <fn> 0.359
## 3 sxk9s <tibble [1,51~ msts <fn> tbats <fn> 0.421
Then do the same nested fitting as in previous example:
# dtfull_nest <- dtfull %>%
# mutate(
# params = map(fulldata, ~ list(x = .x)),
# data = invoke_map(data_fun, params),
# params = map(data, ~ list(x = .x)),
# fitted = invoke_map(model, params)
# ) %>%
# select(-data, -params)
#
# saveRDS(dtfull_nest, "dtfull_nest.RDS")
dtfull_nest <- readRDS("dtfull_nest.RDS")
dtfull_nest## # A tibble: 3 x 8
## src_sub_area fulldata data_fun_name model_name data_fun model error
## <chr> <list> <chr> <chr> <list> <lis> <dbl>
## 1 sxk9e <tibble~ msts tbats <fn> <fn> 0.359
## 2 sxk9s <tibble~ msts tbats <fn> <fn> 0.421
## 3 sxk97 <tibble~ msts tbats <fn> <fn> 0.488
## # ... with 1 more variable: fitted <list>
Next, let’s make a tbl containing each of our forecast results, and convert our nested data to a proper long format:
# get forecast
dtfull_nest <- dtfull_nest %>%
mutate(forecast =
map(fitted, ~ forecast(.x, h = 24 * 7 * 4)) %>%
map2(fulldata, ~ tibble(
datetime = timetk::tk_make_future_timeseries(.y$datetime, 24 * 7 * 4),
demand = as.vector(.x$mean)
))
)
dtfull_nest## # A tibble: 3 x 9
## src_sub_area fulldata data_fun_name model_name data_fun model error
## <chr> <list> <chr> <chr> <list> <lis> <dbl>
## 1 sxk9e <tibble~ msts tbats <fn> <fn> 0.359
## 2 sxk9s <tibble~ msts tbats <fn> <fn> 0.421
## 3 sxk97 <tibble~ msts tbats <fn> <fn> 0.488
## # ... with 2 more variables: fitted <list>, forecast <list>
Finally, we can unnest() the data to get the result:
# unnest actual and forecast
dtfull_unnest <- dtfull_nest %>%
select(src_sub_area, actual = fulldata, forecast) %>%
gather(key, value, -src_sub_area) %>%
unnest(value) %>%
mutate(demand = rec_revert(demand, rec, src_sub_area))
dtfull_unnest## # A tibble: 6,552 x 4
## src_sub_area key datetime demand
## <chr> <chr> <dttm> <dbl>
## 1 sxk9e actual 2017-10-01 00:00:00 8
## 2 sxk9e actual 2017-10-01 01:00:00 2
## 3 sxk9e actual 2017-10-01 02:00:00 3
## 4 sxk9e actual 2017-10-01 03:00:00 2
## 5 sxk9e actual 2017-10-01 04:00:00 1
## 6 sxk9e actual 2017-10-01 05:00:00 2
## 7 sxk9e actual 2017-10-01 06:00:00 1
## 8 sxk9e actual 2017-10-01 07:00:00 0
## 9 sxk9e actual 2017-10-01 08:00:00 2
## 10 sxk9e actual 2017-10-01 09:00:00 11
## # ... with 6,542 more rows
We show the plot of forecast results for each provider in a proper long format:
## [1] "2017-10-01 00:00:00 UTC" "2017-12-30 23:00:00 UTC"
d <- dtfull_unnest %>%
ggplot(aes(x = datetime, y = demand, colour = key)) +
geom_line() +
labs(x = NULL, y = NULL, colour = NULL) +
facet_wrap(~ src_sub_area, scale = "free", ncol = 1) +
tidyquant::theme_tq() +
tidyquant::scale_colour_tq()
ggplotly(d)data_test <- dtfull_unnest %>%
filter(datetime %within% intest) %>%
select(src_sub_area, datetime, demand)
data_test## # A tibble: 504 x 3
## src_sub_area datetime demand
## <chr> <dttm> <dbl>
## 1 sxk9e 2017-11-26 00:00:00 21
## 2 sxk9e 2017-11-26 01:00:00 15
## 3 sxk9e 2017-11-26 02:00:00 12
## 4 sxk9e 2017-11-26 03:00:00 7
## 5 sxk9e 2017-11-26 04:00:00 0
## 6 sxk9e 2017-11-26 05:00:00 0
## 7 sxk9e 2017-11-26 06:00:00 1
## 8 sxk9e 2017-11-26 07:00:00 1
## 9 sxk9e 2017-11-26 08:00:00 14
## 10 sxk9e 2017-11-26 09:00:00 13
## # ... with 494 more rows
Mean Absolute Error (MAE) Result for data test
mae_test_result <- subarea_error %>%
group_by(src_sub_area) %>%
select(src_sub_area, error) %>%
arrange(error) %>%
slice(1) %>%
ungroup()
# Reached MAE < 11 for all sub-area in (your own) test dataset
# We used tbats - the scale for MAE is a little bit different
mae_test_result %<>%
summarise(src_sub_area = "all sub-area",
error = mean(error)) %>%
bind_rows(mae_test_result, .)
mae_test_result## # A tibble: 4 x 2
## src_sub_area error
## <chr> <dbl>
## 1 sxk97 0.488
## 2 sxk9e 0.359
## 3 sxk9s 0.421
## 4 all sub-area 0.423
At the plot above, data shown in the interval: 2017-10-01 00:00:00 - 2017-12-30 23:00:00. We will just prepare 1 week forecast after data train: 2017-12-03 00:00:00 - 2017-12-09 23:00:00
## # A tibble: 504 x 3
## src_sub_area datetime demand
## <chr> <dttm> <lgl>
## 1 sxk97 2017-12-03 00:00:00 NA
## 2 sxk97 2017-12-03 01:00:00 NA
## 3 sxk97 2017-12-03 02:00:00 NA
## 4 sxk97 2017-12-03 03:00:00 NA
## 5 sxk97 2017-12-03 04:00:00 NA
## 6 sxk97 2017-12-03 05:00:00 NA
## 7 sxk97 2017-12-03 06:00:00 NA
## 8 sxk97 2017-12-03 07:00:00 NA
## 9 sxk97 2017-12-03 08:00:00 NA
## 10 sxk97 2017-12-03 09:00:00 NA
## # ... with 494 more rows
## [1] "2017-12-03 00:00:00 UTC" "2017-12-09 23:00:00 UTC"
Forecast our data submission using our best model:
submission <- subarea_best_model %>% #use the best model
mutate(
forecast =
map(fitted, ~ forecast(.x, h = 24 * 7)) %>%
map2(submit_nest$data, ~ tibble( #map into our data submission nest
datetime = .y$datetime,
demand = as.vector(.x$mean)
)),
key = paste(data_fun_name, model_name, sep = "-")
)Select only the required colums:
submission <- submission %>%
select(src_sub_area, forecast) %>%
unnest(forecast) %>%
mutate(demand = rec_revert(demand, rec, src_sub_area))
submission## # A tibble: 504 x 3
## src_sub_area datetime demand
## <chr> <dttm> <dbl>
## 1 sxk97 2017-12-03 00:00:00 18
## 2 sxk97 2017-12-03 01:00:00 15
## 3 sxk97 2017-12-03 02:00:00 11
## 4 sxk97 2017-12-03 03:00:00 7
## 5 sxk97 2017-12-03 04:00:00 5
## 6 sxk97 2017-12-03 05:00:00 3
## 7 sxk97 2017-12-03 06:00:00 2
## 8 sxk97 2017-12-03 07:00:00 6
## 9 sxk97 2017-12-03 08:00:00 12
## 10 sxk97 2017-12-03 09:00:00 12
## # ... with 494 more rows
Checking the MAE Result for data submission
because we don’t have actual data (truth) from our data submission, we check the MAE in Algoritma shiny: https://algoritma.shinyapps.io/leaderboard_capsml/
And the result:
Submission’s Model Performance
Plot the submission forecast - only :
pad() is one of the function that helps us a lot to complete the data.ets() and auto.arima() are not suitable for multiple seasonality time series data. MAE function - not only visual checking via plot.Note:
Special thanks to @teamalgoritma who helps a lot in learning, preparing and completing Machine Learning course - especially in this Capstone Project.
(c) Meilinie - June 2019