Multiple Time Series Analysis on Online Transportation Dataset
Introduction
Have you ever wonder how online transportation companies decide their price? Is the price spread equally throughout the time? or maybe distincted area priced differently?
If we follow the basic economic rule, we all would agree that the price mainly will be influenced by the order demand (beside the availability of the driver). Higher demand means higher prices. And for sure, the demand would not be equally same through the time and through different places.
In this case, we are provided a real-time transaction dataset from a motorcycles ride-sharing service by Algoritma Data Science team. With this dataset, we are going to help them in solving some forecasting problems in order to improve their business processes, including the pricing system and the driver availability.
It’s almost the end of 2017 and we need to prepare a forecast model to help the company ready for the end year’s demands. Unfortunately, the company is not old enough to have last year data for December, so we can not look back at past demands to prepare forecast for December’s demands.
This project would aim to make an automated forecasting model for hourly demands that would be evaluated on the next 7 days (Sunday, December 3rd 2017 to Monday, December 9th 2017).
Libraries Used
As this project is a time series case, we would use some time series package such as forecast, yardstick and timetk. For data wrangling, we will use dplyr, purrr,recipes.
Online Transportation Dataset
We will start by importing the train dataset:
## Observations: 90,113
## Variables: 16
## $ id <fct> 59d005e1ffcfa261708ce9cd, 59d0066affcfa261708ceb11…
## $ trip_id <fct> 59d005e9cb564761a8fe5d3e, NA, 59d006c131e39c618969…
## $ driver_id <fct> 59a892c5568be44b2734f276, NA, 599dc0dfa5b4fd5471ad…
## $ rider_id <fct> 59ad2d6efba75a581666b506, 59cd704bcf482f6ce2fadfdb…
## $ start_time <fct> 2017-10-01T00:00:17Z, 2017-10-01T00:02:34Z, 2017-1…
## $ src_lat <dbl> 41.07047, 41.07487, 41.04995, 41.05287, 41.06760, …
## $ src_lon <dbl> 29.01945, 28.99528, 29.03107, 28.99522, 28.98827, …
## $ src_area <fct> sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sx…
## $ src_sub_area <fct> sxk9s, sxk9e, sxk9s, sxk9e, sxk9e, sxk9s, sxk9s, s…
## $ dest_lat <dbl> 41.11716, 41.08351, 41.04495, 41.08140, 41.02125, …
## $ dest_lon <dbl> 29.03650, 29.00228, 28.98192, 28.98197, 29.11316, …
## $ dest_area <fct> sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sx…
## $ dest_sub_area <fct> sxk9u, sxk9e, sxk9e, sxk9e, sxk9q, sxk9e, sxk9e, s…
## $ distance <dbl> 5.379250, 1.126098, 4.169492, 3.358296, 11.693573,…
## $ status <fct> confirmed, nodrivers, confirmed, confirmed, nodriv…
## $ confirmed_time_sec <int> 8, 0, 32, 65, 0, 27, 32, 30, 0, 24, 9, 0, 118, 106…
The data contain 16 variables, however as our case is a time series problem, we would only use two variables, which is the time and area. There are three different areas that provided by the dataset.
## Observations: 90,113
## Variables: 2
## $ start_time <fct> 2017-10-01T00:00:17Z, 2017-10-01T00:02:34Z, 2017-10-01T0…
## $ src_sub_area <fct> sxk9s, sxk9e, sxk9s, sxk9e, sxk9e, sxk9s, sxk9s, sxk9s, …
## [1] sxk9s sxk9e sxk97
## Levels: sxk97 sxk9e sxk9s
For time series case, we need to round the time hourly. We would use the function from lubridate.
data$start_time <- as.POSIXct(data$start_time, format="%Y-%m-%dT%H:%M:%SZ")
data$start_time <- floor_date(data$start_time, unit="hour")Next, we need to count how many demand on the specific hour and area.
Unfortunately, there are missing hour in the dataset which means there are no demand in that specific hour. As we are not permitted to have missing time for time series modeling, we need to do padding and replaced the demand in that missing time with zero value.
min_date <- min(data$start_time)
start_val <- make_datetime(year = year(min_date), month=month(min_date), day=day(min_date), hour = 0)
max_date <- max(data$start_time)
end_val <- make_datetime(year = year(max_date), month=month(max_date), day=day(max_date), hour = 23)
data %<>%
group_by(src_sub_area) %>%
padr::pad(start_val = start_val, end_val = end_val) %>%
ungroup() %>%
distinct()## Warning: coercing time zone from UTC to
## Warning: coercing time zone from UTC to
## pad applied on the interval: hour
## # A tibble: 6 x 3
## # Groups: src_sub_area [1]
## start_time src_sub_area demand
## <dttm> <fct> <dbl>
## 1 2017-10-01 00:00:00 sxk97 6
## 2 2017-10-01 01:00:00 sxk97 4
## 3 2017-10-01 02:00:00 sxk97 9
## 4 2017-10-01 03:00:00 sxk97 2
## 5 2017-10-01 04:00:00 sxk97 5
## 6 2017-10-01 05:00:00 sxk97 4
Exploratory Data Analysis
Demand By Sub-Area
We would like to see the demand by sub-area throughout the day:
ggplotly(ggplot(data, aes(x = start_time, y = demand)) +
geom_line(aes(col = src_sub_area)) +
labs(y = "Order Request", x = NULL, title = "Order Demand") +
facet_wrap(~ src_sub_area, scale = "free_y", ncol = 1) +
theme_bw() +
theme(legend.position = "none", plot.title = element_text(hjust = 0.5)))We could see that there are seasonal pattern, although the trend is not clear yet. To investigate more, we would make daily and weekly polar plot:
sxk97 <- data %>% filter(src_sub_area == "sxk97") %>% .$demand
sxk97_daily <- ggseasonplot(ts(sxk97,frequency = 24),polar = T) +
theme_bw() +
theme(legend.position = "none",plot.title = element_text(hjust = 0.5)) +
labs(title = "SXK97 Daily",x = "Hour") +
scale_y_sqrt()
sxk97_w <- data %>% filter(src_sub_area == "sxk97") %>%
mutate(date = format(start_time,"%Y/%m/%d")) %>%
dplyr::group_by(date) %>%
mutate(demand = sum(demand)) %>%
dplyr::select(c(date,demand)) %>%
distinct()
sxk97_weekly <- ggseasonplot(ts(sxk97_w$demand, frequency = 7),polar = T) +
theme_bw() +
theme(legend.position = "none",plot.title = element_text(hjust = 0.5)) +
labs(title = "SXK97 Weekly",x = "Day") +
scale_y_sqrt()
sxk9e <- data %>% filter(src_sub_area == "sxk9e") %>% .$demand
sxk9e_daily <- ggseasonplot(ts(sxk9e,frequency = 24),polar = T) +
theme_bw() +
theme(legend.position = "none",plot.title = element_text(hjust = 0.5)) +
labs(title = "SXK9E Daily",x = "Hour") +
scale_y_sqrt()
sxk9e_w <- data %>% filter(src_sub_area == "sxk9e") %>%
mutate(date = format(start_time,"%Y/%m/%d")) %>%
dplyr::group_by(date) %>%
mutate(demand = sum(demand)) %>%
dplyr::select(c(date,demand)) %>%
distinct()
sxk9e_weekly <- ggseasonplot(ts(sxk9e_w$demand,frequency = 7),polar = T) +
theme_bw() +
theme(legend.position = "none",plot.title = element_text(hjust = 0.5)) +
labs(title = "SXK9E Weekly",x = "Day") +
scale_y_sqrt()
sxk9s <- data %>% filter(src_sub_area == "sxk9s") %>% .$demand
sxk9s_daily <- ggseasonplot(ts(sxk9s,frequency = 24),polar = T) +
theme_bw() +
theme(legend.position = "none",plot.title = element_text(hjust = 0.5)) +
labs(title = "SXK9S Daily",x = "Hour") +
scale_y_sqrt()
sxk9s_w <- data %>% filter(src_sub_area == "sxk9s") %>%
mutate(date = format(start_time,"%Y/%m/%d")) %>%
dplyr::group_by(date) %>%
mutate(demand = sum(demand)) %>%
dplyr::select(c(date,demand)) %>%
distinct()
sxk9s_weekly <- ggseasonplot(ts(sxk9s_w$demand,frequency = 7),polar = T) +
theme_bw() +
theme(legend.position = "none",plot.title = element_text(hjust = 0.5)) +
labs(title = "SXK9S Weekly",x = "Day") +
scale_y_sqrt()Generally in all three sub-areas, the demand peaks at 6 to 7 PM, while low at 5 to 6 AM. While weekly, the demand are high on Friday and Saturday, and at the lowest on Sunday.
After looking at the pattern, we would use daily and weekly pattern for time series modeling. Monthly pattern would be impossible to use because there are no enough data.
Decomposition
We would like to decompose the data to check the seasonal, trend and error from the data from ts object with daily and weekly seasonality:
daily <- data %>% filter(src_sub_area == "sxk97") %>% .$demand %>% ts(frequency = 24)
autoplot(decompose(daily)) + labs(title = "Decomposition on Daily Basis") + theme(legend.position = "none",plot.title = element_text(hjust = 0.5))weekly <- data %>% filter(src_sub_area == "sxk97") %>% .$demand %>% ts(frequency = 24*7)
autoplot(decompose(weekly)) + labs(title = "Decomposition on Weekly Basis") + theme(legend.position = "none",plot.title = element_text(hjust = 0.5))Unfortunately, the trend that resulted from the decomposition is not smooth enough that might be caused by uncaptured extra seasonality, so it can be considered as multi-seasonal data. So that, we need to try another option by creating the multiple time series object, msts with daily and weekly seasonality:
daily_weekly <- data %>% filter(src_sub_area == "sxk97") %>% .$demand %>% msts(.,seasonal.periods = c(24,24*7))
autoplot(mstl(daily_weekly)) + labs(title = "Decomposition on Daily and Weekly Basis") +theme(legend.position = "none",plot.title = element_text(hjust = 0.5))This time, the trend is smoother which indicate a correct used of seasonality pattern.
Data Preprocessing
Cross Validation
Before modeling, we have to seperate our data into two: train and test dataset. Test dataset would be the last one week from the data, while the remain is our train dataset.
# Getting the test size
test.size <- 24*7
test.end <- max(data$start_time)
test.start <- test.end - hours(test.size) + hours(1)
train.end <- test.start - hours(1)
train.start <- min(data$start_time)Then, we would label start_time whether it is a train or test dataset in data_sample:
data %<>%
mutate(data_sample = case_when(
start_time %within% intrain ~ "train",
start_time %within% intest ~ "test")) %>%
drop_na() %>%
mutate(data_sample = factor(data_sample, levels = c("train", "test")))
head(data)## # A tibble: 6 x 4
## # Groups: src_sub_area [3]
## start_time src_sub_area demand data_sample
## <dttm> <fct> <dbl> <fct>
## 1 2017-10-01 00:00:00 sxk97 6 train
## 2 2017-10-01 01:00:00 sxk97 4 train
## 3 2017-10-01 02:00:00 sxk97 9 train
## 4 2017-10-01 03:00:00 sxk97 2 train
## 5 2017-10-01 04:00:00 sxk97 5 train
## 6 2017-10-01 05:00:00 sxk97 4 train
Data Scaling
To prevent outlier to have big influence on our model, we would do scaling by using ‘recipes’ packages. Since the recipes only accept columnwise format, we need to change our data into wide format:
## # A tibble: 6 x 5
## start_time data_sample sxk97 sxk9e sxk9s
## <dttm> <fct> <dbl> <dbl> <dbl>
## 1 2017-10-01 00:00:00 train 6 8 10
## 2 2017-10-01 01:00:00 train 4 2 3
## 3 2017-10-01 02:00:00 train 9 3 1
## 4 2017-10-01 03:00:00 train 2 2 0
## 5 2017-10-01 04:00:00 train 5 1 0
## 6 2017-10-01 05:00:00 train 4 2 0
Beside scaling, we would like to use square root transformations:
recipe <- recipe(~., filter(data, start_time %within% intrain)) %>%
step_sqrt(all_numeric()) %>%
step_center(all_numeric()) %>%
step_scale(all_numeric()) %>%
prep()
data <- bake(recipe, data)Then, we change back the data into the longformat:
## # A tibble: 6 x 4
## start_time data_sample src_sub_area demand
## <dttm> <fct> <chr> <dbl>
## 1 2017-10-01 00:00:00 train sxk97 -0.594
## 2 2017-10-01 01:00:00 train sxk97 -0.841
## 3 2017-10-01 02:00:00 train sxk97 -0.291
## 4 2017-10-01 03:00:00 train sxk97 -1.16
## 5 2017-10-01 04:00:00 train sxk97 -0.711
## 6 2017-10-01 05:00:00 train sxk97 -0.841
To change data into the original form (before the scaling), we create a revert back function that will be used later after modeling.
Nested Model Fitting
As we will use functional programming purrr later, we have to convert our data into nested tbl, which create a table inside our table. We nest() the data by start_time and demand based on data_sample.
data %<>%
group_by(src_sub_area, data_sample) %>%
nest(data = c(start_time, demand)) %>%
pivot_wider(names_from = data_sample, values_from = data)
head(data)## # A tibble: 3 x 3
## # Groups: src_sub_area [3]
## src_sub_area train test
## <chr> <list<df[,2]>> <list<df[,2]>>
## 1 sxk97 [1,344 × 2] [168 × 2]
## 2 sxk9e [1,344 × 2] [168 × 2]
## 3 sxk9s [1,344 × 2] [168 × 2]
Data Model List
As we know before, there are two option of data representation, a ts object with daily seasonality and a msts with daily and weekly seasonality. To apply them into our data, then we need to make a data frame which contain the object and the function. Then we would combining them using dplyr package.
data_funs <- list(
ts = function(x) ts(x$demand, frequency = 24),
msts = function(x) msts(x$demand, seasonal.periods = c(24, 24 * 7))
)
data_funs %<>%
rep(length(unique(data$src_sub_area))) %>%
enframe("data_fun_name", "data_fun") %>%
mutate(src_sub_area =
sort(rep(unique(data$src_sub_area), length(unique(.$data_fun_name))))
)
data_funs## # A tibble: 6 x 3
## data_fun_name data_fun src_sub_area
## <chr> <list> <chr>
## 1 ts <fn> sxk97
## 2 msts <fn> sxk97
## 3 ts <fn> sxk9e
## 4 msts <fn> sxk9e
## 5 ts <fn> sxk9s
## 6 msts <fn> sxk9s
By using dplyr package, we combine them into one dataframe:
## Joining, by = "src_sub_area"
## # A tibble: 6 x 5
## # Groups: src_sub_area [3]
## src_sub_area train test data_fun_name data_fun
## <chr> <list<df[,2]>> <list<df[,2]>> <chr> <list>
## 1 sxk97 [1,344 × 2] [168 × 2] ts <fn>
## 2 sxk97 [1,344 × 2] [168 × 2] msts <fn>
## 3 sxk9e [1,344 × 2] [168 × 2] ts <fn>
## 4 sxk9e [1,344 × 2] [168 × 2] msts <fn>
## 5 sxk9s [1,344 × 2] [168 × 2] ts <fn>
## 6 sxk9s [1,344 × 2] [168 × 2] msts <fn>
Time Series Model List
Just like when we create the data model, we could also make list of time series model as a nested list. We choose stlm(), tbats() & holt.winter() and neglected ets() and auto.arima() as they are not suitable for multiple seasonality time series. dshw() could not be use because there is zero values in our data.
models <- list(
stlm = function(x) stlm(x),
tbats = function(x) tbats(x, use.box.cox = FALSE,
use.trend = TRUE,
use.damped.trend = TRUE,
use.parallel = FALSE),
holt.winter = function(x) HoltWinters(x,seasonal = "additive"),
auto.arima = function(x) auto.arima(x),
ets = function(x) ets(x)
)
models %<>%
rep(length(unique(data$src_sub_area))) %>%
enframe("model_name", "model") %>%
mutate(src_sub_area =
sort(rep(unique(data$src_sub_area), length(unique(.$model_name))))
)
models## # A tibble: 15 x 3
## model_name model src_sub_area
## <chr> <list> <chr>
## 1 stlm <fn> sxk97
## 2 tbats <fn> sxk97
## 3 holt.winter <fn> sxk97
## 4 auto.arima <fn> sxk97
## 5 ets <fn> sxk97
## 6 stlm <fn> sxk9e
## 7 tbats <fn> sxk9e
## 8 holt.winter <fn> sxk9e
## 9 auto.arima <fn> sxk9e
## 10 ets <fn> sxk9e
## 11 stlm <fn> sxk9s
## 12 tbats <fn> sxk9s
## 13 holt.winter <fn> sxk9s
## 14 auto.arima <fn> sxk9s
## 15 ets <fn> sxk9s
Then we combine models with our data:
data %<>%
left_join(models) %>%
filter(!(model_name == "ets" & data_fun_name == "msts"),
!(model_name == "auto.arima" & data_fun_name == "msts"))## Joining, by = "src_sub_area"
## # A tibble: 6 x 7
## # Groups: src_sub_area [1]
## src_sub_area train test data_fun_name data_fun model_name model
## <chr> <list<df[,2]> <list<df[,> <chr> <list> <chr> <lis>
## 1 sxk97 [1,344 × 2] [168 × 2] ts <fn> stlm <fn>
## 2 sxk97 [1,344 × 2] [168 × 2] ts <fn> tbats <fn>
## 3 sxk97 [1,344 × 2] [168 × 2] ts <fn> holt.wint… <fn>
## 4 sxk97 [1,344 × 2] [168 × 2] ts <fn> auto.arima <fn>
## 5 sxk97 [1,344 × 2] [168 × 2] ts <fn> ets <fn>
## 6 sxk97 [1,344 × 2] [168 × 2] msts <fn> stlm <fn>
Model
Finally, we could execute the model fitting using map() and invoke_map() function from purrr package. First, we need to make a list using map() then we call the function inside data_fun using invoke_map().
#data %<>%
# mutate(
# params = map(train, ~ list(x = .x)),
# data = invoke_map(data_fun, params),
# params = map(data, ~ list(x = .x)),
# fitted = invoke_map(model, params)
# ) %>%
# select(-data, -params)
#data_model <- saveRDS(data, "data_model.RDS")
data <- readRDS("data_model.RDS")
head(data)## # A tibble: 6 x 8
## # Groups: src_sub_area [1]
## src_sub_area train test data_fun_name data_fun model_name model
## <chr> <list<df[,> <list<df> <chr> <list> <chr> <lis>
## 1 sxk97 [1,344 × 2] [168 × 2] ts <fn> stlm <fn>
## 2 sxk97 [1,344 × 2] [168 × 2] ts <fn> tbats <fn>
## 3 sxk97 [1,344 × 2] [168 × 2] ts <fn> holt.wint… <fn>
## 4 sxk97 [1,344 × 2] [168 × 2] ts <fn> auto.arima <fn>
## 5 sxk97 [1,344 × 2] [168 × 2] ts <fn> ets <fn>
## 6 sxk97 [1,344 × 2] [168 × 2] msts <fn> stlm <fn>
## # … with 1 more variable: fitted <list>
Evaluation
After making the model, we need to measure the train and test error. We would using forecast()to the test dataset then measure the error by using mae_vec from yardstick package.
data %<>%
mutate(MAE_error_test =
map(fitted, ~ forecast(.x, h = test.size)) %>%
map2_dbl(test, ~ mae_vec(truth = rec_revert(.y$demand,recipe,src_sub_area), estimate = rec_revert(.x$mean,recipe,src_sub_area)))) %>%
arrange(src_sub_area, MAE_error_test)
data %<>%
mutate(MAE_error_train =
map(fitted, ~ forecast(.x, h = test.size)) %>%
map2_dbl(train, ~ mae_vec(truth = rec_revert(.y$demand,recipe,src_sub_area), estimate = rec_revert(.x$fitted,recipe,src_sub_area)))) %>%
arrange(src_sub_area, MAE_error_train) ## # A tibble: 24 x 5
## # Groups: src_sub_area [3]
## src_sub_area data_fun_name model_name MAE_error_test MAE_error_train
## <chr> <chr> <chr> <dbl> <dbl>
## 1 sxk97 msts stlm 8.67 3.63
## 2 sxk97 ts stlm 9.34 4.67
## 3 sxk97 msts tbats 7.42 4.69
## 4 sxk97 ts tbats 9.02 4.91
## 5 sxk97 ts ets 9 5.05
## 6 sxk97 msts holt.winter 8.70 5.07
## 7 sxk97 ts holt.winter 9.12 5.54
## 8 sxk97 ts auto.arima 11.3 5.62
## 9 sxk9e msts stlm 10.1 5.00
## 10 sxk9e msts tbats 9.76 6.48
## # … with 14 more rows
Forecast and Actual Data Comparison
After getting the error, we would like to compare the forecast result to the actual test. First, we need to make a tbl containing our forecast then using spread() and gather() to differentiate the actual and forecast result:
data_test <- data %>%
mutate(
forecast =
map(fitted, ~ forecast(.x, h = test.size)) %>%
map2(test, ~ tibble(
start_time = .y$start_time,
demand = as.vector(.x$mean)
)),
key = paste(data_fun_name, model_name, sep = "-")
)
data_test %<>%
select(src_sub_area, key, actual = test, forecast) %>%
spread(key, forecast) %>%
gather(key, value, -src_sub_area) %>%
unnest(value) %>%
mutate(demand = rec_revert(demand,recipe,src_sub_area))## Warning: attributes are not identical across measure variables;
## they will be dropped
## # A tibble: 6 x 4
## # Groups: src_sub_area [1]
## src_sub_area key start_time demand
## <chr> <chr> <dttm> <dbl>
## 1 sxk97 actual 2017-11-26 00:00:00 38
## 2 sxk97 actual 2017-11-26 01:00:00 21
## 3 sxk97 actual 2017-11-26 02:00:00 10
## 4 sxk97 actual 2017-11-26 03:00:00 10
## 5 sxk97 actual 2017-11-26 04:00:00 5
## 6 sxk97 actual 2017-11-26 05:00:00 2
ggplotly(ggplot(data_test,aes(x = start_time, y = demand)) +
geom_line(data = data_test %>% filter(key == "actual"),aes(y = demand),alpha = 0.2,size = 0.8)+
geom_line(data = data_test %>% filter(key != "actual"),aes(frame = key,col = key)) +
labs(x = "", y = "Permintaan (order)",title = "FORECAST VS ACTUAL", frame = "Models") +
facet_wrap(~ src_sub_area, scale = "free_y", ncol = 1) +
tidyquant::theme_tq() +
tidyquant::scale_colour_tq()+
theme(legend.position = "none",plot.title = element_text(hjust = 0.5)))## Warning: Ignoring unknown aesthetics: frame
## Warning in p$x$data[firstFrame] <- p$x$frames[[1]]$data: number of items to
## replace is not a multiple of replacement length
Automated Model Selection
On this section, we would like to predict the unseen data, which is the next 168 hours from our test data. The unseen data would range from Sunday, December 3rd 2017 to Monday, December 9th 2017. However, we need to choose the best model, so that in this section we would make an automated model selection. First, as it would be hard to choose the best model by only using the graphical analysis, we would choose the model with the least error:
data %<>%
select(-fitted) %>%
group_by(src_sub_area) %>%
filter(MAE_error_test == min(MAE_error_test)) %>%
ungroup()
data## # A tibble: 3 x 9
## src_sub_area train test data_fun_name data_fun model_name model
## <chr> <list<df[,> <list<df> <chr> <list> <chr> <lis>
## 1 sxk97 [1,344 × 2] [168 × 2] msts <fn> tbats <fn>
## 2 sxk9e [1,344 × 2] [168 × 2] msts <fn> tbats <fn>
## 3 sxk9s [1,344 × 2] [168 × 2] msts <fn> tbats <fn>
## # … with 2 more variables: MAE_error_test <dbl>, MAE_error_train <dbl>
Different with the previous, for the final forecast we would use all the data, which means we have to combine the train and test dataset:
data %<>%
mutate(fulldata = map2(train, test, ~ bind_rows(.x, .y))) %>%
select(src_sub_area, fulldata, everything(), -train, -test)
head(data)## # A tibble: 3 x 8
## src_sub_area fulldata data_fun_name data_fun model_name model MAE_error_test
## <chr> <list> <chr> <list> <chr> <lis> <dbl>
## 1 sxk97 <tibble… msts <fn> tbats <fn> 7.42
## 2 sxk9e <tibble… msts <fn> tbats <fn> 9.76
## 3 sxk9s <tibble… msts <fn> tbats <fn> 6.94
## # … with 1 more variable: MAE_error_train <dbl>
Then we would do nested fitting:
#data %<>%
#mutate(
# params = map(fulldata, ~ list(x = .x)),
# data = invoke_map(data_fun, params),
# params = map(data, ~ list(x = .x)),
# fitted = invoke_map(model, params)
# ) %>%
# select(-data, -params)
#data_bestmodel <- saveRDS(data, "data_bestmodel.RDS")
data <- readRDS("data_bestmodel.RDS")
data_for <- data %>%
mutate(forecast =
map(fitted, ~ forecast(.x, h = (24 * 7) + 7)) %>%
map2(fulldata, ~ tibble(
start_time = timetk::tk_make_future_timeseries(.y$start_time, (24 * 7) + 7),
demand = as.vector(.x$mean)
)
)
)Then, we use unnest() to get the final forecast result:
data_for %<>%
select(src_sub_area, actual = fulldata, forecast) %>%
gather(key, value, -src_sub_area) %>%
unnest(value) %>%
mutate(demand = rec_revert(demand,recipe,src_sub_area))
tail(data)## # A tibble: 3 x 9
## src_sub_area fulldata data_fun_name data_fun model_name model MAE_error_test
## <chr> <list> <chr> <list> <chr> <lis> <dbl>
## 1 sxk97 <tibble… msts <fn> tbats <fn> 7.42
## 2 sxk9e <tibble… msts <fn> tbats <fn> 9.76
## 3 sxk9s <tibble… msts <fn> tbats <fn> 6.94
## # … with 2 more variables: MAE_error_train <dbl>, fitted <list>
Finally, we are getting our final forecast result:
lag_7 <- function(x){
lag(lag(lag(lag(lag(lag(lag(x)))))))
}
data_actual <- data_for %>%
filter(key == "actual")
data_forecast <- data_for %>%
filter(key == "forecast") %>%
mutate(demand = lag_7(demand)) %>%
filter(start_time >= "2017-12-03 00:00:00")
data_final <- rbind(data_actual, data_forecast)
data_ex <- data_final %>%
filter(key == "forecast") %>%
dplyr::rename(datetime = start_time) %>%
select(- key)
#write.csv(data_ex, "data-submissiontop.csv")
data_forecast## # A tibble: 504 x 4
## src_sub_area key start_time demand
## <chr> <chr> <dttm> <dbl>
## 1 sxk97 forecast 2017-12-03 00:00:00 40
## 2 sxk97 forecast 2017-12-03 01:00:00 30
## 3 sxk97 forecast 2017-12-03 02:00:00 21
## 4 sxk97 forecast 2017-12-03 03:00:00 15
## 5 sxk97 forecast 2017-12-03 04:00:00 11
## 6 sxk97 forecast 2017-12-03 05:00:00 7
## 7 sxk97 forecast 2017-12-03 06:00:00 6
## 8 sxk97 forecast 2017-12-03 07:00:00 11
## 9 sxk97 forecast 2017-12-03 08:00:00 19
## 10 sxk97 forecast 2017-12-03 09:00:00 19
## # … with 494 more rows
Now, we would like to present the actual data and our forecast result on the graph:
ggplotly(ggplot(data_final,aes(x = start_time, y = demand, colour = key)) +
geom_line() +
labs(y = "Order Request", x = NULL, title = "Sub-Areas Model Prediction") +
facet_wrap(~ src_sub_area, scale = "free_y", ncol = 1) +
scale_color_brewer(palette = "Pastel1") +
tidyquant::theme_tq() +
theme(legend.position = "none",plot.title = element_text(hjust = 0.5)))Conclusion
This online transportation case has two types of seasonality, daily and weekly. We used STLM, TBATS, and HoltWinter for multi-seasonal data. The forecast from TBATS models showing a better performance for all and each sub-area.