Scotty is a ride-sharing business that operating in several big cities in Turkey. The company provides motorcycles ride-sharing service for Turkey’s citizen, and really value the efficiency in traveling through the traffic–the apps even give some reference to Star Trek “beam me up” in their order buttons.
Scotty provided us with real-time transaction dataset. With this dataset, we are going to help them in solving some forecasting and classification problems in order to improve their business processes.
It’s almost the end of 2017 and we need to prepare a forecast model to helps Scotty ready for the end year’s demands. Unfortunately, Scotty is not old enough to have last year data for December, so we can not look back at past demands to prepare forecast for December’s demands. Fortunately, you already know that time series analysis is more than enough to help us to forecast! But, as an investment for the business’ future, we need to develop an automated forecasting framework so we don’t have to meddling with forecast model selection anymore in the future!
## Rows: 90,113
## Columns: 16
## $ id <fct> 59d005e1ffcfa261708ce9cd, 59d0066affcfa261708ceb11…
## $ trip_id <fct> 59d005e9cb564761a8fe5d3e, NA, 59d006c131e39c618969…
## $ driver_id <fct> 59a892c5568be44b2734f276, NA, 599dc0dfa5b4fd5471ad…
## $ rider_id <fct> 59ad2d6efba75a581666b506, 59cd704bcf482f6ce2fadfdb…
## $ start_time <fct> 2017-10-01T00:00:17Z, 2017-10-01T00:02:34Z, 2017-1…
## $ src_lat <dbl> 41.07047, 41.07487, 41.04995, 41.05287, 41.06760, …
## $ src_lon <dbl> 29.01945, 28.99528, 29.03107, 28.99522, 28.98827, …
## $ src_area <fct> sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sx…
## $ src_sub_area <fct> sxk9s, sxk9e, sxk9s, sxk9e, sxk9e, sxk9s, sxk9s, s…
## $ dest_lat <dbl> 41.11716, 41.08351, 41.04495, 41.08140, 41.02125, …
## $ dest_lon <dbl> 29.03650, 29.00228, 28.98192, 28.98197, 29.11316, …
## $ dest_area <fct> sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sx…
## $ dest_sub_area <fct> sxk9u, sxk9e, sxk9e, sxk9e, sxk9q, sxk9e, sxk9e, s…
## $ distance <dbl> 5.379250, 1.126098, 4.169492, 3.358296, 11.693573,…
## $ status <fct> confirmed, nodrivers, confirmed, confirmed, nodriv…
## $ confirmed_time_sec <int> 8, 0, 32, 65, 0, 27, 32, 30, 0, 24, 9, 0, 118, 106…
We would only use two variables, which are the time and area.
## Rows: 90,113
## Columns: 2
## $ start_time <fct> 2017-10-01T00:00:17Z, 2017-10-01T00:02:34Z, 2017-10-01T0…
## $ src_sub_area <fct> sxk9s, sxk9e, sxk9s, sxk9e, sxk9e, sxk9s, sxk9s, sxk9s, …
## [1] sxk9s sxk9e sxk97
## Levels: sxk97 sxk9e sxk9s
For this time series case, we need to round the time hourly.
ride_df$start_time <- as.POSIXct(ride_df$start_time, format="%Y-%m-%dT%H:%M:%SZ")
ride_df$start_time <- floor_date(ride_df$start_time, unit="hour")Next, we need to count how many demand on the specific hour and area.
In time series, we aren’t permitted to have missing time for time series modeling, we need to do padding and replaced the demand in that missing time with zero value.
min_date <- min(ride_df$start_time)
start_val <- make_datetime(year = year(min_date), month=month(min_date), day=day(min_date), hour = 0)
max_date <- max(ride_df$start_time)
end_val <- make_datetime(year = year(max_date), month=month(max_date), day=day(max_date), hour = 23)
ride_df %<>%
group_by(src_sub_area) %>%
mutate(demand = replace_na(demand,0)) %>%
pad(start_val = start_val, end_val = end_val) %>%
ungroup() %>%
distinct()## # A tibble: 6 x 3
## # Groups: src_sub_area [1]
## start_time src_sub_area demand
## <dttm> <fct> <dbl>
## 1 2017-10-01 00:00:00 sxk97 6
## 2 2017-10-01 01:00:00 sxk97 4
## 3 2017-10-01 02:00:00 sxk97 9
## 4 2017-10-01 03:00:00 sxk97 2
## 5 2017-10-01 04:00:00 sxk97 5
## 6 2017-10-01 05:00:00 sxk97 4
Visualize the demand data per sub area.
ggplotly(ggplot(ride_df,aes(x = start_time, y = demand)) +
geom_line(aes(col = src_sub_area)) +
labs(x = "", y = "Order Request",title = "Order Demand by Sub Area") +
facet_wrap(~ src_sub_area, scale = "free_y", ncol = 1) +
theme_tq() +
scale_colour_tq() +
theme(legend.position = "none")
)The trend that resulted from the decomposition is not smooth enough that might be caused by uncaptured extra seasonality, so it can be considered as multi-seasonal data. So that, we need to try another option by creating the multiple time series object, msts with daily and weekly seasonality:
daily_weekly <- ride_df %>% filter(src_sub_area == "sxk97") %>% .$demand %>% msts(.,seasonal.periods = c(24,24*7))
autoplot(mstl(daily_weekly)) + labs(title = "Decomposition on Daily and Weekly Basis") +theme(legend.position = "none",plot.title = element_text(hjust = 0.5))I use a squared scale in order to clarify the pattern. Create a data visualization of one of the sub-areas to identify emerging seasonal patterns.
sxk97 <- ride_df %>% filter(src_sub_area == "sxk97") %>% .$demand
sxk97_daily <- ggseasonplot(ts(sxk97,frequency = 24),polar = T) +
theme(legend.position = "none") +
labs(title = "SXK97 Daily",x = "Hour") +
scale_y_sqrt()
sxk97_w <- ride_df %>% filter(src_sub_area == "sxk97") %>%
mutate(date = format(start_time,"%Y/%m/%d")) %>%
group_by(date) %>%
mutate(demand = sum(demand)) %>%
select(c(date,demand)) %>%
distinct()
sxk97_weekly <- ggseasonplot(ts(sxk97_w$demand,frequency = 7),polar = T) +
theme(legend.position = "none") +
labs(title = "SXK97 Weekly",x =NULL) +
scale_y_sqrt()
ggarrange(sxk97_daily, sxk97_weekly, ncol=2, nrow=1)sxk9e <- ride_df %>% filter(src_sub_area == "sxk9e") %>% .$demand
sxk9e_daily <- ggseasonplot(ts(sxk9e,frequency = 24),polar = T) +
theme(legend.position = "none") +
labs(title = "SXK9E Daily",x = "Hour") +
scale_y_sqrt()
sxk9e_w <- ride_df %>% filter(src_sub_area == "sxk9e") %>%
mutate(date = format(start_time,"%Y/%m/%d")) %>%
group_by(date) %>%
mutate(demand = sum(demand)) %>%
select(c(date,demand)) %>%
distinct()
sxk9e_weekly <- ggseasonplot(ts(sxk9e_w$demand,frequency = 7),polar = T) +
theme(legend.position = "none") +
labs(title = "SXK9E Weekly",x =NULL) +
scale_y_sqrt()
ggarrange(sxk9e_daily, sxk9e_weekly, ncol=2, nrow=1)sxk9s <- ride_df %>% filter(src_sub_area == "sxk9s") %>% .$demand
sxk9s_daily <- ggseasonplot(ts(sxk9s,frequency = 24),polar = T) +
theme(legend.position = "none") +
labs(title = "SXK9S Daily",x = "Hour") +
scale_y_sqrt()
sxk9s_w <- ride_df %>% filter(src_sub_area == "sxk9s") %>%
mutate(date = format(start_time,"%Y/%m/%d")) %>%
group_by(date) %>%
mutate(demand = sum(demand)) %>%
select(c(date,demand)) %>%
distinct()
sxk9s_weekly <- ggseasonplot(ts(sxk9s_w$demand,frequency = 7),polar = T) +
theme(legend.position = "none") +
labs(title = "SXK9S Weekly",x =NULL) +
scale_y_sqrt()
ggarrange(sxk9s_daily, sxk9s_weekly, ncol=2, nrow=1)On the daily pattern plot, we can see that the highest demand is at 17-19 at night. While the lowest demand is at 5-6 in the morning. On the plot of the weekly pattern (starting from Sunday morning at 12 o’clock), we can see that the highest demand appears on Friday or Saturday.
I decided to use only the daily and weekly patterns in this modeling. This is because the monthly pattern is not very visible and the observations on the dataset are only for 2 months.
Next, do Cross Validation by dividing the data into Train data (to train the model) and Test data (to evaluate the model). Test data were obtained from observations during the last 1 week and the rest was entered into the Train data.
Determine the beginning and end of the train and test data.
test_size <- 24*7
test_end <- max(ride_df$start_time)
test_start <- test_end - hours(test_size) + hours(1)
train_end <- test_start - hours(1)
train_start <- min(ride_df$start_time)
intrain <- interval(train_start, train_end)
intest <- interval(test_start, test_end)Then, we would label start_time whether it is a train or test dataset
ride_df %<>%
mutate(sample = case_when(
start_time %within% intrain ~ "train",
start_time %within% intest ~ "test"
)) %>%
drop_na() %>%
mutate(sample = factor(sample, levels = c("train", "test")))
head(ride_df)## # A tibble: 6 x 4
## # Groups: src_sub_area [1]
## start_time src_sub_area demand sample
## <dttm> <fct> <dbl> <fct>
## 1 2017-10-01 00:00:00 sxk97 6 train
## 2 2017-10-01 01:00:00 sxk97 4 train
## 3 2017-10-01 02:00:00 sxk97 9 train
## 4 2017-10-01 03:00:00 sxk97 2 train
## 5 2017-10-01 04:00:00 sxk97 5 train
## 6 2017-10-01 05:00:00 sxk97 4 train
Use recipes packages to do scaling and prevent outlier on our model. We need to change our data into wide format because recipes package only accept columnwise format.
ride_df %<>%
spread(src_sub_area, demand)
recipe <- recipe(~., filter(ride_df, start_time %within% intrain)) %>%
step_sqrt(all_numeric()) %>%
step_center(all_numeric()) %>%
step_scale(all_numeric()) %>%
prep()
# Execute calling function
ride_df <- bake(recipe, ride_df)
# Converts data back to high format
ride_df %<>%
gather(src_sub_area, demand, -start_time,-sample)
head(ride_df)## # A tibble: 6 x 4
## start_time sample src_sub_area demand
## <dttm> <fct> <chr> <dbl>
## 1 2017-10-01 00:00:00 train sxk97 -0.594
## 2 2017-10-01 01:00:00 train sxk97 -0.841
## 3 2017-10-01 02:00:00 train sxk97 -0.291
## 4 2017-10-01 03:00:00 train sxk97 -1.16
## 5 2017-10-01 04:00:00 train sxk97 -0.711
## 6 2017-10-01 05:00:00 train sxk97 -0.841
After scaling, don’t forget to create a function to return the data to the actual scaled value.
recipe_revert <- function(vector, rec, varname) {
rec_center <- rec$steps[[2]]$means[varname]
rec_scale <- rec$steps[[3]]$sds[varname]
results <- (vector * rec_scale + rec_center) ^ 2
results <- round(results)
results
}Next, nest the Training data and Test data to simplify the model selection process.
ride_df %<>%
group_by(src_sub_area, sample) %>%
nest(data = c(start_time, demand)) %>%
pivot_wider(names_from = sample, values_from = data)
head(ride_df)## # A tibble: 3 x 3
## # Groups: src_sub_area [3]
## src_sub_area train test
## <chr> <list> <list>
## 1 sxk97 <tibble [1,344 × 2]> <tibble [168 × 2]>
## 2 sxk9e <tibble [1,344 × 2]> <tibble [168 × 2]>
## 3 sxk9s <tibble [1,344 × 2]> <tibble [168 × 2]>
As we know before, there are two option of data representation, a ts object with daily seasonality and a msts with daily and weekly seasonality. To apply them into our data, then we need to make a data frame which contain the object and the function.
# Make a list of object functions
data_funs <- list(
ts = function(x) ts(x$demand, frequency = 24),
msts = function(x) msts(x$demand, seasonal.periods = c(24, 24 * 7))
)
# Change the form of the function into a data frame
# that is ready to be combined with the dataset
data_funs %<>%
rep(length(unique(ride_df$src_sub_area))) %>%
enframe("data_fun_name", "data_fun") %>%
mutate(src_sub_area =
sort(rep(unique(ride_df$src_sub_area), length(unique(.$data_fun_name))))
)
data_funs## # A tibble: 6 x 3
## data_fun_name data_fun src_sub_area
## <chr> <list> <chr>
## 1 ts <fn> sxk97
## 2 msts <fn> sxk97
## 3 ts <fn> sxk9e
## 4 msts <fn> sxk9e
## 5 ts <fn> sxk9s
## 6 msts <fn> sxk9s
Combine them into one dataframe:
## # A tibble: 6 x 5
## # Groups: src_sub_area [3]
## src_sub_area train test data_fun_name data_fun
## <chr> <list> <list> <chr> <list>
## 1 sxk97 <tibble [1,344 × 2]> <tibble [168 × 2]> ts <fn>
## 2 sxk97 <tibble [1,344 × 2]> <tibble [168 × 2]> msts <fn>
## 3 sxk9e <tibble [1,344 × 2]> <tibble [168 × 2]> ts <fn>
## 4 sxk9e <tibble [1,344 × 2]> <tibble [168 × 2]> msts <fn>
## 5 sxk9s <tibble [1,344 × 2]> <tibble [168 × 2]> ts <fn>
## 6 sxk9s <tibble [1,344 × 2]> <tibble [168 × 2]> msts <fn>
Make a list of model algorithms that you want to apply to the dataset. I used the model:
ets)stlm)tbats)arima)holt.winter)I’ll try all of these models, then see which model gives the smallest error when the model makes predictions. Especially for the ets and arima models, they are not applied to multiple seasonal time series objects because these models are not compatible with these objects.
models <- list(
stlm = function(x) stlm(x),
tbats = function(x) tbats(x, use.box.cox = FALSE,
use.trend = TRUE,
use.damped.trend = TRUE,
use.parallel = FALSE),
holt.winter = function(x) HoltWinters(x,seasonal = "additive"),
auto.arima = function(x) auto.arima(x),
ets = function(x) ets(x)
)
models %<>%
rep(length(unique(ride_df$src_sub_area))) %>%
enframe("model_name", "model") %>%
mutate(src_sub_area =
sort(rep(unique(ride_df$src_sub_area), length(unique(.$model_name))))
)
head(models)## # A tibble: 6 x 3
## model_name model src_sub_area
## <chr> <list> <chr>
## 1 stlm <fn> sxk97
## 2 tbats <fn> sxk97
## 3 holt.winter <fn> sxk97
## 4 auto.arima <fn> sxk97
## 5 ets <fn> sxk97
## 6 stlm <fn> sxk9e
Then combine models with data:
ride_df %<>%
left_join(models) %>%
filter(!(model_name == "ets" & data_fun_name == "msts"),
!(model_name == "auto.arima" & data_fun_name == "msts"))
head(ride_df)## # A tibble: 6 x 7
## # Groups: src_sub_area [1]
## src_sub_area train test data_fun_name data_fun model_name model
## <chr> <list> <list> <chr> <list> <chr> <lis>
## 1 sxk97 <tibble [1,… <tibble [16… ts <fn> stlm <fn>
## 2 sxk97 <tibble [1,… <tibble [16… ts <fn> tbats <fn>
## 3 sxk97 <tibble [1,… <tibble [16… ts <fn> holt.wint… <fn>
## 4 sxk97 <tibble [1,… <tibble [16… ts <fn> auto.arima <fn>
## 5 sxk97 <tibble [1,… <tibble [16… ts <fn> ets <fn>
## 6 sxk97 <tibble [1,… <tibble [16… msts <fn> stlm <fn>
Apply time series object creation functions and modeling algorithms to datasets.
# ride_df %<>%
# mutate(
# params = map(train, ~ list(x = .x)),
# data = invoke_map(data_fun, params),
# params = map(data, ~ list(x = .x)),
# fitted = invoke_map(model, params)
# ) %>%
# select(-data, -params)
#
# ride_model <- saveRDS(ride_df, "ride_model.RDS")
ride_df <- readRDS("ride_model.RDS")
head(ride_df)## # A tibble: 6 x 8
## # Groups: src_sub_area [1]
## src_sub_area train test data_fun_name data_fun model_name model fitted
## <chr> <list> <list> <chr> <list> <chr> <lis> <list>
## 1 sxk97 <tibble … <tibble… ts <fn> stlm <fn> <stlm>
## 2 sxk97 <tibble … <tibble… ts <fn> tbats <fn> <tbat…
## 3 sxk97 <tibble … <tibble… ts <fn> holt.wint… <fn> <HltW…
## 4 sxk97 <tibble … <tibble… ts <fn> auto.arima <fn> <fr_A…
## 5 sxk97 <tibble … <tibble… ts <fn> ets <fn> <ets>
## 6 sxk97 <tibble … <tibble… msts <fn> stlm <fn> <stlm>
Usemae_vec from yardstick package to measure the train and test error.
ride_df %<>%
mutate(MAE_test =
map(fitted, ~ forecast(.x, h = 24 * 7)) %>%
map2_dbl(test, ~ mae_vec(truth = recipe_revert(.y$demand,recipe,src_sub_area), estimate = recipe_revert(.x$mean,recipe,src_sub_area)))) %>%
arrange(src_sub_area, MAE_test)
ride_df %<>%
mutate(MAE_train =
map(fitted, ~ forecast(.x, h = 24 * 7)) %>%
map2_dbl(train, ~ mae_vec(truth = recipe_revert(.y$demand,recipe,src_sub_area), estimate = recipe_revert(.x$fitted,recipe,src_sub_area)))) %>%
arrange(src_sub_area, MAE_test)## # A tibble: 24 x 5
## # Groups: src_sub_area [3]
## src_sub_area data_fun_name model_name MAE_test MAE_train
## <chr> <chr> <chr> <dbl> <dbl>
## 1 sxk97 msts tbats 7.42 4.69
## 2 sxk97 msts stlm 8.67 3.63
## 3 sxk97 msts holt.winter 8.70 5.07
## 4 sxk97 ts ets 9 5.05
## 5 sxk97 ts tbats 9.02 4.91
## 6 sxk97 ts holt.winter 9.12 5.54
## 7 sxk97 ts stlm 9.34 4.67
## 8 sxk97 ts auto.arima 11.3 5.62
## 9 sxk9e msts tbats 9.76 6.48
## 10 sxk9e msts stlm 10.1 5.00
## # … with 14 more rows
Create a visualization that shows how the predicted results differ from each model when compared to real demand data.
ride_df_test <- ride_df %>%
mutate(
forecast =
map(fitted, ~ forecast(.x, h = 24 * 7)) %>%
map2(test, ~ tibble(
start_time = .y$start_time,
demand = as.vector(.x$mean)
)),
key = paste(data_fun_name, model_name, sep = "-")
)Convert before visualize into visualization step.
ride_df_test %<>%
select(src_sub_area, key, actual = test, forecast) %>%
spread(key, forecast) %>%
gather(key, value, -src_sub_area) %>%
unnest(value) %>%
mutate(demand = recipe_revert(demand,recipe,src_sub_area))
# Visualization step
ggplotly(ggplot(ride_df_test,aes(x = start_time, y = demand)) +
geom_line(data = ride_df_test %>% filter(key == "actual"),aes(y = demand),alpha = 0.2,size = 0.8) +
geom_line(data = ride_df_test %>% filter(key != "actual"),aes(frame = key,col = key)) +
labs(x = "", y = "Order)",title = "Comparison Of Model Prediction Results", frame = "Models") + facet_wrap(~ src_sub_area, scale = "free_y", ncol = 1) +
theme_tq() +
scale_colour_tq()+
theme(legend.position = "none"))## # A tibble: 6 x 4
## # Groups: src_sub_area [1]
## src_sub_area key start_time demand
## <chr> <chr> <dttm> <dbl>
## 1 sxk97 actual 2017-11-26 00:00:00 38
## 2 sxk97 actual 2017-11-26 01:00:00 21
## 3 sxk97 actual 2017-11-26 02:00:00 10
## 4 sxk97 actual 2017-11-26 03:00:00 10
## 5 sxk97 actual 2017-11-26 04:00:00 5
## 6 sxk97 actual 2017-11-26 05:00:00 2
Choose the model that has the smallest prediction error for each sub-area then combining into test and train data.
# Smallest error selection
ride_df %<>%
select(-fitted) %>%
group_by(src_sub_area) %>%
filter(MAE_test == min(MAE_test)) %>%
ungroup()
# Combining test and train data
ride_df %<>%
mutate(fulldata = map2(train, test, ~ bind_rows(.x, .y))) %>%
select(src_sub_area, fulldata, everything(), -train, -test)
ride_df## # A tibble: 3 x 8
## src_sub_area fulldata data_fun_name data_fun model_name model MAE_test
## <chr> <list> <chr> <list> <chr> <lis> <dbl>
## 1 sxk97 <tibble… msts <fn> tbats <fn> 7.42
## 2 sxk9e <tibble… msts <fn> tbats <fn> 9.76
## 3 sxk9s <tibble… msts <fn> tbats <fn> 6.94
## # … with 1 more variable: MAE_train <dbl>
Then we would do nested fitting. The tbats model has the smallest prediction error for each sub-area. Next, combine the Test data and Train data to create a final model. Modeling with the tbats model of the combined dataset, then predicting demand for the next 7 days.
# Running model
# ride_df %<>%
# mutate(
# params = map(fulldata, ~ list(x = .x)),
# data = invoke_map(data_fun, params),
# params = map(data, ~ list(x = .x)),
# fitted = invoke_map(model, params)
# ) %>%
# select(-data, -params)
#
# ride_bestmodel <- saveRDS(ride_df, "ride_bestmodel.RDS")
ride_df <- readRDS("ride_bestmodel.RDS")
# Make Prediction
ride_df %<>%
mutate(forecast =
map(fitted, ~ forecast(.x, h = 24 * 7)) %>%
map2(fulldata, ~ tibble(
start_time = tk_make_future_timeseries(.y$start_time, 24 * 7),
demand = as.vector(.x$mean)
))
)
ride_df## # A tibble: 3 x 10
## src_sub_area fulldata data_fun_name data_fun model_name model MAE_error_train
## <chr> <list> <chr> <list> <chr> <lis> <dbl>
## 1 sxk97 <tibble… msts <fn> tbats <fn> 4.69
## 2 sxk9e <tibble… msts <fn> tbats <fn> 6.48
## 3 sxk9s <tibble… msts <fn> tbats <fn> 4.81
## # … with 3 more variables: MAE_error_test <dbl>, fitted <list>, forecast <list>
Open data nests and create visualizations of prediction results.
ride_df %<>%
select(src_sub_area, actual = fulldata, forecast) %>%
gather(key, value, -src_sub_area) %>%
unnest(value) %>%
mutate(demand = recipe_revert(demand,recipe,src_sub_area))
# Visualization
ggplotly(ggplot(ride_df,aes(x = start_time, y = demand, colour = key)) +
geom_line() +
labs(y = NULL, x = NULL, colour = NULL, title = "Model Prediction Results") +
facet_wrap(~ src_sub_area, scale = "free_y", ncol = 1) +
theme_tq())Getting our final forecast result.
ride_actual <- ride_df %>%
filter(key == "actual")
ride_forecast <- ride_df %>%
filter(key == "forecast") %>%
filter(start_time >= "2017-12-03 00:00:00")
ride_final <- rbind(ride_actual, ride_forecast)
data_submit <- ride_final %>%
filter(key == "forecast") %>%
rename(datetime = start_time) %>%
select(- key)
#write.csv(data_submit, "data-submission.csv")
head(data_submit)## # A tibble: 6 x 3
## src_sub_area datetime demand
## <chr> <dttm> <dbl>
## 1 sxk97 2017-12-03 00:00:00 40
## 2 sxk97 2017-12-03 01:00:00 30
## 3 sxk97 2017-12-03 02:00:00 21
## 4 sxk97 2017-12-03 03:00:00 15
## 5 sxk97 2017-12-03 04:00:00 11
## 6 sxk97 2017-12-03 05:00:00 7
The forecast from tbats models showing a better performance for all and each sub-area. This online transportation case has two types of seasonality, daily and weekly. So we use stlm, tbats, and HoltWinter.