Scotty is a ride-sharing business that operating in several big cities in Turkey. The company provides motorcycles ride-sharing service for Turkey’s citizen, and really value the efficiency in traveling through the traffic–the apps even give some reference to Star Trek “beam me up” in their order buttons.
Scotty provided us with real-time transaction dataset. With this dataset, we are going to help them in solving some forecasting and classification problems in order to improve their business processes.
It’s almost the end of 2017 and we need to prepare a forecast model to helps Scotty ready for the end year’s demands. Unfortunately, Scotty is not old enough to have last year data for December, so we can not look back at past demands to prepare forecast for December’s demands. Fortunately, you already know that time series analysis is more than enough to help us to forecast! But, as an investment for the business’ future, we need to develop an automated forecasting framework so we don’t have to meddling with forecast model selection anymore in the future!
This time, I am going to show you the process of using automated time-series forecasting for multiple model. To do it, I will use package called purrr().
Let’s get into the problem! Enjoy the codes!
Since I am using R programming languange, it is necessary to load packages which may help me to do the problems
I will load my data into scotty object
Next, observe the scotty with glimpse() function
## Observations: 90,113
## Variables: 16
## $ id <chr> "59d005e1ffcfa261708ce9cd", "59d0066affcfa261708...
## $ trip_id <chr> "59d005e9cb564761a8fe5d3e", NA, "59d006c131e39c6...
## $ driver_id <chr> "59a892c5568be44b2734f276", NA, "599dc0dfa5b4fd5...
## $ rider_id <chr> "59ad2d6efba75a581666b506", "59cd704bcf482f6ce2f...
## $ start_time <dttm> 2017-10-01 00:00:17, 2017-10-01 00:02:34, 2017-...
## $ src_lat <dbl> 41.07047, 41.07487, 41.04995, 41.05287, 41.06760...
## $ src_lon <dbl> 29.01945, 28.99528, 29.03107, 28.99522, 28.98827...
## $ src_area <chr> "sxk9", "sxk9", "sxk9", "sxk9", "sxk9", "sxk9", ...
## $ src_sub_area <chr> "sxk9s", "sxk9e", "sxk9s", "sxk9e", "sxk9e", "sx...
## $ dest_lat <dbl> 41.11716, 41.08351, 41.04495, 41.08140, 41.02125...
## $ dest_lon <dbl> 29.03650, 29.00228, 28.98192, 28.98197, 29.11316...
## $ dest_area <chr> "sxk9", "sxk9", "sxk9", "sxk9", "sxk9", "sxk9", ...
## $ dest_sub_area <chr> "sxk9u", "sxk9e", "sxk9e", "sxk9e", "sxk9q", "sx...
## $ distance <dbl> 5.379250, 1.126098, 4.169492, 3.358296, 11.69357...
## $ status <chr> "confirmed", "nodrivers", "confirmed", "confirme...
## $ confirmed_time_sec <dbl> 8, 0, 32, 65, 0, 27, 32, 30, 0, 24, 9, 0, 118, 1...
Now, look at the top 6 data using head() function
## # A tibble: 6 x 16
## id trip_id driver_id rider_id start_time src_lat src_lon src_area
## <chr> <chr> <chr> <chr> <dttm> <dbl> <dbl> <chr>
## 1 59d0~ 59d005~ 59a892c5~ 59ad2d6~ 2017-10-01 00:00:17 41.1 29.0 sxk9
## 2 59d0~ <NA> <NA> 59cd704~ 2017-10-01 00:02:34 41.1 29.0 sxk9
## 3 59d0~ 59d006~ 599dc0df~ 59bd62c~ 2017-10-01 00:03:29 41.0 29.0 sxk9
## 4 59d0~ 59d007~ 59a58565~ 59c17cd~ 2017-10-01 00:04:24 41.1 29.0 sxk9
## 5 59d0~ <NA> <NA> 596f47a~ 2017-10-01 00:06:06 41.1 29.0 sxk9
## 6 59d0~ 59d007~ 599dc0df~ 59bd62c~ 2017-10-01 00:08:19 41.0 29.0 sxk9
## # ... with 8 more variables: src_sub_area <chr>, dest_lat <dbl>,
## # dest_lon <dbl>, dest_area <chr>, dest_sub_area <chr>, distance <dbl>,
## # status <chr>, confirmed_time_sec <dbl>
Since I am going to make timeseries model, I only need the start_time and src_sub_area variables. Therefore, I will subset my dataframe and store it into new object scotty_new
scott_new <- scotty %>%
mutate(src_sub_area = as.factor(src_sub_area)) %>%
select(start_time, src_sub_area) Next, I want to count the order based on the time (hourly), hence I will make my start_time into hours. I will put in to scotty_agg
scotty_agg <-scott_new %>%
mutate (start_time = ymd_hms(start_time) %>% floor_date(unit = "hour")) %>%
group_by(start_time, src_sub_area) %>%
summarise(demand = n()) %>%
ungroup()Great! Now I have the order number for each hour and area.
This data has 3 levels or factors of src_sub_area, so I will make the data from long data frame into wide data frame using spread() function inside tidyr package
Next step, I am going to pad the timeseries object to make sure that the timeseries data is full. Before that, I have to find the minimum and maximum value of the time
## [1] "2017-10-01 UTC"
## [1] "2017-12-02 23:00:00 UTC"
Padding time!
scotty_pad <-scotty_agg %>%
pad(start_val = ymd_hms ("2017-10-01 00:00:00 UTC"), end_val = ymd_hms("2017-12-02 23:00:00 UTC"))Since this data is a demand order for each hour, so I am going to replace the NA values with 0 (This missing values mean that there are no orders at that time)
scotty_pad <- scotty_pad %>%
mutate(sxk97 = replace_na(sxk97, 0)) %>%
mutate(sxk9e = replace_na(sxk9e, 0)) %>%
mutate(sxk9s = replace_na(sxk9s, 0))Voila! The data is much cleaner now.
To make a clearer view, I will visualize the data using graph of ggplot()
scotty_pad %>%
pivot_longer(cols = c("sxk97","sxk9e","sxk9s"), names_to = "src_sub_area", values_to = "demand") %>%
filter(start_time >= max(start_time) - hours(24 * 7 * 4)) %>%
ggplot(mapping = aes(x =start_time, y= demand))+
geom_line()+
facet_wrap(~src_sub_area, scale ="free", ncol =1)+
tidyquant::theme_tq()As we see from the three time-series graphs, they have similarities among them. The order number is 0 in certain date of month and has peaks at the same time too. This pattern is repeating regularly each day and week, hence we can call it seasonality pattern.
Since the data consits of 3 months only, I will split the data into 3 months for cross validation. I am going to split to data test and data train into 4 weeks for 3 months per hour a day, so it becomes 24 hours * 7 days * 4 weeks * 3 months.
test_size <- 24 * 7 #test size is only 7 days since I want to predict for next 7 days
train_size <- 24 * 7 * 4 * 2 #train size is 2 months
# get the min-max of the time index for each sample
test_end <- max(scotty_pad$start_time)
test_start <- test_end - hours(test_size) + hours(1)
train_end <- test_start - hours(1)
train_start <- train_end - hours(train_size) + hours(1)To make it easier, I can combine the start and end interval into date interval using interval() function
intrain <- interval(start = train_start, end = train_end)
intest <- interval(start = test_start, end = test_end)
intrain## [1] 2017-10-01 UTC--2017-11-25 23:00:00 UTC
## [1] 2017-11-26 UTC--2017-12-02 23:00:00 UTC
It is cool, right?
Now, I am going to process the data using recipe() and bake() function
rec <- recipe(~., filter(scotty_pad, start_time %within% intrain)) %>%
step_sqrt(all_numeric()) %>%
step_center(all_numeric()) %>%
step_scale(all_numeric()) %>%
prep()
bake <- bake(rec, scotty_pad)
bake## # A tibble: 1,512 x 4
## start_time sxk97 sxk9e sxk9s
## <dttm> <dbl> <dbl> <dbl>
## 1 2017-10-01 00:00:00 -0.594 -0.697 -0.111
## 2 2017-10-01 01:00:00 -0.841 -1.28 -0.780
## 3 2017-10-01 02:00:00 -0.291 -1.15 -1.12
## 4 2017-10-01 03:00:00 -1.16 -1.28 -1.59
## 5 2017-10-01 04:00:00 -0.711 -1.44 -1.59
## 6 2017-10-01 05:00:00 -0.841 -1.28 -1.59
## 7 2017-10-01 06:00:00 -1.39 -1.44 -1.59
## 8 2017-10-01 07:00:00 -1.94 -1.85 -1.12
## 9 2017-10-01 08:00:00 -0.989 -1.28 -0.544
## 10 2017-10-01 09:00:00 -0.989 -0.497 -0.544
## # ... with 1,502 more rows
If we use recipe() function, we have to create revert back function
# revert back function
rec_revert <- function(vector, rec, varname) {
# store recipe values
rec_center <- rec$steps[[2]]$means[varname]
rec_scale <- rec$steps[[3]]$sds[varname]
# convert back based on the recipe
results <- (vector * rec_scale + rec_center) ^ 2
# add additional adjustment if necessary
results <- round(results)
# return the results
results
}Revert the data frame into long data frame again
In programming using purrr package, we have to need to convert our tbl into nested tbl. It is just like we have tbl inside tbl. Now, let’s start by converting tbl into nested tbl.
scotty_clean<-scotty_clean %>%
mutate(sample = case_when(
start_time %within% intrain ~ "train",
start_time %within% intest ~ "test"
)) After that, we can start to nest the tbl using nest() function based on sample and src_sub_area then spread() or pivot_wider() again the tbl based on sample (This time I will try to use pivot_wider() instead of spread())
nest <-scotty_clean %<>%
group_by(src_sub_area, sample) %>%
nest(.key = "data") %>%
pivot_wider(names_from = sample, values_from = data)
nest## # A tibble: 3 x 3
## # Groups: src_sub_area [3]
## src_sub_area train test
## <chr> <list<df[,2]>> <list<df[,2]>>
## 1 sxk97 [1,344 x 2] [168 x 2]
## 2 sxk9e [1,344 x 2] [168 x 2]
## 3 sxk9s [1,344 x 2] [168 x 2]
Since the data is in high frequency, I can consider two options of model, daily seasonality(ts) and weekly seasonality (msts)
We need to create functions for another nested data
data_funs <- list(
ts = function(x) ts(x$demand, frequency = 24),
msts = function(x) msts(x$demand, seasonal.periods = c(24, 24 * 7))
)
data_funs## $ts
## function(x) ts(x$demand, frequency = 24)
##
## $msts
## function(x) msts(x$demand, seasonal.periods = c(24, 24 * 7))
We can convert the list above into tbl using enframe() function.
data_funs<- data_funs %>%
rep(length(unique(nest$src_sub_area))) %>%
enframe("data_fun_name", "data_fun") %>%
mutate(src_sub_area =
sort(rep(unique(nest$src_sub_area), length(unique(.$data_fun_name))))
)
data_funs## # A tibble: 6 x 3
## data_fun_name data_fun src_sub_area
## <chr> <list> <chr>
## 1 ts <fn> sxk97
## 2 msts <fn> sxk97
## 3 ts <fn> sxk9e
## 4 msts <fn> sxk9e
## 5 ts <fn> sxk9s
## 6 msts <fn> sxk9s
The next step is joining the data_funs with nest data.frame using left_join()
In order to look at the trend and seasonality pattern of all area, I will try to visualize it using ggseasonplot() function
#area sxk97
areasxk97<- scotty_pad %>%
gather(src_sub_area, demand, -start_time) %>%
filter(src_sub_area == "sxk97") %>%
.$demandSince the data only consists 2 months of period, it is not relevant to be used to find monthly pattern. Therefore, I will only observe the daily and weekly pattern.
Create the plot
plotdaily<-ggseasonplot(ts(areasxk97,frequency = 24),polar = T) +
labs(title = "Daily Pattern",x = "Hour") +
scale_y_sqrt()+
theme_minimal()+
theme(legend.position = "none")
plotweekly<-ggseasonplot(ts(areasxk97,frequency = 24*7),polar = T) +
labs(title = "Weekly Pattern",x = "Week") +
scale_y_sqrt()+
theme_minimal()+
theme(legend.position = "none")Plot to visualize
Based on the plots above, it can be concluded that:
1. On daily pattern, the highest demand can be found at nearly 5pm-7pm
2. On weekly pattern, the highest demand can be found on Fridays and Saturdays (10 o’clock direction)
Next, let’s find the trend and seasonality by its decomposition
decompose <- scotty_pad %>%
gather(src_sub_area, demand, -start_time) %>%
filter(src_sub_area == "sxk97") %>%
.$demand %>%
msts(.,seasonal.periods = c(24,24*7)) #24 for daily, 24*7 for weekly
autoplot(mstl(decompose))+
labs(title = "Decomposition by Daily and Weekly Pattern for Area sxk97")+
theme_minimal()After a quick screening, we can see that the trend line made smoothly. This indicates that this time series data has trend and also seasonality pattern.
At this time, I am going to use five different time series models, which are:since the data has trend and seasonality.
# models list
models <- list(
auto.arima = function (x) {
while(adf.test(x)$p.value > 0.05) {
x = diff(x)
}
auto.arima(x, seasonal = TRUE)},
ets = function(x) ets(x),
stlm = function(x) stlm(x, method = "arima"), #stlm arima
tbats = function(x) tbats(x, use.box.cox = FALSE,
use.trend = TRUE,
use.damped.trend = NULL,
use.arma.errors = TRUE,
use.parallel = TRUE,
num.cores = 6),
holt.winter = function (x) HoltWinters(x, seasonal = "additive")
)I will aggregate all of the models into one nested rbl in order to a quicker processing.
Now convert the list into tbl like previous function
models <- models %>%
rep(length(unique(join$src_sub_area))) %>%
enframe("model_name", "model") %>%
mutate(src_sub_area =
sort(rep(unique(join$src_sub_area), length(unique(.$model_name))))
)Great! We can now join join and models. However, we do not want to have the auto.arima(), stlm() and ets() with data msts, since they are not compatible for multi seasonality time-series. Hence, we have to remove it.
To execute the model fitting, we need to wrap up the needed arguments as a list using map() function. Then, we could call the function using invoke_map(). We need to do this for data transformation using the function inside data_fun, then continue to fit the model with the same process using the function inside model.
Next, we have to forecast the data using forecast() and try to observe the error using function in yardstick() package. I will use MAE (mean absolute error) as the error function. The error values are splitted into train_error (for evaluation dataset) and test_error (for data test dataset).
scot<- scot %>%
mutate(train_error =
map(fitted, ~ forecast(.x, h = 24 * 7)) %>%
map2_dbl(train, ~ mae_vec(truth = rec_revert(.y$demand,rec,src_sub_area), estimate = rec_revert(.x$fitted,rec,src_sub_area))),
test_error=
map(fitted, ~ forecast(.x, h = 24 * 7)) %>%
map2_dbl(test, ~ mae_vec(truth = rec_revert(.y$demand,rec,src_sub_area), estimate = rec_revert(.x$mean,rec,src_sub_area)))
) %>%
arrange(src_sub_area, train_error, test_error)
scotty_error <- scot
scot %>%
select(src_sub_area, ends_with("_name"), train_error, test_error)## # A tibble: 21 x 5
## # Groups: src_sub_area [3]
## src_sub_area data_fun_name model_name train_error test_error
## <chr> <chr> <chr> <dbl> <dbl>
## 1 sxk97 ts stlm 4.55 9.39
## 2 sxk97 msts tbats 4.69 7.42
## 3 sxk97 ts tbats 4.91 9.02
## 4 sxk97 ts ets 5.05 9
## 5 sxk97 msts holt.winter 5.07 8.70
## 6 sxk97 ts holt.winter 5.54 9.12
## 7 sxk97 ts auto.arima 5.62 11.3
## 8 sxk9e ts stlm 6.34 11.2
## 9 sxk9e msts tbats 6.48 9.76
## 10 sxk9e ts tbats 6.85 10.9
## # ... with 11 more rows
We can compare the forecast results to the real test series using graphical analysis. Before that, we have to build a forecast model in order to forecast the data. Next, we will try to do some spread-gather tricks to make a set of keys that unique for each data representations, models, and also one for the forecast itself.
First, build the forecast model.
scot_test <- scot %>%
mutate(
forecast =
map(fitted, ~ forecast(.x, h = 24 * 7 )) %>%
map2(test, ~ tibble(
start_time = .y$start_time,
demand = as.vector(.x$mean)
)),
key = paste(data_fun_name, model_name, sep = "-")
)
scot_test## # A tibble: 21 x 12
## # Groups: src_sub_area [3]
## src_sub_area train test data_fun_name data_fun model_name model
## <chr> <list<df[,> <list<df> <chr> <list> <chr> <lis>
## 1 sxk97 [1,344 x 2] [168 x 2] ts <fn> stlm <fn>
## 2 sxk97 [1,344 x 2] [168 x 2] msts <fn> tbats <fn>
## 3 sxk97 [1,344 x 2] [168 x 2] ts <fn> tbats <fn>
## 4 sxk97 [1,344 x 2] [168 x 2] ts <fn> ets <fn>
## 5 sxk97 [1,344 x 2] [168 x 2] msts <fn> holt.wint~ <fn>
## 6 sxk97 [1,344 x 2] [168 x 2] ts <fn> holt.wint~ <fn>
## 7 sxk97 [1,344 x 2] [168 x 2] ts <fn> auto.arima <fn>
## 8 sxk9e [1,344 x 2] [168 x 2] ts <fn> stlm <fn>
## 9 sxk9e [1,344 x 2] [168 x 2] msts <fn> tbats <fn>
## 10 sxk9e [1,344 x 2] [168 x 2] ts <fn> tbats <fn>
## # ... with 11 more rows, and 5 more variables: fitted <list>,
## # train_error <dbl>, test_error <dbl>, forecast <list>, key <chr>
Next, do spread-gather
scot_test <- scot_test %>%
select(src_sub_area, key, actual = test, forecast) %>%
spread(key, forecast) %>%
gather(key, value, -src_sub_area)
scot_test## # A tibble: 24 x 3
## # Groups: src_sub_area [3]
## src_sub_area key value
## <chr> <chr> <list>
## 1 sxk97 actual <tibble [168 x 2]>
## 2 sxk9e actual <tibble [168 x 2]>
## 3 sxk9s actual <tibble [168 x 2]>
## 4 sxk97 msts-holt.winter <tibble [168 x 2]>
## 5 sxk9e msts-holt.winter <tibble [168 x 2]>
## 6 sxk9s msts-holt.winter <tibble [168 x 2]>
## 7 sxk97 msts-tbats <tibble [168 x 2]>
## 8 sxk9e msts-tbats <tibble [168 x 2]>
## 9 sxk9s msts-tbats <tibble [168 x 2]>
## 10 sxk97 ts-auto.arima <tibble [168 x 2]>
## # ... with 14 more rows
Finally, let’s unnest the data using unnest() function
scot_test <- scot_test %>%
unnest(value) %>%
mutate(demand = rec_revert(demand, rec, src_sub_area))
scot_test## # A tibble: 4,032 x 4
## # Groups: src_sub_area [3]
## src_sub_area key start_time demand
## <chr> <chr> <dttm> <dbl>
## 1 sxk97 actual 2017-11-26 00:00:00 38
## 2 sxk97 actual 2017-11-26 01:00:00 21
## 3 sxk97 actual 2017-11-26 02:00:00 10
## 4 sxk97 actual 2017-11-26 03:00:00 10
## 5 sxk97 actual 2017-11-26 04:00:00 5
## 6 sxk97 actual 2017-11-26 05:00:00 2
## 7 sxk97 actual 2017-11-26 06:00:00 0
## 8 sxk97 actual 2017-11-26 07:00:00 11
## 9 sxk97 actual 2017-11-26 08:00:00 4
## 10 sxk97 actual 2017-11-26 09:00:00 4
## # ... with 4,022 more rows
Great! Now, let’s try to visualize the data using ggplot() graph for a better view
p1 <-scot_test %>%
ggplot(aes(x = start_time, y = demand, colour = key)) +
geom_line(data = scot_test %>%
filter(key == "actual"),aes(y = demand),alpha = 0.3,size = 0.8)+
geom_line(data = scot_test %>%
filter(key != "actual"),aes(frame = key,col = key)) +
labs(x = "", y = "Demand",title = "Model Prediction Comparison", frame = "Model") +
facet_wrap(~ src_sub_area, scale = "free_y", ncol = 1) +
tidyquant::theme_tq() +
theme(legend.position = "none",axis.title.y = element_text(size = 12), axis.text.y = element_text(size = 7))
ggplotly(p1)It’s a cool stuff, isn’t it? We can adjust to compare each model by just slide the button to desired model. The transparent line at the back is the test data which will be forecasted, while the coloured line is the forecast representation.
We can see that all models has such different pattern of line. However, all of them has trend and seasonality pattern for each area. This indicates that the time-series model produce a pretty good results.
It is quite hard to compare the results by only using graphs. We can do model selection automatically by using filter() function to find the lowest error.
scot_best <- scot %>%
select(-c(fitted, train, test, data_fun, model)) %>%
group_by(src_sub_area) %>%
filter(test_error == min(test_error)) %>%
ungroup()
scot_best## # A tibble: 3 x 5
## src_sub_area data_fun_name model_name train_error test_error
## <chr> <chr> <chr> <dbl> <dbl>
## 1 sxk97 msts tbats 4.69 7.42
## 2 sxk9e msts tbats 6.48 9.76
## 3 sxk9s msts tbats 4.81 6.94
It can be seen from the data frame above that the best model which has the lowest error (based on MAE) for the test data (data which want to be forecasted) is tbats model with error :
For reminder, these results are produced from the splitted train data . In order to forecast the real data, I have to combine the train data and test data into full data. This step will be explained on the next section
Now for the final step is to make the last forecast model. Same as previous steps, but we use train and test combination as our new data.
scot<- scot %>%
mutate(fulldata = map2(train, test, ~ bind_rows(.x, .y))) %>%
select(src_sub_area, fulldata, everything(), -train, -test)
scot## # A tibble: 21 x 9
## # Groups: src_sub_area [3]
## src_sub_area fulldata data_fun_name data_fun model_name model fitted
## <chr> <list> <chr> <list> <chr> <lis> <list>
## 1 sxk97 <tibble~ ts <fn> stlm <fn> <stlm>
## 2 sxk97 <tibble~ msts <fn> tbats <fn> <tbat~
## 3 sxk97 <tibble~ ts <fn> tbats <fn> <tbat~
## 4 sxk97 <tibble~ ts <fn> ets <fn> <ets>
## 5 sxk97 <tibble~ msts <fn> holt.wint~ <fn> <HltW~
## 6 sxk97 <tibble~ ts <fn> holt.wint~ <fn> <HltW~
## 7 sxk97 <tibble~ ts <fn> auto.arima <fn> <fr_A~
## 8 sxk9e <tibble~ ts <fn> stlm <fn> <stlm>
## 9 sxk9e <tibble~ msts <fn> tbats <fn> <tbat~
## 10 sxk9e <tibble~ ts <fn> tbats <fn> <tbat~
## # ... with 11 more rows, and 2 more variables: train_error <dbl>,
## # test_error <dbl>
Do the nested fitting
scot <- scot %<>%
mutate(
params = map(fulldata, ~ list(x = .x)),
data = invoke_map(data_fun, params),
params = map(data, ~ list(x = .x)),
fitted = invoke_map(model, params)
) %>%
select(-data, -params)After that, we make model to predict for the next 7 days.
scot<- scot %>%
mutate(forecast =
map(fitted, ~ forecast(.x, h = 24 * 7 )) %>%
map2(fulldata, ~ tibble(
start_time = timetk::tk_make_future_timeseries(.y$start_time, 24 * 7 ),
demand = as.vector(.x$mean)
))
)
# save the data for evaluation
scot_test_full <- scot
scot## # A tibble: 21 x 10
## # Groups: src_sub_area [3]
## src_sub_area fulldata data_fun_name data_fun model_name model fitted
## <chr> <list> <chr> <list> <chr> <lis> <list>
## 1 sxk97 <tibble~ ts <fn> stlm <fn> <stlm>
## 2 sxk97 <tibble~ msts <fn> tbats <fn> <tbat~
## 3 sxk97 <tibble~ ts <fn> tbats <fn> <tbat~
## 4 sxk97 <tibble~ ts <fn> ets <fn> <ets>
## 5 sxk97 <tibble~ msts <fn> holt.wint~ <fn> <HltW~
## 6 sxk97 <tibble~ ts <fn> holt.wint~ <fn> <HltW~
## 7 sxk97 <tibble~ ts <fn> auto.arima <fn> <fr_A~
## 8 sxk9e <tibble~ ts <fn> stlm <fn> <stlm>
## 9 sxk9e <tibble~ msts <fn> tbats <fn> <tbat~
## 10 sxk9e <tibble~ ts <fn> tbats <fn> <tbat~
## # ... with 11 more rows, and 3 more variables: train_error <dbl>,
## # test_error <dbl>, forecast <list>
To obtain the best model, filter() again the lowest error
scot_best_full <- scot %>%
select(-fitted) %>%
group_by(src_sub_area) %>%
filter(test_error == min(test_error)) %>%
ungroup()
scot_best_full ## # A tibble: 3 x 9
## src_sub_area fulldata data_fun_name data_fun model_name model train_error
## <chr> <list> <chr> <list> <chr> <lis> <dbl>
## 1 sxk97 <tibble~ msts <fn> tbats <fn> 4.69
## 2 sxk9e <tibble~ msts <fn> tbats <fn> 6.48
## 3 sxk9s <tibble~ msts <fn> tbats <fn> 4.81
## # ... with 2 more variables: test_error <dbl>, forecast <list>
Finally, we have to unnest the model to retrieve the results
scot <- scot %>%
select(src_sub_area, actual = fulldata, forecast) %>%
gather(key, value, -src_sub_area) %>%
unnest(value) %>%
mutate(demand = rec_revert(demand, rec, src_sub_area))
scot## # A tibble: 35,280 x 4
## # Groups: src_sub_area [3]
## src_sub_area key start_time demand
## <chr> <chr> <dttm> <dbl>
## 1 sxk97 actual 2017-10-01 00:00:00 6
## 2 sxk97 actual 2017-10-01 01:00:00 4
## 3 sxk97 actual 2017-10-01 02:00:00 9
## 4 sxk97 actual 2017-10-01 03:00:00 2
## 5 sxk97 actual 2017-10-01 04:00:00 5
## 6 sxk97 actual 2017-10-01 05:00:00 4
## 7 sxk97 actual 2017-10-01 06:00:00 1
## 8 sxk97 actual 2017-10-01 07:00:00 0
## 9 sxk97 actual 2017-10-01 08:00:00 3
## 10 sxk97 actual 2017-10-01 09:00:00 3
## # ... with 35,270 more rows
Now, to get a better view, we can visualize the data using graph
p2<- scot %>%
ggplot(aes(x = start_time, y = demand, colour = key)) +
geom_line() +
labs(x = NULL, y = NULL, colour = NULL) +
facet_wrap(~ src_sub_area, scale = "free", ncol = 1) +
labs(x = "", y = "Demand",title = "Final Model Prediction Comparison")+
tidyquant::theme_tq() +
tidyquant::scale_colour_tq()
ggplotly(p2)To conclude the time-series process, I will filter the data for forecast key.
scot %>%
filter(key == "forecast") %>%
select(-key) %>%
select(src_sub_area, datetime = start_time, demand) %>%
ggplot(mapping = aes(x = datetime, y= demand)) +
geom_line()+
labs(x = NULL, y = NULL, colour = NULL) +
facet_wrap(~ src_sub_area, scale = "free", ncol = 1) +
labs(x = "", y = "Demand",title = "Demand Forecast for December 3rd - December 9th 2017")+
tidyquant::theme_tq() +
tidyquant::scale_colour_tq()To check the quality of the forecast result, we can use MAE (mean absolute error) and evaluate the error for each src_sub_area :
Here is the final demand results:
## # A tibble: 3,528 x 4
## # Groups: src_sub_area [3]
## src_sub_area key start_time demand
## <chr> <chr> <dttm> <dbl>
## 1 sxk97 msts-holt.winter 2017-11-26 00:00:00 24
## 2 sxk97 msts-holt.winter 2017-11-26 01:00:00 27
## 3 sxk97 msts-holt.winter 2017-11-26 02:00:00 20
## 4 sxk97 msts-holt.winter 2017-11-26 03:00:00 14
## 5 sxk97 msts-holt.winter 2017-11-26 04:00:00 9
## 6 sxk97 msts-holt.winter 2017-11-26 05:00:00 6
## 7 sxk97 msts-holt.winter 2017-11-26 06:00:00 1
## 8 sxk97 msts-holt.winter 2017-11-26 07:00:00 3
## 9 sxk97 msts-holt.winter 2017-11-26 08:00:00 5
## 10 sxk97 msts-holt.winter 2017-11-26 09:00:00 5
## # ... with 3,518 more rows
I can filter() for the best error.
## # A tibble: 3 x 5
## src_sub_area data_fun_name model_name train_error test_error
## <chr> <chr> <chr> <dbl> <dbl>
## 1 sxk97 msts tbats 4.69 7.42
## 2 sxk9e msts tbats 6.48 9.76
## 3 sxk9s msts tbats 4.81 6.94
Results :
msts-tbats produced the lowest error among the other models.
Here is the final demand results:
## # A tibble: 3,528 x 4
## # Groups: src_sub_area [3]
## src_sub_area key start_time demand
## <chr> <chr> <dttm> <dbl>
## 1 sxk97 forecast 2017-12-03 00:00:00 43
## 2 sxk97 forecast 2017-12-03 01:00:00 28
## 3 sxk97 forecast 2017-12-03 02:00:00 23
## 4 sxk97 forecast 2017-12-03 03:00:00 15
## 5 sxk97 forecast 2017-12-03 04:00:00 9
## 6 sxk97 forecast 2017-12-03 05:00:00 5
## 7 sxk97 forecast 2017-12-03 06:00:00 5
## 8 sxk97 forecast 2017-12-03 07:00:00 10
## 9 sxk97 forecast 2017-12-03 08:00:00 20
## 10 sxk97 forecast 2017-12-03 09:00:00 29
## # ... with 3,518 more rows
Filter the lowest error
## # A tibble: 3 x 5
## src_sub_area data_fun_name model_name train_error test_error
## <chr> <chr> <chr> <dbl> <dbl>
## 1 sxk97 msts tbats 4.69 7.42
## 2 sxk9e msts tbats 6.48 9.76
## 3 sxk9s msts tbats 4.81 6.94
Results:
msts-tbats also produces the lowest error for the next 7 days. All of the three areas have MAE for train_error and test_error value less than 10. This is a good news.
Last but not least, I am going to check two assumptions of a good time-series forecasting model, which are no Auto-correlation and Normality Residual.
##
## Box-Ljung test
##
## data: residuals(scot_test_full$fitted[[1]])
## X-squared = 0.0095585, df = 1, p-value = 0.9221
Based on Hypothesis Testing criterion :
H0: no-autocorrelation <- p-value > 0.05
H1: autocorrelation <- p-value < 0.05
Since p-value is 0.9221 > 0.05 , the model has no- autocorrelation. This means that the residual has no correlation each other.
##
## Box-Ljung test
##
## data: residuals(scot_test_full$fitted[[2]])
## X-squared = 0.005713, df = 1, p-value = 0.9397
Based on Hypothesis Testing criterion :
H0: no-autocorrelation <- p-value > 0.05
H1: autocorrelation <- p-value < 0.05
Since p-value is 0.93 > 0.05 , the model has no-autocorrelation. This means that the residual has no correlation each other. All information are collected and used at this sub area. It can be called a good model.
##
## Box-Ljung test
##
## data: residuals(scot_test_full$fitted[[3]])
## X-squared = 0.27352, df = 1, p-value = 0.601
Based on Hypothesis Testing criterion :
H0: no-autocorrelation <- p-value > 0.05
H1: autocorrelation <- p-value < 0.05
Since p-value is 0.60 > 0.05 , the model has no-autocorrelation. This means that the residual has no correlation each other.
After creating all of time-series forecasting process, we have can say that the best model with lowest error for this kind of data set is the msts-tbats with MAE < 10. Based on the assumption check, area sxk97, area sxk9e, and sxk9s have no auto-correlation between the residuals. This sounds great for a time-series modelling since all information are used well on the model. Hence, the model is effectively build. Moreover, the three areas’ residuals are normally distributed based on the histogram. Overall, we can say that using automated time-series forecasting model for this dataset is a pretty good option.
I will read the data submission and stored in in submission object
## # A tibble: 5 x 3
## src_sub_area datetime demand
## <chr> <dttm> <lgl>
## 1 sxk97 2017-12-03 00:00:00 NA
## 2 sxk97 2017-12-03 01:00:00 NA
## 3 sxk97 2017-12-03 02:00:00 NA
## 4 sxk97 2017-12-03 03:00:00 NA
## 5 sxk97 2017-12-03 04:00:00 NA
Next, the following codes are written for submission collection
Obtain the best_model from scotty_error that has been made. This is the error of msts_tbats.
best_model <- scotty_error %>%
group_by(src_sub_area) %>%
arrange(test_error) %>%
slice(1) %>%
ungroup() %>%
select(src_sub_area, ends_with("_name"),test_error)
best_model## # A tibble: 3 x 4
## src_sub_area data_fun_name model_name test_error
## <chr> <chr> <chr> <dbl>
## 1 sxk97 msts tbats 7.42
## 2 sxk9e msts tbats 9.76
## 3 sxk9s msts tbats 6.94
Join the error once more
best_model2 <- best_model %>%
select(-test_error) %>%
left_join(scotty_error)%>%
select(-c(test_error,train_error))
best_model2## # A tibble: 3 x 8
## src_sub_area data_fun_name model_name train test data_fun model
## <chr> <chr> <chr> <list<df[,> <list<df> <list> <lis>
## 1 sxk97 msts tbats [1,344 x 2] [168 x 2] <fn> <fn>
## 2 sxk9e msts tbats [1,344 x 2] [168 x 2] <fn> <fn>
## 3 sxk9s msts tbats [1,344 x 2] [168 x 2] <fn> <fn>
## # ... with 1 more variable: fitted <list>
Another small adjusment
sub_forecast <- best_model2 %>%
mutate(
forecast =
map(fitted, ~ forecast(.x, h = 24 * 7)) %>%
map2(submission_nest$data,
~ tibble(
datetime = .y$datetime,
demand = as.vector(.x$mean)
)),
key = paste(data_fun_name, model_name, sep = "-")
)
sub_forecast## # A tibble: 3 x 10
## src_sub_area data_fun_name model_name train test data_fun model
## <chr> <chr> <chr> <list<df[,> <list<df> <list> <lis>
## 1 sxk97 msts tbats [1,344 x 2] [168 x 2] <fn> <fn>
## 2 sxk9e msts tbats [1,344 x 2] [168 x 2] <fn> <fn>
## 3 sxk9s msts tbats [1,344 x 2] [168 x 2] <fn> <fn>
## # ... with 3 more variables: fitted <list>, forecast <list>, key <chr>
Now, let’s build the final submission!
final_submission <- sub_forecast %>%
select(src_sub_area, forecast) %>%
unnest(forecast) %>%
mutate(demand = rec_revert(demand, rec, src_sub_area))
final_submission## # A tibble: 504 x 3
## src_sub_area datetime demand
## <chr> <dttm> <dbl>
## 1 sxk97 2017-12-03 00:00:00 20
## 2 sxk97 2017-12-03 01:00:00 17
## 3 sxk97 2017-12-03 02:00:00 12
## 4 sxk97 2017-12-03 03:00:00 9
## 5 sxk97 2017-12-03 04:00:00 7
## 6 sxk97 2017-12-03 05:00:00 4
## 7 sxk97 2017-12-03 06:00:00 3
## 8 sxk97 2017-12-03 07:00:00 7
## 9 sxk97 2017-12-03 08:00:00 13
## 10 sxk97 2017-12-03 09:00:00 13
## # ... with 494 more rows
I will export the data to csv file
So, that’s all for the process of automated time-series forecasting using purrr package in R programming language.I hope this page can help you understand time-series problem and the solution behind it.
See you in the other page!
Author,
Alfado Sembiring
Notes :
In case you want to look up my profile, click the link below :
Jump To My Profile (open link in a new tab)