1 Merhaba, Benim Adım Scotty

1.1 Uber Fast

Ride hailing apps have been the pinnacle of the new millennium milestone. Take Uber, for example, which has emerged as one of the highest valued pioneers of digital transportation¹. It went from a small ride hailing idea in San Francisco, USA to a successful unicorn with USD16.5 billion worth of gross bookings in Q3 2019². That success was quickly recognized by business players who wanted in on their fortune. Competing ride hailing apps began to pop up locally while the global market adjusted the idea with their biggest road problems.

In Turkey, for example, Uber’s convenience does not offer much in compatison with the country’s biggest transportation problem³. Former Uber fleet partner, Tarkan Altan, recognized how much time the drivers lost during shift transitions because of the city’s immense traffic. He, then, took action by switching to motorcycles in order to save time, which eventually led to the launch of Scotty.

1.2 Turkish Delight

What Uber doesn’t give is time. It gives you luxury, a better performance, a better ride, but it doesn’t change if you are stuck in a Mercedes or a taxi — you’re stuck in traffic.

Scotty started in 2017 as a mobile application that allows motorcyclists to take passengers to a destination of their choosing⁴. The app, named after the Star Trek character Montgomery Scott, was originally programmed to solve Istanbul’s traffic problem through motorcycle rides, but then expanded to food and package deliveries. Its competitive advantage lies within the relatively affordable price. With the Scotty app, you can go from Beşiktaş to Bebek, in which the estimated distance is 5 km, for only TL10 - or equivalent to USD1.72. The price is almost one-third of the average taxi fare in Turkey.

The app quickly rose to prominence - gaining 3 million trips within its first 18 months of operation, but is now seemingly focus on profitability. In a Glocal Podcast, Webrazzi reported that Tarkan Altan was discussing about how Scotty would focus 50% of his time on fundraising, 30% on recruitment, and the remaining 20% on strategy in the near future⁵. It is an understandable move to bluntly state his focus on financial profitability, but its lack of focus in strategy could damage the company’s ability to attract investors. After all, the future prospect of a company’s operational value is the most tangible forecast there is.

1.3 Future Enterprise

In order to balance its focus on fundraising and strategy, Scotty needs to construct an automated forecasting framework that could help its operation - namely, predicting its demand. The demand can be predicted using the following dataset, with only the first 100 observations being shown.

id: Transaction ID
trip_id: Trip ID
driver_id: Driver ID
rider_id: Rider or the consumer’s ID
start_time: Pick up time calculated when the driver presses “go”
src_lat: Request source latitude
src_lon: Request source longitude
src_area: Request source area
src_sub_area: Request source sub-area
dest_lat: Requested destination latitude
dest_lon: Requested destination longitude
dest_area: Requested destination area
dest_sub_area: Requested destination sub-area
distance: Trip distance (in KM)
status: Trip status which will all be considered as demand
confirmed_time_sec: Time difference from the request to confirmation status by the driver (in seconds)

The data is saved as an object named data_train and consists of 90,113 time stamps as well as 16 variables. These variables do not include direct hourly demand, which can be difficult for us to build the forecasting framework. However, through various coding mechanisms, we can construct an accurate number of hourly demand based on the existing time series data. To boldly go where no man has gone before, we will start small by forecasting the next 7 days of the existing time frame.

2 Discovery

This section explains and elaborates the data preprocessing.

2.1 Package for Mr. Scott

After calling the data and saving as an object named data_train in subsection 1.3, the next logical move would be to set up the packages. These packages will help us in creating a more profound and efficient way of coding. A set of said packages consists of the following, with the # symbol explaining its use.

library(forecast)   #for time series modeling and forecasting; autoplot
library(yardstick)  #for measuring forecasting performances; mae_vec
library(recipes)    #for data preprocessing
library(purrr)      #for nested model fitting
library(dplyr)      #for better data manipulation
library(lubridate)  #for restructuring a certain type of date
library(magrittr)   #an extension of dplyr for piping syntax
library(timetk)     #additional toolkit needed for time series models
library(tidyquant)  #for creating better ggplot aesthetics; theme_tq
library(padr)       #for adjusting the time stamp
library(tidyverse)  #for data wrangling
library(tidymodels) #for supporting an easier model fitting process
library(ggplot2)    #for creating a plot based on the desired model
library(plotly)     #for generating an interactive visualization
library(MLmetrics)  #for machine learning evaluation metrics

The object data_train, as explained in subsection 1.3, has 16 variables that may need some redacting. If anything, keeping all these variables might result in overfitting. It is better to tweak the dataset and solely take the important variables for forecasting.

scotty_train <- data_train[,c(1,5,9)]

head(scotty_train)

## # A tibble: 6 x 3
##   id                       start_time          src_sub_area
##   <chr>                    <dttm>              <chr>       
## 1 59d005e1ffcfa261708ce9cd 2017-10-01 00:00:17 sxk9s       
## 2 59d0066affcfa261708ceb11 2017-10-01 00:02:34 sxk9e       
## 3 59d006a1ffcfa261708ceba4 2017-10-01 00:03:29 sxk9s       
## 4 59d006d8cb564761a8fe5efd 2017-10-01 00:04:24 sxk9e       
## 5 59d0073ecb564761a8fe5fdc 2017-10-01 00:06:06 sxk9e       
## 6 59d007c3ffcfa261708ceee1 2017-10-01 00:08:19 sxk9s

Reasons as to why the remaining three variables stay are summarized into the following.

id: to keep track on the transactions.
start_time: to reconstruct the time frame so one can extract the hourly demand.
src_sub_area: to differentiate demand based on locations.

2.2 Scotty’s Aggregation

head(scotty_train$start_time)

## [1] "2017-10-01 00:00:17 UTC" "2017-10-01 00:02:34 UTC"
## [3] "2017-10-01 00:03:29 UTC" "2017-10-01 00:04:24 UTC"
## [5] "2017-10-01 00:06:06 UTC" "2017-10-01 00:08:19 UTC"

tail(scotty_train$start_time)

## [1] "2017-12-02 23:57:29 UTC" "2017-12-02 23:57:45 UTC"
## [3] "2017-12-02 23:58:52 UTC" "2017-12-02 23:58:55 UTC"
## [5] "2017-12-02 23:58:59 UTC" "2017-12-02 23:59:20 UTC"

If we take a peek at start_time, the time frame presented is in a complete date and time format up to the very last second. This may be pointless to include as one specific value as the probability of the exact same second appearing at least twice within the data is close to zero. Logically, what one could do to fulfill the time series requirement is by grouping each observation based on the hour. With this logical verity, one would have more than enough repeated data of each hour from 12AM to 11PM.

In order to group each time stamp, floor_date() function can be used from the lubridate package. The syntax will resemble the following.

scotty_train$datetime <- floor_date(scotty_train$start_time,
                                      unit = "hour")
scotty_train1 <- scotty_train[,-2]
head(scotty_train1)

## # A tibble: 6 x 3
##   id                       src_sub_area datetime           
##   <chr>                    <chr>        <dttm>             
## 1 59d005e1ffcfa261708ce9cd sxk9s        2017-10-01 00:00:00
## 2 59d0066affcfa261708ceb11 sxk9e        2017-10-01 00:00:00
## 3 59d006a1ffcfa261708ceba4 sxk9s        2017-10-01 00:00:00
## 4 59d006d8cb564761a8fe5efd sxk9e        2017-10-01 00:00:00
## 5 59d0073ecb564761a8fe5fdc sxk9e        2017-10-01 00:00:00
## 6 59d007c3ffcfa261708ceee1 sxk9s        2017-10-01 00:00:00

From the table above, we can see that the start_time column has been omitted. This was a personal choice as it would be easier to use datetime as a representative variable for explaining the values of both date and time.

If we continue by looking at the first six results, we can see that the datetime variable only displays the date. This could be a misdirect for the first few observations as each time stamp for said occasion has been changed to all zeros. In reality, the data frame has understood that what they are keeping is a set of variables clustered to the 12 AM slot.

To prove my point that this was not a coding error, let’s take a peek at the last six observations.

tail(scotty_train1)

## # A tibble: 6 x 3
##   id                       src_sub_area datetime           
##   <chr>                    <chr>        <dttm>             
## 1 5a2313b99af1344c65bdcd23 sxk97        2017-12-02 23:00:00
## 2 5a2313c9ecb4a04c66871562 sxk9s        2017-12-02 23:00:00
## 3 5a23140c9af1344c65bdcfcc sxk97        2017-12-02 23:00:00
## 4 5a23140f490a204c2d2e4f64 sxk9e        2017-12-02 23:00:00
## 5 5a231413ecb4a04c668717d6 sxk9s        2017-12-02 23:00:00
## 6 5a2314289af1344c65bdd0b2 sxk9e        2017-12-02 23:00:00

From the last six observations, we can clearly see that the data frame has stored and displayed the clustered hours. This gives us confirmation that the hypothesis regarding the misdircet is true. Therefore, we should carry on and start building demand.

When we talk about forecasting ride hailing demand, transaction IDs are not always important. Basic time series predictions usually require one variable name as both the predictor(s) and the target variable. Since transaction IDs are useless in our scenario of forecasting future demand, it is better to omit the variable entirely.

scotty_train2 <- scotty_train1[,-1]
scotty_train2

## # A tibble: 90,113 x 2
##    src_sub_area datetime           
##    <chr>        <dttm>             
##  1 sxk9s        2017-10-01 00:00:00
##  2 sxk9e        2017-10-01 00:00:00
##  3 sxk9s        2017-10-01 00:00:00
##  4 sxk9e        2017-10-01 00:00:00
##  5 sxk9e        2017-10-01 00:00:00
##  6 sxk9s        2017-10-01 00:00:00
##  7 sxk9s        2017-10-01 00:00:00
##  8 sxk9s        2017-10-01 00:00:00
##  9 sxk9s        2017-10-01 00:00:00
## 10 sxk9s        2017-10-01 00:00:00
## # … with 90,103 more rows

Surprisingly, from the new data frame of scotty_train2, the time stamp of 12 AM begins to appear.

Now that we have extracted the important variables or columns, we shall proceed with the development of demand. This mechanism is done through piping each syntax using the dplyr package.

scotty_train3 <- scotty_train2 %>% 
  group_by(src_sub_area, datetime) %>%
  mutate(demand = n()) %>%
  ungroup()

scotty_train3

## # A tibble: 90,113 x 3
##    src_sub_area datetime            demand
##    <chr>        <dttm>               <int>
##  1 sxk9s        2017-10-01 00:00:00     10
##  2 sxk9e        2017-10-01 00:00:00      8
##  3 sxk9s        2017-10-01 00:00:00     10
##  4 sxk9e        2017-10-01 00:00:00      8
##  5 sxk9e        2017-10-01 00:00:00      8
##  6 sxk9s        2017-10-01 00:00:00     10
##  7 sxk9s        2017-10-01 00:00:00     10
##  8 sxk9s        2017-10-01 00:00:00     10
##  9 sxk9s        2017-10-01 00:00:00     10
## 10 sxk9s        2017-10-01 00:00:00     10
## # … with 90,103 more rows

Demand is now stored as a new variable through the mutate function. The variable is a construction done by grouping src_sub_area and calculating the unique values of each area, subjected by the hour.

2.3 Time Padding

Since we are developing a time series model, it is best to identify the time interval first.

scotty_min <- min(scotty_train3$datetime)
scotty_min

## [1] "2017-10-01 UTC"

scotty_max <- max(scotty_train3$datetime)
scotty_max

## [1] "2017-12-02 23:00:00 UTC"

Based on the result of the syntax, we can see that the time interval ranges from January 1st, 2017 at 12 AM to December 2nd, 2017 at 11 PM. However, it appears that the time stamp has disappeared once again for the 12 AM mark, unlike the one presented in scotty_train3.

start_val <- make_datetime(year = year(scotty_min),
                           month = month(scotty_min),
                           day = day(scotty_min),
                           hour = 0)
start_val

## [1] "2017-10-01 UTC"

end_val <- make_datetime(year = year(scotty_max),
                         month = month(scotty_max),
                         day = day(scotty_max),
                         hour = 23)
end_val

## [1] "2017-12-02 23:00:00 UTC"

If we create two new objects that would define each meaning on the datetime variable, we could clearly see that the 12 AM mark remains invisible. It is incredibly frustrating to solve this, but at least we get two new objects that could be useful later on.

When analyzing a time series dataset, one would often find themselves working with a lot of gaps⁶. This means that the dataset we encounter does not have regular intervals, in which the value of each defined time period is numerically available. If there is no value for a certain time period, the observation should still exist with the value written as 0. However, this is not the case for scotty_train3.

Time gaps can be identified using the time padding function of pad() from the padr package. Aside from identifying time gaps, it also puts the missing time frame in as its own row of observation.

scotty_train4 <- scotty_train3 %>% 
  group_by(src_sub_area) %>%
  pad(start_val = start_val,
      end_val = end_val) %>%
  ungroup() %>% 
  distinct()
  
scotty_train4

## # A tibble: 4,536 x 3
##    src_sub_area datetime            demand
##    <chr>        <dttm>               <int>
##  1 sxk97        2017-10-01 00:00:00      6
##  2 sxk97        2017-10-01 01:00:00      4
##  3 sxk97        2017-10-01 02:00:00      9
##  4 sxk97        2017-10-01 03:00:00      2
##  5 sxk97        2017-10-01 04:00:00      5
##  6 sxk97        2017-10-01 05:00:00      4
##  7 sxk97        2017-10-01 06:00:00      1
##  8 sxk97        2017-10-01 07:00:00     NA
##  9 sxk97        2017-10-01 08:00:00      3
## 10 sxk97        2017-10-01 09:00:00      3
## # … with 4,526 more rows

As one can see from the newly-reformed scotty_train4, time gaps exist. This argument is drawn from the appearance of NA in row 8 and so on. What this implies, then, is that at a certain hour, there is no demand or a single booking request, subjected by the area.

The value of NA gives strong implications, but is useless in a series of numerical values. Thus, in order to proceed to forecasting, we shall transform it to the numerical value of zero.

scotty_train5 <- scotty_train4 %>% 
  mutate(demand = replace_na(demand, 0))

scotty_train5

## # A tibble: 4,536 x 3
##    src_sub_area datetime            demand
##    <chr>        <dttm>               <dbl>
##  1 sxk97        2017-10-01 00:00:00      6
##  2 sxk97        2017-10-01 01:00:00      4
##  3 sxk97        2017-10-01 02:00:00      9
##  4 sxk97        2017-10-01 03:00:00      2
##  5 sxk97        2017-10-01 04:00:00      5
##  6 sxk97        2017-10-01 05:00:00      4
##  7 sxk97        2017-10-01 06:00:00      1
##  8 sxk97        2017-10-01 07:00:00      0
##  9 sxk97        2017-10-01 08:00:00      3
## 10 sxk97        2017-10-01 09:00:00      3
## # … with 4,526 more rows

The value of NA has now been manipulated into a useable value of 0.

3 First Contact

This section unravels the cross-validation scheme.

3.1 Clocking In

In subsection 2.3, the hourly demand has been successfully constructed. If we proceed to visualize, we can obtain some insights regarding the pattern of said demand like the following.

plot3_1 <- ggplot(scotty_train5,
                  aes(x = datetime, y = demand)) +
  geom_line(aes(col = src_sub_area)) +
  labs(x = NULL, y = "Number of Hourly Demand",
       title = "Exhibit 3.1 Hourly Demand in Each Sub-Area") +
  facet_wrap(~ src_sub_area, scale = "free_y",
             ncol = 1) +
  scale_color_brewer(palette = "Accent") +
  theme(legend.position = "none")

ggplotly(plot3_1)

From Exhibit 3.1, there are two striking spikes that clearly stood out; one from November 3 and the other from November 24. The reason for this cannot be easily predicted, though one could hypothesize that the shock came from Friday night events.

According to various Google Search results, Friday, November 3 and Friday, November 24 was an active date for night owls. With many music and clubbing events, the increase in demand could come from said activities. However, there is no legitimate reason to accept this trivial observation.

Since it is difficult to confirm a trivial observation on a demand shock, one could stick to what they know best; statistical visualization. This can be done by plotting a daily demand pattern by the hour like the following.

sxk97_subset <- scotty_train5 %>%
  filter(src_sub_area == "sxk97") %>%
  .$demand

plot3_2 <- ggseasonplot(ts(sxk97_subset, frequency = 24),
             polar = T) +
  theme(legend.position = "none") +
  labs(title = "Exhibit 3.2 Daily Demand Pattern",
       x = NULL,
       caption = "Numbers Around the Circle = Hour") +
  scale_y_sqrt()

plot3_2

Exhibit 3.2 uses the sub-area of sxk97 to sample the pattern and to keep the analysis short and precise. From the exhibit, which shapes like an actual clock, it is inferred that the overall demand is statistically high around 6 PM. This is highly logical if you cross-check the result with the average working hour that ends around 5:30 PM in Turkey⁷. It can also be inferred that the overall demand is relatively low around 5 AM. However, both analysis is pretty short-sighted as it only groups the data based on the hour.

In reality, the general public has a different daily habit when it comes to transportation consumption. For instance, the working age are more likely to travel by public transportations or other non-private vehicles during working days. They are also more likely to go out on a Friday night, which would explain the two spikes in Exhibit 3.1. These examples, then, provide us with a strong instinct that daily patterns can deviate from one another, but may show signs of repetition in the coming weeks. Therefore, weekly patterns should not be forgotten.

In order to plot the weekly pattern, the frequency must be adjusted by multiplying the total of hours with 7 days. The product of this multiplication will result into a sequential number that arranges the hour by day. As a result, numbers 1 to 24 represents 12 AM to 11 PM on a Sunday like the following.

plot3_3 <- ggseasonplot(ts(sxk97_subset, frequency = 24*7),
                        polar = T) +
  theme(legend.position = "none") +
  labs(title = "Exhibit 3.3 Weekly Demand Pattern",
       x = NULL,
       caption = "Numbers Around the Circle = Date and Hour Identification") +
  scale_y_sqrt()

plot3_3

Before comprehending the complicated number reading, it is best to define the numbers circling around.

1-24: Sunday, 12 AM - 11 PM
25-48: Monday, 12 AM - 11 PM
49-72: Tuesday, 12 AM - 11 PM
73-96 : Wednesday, 12 AM - 11 PM
97-120: Thursday, 12 AM - 11 PM
121-144: Friday, 12 AM - 11 PM
145-168: Saturday, 12 AM - 11 PM

From Exhibit 3.3, we could clearly see that the spikes are leaning towards the number 139, with the density grouping mostly to its surrounding numbers. Further validation with the stated definition implies that these spikes refer to a massive amount of demand on Fridays at 6 PM. Thus, we can conclude that the highest order is statistically demanded on Fridays around 6 PM.

Moreover, there is something incomplete about stopping our analysis at the weekly level. Going back to the range, the data spans over 2 months which gives us more seasonality to observe. We might as well continue the visualization by observing the monthly pattern like the following.

plot3_4 <- ggseasonplot(ts(sxk97_subset,frequency = 24*7*4),
             polar = T) +
  theme(legend.position = "none") +
  labs(title = "Exhibit 3.4 Monthly Demand Pattern",
       x = NULL,
       caption = "Numbers Around the Circle = Week, Date, and Hour Identification") +
  scale_y_sqrt()

plot3_4

Exhibit 3.4 offers a more complicated reading of the visualization. This is due to the fact that the data is trying to fit as many values as possible. The more information it receives, the more complex it gets. While this may not be a problem in some cases - heck, even a blessing, it presents a pretty difficult barrier to tackle. Therefore, we can stop the visualization analysis in Exhibit 3.3.

Another reason as to why it might be preeminent to stop at Exhibit 3.3 is the lack of month included. The data generously includes values from October to December, but it lacks the necessary repetition. When one is concerned with repeated patterns, the observation of monthly patterns becomes obsolete.

3.2 Tis The Season

In subsection 3.1, it is implied that the best frequency or seasonality to use is the weekly one. Although arguments were made as to why this is the logical choice, it is better to compare the decomposition of each seasonality like the following.

dcm_scotty <- scotty_train5 %>%
  filter(src_sub_area == "sxk97") %>%
  .$demand %>%
  msts(.,seasonal.periods = c(24,24*7))

plot_dcm <- autoplot(mstl(dcm_scotty)) +
  labs(title = "Exhibit 3.5 Data Decomposition by the Daily, Weekly, and Monthly Pattern",
       x = "Respective Time Frame")

plot_dcm

Exhibit 3.5 showcases 3 important things.

Trend: Confirming whether the data is visually smooth.
Seasonal24: The daily demand pattern.
Seasonal168: The weekly demand pattern.

The first thing that one needs to evaluate is the trend. This part is pretty tricky because visualization is highly subjective. However, for the purpose of simplifying our analysis, we can conclude that the overall trend is pretty smooth with a noticeable increasing pattern. Still, we should note that the smoothness of a trend line is debatable.

The second thing that one needs to evaluate is the seasonality comparison. In response to the smoothness, it is obvious that the weekly demand pattern is relatively smoother. This implies that, statistically, the weekly pattern is a better seasonal fit for modeling this time series data.

3.3 Before Chekov Takes Off

When it comes to developing multiple models, the best practice is to split some portion of the data into two separate objects. The argument here is that the models needs to learn from what the human can present and what one can present is a sequence of data. For humans, it is easier to learn because history provides us with past problems and solutions, but models do not have that luxury of a guidance. It needs the help of an operator to provide what the “past” model looks like through sampling.

The strategy here is to split two datas; one for the model to learn and the other to forecast. Logically, if a model wants to be the fittest, it needs as many supplements as possible. Therefore, the data for learning should include as many weeks as possible since we are using the weekly seasonality.

When subjected to a total of 9 weeks within scotty_train5, one could ration the data 8:1. The logical reasoning behind this is that an eight week worth of data is the highest possible sample for the model to obtain. Thus, the data for learning - often referred as train, should be constructed from the first eight weeks. Meanwhile, the data for forecasting - often referred as test, will be constructed from the last week of the dataset.

In order to split the data into two separate entities of train and test, we shall define each by recursively getting the start and the end index, respectively, like the following.

# Step 1: Defining the size for each of the split data
test_size <- 24 * 7 * 1
train_size <- 24 * 7 * 8

# Step 2: Getting the minimal and the maximum value of the time index for each of the split data
test_end <- max(scotty_train5$datetime)
test_start <- test_end - hours(test_size) + hours(1)

train_end <- test_start - hours(1)
train_start <- train_end - hours(train_size) + hours(1)

train_end

## [1] "2017-11-25 23:00:00 UTC"

train_start

## [1] "2017-10-01 UTC"

# Step 3: Getting the time interval from each of the split data
train_int <- interval(train_start, train_end)
test_int <- interval(test_start, test_end)

train_int

## [1] 2017-10-01 UTC--2017-11-25 23:00:00 UTC

test_int

## [1] 2017-11-26 UTC--2017-12-02 23:00:00 UTC

The data has now been successfully split, with train_int being stored as a reference for the future plotted model to learn.

To visualize the presence of the recent scotty_train5 data, we can plot the following code.

plot3_6 <- scotty_train5 %>%
  mutate(sample = case_when(
    datetime %within% train_int ~ "train",
    datetime %within% test_int ~ "test"
  )) %>%
  drop_na() %>%
  mutate(sample = factor(sample,
                         levels = c("train", "test"))) %>%
  ggplot(aes(x = datetime, y = demand, color = sample)) +
    geom_line() +
    labs(title = "Exhibit 3.6 Visualization of the Data Train and Test by Sub-Areas",
         x = "",
         y = NULL,
         color = NULL) +
    facet_wrap(~ src_sub_area, scale = "free", ncol = 1)

ggplotly(plot3_6)

3.4 Recipe for Success

One of the most anxious parts of constructing a model is making sure that the data used for learning is as close as to whatever the parameter for accuracy is. Often times, one would neglect seeking the arbitrary roles that certain variables hold and thus, deeming the values as just another passing number. In reality, however, the set of numbers have meaning and each meaning ranges from a different scale.

In order to address the potential scaling issue, one could use the recipes package. This package allows us to design matrices for modeling, as well as to conduct variable preprocessing ⁸.

Referring to this case, what matters at hand is to use the recipes package to scale the demand for each area, subjected to the interval. Since the package works column-wise, we need to widen the data by spreading the sub-areas like the following.

scotty_wide <- scotty_train5 %>% 
  spread(src_sub_area, demand)

scotty_wide

## # A tibble: 1,512 x 4
##    datetime            sxk97 sxk9e sxk9s
##    <dttm>              <dbl> <dbl> <dbl>
##  1 2017-10-01 00:00:00     6     8    10
##  2 2017-10-01 01:00:00     4     2     3
##  3 2017-10-01 02:00:00     9     3     1
##  4 2017-10-01 03:00:00     2     2     0
##  5 2017-10-01 04:00:00     5     1     0
##  6 2017-10-01 05:00:00     4     2     0
##  7 2017-10-01 06:00:00     1     1     0
##  8 2017-10-01 07:00:00     0     0     1
##  9 2017-10-01 08:00:00     3     2     5
## 10 2017-10-01 09:00:00     3    11     5
## # … with 1,502 more rows

Afterwards, we shall heat up the stove and bake the recipe. It should be noted that the objective is to focus on the train data since it is the only one used for learning. Then, each cooking step will be referred to within the # symbol and the overall process will look like the following.

# Step 1: Add new actions to scale the demand based on extracting 
#         each square root and centering the data to have zero mean
rec <- recipe(~ ., filter(scotty_wide,
                          datetime %within% train_int)) %>%
  step_sqrt(all_numeric()) %>%
  step_center(all_numeric()) %>%
  step_scale(all_numeric()) %>%
  prep()

# Step 2: Put it in the oven 
scotty_wide1 <- bake(rec, scotty_wide)

# Step 3: Take a glimpse of what it looks like
scotty_wide1

## # A tibble: 1,512 x 4
##    datetime             sxk97  sxk9e  sxk9s
##    <dttm>               <dbl>  <dbl>  <dbl>
##  1 2017-10-01 00:00:00 -0.594 -0.697 -0.111
##  2 2017-10-01 01:00:00 -0.841 -1.28  -0.780
##  3 2017-10-01 02:00:00 -0.291 -1.15  -1.12 
##  4 2017-10-01 03:00:00 -1.16  -1.28  -1.59 
##  5 2017-10-01 04:00:00 -0.711 -1.44  -1.59 
##  6 2017-10-01 05:00:00 -0.841 -1.28  -1.59 
##  7 2017-10-01 06:00:00 -1.39  -1.44  -1.59 
##  8 2017-10-01 07:00:00 -1.94  -1.85  -1.12 
##  9 2017-10-01 08:00:00 -0.989 -1.28  -0.544
## 10 2017-10-01 09:00:00 -0.989 -0.497 -0.544
## # … with 1,502 more rows

# Step 4: While waiting until it's baked, create a conversion
#         function to revert the format later on 
rec_revert <- function(vector, rec, varname) {

  # Step 4a: Store the current recipe values
  rec_center <- rec$steps[[2]]$means[varname]
  rec_scale <- rec$steps[[3]]$sds[varname]

  # Step 4b: Convert it back based on the piped recipe from Step 1
  results <- (vector * rec_scale + rec_center) ^ 2

  # Step 4c: Additional seasonings are welcomed if necessary
  results <- round(results)

  # Step 4d: Return the results
  results
}

# Step 5: Take it out of the oven and unwrap it by converting
#         back to its longer format
scotty_long <- scotty_wide1 %>% 
  gather(src_sub_area, demand, -datetime)
  
scotty_long

## # A tibble: 4,536 x 3
##    datetime            src_sub_area demand
##    <dttm>              <chr>         <dbl>
##  1 2017-10-01 00:00:00 sxk97        -0.594
##  2 2017-10-01 01:00:00 sxk97        -0.841
##  3 2017-10-01 02:00:00 sxk97        -0.291
##  4 2017-10-01 03:00:00 sxk97        -1.16 
##  5 2017-10-01 04:00:00 sxk97        -0.711
##  6 2017-10-01 05:00:00 sxk97        -0.841
##  7 2017-10-01 06:00:00 sxk97        -1.39 
##  8 2017-10-01 07:00:00 sxk97        -1.94 
##  9 2017-10-01 08:00:00 sxk97        -0.989
## 10 2017-10-01 09:00:00 sxk97        -0.989
## # … with 4,526 more rows

The data has been adjusted and is now ready for its modeling gig. It is currently stored under the object of scotty_long.

4 The Final Frontier

This section develops the automated model selection process.

4.1 Into the Nest

In constructing an automated model selection, the data frame submitted is often nested. This concept involves the creation of a list-column of data frames that would simplify the summarization of the content or information⁹. The process, commonly known as nesting, operates as if the data frame can be grouped again into a larger dataset. This would create many possibilities for the user to utilize the data for different modeling scenarios solely by defining the key and the provider for the data frame to “open”.

To jump start the nesting process, one should add a sample indicator. The purpose of this is to build a key for the data frame to recognize, with each coding process defined within the # symbol like the following.

# Step 1: Define the sample indicator
scotty_nest <- scotty_long %>%
  mutate(sample = case_when(
    datetime %within% train_int ~ "train",
    datetime %within% test_int ~ "test"
  )) %>%
  drop_na()

scotty_nest

## # A tibble: 4,536 x 4
##    datetime            src_sub_area demand sample
##    <dttm>              <chr>         <dbl> <chr> 
##  1 2017-10-01 00:00:00 sxk97        -0.594 train 
##  2 2017-10-01 01:00:00 sxk97        -0.841 train 
##  3 2017-10-01 02:00:00 sxk97        -0.291 train 
##  4 2017-10-01 03:00:00 sxk97        -1.16  train 
##  5 2017-10-01 04:00:00 sxk97        -0.711 train 
##  6 2017-10-01 05:00:00 sxk97        -0.841 train 
##  7 2017-10-01 06:00:00 sxk97        -1.39  train 
##  8 2017-10-01 07:00:00 sxk97        -1.94  train 
##  9 2017-10-01 08:00:00 sxk97        -0.989 train 
## 10 2017-10-01 09:00:00 sxk97        -0.989 train 
## # … with 4,526 more rows

# Step 2: Nest the data by defining the provider and including the sample
scotty_train_nest <- scotty_nest %>% 
  group_by(src_sub_area, sample) %>%
  nest()

scotty_train_nest

## # A tibble: 6 x 3
## # Groups:   src_sub_area, sample [6]
##   src_sub_area sample           data
##   <chr>        <chr>  <list<df[,2]>>
## 1 sxk97        train     [1,344 × 2]
## 2 sxk97        test        [168 × 2]
## 3 sxk9e        train     [1,344 × 2]
## 4 sxk9e        test        [168 × 2]
## 5 sxk9s        train     [1,344 × 2]
## 6 sxk9s        test        [168 × 2]

The next step should be addressed coherently as the utilization of the nest() function has been recently updated. According to Tidyverse, the .key parameter has been deprecated¹⁰. This implies that the spread() function may no longer be used to define how certain values within the nested data frame can later be unnested. Since .key is a huge part of the function, one should find a replacement.

Thankfully, a new function has been developed to solve this potential problem. The function pivot_wider() can now be used as a replacement for spread() like the following.

scotty_train_nest1 <- scotty_train_nest %>% 
  pivot_wider(names_from = sample, values_from = data)

scotty_train_nest1

## # A tibble: 3 x 3
## # Groups:   src_sub_area [3]
##   src_sub_area          train           test
##   <chr>        <list<df[,2]>> <list<df[,2]>>
## 1 sxk97           [1,344 × 2]      [168 × 2]
## 2 sxk9e           [1,344 × 2]      [168 × 2]
## 3 sxk9s           [1,344 × 2]      [168 × 2]

Now that the data frame has been nested, it is time to define the types of models that would be compatible for said data frame. Previously, it has been established that the weekly seasonality, categorized as part of multiple seasonality, is the best frquency to build our model. However, as we progress a more complex development, it is best to include all the possible seasonality types. Recall that the other type of usable seasonality is the daily one. Therefore, it will be incorporated within the nested model.

To incorporate the two model types into our nested data frame, an additional list should be constructed. This nest will define daily seasonality as a function of ts and multiple seasonality as msts. The syntax will resemble the following.

data_func <- list(
  ts = function(x) ts(x$demand, frequency = 24),
  msts = function(x) msts(x$demand,
                          seasonal.periods = c(24, 24 * 7))
)

data_func

## $ts
## function(x) ts(x$demand, frequency = 24)
## 
## $msts
## function(x) msts(x$demand,
##                           seasonal.periods = c(24, 24 * 7))

After the list has been created, it should be converted to a tbl format. The purpose of this is to enable the modeling functions later on and gives them the ability to read the object as a data frame. This can be done by utilizing the enframe() function. Additionally, it should be noted that defining the key is still part of the process. In this case, the key is src_sub_area and will be stored as part of the data frame.

All the aforementioned steps can be processed by piping the conversion with the key definition, in which the latter will utilize the rep() function like the following.

data_func1 <- data_func %<>%
  rep(length(unique(scotty_nest$src_sub_area))) %>%
  enframe("data_fun_name", "data_fun") %>%
  mutate(src_sub_area =
    sort(rep(unique(scotty_nest$src_sub_area),
             length(unique(.$data_fun_name))))
  )

data_func1

## # A tibble: 6 x 3
##   data_fun_name data_fun src_sub_area
##   <chr>         <list>   <chr>       
## 1 ts            <fn>     sxk97       
## 2 msts          <fn>     sxk97       
## 3 ts            <fn>     sxk9e       
## 4 msts          <fn>     sxk9e       
## 5 ts            <fn>     sxk9s       
## 6 msts          <fn>     sxk9s

The piping has been successfully conducted and it is time to combine the first nested data frame of scotty_train_nest1 with the recent one. In order to execute this, the left_join() function is used like the following.

scotty_nest_fit <- scotty_train_nest1 %>% 
  left_join(data_func1)

scotty_nest_fit

## # A tibble: 6 x 5
## # Groups:   src_sub_area [3]
##   src_sub_area          train           test data_fun_name data_fun
##   <chr>        <list<df[,2]>> <list<df[,2]>> <chr>         <list>  
## 1 sxk97           [1,344 × 2]      [168 × 2] ts            <fn>    
## 2 sxk97           [1,344 × 2]      [168 × 2] msts          <fn>    
## 3 sxk9e           [1,344 × 2]      [168 × 2] ts            <fn>    
## 4 sxk9e           [1,344 × 2]      [168 × 2] msts          <fn>    
## 5 sxk9s           [1,344 × 2]      [168 × 2] ts            <fn>    
## 6 sxk9s           [1,344 × 2]      [168 × 2] msts          <fn>

scotty_nest_fit is now set to be the final dataset for our forecast model.

4.2 The Search for Models

Since the end goal involves creating an automated forecasting framework to predict demand, one should include every possible model for time series. Discussions regarding the logic behind every model would take time so it would be best to summarize the usable ones, with each model being linked to the footer, into the following list.

Exponential Smoothing State Space Model (ets)¹¹
Seasonal and Trend Decomposition using Loess (stlm)¹²
Trigonometric Seasonality, Box-Cox Transformation, ARMA Errors, Trend and Seasonal Components (tbats)¹³
Autoregressive Integrated Moving Average (arima)¹⁴
Holt-Winters (hw)¹⁵

The list can be defined and stored into an object like the following.

models <- list(
  ets = function(x) ets(x),
  stlm = function(x) stlm(x),
  tbats = function(x) tbats(x, use.box.cox = FALSE),
  auto.arima = function(x) auto.arima(x),
  holt.winter = function(x) HoltWinters(x,seasonal = "additive")
  )
models

## $ets
## function(x) ets(x)
## 
## $stlm
## function(x) stlm(x)
## 
## $tbats
## function(x) tbats(x, use.box.cox = FALSE)
## 
## $auto.arima
## function(x) auto.arima(x)
## 
## $holt.winter
## function(x) HoltWinters(x,seasonal = "additive")

As mentioned in subsection 4.2, a list is arguably meaningless if it is not stored as a data frame. Thus, it is best to convert the aforementioned list like the following.

models1 <- models %<>%
  rep(length(unique(scotty_nest_fit$src_sub_area))) %>%
  enframe("model_name", "model") %>%
  mutate(src_sub_area =
    sort(rep(unique(scotty_nest_fit$src_sub_area),
             length(unique(.$model_name))))
    )

models1

## # A tibble: 15 x 3
##    model_name  model  src_sub_area
##    <chr>       <list> <chr>       
##  1 ets         <fn>   sxk97       
##  2 stlm        <fn>   sxk97       
##  3 tbats       <fn>   sxk97       
##  4 auto.arima  <fn>   sxk97       
##  5 holt.winter <fn>   sxk97       
##  6 ets         <fn>   sxk9e       
##  7 stlm        <fn>   sxk9e       
##  8 tbats       <fn>   sxk9e       
##  9 auto.arima  <fn>   sxk9e       
## 10 holt.winter <fn>   sxk9e       
## 11 ets         <fn>   sxk9s       
## 12 stlm        <fn>   sxk9s       
## 13 tbats       <fn>   sxk9s       
## 14 auto.arima  <fn>   sxk9s       
## 15 holt.winter <fn>   sxk9s

It seems that the data frame is now perfect for model fitting, but one should remain cautious. Models like ets and arima are not suitable for multiple seasonality. They can only predict certain values for one specific seasonality. As a consequence, the two functions need to be removed from the msts grouping or classification.

In order to save time, the removal of ets and arima from msts can be done simultaneously with the merger of scotty_nest_fit. By utilizing the piping feature, the process will resemble the following.

scotty_nest_model <- scotty_nest_fit %<>% 
  left_join(models1) %>% 
  filter(!(model_name == "ets" & data_fun_name == "msts"),
         !(model_name == "auto.arima" & data_fun_name == "msts"))

head(scotty_nest_model)

## # A tibble: 6 x 7
## # Groups:   src_sub_area [1]
##   src_sub_area       train      test data_fun_name data_fun model_name
##   <chr>        <list<df[,> <list<df> <chr>         <list>   <chr>     
## 1 sxk97        [1,344 × 2] [168 × 2] ts            <fn>     ets       
## 2 sxk97        [1,344 × 2] [168 × 2] ts            <fn>     stlm      
## 3 sxk97        [1,344 × 2] [168 × 2] ts            <fn>     tbats     
## 4 sxk97        [1,344 × 2] [168 × 2] ts            <fn>     auto.arima
## 5 sxk97        [1,344 × 2] [168 × 2] ts            <fn>     holt.wint…
## 6 sxk97        [1,344 × 2] [168 × 2] msts          <fn>     stlm      
## # … with 1 more variable: model <list>

scotty_nest_model is now ready for model fitting through a rigorous piping process. Within the purrr package lies an invoke_map() function which will make it easier for combining functions with a list of parameters. Originally, the function was supposed to be retired¹⁶, but it still operates beautifully like the following.

scotty_nest_model1 <- scotty_nest_model %<>%
  mutate(
    params = map(train, ~ list(x = .x)),
    data = invoke_map(data_fun, params),
    params = map(data, ~ list(x = .x)),
    fitted = invoke_map(model, params)
    ) %>%
  select(-data, -params)

scotty_nest_model1

## # A tibble: 24 x 8
## # Groups:   src_sub_area [3]
##    src_sub_area       train      test data_fun_name data_fun model_name
##    <chr>        <list<df[,> <list<df> <chr>         <list>   <chr>     
##  1 sxk97        [1,344 × 2] [168 × 2] ts            <fn>     ets       
##  2 sxk97        [1,344 × 2] [168 × 2] ts            <fn>     stlm      
##  3 sxk97        [1,344 × 2] [168 × 2] ts            <fn>     tbats     
##  4 sxk97        [1,344 × 2] [168 × 2] ts            <fn>     auto.arima
##  5 sxk97        [1,344 × 2] [168 × 2] ts            <fn>     holt.wint…
##  6 sxk97        [1,344 × 2] [168 × 2] msts          <fn>     stlm      
##  7 sxk97        [1,344 × 2] [168 × 2] msts          <fn>     tbats     
##  8 sxk97        [1,344 × 2] [168 × 2] msts          <fn>     holt.wint…
##  9 sxk9e        [1,344 × 2] [168 × 2] ts            <fn>     ets       
## 10 sxk9e        [1,344 × 2] [168 × 2] ts            <fn>     stlm      
## # … with 14 more rows, and 2 more variables: model <list>, fitted <list>

Running the syntax for model fitting takes a pretty long time since it is practically the “main event.” It is best, then, to save the result after the first run has been completed like the following.

saveRDS(scotty_nest_model1, "scotty_nest_model1.RDS")

The saved result can be called again for future use in case something happens when developing the next syntax. In order to do that, a safety net can be created like the following.

scotty_nest_model2 <- readRDS("scotty_nest_model1.RDS")
scotty_nest_model2

## # A tibble: 24 x 8
## # Groups:   src_sub_area [3]
##    src_sub_area       train      test data_fun_name data_fun model_name
##    <chr>        <list<df[,> <list<df> <chr>         <list>   <chr>     
##  1 sxk97        [1,344 × 2] [168 × 2] ts            <fn>     ets       
##  2 sxk97        [1,344 × 2] [168 × 2] ts            <fn>     stlm      
##  3 sxk97        [1,344 × 2] [168 × 2] ts            <fn>     tbats     
##  4 sxk97        [1,344 × 2] [168 × 2] ts            <fn>     auto.arima
##  5 sxk97        [1,344 × 2] [168 × 2] ts            <fn>     holt.wint…
##  6 sxk97        [1,344 × 2] [168 × 2] msts          <fn>     stlm      
##  7 sxk97        [1,344 × 2] [168 × 2] msts          <fn>     tbats     
##  8 sxk97        [1,344 × 2] [168 × 2] msts          <fn>     holt.wint…
##  9 sxk9e        [1,344 × 2] [168 × 2] ts            <fn>     ets       
## 10 sxk9e        [1,344 × 2] [168 × 2] ts            <fn>     stlm      
## # … with 14 more rows, and 2 more variables: model <list>, fitted <list>

Moving on to the statistical aspect of model fitting, then the presence of error should not be forgotten. Since what the model will do is basically forecast something that has not happened yet, the predicted results must have several errors that deviate from the actual ones. These errors, then, need to be calculated in value. In this case, the error needs to be calculated for every model involved against the test data like the following.

scotty_error <- scotty_nest_model2 %<>%
  mutate(error =
    map(fitted, ~ forecast(.x, h = 24 * 7)) %>%
    map2_dbl(test, ~ mae_vec(truth = rec_revert
                             (.y$demand, rec, src_sub_area),
                             estimate = rec_revert
                             (.x$mean, rec, src_sub_area)))) %>%
  arrange(src_sub_area, error) 

scotty_error

## # A tibble: 24 x 9
## # Groups:   src_sub_area [3]
##    src_sub_area       train      test data_fun_name data_fun model_name
##    <chr>        <list<df[,> <list<df> <chr>         <list>   <chr>     
##  1 sxk97        [1,344 × 2] [168 × 2] msts          <fn>     tbats     
##  2 sxk97        [1,344 × 2] [168 × 2] msts          <fn>     stlm      
##  3 sxk97        [1,344 × 2] [168 × 2] msts          <fn>     holt.wint…
##  4 sxk97        [1,344 × 2] [168 × 2] ts            <fn>     ets       
##  5 sxk97        [1,344 × 2] [168 × 2] ts            <fn>     tbats     
##  6 sxk97        [1,344 × 2] [168 × 2] ts            <fn>     holt.wint…
##  7 sxk97        [1,344 × 2] [168 × 2] ts            <fn>     stlm      
##  8 sxk97        [1,344 × 2] [168 × 2] ts            <fn>     auto.arima
##  9 sxk9e        [1,344 × 2] [168 × 2] msts          <fn>     tbats     
## 10 sxk9e        [1,344 × 2] [168 × 2] msts          <fn>     stlm      
## # … with 14 more rows, and 3 more variables: model <list>, fitted <list>,
## #   error <dbl>

To take a glimpse into which model has the biggest and smallest error, a subset can be created for scotty_error like the following.

scotty_error1 <- scotty_error %>%
  select(src_sub_area, ends_with("_name"), error)

scotty_error1

## # A tibble: 24 x 4
## # Groups:   src_sub_area [3]
##    src_sub_area data_fun_name model_name  error
##    <chr>        <chr>         <chr>       <dbl>
##  1 sxk97        msts          tbats        7.60
##  2 sxk97        msts          stlm         8.67
##  3 sxk97        msts          holt.winter  8.70
##  4 sxk97        ts            ets          9   
##  5 sxk97        ts            tbats        9.02
##  6 sxk97        ts            holt.winter  9.12
##  7 sxk97        ts            stlm         9.34
##  8 sxk97        ts            auto.arima  11.3 
##  9 sxk9e        msts          tbats        9.26
## 10 sxk9e        msts          stlm        10.1 
## # … with 14 more rows

Now that the potential error has been identified, the forecast can be conducted. In order to process every model at once, the code can be written like the following chunk.

scotty_test <- scotty_error %>%
  mutate(
    forecast =
      map(fitted, ~ forecast(.x, h = 24 * 7)) %>%
      map2(test, ~ tibble(
        datetime = .y$datetime,
        demand = as.vector(.x$mean)
      )),
    key = paste(data_fun_name, model_name, sep = "-")
  )

scotty_test

## # A tibble: 24 x 11
## # Groups:   src_sub_area [3]
##    src_sub_area       train      test data_fun_name data_fun model_name
##    <chr>        <list<df[,> <list<df> <chr>         <list>   <chr>     
##  1 sxk97        [1,344 × 2] [168 × 2] msts          <fn>     tbats     
##  2 sxk97        [1,344 × 2] [168 × 2] msts          <fn>     stlm      
##  3 sxk97        [1,344 × 2] [168 × 2] msts          <fn>     holt.wint…
##  4 sxk97        [1,344 × 2] [168 × 2] ts            <fn>     ets       
##  5 sxk97        [1,344 × 2] [168 × 2] ts            <fn>     tbats     
##  6 sxk97        [1,344 × 2] [168 × 2] ts            <fn>     holt.wint…
##  7 sxk97        [1,344 × 2] [168 × 2] ts            <fn>     stlm      
##  8 sxk97        [1,344 × 2] [168 × 2] ts            <fn>     auto.arima
##  9 sxk9e        [1,344 × 2] [168 × 2] msts          <fn>     tbats     
## 10 sxk9e        [1,344 × 2] [168 × 2] msts          <fn>     stlm      
## # … with 14 more rows, and 5 more variables: model <list>, fitted <list>,
## #   error <dbl>, forecast <list>, key <chr>

From the previous chunk, it can be seen that scotty_error is used for piping instead of the latest scotty_error1. The reason behind this is that scotty_error1 will create an error within the process. “Error: .x must be a vector, not a function” will appear on the console.

For those who are a sucker for visualization, the result of the forecast can be compiled and compared by constructing the following code.

scotty_test1 <- scotty_test %<>%
  select(src_sub_area, key, actual = test, forecast) %>%
  spread(key, forecast) %>%
  gather(key, value, -src_sub_area) %>% 
  unnest(value) %>% 
  mutate(demand = rec_revert(demand,rec,src_sub_area))

## Warning: attributes are not identical across measure variables;
## they will be dropped

The previous code enables one to set up the visualization of the forecast result collectively. By unnesting the values, you can proceed to plot the visuals like the following.

plot_forecast <- ggplot(scotty_test1,
                        aes(x = datetime, y = demand)) +
  geom_line(data = scotty_test1 %>%
              filter(key == "actual"),
            aes(y = demand), 
            alpha = 0.2,
            size = 0.8) +
  geom_line(data = scotty_test1 %>%
              filter(key != "actual"),
            aes(frame = key,col = key)) +
  labs(x = "",
       y = "Number of Hourly Demand", 
       title = "Exhibit 4.1 Compilation for Comparing The Time Series Models",
       frame = "Models") +
  facet_wrap(~ src_sub_area, scale = "free_y", ncol = 1) +
  scale_color_manual(values = c("orangered3", "orangered3", "orangered3",
                                "orangered3", "orangered3", "orangered3",
                                "orangered3", "orangered3", "orangered3")) +
  theme(legend.position = "none")

## Warning: Ignoring unknown aesthetics: frame

ggplotly(plot_forecast)

## Warning in p$x$data[firstFrame] <- p$x$frames[[1]]$data: number of items to
## replace is not a multiple of replacement length

Don’t forget to press Play!

4.3 The Undiscovered Best

An additional step that will extend the analysis is uncovering the best model. Although this may not be the best fit for the automated forecasting framework, this will give us an indication of which model has the lowest value of error. Granted, the purpose of adding this process is to identify and store the best model like the following.

scotty_best_model <- scotty_error %>%
  group_by(src_sub_area) %>% 
  arrange(error) %>% 
  slice(1) %>%
  ungroup() %>% 
  select(src_sub_area, ends_with("_name"),error)

scotty_best_model

## # A tibble: 3 x 4
##   src_sub_area data_fun_name model_name error
##   <chr>        <chr>         <chr>      <dbl>
## 1 sxk97        msts          tbats       7.60
## 2 sxk9e        msts          tbats       9.26
## 3 sxk9s        msts          tbats       7

One could settle for scotty_best_model, but the information displayed is insufficient. The information needs to be completed in order for it to be usable for examining error later on. To finalize the data frame stored, one could code the following syntax.

scotty_best_model1 <- scotty_best_model %>% 
  select(-error) %>% 
  left_join(scotty_error)%>% 
  select(-error)

scotty_best_model1

## # A tibble: 3 x 8
##   src_sub_area data_fun_name model_name       train      test data_fun
##   <chr>        <chr>         <chr>      <list<df[,> <list<df> <list>  
## 1 sxk97        msts          tbats      [1,344 × 2] [168 × 2] <fn>    
## 2 sxk9e        msts          tbats      [1,344 × 2] [168 × 2] <fn>    
## 3 sxk9s        msts          tbats      [1,344 × 2] [168 × 2] <fn>    
## # … with 2 more variables: model <list>, fitted <list>

4.4 The Final Selection

Recall that in subsection 4.2, the predicted result of a forecasting model tend to deviate from the actual one in the form of an error. With that logic, one could only hope for the smallest value of error. The lower the error, the closer the predicted result is to the actual one. Therefore, in selecting the automated model, the smallest error needs to be identified and selected through the following syntax.

scotty_small_error <- scotty_error %<>%
  select(-fitted) %>%
  group_by(src_sub_area) %>%
  filter(error == min(error)) %>%
  ungroup()

scotty_small_error

## # A tibble: 3 x 8
##   src_sub_area       train      test data_fun_name data_fun model_name
##   <chr>        <list<df[,> <list<df> <chr>         <list>   <chr>     
## 1 sxk97        [1,344 × 2] [168 × 2] msts          <fn>     tbats     
## 2 sxk9e        [1,344 × 2] [168 × 2] msts          <fn>     tbats     
## 3 sxk9s        [1,344 × 2] [168 × 2] msts          <fn>     tbats     
## # … with 2 more variables: model <list>, error <dbl>

Based on the result above, it can be implied that the tbats model is the best one to put within the automated forecasting framework. However, it should be noted that the error might come solely from a dataset consisting of test. To get a sense of the actual model with the smallest error values, one should combine both test and train. The result of scotty_small_error can be piped and mapped out to get the full data like the following.

scotty_combined <- scotty_small_error %<>%
  mutate(fulldata = map2(train,
                         test, ~ bind_rows(.x, .y))) %>%
  select(src_sub_area, fulldata,
         everything(), -train, -test)

scotty_combined

## # A tibble: 3 x 7
##   src_sub_area fulldata       data_fun_name data_fun model_name model error
##   <chr>        <list>         <chr>         <list>   <chr>      <lis> <dbl>
## 1 sxk97        <tibble [1,51… msts          <fn>     tbats      <fn>   7.60
## 2 sxk9e        <tibble [1,51… msts          <fn>     tbats      <fn>   9.26
## 3 sxk9s        <tibble [1,51… msts          <fn>     tbats      <fn>   7

It has now been confirmed that tbats is the best practice and the final model for the automated forecasting framework.

With tbats being confirmed, the forecasting framework should be up and running for making predictions. To start predicting demand for the next 7 days, the syntax should resemble the following code for model fitting.

scotty_tbats <- scotty_combined %<>%
  mutate(
    params = map(fulldata, ~ list(x = .x)),
    data = invoke_map(data_fun, params),
    params = map(data, ~ list(x = .x)),
    fitted = invoke_map(model, params)
  ) %>%
  select(-data, -params)

scotty_tbats

## # A tibble: 3 x 8
##   src_sub_area fulldata data_fun_name data_fun model_name model error
##   <chr>        <list>   <chr>         <list>   <chr>      <lis> <dbl>
## 1 sxk97        <tibble… msts          <fn>     tbats      <fn>   7.60
## 2 sxk9e        <tibble… msts          <fn>     tbats      <fn>   9.26
## 3 sxk9s        <tibble… msts          <fn>     tbats      <fn>   7   
## # … with 1 more variable: fitted <list>

The process takes a while to load so it is best to save the result in an RDS format. As mentioned in subsection 4.2, this could give a helpful benefit later on.

saveRDS(scotty_tbats, "scotty_tbats.RDS")

scotty_tbats1 <- readRDS("scotty_tbats.RDS")
scotty_tbats1

## # A tibble: 3 x 8
##   src_sub_area fulldata data_fun_name data_fun model_name model error
##   <chr>        <list>   <chr>         <list>   <chr>      <lis> <dbl>
## 1 sxk97        <tibble… msts          <fn>     tbats      <fn>   7.60
## 2 sxk9e        <tibble… msts          <fn>     tbats      <fn>   9.26
## 3 sxk9s        <tibble… msts          <fn>     tbats      <fn>   7   
## # … with 1 more variable: fitted <list>

Finalizing the framework requires one to construct the forecasting function. In order to achieve this, the following syntax must be created to commence the initialization.

scotty_pred <- scotty_tbats1 %<>%
  mutate(forecast =
    map(fitted, ~ forecast(.x, h = 24 * 7)) %>%
    map2(fulldata, ~ tibble(
      datetime = timetk::tk_make_future_timeseries
      (.y$datetime, 24 * 7),
      demand = as.vector(.x$mean)
    ))
  )

scotty_pred

## # A tibble: 3 x 9
##   src_sub_area fulldata data_fun_name data_fun model_name model error
##   <chr>        <list>   <chr>         <list>   <chr>      <lis> <dbl>
## 1 sxk97        <tibble… msts          <fn>     tbats      <fn>   7.60
## 2 sxk9e        <tibble… msts          <fn>     tbats      <fn>   9.26
## 3 sxk9s        <tibble… msts          <fn>     tbats      <fn>   7   
## # … with 2 more variables: fitted <list>, forecast <list>

Upon seeing scotty_pred, it can be inferred that this is not the final result since there is no numerical value to be found within the prediction. In fact, the contents of the object resembles a nested data frame. Therefore, the object needs to be unnested like the following.

scotty_forecast <- scotty_pred %<>%
  select(src_sub_area, actual = fulldata, forecast) %>%
  gather(key, value, -src_sub_area) %>%
  unnest(value) %>% 
  mutate(demand = rec_revert(demand,rec,src_sub_area))

scotty_forecast

## # A tibble: 5,040 x 4
##    src_sub_area key    datetime            demand
##    <chr>        <chr>  <dttm>               <dbl>
##  1 sxk97        actual 2017-10-01 00:00:00      6
##  2 sxk97        actual 2017-10-01 01:00:00      4
##  3 sxk97        actual 2017-10-01 02:00:00      9
##  4 sxk97        actual 2017-10-01 03:00:00      2
##  5 sxk97        actual 2017-10-01 04:00:00      5
##  6 sxk97        actual 2017-10-01 05:00:00      4
##  7 sxk97        actual 2017-10-01 06:00:00      1
##  8 sxk97        actual 2017-10-01 07:00:00      0
##  9 sxk97        actual 2017-10-01 08:00:00      3
## 10 sxk97        actual 2017-10-01 09:00:00      3
## # … with 5,030 more rows

The data frame is now complete and we can see the number of demand per hour by each sub-area on the right column. Looking at the first 10 observations, it is implied that the highest demand for Scotty motorcycle ride is around 2 AM - a number concluded from the Sunday, October 1 projection. However, this is part of the train dataset, which already has an actual value, and not what one should look for in presenting the result of a forecast.

Inspecting the last rows of the forecast result should come in handy, but is arguably not the best practice. Other than being kept in a tabular format, the forecast result can be presented in a form of visualization. The most efficient way to visualize forecasting results is by utilizing the geom_line() plot to observe the point movements of what’s to come. The plot should also separate the actual demand with the predicted ones like the following.

plot_final <- ggplot(scotty_forecast,
       aes(x = datetime,
           y = demand, color = key)) +
  geom_line() +
  labs(y = "Number of Hourly Demand",
       x = NULL,
       title = "Exhibit 4.2 Forecasting Result of the Predicted Demand for Scotty") +
  facet_wrap(~ src_sub_area, scale = "free_y", ncol = 1) +
  scale_color_brewer(palette = "Set2") +
  theme(legend.position = "none")

ggplotly(plot_final)

Exhibit 4.2 visualizes the forecast result of the predicted hourly demand for Scotty ride hailing app. The plot includes both the actual demand from Sunday, October 1, 2017 until Saturday, December 2, 2017, as well as the predicted demand for Sunday, December 3, 2017 until Saturday, December 9, 2017. The green pastel line covers the actual values while the orange pastel one points out the predicted demand.

Based on the plot, we can see that the app will be flooded by ride requests on Friday, December 8, 2017 around 6 PM. This seems to apply to all sub-areas and is consistent with the previous surged pattern on 6 PM ride requests on Fridays; most notably on Friday, November 3, 2017.

The lowest demand, however, seems to display a consistent daily pattern of plummeting around 5 AM. Regardless of the day, the hourly demand will be at its lowest before the morning sun even rises. This analysis applies to all sub-areas.

The prediction result of the demand is the most important value for Scotty, and thus, should be saved in a format that could be easily read by all parties involved. This can be done by storing the data into a CSV format and generating the following code.

write_csv(scotty_forecast, "data/auliya_devaldi_wiratama_scotty-ts-auto-test.csv")

5 The Voyage Home

This section wraps everything up from the prediction performance to the conclusion of this publication.

5.1 Starfleet Submission

As previously mentioned in subsection 4.3, forecast analysis would be incomplete without looking at the error of the predicted values. The values of the predicition itself have been obtained and stored within the data frame of scotty_forecast in subsection 4.4. To open up the data and unnest the content, we could use a submission format that, in my case, has been provided by Algoritma Data Science School. Importing the submission format will look like the following.

data_submission <- read_csv("data/data-submission.csv")
head(data_submission, 10)

## # A tibble: 10 x 3
##    src_sub_area datetime            demand
##    <chr>        <dttm>              <lgl> 
##  1 sxk97        2017-12-03 00:00:00 NA    
##  2 sxk97        2017-12-03 01:00:00 NA    
##  3 sxk97        2017-12-03 02:00:00 NA    
##  4 sxk97        2017-12-03 03:00:00 NA    
##  5 sxk97        2017-12-03 04:00:00 NA    
##  6 sxk97        2017-12-03 05:00:00 NA    
##  7 sxk97        2017-12-03 06:00:00 NA    
##  8 sxk97        2017-12-03 07:00:00 NA    
##  9 sxk97        2017-12-03 08:00:00 NA    
## 10 sxk97        2017-12-03 09:00:00 NA

The content of data_submission should have no forecast values with a defined range of date and time. To check on said range, a quick glimpse of the following code will do the trick.

range(data_submission$datetime)

## [1] "2017-12-03 00:00:00 UTC" "2017-12-09 23:00:00 UTC"

From the information above, it is implied that data_submission is an object resembling the dates and hours of test. This indicates that the only time period that matters for further performance review is the predicted values from Sunday, December 3, 2017 to Saturday, December 9, 2017.

Moreover, it should be well known by now that any involvement regarding multiple seasonality forecast will require one to nest the data frame. Said requirement applies here, in which the syntax will look like the following.

submission_nest <- data_submission %>%
  nest(datetime)

## Warning: All elements of `...` must be named.
## Did you want `data = c(datetime)`?

submission_nest

## # A tibble: 3 x 3
##   src_sub_area demand           data
##   <chr>        <lgl>  <list<df[,1]>>
## 1 sxk97        NA          [168 × 1]
## 2 sxk9e        NA          [168 × 1]
## 3 sxk9s        NA          [168 × 1]

After nesting the data, the submission is ready for automatic forecasting with a minor adjustment. The adjustment in question involves the model used. In the previous forecast, scotty_tbats1 was utilized due to its relatively low error when compared to other time series models. However, there is another model which has been identified as the undiscovered best model in subsection 4.3. This model was stored as an object of scotty_best_model1 and should be used to forecast the following.

submission_forecast <- scotty_best_model1 %>% 
  mutate(
    forecast =
      map(fitted, ~ forecast(.x, h = 24 * 7)) %>%
      map2(submission_nest$data,
           ~ tibble(
        datetime = .y$datetime,
        demand = as.vector(.x$mean)
      )),
    key = paste(data_fun_name, model_name, sep = "-")
  )

submission_forecast

## # A tibble: 3 x 10
##   src_sub_area data_fun_name model_name       train      test data_fun
##   <chr>        <chr>         <chr>      <list<df[,> <list<df> <list>  
## 1 sxk97        msts          tbats      [1,344 × 2] [168 × 2] <fn>    
## 2 sxk9e        msts          tbats      [1,344 × 2] [168 × 2] <fn>    
## 3 sxk9s        msts          tbats      [1,344 × 2] [168 × 2] <fn>    
## # … with 4 more variables: model <list>, fitted <list>, forecast <list>,
## #   key <chr>

submission_forecast has now been stored as an object capable of forecasting and matching the current best model with submission_nest.

By this point, the next step is self explanatory as it is just another unnesting process. The code resembles the following syntax.

scotty_submission <- submission_forecast %>%
  select(src_sub_area, forecast) %>%
  unnest(forecast) %>%
  mutate(demand = rec_revert(demand, rec, src_sub_area))

scotty_submission

## # A tibble: 504 x 3
##    src_sub_area datetime            demand
##    <chr>        <dttm>               <dbl>
##  1 sxk97        2017-12-03 00:00:00     18
##  2 sxk97        2017-12-03 01:00:00     15
##  3 sxk97        2017-12-03 02:00:00     11
##  4 sxk97        2017-12-03 03:00:00      7
##  5 sxk97        2017-12-03 04:00:00      5
##  6 sxk97        2017-12-03 05:00:00      3
##  7 sxk97        2017-12-03 06:00:00      2
##  8 sxk97        2017-12-03 07:00:00      6
##  9 sxk97        2017-12-03 08:00:00     12
## 10 sxk97        2017-12-03 09:00:00     12
## # … with 494 more rows

This next chunk will come in handy when the time comes to evaluate the error.

write_csv(scotty_submission, "data/auliya_devaldi_wiratama_scotty-ts-auto-submission.csv")

If one wishes to visualize scotty_submission, the syntax, as well as the final plot, will resemble the following.

plot_submission <- ggplot(scotty_submission,
                          aes(x = datetime, y = demand)) +
  geom_line() +
  labs(title = "Exhibit 5.1 Predicted Demand from December 3 to December 9, 2017",
       x = NULL,
       y = "Number of Hourly Demand",
       color = NULL) +
  facet_wrap(~ src_sub_area, scale = "free", ncol = 1)

ggplotly(plot_submission)

Exhibit 5.1 displays only the predicted demand for Scotty ride hailing app from Sunday, December 3 to Saturday, December 9, 2017. The plot depicts a closer look on the pattern, as well as the values, which confirms the same analysis as the first and final one in subsection 4.4.

5.2 The Wrath of Errors

Part of the reason why subsection 5.1 was needed is to transport us to this destination; prediction performance by evaluating the error. There are countless mechanisms to evaluate error terms, but the most common one is arguably MAE. This shortened moniker is a stand-in for Mean Absolute Error, which measures the difference between two continuous variables¹⁷. In this forecasting case, MAE calculates the average of the absolute errors between the predicted and the actual values.

In practice, calculating the values of MAE requires its own syntax, but this case is pretty different. Under the circumstances that this publication is a part of Algoritma Data Science School’s capstone project, coding the syntax is deemed unnecessary. Instead, the party involved developed an automatic MAE generator as a shiny dashboard which can only be accessed by few. As a response, the MAE of scotty_submission is captured and displayed in the following picture.

According to the automated MAE calculator, the error in each sub-area is relatively low. Not to mention, the sub-areas combined also has a value below 11. Despite it being numerically low, this result should be taken with a grain of salt as it is highly subjective to set a specific numerical threshold for MAE.

5.3 Live Long and Prosper

Overall, it is neither impossible nor is it difficult to build an automated forecasting framework for a ride hailing app. The key is to start simple by modeling a demand prediction based on the previous seasonal patterns. Identifying the seasonality might require one to code a complex syntax - same thing applies to the latter process of deciphering the data, but it is a moderately attainable process to complete.

From a technical point of view, developing an automated forecasting framework for predicting demand requires patience and knowledge. The time series data may encounter several problems, which are solvable as long as the correct packages are in store, namely padr and purrr. In addition, one must be extremely careful when setting up the model as it is easier said than done. Anyone can process theoretical logic, but not every user can transform said logic into a working syntax.

From a business perspective, the framework that this publication chooses, which is tbats, confirms the descriptive hypothesis that demand on the predicted week will resemble the previous pattern. Based on Exhibit 4.2, it is implied that consumers will be flooding the Scotty app on Friday, December 8, 2017 around 6 PM. However, they will not give the same daily sentiment at approximately 5 AM for the entire first week of December 2017.

The joyous result of the automated forecasting framework extends to the prediction performance as well. With MAE being used, it is implied that the forecasting model within the framework is good enough to use. Therefore, the model can be utilized as an operational tool to predict demand and to enhance its operation. Should Scotty chooses to focus more on its strategy by adopting the aforementioned operational tool, the ride hailing app will get to live long and prosper.

6 Beyond

6.1 Disclaimer

This publication is part of Algoritma Data Science School’s capstone project on machine learning. All credit, including the data, goes to the company.

If you wish to get in touch with the author, you can message him through the following sites.

LinkedIn: Valdy Wiratama
Twitter: valdywiratama

6.2 References

Hartmans, A. & Leskin, P. (2019, May 19). The History of How Uber Went From the Most Feared Startup in the World to Its Massive IPO. Retrieved from Business Insider ↩
Business Wire. (2019, November 4). Uber Announces Results for Third Quarter 2019. Retrieved from Business Wire ↩
Locke, A. (2017, November 22). Uber-like App for Motorcycles Eases Traffic Woes in One of World’s Most Congested Cities. Retrieved from USA Today ↩
Gunes, M. (2017, August 9). Istanbul Gets Motorized: New App Beats City’s Dreadful Traffic. Retrieved from Daily Sabah ↩
Hulli, E. (2019, September 13). Scotty: You Only Lose When You Give Up. Retrieved from Webrazzi ↩
Ros, I. (2012, June 6). Padding Time Series in R. Retrieved from Bocoup ↩
International Trade Administration. (2019, June 24). Turkey - Business Travel. Retrieved from export.gov ↩
Kuhn, M. (2019, September 15). Recipes: A Package For Computing And Preprocessing Design Matrices. Retrieved from: RDocumentation ↩
Wickham, H. (2019, September 10). Nest and Unnest. Retrieved from tidyr ↩
Wickham, H. (2019, September 13). Tidyr 1.0.0. Retrieved from Tidyverse ↩
De Livera, A. M., Hyndman, R. J., & Snyder, R. D. (2011). Forecasting Time Series with Complex Seasonal Patterns using Exponential Smoothing. J American Statistical Association, 106(496), 15131527. Retrieved from https://doi.org/10.1198/jasa.2011.tm09771 ↩
Cleveland, R. B., Cleveland, W. S., McRae, J. E., & Terpenning, I. J. (1990). STL: A seasonal-trend decomposition procedure based on loess. Journal of Official Statistics, 6(1), 3–33. Retrieved from SCB ↩
Prastiwi, A. (2019, January 8). Forecasting Time Series with Multiple Seasonal. Retrieved from Algoritma Technical Blog ↩
Chatterjee, S. (2018, January 30). Time Series Analysis Using ARIMA Model In R. Retrieved from datascience+↩
Mehlem, J. (2019, April 23). Exponential Smoothing for Time Series Forecasting in R. Retrieved from Johannes Mehlem ↩
Henry, L. (2019, February 6). Purrr 0.3.0. Retrieved from Tidyverse ↩
Willmott, Cort J., Matsuura, K. (2005). Advantages of the Mean Absolute Error (MAE) Over the Root Mean Square Error (RMSE) in Assessing Average Model Performance. Climate Research. 30: 79–82. Retrieved from https://doi:10.3354/cr030079 ↩

Beam Me Up, Scotty!

A Forecast of The Turkish Ride Hailing App’s Demand

Driver: valdywiratama

Trip Taken on December 16, 2019