Multiple Time Series Analysis on Online Transportation Dataset

Introduction

source: http://www.genmuda.com/wp-content/uploads/2018/02/ojek-online-tuyul-online.jpg

Have you ever wonder how online transportation companies decide their price? Is the price spread equally throughout the time? or maybe distincted area priced differently?

If we follow the basic economic rule, we all would agree that the price mainly will be influenced by the order demand (beside the availability of the driver). Higher demand means higher prices. And for sure, the demand would not be equally same through the time and through different places.

In this case, we are provided a real-time transaction dataset from a motorcycles ride-sharing service by Algoritma Data Science team. With this dataset, we are going to help them in solving some forecasting problems in order to improve their business processes, including the pricing system and the driver availability.

It’s almost the end of 2017 and we need to prepare a forecast model to help the company ready for the end year’s demands. Unfortunately, the company is not old enough to have last year data for December, so we can not look back at past demands to prepare forecast for December’s demands.

This project would aim to make an automated forecasting model for hourly demands that would be evaluated on the next 7 days (Sunday, December 3rd 2017 to Monday, December 9th 2017).

Libraries Used

As this project is a time series case, we would use some time series package such as forecast, yardstick and timetk. For data wrangling, we will use dplyr, purrr,recipes.

Online Transportation Dataset

We will start by importing the train dataset:

## Observations: 90,113
## Variables: 16
## $ id                 <fct> 59d005e1ffcfa261708ce9cd, 59d0066affcfa261708ceb11…
## $ trip_id            <fct> 59d005e9cb564761a8fe5d3e, NA, 59d006c131e39c618969…
## $ driver_id          <fct> 59a892c5568be44b2734f276, NA, 599dc0dfa5b4fd5471ad…
## $ rider_id           <fct> 59ad2d6efba75a581666b506, 59cd704bcf482f6ce2fadfdb…
## $ start_time         <fct> 2017-10-01T00:00:17Z, 2017-10-01T00:02:34Z, 2017-1…
## $ src_lat            <dbl> 41.07047, 41.07487, 41.04995, 41.05287, 41.06760, …
## $ src_lon            <dbl> 29.01945, 28.99528, 29.03107, 28.99522, 28.98827, …
## $ src_area           <fct> sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sx…
## $ src_sub_area       <fct> sxk9s, sxk9e, sxk9s, sxk9e, sxk9e, sxk9s, sxk9s, s…
## $ dest_lat           <dbl> 41.11716, 41.08351, 41.04495, 41.08140, 41.02125, …
## $ dest_lon           <dbl> 29.03650, 29.00228, 28.98192, 28.98197, 29.11316, …
## $ dest_area          <fct> sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sx…
## $ dest_sub_area      <fct> sxk9u, sxk9e, sxk9e, sxk9e, sxk9q, sxk9e, sxk9e, s…
## $ distance           <dbl> 5.379250, 1.126098, 4.169492, 3.358296, 11.693573,…
## $ status             <fct> confirmed, nodrivers, confirmed, confirmed, nodriv…
## $ confirmed_time_sec <int> 8, 0, 32, 65, 0, 27, 32, 30, 0, 24, 9, 0, 118, 106…

The data contain 16 variables, however as our case is a time series problem, we would only use two variables, which is the time and area. There are three different areas that provided by the dataset.

## Observations: 90,113
## Variables: 2
## $ start_time   <fct> 2017-10-01T00:00:17Z, 2017-10-01T00:02:34Z, 2017-10-01T0…
## $ src_sub_area <fct> sxk9s, sxk9e, sxk9s, sxk9e, sxk9e, sxk9s, sxk9s, sxk9s, …
## [1] sxk9s sxk9e sxk97
## Levels: sxk97 sxk9e sxk9s

For time series case, we need to round the time hourly. We would use the function from lubridate.

Next, we need to count how many demand on the specific hour and area.

Unfortunately, there are missing hour in the dataset which means there are no demand in that specific hour. As we are not permitted to have missing time for time series modeling, we need to do padding and replaced the demand in that missing time with zero value.

## Warning: coercing time zone from UTC to

## Warning: coercing time zone from UTC to
## pad applied on the interval: hour
## # A tibble: 6 x 3
## # Groups:   src_sub_area [1]
##   start_time          src_sub_area demand
##   <dttm>              <fct>         <dbl>
## 1 2017-10-01 00:00:00 sxk97             6
## 2 2017-10-01 01:00:00 sxk97             4
## 3 2017-10-01 02:00:00 sxk97             9
## 4 2017-10-01 03:00:00 sxk97             2
## 5 2017-10-01 04:00:00 sxk97             5
## 6 2017-10-01 05:00:00 sxk97             4

Exploratory Data Analysis

Demand By Sub-Area

We would like to see the demand by sub-area throughout the day:

We could see that there are seasonal pattern, although the trend is not clear yet. To investigate more, we would make daily and weekly polar plot:

sxk97 <- data %>% filter(src_sub_area == "sxk97") %>% .$demand 

sxk97_daily <- ggseasonplot(ts(sxk97,frequency = 24),polar = T)  +
  theme_bw() +
  theme(legend.position = "none",plot.title = element_text(hjust = 0.5)) +
  labs(title = "SXK97 Daily",x = "Hour") +
  scale_y_sqrt()

sxk97_w <- data %>% filter(src_sub_area == "sxk97") %>% 
  mutate(date = format(start_time,"%Y/%m/%d")) %>% 
 dplyr::group_by(date) %>% 
  mutate(demand = sum(demand)) %>% 
  dplyr::select(c(date,demand)) %>% 
  distinct()
  
sxk97_weekly <- ggseasonplot(ts(sxk97_w$demand, frequency = 7),polar = T)  +
  theme_bw() +
  theme(legend.position = "none",plot.title = element_text(hjust = 0.5)) +
  labs(title = "SXK97 Weekly",x = "Day") +
  scale_y_sqrt() 


sxk9e <- data %>% filter(src_sub_area == "sxk9e") %>% .$demand

sxk9e_daily <- ggseasonplot(ts(sxk9e,frequency = 24),polar = T)  +
  theme_bw() +
  theme(legend.position = "none",plot.title = element_text(hjust = 0.5)) +
  labs(title = "SXK9E Daily",x = "Hour") +
  scale_y_sqrt()

sxk9e_w <- data %>% filter(src_sub_area == "sxk9e") %>% 
  mutate(date = format(start_time,"%Y/%m/%d")) %>% 
 dplyr::group_by(date) %>% 
  mutate(demand = sum(demand)) %>% 
  dplyr::select(c(date,demand)) %>% 
  distinct()

sxk9e_weekly <- ggseasonplot(ts(sxk9e_w$demand,frequency = 7),polar = T)  +
  theme_bw() +
  theme(legend.position = "none",plot.title = element_text(hjust = 0.5)) +
  labs(title = "SXK9E Weekly",x = "Day") +
  scale_y_sqrt()

sxk9s <- data %>% filter(src_sub_area == "sxk9s") %>% .$demand

sxk9s_daily <- ggseasonplot(ts(sxk9s,frequency = 24),polar = T)  +
  theme_bw() +
  theme(legend.position = "none",plot.title = element_text(hjust = 0.5)) +
  labs(title = "SXK9S Daily",x = "Hour") +
  scale_y_sqrt()

sxk9s_w <- data %>% filter(src_sub_area == "sxk9s") %>% 
  mutate(date = format(start_time,"%Y/%m/%d")) %>% 
 dplyr::group_by(date) %>% 
  mutate(demand = sum(demand)) %>% 
  dplyr::select(c(date,demand)) %>% 
  distinct()

sxk9s_weekly <- ggseasonplot(ts(sxk9s_w$demand,frequency = 7),polar = T)  +
  theme_bw() +
  theme(legend.position = "none",plot.title = element_text(hjust = 0.5)) +
  labs(title = "SXK9S Weekly",x = "Day") +
  scale_y_sqrt()

Generally in all three sub-areas, the demand peaks at 6 to 7 PM, while low at 5 to 6 AM. While weekly, the demand are high on Friday and Saturday, and at the lowest on Sunday.

After looking at the pattern, we would use daily and weekly pattern for time series modeling. Monthly pattern would be impossible to use because there are no enough data.

Decomposition

We would like to decompose the data to check the seasonal, trend and error from the data from ts object with daily and weekly seasonality:

Unfortunately, the trend that resulted from the decomposition is not smooth enough that might be caused by uncaptured extra seasonality, so it can be considered as multi-seasonal data. So that, we need to try another option by creating the multiple time series object, msts with daily and weekly seasonality:

This time, the trend is smoother which indicate a correct used of seasonality pattern.

Data Preprocessing

Cross Validation

Before modeling, we have to seperate our data into two: train and test dataset. Test dataset would be the last one week from the data, while the remain is our train dataset.

Then, we would label start_time whether it is a train or test dataset in data_sample:

## # A tibble: 6 x 4
## # Groups:   src_sub_area [3]
##   start_time          src_sub_area demand data_sample
##   <dttm>              <fct>         <dbl> <fct>      
## 1 2017-10-01 00:00:00 sxk97             6 train      
## 2 2017-10-01 01:00:00 sxk97             4 train      
## 3 2017-10-01 02:00:00 sxk97             9 train      
## 4 2017-10-01 03:00:00 sxk97             2 train      
## 5 2017-10-01 04:00:00 sxk97             5 train      
## 6 2017-10-01 05:00:00 sxk97             4 train

Data Scaling

To prevent outlier to have big influence on our model, we would do scaling by using ‘recipes’ packages. Since the recipes only accept columnwise format, we need to change our data into wide format:

## # A tibble: 6 x 5
##   start_time          data_sample sxk97 sxk9e sxk9s
##   <dttm>              <fct>       <dbl> <dbl> <dbl>
## 1 2017-10-01 00:00:00 train           6     8    10
## 2 2017-10-01 01:00:00 train           4     2     3
## 3 2017-10-01 02:00:00 train           9     3     1
## 4 2017-10-01 03:00:00 train           2     2     0
## 5 2017-10-01 04:00:00 train           5     1     0
## 6 2017-10-01 05:00:00 train           4     2     0

Beside scaling, we would like to use square root transformations:

Then, we change back the data into the longformat:

## # A tibble: 6 x 4
##   start_time          data_sample src_sub_area demand
##   <dttm>              <fct>       <chr>         <dbl>
## 1 2017-10-01 00:00:00 train       sxk97        -0.594
## 2 2017-10-01 01:00:00 train       sxk97        -0.841
## 3 2017-10-01 02:00:00 train       sxk97        -0.291
## 4 2017-10-01 03:00:00 train       sxk97        -1.16 
## 5 2017-10-01 04:00:00 train       sxk97        -0.711
## 6 2017-10-01 05:00:00 train       sxk97        -0.841

To change data into the original form (before the scaling), we create a revert back function that will be used later after modeling.

Nested Model Fitting

As we will use functional programming purrr later, we have to convert our data into nested tbl, which create a table inside our table. We nest() the data by start_time and demand based on data_sample.

## # A tibble: 3 x 3
## # Groups:   src_sub_area [3]
##   src_sub_area          train           test
##   <chr>        <list<df[,2]>> <list<df[,2]>>
## 1 sxk97           [1,344 × 2]      [168 × 2]
## 2 sxk9e           [1,344 × 2]      [168 × 2]
## 3 sxk9s           [1,344 × 2]      [168 × 2]

Data Model List

As we know before, there are two option of data representation, a ts object with daily seasonality and a msts with daily and weekly seasonality. To apply them into our data, then we need to make a data frame which contain the object and the function. Then we would combining them using dplyr package.

## # A tibble: 6 x 3
##   data_fun_name data_fun src_sub_area
##   <chr>         <list>   <chr>       
## 1 ts            <fn>     sxk97       
## 2 msts          <fn>     sxk97       
## 3 ts            <fn>     sxk9e       
## 4 msts          <fn>     sxk9e       
## 5 ts            <fn>     sxk9s       
## 6 msts          <fn>     sxk9s

By using dplyr package, we combine them into one dataframe:

## Joining, by = "src_sub_area"
## # A tibble: 6 x 5
## # Groups:   src_sub_area [3]
##   src_sub_area          train           test data_fun_name data_fun
##   <chr>        <list<df[,2]>> <list<df[,2]>> <chr>         <list>  
## 1 sxk97           [1,344 × 2]      [168 × 2] ts            <fn>    
## 2 sxk97           [1,344 × 2]      [168 × 2] msts          <fn>    
## 3 sxk9e           [1,344 × 2]      [168 × 2] ts            <fn>    
## 4 sxk9e           [1,344 × 2]      [168 × 2] msts          <fn>    
## 5 sxk9s           [1,344 × 2]      [168 × 2] ts            <fn>    
## 6 sxk9s           [1,344 × 2]      [168 × 2] msts          <fn>

Time Series Model List

Just like when we create the data model, we could also make list of time series model as a nested list. We choose stlm(), tbats() & holt.winter() and neglected ets() and auto.arima() as they are not suitable for multiple seasonality time series. dshw() could not be use because there is zero values in our data.

## # A tibble: 15 x 3
##    model_name  model  src_sub_area
##    <chr>       <list> <chr>       
##  1 stlm        <fn>   sxk97       
##  2 tbats       <fn>   sxk97       
##  3 holt.winter <fn>   sxk97       
##  4 auto.arima  <fn>   sxk97       
##  5 ets         <fn>   sxk97       
##  6 stlm        <fn>   sxk9e       
##  7 tbats       <fn>   sxk9e       
##  8 holt.winter <fn>   sxk9e       
##  9 auto.arima  <fn>   sxk9e       
## 10 ets         <fn>   sxk9e       
## 11 stlm        <fn>   sxk9s       
## 12 tbats       <fn>   sxk9s       
## 13 holt.winter <fn>   sxk9s       
## 14 auto.arima  <fn>   sxk9s       
## 15 ets         <fn>   sxk9s

Then we combine models with our data:

## Joining, by = "src_sub_area"
## # A tibble: 6 x 7
## # Groups:   src_sub_area [1]
##   src_sub_area         train        test data_fun_name data_fun model_name model
##   <chr>        <list<df[,2]> <list<df[,> <chr>         <list>   <chr>      <lis>
## 1 sxk97          [1,344 × 2]   [168 × 2] ts            <fn>     stlm       <fn> 
## 2 sxk97          [1,344 × 2]   [168 × 2] ts            <fn>     tbats      <fn> 
## 3 sxk97          [1,344 × 2]   [168 × 2] ts            <fn>     holt.wint… <fn> 
## 4 sxk97          [1,344 × 2]   [168 × 2] ts            <fn>     auto.arima <fn> 
## 5 sxk97          [1,344 × 2]   [168 × 2] ts            <fn>     ets        <fn> 
## 6 sxk97          [1,344 × 2]   [168 × 2] msts          <fn>     stlm       <fn>

Model

Finally, we could execute the model fitting using map() and invoke_map() function from purrr package. First, we need to make a list using map() then we call the function inside data_fun using invoke_map().

## # A tibble: 6 x 8
## # Groups:   src_sub_area [1]
##   src_sub_area       train      test data_fun_name data_fun model_name model
##   <chr>        <list<df[,> <list<df> <chr>         <list>   <chr>      <lis>
## 1 sxk97        [1,344 × 2] [168 × 2] ts            <fn>     stlm       <fn> 
## 2 sxk97        [1,344 × 2] [168 × 2] ts            <fn>     tbats      <fn> 
## 3 sxk97        [1,344 × 2] [168 × 2] ts            <fn>     holt.wint… <fn> 
## 4 sxk97        [1,344 × 2] [168 × 2] ts            <fn>     auto.arima <fn> 
## 5 sxk97        [1,344 × 2] [168 × 2] ts            <fn>     ets        <fn> 
## 6 sxk97        [1,344 × 2] [168 × 2] msts          <fn>     stlm       <fn> 
## # … with 1 more variable: fitted <list>

Evaluation

After making the model, we need to measure the train and test error. We would using forecast()to the test dataset then measure the error by using mae_vec from yardstick package.

## # A tibble: 24 x 5
## # Groups:   src_sub_area [3]
##    src_sub_area data_fun_name model_name  MAE_error_test MAE_error_train
##    <chr>        <chr>         <chr>                <dbl>           <dbl>
##  1 sxk97        msts          stlm                  8.67            3.63
##  2 sxk97        ts            stlm                  9.34            4.67
##  3 sxk97        msts          tbats                 7.42            4.69
##  4 sxk97        ts            tbats                 9.02            4.91
##  5 sxk97        ts            ets                   9               5.05
##  6 sxk97        msts          holt.winter           8.70            5.07
##  7 sxk97        ts            holt.winter           9.12            5.54
##  8 sxk97        ts            auto.arima           11.3             5.62
##  9 sxk9e        msts          stlm                 10.1             5.00
## 10 sxk9e        msts          tbats                 9.76            6.48
## # … with 14 more rows

Forecast and Actual Data Comparison

After getting the error, we would like to compare the forecast result to the actual test. First, we need to make a tbl containing our forecast then using spread() and gather() to differentiate the actual and forecast result:

## Warning: attributes are not identical across measure variables;
## they will be dropped
## # A tibble: 6 x 4
## # Groups:   src_sub_area [1]
##   src_sub_area key    start_time          demand
##   <chr>        <chr>  <dttm>               <dbl>
## 1 sxk97        actual 2017-11-26 00:00:00     38
## 2 sxk97        actual 2017-11-26 01:00:00     21
## 3 sxk97        actual 2017-11-26 02:00:00     10
## 4 sxk97        actual 2017-11-26 03:00:00     10
## 5 sxk97        actual 2017-11-26 04:00:00      5
## 6 sxk97        actual 2017-11-26 05:00:00      2
## Warning: Ignoring unknown aesthetics: frame
## Warning in p$x$data[firstFrame] <- p$x$frames[[1]]$data: number of items to
## replace is not a multiple of replacement length

Automated Model Selection

On this section, we would like to predict the unseen data, which is the next 168 hours from our test data. The unseen data would range from Sunday, December 3rd 2017 to Monday, December 9th 2017. However, we need to choose the best model, so that in this section we would make an automated model selection. First, as it would be hard to choose the best model by only using the graphical analysis, we would choose the model with the least error:

## # A tibble: 3 x 9
##   src_sub_area       train      test data_fun_name data_fun model_name model
##   <chr>        <list<df[,> <list<df> <chr>         <list>   <chr>      <lis>
## 1 sxk97        [1,344 × 2] [168 × 2] msts          <fn>     tbats      <fn> 
## 2 sxk9e        [1,344 × 2] [168 × 2] msts          <fn>     tbats      <fn> 
## 3 sxk9s        [1,344 × 2] [168 × 2] msts          <fn>     tbats      <fn> 
## # … with 2 more variables: MAE_error_test <dbl>, MAE_error_train <dbl>

Different with the previous, for the final forecast we would use all the data, which means we have to combine the train and test dataset:

## # A tibble: 3 x 8
##   src_sub_area fulldata data_fun_name data_fun model_name model MAE_error_test
##   <chr>        <list>   <chr>         <list>   <chr>      <lis>          <dbl>
## 1 sxk97        <tibble… msts          <fn>     tbats      <fn>            7.42
## 2 sxk9e        <tibble… msts          <fn>     tbats      <fn>            9.76
## 3 sxk9s        <tibble… msts          <fn>     tbats      <fn>            6.94
## # … with 1 more variable: MAE_error_train <dbl>

Then we would do nested fitting:

Then, we use unnest() to get the final forecast result:

## # A tibble: 3 x 9
##   src_sub_area fulldata data_fun_name data_fun model_name model MAE_error_test
##   <chr>        <list>   <chr>         <list>   <chr>      <lis>          <dbl>
## 1 sxk97        <tibble… msts          <fn>     tbats      <fn>            7.42
## 2 sxk9e        <tibble… msts          <fn>     tbats      <fn>            9.76
## 3 sxk9s        <tibble… msts          <fn>     tbats      <fn>            6.94
## # … with 2 more variables: MAE_error_train <dbl>, fitted <list>

Finally, we are getting our final forecast result:

## # A tibble: 504 x 4
##    src_sub_area key      start_time          demand
##    <chr>        <chr>    <dttm>               <dbl>
##  1 sxk97        forecast 2017-12-03 00:00:00     40
##  2 sxk97        forecast 2017-12-03 01:00:00     30
##  3 sxk97        forecast 2017-12-03 02:00:00     21
##  4 sxk97        forecast 2017-12-03 03:00:00     15
##  5 sxk97        forecast 2017-12-03 04:00:00     11
##  6 sxk97        forecast 2017-12-03 05:00:00      7
##  7 sxk97        forecast 2017-12-03 06:00:00      6
##  8 sxk97        forecast 2017-12-03 07:00:00     11
##  9 sxk97        forecast 2017-12-03 08:00:00     19
## 10 sxk97        forecast 2017-12-03 09:00:00     19
## # … with 494 more rows

Now, we would like to present the actual data and our forecast result on the graph:

Conclusion

This online transportation case has two types of seasonality, daily and weekly. We used STLM, TBATS, and HoltWinter for multi-seasonal data. The forecast from TBATS models showing a better performance for all and each sub-area.

Dhaneswara Mandrasa T.