Introduction

Scotty is a ride-sharing business that operating in several big cities in Turkey. The company provides motorcycles ride-sharing service for Turkey’s citizen, and really value the efficiency in traveling through the traffic–the apps even give some reference to Star Trek “beam me up” in their order buttons.

Scotty provided us with real-time transaction dataset. With this dataset, we are going to help them in solving some forecasting and classification problems in order to improve their business processes.

It’s almost the end of 2017 and we need to prepare a forecast model to helps Scotty ready for the end year’s demands. Unfortunately, Scotty is not old enough to have last year data for December, so we can not look back at past demands to prepare forecast for December’s demands. Fortunately, you already know that time series analysis is more than enough to help us to forecast! But, as an investment for the business’ future, we need to develop an automated forecasting framework so we don’t have to meddling with forecast model selection anymore in the future!

This time, I am going to show you the process of using automated time-series forecasting for multiple model. To do it, I will use package called purrr().

Let’s get into the problem! Enjoy the codes!

Data Wrangling

I will load my data into scotty object

Next, observe the scotty with glimpse() function

## Observations: 90,113
## Variables: 16
## $ id                 <chr> "59d005e1ffcfa261708ce9cd", "59d0066affcfa261708...
## $ trip_id            <chr> "59d005e9cb564761a8fe5d3e", NA, "59d006c131e39c6...
## $ driver_id          <chr> "59a892c5568be44b2734f276", NA, "599dc0dfa5b4fd5...
## $ rider_id           <chr> "59ad2d6efba75a581666b506", "59cd704bcf482f6ce2f...
## $ start_time         <dttm> 2017-10-01 00:00:17, 2017-10-01 00:02:34, 2017-...
## $ src_lat            <dbl> 41.07047, 41.07487, 41.04995, 41.05287, 41.06760...
## $ src_lon            <dbl> 29.01945, 28.99528, 29.03107, 28.99522, 28.98827...
## $ src_area           <chr> "sxk9", "sxk9", "sxk9", "sxk9", "sxk9", "sxk9", ...
## $ src_sub_area       <chr> "sxk9s", "sxk9e", "sxk9s", "sxk9e", "sxk9e", "sx...
## $ dest_lat           <dbl> 41.11716, 41.08351, 41.04495, 41.08140, 41.02125...
## $ dest_lon           <dbl> 29.03650, 29.00228, 28.98192, 28.98197, 29.11316...
## $ dest_area          <chr> "sxk9", "sxk9", "sxk9", "sxk9", "sxk9", "sxk9", ...
## $ dest_sub_area      <chr> "sxk9u", "sxk9e", "sxk9e", "sxk9e", "sxk9q", "sx...
## $ distance           <dbl> 5.379250, 1.126098, 4.169492, 3.358296, 11.69357...
## $ status             <chr> "confirmed", "nodrivers", "confirmed", "confirme...
## $ confirmed_time_sec <dbl> 8, 0, 32, 65, 0, 27, 32, 30, 0, 24, 9, 0, 118, 1...

Now, look at the top 6 data using head() function

## # A tibble: 6 x 16
##   id    trip_id driver_id rider_id start_time          src_lat src_lon src_area
##   <chr> <chr>   <chr>     <chr>    <dttm>                <dbl>   <dbl> <chr>   
## 1 59d0~ 59d005~ 59a892c5~ 59ad2d6~ 2017-10-01 00:00:17    41.1    29.0 sxk9    
## 2 59d0~ <NA>    <NA>      59cd704~ 2017-10-01 00:02:34    41.1    29.0 sxk9    
## 3 59d0~ 59d006~ 599dc0df~ 59bd62c~ 2017-10-01 00:03:29    41.0    29.0 sxk9    
## 4 59d0~ 59d007~ 59a58565~ 59c17cd~ 2017-10-01 00:04:24    41.1    29.0 sxk9    
## 5 59d0~ <NA>    <NA>      596f47a~ 2017-10-01 00:06:06    41.1    29.0 sxk9    
## 6 59d0~ 59d007~ 599dc0df~ 59bd62c~ 2017-10-01 00:08:19    41.0    29.0 sxk9    
## # ... with 8 more variables: src_sub_area <chr>, dest_lat <dbl>,
## #   dest_lon <dbl>, dest_area <chr>, dest_sub_area <chr>, distance <dbl>,
## #   status <chr>, confirmed_time_sec <dbl>

Since I am going to make timeseries model, I only need the start_time and src_sub_area variables. Therefore, I will subset my dataframe and store it into new object scotty_new

Next, I want to count the order based on the time (hourly), hence I will make my start_time into hours. I will put in to scotty_agg

Great! Now I have the order number for each hour and area.

This data has 3 levels or factors of src_sub_area, so I will make the data from long data frame into wide data frame using spread() function inside tidyr package

Next step, I am going to pad the timeseries object to make sure that the timeseries data is full. Before that, I have to find the minimum and maximum value of the time

## [1] "2017-10-01 UTC"
## [1] "2017-12-02 23:00:00 UTC"

Padding time!

Since this data is a demand order for each hour, so I am going to replace the NA values with 0 (This missing values mean that there are no orders at that time)

Voila! The data is much cleaner now.

Exploratory Data Analysis

To make a clearer view, I will visualize the data using graph of ggplot()

As we see from the three time-series graphs, they have similarities among them. The order number is 0 in certain date of month and has peaks at the same time too. This pattern is repeating regularly each day and week, hence we can call it seasonality pattern.

Cross Validation

Since the data consits of 3 months only, I will split the data into 3 months for cross validation. I am going to split to data test and data train into 4 weeks for 3 months per hour a day, so it becomes 24 hours * 7 days * 4 weeks * 3 months.

To make it easier, I can combine the start and end interval into date interval using interval() function

## [1] 2017-10-01 UTC--2017-11-25 23:00:00 UTC
## [1] 2017-11-26 UTC--2017-12-02 23:00:00 UTC

It is cool, right?

Now, I am going to process the data using recipe() and bake() function

## # A tibble: 1,512 x 4
##    start_time           sxk97  sxk9e  sxk9s
##    <dttm>               <dbl>  <dbl>  <dbl>
##  1 2017-10-01 00:00:00 -0.594 -0.697 -0.111
##  2 2017-10-01 01:00:00 -0.841 -1.28  -0.780
##  3 2017-10-01 02:00:00 -0.291 -1.15  -1.12 
##  4 2017-10-01 03:00:00 -1.16  -1.28  -1.59 
##  5 2017-10-01 04:00:00 -0.711 -1.44  -1.59 
##  6 2017-10-01 05:00:00 -0.841 -1.28  -1.59 
##  7 2017-10-01 06:00:00 -1.39  -1.44  -1.59 
##  8 2017-10-01 07:00:00 -1.94  -1.85  -1.12 
##  9 2017-10-01 08:00:00 -0.989 -1.28  -0.544
## 10 2017-10-01 09:00:00 -0.989 -0.497 -0.544
## # ... with 1,502 more rows

If we use recipe() function, we have to create revert back function

Revert the data frame into long data frame again

Nested Model Fitting and Forecasting

In programming using purrr package, we have to need to convert our tbl into nested tbl. It is just like we have tbl inside tbl. Now, let’s start by converting tbl into nested tbl.

After that, we can start to nest the tbl using nest() function based on sample and src_sub_area then spread() or pivot_wider() again the tbl based on sample (This time I will try to use pivot_wider() instead of spread())

## # A tibble: 3 x 3
## # Groups:   src_sub_area [3]
##   src_sub_area          train           test
##   <chr>        <list<df[,2]>> <list<df[,2]>>
## 1 sxk97           [1,344 x 2]      [168 x 2]
## 2 sxk9e           [1,344 x 2]      [168 x 2]
## 3 sxk9s           [1,344 x 2]      [168 x 2]

Representation Model Lists

Since the data is in high frequency, I can consider two options of model, daily seasonality(ts) and weekly seasonality (msts)

We need to create functions for another nested data

## $ts
## function(x) ts(x$demand, frequency = 24)
## 
## $msts
## function(x) msts(x$demand, seasonal.periods = c(24, 24 * 7))

We can convert the list above into tbl using enframe() function.

## # A tibble: 6 x 3
##   data_fun_name data_fun src_sub_area
##   <chr>         <list>   <chr>       
## 1 ts            <fn>     sxk97       
## 2 msts          <fn>     sxk97       
## 3 ts            <fn>     sxk9e       
## 4 msts          <fn>     sxk9e       
## 5 ts            <fn>     sxk9s       
## 6 msts          <fn>     sxk9s

The next step is joining the data_funs with nest data.frame using left_join()

Time Series Model Lists

In order to look at the trend and seasonality pattern of all area, I will try to visualize it using ggseasonplot() function

Since the data only consists 2 months of period, it is not relevant to be used to find monthly pattern. Therefore, I will only observe the daily and weekly pattern.
Create the plot

Plot to visualize

Based on the plots above, it can be concluded that:
1. On daily pattern, the highest demand can be found at nearly 5pm-7pm
2. On weekly pattern, the highest demand can be found on Fridays and Saturdays (10 o’clock direction)

Next, let’s find the trend and seasonality by its decomposition

After a quick screening, we can see that the trend line made smoothly. This indicates that this time series data has trend and also seasonality pattern.

At this time, I am going to use five different time series models, which are:
  • Auto arima
  • Exponential smoothing state space model (ets)
  • Seasonal and Trend decomposition using Loess (stlm)
  • Trigonometric seasonality, Box-Cox transformation, ARMA errors, Trend and Seasonal components (tbats)
  • Autoregressive integrated moving average (arima)
  • Holt Winter

since the data has trend and seasonality.

I will aggregate all of the models into one nested rbl in order to a quicker processing.
Now convert the list into tbl like previous function

Great! We can now join join and models. However, we do not want to have the auto.arima(), stlm() and ets() with data msts, since they are not compatible for multi seasonality time-series. Hence, we have to remove it.

Nested Fitting Execution / Modelling

To execute the model fitting, we need to wrap up the needed arguments as a list using map() function. Then, we could call the function using invoke_map(). We need to do this for data transformation using the function inside data_fun, then continue to fit the model with the same process using the function inside model.

Model Evaluation

Next, we have to forecast the data using forecast() and try to observe the error using function in yardstick() package. I will use MAE (mean absolute error) as the error function. The error values are splitted into train_error (for evaluation dataset) and test_error (for data test dataset).

## # A tibble: 21 x 5
## # Groups:   src_sub_area [3]
##    src_sub_area data_fun_name model_name  train_error test_error
##    <chr>        <chr>         <chr>             <dbl>      <dbl>
##  1 sxk97        ts            stlm               4.55       9.39
##  2 sxk97        msts          tbats              4.69       7.42
##  3 sxk97        ts            tbats              4.91       9.02
##  4 sxk97        ts            ets                5.05       9   
##  5 sxk97        msts          holt.winter        5.07       8.70
##  6 sxk97        ts            holt.winter        5.54       9.12
##  7 sxk97        ts            auto.arima         5.62      11.3 
##  8 sxk9e        ts            stlm               6.34      11.2 
##  9 sxk9e        msts          tbats              6.48       9.76
## 10 sxk9e        ts            tbats              6.85      10.9 
## # ... with 11 more rows

Unnesting The Result

We can compare the forecast results to the real test series using graphical analysis. Before that, we have to build a forecast model in order to forecast the data. Next, we will try to do some spread-gather tricks to make a set of keys that unique for each data representations, models, and also one for the forecast itself.

First, build the forecast model.

## # A tibble: 21 x 12
## # Groups:   src_sub_area [3]
##    src_sub_area       train      test data_fun_name data_fun model_name model
##    <chr>        <list<df[,> <list<df> <chr>         <list>   <chr>      <lis>
##  1 sxk97        [1,344 x 2] [168 x 2] ts            <fn>     stlm       <fn> 
##  2 sxk97        [1,344 x 2] [168 x 2] msts          <fn>     tbats      <fn> 
##  3 sxk97        [1,344 x 2] [168 x 2] ts            <fn>     tbats      <fn> 
##  4 sxk97        [1,344 x 2] [168 x 2] ts            <fn>     ets        <fn> 
##  5 sxk97        [1,344 x 2] [168 x 2] msts          <fn>     holt.wint~ <fn> 
##  6 sxk97        [1,344 x 2] [168 x 2] ts            <fn>     holt.wint~ <fn> 
##  7 sxk97        [1,344 x 2] [168 x 2] ts            <fn>     auto.arima <fn> 
##  8 sxk9e        [1,344 x 2] [168 x 2] ts            <fn>     stlm       <fn> 
##  9 sxk9e        [1,344 x 2] [168 x 2] msts          <fn>     tbats      <fn> 
## 10 sxk9e        [1,344 x 2] [168 x 2] ts            <fn>     tbats      <fn> 
## # ... with 11 more rows, and 5 more variables: fitted <list>,
## #   train_error <dbl>, test_error <dbl>, forecast <list>, key <chr>

Next, do spread-gather

## # A tibble: 24 x 3
## # Groups:   src_sub_area [3]
##    src_sub_area key              value             
##    <chr>        <chr>            <list>            
##  1 sxk97        actual           <tibble [168 x 2]>
##  2 sxk9e        actual           <tibble [168 x 2]>
##  3 sxk9s        actual           <tibble [168 x 2]>
##  4 sxk97        msts-holt.winter <tibble [168 x 2]>
##  5 sxk9e        msts-holt.winter <tibble [168 x 2]>
##  6 sxk9s        msts-holt.winter <tibble [168 x 2]>
##  7 sxk97        msts-tbats       <tibble [168 x 2]>
##  8 sxk9e        msts-tbats       <tibble [168 x 2]>
##  9 sxk9s        msts-tbats       <tibble [168 x 2]>
## 10 sxk97        ts-auto.arima    <tibble [168 x 2]>
## # ... with 14 more rows

Finally, let’s unnest the data using unnest() function

## # A tibble: 4,032 x 4
## # Groups:   src_sub_area [3]
##    src_sub_area key    start_time          demand
##    <chr>        <chr>  <dttm>               <dbl>
##  1 sxk97        actual 2017-11-26 00:00:00     38
##  2 sxk97        actual 2017-11-26 01:00:00     21
##  3 sxk97        actual 2017-11-26 02:00:00     10
##  4 sxk97        actual 2017-11-26 03:00:00     10
##  5 sxk97        actual 2017-11-26 04:00:00      5
##  6 sxk97        actual 2017-11-26 05:00:00      2
##  7 sxk97        actual 2017-11-26 06:00:00      0
##  8 sxk97        actual 2017-11-26 07:00:00     11
##  9 sxk97        actual 2017-11-26 08:00:00      4
## 10 sxk97        actual 2017-11-26 09:00:00      4
## # ... with 4,022 more rows

Great! Now, let’s try to visualize the data using ggplot() graph for a better view

Graph Visualization

It’s a cool stuff, isn’t it? We can adjust to compare each model by just slide the button to desired model. The transparent line at the back is the test data which will be forecasted, while the coloured line is the forecast representation.
We can see that all models has such different pattern of line. However, all of them has trend and seasonality pattern for each area. This indicates that the time-series model produce a pretty good results.

Automatic Model Selection

It is quite hard to compare the results by only using graphs. We can do model selection automatically by using filter() function to find the lowest error.

## # A tibble: 3 x 5
##   src_sub_area data_fun_name model_name train_error test_error
##   <chr>        <chr>         <chr>            <dbl>      <dbl>
## 1 sxk97        msts          tbats             4.69       7.42
## 2 sxk9e        msts          tbats             6.48       9.76
## 3 sxk9s        msts          tbats             4.81       6.94

It can be seen from the data frame above that the best model which has the lowest error (based on MAE) for the test data (data which want to be forecasted) is tbats model with error :

  • 7.42 for sxk97
  • 9.76 for sxk9e
  • 6.94 for sxk9s
and for train data (evaluation data set)
  • 4.68 for sxk97
  • 6.47 for sxk9e
  • 4.80 for sxk9s

For reminder, these results are produced from the splitted train data . In order to forecast the real data, I have to combine the train data and test data into full data. This step will be explained on the next section

Final Forecast Model

Now for the final step is to make the last forecast model. Same as previous steps, but we use train and test combination as our new data.

## # A tibble: 21 x 9
## # Groups:   src_sub_area [3]
##    src_sub_area fulldata data_fun_name data_fun model_name model fitted
##    <chr>        <list>   <chr>         <list>   <chr>      <lis> <list>
##  1 sxk97        <tibble~ ts            <fn>     stlm       <fn>  <stlm>
##  2 sxk97        <tibble~ msts          <fn>     tbats      <fn>  <tbat~
##  3 sxk97        <tibble~ ts            <fn>     tbats      <fn>  <tbat~
##  4 sxk97        <tibble~ ts            <fn>     ets        <fn>  <ets> 
##  5 sxk97        <tibble~ msts          <fn>     holt.wint~ <fn>  <HltW~
##  6 sxk97        <tibble~ ts            <fn>     holt.wint~ <fn>  <HltW~
##  7 sxk97        <tibble~ ts            <fn>     auto.arima <fn>  <fr_A~
##  8 sxk9e        <tibble~ ts            <fn>     stlm       <fn>  <stlm>
##  9 sxk9e        <tibble~ msts          <fn>     tbats      <fn>  <tbat~
## 10 sxk9e        <tibble~ ts            <fn>     tbats      <fn>  <tbat~
## # ... with 11 more rows, and 2 more variables: train_error <dbl>,
## #   test_error <dbl>

Do the nested fitting

After that, we make model to predict for the next 7 days.

## # A tibble: 21 x 10
## # Groups:   src_sub_area [3]
##    src_sub_area fulldata data_fun_name data_fun model_name model fitted
##    <chr>        <list>   <chr>         <list>   <chr>      <lis> <list>
##  1 sxk97        <tibble~ ts            <fn>     stlm       <fn>  <stlm>
##  2 sxk97        <tibble~ msts          <fn>     tbats      <fn>  <tbat~
##  3 sxk97        <tibble~ ts            <fn>     tbats      <fn>  <tbat~
##  4 sxk97        <tibble~ ts            <fn>     ets        <fn>  <ets> 
##  5 sxk97        <tibble~ msts          <fn>     holt.wint~ <fn>  <HltW~
##  6 sxk97        <tibble~ ts            <fn>     holt.wint~ <fn>  <HltW~
##  7 sxk97        <tibble~ ts            <fn>     auto.arima <fn>  <fr_A~
##  8 sxk9e        <tibble~ ts            <fn>     stlm       <fn>  <stlm>
##  9 sxk9e        <tibble~ msts          <fn>     tbats      <fn>  <tbat~
## 10 sxk9e        <tibble~ ts            <fn>     tbats      <fn>  <tbat~
## # ... with 11 more rows, and 3 more variables: train_error <dbl>,
## #   test_error <dbl>, forecast <list>

To obtain the best model, filter() again the lowest error

## # A tibble: 3 x 9
##   src_sub_area fulldata data_fun_name data_fun model_name model train_error
##   <chr>        <list>   <chr>         <list>   <chr>      <lis>       <dbl>
## 1 sxk97        <tibble~ msts          <fn>     tbats      <fn>         4.69
## 2 sxk9e        <tibble~ msts          <fn>     tbats      <fn>         6.48
## 3 sxk9s        <tibble~ msts          <fn>     tbats      <fn>         4.81
## # ... with 2 more variables: test_error <dbl>, forecast <list>

Finally, we have to unnest the model to retrieve the results

## # A tibble: 35,280 x 4
## # Groups:   src_sub_area [3]
##    src_sub_area key    start_time          demand
##    <chr>        <chr>  <dttm>               <dbl>
##  1 sxk97        actual 2017-10-01 00:00:00      6
##  2 sxk97        actual 2017-10-01 01:00:00      4
##  3 sxk97        actual 2017-10-01 02:00:00      9
##  4 sxk97        actual 2017-10-01 03:00:00      2
##  5 sxk97        actual 2017-10-01 04:00:00      5
##  6 sxk97        actual 2017-10-01 05:00:00      4
##  7 sxk97        actual 2017-10-01 06:00:00      1
##  8 sxk97        actual 2017-10-01 07:00:00      0
##  9 sxk97        actual 2017-10-01 08:00:00      3
## 10 sxk97        actual 2017-10-01 09:00:00      3
## # ... with 35,270 more rows

Now, to get a better view, we can visualize the data using graph

Conlusion

Data Summary

To conclude the time-series process, I will filter the data for forecast key.

To check the quality of the forecast result, we can use MAE (mean absolute error) and evaluate the error for each src_sub_area :

  1. For November 26th 2017 - December 2nd 2017 demand prediction (using own evaluation dataset)

Here is the final demand results:

## # A tibble: 3,528 x 4
## # Groups:   src_sub_area [3]
##    src_sub_area key              start_time          demand
##    <chr>        <chr>            <dttm>               <dbl>
##  1 sxk97        msts-holt.winter 2017-11-26 00:00:00     24
##  2 sxk97        msts-holt.winter 2017-11-26 01:00:00     27
##  3 sxk97        msts-holt.winter 2017-11-26 02:00:00     20
##  4 sxk97        msts-holt.winter 2017-11-26 03:00:00     14
##  5 sxk97        msts-holt.winter 2017-11-26 04:00:00      9
##  6 sxk97        msts-holt.winter 2017-11-26 05:00:00      6
##  7 sxk97        msts-holt.winter 2017-11-26 06:00:00      1
##  8 sxk97        msts-holt.winter 2017-11-26 07:00:00      3
##  9 sxk97        msts-holt.winter 2017-11-26 08:00:00      5
## 10 sxk97        msts-holt.winter 2017-11-26 09:00:00      5
## # ... with 3,518 more rows

I can filter() for the best error.

## # A tibble: 3 x 5
##   src_sub_area data_fun_name model_name train_error test_error
##   <chr>        <chr>         <chr>            <dbl>      <dbl>
## 1 sxk97        msts          tbats             4.69       7.42
## 2 sxk9e        msts          tbats             6.48       9.76
## 3 sxk9s        msts          tbats             4.81       6.94

Results :
msts-tbats produced the lowest error among the other models.

  1. For December 3rd 2017 - December 9th 2017 demand prediction (using “data-test” set)

Here is the final demand results:

## # A tibble: 3,528 x 4
## # Groups:   src_sub_area [3]
##    src_sub_area key      start_time          demand
##    <chr>        <chr>    <dttm>               <dbl>
##  1 sxk97        forecast 2017-12-03 00:00:00     43
##  2 sxk97        forecast 2017-12-03 01:00:00     28
##  3 sxk97        forecast 2017-12-03 02:00:00     23
##  4 sxk97        forecast 2017-12-03 03:00:00     15
##  5 sxk97        forecast 2017-12-03 04:00:00      9
##  6 sxk97        forecast 2017-12-03 05:00:00      5
##  7 sxk97        forecast 2017-12-03 06:00:00      5
##  8 sxk97        forecast 2017-12-03 07:00:00     10
##  9 sxk97        forecast 2017-12-03 08:00:00     20
## 10 sxk97        forecast 2017-12-03 09:00:00     29
## # ... with 3,518 more rows

Filter the lowest error

## # A tibble: 3 x 5
##   src_sub_area data_fun_name model_name train_error test_error
##   <chr>        <chr>         <chr>            <dbl>      <dbl>
## 1 sxk97        msts          tbats             4.69       7.42
## 2 sxk9e        msts          tbats             6.48       9.76
## 3 sxk9s        msts          tbats             4.81       6.94

Results:
msts-tbats also produces the lowest error for the next 7 days. All of the three areas have MAE for train_error and test_error value less than 10. This is a good news.

Assumption Checking

Last but not least, I am going to check two assumptions of a good time-series forecasting model, which are no Auto-correlation and Normality Residual.

Sub Area sxk97

  1. No Auto-correlation Check
    I will use Ljung-Box test to find whether the model has auto-correlation or not.
## 
##  Box-Ljung test
## 
## data:  residuals(scot_test_full$fitted[[1]])
## X-squared = 0.0095585, df = 1, p-value = 0.9221

Based on Hypothesis Testing criterion :
H0: no-autocorrelation <- p-value > 0.05
H1: autocorrelation <- p-value < 0.05

Since p-value is 0.9221 > 0.05 , the model has no- autocorrelation. This means that the residual has no correlation each other.

  1. Normality Residual Check
    The last assumption is normality residual. I will check for the histogram of residuals distibution.

From the chart above, we can see that the residuals are normally distributed. This indicates that most of the residuals (error) are approaching 0 (zero).

Sub Area sxk9e

  1. No Auto-correlation Check
    I will use Ljung-Box test to find whether the model has auto-correlation or not.
## 
##  Box-Ljung test
## 
## data:  residuals(scot_test_full$fitted[[2]])
## X-squared = 0.005713, df = 1, p-value = 0.9397

Based on Hypothesis Testing criterion :
H0: no-autocorrelation <- p-value > 0.05
H1: autocorrelation <- p-value < 0.05

Since p-value is 0.93 > 0.05 , the model has no-autocorrelation. This means that the residual has no correlation each other. All information are collected and used at this sub area. It can be called a good model.

  1. Normality Residual Check
    The last assumption is normality residual. I will check for the histogram of residuals distibution.

Just like area sxk97, the residuals are normally distributed. This indicates that most of the residuals (error) are approaching 0 (zero).

Sub Area sxk9s

  1. No Auto-correlation Check
    I will use Ljung-Box test to find whether the model has auto-correlation or not.
## 
##  Box-Ljung test
## 
## data:  residuals(scot_test_full$fitted[[3]])
## X-squared = 0.27352, df = 1, p-value = 0.601

Based on Hypothesis Testing criterion :
H0: no-autocorrelation <- p-value > 0.05
H1: autocorrelation <- p-value < 0.05

Since p-value is 0.60 > 0.05 , the model has no-autocorrelation. This means that the residual has no correlation each other.

  1. Normality Residual Check
    The last assumption is normality residual. I will check for the histogram of residuals distibution.

Residuals of sub area sxk9s also normally distributed. This indicates that most of the residuals (error) are approaching 0 (zero).

Results Summary

After creating all of time-series forecasting process, we have can say that the best model with lowest error for this kind of data set is the msts-tbats with MAE < 10. Based on the assumption check, area sxk97, area sxk9e, and sxk9s have no auto-correlation between the residuals. This sounds great for a time-series modelling since all information are used well on the model. Hence, the model is effectively build. Moreover, the three areas’ residuals are normally distributed based on the histogram. Overall, we can say that using automated time-series forecasting model for this dataset is a pretty good option.

Data Exporting and Submission

I will read the data submission and stored in in submission object

## # A tibble: 5 x 3
##   src_sub_area datetime            demand
##   <chr>        <dttm>              <lgl> 
## 1 sxk97        2017-12-03 00:00:00 NA    
## 2 sxk97        2017-12-03 01:00:00 NA    
## 3 sxk97        2017-12-03 02:00:00 NA    
## 4 sxk97        2017-12-03 03:00:00 NA    
## 5 sxk97        2017-12-03 04:00:00 NA

Next, the following codes are written for submission collection

Obtain the best_model from scotty_error that has been made. This is the error of msts_tbats.

## # A tibble: 3 x 4
##   src_sub_area data_fun_name model_name test_error
##   <chr>        <chr>         <chr>           <dbl>
## 1 sxk97        msts          tbats            7.42
## 2 sxk9e        msts          tbats            9.76
## 3 sxk9s        msts          tbats            6.94

Join the error once more

## # A tibble: 3 x 8
##   src_sub_area data_fun_name model_name       train      test data_fun model
##   <chr>        <chr>         <chr>      <list<df[,> <list<df> <list>   <lis>
## 1 sxk97        msts          tbats      [1,344 x 2] [168 x 2] <fn>     <fn> 
## 2 sxk9e        msts          tbats      [1,344 x 2] [168 x 2] <fn>     <fn> 
## 3 sxk9s        msts          tbats      [1,344 x 2] [168 x 2] <fn>     <fn> 
## # ... with 1 more variable: fitted <list>

Another small adjusment

## # A tibble: 3 x 10
##   src_sub_area data_fun_name model_name       train      test data_fun model
##   <chr>        <chr>         <chr>      <list<df[,> <list<df> <list>   <lis>
## 1 sxk97        msts          tbats      [1,344 x 2] [168 x 2] <fn>     <fn> 
## 2 sxk9e        msts          tbats      [1,344 x 2] [168 x 2] <fn>     <fn> 
## 3 sxk9s        msts          tbats      [1,344 x 2] [168 x 2] <fn>     <fn> 
## # ... with 3 more variables: fitted <list>, forecast <list>, key <chr>

Now, let’s build the final submission!

## # A tibble: 504 x 3
##    src_sub_area datetime            demand
##    <chr>        <dttm>               <dbl>
##  1 sxk97        2017-12-03 00:00:00     20
##  2 sxk97        2017-12-03 01:00:00     17
##  3 sxk97        2017-12-03 02:00:00     12
##  4 sxk97        2017-12-03 03:00:00      9
##  5 sxk97        2017-12-03 04:00:00      7
##  6 sxk97        2017-12-03 05:00:00      4
##  7 sxk97        2017-12-03 06:00:00      3
##  8 sxk97        2017-12-03 07:00:00      7
##  9 sxk97        2017-12-03 08:00:00     13
## 10 sxk97        2017-12-03 09:00:00     13
## # ... with 494 more rows

I will export the data to csv file

Ending

So, that’s all for the process of automated time-series forecasting using purrr package in R programming language.I hope this page can help you understand time-series problem and the solution behind it.

See you in the other page!

Author,
Alfado Sembiring

Notes :
In case you want to look up my profile, click the link below :
Jump To My Profile (open link in a new tab)