1 Pre-processing

First, let’s set the environment and load pacman package for loading libraries easily.

In this project, we are trying to forecast the visitor of a business for the next 7 days. The data is gathered from a data science bootcamp. From the data we’ll get the receipts with many branches. Let’s take a glimpse of the data we have.

## Observations: 7,063,969
## Variables: 11
## $ datetime         <chr> "2017-12-01 00:47:29", "2017-12-01 00:47:29",...
## $ outlet           <chr> "E_46", "E_46", "E_46", "E_46", "E_46", "E_46...
## $ receipt          <chr> "A0017765", "A0017765", "A0017765", "A0017765...
## $ item             <int> 10100101, 10200029, 10400016, 10500028, 10500...
## $ item_group       <chr> "NOODLE_DISH", "RICE_DISH", "SIDE_DISH", "DRI...
## $ item_major_group <chr> "FOOD", "FOOD", "FOOD", "BEVERAGES", "BEVERAG...
## $ qty              <int> 1, 1, 1, 1, 2, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2, ...
## $ price            <dbl> 3.50, 6.96, 9.92, 4.74, 3.01, 5.48, 3.01, 4.7...
## $ total            <dbl> 3.50, 6.96, 9.92, 4.74, 6.01, 5.48, 3.01, 4.7...
## $ payment          <chr> "CASH", "CASH", "CASH", "CASH", "CASH", "CASH...
## $ sales_type       <chr> "DINE_IN", "DINE_IN", "DINE_IN", "DINE_IN", "...

Here, we’ve known that we have 7,063,969 observations and 11 variables. Each observation contains an item which has been ordered.

##  [1] "E_46" "E_11" "E_34" "E_30" "E_48" "E_44" "E_26" "E_63" "E_22" "C_05"
## [11] "E_42" "E_37" "A_05" "E_43" "A_21" "C_03" "A_10" "C_02" "A_12" "A_19"
## [21] "E_56" "A_08" "E_55" "A_22" "E_02" "D_01" "E_35" "A_02" "E_49" "A_07"
## [31] "E_16" "D_02" "E_08" "A_11" "B_01" "A_18" "E_60" "A_13" "A_20" "E_50"
## [41] "A_01" "E_21" "E_06" "E_45" "E_18" "E_09" "C_01" "A_09" "A_23" "E_27"
## [51] "E_62" "E_20" "E_04" "A_14" "E_32" "E_36" "A_06" "D_03" "E_51" "E_10"
## [61] "A_17" "E_40" "E_33" "E_47" "E_54" "E_14" "E_19" "E_24" "E_61" "E_05"
## [71] "E_57" "E_38" "E_28" "E_13" "E_41" "E_58" "E_07" "E_25" "E_12" "E_53"
## [81] "A_15" "E_17" "E_01" "E_52" "E_31" "E_23" "C_04" "E_29" "E_15" "E_39"
## [91] "A_16" "A_04" "E_03" "A_03" "E_59"

There are 95 unique outlets in observation, but here we only want to analyze the trends by the region of the outlets (the first character) which is A, B, C, D and E. There is also no variable of amount of visitor which we want to predict, but we can do some feature engineering to get visitor variable from the unique variable available, which is from the unique of the receipt code.

To predict the visitor in each outlet region and each hour, let’s just proceed the datetime, visitor, and receipt variables, and ignore the others. The datetime is still in character class, and contain the minutes, and seconds. The outlet also contains specific number details. Here, we would convert the class of datetime, round the datetime into the floor time, and get first character of the outlet.

##   floor_date outlet_group  receipt
## 1 2017-12-01            E A0017765
## 2 2017-12-01            E A0017766
## 3 2017-12-01            E A0017767
## 4 2017-12-01            E A0017768
## 5 2017-12-01            E A0017769
## 6 2017-12-01            E A0017770

Before we proceed futher, let’s check if we have any NA in the data.

##   floor_date outlet_group      receipt 
##            0            0            0

And then, we use count function to get the visitor variable for each hour of time and outlet.

## # A tibble: 6 x 3
##   datetime            outlet visitor
##   <dttm>              <chr>    <int>
## 1 2017-12-01 00:00:00 E            7
## 2 2017-12-01 01:00:00 E           20
## 3 2017-12-01 02:00:00 E           13
## 4 2017-12-01 03:00:00 E           13
## 5 2017-12-01 04:00:00 E           16
## 6 2017-12-01 05:00:00 E            5

Now, we have a far less amount of rows. Let’s visualize our data to get the good view of how the time series look like.

It’s clearly showed that outlet E has the most visitors among all outlets, followed by outlet A. In outlet A and E, we also see there are trends and seasonalities, which is probably high on weekend, that happen especially after Christmas & New Year Season. For the reason, we will only use the data from second week of January. After seeing the trend and seasonality, we can confirm that there are corellations among successive observations, so we can proceed futher the time series analysis and forecasting.

Then, after seeing the plot of overall data, let’s zoom out to smaller range.

In zoom-out view as the plot above, we see there are also seasonalities at each hour, but looks differently in different outlets. Because we find there are more than one seasonality in our data (weekly and hourly), it’s better to use stlm or tbats for our methods for forecasting. We will see the better understanding of the trends and seasonalities after creating the time series objects of each outlet.

Before creating time series objects, from the last table above, it is showed not every hour outlets have visitors. Here, we have to create a data frame which contains every hour for each outlet, merge with the data frame we already have, and fill with 0, if there are no visitors recorded.

Then, after binding every outlet together, in order to only use the data with date after second week of January (8 January 2018), we need to find the starting index.

## [1] 4561 4562 4563 4564 4565
##        datetime visitor outlet
## 4561 2018-01-08       5      A
## 4562 2018-01-08       0      B
## 4563 2018-01-08       0      C
## 4564 2018-01-08       0      D
## 4565 2018-01-08     193      E

The plots above look like a multiplicative time series, so let’s normalize the data. Here, I use two times root of square of the visitor column due to two seasonalities.

##               datetime  visitor outlet
## 1  2018-01-08 00:00:00 1.495349      A
## 2  2018-01-08 00:00:00 0.000000      B
## 3  2018-01-08 00:00:00 0.000000      C
## 4  2018-01-08 00:00:00 0.000000      D
## 5  2018-01-08 00:00:00 3.727257      E
## 6  2018-01-08 01:00:00 0.000000      A
## 7  2018-01-08 01:00:00 0.000000      B
## 8  2018-01-08 01:00:00 0.000000      C
## 9  2018-01-08 01:00:00 0.000000      D
## 10 2018-01-08 01:00:00 2.932972      E

For testing the model we will make, we split the data by last 7 days.

For multiple output, we can use sweep package to help us perform a forecast on each outlet altogether, without running one by one.

First, we have to group the data by outlet and bundle them into list-column using nest function from tidyrpackage .

## # A tibble: 5 x 2
##   outlet data.tbl          
##   <chr>  <list>            
## 1 A      <tibble [912 x 2]>
## 2 B      <tibble [912 x 2]>
## 3 C      <tibble [912 x 2]>
## 4 D      <tibble [912 x 2]>
## 5 E      <tibble [912 x 2]>

After that, we create time series with aid of map function. Due to the data is in hourly period, we use frequency of 24.

## # A tibble: 5 x 3
##   outlet data.tbl           data.ts 
##   <chr>  <list>             <list>  
## 1 A      <tibble [912 x 2]> <S3: ts>
## 2 B      <tibble [912 x 2]> <S3: ts>
## 3 C      <tibble [912 x 2]> <S3: ts>
## 4 D      <tibble [912 x 2]> <S3: ts>
## 5 E      <tibble [912 x 2]> <S3: ts>

Then, for multiseasonal time series, we have to convert the ts with msts function with two periods 24 (hourly) and 24*7=168 (weekly).

## # A tibble: 5 x 4
##   outlet data.tbl           data.ts  data.msts 
##   <chr>  <list>             <list>   <list>    
## 1 A      <tibble [912 x 2]> <S3: ts> <S3: msts>
## 2 B      <tibble [912 x 2]> <S3: ts> <S3: msts>
## 3 C      <tibble [912 x 2]> <S3: ts> <S3: msts>
## 4 D      <tibble [912 x 2]> <S3: ts> <S3: msts>
## 5 E      <tibble [912 x 2]> <S3: ts> <S3: msts>

2 Training the models

Here, I want to compare two models using stlm and tbats.

2.1 Using stlm

First, let’s create stlm model.

## # A tibble: 5 x 5
##   outlet data.tbl           data.ts  data.msts  fit.stlm  
##   <chr>  <list>             <list>   <list>     <list>    
## 1 A      <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: stlm>
## 2 B      <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: stlm>
## 3 C      <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: stlm>
## 4 D      <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: stlm>
## 5 E      <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: stlm>

Then, forecast the next week as the periods that we need to predict.

## # A tibble: 5 x 6
##   outlet data.tbl           data.ts  data.msts  fit.stlm   fcast.stlm    
##   <chr>  <list>             <list>   <list>     <list>     <list>        
## 1 A      <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: stlm> <S3: forecast>
## 2 B      <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: stlm> <S3: forecast>
## 3 C      <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: stlm> <S3: forecast>
## 4 D      <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: stlm> <S3: forecast>
## 5 E      <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: stlm> <S3: forecast>

We can apply sw_sweep to get the forecast in a nice “tidy” data frame. Here, we need to convert back the value of visitor that we have normalized.

## Warning in sw_sweep.forecast(.x[[i]], ...): Object has no timetk index.
## Using default index.

## Warning in sw_sweep.forecast(.x[[i]], ...): Object has no timetk index.
## Using default index.

## Warning in sw_sweep.forecast(.x[[i]], ...): Object has no timetk index.
## Using default index.

## Warning in sw_sweep.forecast(.x[[i]], ...): Object has no timetk index.
## Using default index.

## Warning in sw_sweep.forecast(.x[[i]], ...): Object has no timetk index.
## Using default index.
## # A tibble: 6 x 9
##   outlet index key         value lo.80 lo.95 hi.80        hi.95 visitor
##   <chr>  <dbl> <chr>       <dbl> <dbl> <dbl> <dbl>        <dbl>   <dbl>
## 1 E       7.42 forecast 5.59       808   729  1165 282429536481     974
## 2 A       7.42 forecast 2.39        23    19    45       130321      33
## 3 B       7.42 forecast 0.000731     0     0     0            0       0
## 4 C       7.42 forecast 0.0626       0     0     0            0       0
## 5 D       7.42 forecast 1.43         3     2     6           16       4
## 6 E       7.42 forecast 4.74       404   358   623  16426010896     505

Then, let’s compare the actual and the forecast values.

## # A tibble: 840 x 4
##    outlet index visitor_actual visitor_forecast
##    <chr>  <dbl>          <dbl>            <dbl>
##  1 A       6.43             26                6
##  2 B       6.43              0                0
##  3 C       6.43              0                0
##  4 D       6.43              0                0
##  5 E       6.43            228              211
##  6 A       6.43              4                0
##  7 B       6.43              0                0
##  8 C       6.43              0                0
##  9 D       6.43              0                0
## 10 E       6.43             80               86
## # ... with 830 more rows

Here, we get the mean summary of the model.

## # A tibble: 5 x 3
##   outlet `mean(visitor_actual)` `mean(visitor_forecast)`
##   <chr>                   <dbl>                    <dbl>
## 1 A                       194.                     188. 
## 2 B                        14.7                     14.9
## 3 C                        31.2                     32.3
## 4 D                        21.6                     22.5
## 5 E                       649                      652.

Let’s calculate the rmse of the model.

## For binary classification, the first factor level is assumed to be the event.
## Set the global option `yardstick.event_first` to `FALSE` to change this.
## 
## Attaching package: 'yardstick'
## The following object is masked from 'package:forecast':
## 
##     accuracy
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        43.3

2.2 Using tbats

With the same process as lstm model, we proceed the forecast with tbats.

## # A tibble: 5 x 5
##   outlet data.tbl           data.ts  data.msts  fit.tbats  
##   <chr>  <list>             <list>   <list>     <list>     
## 1 A      <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: tbats>
## 2 B      <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: tbats>
## 3 C      <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: tbats>
## 4 D      <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: tbats>
## 5 E      <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: tbats>
## # A tibble: 5 x 6
##   outlet data.tbl           data.ts  data.msts  fit.tbats   fcast.tbats   
##   <chr>  <list>             <list>   <list>     <list>      <list>        
## 1 A      <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: tbats> <S3: forecast>
## 2 B      <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: tbats> <S3: forecast>
## 3 C      <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: tbats> <S3: forecast>
## 4 D      <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: tbats> <S3: forecast>
## 5 E      <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: tbats> <S3: forecast>
## # A tibble: 6 x 8
##   outlet index key      visitor lo.80 lo.95 hi.80        hi.95
##   <chr>  <dbl> <chr>      <dbl> <dbl> <dbl> <dbl>        <dbl>
## 1 E       7.42 forecast     978   789   701  1200 241474942801
## 2 A       7.42 forecast      40    25    19    60       130321
## 3 B       7.42 forecast       0     0     0     0            0
## 4 C       7.42 forecast       0     0     0     0            0
## 5 D       7.42 forecast       3     1     1     5            1
## 6 E       7.42 forecast     499   386   335   634  12594450625
## # A tibble: 840 x 4
##    outlet index visitor_actual visitor_forecast
##    <chr>  <dbl>          <dbl>            <dbl>
##  1 A       6.43             26                8
##  2 B       6.43              0                0
##  3 C       6.43              0                0
##  4 D       6.43              0                0
##  5 E       6.43            228              244
##  6 A       6.43              4                0
##  7 B       6.43              0                0
##  8 C       6.43              0                0
##  9 D       6.43              0                0
## 10 E       6.43             80               87
## # ... with 830 more rows
## # A tibble: 5 x 3
##   outlet `mean(visitor_actual)` `mean(visitor_forecast)`
##   <chr>                   <dbl>                    <dbl>
## 1 A                       194.                     184. 
## 2 B                        14.7                     14.5
## 3 C                        31.2                     32.3
## 4 D                        21.6                     22.8
## 5 E                       649                      635.
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        57.5

3 Getting the Forecast

From rmse result, we know that lstm (43.3) is better than tbats model (57.6). So, let’s use lstm for forecasting the visitors 7 days after 22 February 2018. Here, we use the same process to get the forecast with combining train and test data we split previously.

## # A tibble: 840 x 3
##    datetime outlet visitor
##       <dbl> <chr>    <dbl>
##  1     7.43 A           10
##  2     7.43 B            0
##  3     7.43 C            0
##  4     7.43 D            0
##  5     7.43 E          222
##  6     7.43 A            0
##  7     7.43 B            0
##  8     7.43 C            0
##  9     7.43 D            0
## 10     7.43 E           77
## # ... with 830 more rows

Let’s input the visitor values to the submission files.

##                 datetime outlet visitor
##   1: 2018-02-22 00:00:00      A      10
##   2: 2018-02-22 00:00:00      B       0
##   3: 2018-02-22 00:00:00      C       0
##   4: 2018-02-22 00:00:00      D       0
##   5: 2018-02-22 00:00:00      E     222
##  ---                                   
## 836: 2018-02-28 23:00:00      A      35
## 837: 2018-02-28 23:00:00      B       0
## 838: 2018-02-28 23:00:00      C       0
## 839: 2018-02-28 23:00:00      D       1
## 840: 2018-02-28 23:00:00      E     475

Finally, let’s write the results for the submission.