1 Pre-processing

First, let’s set the environment and load pacman package for loading libraries easily.

knitr::opts_chunk$set(echo = TRUE)
library(pacman)

In this project, we are trying to forecast the visitor of a business for the next 7 days. The data is gathered from a data science bootcamp. From the data we’ll get the receipts with many branches. Let’s take a glimpse of the data we have.

p_load(data.table) # for faster and easily processing of large data
p_load(dplyr) # for data manipulation
cs <- fread("data/cs_train.csv")
cs %>% glimpse()

## Observations: 7,063,969
## Variables: 11
## $ datetime         <chr> "2017-12-01 00:47:29", "2017-12-01 00:47:29",...
## $ outlet           <chr> "E_46", "E_46", "E_46", "E_46", "E_46", "E_46...
## $ receipt          <chr> "A0017765", "A0017765", "A0017765", "A0017765...
## $ item             <int> 10100101, 10200029, 10400016, 10500028, 10500...
## $ item_group       <chr> "NOODLE_DISH", "RICE_DISH", "SIDE_DISH", "DRI...
## $ item_major_group <chr> "FOOD", "FOOD", "FOOD", "BEVERAGES", "BEVERAG...
## $ qty              <int> 1, 1, 1, 1, 2, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2, ...
## $ price            <dbl> 3.50, 6.96, 9.92, 4.74, 3.01, 5.48, 3.01, 4.7...
## $ total            <dbl> 3.50, 6.96, 9.92, 4.74, 6.01, 5.48, 3.01, 4.7...
## $ payment          <chr> "CASH", "CASH", "CASH", "CASH", "CASH", "CASH...
## $ sales_type       <chr> "DINE_IN", "DINE_IN", "DINE_IN", "DINE_IN", "...

Here, we’ve known that we have 7,063,969 observations and 11 variables. Each observation contains an item which has been ordered.

cs %>% distinct(outlet) %>% pull(outlet)

##  [1] "E_46" "E_11" "E_34" "E_30" "E_48" "E_44" "E_26" "E_63" "E_22" "C_05"
## [11] "E_42" "E_37" "A_05" "E_43" "A_21" "C_03" "A_10" "C_02" "A_12" "A_19"
## [21] "E_56" "A_08" "E_55" "A_22" "E_02" "D_01" "E_35" "A_02" "E_49" "A_07"
## [31] "E_16" "D_02" "E_08" "A_11" "B_01" "A_18" "E_60" "A_13" "A_20" "E_50"
## [41] "A_01" "E_21" "E_06" "E_45" "E_18" "E_09" "C_01" "A_09" "A_23" "E_27"
## [51] "E_62" "E_20" "E_04" "A_14" "E_32" "E_36" "A_06" "D_03" "E_51" "E_10"
## [61] "A_17" "E_40" "E_33" "E_47" "E_54" "E_14" "E_19" "E_24" "E_61" "E_05"
## [71] "E_57" "E_38" "E_28" "E_13" "E_41" "E_58" "E_07" "E_25" "E_12" "E_53"
## [81] "A_15" "E_17" "E_01" "E_52" "E_31" "E_23" "C_04" "E_29" "E_15" "E_39"
## [91] "A_16" "A_04" "E_03" "A_03" "E_59"

There are 95 unique outlets in observation, but here we only want to analyze the trends by the region of the outlets (the first character) which is A, B, C, D and E. There is also no variable of amount of visitor which we want to predict, but we can do some feature engineering to get visitor variable from the unique variable available, which is from the unique of the receipt code.

To predict the visitor in each outlet region and each hour, let’s just proceed the datetime, visitor, and receipt variables, and ignore the others. The datetime is still in character class, and contain the minutes, and seconds. The outlet also contains specific number details. Here, we would convert the class of datetime, round the datetime into the floor time, and get first character of the outlet.

p_load(lubridate) # for tranforming datetime
cs_dor <- cs %>%
  distinct(datetime, outlet, receipt) %>%
  mutate(datetime = ymd_hms(datetime), floor_date = floor_date(datetime, "hour"), outlet_group = substr(outlet, 1, 1)) %>%
  select(floor_date, outlet_group, receipt)
cs_dor %>% head()

##   floor_date outlet_group  receipt
## 1 2017-12-01            E A0017765
## 2 2017-12-01            E A0017766
## 3 2017-12-01            E A0017767
## 4 2017-12-01            E A0017768
## 5 2017-12-01            E A0017769
## 6 2017-12-01            E A0017770

Before we proceed futher, let’s check if we have any NA in the data.

colSums(is.na(cs_dor))

##   floor_date outlet_group      receipt 
##            0            0            0

And then, we use count function to get the visitor variable for each hour of time and outlet.

cs_vis <- cs_dor %>%
  count(floor_date, outlet_group) %>%
  select(datetime = floor_date, outlet = outlet_group, visitor = n)
cs_vis %>% head()

## # A tibble: 6 x 3
##   datetime            outlet visitor
##   <dttm>              <chr>    <int>
## 1 2017-12-01 00:00:00 E            7
## 2 2017-12-01 01:00:00 E           20
## 3 2017-12-01 02:00:00 E           13
## 4 2017-12-01 03:00:00 E           13
## 5 2017-12-01 04:00:00 E           16
## 6 2017-12-01 05:00:00 E            5

Now, we have a far less amount of rows. Let’s visualize our data to get the good view of how the time series look like.

p_load(ggplot2)
p_load(viridis) # for better visualization

cs_vis %>%
  arrange(datetime) %>%
  group_by(outlet) %>%
  ggplot(aes(datetime, visitor, col = outlet)) +
  geom_line() +
  scale_color_viridis(discrete = TRUE) +
  theme_classic(base_size = 12, base_family = "mono")

It’s clearly showed that outlet E has the most visitors among all outlets, followed by outlet A. In outlet A and E, we also see there are trends and seasonalities, which is probably high on weekend, that happen especially after Christmas & New Year Season. For the reason, we will only use the data from second week of January. After seeing the trend and seasonality, we can confirm that there are corellations among successive observations, so we can proceed futher the time series analysis and forecasting.

Then, after seeing the plot of overall data, let’s zoom out to smaller range.

cs_vis[4000:4250, ] %>%
  arrange(datetime) %>%
  group_by(outlet) %>%
  ggplot(aes(datetime, visitor, col = outlet)) +
  geom_line() +
  scale_color_viridis(discrete = TRUE) +
  theme_classic(base_size = 12, base_family = "mono")

In zoom-out view as the plot above, we see there are also seasonalities at each hour, but looks differently in different outlets. Because we find there are more than one seasonality in our data (weekly and hourly), it’s better to use stlm or tbats for our methods for forecasting. We will see the better understanding of the trends and seasonalities after creating the time series objects of each outlet.

Before creating time series objects, from the last table above, it is showed not every hour outlets have visitors. Here, we have to create a data frame which contains every hour for each outlet, merge with the data frame we already have, and fill with 0, if there are no visitors recorded.

# filter by outlet
cs_vis_a <- cs_vis %>%
  filter(outlet == "A")
cs_vis_b <- cs_vis %>%
  filter(outlet == "B")
cs_vis_c <- cs_vis %>%
  filter(outlet == "C")
cs_vis_d <- cs_vis %>%
  filter(outlet == "D")
cs_vis_e <- cs_vis %>%
  filter(outlet == "E")

# create a data frame contains every hour from the interval of the data
time_int <- data.frame(datetime = seq(
  from = range(cs_vis$datetime)[1],
  to = range(cs_vis$datetime)[2],
  by = "hour"
))

# merge it with each outlets
cs_vis_a_m <- merge(x = time_int, y = cs_vis_a, by = "datetime", all = TRUE) %>%
  select(datetime, visitor) %>%
  replace(is.na(.), 0)
cs_vis_b_m <- merge(x = time_int, y = cs_vis_b, by = "datetime", all = TRUE) %>%
  select(datetime, visitor) %>%
  replace(is.na(.), 0)
cs_vis_c_m <- merge(x = time_int, y = cs_vis_c, by = "datetime", all = TRUE) %>%
  select(datetime, visitor) %>%
  replace(is.na(.), 0)
cs_vis_d_m <- merge(x = time_int, y = cs_vis_d, by = "datetime", all = TRUE) %>%
  select(datetime, visitor) %>%
  replace(is.na(.), 0)
cs_vis_e_m <- merge(x = time_int, y = cs_vis_e, by = "datetime", all = TRUE) %>%
  select(datetime, visitor) %>%
  replace(is.na(.), 0)

Then, after binding every outlet together, in order to only use the data with date after second week of January (8 January 2018), we need to find the starting index.

# create column of outlet names
cs_vis_a_m <- cs_vis_a_m %>% mutate(outlet = "A")
cs_vis_b_m <- cs_vis_b_m %>% mutate(outlet = "B")
cs_vis_c_m <- cs_vis_c_m %>% mutate(outlet = "C")
cs_vis_d_m <- cs_vis_d_m %>% mutate(outlet = "D")
cs_vis_e_m <- cs_vis_e_m %>% mutate(outlet = "E")

# bind all outlets
cs_bind <-
  rbind(cs_vis_a_m, cs_vis_b_m, cs_vis_c_m, cs_vis_d_m, cs_vis_e_m) %>%
  arrange(datetime, outlet)

# find index with datetime of 2018-01-08 00:00:00
which((cs_bind$datetime == "2018-01-08 00:00:00 UTC")) + 7 * 5 # need to adjust the time for different time zone

## [1] 4561 4562 4563 4564 4565

cs_bind[4561:4565, ]

##        datetime visitor outlet
## 4561 2018-01-08       5      A
## 4562 2018-01-08       0      B
## 4563 2018-01-08       0      C
## 4564 2018-01-08       0      D
## 4565 2018-01-08     193      E

The plots above look like a multiplicative time series, so let’s normalize the data. Here, I use two times root of square of the visitor column due to two seasonalities.

# filter by the datetime (after 8 January 2018)
cs_bind_slc <- cs_bind %>%
  slice(4561:n())

# normalize data
cs_bind_scale <- cs_bind_slc %>%
  mutate(visitor = sqrt(sqrt(visitor)))
cs_bind_scale %>% head(., 10)

##               datetime  visitor outlet
## 1  2018-01-08 00:00:00 1.495349      A
## 2  2018-01-08 00:00:00 0.000000      B
## 3  2018-01-08 00:00:00 0.000000      C
## 4  2018-01-08 00:00:00 0.000000      D
## 5  2018-01-08 00:00:00 3.727257      E
## 6  2018-01-08 01:00:00 0.000000      A
## 7  2018-01-08 01:00:00 0.000000      B
## 8  2018-01-08 01:00:00 0.000000      C
## 9  2018-01-08 01:00:00 0.000000      D
## 10 2018-01-08 01:00:00 2.932972      E

For testing the model we will make, we split the data by last 7 days.

train_idx <- c(1:(nrow(cs_bind_slc) - (5 * 24 * 7)))

cs_train <- cs_bind_scale[train_idx, ]
cs_test <- cs_bind_scale[-train_idx, ]

For multiple output, we can use sweep package to help us perform a forecast on each outlet altogether, without running one by one.

First, we have to group the data by outlet and bundle them into list-column using nest function from tidyrpackage .

p_load(tidyr) # for nesting
cs_nest <- cs_train %>%
  group_by(outlet) %>%
  nest(.key = "data.tbl")
cs_nest

## # A tibble: 5 x 2
##   outlet data.tbl          
##   <chr>  <list>            
## 1 A      <tibble [912 x 2]>
## 2 B      <tibble [912 x 2]>
## 3 C      <tibble [912 x 2]>
## 4 D      <tibble [912 x 2]>
## 5 E      <tibble [912 x 2]>

After that, we create time series with aid of map function. Due to the data is in hourly period, we use frequency of 24.

# create ts data
p_load(purrr) # for mapping
p_load(timetk) # for accessing tk_ts fuction
cs_ts <- cs_nest %>%
  mutate(data.ts = map(
    .x = data.tbl,
    .f = tk_ts,
    select = -datetime,
    start = 1,
    freq = 24
  ))

cs_ts

## # A tibble: 5 x 3
##   outlet data.tbl           data.ts 
##   <chr>  <list>             <list>  
## 1 A      <tibble [912 x 2]> <S3: ts>
## 2 B      <tibble [912 x 2]> <S3: ts>
## 3 C      <tibble [912 x 2]> <S3: ts>
## 4 D      <tibble [912 x 2]> <S3: ts>
## 5 E      <tibble [912 x 2]> <S3: ts>

Then, for multiseasonal time series, we have to convert the ts with msts function with two periods 24 (hourly) and 24*7=168 (weekly).

# create multiseasonal ts data
p_load(forecast)
cs_msts <- cs_ts %>%
  mutate(data.msts = map(data.ts,
    msts,
    seasonal.periods = c(24, 24 * 7)
  ))
cs_msts

## # A tibble: 5 x 4
##   outlet data.tbl           data.ts  data.msts 
##   <chr>  <list>             <list>   <list>    
## 1 A      <tibble [912 x 2]> <S3: ts> <S3: msts>
## 2 B      <tibble [912 x 2]> <S3: ts> <S3: msts>
## 3 C      <tibble [912 x 2]> <S3: ts> <S3: msts>
## 4 D      <tibble [912 x 2]> <S3: ts> <S3: msts>
## 5 E      <tibble [912 x 2]> <S3: ts> <S3: msts>

2 Training the models

Here, I want to compare two models using stlm and tbats.

2.1 Using stlm

First, let’s create stlm model.

# forecast with stlm
cs_stlm <- cs_msts %>%
  mutate(fit.stlm = map(
    data.msts,
    stlm
  ))
cs_stlm

## # A tibble: 5 x 5
##   outlet data.tbl           data.ts  data.msts  fit.stlm  
##   <chr>  <list>             <list>   <list>     <list>    
## 1 A      <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: stlm>
## 2 B      <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: stlm>
## 3 C      <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: stlm>
## 4 D      <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: stlm>
## 5 E      <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: stlm>

Then, forecast the next week as the periods that we need to predict.

# forecast the next 7 x 24 hours
cs_fcast_stlm <- cs_stlm %>%
  mutate(fcast.stlm = map(fit.stlm, forecast, h = 24 * 7))
cs_fcast_stlm

## # A tibble: 5 x 6
##   outlet data.tbl           data.ts  data.msts  fit.stlm   fcast.stlm    
##   <chr>  <list>             <list>   <list>     <list>     <list>        
## 1 A      <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: stlm> <S3: forecast>
## 2 B      <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: stlm> <S3: forecast>
## 3 C      <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: stlm> <S3: forecast>
## 4 D      <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: stlm> <S3: forecast>
## 5 E      <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: stlm> <S3: forecast>

We can apply sw_sweep to get the forecast in a nice “tidy” data frame. Here, we need to convert back the value of visitor that we have normalized.

# get the forecast result of the visitor for each outlet and datetime
p_load(sweep)
cs_fcast_tidy <- cs_fcast_stlm %>%
  mutate(sweep = map(fcast.stlm, sw_sweep, fitted = FALSE, timetk_idx = TRUE)) %>%
  unnest(sweep) %>%
  mutate(
    visitor = round(value^2^2),
    lo.80 = round(lo.80^2^2),
    lo.95 = round(lo.95^2^2),
    hi.80 = round(hi.80^2^2),
    hi.95 = round(lo.95^2^2)
  ) %>%
  arrange(index, outlet)

## Warning in sw_sweep.forecast(.x[[i]], ...): Object has no timetk index.
## Using default index.

## Warning in sw_sweep.forecast(.x[[i]], ...): Object has no timetk index.
## Using default index.

## Warning in sw_sweep.forecast(.x[[i]], ...): Object has no timetk index.
## Using default index.

## Warning in sw_sweep.forecast(.x[[i]], ...): Object has no timetk index.
## Using default index.

## Warning in sw_sweep.forecast(.x[[i]], ...): Object has no timetk index.
## Using default index.

cs_fcast_tidy %>% tail()

## # A tibble: 6 x 9
##   outlet index key         value lo.80 lo.95 hi.80        hi.95 visitor
##   <chr>  <dbl> <chr>       <dbl> <dbl> <dbl> <dbl>        <dbl>   <dbl>
## 1 E       7.42 forecast 5.59       808   729  1165 282429536481     974
## 2 A       7.42 forecast 2.39        23    19    45       130321      33
## 3 B       7.42 forecast 0.000731     0     0     0            0       0
## 4 C       7.42 forecast 0.0626       0     0     0            0       0
## 5 D       7.42 forecast 1.43         3     2     6           16       4
## 6 E       7.42 forecast 4.74       404   358   623  16426010896     505

Then, let’s compare the actual and the forecast values.

# create table to compare actual and forecast visitor
cs_fcast_tidy$visitor_actual <- cs_bind_slc$visitor
cs_fcast_vs_actual <- cs_fcast_tidy %>%
  filter(key == "forecast") %>%
  mutate(visitor_forecast = visitor) %>%
  select(c(outlet, index, visitor_actual, visitor_forecast))
cs_fcast_vs_actual

## # A tibble: 840 x 4
##    outlet index visitor_actual visitor_forecast
##    <chr>  <dbl>          <dbl>            <dbl>
##  1 A       6.43             26                6
##  2 B       6.43              0                0
##  3 C       6.43              0                0
##  4 D       6.43              0                0
##  5 E       6.43            228              211
##  6 A       6.43              4                0
##  7 B       6.43              0                0
##  8 C       6.43              0                0
##  9 D       6.43              0                0
## 10 E       6.43             80               86
## # ... with 830 more rows

Here, we get the mean summary of the model.

# summarise the actual and forecast visitor for each outlet
cs_fcast_vs_actual %>%
  group_by(outlet) %>%
  summarise(mean(visitor_actual), mean(visitor_forecast))

## # A tibble: 5 x 3
##   outlet `mean(visitor_actual)` `mean(visitor_forecast)`
##   <chr>                   <dbl>                    <dbl>
## 1 A                       194.                     188. 
## 2 B                        14.7                     14.9
## 3 C                        31.2                     32.3
## 4 D                        21.6                     22.5
## 5 E                       649                      652.

Let’s calculate the rmse of the model.

# calculate the rmse of the model
library(yardstick)

## For binary classification, the first factor level is assumed to be the event.
## Set the global option `yardstick.event_first` to `FALSE` to change this.

## 
## Attaching package: 'yardstick'

## The following object is masked from 'package:forecast':
## 
##     accuracy

cs_fcast_rmse <- cs_fcast_vs_actual %>%
  # group_by(outlet) %>%
  rmse(truth = visitor_actual, estimate = visitor_forecast)
cs_fcast_rmse

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        43.3

2.2 Using tbats

With the same process as lstm model, we proceed the forecast with tbats.

# forecast with tbats
cs_tbats <- cs_msts %>%
  mutate(fit.tbats = map(data.msts,
    tbats,
    use.box.cox = FALSE
  ))
cs_tbats

## # A tibble: 5 x 5
##   outlet data.tbl           data.ts  data.msts  fit.tbats  
##   <chr>  <list>             <list>   <list>     <list>     
## 1 A      <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: tbats>
## 2 B      <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: tbats>
## 3 C      <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: tbats>
## 4 D      <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: tbats>
## 5 E      <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: tbats>

# forecast the next 7 x 24 hours
cs_fcast_tbats <- cs_tbats %>%
  mutate(fcast.tbats = map(fit.tbats, forecast, h = 24 * 7))
cs_fcast_tbats

## # A tibble: 5 x 6
##   outlet data.tbl           data.ts  data.msts  fit.tbats   fcast.tbats   
##   <chr>  <list>             <list>   <list>     <list>      <list>        
## 1 A      <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: tbats> <S3: forecast>
## 2 B      <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: tbats> <S3: forecast>
## 3 C      <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: tbats> <S3: forecast>
## 4 D      <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: tbats> <S3: forecast>
## 5 E      <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: tbats> <S3: forecast>

# get the forecast result of the visitor for each outlet and datetime
p_load(sweep)
cs_fcast_tidy <- cs_fcast_tbats %>%
  mutate(sweep = map(fcast.tbats, sw_sweep, fitted = FALSE)) %>%
  unnest(sweep) %>%
  arrange(index, outlet) %>%
  mutate(
    visitor = round(visitor^2^2),
    lo.80 = round(lo.80^2^2),
    lo.95 = round(lo.95^2^2),
    hi.80 = round(hi.80^2^2),
    hi.95 = round(lo.95^2^2)
  )

cs_fcast_tidy %>% tail()

## # A tibble: 6 x 8
##   outlet index key      visitor lo.80 lo.95 hi.80        hi.95
##   <chr>  <dbl> <chr>      <dbl> <dbl> <dbl> <dbl>        <dbl>
## 1 E       7.42 forecast     978   789   701  1200 241474942801
## 2 A       7.42 forecast      40    25    19    60       130321
## 3 B       7.42 forecast       0     0     0     0            0
## 4 C       7.42 forecast       0     0     0     0            0
## 5 D       7.42 forecast       3     1     1     5            1
## 6 E       7.42 forecast     499   386   335   634  12594450625

# create table to compare actual and forecast visitor
cs_fcast_tidy$visitor_actual <- cs_bind_slc$visitor
cs_fcast_vs_actual <- cs_fcast_tidy %>%
  filter(key == "forecast") %>%
  mutate(visitor_forecast = visitor) %>%
  select(c(outlet, index, visitor_actual, visitor_forecast))
cs_fcast_vs_actual

## # A tibble: 840 x 4
##    outlet index visitor_actual visitor_forecast
##    <chr>  <dbl>          <dbl>            <dbl>
##  1 A       6.43             26                8
##  2 B       6.43              0                0
##  3 C       6.43              0                0
##  4 D       6.43              0                0
##  5 E       6.43            228              244
##  6 A       6.43              4                0
##  7 B       6.43              0                0
##  8 C       6.43              0                0
##  9 D       6.43              0                0
## 10 E       6.43             80               87
## # ... with 830 more rows

# summarise the actual and forecast visitor for each outlet
cs_fcast_vs_actual %>%
  group_by(outlet) %>%
  summarise(mean(visitor_actual), mean(visitor_forecast))

## # A tibble: 5 x 3
##   outlet `mean(visitor_actual)` `mean(visitor_forecast)`
##   <chr>                   <dbl>                    <dbl>
## 1 A                       194.                     184. 
## 2 B                        14.7                     14.5
## 3 C                        31.2                     32.3
## 4 D                        21.6                     22.8
## 5 E                       649                      635.

# calculate the rmse of the model for each outlet
library(yardstick)
cs_fcast_rmse <- cs_fcast_vs_actual %>%
  # group_by(outlet) %>%
  rmse(truth = visitor_actual, estimate = visitor_forecast)
cs_fcast_rmse

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        57.5

3 Getting the Forecast

From rmse result, we know that lstm (43.3) is better than tbats model (57.6). So, let’s use lstm for forecasting the visitors 7 days after 22 February 2018. Here, we use the same process to get the forecast with combining train and test data we split previously.

# to forecast the visitor after 22 February 2018
cs_nest2 <- cs_bind_scale %>%
  group_by(outlet) %>%
  nest(.key = "data.tbl") %>%
  mutate(data.ts = map(
    .x = data.tbl,
    .f = tk_ts,
    select = -datetime,
    start = 1,
    freq = 24
  )) %>%
  mutate(data.msts = map(data.ts,
    msts,
    seasonal.periods = c(24, 24 * 7)
  )) %>%
  mutate(fit.stlm = map(
    data.msts,
    stlm
  )) %>%
  mutate(fcast.stlm = map(fit.stlm, forecast, h = 24 * 7)) %>%
  mutate(sweep = map(fcast.stlm, sw_sweep, fitted = FALSE, timetk_idx = F)) %>%
  unnest(sweep) %>%
  arrange(index, outlet) %>%
  mutate(
    visitor = round(value^2^2),
    lo.80 = round(lo.80^2^2),
    lo.95 = round(lo.95^2^2),
    hi.80 = round(hi.80^2^2),
    hi.95 = round(lo.95^2^2)
  ) %>%
  filter(key == "forecast") %>%
  mutate(datetime = index) %>%
  select(datetime, outlet, visitor)
cs_nest2

## # A tibble: 840 x 3
##    datetime outlet visitor
##       <dbl> <chr>    <dbl>
##  1     7.43 A           10
##  2     7.43 B            0
##  3     7.43 C            0
##  4     7.43 D            0
##  5     7.43 E          222
##  6     7.43 A            0
##  7     7.43 B            0
##  8     7.43 C            0
##  9     7.43 D            0
## 10     7.43 E           77
## # ... with 830 more rows

Let’s input the visitor values to the submission files.

# load the data for submission
cs_sms <- fread("data/cs_submission.csv")

# input the result of visitor from nest2
cs_sms$visitor <- cs_nest2$visitor
cs_sms

##                 datetime outlet visitor
##   1: 2018-02-22 00:00:00      A      10
##   2: 2018-02-22 00:00:00      B       0
##   3: 2018-02-22 00:00:00      C       0
##   4: 2018-02-22 00:00:00      D       0
##   5: 2018-02-22 00:00:00      E     222
##  ---                                   
## 836: 2018-02-28 23:00:00      A      35
## 837: 2018-02-28 23:00:00      B       0
## 838: 2018-02-28 23:00:00      C       0
## 839: 2018-02-28 23:00:00      D       1
## 840: 2018-02-28 23:00:00      E     475

Finally, let’s write the results for the submission.

# write into the submission file
write.csv(cs_sms, "cs_submission - Claudy Sonly Daniel.csv")

Forecasting Visitors of a Business with Various Branches with Weekly and Hourly Seasonality

Claudy Daniel P.

Updated: January 18, 2019

1 Pre-processing

2 Training the models

2.1 Using stlm

2.2 Using tbats

3 Getting the Forecast