Restaurant Visitor Forcasting

library(dplyr)
library(caret)
library(lubridate)
library(zoo)
library(tidyr)
library(padr)
library(forecast)
library(tseries)
library(TTR)
library(rmarkdown)
library(DT)

Import Data

This data contain both data train and data test, data train will be used to train the model, meanwhile data test will be used as unseen data (new data). I’ll perform feature engineering to get total visitor of the restaurant (see in pre processing section), and use it to predict total visitor of the data test.

data_train <- read.csv("data_input/data-train.csv")

datatable(data_train, options = list(scrollX = T)) 

Columns description:

  • Transaction_date : The timestamp of a transaction
  • Receipt_number : The ID of a transaction
  • Item_id : The ID of an item in a transaction
  • Item_group : The group ID of an item in a transaction
  • Item_major_group : The major-group ID of an item in a transaction
  • Quantity : The quantity of purchased item
  • Price_usd : The price of purchased item
  • Total_usd : The total price of purchased item
  • Payment_type : The payment method
  • Sales_type : The sales method
data_test <- read.csv("data_input/data-test.csv")

datatable(data_test, options = list(scrollX = T)) 

Data preprocessing

Feature Engineering

First i will create new column that only have date and hour data, this is important because we need to arrange our data by date before doing forecasting.

data_train <- data_train %>% 
  mutate(transaction_date = ymd_hms(transaction_date, tz = "UTC"))

data_train <- data_train %>% 
  mutate(only_date = format(transaction_date, "%Y-%m-%d")) %>% 
  mutate(only_hour = format(transaction_date, "%H")) %>% 
  unite(date_hour, only_date, only_hour, sep = " ") %>% 
  mutate(date_hour = ymd_h(date_hour)) %>% 
  arrange(date_hour)

data_test <- data_test %>% 
  mutate(datetime = ymd_hms(datetime))

Second, in order to get total costumer, I’ll ssing group_by function based on date_hour column and pick only first unique number in receipt_number, and then counting appearance on each unique value

data_train_agg <- data_train %>% 
  group_by(date_hour) %>% 
  summarise(total_visitor = n_distinct(receipt_number)) %>% 
  ungroup()

datatable(data_train_agg, options = list(scrollX = T)) 
summary(data_train_agg)
#>    date_hour                      total_visitor  
#>  Min.   :2017-12-01 13:00:00.00   Min.   : 1.00  
#>  1st Qu.:2017-12-21 00:45:00.00   1st Qu.:16.00  
#>  Median :2018-01-09 23:30:00.00   Median :26.00  
#>  Mean   :2018-01-10 03:38:38.00   Mean   :28.55  
#>  3rd Qu.:2018-01-29 23:15:00.00   3rd Qu.:40.00  
#>  Max.   :2018-02-18 23:00:00.00   Max.   :83.00

Third, Forecasting model not allow any missing value, so I’ll using padding to make sure no dates are missed and replace NA to 0 value, where we know that on that moment neither the restaurant not open nor no visitor at all then filtering time to certain hour

set.seed(300)
data_train_pad <- data_train_agg %>% 
  pad(start_val = ymd_hms("2017-12-01 13:00:00"),
      end_val = ymd_hms("2018-02-18 23:00:00")) %>% 
  mutate(total_visitor = replace_na(total_visitor, 0)) %>% 
  mutate(only_hour = hour(date_hour)) %>% 
  filter(only_hour %in% c(10:22))

data_train_pad <- data_train_pad %>% 
  select(date_hour, total_visitor)

datatable(data_train_pad, options = list(scrollX = T))

Checking Missing Value

colSums(is.na(data_train_pad))
#>     date_hour total_visitor 
#>             0             0

Theres no misisng value detected, so we can perform forecasting

Forecasting

We already fulfill all requirements to do forecasting. From preliminary inspection, I know that this data contain multiple seasonality, so for the the first model, I’ll use STLM to handle multiseasonal pattern using ARIMA and ETS method, than compare both model by evaluation to get the best model. Also, at the end of this section, I’ll use another model to make sure STLM is the best model to handle multiseasonal pattern

Creating time series object using TS function

First I’ll create TS object, so we can see all the pattern of our model

train_ts <- ts(data_train_pad$total_visitor,
               frequency = 12)

Plotting TS object

Plotting our TS object to see data, trend, seasonal and remainder. By seeing this plot we can determining which model will be used

train_decom <- decompose(train_ts)

train_decom %>% autoplot()

As I mentioned before, from trend plot I can say this data contain multiseasonal pattern, so I can’t use ts function to decompose my data.

Creating TS object using msts function

model_msts <- data_train_pad$total_visitor %>% 
  msts(seasonal.periods = c(13, 13*7))

msts_decom <- model_msts %>% mstl()
msts_decom %>% autoplot()

From plot above no seasonal pattern captured in trend, indicating we success in making time series object

Cross Validation

I’ll split the data into data train and data test. Data test will contain a on week of data, while the rest will be data train.

msts_test <- tail(model_msts, 13*7)
msts_train <- head(model_msts, -13*7)

Modeling data using stlm (arima and ets)

Build model

stlm_model <- stlm(msts_train, s.window = 13, method = "arima")
ets_model <- stlm(msts_train, s.window = 13, method = "ets")

Forcasting for next 7 days

After build the model, I’ll perform forcasting for next 7 days (1 week)

forcast_arima <- forecast(stlm_model, h = 13*7)
forcast_ets <- forecast(ets_model, h = 13*7)

Plotting forcasting result from ARIMA

forcast_arima %>% 
  autoplot()

Plotting forcasting result from ETS

forcast_ets %>% 
  autoplot()

Checking for autocorrelation residuals

Plotting residuals for both method, from this plot i can make an initial guess for no-autocorrealtion test.

pacf(forcast_arima$residuals)

pacf(forcast_ets$residuals)

For more confident result, I’ll do Ljung box test to detect autocorrelation. Our target is model without autocorrelation, or model with p-value greater than 0.05.

Box.test(forcast_arima$residuals, type = "Ljung-Box")
#> 
#>  Box-Ljung test
#> 
#> data:  forcast_arima$residuals
#> X-squared = 0.0027996, df = 1, p-value = 0.9578
Box.test(forcast_ets$residuals, type = "Ljung-Box")
#> 
#>  Box-Ljung test
#> 
#> data:  forcast_ets$residuals
#> X-squared = 17.109, df = 1, p-value = 0.00003529

From both test, I conclude that ARIMA model has better performance because no-autocorrelation residuals detected

Comparing multiple model specification and its actual visitor

msts_test %>% 
  autoplot() +
  autolayer(forcast_arima$mean) +
  autolayer(forcast_ets$mean)

Model evaluation

Model evaluation is crucial to choose which model is the best. By seeing to accuracy, best model will have least number of RMSE and MAE

accuracy(forcast_arima, msts_test)
#>                       ME     RMSE      MAE MPE MAPE      MASE        ACF1
#> Training set -0.03408915 5.197467 3.936139 NaN  Inf 0.6093426 0.001717579
#> Test set     -2.12716247 7.332962 5.656791 NaN  Inf 0.8757118 0.344412851
#>              Theil's U
#> Training set        NA
#> Test set             0
accuracy(forcast_ets, msts_test)
#>                       ME     RMSE      MAE  MPE MAPE     MASE      ACF1
#> Training set -0.03470287 5.334872 4.062086  NaN  Inf 0.628840 0.1342714
#> Test set     -6.08259918 9.330179 7.678216 -Inf  Inf 1.188643 0.3548303
#>              Theil's U
#> Training set        NA
#> Test set             0

from test above, ARIMA model has better performance in predicting visitors in next 7 days because it has least number of RMSE and MAE

Seasonal (hour) analysis

Seasonal analysis is used to see the pattern each hour. From this analysis I can see which hour is costumer like the most to visit restaurant.

# data preparation
day_seasonal_analysis <- data_train_pad %>% 
  mutate(
    seasonal= msts_decom[,"Seasonal13"], # simpan seasonal hasil decompose
    hour = hour(lubridate::ymd_hms(date_hour))
  ) %>%  
  distinct(hour, seasonal)

day_seasonal_analysis
#> # A tibble: 1,037 × 2
#>     hour seasonal
#>    <int>    <dbl>
#>  1    13   -12.3 
#>  2    14   -11.2 
#>  3    15    -7.46
#>  4    16    -5.97
#>  5    17     1.32
#>  6    18     7.28
#>  7    19    26.6 
#>  8    20    23.7 
#>  9    21    24.8 
#> 10    22    16.3 
#> # ℹ 1,027 more rows
day_seasonal_analysis %>% 
  ggplot(mapping = aes(x = hour, y = seasonal)) +
  geom_col()

As I can see from our plot above, visitor after 4 PM have the highest number compared to the average. Highest average visitor come at 8 PM and lowest average visitor come at 10 AM (right after the restaurant open). People like to visit restaurant for dinner than lunch

Seasonal (week) analysis

As same as seasonal (hour) analysis, week analysis is used to see which week in a year that visitor like the most to come to the restaurant.

# data preparation
week_seasonal_analysis <- data_train_pad %>% 
  mutate(
    seasonal91 = msts_decom[,"Seasonal91"],
    month = month(date_hour, label = T),
    week = week(date_hour)
  ) %>%  
  arrange(date_hour) %>% 
  distinct(month, week, seasonal91)

week_seasonal_analysis <- week_seasonal_analysis %>% 
  group_by(month, week) %>% 
  summarise(seasonal91 = mean(seasonal91)) %>% 
  ungroup() %>% 
  mutate(month = factor(month, levels = c("Dec", "Jan", "Feb"))) %>% 
  arrange(month) %>% 
  unite(month_week,month, week, sep = " ")

week_seasonal_analysis
#> # A tibble: 14 × 2
#>    month_week seasonal91
#>    <chr>           <dbl>
#>  1 Dec 48        3.31   
#>  2 Dec 49       -0.00344
#>  3 Dec 50       -0.00344
#>  4 Dec 51       -0.00943
#>  5 Dec 52       -0.0111 
#>  6 Dec 53        4.11   
#>  7 Jan 1         0.0655 
#>  8 Jan 2         0.0579 
#>  9 Jan 3         0.0497 
#> 10 Jan 4         0.0550 
#> 11 Jan 5        -2.42   
#> 12 Feb 5         1.86   
#> 13 Feb 6         0.0281 
#> 14 Feb 7         0.0233
week_seasonal_analysis %>% 
  ggplot(mapping = aes(x = factor(month_week, levels = month_week), y = seasonal91)) +
  geom_col()

The highest number of consumer occur in the last week of December 2017 where people celebrate new year eve. Also, except last week of January, the restaurant get more consumer in last and first week of the month, maybe its because of payment day effect compare to its average number.

Final modeling

fnb_stlm <- stlm(model_msts, method = "arima")
fnb_forecast <- forecast(fnb_stlm, h = 13*7)

Creating submission file

# Create submission data
#submission <- data_test %>% 
#  mutate(visitor = fnb_forecast$mean)

# save data
#write.csv(submission, "submission_faqih.csv", row.names = F)

# check first 3 data
#head(submission, 3)
knitr::include_graphics("final_score.png")

Tryin using another method

  1. Forcasting using SMA
fnb_sma <- SMA(msts_train, 13*7)
accuracy(fnb_sma, msts_test)
#>          ME RMSE MAE      MPE     MAPE ACF1 Theil's U
#> Test set  9    9   9 21.95122 21.95122   NA         0
  1. stlf
fnb_ses <- stlf(msts_train)
frocast_ets <- forecast(fnb_ses, h = 13*7)
accuracy(frocast_ets, msts_test)
#>                       ME     RMSE      MAE  MPE MAPE      MASE      ACF1
#> Training set -0.03231565 5.285451 4.034626  NaN  Inf 0.6245891 0.1332997
#> Test set     -6.28882907 9.459711 7.826815 -Inf  Inf 1.2116472 0.3536323
#>              Theil's U
#> Training set        NA
#> Test set             0
  1. Holtwinter
fnb_hw <- HoltWinters(msts_train)
frocast_hw <- forecast(fnb_hw, h = 13*7)
accuracy(frocast_hw$mean, msts_test)
#>                 ME     RMSE      MAE  MPE MAPE     ACF1 Theil's U
#> Test set -2.898399 8.154527 6.451498 -Inf  Inf 0.468013         0

From all alternative models above, no one has better performance than our previous model