Introduction

In the realm of food and beverage businesses, understanding visitor trends and transaction patterns is paramount for making informed decisions and optimizing operational strategies. With the rapid evolution of consumer preferences and the increasing reliance on diverse dining options such as dine-in, delivery, and take-away, forecasting visitor numbers has become a crucial aspect of business planning. In this analysis, we delve into a dataset encompassing visitor data from a food and beverage establishment, spanning various transaction types including dine-in, delivery, and take-away. The primary objective and challenge are to forecast the hourly number of visitors and provide insights into the underlying seasonality. This forecast result and seasonality explanation will be evaluated on the next 7 days, from Monday, February 19th, 2018, to Sunday, February 25th, 2018. By leveraging forecasting techniques and seasonality analysis, we aim to uncover valuable insights that will empower the owner to optimize resource allocation, enhance service delivery, and ultimately drive business growth in the dynamic landscape of the food and beverage industry.

Library

library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(lubridate)
## Warning: package 'lubridate' was built under R version 4.3.3
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(tidyr)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.2
library(TSstudio) 
## Warning: package 'TSstudio' was built under R version 4.3.3
library(forecast)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(readr)
library(padr)
library(zoo)
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(ggsci)
library(xts)
## 
## ######################### Warning from 'xts' package ##########################
## #                                                                             #
## # The dplyr lag() function breaks how base R's lag() function is supposed to  #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or       #
## # source() into this session won't work correctly.                            #
## #                                                                             #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop           #
## # dplyr from breaking base R's lag() function.                                #
## #                                                                             #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning.  #
## #                                                                             #
## ###############################################################################
## 
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
## 
##     first, last
library(tseries)
library(Metrics)
## Warning: package 'Metrics' was built under R version 4.3.3
## 
## Attaching package: 'Metrics'
## The following object is masked from 'package:forecast':
## 
##     accuracy

Data Preprocessing

Train dataset

FnB <- read.csv("data_input/FnB/data/data-train.csv")

head(FnB)
##       transaction_date receipt_number   item_id  item_group item_major_group
## 1 2017-12-01T13:32:46Z       A0026694 I10100139 noodle_dish             food
## 2 2017-12-01T13:32:46Z       A0026694 I10500037      drinks        beverages
## 3 2017-12-01T13:33:39Z       A0026695 I10500044      drinks        beverages
## 4 2017-12-01T13:33:39Z       A0026695 I10400009   side_dish             food
## 5 2017-12-01T13:33:39Z       A0026695 I10500046      drinks        beverages
## 6 2017-12-01T13:35:59Z       A0026696 I10300002      pastry             food
##   quantity price_usd total_usd payment_type sales_type
## 1        1      7.33      7.33         cash    dine_in
## 2        1      4.12      4.12         cash    dine_in
## 3        1      2.02      2.02         cash    dine_in
## 4        1      5.60      5.60         cash    dine_in
## 5        1      3.01      3.01         cash    dine_in
## 6        1      4.86      4.86         cash    dine_in

Missing Value Check

anyNA(FnB)
## [1] FALSE

Summary of data

glimpse(FnB)
## Rows: 137,748
## Columns: 10
## $ transaction_date <chr> "2017-12-01T13:32:46Z", "2017-12-01T13:32:46Z", "2017…
## $ receipt_number   <chr> "A0026694", "A0026694", "A0026695", "A0026695", "A002…
## $ item_id          <chr> "I10100139", "I10500037", "I10500044", "I10400009", "…
## $ item_group       <chr> "noodle_dish", "drinks", "drinks", "side_dish", "drin…
## $ item_major_group <chr> "food", "beverages", "beverages", "food", "beverages"…
## $ quantity         <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1,…
## $ price_usd        <dbl> 7.33, 4.12, 2.02, 5.60, 3.01, 4.86, 6.34, 7.58, 4.12,…
## $ total_usd        <dbl> 7.33, 4.12, 2.02, 5.60, 3.01, 4.86, 6.34, 7.58, 4.12,…
## $ payment_type     <chr> "cash", "cash", "cash", "cash", "cash", "cash", "cash…
## $ sales_type       <chr> "dine_in", "dine_in", "dine_in", "dine_in", "dine_in"…

From the summary, we know that the data type of Transaction_date is not suitable yet. Hence, we need to change it to datetime format and then round it to the nearest hour to obtain the hourly transaction time.

FnB_1 <- FnB %>% 
  mutate(transaction_date = ymd_hms(transaction_date) %>% 
         floor_date(unit = "hour")) %>% 
  group_by(transaction_date) %>% 
  

  summarise(visitor_number = n_distinct(receipt_number)) %>%
  arrange(transaction_date) %>% 
  ungroup()
FnB_1 <- FnB_1 %>% 
  pad(start_val = ymd("2017-12-01"), end_val = ymd("2018-02-18"))
## pad applied on the interval: hour
FnB_1 <- FnB_1 %>%
  mutate(hour = hour(transaction_date)) %>% 
  filter(hour %in% c(10:22)) %>% 
  select(transaction_date, visitor_number)

Since the restaurant operates from 10 AM to 10 PM, we need to fill the NA values in the datetime column with 0

FnB_1 <- FnB_1 %>% 
  mutate(visitor_number = na.fill(visitor_number, 0))

Recheck that there is no missing value in our data

min_date <- min(FnB_1$transaction_date)
max_date <- max(FnB_1$transaction_date)

min_date
## [1] "2017-12-01 10:00:00 UTC"
max_date
## [1] "2018-02-17 22:00:00 UTC"

Seasonality Analysis

Convert the data into time series Object

The frequency that we used is 13 which refers to the time range of the outlet open everyday

FnB_ts <- ts(FnB_1$visitor_number,start=c(1,4),frequency = 13)

Visualize the data to see the distribution of the data

autoplot(FnB_ts) + theme_minimal()

insight : the data has trend, seasonality, and error

Decomposing

To observe the data trend effectively, we can perform data decomposition using monthly periods, as each month inherently exhibits its unique seasonal patterns. Consequently, upon analysis, we discern that despite the graphical representation of the trend column lacking smoothness, it reveals the presence of seasonality that was not adequately captured in FnB_ts. This suggests the likelihood of our data manifesting multiple seasonal patterns.

FnB1_deco<-decompose(x=FnB_ts)
autoplot(FnB1_deco)

Insight The visual representation depicted above reveals a distinct pattern within the Estimated Trend component, hinting at the existence of possibly overlooked additional seasonality. This observation implies that our dataset likely encompasses multiple seasonal variations. To effectively address this complex multi-seasonal nature, it becomes imperative to transition the dataset into a “Multiple Seasonality Time Series” format, which is designed to accommodate and manage multiple frequency settings seamlessly.

Time Series with Multiple Seasonality

Generate a “Multi-seasonality Time Series” structure utilizing the msts() function, specifying the frequencies as Daily (13) and Weekly (13*7), thereby encompassing both daily and weekly seasonal patterns. This approach ensures the incorporation of seasonality at both the daily and weekly levels. Subsequently, proceed with decomposing the series and visualizing the results.

FnB_msts<-msts(FnB_1$visitor_number, seasonal.periods  = c(13,13*7))

autoplot(FnB_msts)

FnB_msts_deco  <- mstl(FnB_msts)

FnB_msts %>% tail(4*7*13) %>% stl(s.window = "periodic") %>% autoplot()

insight

Upon reviewing the aforementioned graph, it is evident that the Estimated Trend derived from the “Multiple Seasonality Time Series” exhibits enhanced smoothness and clarity. Moreover, the delineation of daily and weekly seasonality is more pronounced, rendering it more conducive to comprehensive analysis and interpretation.

Seasonality analysis daily by hour

FnB_1 %>% 
  mutate(seasonal = FnB1_deco$seasonal,
         hour = hour(transaction_date)) %>% 
  distinct(hour, seasonal) %>% 
  ggplot(mapping = aes(x = hour, y = seasonal, fill = as.factor(hour))) +
  geom_col()+
  theme_minimal()+
  scale_x_continuous(breaks = seq(10,22,1)) +
  labs(
    title = "Single Seasonality Analysis",
    subtitle = "Daily"
  )

Insight - The majority of visitors patronize the restaurant between 7:00 PM and 10:00 PM, at 10.00 PM is the time time where most visitors come - The restaurant experiences its lowest visitor count during its opening hours, starting from 10:00 AM.

****Seasonality analysis weekly by hour***

as.data.frame(FnB_msts_deco) %>%
  mutate(tr_date = FnB_1$transaction_date) %>%
  mutate(
    Day_of_Week = wday(tr_date, label = T, abbr = F),
    Hour = as.factor(hour(tr_date))
  ) %>%
  group_by(Day_of_Week, Hour) %>%
  summarise(Seasonal = sum(Seasonal13 + Seasonal91)) %>%
  ggplot() +
  geom_bar(aes(x = Hour, y = Seasonal, fill = Day_of_Week), stat ="identity", position = "stack", width = 0.7) +
  scale_fill_manual(values = c("Senin" = "blue", "Selasa" = "green", "Rabu" = "red", "Kamis" ="Purple","Jumat" = "orange", "Sabtu" = "yellow", "Minggu" = "cyan" )) +
  labs(
    title = "Multi Seasonality Analysis",
    subtitle = "Weekly"
  )
## `summarise()` has grouped output by 'Day_of_Week'. You can override using the
## `.groups` argument.

Insight - Peak hours for the outlet fall within the 19:00 to 22:00 timeframe daily. - Sundays witness the highest visitor influx, notably at 20:00. - The least busy period occurs uniformly at 10:00 daily, corresponding to the outlet’s opening hours.

Model Fitting and Evaluation

Cross Validation

The time series cross-validation scheme should not be randomly sampled but split sequentially. In this scenario, we will divide the data into Test (1 Week = 7 x 13 Hours) and Train (Total Data excluding Test Data).

FnBtest_msts <- tail(FnB_msts, n = 7*13)
FnBtrain_msts <- head(FnB_msts, n= length(FnB_ts)-length(FnBtest_msts))

Build Mode

Based on the previous data analysis exploration, it is evident that this dataset exhibits both trend and seasonal patterns. Therefore, the appropriate methods for analysis are Triple Exponential Smoothing and ARIMA.

Triple Exponential Smoothing Model
#model
FnB_TESM_test<- HoltWinters(FnBtrain_msts)
FnB_TESM_test
## Holt-Winters exponential smoothing with trend and additive seasonal component.
## 
## Call:
## HoltWinters(x = FnBtrain_msts)
## 
## Smoothing parameters:
##  alpha: 0.02059033
##  beta : 0.004279326
##  gamma: 0.3800661
## 
## Coefficients:
##              [,1]
## a    28.369622821
## b     0.001528001
## s1  -20.474638787
## s2  -11.082897293
## s3   -5.011391454
## s4    1.548814355
## s5    4.158496257
## s6    7.333703816
## s7   10.493680079
## s8    8.717766502
## s9   11.036673055
## s10  25.602690209
## s11  36.526600516
## s12  28.411245963
## s13  15.447509896
## s14 -21.120471785
## s15 -20.583872852
## s16 -11.543344873
## s17 -14.335345073
## s18 -12.100975342
## s19  -8.913709939
## s20 -13.592510972
## s21  -6.112779839
## s22   1.032456914
## s23   7.236459105
## s24  16.304430332
## s25  16.102872821
## s26   8.665769647
## s27 -22.101851889
## s28 -17.738321058
## s29 -12.540679150
## s30 -11.642409372
## s31 -12.329560350
## s32  -8.628433719
## s33  -8.775276205
## s34  -2.836904624
## s35  -0.008611163
## s36  12.786797732
## s37  13.848327034
## s38  17.958000831
## s39  14.454443317
## s40 -18.874334303
## s41 -12.532328022
## s42  -9.314493479
## s43 -13.785421373
## s44  -8.473836004
## s45  -9.426220042
## s46  -3.826022409
## s47  -0.988007035
## s48   1.367908539
## s49  13.526248085
## s50  22.601052203
## s51  15.375501037
## s52  13.370384613
## s53 -18.987345711
## s54 -14.613340995
## s55  -8.398258226
## s56 -12.106760872
## s57 -10.253528129
## s58  -6.306668081
## s59  -7.044260274
## s60  -7.673654287
## s61   0.485797800
## s62  20.840823309
## s63  28.229425113
## s64  27.888256725
## s65  14.989522530
## s66 -27.716755573
## s67 -27.709714789
## s68 -27.704368251
## s69 -22.956974829
## s70 -10.775526978
## s71  -7.769615517
## s72 -10.756083913
## s73  -7.144881724
## s74   7.524893395
## s75  22.030571714
## s76  35.092421011
## s77  30.182932032
## s78  21.900713124
## s79 -18.693466384
## s80  -8.757682181
## s81  -1.850892036
## s82  -2.808011845
## s83   0.891846241
## s84   2.256859790
## s85   1.174339457
## s86   7.002404348
## s87  11.725322785
## s88  24.789991547
## s89  42.226271061
## s90  37.836195470
## s91  25.239083370
# forecast
Forecast_TESM_test <- forecast(FnB_TESM_test, h = 7*13)
fitted_values <- fitted(Forecast_TESM_test)

#plot
FnBtrain_msts %>% 
  autoplot(series = "Actual") + 
  autolayer(FnBtrain_msts, series = "Actual test") + 
  autolayer(fitted_values, series = "Fitted values (additive)") +
  autolayer(Forecast_TESM_test$mean, series = "Predict data test (additive)")
## Warning: Removed 91 rows containing missing values (`geom_line()`).

test_forecast(actual = FnB_msts,
             forecast.obj = Forecast_TESM_test,
             train = FnBtrain_msts,
             test = FnBtest_msts)

****Model Evaluation***

#Evaluation
mae(Forecast_TESM_test$mean, FnBtest_msts)
## [1] 6.28736
ARIMA MODELLING
#Model
Arima_ts <- stlm(FnBtrain_msts, method = "arima")
#Forecast
Forecast_Arima <- forecast(Arima_ts, h = 13*7)

Model Evaluation

#Evaluation
mae(Forecast_Arima$mean, FnBtest_msts)
## [1] 5.557077
#plot
FnBtrain_msts %>% 
  autoplot(series = "Actual") + 
  autolayer(FnBtrain_msts, series = "Actual test") + 
  autolayer(Forecast_Arima$mean, series = "Predict ARIMA")

test_forecast(actual = FnB_msts,
             forecast.obj = Forecast_Arima,
             train = FnBtrain_msts,
             test = FnBtest_msts)

Visualization Actual vs Estimated

accuracyData <- data.frame(
  tdate = FnB_1$transaction_date %>% tail(13*7),
  actual = as.vector(FnBtest_msts),
  TESM_forecast = as.vector(Forecast_TESM_test$mean),
  arima_forecast = as.vector(Forecast_Arima$mean)
 
)
Visualizatin of actual VS Best Model ( ARIMA)
accuracyData %>% 
  ggplot() +
  geom_line(aes(x = tdate, y = actual, colour = "Actual"), size = 1) +
  geom_line(aes(x = tdate, y = arima_forecast, colour = "Arima Model (Best Model)"), size = 1)+
  labs(
    title = "Actual vs Arima Model",
    subtitle = "Hourly visitor", 
    x = "Date", 
    y = "Visitor", 
    colour = ""
    ) +
  scale_color_manual(values = c("Actual" = "tan4", "Arima Model (Best Model)" = "orange"))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

##### Visualization of Actual VS All Models

accuracyData %>% 
  ggplot() +
  geom_line(aes(x = tdate, y = actual, colour = "Actual"), size = 0.5) +
  geom_line(aes(x = tdate, y = TESM_forecast, colour = "TESM Model"), size = 0.1) + 
  geom_line(aes(x = tdate, y = arima_forecast, colour = "Arima Model"), size = 0.1) +
  labs(
    title = "Actual vs All Models", 
    subtitle = "Hourly visitor", 
    x = "Date", 
    y = "Visitor", 
    colour = "") +
  scale_color_manual(values = c("yellow", "darkgreen", "maroon"))

Prediction Performance

As it stands, the ARIMA (5,55) model exhibits the lowest MAE value. Consequently, for the forecast on the test dataset, we will opt for the ARIMA model.

FnB_test <- read.csv("data_input/FnB/data/data-test.csv")

Customize the data type

FnB_test <- FnB_test %>% 
  mutate(datetime = ymd_hms(datetime) %>% 
           floor_date(unit = "hour"))
Arima_test <- stlm(FnB_ts, method = "arima")
Forecast_Arima_test <-forecast(Arima_test, h=13*7)

Insert the data into table

Insert <- FnB_test %>% 
  mutate(visitor = Forecast_Arima_test$mean)

Save Data

write.csv(Insert, file = "FnB_Forecast.csv")

Check first 5 data

head(Insert, 26)
##               datetime   visitor
## 1  2018-02-19 10:00:00 15.295440
## 2  2018-02-19 11:00:00 17.119389
## 3  2018-02-19 12:00:00 19.554802
## 4  2018-02-19 13:00:00 22.619172
## 5  2018-02-19 14:00:00 22.505153
## 6  2018-02-19 15:00:00 28.679438
## 7  2018-02-19 16:00:00 28.801170
## 8  2018-02-19 17:00:00 28.642075
## 9  2018-02-19 18:00:00 32.163747
## 10 2018-02-19 19:00:00 49.515288
## 11 2018-02-19 20:00:00 58.640082
## 12 2018-02-19 21:00:00 57.878295
## 13 2018-02-19 22:00:00 52.387666
## 14 2018-02-20 10:00:00  7.998891
## 15 2018-02-20 11:00:00 12.099635
## 16 2018-02-20 12:00:00 15.039004
## 17 2018-02-20 13:00:00 18.556734
## 18 2018-02-20 14:00:00 18.850561
## 19 2018-02-20 15:00:00 25.391746
## 20 2018-02-20 16:00:00 25.843544
## 21 2018-02-20 17:00:00 25.981378
## 22 2018-02-20 18:00:00 29.770169
## 23 2018-02-20 19:00:00 47.362012
## 24 2018-02-20 20:00:00 56.702983
## 25 2018-02-20 21:00:00 56.135670
## 26 2018-02-20 22:00:00 50.819991

Conclusion

In order to ensure accurate forecasting, it is essential to validate several assumptions:

  1. Testing for Autocorrelation in Residuals: To assess autocorrelation, we utilize the Box.test() function. A p-value greater than 0.05 is indicative of accepting the null hypothesis (H0) that there is no autocorrelation in the forecast errors.
Box.test(x= Arima_test$residuals)
## 
##  Box-Pierce test
## 
## data:  Arima_test$residuals
## X-squared = 0.023826, df = 1, p-value = 0.8773

Insight With a p-value exceeding 0.05, we conclude that the residuals exhibit no autocorrelation or accepting the null hypothesis.

  1. Assessing Normality of Residuals: The shapiro.test() function is employed to examine normality, aiming to confirm H0 when the p-value exceeds 0.05.
shapiro.test(Arima_test$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  Arima_test$residuals
## W = 0.99506, p-value = 0.002019

Insight Given a p-value below 0.9, it is evident that the residuals do not conform to a normal distribution. It is important to note that the Shapiro test solely evaluates the deviation of residual distribution from normality, without addressing forecast performance, which may deteriorate with longer forecast periods. To forecast longer-term data, additional data should be incorporated into the model.

In conclusion, based on the Seasonality Analysis, it is evident that Saturday at 8:00 PM experiences the highest influx of visitors.