In the realm of food and beverage businesses, understanding visitor trends and transaction patterns is paramount for making informed decisions and optimizing operational strategies. With the rapid evolution of consumer preferences and the increasing reliance on diverse dining options such as dine-in, delivery, and take-away, forecasting visitor numbers has become a crucial aspect of business planning. In this analysis, we delve into a dataset encompassing visitor data from a food and beverage establishment, spanning various transaction types including dine-in, delivery, and take-away. The primary objective and challenge are to forecast the hourly number of visitors and provide insights into the underlying seasonality. This forecast result and seasonality explanation will be evaluated on the next 7 days, from Monday, February 19th, 2018, to Sunday, February 25th, 2018. By leveraging forecasting techniques and seasonality analysis, we aim to uncover valuable insights that will empower the owner to optimize resource allocation, enhance service delivery, and ultimately drive business growth in the dynamic landscape of the food and beverage industry.
Library
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
## Warning: package 'lubridate' was built under R version 4.3.3
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(tidyr)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.2
library(TSstudio)
## Warning: package 'TSstudio' was built under R version 4.3.3
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(readr)
library(padr)
library(zoo)
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(ggsci)
library(xts)
##
## ######################### Warning from 'xts' package ##########################
## # #
## # The dplyr lag() function breaks how base R's lag() function is supposed to #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or #
## # source() into this session won't work correctly. #
## # #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop #
## # dplyr from breaking base R's lag() function. #
## # #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning. #
## # #
## ###############################################################################
##
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
##
## first, last
library(tseries)
library(Metrics)
## Warning: package 'Metrics' was built under R version 4.3.3
##
## Attaching package: 'Metrics'
## The following object is masked from 'package:forecast':
##
## accuracy
Train dataset
FnB <- read.csv("data_input/FnB/data/data-train.csv")
head(FnB)
## transaction_date receipt_number item_id item_group item_major_group
## 1 2017-12-01T13:32:46Z A0026694 I10100139 noodle_dish food
## 2 2017-12-01T13:32:46Z A0026694 I10500037 drinks beverages
## 3 2017-12-01T13:33:39Z A0026695 I10500044 drinks beverages
## 4 2017-12-01T13:33:39Z A0026695 I10400009 side_dish food
## 5 2017-12-01T13:33:39Z A0026695 I10500046 drinks beverages
## 6 2017-12-01T13:35:59Z A0026696 I10300002 pastry food
## quantity price_usd total_usd payment_type sales_type
## 1 1 7.33 7.33 cash dine_in
## 2 1 4.12 4.12 cash dine_in
## 3 1 2.02 2.02 cash dine_in
## 4 1 5.60 5.60 cash dine_in
## 5 1 3.01 3.01 cash dine_in
## 6 1 4.86 4.86 cash dine_in
Missing Value Check
anyNA(FnB)
## [1] FALSE
Summary of data
glimpse(FnB)
## Rows: 137,748
## Columns: 10
## $ transaction_date <chr> "2017-12-01T13:32:46Z", "2017-12-01T13:32:46Z", "2017…
## $ receipt_number <chr> "A0026694", "A0026694", "A0026695", "A0026695", "A002…
## $ item_id <chr> "I10100139", "I10500037", "I10500044", "I10400009", "…
## $ item_group <chr> "noodle_dish", "drinks", "drinks", "side_dish", "drin…
## $ item_major_group <chr> "food", "beverages", "beverages", "food", "beverages"…
## $ quantity <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1,…
## $ price_usd <dbl> 7.33, 4.12, 2.02, 5.60, 3.01, 4.86, 6.34, 7.58, 4.12,…
## $ total_usd <dbl> 7.33, 4.12, 2.02, 5.60, 3.01, 4.86, 6.34, 7.58, 4.12,…
## $ payment_type <chr> "cash", "cash", "cash", "cash", "cash", "cash", "cash…
## $ sales_type <chr> "dine_in", "dine_in", "dine_in", "dine_in", "dine_in"…
From the summary, we know that the data type of Transaction_date is not suitable yet. Hence, we need to change it to datetime format and then round it to the nearest hour to obtain the hourly transaction time.
FnB_1 <- FnB %>%
mutate(transaction_date = ymd_hms(transaction_date) %>%
floor_date(unit = "hour")) %>%
group_by(transaction_date) %>%
summarise(visitor_number = n_distinct(receipt_number)) %>%
arrange(transaction_date) %>%
ungroup()
FnB_1 <- FnB_1 %>%
pad(start_val = ymd("2017-12-01"), end_val = ymd("2018-02-18"))
## pad applied on the interval: hour
FnB_1 <- FnB_1 %>%
mutate(hour = hour(transaction_date)) %>%
filter(hour %in% c(10:22)) %>%
select(transaction_date, visitor_number)
Since the restaurant operates from 10 AM to 10 PM, we need to fill the NA values in the datetime column with 0
FnB_1 <- FnB_1 %>%
mutate(visitor_number = na.fill(visitor_number, 0))
Recheck that there is no missing value in our data
min_date <- min(FnB_1$transaction_date)
max_date <- max(FnB_1$transaction_date)
min_date
## [1] "2017-12-01 10:00:00 UTC"
max_date
## [1] "2018-02-17 22:00:00 UTC"
Convert the data into time series Object
The frequency that we used is 13 which refers to the time range of the outlet open everyday
FnB_ts <- ts(FnB_1$visitor_number,start=c(1,4),frequency = 13)
Visualize the data to see the distribution of the data
autoplot(FnB_ts) + theme_minimal()
insight : the data has trend, seasonality, and error
To observe the data trend effectively, we can perform data decomposition using monthly periods, as each month inherently exhibits its unique seasonal patterns. Consequently, upon analysis, we discern that despite the graphical representation of the trend column lacking smoothness, it reveals the presence of seasonality that was not adequately captured in FnB_ts. This suggests the likelihood of our data manifesting multiple seasonal patterns.
FnB1_deco<-decompose(x=FnB_ts)
autoplot(FnB1_deco)
Insight The visual representation depicted above reveals a distinct pattern within the Estimated Trend component, hinting at the existence of possibly overlooked additional seasonality. This observation implies that our dataset likely encompasses multiple seasonal variations. To effectively address this complex multi-seasonal nature, it becomes imperative to transition the dataset into a “Multiple Seasonality Time Series” format, which is designed to accommodate and manage multiple frequency settings seamlessly.
Generate a “Multi-seasonality Time Series” structure utilizing the msts() function, specifying the frequencies as Daily (13) and Weekly (13*7), thereby encompassing both daily and weekly seasonal patterns. This approach ensures the incorporation of seasonality at both the daily and weekly levels. Subsequently, proceed with decomposing the series and visualizing the results.
FnB_msts<-msts(FnB_1$visitor_number, seasonal.periods = c(13,13*7))
autoplot(FnB_msts)
FnB_msts_deco <- mstl(FnB_msts)
FnB_msts %>% tail(4*7*13) %>% stl(s.window = "periodic") %>% autoplot()
insight
Upon reviewing the aforementioned graph, it is evident that the Estimated Trend derived from the “Multiple Seasonality Time Series” exhibits enhanced smoothness and clarity. Moreover, the delineation of daily and weekly seasonality is more pronounced, rendering it more conducive to comprehensive analysis and interpretation.
Seasonality analysis daily by hour
FnB_1 %>%
mutate(seasonal = FnB1_deco$seasonal,
hour = hour(transaction_date)) %>%
distinct(hour, seasonal) %>%
ggplot(mapping = aes(x = hour, y = seasonal, fill = as.factor(hour))) +
geom_col()+
theme_minimal()+
scale_x_continuous(breaks = seq(10,22,1)) +
labs(
title = "Single Seasonality Analysis",
subtitle = "Daily"
)
Insight - The majority of visitors patronize
the restaurant between 7:00 PM and 10:00 PM, at 10.00 PM is the time
time where most visitors come - The restaurant experiences its lowest
visitor count during its opening hours, starting from 10:00 AM.
****Seasonality analysis weekly by hour***
as.data.frame(FnB_msts_deco) %>%
mutate(tr_date = FnB_1$transaction_date) %>%
mutate(
Day_of_Week = wday(tr_date, label = T, abbr = F),
Hour = as.factor(hour(tr_date))
) %>%
group_by(Day_of_Week, Hour) %>%
summarise(Seasonal = sum(Seasonal13 + Seasonal91)) %>%
ggplot() +
geom_bar(aes(x = Hour, y = Seasonal, fill = Day_of_Week), stat ="identity", position = "stack", width = 0.7) +
scale_fill_manual(values = c("Senin" = "blue", "Selasa" = "green", "Rabu" = "red", "Kamis" ="Purple","Jumat" = "orange", "Sabtu" = "yellow", "Minggu" = "cyan" )) +
labs(
title = "Multi Seasonality Analysis",
subtitle = "Weekly"
)
## `summarise()` has grouped output by 'Day_of_Week'. You can override using the
## `.groups` argument.
Insight - Peak hours for the outlet fall
within the 19:00 to 22:00 timeframe daily. - Sundays witness the highest
visitor influx, notably at 20:00. - The least busy period occurs
uniformly at 10:00 daily, corresponding to the outlet’s opening
hours.
The time series cross-validation scheme should not be randomly sampled but split sequentially. In this scenario, we will divide the data into Test (1 Week = 7 x 13 Hours) and Train (Total Data excluding Test Data).
FnBtest_msts <- tail(FnB_msts, n = 7*13)
FnBtrain_msts <- head(FnB_msts, n= length(FnB_ts)-length(FnBtest_msts))
Based on the previous data analysis exploration, it is evident that this dataset exhibits both trend and seasonal patterns. Therefore, the appropriate methods for analysis are Triple Exponential Smoothing and ARIMA.
#model
FnB_TESM_test<- HoltWinters(FnBtrain_msts)
FnB_TESM_test
## Holt-Winters exponential smoothing with trend and additive seasonal component.
##
## Call:
## HoltWinters(x = FnBtrain_msts)
##
## Smoothing parameters:
## alpha: 0.02059033
## beta : 0.004279326
## gamma: 0.3800661
##
## Coefficients:
## [,1]
## a 28.369622821
## b 0.001528001
## s1 -20.474638787
## s2 -11.082897293
## s3 -5.011391454
## s4 1.548814355
## s5 4.158496257
## s6 7.333703816
## s7 10.493680079
## s8 8.717766502
## s9 11.036673055
## s10 25.602690209
## s11 36.526600516
## s12 28.411245963
## s13 15.447509896
## s14 -21.120471785
## s15 -20.583872852
## s16 -11.543344873
## s17 -14.335345073
## s18 -12.100975342
## s19 -8.913709939
## s20 -13.592510972
## s21 -6.112779839
## s22 1.032456914
## s23 7.236459105
## s24 16.304430332
## s25 16.102872821
## s26 8.665769647
## s27 -22.101851889
## s28 -17.738321058
## s29 -12.540679150
## s30 -11.642409372
## s31 -12.329560350
## s32 -8.628433719
## s33 -8.775276205
## s34 -2.836904624
## s35 -0.008611163
## s36 12.786797732
## s37 13.848327034
## s38 17.958000831
## s39 14.454443317
## s40 -18.874334303
## s41 -12.532328022
## s42 -9.314493479
## s43 -13.785421373
## s44 -8.473836004
## s45 -9.426220042
## s46 -3.826022409
## s47 -0.988007035
## s48 1.367908539
## s49 13.526248085
## s50 22.601052203
## s51 15.375501037
## s52 13.370384613
## s53 -18.987345711
## s54 -14.613340995
## s55 -8.398258226
## s56 -12.106760872
## s57 -10.253528129
## s58 -6.306668081
## s59 -7.044260274
## s60 -7.673654287
## s61 0.485797800
## s62 20.840823309
## s63 28.229425113
## s64 27.888256725
## s65 14.989522530
## s66 -27.716755573
## s67 -27.709714789
## s68 -27.704368251
## s69 -22.956974829
## s70 -10.775526978
## s71 -7.769615517
## s72 -10.756083913
## s73 -7.144881724
## s74 7.524893395
## s75 22.030571714
## s76 35.092421011
## s77 30.182932032
## s78 21.900713124
## s79 -18.693466384
## s80 -8.757682181
## s81 -1.850892036
## s82 -2.808011845
## s83 0.891846241
## s84 2.256859790
## s85 1.174339457
## s86 7.002404348
## s87 11.725322785
## s88 24.789991547
## s89 42.226271061
## s90 37.836195470
## s91 25.239083370
# forecast
Forecast_TESM_test <- forecast(FnB_TESM_test, h = 7*13)
fitted_values <- fitted(Forecast_TESM_test)
#plot
FnBtrain_msts %>%
autoplot(series = "Actual") +
autolayer(FnBtrain_msts, series = "Actual test") +
autolayer(fitted_values, series = "Fitted values (additive)") +
autolayer(Forecast_TESM_test$mean, series = "Predict data test (additive)")
## Warning: Removed 91 rows containing missing values (`geom_line()`).
test_forecast(actual = FnB_msts,
forecast.obj = Forecast_TESM_test,
train = FnBtrain_msts,
test = FnBtest_msts)
****Model Evaluation***
#Evaluation
mae(Forecast_TESM_test$mean, FnBtest_msts)
## [1] 6.28736
#Model
Arima_ts <- stlm(FnBtrain_msts, method = "arima")
#Forecast
Forecast_Arima <- forecast(Arima_ts, h = 13*7)
Model Evaluation
#Evaluation
mae(Forecast_Arima$mean, FnBtest_msts)
## [1] 5.557077
#plot
FnBtrain_msts %>%
autoplot(series = "Actual") +
autolayer(FnBtrain_msts, series = "Actual test") +
autolayer(Forecast_Arima$mean, series = "Predict ARIMA")
test_forecast(actual = FnB_msts,
forecast.obj = Forecast_Arima,
train = FnBtrain_msts,
test = FnBtest_msts)
accuracyData <- data.frame(
tdate = FnB_1$transaction_date %>% tail(13*7),
actual = as.vector(FnBtest_msts),
TESM_forecast = as.vector(Forecast_TESM_test$mean),
arima_forecast = as.vector(Forecast_Arima$mean)
)
accuracyData %>%
ggplot() +
geom_line(aes(x = tdate, y = actual, colour = "Actual"), size = 1) +
geom_line(aes(x = tdate, y = arima_forecast, colour = "Arima Model (Best Model)"), size = 1)+
labs(
title = "Actual vs Arima Model",
subtitle = "Hourly visitor",
x = "Date",
y = "Visitor",
colour = ""
) +
scale_color_manual(values = c("Actual" = "tan4", "Arima Model (Best Model)" = "orange"))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
##### Visualization of Actual VS All Models
accuracyData %>%
ggplot() +
geom_line(aes(x = tdate, y = actual, colour = "Actual"), size = 0.5) +
geom_line(aes(x = tdate, y = TESM_forecast, colour = "TESM Model"), size = 0.1) +
geom_line(aes(x = tdate, y = arima_forecast, colour = "Arima Model"), size = 0.1) +
labs(
title = "Actual vs All Models",
subtitle = "Hourly visitor",
x = "Date",
y = "Visitor",
colour = "") +
scale_color_manual(values = c("yellow", "darkgreen", "maroon"))
As it stands, the ARIMA (5,55) model exhibits the lowest MAE value. Consequently, for the forecast on the test dataset, we will opt for the ARIMA model.
FnB_test <- read.csv("data_input/FnB/data/data-test.csv")
Customize the data type
FnB_test <- FnB_test %>%
mutate(datetime = ymd_hms(datetime) %>%
floor_date(unit = "hour"))
Arima_test <- stlm(FnB_ts, method = "arima")
Forecast_Arima_test <-forecast(Arima_test, h=13*7)
Insert the data into table
Insert <- FnB_test %>%
mutate(visitor = Forecast_Arima_test$mean)
Save Data
write.csv(Insert, file = "FnB_Forecast.csv")
Check first 5 data
head(Insert, 26)
## datetime visitor
## 1 2018-02-19 10:00:00 15.295440
## 2 2018-02-19 11:00:00 17.119389
## 3 2018-02-19 12:00:00 19.554802
## 4 2018-02-19 13:00:00 22.619172
## 5 2018-02-19 14:00:00 22.505153
## 6 2018-02-19 15:00:00 28.679438
## 7 2018-02-19 16:00:00 28.801170
## 8 2018-02-19 17:00:00 28.642075
## 9 2018-02-19 18:00:00 32.163747
## 10 2018-02-19 19:00:00 49.515288
## 11 2018-02-19 20:00:00 58.640082
## 12 2018-02-19 21:00:00 57.878295
## 13 2018-02-19 22:00:00 52.387666
## 14 2018-02-20 10:00:00 7.998891
## 15 2018-02-20 11:00:00 12.099635
## 16 2018-02-20 12:00:00 15.039004
## 17 2018-02-20 13:00:00 18.556734
## 18 2018-02-20 14:00:00 18.850561
## 19 2018-02-20 15:00:00 25.391746
## 20 2018-02-20 16:00:00 25.843544
## 21 2018-02-20 17:00:00 25.981378
## 22 2018-02-20 18:00:00 29.770169
## 23 2018-02-20 19:00:00 47.362012
## 24 2018-02-20 20:00:00 56.702983
## 25 2018-02-20 21:00:00 56.135670
## 26 2018-02-20 22:00:00 50.819991
In order to ensure accurate forecasting, it is essential to validate several assumptions:
Box.test(x= Arima_test$residuals)
##
## Box-Pierce test
##
## data: Arima_test$residuals
## X-squared = 0.023826, df = 1, p-value = 0.8773
Insight With a p-value exceeding 0.05, we conclude that the residuals exhibit no autocorrelation or accepting the null hypothesis.
shapiro.test(Arima_test$residuals)
##
## Shapiro-Wilk normality test
##
## data: Arima_test$residuals
## W = 0.99506, p-value = 0.002019
Insight Given a p-value below 0.9, it is evident that the residuals do not conform to a normal distribution. It is important to note that the Shapiro test solely evaluates the deviation of residual distribution from normality, without addressing forecast performance, which may deteriorate with longer forecast periods. To forecast longer-term data, additional data should be incorporated into the model.
In conclusion, based on the Seasonality Analysis, it is evident that Saturday at 8:00 PM experiences the highest influx of visitors.