1 Intro

1.1 Objective

The Food and Beverage dataset is provided by Dattabot, which contains detailed transactions of multiple food and beverage outlets. Using this dataset, we are challenged to do some forecasting and time series analysis to help the outlet’s owner making a better business decision.

1.2 Setup

# load library
library(dplyr)
library(lubridate) 
library(tidyr)
library(ggplot2)
library(TSstudio) 
library(forecast)
library(readr)
library(padr)
library(zoo)
library(ggsci)

theme_set(theme_minimal())

2 Get the Data

fnb <- read.csv("data/data-train.csv")

head(fnb)
#>       transaction_date receipt_number   item_id  item_group item_major_group
#> 1 2017-12-01T13:32:46Z       A0026694 I10100139 noodle_dish             food
#> 2 2017-12-01T13:32:46Z       A0026694 I10500037      drinks        beverages
#> 3 2017-12-01T13:33:39Z       A0026695 I10500044      drinks        beverages
#> 4 2017-12-01T13:33:39Z       A0026695 I10400009   side_dish             food
#> 5 2017-12-01T13:33:39Z       A0026695 I10500046      drinks        beverages
#> 6 2017-12-01T13:35:59Z       A0026696 I10300002      pastry             food
#>   quantity price_usd total_usd payment_type sales_type
#> 1        1      7.33      7.33         cash    dine_in
#> 2        1      4.12      4.12         cash    dine_in
#> 3        1      2.02      2.02         cash    dine_in
#> 4        1      5.60      5.60         cash    dine_in
#> 5        1      3.01      3.01         cash    dine_in
#> 6        1      4.86      4.86         cash    dine_in

Column explanation :

  • transaction_date: The timestamp of a transaction
  • receipt_number: The ID of a transaction
  • item_id: The ID of an item in a transaction
  • item_group: The group ID of an item in a transaction
  • item_major_group: The major-group ID of an item in a transaction
  • quantity: The quantity of purchased item
  • price_usd: The price of purchased item
  • total_usd: The total price of purchased item
  • payment_type: The payment method
  • sales_type: The sales method

3 Data Preprocess

Make sure that the data format is appropriate.

glimpse(fnb)
#> Rows: 137,748
#> Columns: 10
#> $ transaction_date <chr> "2017-12-01T13:32:46Z", "2017-12-01T13:32:46Z", "2017…
#> $ receipt_number   <chr> "A0026694", "A0026694", "A0026695", "A0026695", "A002…
#> $ item_id          <chr> "I10100139", "I10500037", "I10500044", "I10400009", "…
#> $ item_group       <chr> "noodle_dish", "drinks", "drinks", "side_dish", "drin…
#> $ item_major_group <chr> "food", "beverages", "beverages", "food", "beverages"…
#> $ quantity         <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1,…
#> $ price_usd        <dbl> 7.33, 4.12, 2.02, 5.60, 3.01, 4.86, 6.34, 7.58, 4.12,…
#> $ total_usd        <dbl> 7.33, 4.12, 2.02, 5.60, 3.01, 4.86, 6.34, 7.58, 4.12,…
#> $ payment_type     <chr> "cash", "cash", "cash", "cash", "cash", "cash", "cash…
#> $ sales_type       <chr> "dine_in", "dine_in", "dine_in", "dine_in", "dine_in"…

We have to change the data type of transaction_date into datetime.

fnb$transaction_date <- ymd_hms(fnb$transaction_date)

Round datetime into hour using floor_date function

fnb$transaction_date <- floor_date(fnb$transaction_date, "hour")

Aggregate/summarise column transaction_date and receipt_number to get the number of visitors.

fnb_clean <- fnb %>% 
  group_by(transaction_date) %>% 
  summarise(visitor = length(unique(receipt_number)))

fnb_clean
#> # A tibble: 1,244 × 2
#>    transaction_date    visitor
#>    <dttm>                <int>
#>  1 2017-12-01 13:00:00      16
#>  2 2017-12-01 14:00:00      38
#>  3 2017-12-01 15:00:00      27
#>  4 2017-12-01 16:00:00      29
#>  5 2017-12-01 17:00:00      44
#>  6 2017-12-01 18:00:00      50
#>  7 2017-12-01 19:00:00      66
#>  8 2017-12-01 20:00:00      70
#>  9 2017-12-01 21:00:00      63
#> 10 2017-12-01 22:00:00      63
#> # ℹ 1,234 more rows

Do Time Series Padding to fill missing data series (in hour).

fnb_clean <- fnb_clean %>% 
  pad(interval = "hour",
      start_val = ymd_hms("2017-12-01 10:00:00 UTC"),
      end_val = ymd_hms("2018-02-18 22:00:00 UTC"))

fnb_clean
#> # A tibble: 1,909 × 2
#>    transaction_date    visitor
#>    <dttm>                <int>
#>  1 2017-12-01 10:00:00      NA
#>  2 2017-12-01 11:00:00      NA
#>  3 2017-12-01 12:00:00      NA
#>  4 2017-12-01 13:00:00      16
#>  5 2017-12-01 14:00:00      38
#>  6 2017-12-01 15:00:00      27
#>  7 2017-12-01 16:00:00      29
#>  8 2017-12-01 17:00:00      44
#>  9 2017-12-01 18:00:00      50
#> 10 2017-12-01 19:00:00      66
#> # ℹ 1,899 more rows

Filter Date Time in hour to capture range time of Outlet Open, from 10 am to 10 pm.

fnb_clean <- fnb_clean %>% 
  filter(hour(transaction_date) >= 10 & hour(transaction_date) <= 22)

fnb_clean
#> # A tibble: 1,040 × 2
#>    transaction_date    visitor
#>    <dttm>                <int>
#>  1 2017-12-01 10:00:00      NA
#>  2 2017-12-01 11:00:00      NA
#>  3 2017-12-01 12:00:00      NA
#>  4 2017-12-01 13:00:00      16
#>  5 2017-12-01 14:00:00      38
#>  6 2017-12-01 15:00:00      27
#>  7 2017-12-01 16:00:00      29
#>  8 2017-12-01 17:00:00      44
#>  9 2017-12-01 18:00:00      50
#> 10 2017-12-01 19:00:00      66
#> # ℹ 1,030 more rows

Replace NA Value

We should replace the missing value by zero number. This indicates that there are no visitors in that hour.

library(zoo)

fnb_clean <- fnb_clean %>% 
  mutate(visitor = na.fill(visitor, 0))

fnb_clean
#> # A tibble: 1,040 × 2
#>    transaction_date    visitor
#>    <dttm>                <int>
#>  1 2017-12-01 10:00:00       0
#>  2 2017-12-01 11:00:00       0
#>  3 2017-12-01 12:00:00       0
#>  4 2017-12-01 13:00:00      16
#>  5 2017-12-01 14:00:00      38
#>  6 2017-12-01 15:00:00      27
#>  7 2017-12-01 16:00:00      29
#>  8 2017-12-01 17:00:00      44
#>  9 2017-12-01 18:00:00      50
#> 10 2017-12-01 19:00:00      66
#> # ℹ 1,030 more rows

Check the time interval of the transaction_date after padding and filtering.

range(fnb_clean$transaction_date)
#> [1] "2017-12-01 10:00:00 UTC" "2018-02-18 22:00:00 UTC"

4 Seasonality Analysis

4.1 Decomposition

Create Time Series Object using ts() function with frequency 13, where 13 is time range of the outlet open everyday, from 10am to 10pm.

fnb_ts <- ts(data = fnb_clean$visitor,
             freq = 13)
fnb_dc <- decompose(fnb_ts)

plot(fnb_dc)

Explanation :

From plot above, the Trend component still has a pattern and has not shown a smooth graph. This indicates there are multiseasonal in the data. So, we need to convert the data into “Multiple Seasonality Time Series” which accept. multiple frequency setting.

Create “Multi-seasonality Time Series” Object using msts() function, with Frequency : Hourly (13) and Weekly (13*7), this will capture seasonality in Hourly and Weekly. Then decompose and plotting.

fnb_ts_2 <- msts(data = fnb_clean$visitor,
                 seasonal.periods = c(a = 13, 13*7))
fnb_dc_2 <- mstl(fnb_ts_2)

fnb_dc_2 %>% 
  autoplot()

Explanation :

Based on plot above, now the Trend is smoother and clearer. Seasonal on Daily and Weekly also clearer.

4.2 Visualizaion Seasonality

Create a visualization of hourly and weekly seasonality.

fnb_msts_dc_fr <- data.frame(fnb_dc_2)

fnb_msts_dc_fr %>%
  mutate(Day_of_Week = wday(fnb_clean$transaction_date, 
                            label = TRUE, 
                            abbr = FALSE), 
         Hour = (hour(fnb_clean$transaction_date))) %>% 
  group_by(Day_of_Week, Hour) %>%
  summarise(Seasonal = sum(Seasonal13 + Seasonal91)) %>%
  ggplot() +
  geom_bar(aes(x = Hour, y = Seasonal, fill = Day_of_Week),stat ="identity", position = "stack", width = 0.7)+
  scale_x_continuous(breaks = seq(10,22,1)) +
  labs(title = "Multi-Seasonality Analysis  - Weekly & Hourly")

Explanation :

Based on plot above, we can see that on Saturday at 20.00 is the time where most Visitor come. And at 10.00 - every day of the week, is the time where the least visitors come as the Outlet just started/opened.

5 Model Vitting and Evaluation

5.1 Cross Validation

Splitting Data into Validation and Train (Total Data exclude Validation Data). The validation.

fnb_val <- tail(fnb_ts_2, 13*7)

fnb_train <- head(fnb_ts_2, length(fnb_ts_2)-length(fnb_val))

5.2 Model Fitting

Since Time Series Data is having Error, Trend, and Seasonal then the model will be built using :

1). Holt’s Winters Exponential (Triple Exponential Smoothing)
2). ARIMA Model

model_hw <- HoltWinters(fnb_train)

model_arima <- stlm(fnb_train, method = "arima")

Forecast both models 1 week ahead.

forecast_hw <- forecast(model_hw, h = 13*7)
forecast_arima <- forecast(model_arima, h=13*7)

5.3 Visualize Each Models

  1. Holt Winter - Actual vs Estimated Number of Visitors
fnb_ts_2 %>% autoplot(series = "Actual") +
  autolayer(forecast_hw$mean, series = "Predict HW")

  1. Arima - Actual vs Estimated Number of Visitors
fnb_ts_2 %>% autoplot(series = "Actual") +
  autolayer(forecast_arima$mean, series = "Predict Arima")

5.4 Summary Models

Here is the summary of model in “Multi-Seasonal Time Series”

acc_hw <- accuracy(object = forecast_hw$mean, fnb_val)
acc_arima <- accuracy(object = forecast_arima$mean, fnb_val)

result <- rbind(acc_hw, acc_arima)
rownames(result) <- c("Holt Winter", "Arima")
result
#>                    ME     RMSE      MAE  MPE MAPE      ACF1 Theil's U
#> Holt Winter -2.944367 8.168453 6.463898 -Inf  Inf 0.4683447         0
#> Arima       -2.092000 7.317929 5.688691  NaN  Inf 0.3436265         0

Based on the Summary, The Accuracy of “ARIMA Model” is better than Holw Winter with a MAE value of 5.688691 (< 6). This model will be chosen to be used for Prediction.

5.5 Visualization Actual vs Estimated (Best Model)

accuracyData <- data.frame(trx_date= fnb_clean$transaction_date %>% tail(13*7),
  actual = as.vector(fnb_val) ,
  HW_Forecast = as.vector(forecast_hw$mean) ,
  Arima_Forecast = as.vector(forecast_arima$mean))

accuracyData %>% 
 ggplot() +
  geom_line(aes(x = trx_date, y = actual, colour = "Actual"),size=1)+
  geom_line(aes(x = trx_date, y = Arima_Forecast, colour = "Arima Model (Best Model)"),size=1)+
  labs(title = "Hourly Visitor - Actual Vs Arima Model",
       x = "Date",
       y = "Visitor",
       colour = NULL)

5.6 Visualization Actual vs Estimated (All Model)

accuracyData %>% 
 ggplot() +
  geom_line(aes(x = trx_date, y = actual, colour = "Actual"),size=0.5)+
  geom_line(aes(x = trx_date, y = HW_Forecast, colour = "Holt Winter Model"),size=0.1)+
  geom_line(aes(x = trx_date, y = Arima_Forecast, colour = "Arima Model (Best Model)"),size=0.5)+
  labs(title = "Hourly Visitor - Actual Vs All Models",x = "Date",y = "Visitor",colour = "")

6 Prediction Performance

6.1 MAE < 6 in validation dataset.

The Best Model is “Arima Model” as it has MAE Accuracy 5.688691.

accuracy(object = forecast_arima$mean, fnb_val)
#>              ME     RMSE      MAE MPE MAPE      ACF1 Theil's U
#> Test set -2.092 7.317929 5.688691 NaN  Inf 0.3436265         0

6.2 MAE < 6 in test dataset

Predict using ARIMA Model with Multiseasonal and save The Prediction into CSV File “fnb_test_arima.csv”.

Read the Test Data

fnb_test <- read.csv("data/data-test.csv")

Create “ARIMA” Model with Multi-Seasonal Time Series Data then forecast it.

model_arima_test <- stlm(fnb_ts_2, method = "arima")

forecast_arima_test <- forecast(model_arima_test, h=13*7)

fnb_test$visitor <- forecast_arima_test$mean

See the result of prediction.

head(fnb_test, 10)
#>                datetime   visitor
#> 1  2018-02-19T10:00:00Z  4.173315
#> 2  2018-02-19T11:00:00Z  6.880140
#> 3  2018-02-19T12:00:00Z 12.955462
#> 4  2018-02-19T13:00:00Z 15.673966
#> 5  2018-02-19T14:00:00Z 11.139184
#> 6  2018-02-19T15:00:00Z 19.988398
#> 7  2018-02-19T16:00:00Z 21.812542
#> 8  2018-02-19T17:00:00Z 20.382694
#> 9  2018-02-19T18:00:00Z 24.183307
#> 10 2018-02-19T19:00:00Z 40.806998

Save the Test Data into a CSV File “submission_arima.csv”.

write.csv(fnb_test, file = "submission_arima.csv", row.names=FALSE)

Model Performance

7 Conclusion

7.1 Autocorrelation Checking

Box.test(model_arima$residuals, type = "Ljung-Box")
#> 
#>  Box-Ljung test
#> 
#> data:  model_arima$residuals
#> X-squared = 0.00021456, df = 1, p-value = 0.9883

From the above results obtained a p-value of 0.9883. Because the p-value > alpha (0.05) then H0 is not rejected. So, it can be concluded that the Residuals have no-autocorrelation.

7.2 Normality of residuals

Normality of residuals, tested using saphiro.test function

shapiro.test(x = model_arima$residuals)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  model_arima$residuals
#> W = 0.99053, p-value = 8.6e-06

From the above results obtained a p-value of 0.0000086. Because the p-value < alpha (0.05) then H0 is rejected. So, it can be concluded that the Residuals are not distributed normally.

How to handle it ?
It was found that the residuals are not distributed normally, this could happen since the size of our data if not large enough, but does not mean Forecast Performance is not good. We need to get more data to build and feed into models, so the residual more distributed normally and getting better result on Model Evaluation and Forecast Performance.

8 Summary

  • The highest visitors occur on Saturday at 20.00

  • The least visitors occur at 10.00 every day of the week, when the outlet just opened