Introduction

Background

This will be my final project as Algoritma Trainee. After being trained for 2,5 months and provided a lof of compatible material for us, we are given project to train our skill. I decided to go with Forecasting Model.

Dataset

The Food and Beverage dataset is provided by Dattabot, which contains detailed transaction of multiple food and beverage outlets. Using this dataset, I am going to do some forecasting and time series analysis to help the outlet’s owner making a better business decision.
Customer behaviour, especially in food and beverage industry is highly related to seasonality patterns. The owner wants to analyze the number visitor so he could make better judgement in 2018.

  • Challenge: Forecast result and seasonality explanation for hourly number of visitors, that would be evaluated on the next 7 days (Monday, February 19th 2018 to Sunday, February 25th 2018)!

Library Setup

library(forecast)
library(tidyverse)
library(fpp)
library(TTR)
library(MLmetrics)
library(tseries)
library(lubridate)
library(dplyr)
library(TSstudio)
library(xts)
library(ggplot2)
library(lubridate)
library(skimr)
library(padr)

Data Preprocessing

The train dataset contains detailed transaction details from December 1st 2017 to February 18th 2018:

fnb <- read.csv("Data Capstone/3. FnB-ts-visitor/data/data-train.csv")

The dataset includes information about:

colSums(is.na(fnb))
## transaction_date   receipt_number          item_id       item_group 
##                0                0                0                0 
## item_major_group         quantity        price_usd        total_usd 
##                0                0                0                0 
##     payment_type       sales_type 
##                0                0
head(fnb)
##       transaction_date receipt_number   item_id  item_group item_major_group
## 1 2017-12-01T13:32:46Z       A0026694 I10100139 noodle_dish             food
## 2 2017-12-01T13:32:46Z       A0026694 I10500037      drinks        beverages
## 3 2017-12-01T13:33:39Z       A0026695 I10500044      drinks        beverages
## 4 2017-12-01T13:33:39Z       A0026695 I10400009   side_dish             food
## 5 2017-12-01T13:33:39Z       A0026695 I10500046      drinks        beverages
## 6 2017-12-01T13:35:59Z       A0026696 I10300002      pastry             food
##   quantity price_usd total_usd payment_type sales_type
## 1        1      7.33      7.33         cash    dine_in
## 2        1      4.12      4.12         cash    dine_in
## 3        1      2.02      2.02         cash    dine_in
## 4        1      5.60      5.60         cash    dine_in
## 5        1      3.01      3.01         cash    dine_in
## 6        1      4.86      4.86         cash    dine_in
skim(fnb)
Data summary
Name fnb
Number of rows 137748
Number of columns 10
_______________________
Column type frequency:
factor 7
numeric 3
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
transaction_date 0 1 FALSE 35477 201: 58, 201: 50, 201: 48, 201: 46
receipt_number 0 1 FALSE 35522 A00: 58, A00: 50, A00: 48, A00: 46
item_id 0 1 FALSE 264 I10: 8360, I10: 6290, I10: 5310, I10: 3099
item_group 0 1 FALSE 14 dri: 40687, noo: 23832, ric: 19597, sid: 11085
item_major_group 0 1 FALSE 4 foo: 74026, bev: 58114, top: 5158, pac: 450
payment_type 0 1 FALSE 2 cas: 112512, non: 25236
sales_type 0 1 FALSE 3 din: 127257, del: 9550, tak: 941

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
quantity 0 1 1.16 0.65 1.00 1.00 1.00 1.00 36.00 ▇▁▁▁▁
price_usd 0 1 4.88 1.93 1.03 3.13 4.61 6.22 18.82 ▆▇▁▁▁
total_usd 0 1 5.53 3.65 1.03 4.12 4.86 6.34 326.15 ▇▁▁▁▁
After imported d ata, we need to format `tran saction _date` into da te form at.
fnb_agg <- fnb %>% 
# this function `floor_date()` functions to do date rounding
  mutate(transaction_date = ymd_hms(transaction_date) %>% floor_date(unit = "hour")) %>% 
  group_by(transaction_date) %>% 
# The n_distinct function is used to count every unique order ID
  summarise(visitor_number = n_distinct(receipt_number)) %>%
  arrange(transaction_date) %>% 
  ungroup()
fnb_agg <- fnb_agg %>% 
  pad(start_val = ymd("2017-12-01"), end_val = ymd("2018-02-18"))

In order to get transaction time hourly, we need to round the transaction_date.

fnb_agg <- fnb_agg %>%
  mutate(hour = hour(transaction_date)) %>% 
  filter(hour %in% c(10:22)) %>% 
  select(transaction_date, visitor_number)

Since the restaurant is opened at 10 AM to 10 PM, we need to filter the datetime from 10 AM to 10 PM and replace na value with 0.

fnb_agg<-fnb_agg %>% 
  mutate(visitor_number = replace_na(visitor_number, 0))
fnb_agg <- fnb_agg[-c(1:3),]

To make sure that there is no missing time in our data, let’s do time series padding.

min_date <- min(fnb_agg$transaction_date)
max_date <- max(fnb_agg$transaction_date)

min_date
## [1] "2017-12-01 13:00:00 UTC"
max_date
## [1] "2018-02-17 22:00:00 UTC"

Seasonality

First step we need to do is convert the data into time series object.

vis_ts <- ts(fnb_agg$visitor_number,start=c(1,4),frequency = 13)

Visualize the data to see the distribution of the data

autoplot(vis_ts) + theme_minimal()

Early assumption we can get:
1. The data has trend, seasonanality, error.

Decomposing

If we want to see the trend of the data, we only need to decompose our data with one month periods since every month has its own seasonality.
We find:
- Since graphic in trend column haven’t been created smoothly, find out that there is seasonality that haven’t been caught in vis_ts. That means our data are most likely having multiple seasonality.

fnb_deco<-decompose(x=vis_ts)
autoplot(fnb_deco)

## Multiple Time Series Let’s re-create our time series data.

fnb_msts<-msts(fnb_agg$visitor_number , seasonal.periods  = c(13,13*7))
 autoplot(fnb_msts)

fnb_msts_decompose<-mstl(fnb_msts)
autoplot(fnb_msts_decompose)

As we can see, the trend is created smoothly indicating that our data set has multiple seasonality.
## Seasonality Analysis

# Single seasonality

fnb_agg %>% 
  mutate(
    seasonal = fnb_deco$seasonal,
    hour = hour(transaction_date)
  ) %>% 
  distinct(hour, seasonal) %>% 
  ggplot(mapping = aes(x = hour, y = seasonal)) +
  geom_col() +
  theme_minimal() +
  scale_x_continuous(breaks = seq(10,22,1)) +
  labs(
    title = "Single Seasonality Analysis",
    subtitle = "Daily"
  )

We find:
- Most of visitors come to restaurant at 19.00 - 22.00
- Opening hour (10 A.M) is the least of visitors of restaurant.

Besides from Single Seasonality Pattern, we find:
- For lunch time period (11 A.M to 1 P.M), Friday is the day with least visitors.
- Weekend (Saturday & Sunday), the restaurant will start busy at 3 P.M.
- Friday, Saturday & Sunday are the highest traffic at 7 P.M - 10 P.M.

Model Fitting and Evaluation

Single Seasonality

Cross Validation

Let’s split data into 2 set of data, train and validation data. For validation data, we will use last 2 weeks.

ts_info(vis_ts)
##  The vis_ts series is a ts object with 1 variable and 1024 observations
##  Frequency: 13 
##  Start time: 1 4 
##  End time: 79 13

The end of our data is “13” which means we can consider our the it’s full day data (10.00-22.00).

#data splitting
fnbtrain <- head(vis_ts, n= length(vis_ts)-13*7)
fnbtest <- tail(vis_ts, n = 13*7)

Triple Exponential Smoothing Model

Since our data has error, trend and seasonality, we can use triple exponential smoothing method as one of our modeling

model_tes_ts <- HoltWinters(fnbtrain)

forecast_tes_ts <- forecast(model_tes_ts, h = 13*7)

MAE(y_pred = forecast_tes_ts$mean, y_true = fnbtest)
## [1] 14.91141

Let’s visualize the forecast with our actual value that we get.

plot_forecast(forecast_tes_ts)

As we can see from visualization, most of forecast data point (green color) are not close to our actual data (blue color) which mean there is a lot of error from our model.

ARIMA modeling

model_arima_ts <- stlm(fnbtrain, method = "arima")

# forecast
forecast_arima_ts <- forecast(model_arima_ts, h=13*7)

# model evaluation
MAE(y_pred = forecast_arima_ts$mean, y_true = fnbtest)
## [1] 7.077565

Let’s how this model works in visualization.

plot_forecast(forecast_arima_ts)

As we can see from the visualization, this model is better than the previous one. The forecast data point is closer to actual data and the score of MAE of this model is smaller than the previous one.

Multiple Seasonality

Cross Validation

fnbtrain_msts <- head(fnb_msts, n= length(vis_ts)-13*7)
fnbtest_msts <- tail(fnb_msts, n = 13*7)

Triple Exponential Smooting Model

model_tes_msts <- HoltWinters(fnbtrain_msts)

# forecast
forecast_tes_msts <- forecast(model_tes_msts, h = 13*7)

# model evaluation
MAE(y_pred = forecast_tes_msts$mean, y_true = fnbtest_msts)
## [1] 6.277796

Let’s see how the visualization of this model is

plot_forecast(forecast_tes_msts)

### ARIMA Model

model_arima_msts <- stlm(fnbtrain_msts, method = "arima")

# forecast
forecast_arima_msts <- forecast(model_arima_msts, h=13*7)

# model evaluation
MAE(y_pred = forecast_arima_msts$mean, y_true = fnbtest_msts)
## [1] 5.579281

This model has the smallest MAE comparing to our previous models. Let’s see how the visualization is.

plot_forecast(forecast_arima_msts)

The result is good. Most of our forecast data point is closer to the actual data point. We can assume that this model is better than the others.

Prediction Performance

Since the smallest MAE that we get is from ARIMA model of Multiple Seasonality (5.57), we will use this model to forecast in our test data.

test_data <- read.csv("Data Capstone/3. FnB-ts-visitor/data/data-test.csv")

model_arima_test <- stlm(fnb_msts, method = "arima")

forecast_arima_test <- forecast(model_arima_test, h=13*7)


# insert the data into table
test_data$visitor <- forecast_arima_test$mean
write.csv(test_data,file = "forecast_fnb5.csv")

As result of submitting to leaderboard, we pass the standard requirement of MAE.

Conclusion

To make a good forecasting, we have to test several assumptions:
1. No-Autocorrelation for residuals
2. Normality of residuals

Box.test(x= model_arima_test$residuals)
## 
##  Box-Pierce test
## 
## data:  model_arima_test$residuals
## X-squared = 0.0074198, df = 1, p-value = 0.9314

Due to p-value is bigger than 0.05, we can conclude that the residuals has no-autocorrelation.

shapiro.test(model_arima_test$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model_arima_test$residuals
## W = 0.99123, p-value = 8.823e-06

Since the p-value is less than 0.9, so the residuals are not distributed normally. Note that Shapiro test tests only deviation of residual distribution from the normal and not the forecast performance, which worsens for longer forecast. If we want to forecast longer data, we will need to add more data to feed in our model.

As for the last, we can see from the Seasonality Analysis, we conclude that Saturday at 20.00 (8 P.M) is the highest visitors.