1 Introduction

1.1 Background

This will be my final project as Algoritma Trainee. After being trained for 2,5 months and provided a lof of compatible material for us, we are given project to train our skill. I decided to go with Forecasting Model.

1.2 Dataset

The Food and Beverage dataset is provided by Dattabot, which contains detailed transaction of multiple food and beverage outlets. Using this dataset, I am going to do some forecasting and time series analysis to help the outlet’s owner making a better business decision.
Customer behaviour, especially in food and beverage industry is highly related to seasonality patterns. The owner wants to analyze the number visitor so he could make better judgement in 2018.

Challenge: Forecast result and seasonality explanation for hourly number of visitors, that would be evaluated on the next 7 days (Monday, February 19th 2018 to Sunday, February 25th 2018)!

1.3 Library Setup

library(lubridate)
library(tidyverse)
library(padr)
library(TSstudio)
library(ggfortify)
library(forecast)
library(MLmetrics)
library(tseries)

2 Data Preprocessing

The train dataset contains detailed transaction details from December 1st 2017 to February 18th 2018:

resto <- read.csv("data_input/data/data-train.csv")
resto

The dataset includes information about:

transaction_date: The timestamp of a transaction
receipt_number: The ID of a transaction
item_id: The ID of an item in a transaction
item_group: The group ID of an item in a transaction
item_major_group: The major-group ID of an item in a transaction
quantity: The quantity of purchased item
price_usd: The price of purchased item
total_usd: The total price of purchased item
payment_type: The payment method
sales_type: The sales method

After imported data, we need to format transaction_date into date format.

resto$transaction_date <- ymd_hms(resto$transaction_date)

In order to get transaction time hourly, we need to round the transaction_date.

resto <- resto %>% 
  mutate(datetime= floor_date(transaction_date, unit ="hour"))

For a customer can order more than one item in one time, let’s summarise the data by receipt_number to see how many orders created hourly.

resto <- resto %>% 
  group_by(datetime) %>% 
  summarise(freq = n_distinct(receipt_number))

To make sure that there is no missing time in our data, let’s do time series padding.

min_date <- min(resto$datetime)
max_date <- max(resto$datetime)

resto <- resto %>% 
  pad(start_val = make_datetime(year = year(min_date),month = month(min_date), day= day(min_date), hour = 0), end_val =make_datetime(year = year(max_date),month = month(max_date), day= day(max_date), hour = 23) )

## pad applied on the interval: hour

Since the restaurant is opened at 10 AM to 10 PM, we need to filter the datetime from 10 AM to 10 PM and replace na value with 0.

resto_fin <- resto %>% 
  mutate(freq = replace_na(freq,0)) %>% 
  filter(hour(datetime)>=10& hour(datetime)<=22)
resto_fin <- resto_fin[-c(1:3),]

3 Seasonality

First step we need to do is convert the data into time series object.

resto_ts <- ts(data=resto_fin$freq,start = c(1,4), frequency = 13)

Visualize the data to see the distribution of the data

resto_ts %>% 
  autoplot()+
  theme_minimal()

Early assumption we can get: 1. The data has trend, seasonanality, error.

3.1 Decomposing

If we want to see the trend of the data, we only need to decompose our data with one month periods since every month has its own seasonality.

resto_ts %>% 
  tail(13*7*4) %>% 
  stl(s.window = "periodic") %>% 
  autoplot()

We find:
- Since graphic in trend column haven’t been created smoothly, find out that there is seasonality that haven’t been caught in resto_ts. That means our data are most likely having multiple seasonality.

#Assigning the decomposed single seasonality model
resto_single_decompose <- decompose(resto_ts)

3.2 Multiple Time Series

Let’s re-create our time series data.

resto_msts <- msts(data = resto_fin$freq,seasonal.periods = c(13,13*7))

resto_msts %>% 
  tail(13*7*4) %>% 
  stl(s.window = "periodic") %>% 
  autoplot()

As we can see, the trend is created smoothly indicating that our data set has multiple seasonality.

# Assigning decomposed multiple seasonality model
resto_double_decompose <- mstl(resto_msts)

3.3 Seasonality Analysis

# Single seasonality

resto_fin %>% 
  mutate(
    seasonal = resto_single_decompose$seasonal,
    hour = hour(datetime)
  ) %>% 
  distinct(hour, seasonal) %>% 
  ggplot(mapping = aes(x = hour, y = seasonal)) +
  geom_col() +
  theme_minimal() +
  scale_x_continuous(breaks = seq(10,22,1)) +
  labs(
    title = "Single Seasonality Analysis",
    subtitle = "Daily"
  )

We find:
- Most of visitors come to restaurant at 19.00 - 22.00
- Opening hour (10 A.M) is the least of visitors of restaurant.

# Multiple Seasonality

as.data.frame(resto_double_decompose) %>% 
  mutate(datetime = resto_fin$datetime) %>% 
  mutate(
    dow = wday(datetime, label = TRUE, abbr = FALSE),
    hour = as.factor(hour(datetime))
  ) %>% 
  group_by(dow, hour) %>% 
  summarise(seasonal = sum(Seasonal13 + Seasonal91)) %>% 
  ggplot(mapping = aes(x = hour, y = seasonal)) +
  geom_col(aes(fill = dow)) +
  scale_fill_viridis_d(option = "plasma") +
  theme_minimal() +
  labs(
    title = "Multiple Seasonality Analysis",
    subtitle = "Daily & Weekly"
  )

Besides from Single Seasonality Pattern, we find:
- For lunch time period (11 A.M to 1 P.M), Friday is the day with least visitors.
- Weekend (Saturday & Sunday), the restaurant will start busy at 3 P.M.
- Friday, Saturday & Sunday are the highest traffic at 7 P.M - 10 P.M.

4 Model Fitting and Evaluation

4.1 Single Seasonality

4.1.1 Cross Validation

Let’s split data into 2 set of data, train and validation data. For validation data, we will use last 2 weeks.

#checking the tail of the data
ts_info(resto_ts)

##  The resto_ts series is a ts object with 1 variable and 1037 observations
##  Frequency: 13 
##  Start time: 1 4 
##  End time: 80 13

The end of our data is “13” which means we can consider our the it’s full day data (10.00-22.00).

# data splitting
train_resto_ts <- head(resto_ts, n = length(resto_ts)-13*7)
test_resto_ts <- tail(resto_ts, n = 13*7)

4.1.2 Triple Exponential Smoothing Model

Since our data has error, trend and seasonality, we can use triple exponential smoothing method as one of our modeling

# modeling
model_tes_ts <- HoltWinters(train_resto_ts)

# forecast
forecast_tes_ts <- forecast(model_tes_ts, h = 13*7)

# model evaluation
MAE(y_pred = forecast_tes_ts$mean, y_true = test_resto_ts)

## [1] 11.61221

Let’s visualize the forecast with our actual value that we get.

test_forecast(actual = resto_ts ,
              forecast.obj = forecast_tes_ts,
              train = train_resto_ts,
              test = test_resto_ts)

As we can see from visualization, most of forecast data point (green color) are not close to our actual data (blue color) which mean there is a lot of error from our model.

4.1.3 ARIMA modeling

model_arima_ts <- stlm(train_resto_ts, method = "arima")

# forecast
forecast_arima_ts <- forecast(model_arima_ts, h=13*7)

# model evaluation
MAE(y_pred = forecast_arima_ts$mean, y_true = test_resto_ts)

## [1] 7.461238

Let’s how this model works in visualization.

test_forecast(actual = resto_ts ,
              forecast.obj = forecast_arima_ts,
              train = train_resto_ts,
              test = test_resto_ts)

As we can see from the visualization, this model is better than the previous one. The forecast data point is closer to actual data and the score of MAE of this model is smaller than the previous one.

4.2 Multiple Seasonality

4.2.1 Cross Validation

train_resto_msts <- head(resto_msts, n= length(resto_msts)-(13*7)) 
test_resto_msts <- tail(resto_msts, n=13*7)

4.2.2 Triple Exponential Smooting Model

model_tes_msts <- HoltWinters(train_resto_msts)

## Warning in HoltWinters(train_resto_msts): optimization difficulties: ERROR:
## ABNORMAL_TERMINATION_IN_LNSRCH

# forecast
forecast_tes_msts <- forecast(model_tes_msts, h = 13*7)

# model evaluation
MAE(y_pred = forecast_tes_msts$mean, y_true = test_resto_msts)

## [1] 6.451498

Let’s see how the visualization of this model is

test_forecast(actual = resto_msts ,
              forecast.obj = forecast_tes_msts,
              train = train_resto_msts,
              test = test_resto_msts)

4.2.3 ARIMA Model

model_arima_msts <- stlm(train_resto_msts, method = "arima")

# forecast
forecast_arima_msts <- forecast(model_arima_msts, h=13*7)

# model evaluation
MAE(y_pred = forecast_arima_msts$mean, y_true = test_resto_msts)

## [1] 5.656791

This model has the smallest MAE comparing to our previous models. Let’s see how the visualization is.

test_forecast(actual = resto_msts ,
              forecast.obj = forecast_arima_msts,
              train = train_resto_msts,
              test = test_resto_msts)

The result is good. Most of our forecast data point is closer to the actual data point. We can assume that this model is better than the others.

5 Prediction Performance

Since the smallest MAE that we get is from ARIMA model of Multiple Seasonality (5.66) in our validation data, we will use this model to forecast in our test data.

test_data <- read.csv("data_input/data/data-test.csv")

model_arima_test <- stlm(resto_msts, method = "arima")

# forecast
forecast_arima_test <- forecast(model_arima_test, h=13*7)

# insert the data into table
test_data$visitor <- forecast_arima_test$mean
write.csv(test_data,file = "forecast_fnb.csv")

As result of submitting to leaderboard, we pass the standard requirement of MAE.

6 Conclusion

To make a good forecasting, we have to test several assumptions:
1. No-Autocorrelation for residuals
2. Normality of residuals

# No-Autocorrelation test
Box.test(x=model_arima_test$residuals)

## 
##  Box-Pierce test
## 
## data:  model_arima_test$residuals
## X-squared = 0.0081299, df = 1, p-value = 0.9282

Due to p-value is bigger than 0.05, we can conclude that the residuals has no-autocorrelation.

# normality test
shapiro.test(model_arima_test$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  model_arima_test$residuals
## W = 0.99114, p-value = 6.817e-06

Since the p-value is less than 0.05, so the residuals are not distributed normally. Note that Shapiro test tests only deviation of residual distribution from the normal and not the forecast performance, which worsens for longer forecast. If we want to forecast longer data, we will need to add more data to feed in our model.

As for the last, we can see from the Seasonality Analysis, we conclude that Saturday at 20.00 (8 P.M) is the highest visitors.

Forecasting - Food & Beverage

RPD

3/23/2020