1 Introduction

1.1 Background

This will be my final project as Algoritma Trainee. After being trained for 2,5 months and provided a lof of compatible material for us, we are given project to train our skill. I decided to go with Forecasting Model.

1.2 Dataset

The Food and Beverage dataset is provided by Dattabot, which contains detailed transaction of multiple food and beverage outlets. Using this dataset, I am going to do some forecasting and time series analysis to help the outlet’s owner making a better business decision.
Customer behaviour, especially in food and beverage industry is highly related to seasonality patterns. The owner wants to analyze the number visitor so he could make better judgement in 2018.

  • Challenge: Forecast result and seasonality explanation for hourly number of visitors, that would be evaluated on the next 7 days (Monday, February 19th 2018 to Sunday, February 25th 2018)!

2 Data Preprocessing

The train dataset contains detailed transaction details from December 1st 2017 to February 18th 2018:

The dataset includes information about:

  • transaction_date: The timestamp of a transaction
  • receipt_number: The ID of a transaction
  • item_id: The ID of an item in a transaction
  • item_group: The group ID of an item in a transaction
  • item_major_group: The major-group ID of an item in a transaction
  • quantity: The quantity of purchased item
  • price_usd: The price of purchased item
  • total_usd: The total price of purchased item
  • payment_type: The payment method
  • sales_type: The sales method

After imported data, we need to format transaction_date into date format.

In order to get transaction time hourly, we need to round the transaction_date.

For a customer can order more than one item in one time, let’s summarise the data by receipt_number to see how many orders created hourly.

To make sure that there is no missing time in our data, let’s do time series padding.

## pad applied on the interval: hour

Since the restaurant is opened at 10 AM to 10 PM, we need to filter the datetime from 10 AM to 10 PM and replace na value with 0.

3 Seasonality

First step we need to do is convert the data into time series object.

Visualize the data to see the distribution of the data

Early assumption we can get: 1. The data has trend, seasonanality, error.

3.1 Decomposing

If we want to see the trend of the data, we only need to decompose our data with one month periods since every month has its own seasonality.

We find:
- Since graphic in trend column haven’t been created smoothly, find out that there is seasonality that haven’t been caught in resto_ts. That means our data are most likely having multiple seasonality.

3.2 Multiple Time Series

Let’s re-create our time series data.

As we can see, the trend is created smoothly indicating that our data set has multiple seasonality.

4 Model Fitting and Evaluation

4.1 Single Seasonality

4.1.1 Cross Validation

Let’s split data into 2 set of data, train and validation data. For validation data, we will use last 2 weeks.

##  The resto_ts series is a ts object with 1 variable and 1037 observations
##  Frequency: 13 
##  Start time: 1 4 
##  End time: 80 13

The end of our data is “13” which means we can consider our the it’s full day data (10.00-22.00).

4.1.2 Triple Exponential Smoothing Model

Since our data has error, trend and seasonality, we can use triple exponential smoothing method as one of our modeling

## [1] 11.61221

Let’s visualize the forecast with our actual value that we get.

As we can see from visualization, most of forecast data point (green color) are not close to our actual data (blue color) which mean there is a lot of error from our model.

4.1.3 ARIMA modeling

## [1] 7.461238

Let’s how this model works in visualization.

As we can see from the visualization, this model is better than the previous one. The forecast data point is closer to actual data and the score of MAE of this model is smaller than the previous one.

4.2 Multiple Seasonality

4.2.2 Triple Exponential Smooting Model

## Warning in HoltWinters(train_resto_msts): optimization difficulties: ERROR:
## ABNORMAL_TERMINATION_IN_LNSRCH
## [1] 6.451498

Let’s see how the visualization of this model is

4.2.3 ARIMA Model

## [1] 5.656791

This model has the smallest MAE comparing to our previous models. Let’s see how the visualization is.

The result is good. Most of our forecast data point is closer to the actual data point. We can assume that this model is better than the others.

5 Prediction Performance

Since the smallest MAE that we get is from ARIMA model of Multiple Seasonality (5.66) in our validation data, we will use this model to forecast in our test data.

As result of submitting to leaderboard, we pass the standard requirement of MAE.

6 Conclusion

To make a good forecasting, we have to test several assumptions:
1. No-Autocorrelation for residuals
2. Normality of residuals

## 
##  Box-Pierce test
## 
## data:  model_arima_test$residuals
## X-squared = 0.0081299, df = 1, p-value = 0.9282

Due to p-value is bigger than 0.05, we can conclude that the residuals has no-autocorrelation.

## 
##  Shapiro-Wilk normality test
## 
## data:  model_arima_test$residuals
## W = 0.99114, p-value = 6.817e-06

Since the p-value is less than 0.05, so the residuals are not distributed normally. Note that Shapiro test tests only deviation of residual distribution from the normal and not the forecast performance, which worsens for longer forecast. If we want to forecast longer data, we will need to add more data to feed in our model.

As for the last, we can see from the Seasonality Analysis, we conclude that Saturday at 20.00 (8 P.M) is the highest visitors.