1 Intro

1.1 Dataset

The Food and Beverage dataset is provided by Dattabot, which contains detailed transaction of multiple food and beverage outlets. Using this dataset, we are challenged to do some forecasting and time series analysis to help the outlet’s owner making a better business decision.

Food & Beverage: “It’s friday night!” Customer behaviour, especially in food and beverage industry is highly related to seasonality patterns. The owner wants to analyze the number visitor so he could make better judgement in 2018.

The train dataset contains detailed transaction details from October 1st 2017 to December 2nd 2017: The dataset includes information about:

transaction_date: The timestamp of a transaction receipt_number: The ID of a transaction item_id: The ID of an item in a transaction item_group: The group ID of an item in a transaction item_major_group: The major-group ID of an item in a transaction quantity: The quantity of purchased item price_usd: The price of purchased item total_usd: The total price of purchased item payment_type: The payment method sales_type: The sales method

Challenges: Please make a report of your forecasting result and seasonality explanation for hourly number of visitors, that would be evaluated on the next 7 days (Monday, December 19th 2017 to Sunday, December 25th 2017)!

2 Data Pre-processing

Import Data into R

## Parsed with column specification:
## cols(
##   transaction_date = col_datetime(format = ""),
##   receipt_number = col_character(),
##   item_id = col_character(),
##   item_group = col_character(),
##   item_major_group = col_character(),
##   quantity = col_double(),
##   price_usd = col_double(),
##   total_usd = col_double(),
##   payment_type = col_character(),
##   sales_type = col_character()
## )

Agreggrate to get sum of visitor per hour

Check the completeness and order of transaction date. Use pad() from padR package.

## [1] "2017-12-01 13:00:00 UTC"
## [1] "2018-02-18 23:00:00 UTC"
## pad applied on the interval: hour

Filter on transcation date , because the store open on 10 am and close on 22 pm.

if there are missing data, replace it by zero.

## [1] TRUE

For TBATS, it needs to replace NA with 1 (not zero)

4 Model Fitting and Evaluation

4.3 TBATS Model

Tbats model can not accepts the data visitor = zero. Because when we use log(), it has result = -INF, and it shows error (Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, …) : NA/NaN/Inf in ‘y’) So for TBATS model, data visitor1 will be used.

Using msts_test that contains zero, it show -Inf for MPE , and Inf for MAPE

##                 ME     RMSE      MAE  MPE MAPE      ACF1 Theil's U
## Test set -6.082599 9.330179 7.678216 -Inf  Inf 0.3548303         0
##                 ME     RMSE      MAE       MPE     MAPE      ACF1 Theil's U
## Test set -6.049632 9.314613 7.645249 -60.65058 64.90921 0.3556996  1.376096

Using msts_test that contains zero, it show -Inf for MPE , and Inf for MAPE.

##                 ME     RMSE      MAE  MPE MAPE      ACF1 Theil's U
## Test set -3.298899 8.727963 6.903395 -Inf  Inf 0.5524017         0
##                 ME     RMSE      MAE       MPE     MAPE      ACF1 Theil's U
## Test set -3.265932 8.717369 6.870428 -33.82967 46.53715 0.5567218 0.9446494

4.4 HOLT WINTERS Model

Fitting Model

Forecasting

Visualize actual data with forecasting

Model Evaluation

##                 ME    RMSE      MAE  MPE MAPE      ACF1 Theil's U
## Test set -2.931366 8.16131 6.484465 -Inf  Inf 0.4675556         0

4.5 ARIMA Model

Create ARIMA object

Forecast

visualization

Model Evaluation

##                 ME     RMSE      MAE MPE MAPE      ACF1 Theil's U
## Test set -2.127162 7.332962 5.656791 NaN  Inf 0.3444129         0

6 Prediction Data-Test.csv

Because Arima model shows the smallest MAE, so it will be chosen to predict to Data-Test.csv

When it is submitted into Leaderboard, the MAE still under the metrics (6.0)

7 Conclucion

There are several assumptions to do, in order to find out the good forecast : 1. No-Autocorrelation for residuals

## 
##  Box-Pierce test
## 
## data:  model_test_arima$residuals
## X-squared = 0.0081299, df = 1, p-value = 0.9282

Conclucsion: p-value > 0.05, so we can assume there’s no-autocorrelation for residuals.

  1. Normality of residuals
## 
##  Shapiro-Wilk normality test
## 
## data:  model_test_arima$residuals
## W = 0.99114, p-value = 6.817e-06

Conclusion: p-value < 0.05 , so we can assume there’s not normalit of residuals.