1 Intro

1.1 Dataset

The Food and Beverage dataset is provided by Dattabot, which contains detailed transaction of multiple food and beverage outlets. Using this dataset, we are challenged to do some forecasting and time series analysis to help the outlet’s owner making a better business decision.

Food & Beverage: “It’s friday night!” Customer behaviour, especially in food and beverage industry is highly related to seasonality patterns. The owner wants to analyze the number visitor so he could make better judgement in 2018.

The train dataset contains detailed transaction details from October 1st 2017 to December 2nd 2017: The dataset includes information about:

transaction_date: The timestamp of a transaction receipt_number: The ID of a transaction item_id: The ID of an item in a transaction item_group: The group ID of an item in a transaction item_major_group: The major-group ID of an item in a transaction quantity: The quantity of purchased item price_usd: The price of purchased item total_usd: The total price of purchased item payment_type: The payment method sales_type: The sales method

Challenges: Please make a report of your forecasting result and seasonality explanation for hourly number of visitors, that would be evaluated on the next 7 days (Monday, December 19th 2017 to Sunday, December 25th 2017)!

1.2 Libraries

library(tidyverse)
library(forecast)
library(lubridate)
library(TTR)
library(fpp)
library(xts)
library(tseries)
library(TSstudio)
library(padr)
library(MLmetrics)
library(ggfortify)

2 Data Pre-processing

Import Data into R

fnb <- read_csv("data-train.csv")

## Parsed with column specification:
## cols(
##   transaction_date = col_datetime(format = ""),
##   receipt_number = col_character(),
##   item_id = col_character(),
##   item_group = col_character(),
##   item_major_group = col_character(),
##   quantity = col_double(),
##   price_usd = col_double(),
##   total_usd = col_double(),
##   payment_type = col_character(),
##   sales_type = col_character()
## )

head(fnb)

Agreggrate to get sum of visitor per hour

fnb_clean <- fnb %>% 
   mutate(transaction_date = transaction_date %>% floor_date(unit="hour")
          ) %>% 
   group_by(transaction_date) %>%  ## per hour
   summarise(visitor = n_distinct(receipt_number))

fnb_clean

Check the completeness and order of transaction date. Use pad() from padR package.

min_date <- min(fnb_clean$transaction_date)
min_date

## [1] "2017-12-01 13:00:00 UTC"

max_date <- max(fnb_clean$transaction_date)
max_date

## [1] "2018-02-18 23:00:00 UTC"

fnb_clean <- fnb_clean %>% 
   pad(start_val = make_datetime(year = year(min_date),month = month(min_date), day= day(min_date), hour = 0), end_val =make_datetime(year = year(max_date),month = month(max_date), day= day(max_date), hour = 23) )

## pad applied on the interval: hour

fnb_clean

Filter on transcation date , because the store open on 10 am and close on 22 pm.

fnb_clean <- fnb_clean %>% 
   mutate(hour = hour(transaction_date)) %>% 
   filter(hour %in% c(10:22)) %>% 
   mutate(visitor1 = visitor)

fnb_clean

if there are missing data, replace it by zero.

anyNA(fnb_clean)

## [1] TRUE

fnb_clean <- fnb_clean %>% 
   mutate(visitor = replace_na(visitor,0)) %>% 
   mutate(visitor1 = replace_na(visitor1,1))

head(fnb_clean)

For TBATS, it needs to replace NA with 1 (not zero)

fnb_clean <- fnb_clean[-c(1:3),]

3 Seasonality Analisis

ggplot(data=fnb_clean, mapping=aes(
       x=fnb_clean$transaction_date,
       y=fnb_clean$visitor
       )
       ) +
geom_point() +
geom_line() +
theme_minimal()

hourly

ts_fnbH <- fnb_clean$visitor %>% ts(freq=13)

dc_fnbHT <- ts_fnbH %>% 
   tail (13*2) %>% 
   decompose() 
dc_fnbHT %>% 
   autoplot()

dc_fnbHH <- ts_fnbH %>% 
   head (13*2) %>% 
   decompose() 
dc_fnbHH %>% 
   autoplot()

fnb_clean %>% 
   head(13*2) %>% 
   mutate(seasonalHH = dc_fnbHH$seasonal ) %>% 
  ggplot(aes(hour, seasonalHH)) +
  geom_col()

fnb_clean %>% 
   tail(13*2) %>% 
   mutate(seasonalHT = dc_fnbHT$seasonal ) %>% 
  ggplot(aes(hour, seasonalHT)) +
  geom_col()

From this hourly chat, we find out the high frequency order happens on 20pm, and the lowest frequency order happens on 10 am.

WEEKLY

msts_fnb <- fnb_clean$visitor %>% msts(seasonal.periods = c(13, 13*7))
msts_fnb %>% 
  tail(13*7*4) %>% 
  stl(s.window = "periodic") %>% 
  autoplot()

msts_fnb_dc <- mstl(msts_fnb)
as.data.frame(msts_fnb_dc) %>% 
  mutate(tdate = fnb_clean$transaction_date) %>% 
  mutate(
    days1 = wday(tdate, label = TRUE, abbr = FALSE),
    hour = as.factor(hour(tdate))
  ) %>% 
  group_by(days1, hour) %>% 
  summarise(seasonal = sum(Seasonal13 + Seasonal91)) %>% 
  ggplot(mapping = aes(x = hour, y = seasonal)) +
  geom_col(aes(fill = days1))

From this weekly chart, we find out the high frequency order happens on Saturday , 20pm, and the lowest frequency order happens on Friday, 10 am.

4 Model Fitting and Evaluation

4.1 Cross Validation

use fnb_clean$visitor

msts_vs <- fnb_clean$visitor %>% msts(seasonal.periods = c(13, 13*7))
msts_vs %>% 
   head(13*7*4) %>% 
   mstl() %>% 
   autoplot()

use fnb_clean$visitor1

msts_vs1 <- fnb_clean$visitor1 %>% msts(seasonal.periods = c(13, 13*7))
msts_vs1 %>% 
   head(13*7*4) %>% 
   mstl() %>% 
   autoplot()

msts_train <- head(msts_vs, length(msts_vs)-13*7)
msts_test <- tail(msts_vs, 13*7)

msts_train1 <- head(msts_vs1, length(msts_vs1)-13*7)
msts_test1 <- tail(msts_vs1, 13*7)

autoplot(msts_train)

autoplot(msts_train1)

4.2 MSTS Model

stlm_model <- msts_train %>% 
              stlm(lambda = NULL) %>% 
              forecast(h=13*7)
plot(stlm_model)

4.3 TBATS Model

Tbats model can not accepts the data visitor = zero. Because when we use log(), it has result = -INF, and it shows error (Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, …) : NA/NaN/Inf in ‘y’) So for TBATS model, data visitor1 will be used.

tbats_mod <- msts_train1 %>% 
           log() %>% 
           tbats(use.box.cox = NULL, 
                 use.trend = NULL,
                 use.damped.trend = NULL)

tbats_model <- forecast(tbats_mod, h=13*7)
plot(tbats_model)

Using msts_test that contains zero, it show -Inf for MPE , and Inf for MAPE

stlmForecast <- as.vector(stlm_model$mean)
accuracy_stlm <- accuracy(stlmForecast,msts_test)
accuracy_stlm

##                 ME     RMSE      MAE  MPE MAPE      ACF1 Theil's U
## Test set -6.082599 9.330179 7.678216 -Inf  Inf 0.3548303         0

accuracy_stlm1 <- accuracy(stlmForecast,msts_test1)
accuracy_stlm1

##                 ME     RMSE      MAE       MPE     MAPE      ACF1 Theil's U
## Test set -6.049632 9.314613 7.645249 -60.65058 64.90921 0.3556996  1.376096

Using msts_test that contains zero, it show -Inf for MPE , and Inf for MAPE.

tbatsForecast <- as.vector(exp(tbats_model$mean))
accuracy_tbats <- accuracy(tbatsForecast,msts_test)
accuracy_tbats

##                 ME     RMSE      MAE  MPE MAPE      ACF1 Theil's U
## Test set -3.298899 8.727963 6.903395 -Inf  Inf 0.5524017         0

accuracy_tbats1 <- accuracy(tbatsForecast,msts_test1)
accuracy_tbats1

##                 ME     RMSE      MAE       MPE     MAPE      ACF1 Theil's U
## Test set -3.265932 8.717369 6.870428 -33.82967 46.53715 0.5567218 0.9446494

4.4 HOLT WINTERS Model

Fitting Model

fnb_hw <- HoltWinters(msts_train1)

Forecasting

fnbhw_forecast <- forecast(fnb_hw, 13*7)

Visualize actual data with forecasting

test_forecast(actual = msts_vs,
             forecast.obj=fnbhw_forecast,
             train=msts_train,
             test=msts_test)

Model Evaluation

HWForecast <- as.vector(fnbhw_forecast$mean)
accuracyHW <- accuracy(HWForecast,msts_test)
accuracyHW

##                 ME    RMSE      MAE  MPE MAPE      ACF1 Theil's U
## Test set -2.931366 8.16131 6.484465 -Inf  Inf 0.4675556         0

4.5 ARIMA Model

Create ARIMA object

fnbar <- stlm(msts_train,method="arima")

Forecast

fnbar_forecast <- forecast(fnbar, h=13*7)

visualization

test_forecast(actual = msts_vs,
             forecast.obj=fnbar_forecast,
             train=msts_train,
             test=msts_test)

Model Evaluation

arimaForecast <- as.vector(fnbar_forecast$mean)
accuracyArima <- accuracy(arimaForecast,msts_test)
accuracyArima

##                 ME     RMSE      MAE MPE MAPE      ACF1 Theil's U
## Test set -2.127162 7.332962 5.656791 NaN  Inf 0.3444129         0

5 Summary Models

result <- rbind(accuracy_stlm1, accuracy_tbats1, accuracyHW, accuracyArima)
rownames(result) <- c("stlm model", "tbats model", "holt winters model", "arima model")
result

##                           ME     RMSE      MAE       MPE     MAPE      ACF1
## stlm model         -6.049632 9.314613 7.645249 -60.65058 64.90921 0.3556996
## tbats model        -3.265932 8.717369 6.870428 -33.82967 46.53715 0.5567218
## holt winters model -2.931366 8.161310 6.484465      -Inf      Inf 0.4675556
## arima model        -2.127162 7.332962 5.656791       NaN      Inf 0.3444129
##                    Theil's U
## stlm model         1.3760963
## tbats model        0.9446494
## holt winters model 0.0000000
## arima model        0.0000000

accuracyData <- data.frame(visitor=fnb_clean$transaction_date %>% tail(13*7),
                           actual=as.vector(msts_test1),
                           stlmForecast, tbatsForecast, fnbhw_forecast, fnbar_forecast
                           )

accuracyData %>% 
   ggplot()+
   geom_line(aes(x=(fnb_clean$transaction_date %>% tail(13*7)),
                 y=(fnb_clean$visitor1 %>% tail(13*7)),
                 colour="actual"
                 )) +
   geom_line(aes(x=(fnb_clean$transaction_date %>% tail(13*7)),
                 y=stlm_model$mean,
                 colour="stlm"
                 )) +
   geom_line(aes(x=(fnb_clean$transaction_date %>% tail(13*7)),
                 y=exp(tbats_model$mean),
                 colour="tbats"
                 )) +
   geom_line(aes(x=(fnb_clean$transaction_date %>% tail(13*7)),
                 y=fnbhw_forecast$mean,
                 colour="holt winters"
                 )) +
   geom_line(aes(x=(fnb_clean$transaction_date %>% tail(13*7)),
                 y=fnbar_forecast$mean,
                 colour="arima"
                 )) +
   scale_y_continuous()+
   labs(
      title = "Hourly visitor in F&B",
      x="Date",
      y="",
      colour=""
   )

6 Prediction Data-Test.csv

Because Arima model shows the smallest MAE, so it will be chosen to predict to Data-Test.csv

#read data-test
data_test_arima <- read.csv("data-test.csv")

#make MSTS object model
model_test_arima <- stlm(msts_vs, method="arima")

#forecast
forecast_test_arima <- forecast(model_test_arima, 13*7)

#update column visitor
data_test_arima$visitor <- forecast_test_arima$mean

#show the result
head(data_test_arima)

#save to file
write.csv(data_test_arima, "ARIMA-data-test.csv")

When it is submitted into Leaderboard, the MAE still under the metrics (6.0)

7 Conclucion

There are several assumptions to do, in order to find out the good forecast : 1. No-Autocorrelation for residuals

Box.test(x=model_test_arima$residuals)

## 
##  Box-Pierce test
## 
## data:  model_test_arima$residuals
## X-squared = 0.0081299, df = 1, p-value = 0.9282

Conclucsion: p-value > 0.05, so we can assume there’s no-autocorrelation for residuals.

Normality of residuals

shapiro.test(model_test_arima$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  model_test_arima$residuals
## W = 0.99114, p-value = 6.817e-06

Conclusion: p-value < 0.05 , so we can assume there’s not normalit of residuals.

WAWAN - F & B TIME SERIES

Wawan Pranata

3/28/2020