Introduction

The Food and Beverage dataset is provided by Dattabot, which contains detailed transactions of multiple food and beverage outlets. Using this dataset, we are challenged to do some forecasting and time series analysis to help the outlet’s owner making a better business decision.

Customer behaviour, especially in the food and beverage industry is highly related to seasonality patterns. The owner wants to analyze the number of visitors (includes dine in, delivery, and takeaway transactions) so he could make better judgment in 2018. Fortunately, you already know that time series analysis is enough to provide a good forecast and seasonality explanation.

This project aims to forecast the result and seasonality explanation for hourly number of visitors, that would be evaluated on the next 7 days (Monday, February 19th 2018 to Sunday, February 25th 2018).

Data Preprocessing

# data wrangling
library(tidyverse) 
library(lubridate) # date manipulation
library(zoo)

# time series 
library(forecast) # time series library
library(TTR) # for Simple moving average function
library(MLmetrics) # calculate error
library(tseries) # adf.test
library(TSstudio) # mempercantik visualisasi timeseries
library(padr)
library(imputeTS)
train <- read.csv('data_input/data-train.csv')
test <- read.csv('data_input/data-test.csv')

The train dataset contains detailed transaction details from December 1st 2017 to December 18th of February 2018. The shop opens from 10 am to 10 pm every day:

str(train)
## 'data.frame':    137748 obs. of  10 variables:
##  $ transaction_date: chr  "2017-12-01T13:32:46Z" "2017-12-01T13:32:46Z" "2017-12-01T13:33:39Z" "2017-12-01T13:33:39Z" ...
##  $ receipt_number  : chr  "A0026694" "A0026694" "A0026695" "A0026695" ...
##  $ item_id         : chr  "I10100139" "I10500037" "I10500044" "I10400009" ...
##  $ item_group      : chr  "noodle_dish" "drinks" "drinks" "side_dish" ...
##  $ item_major_group: chr  "food" "beverages" "beverages" "food" ...
##  $ quantity        : int  1 1 1 1 1 1 1 1 1 2 ...
##  $ price_usd       : num  7.33 4.12 2.02 5.6 3.01 4.86 6.34 7.58 4.12 1.77 ...
##  $ total_usd       : num  7.33 4.12 2.02 5.6 3.01 4.86 6.34 7.58 4.12 3.54 ...
##  $ payment_type    : chr  "cash" "cash" "cash" "cash" ...
##  $ sales_type      : chr  "dine_in" "dine_in" "dine_in" "dine_in" ...

Dataset Description

  • transaction_date: The timestamp of a transaction
  • receipt_number: The ID of a transaction
  • item_id: The ID of an item in a transaction
  • item_group: The group ID of an item in a transaction
  • item_major_group: The major-group ID of an item in a transaction
  • quantity: The quantity of purchased item
  • price_usd: The price of purchased item
  • total_usd: The total price of purchased item
  • payment_type: The payment method
  • sales_type: The sales method

Using the lubridate library, we change the transaction_date column to date-time data type, then we create a new variable called datetime which rounds the date-time object down to the nearest integer value of the specified time unit. It’s hourly, in this case.

train <- train %>%
  mutate(transaction_date = ymd_hms(transaction_date)) %>%
  mutate(datetime = floor_date(transaction_date, unit = "hour"))

Then, we check for missing value. It seems that the data doesn’t contain any.

colSums((is.na(train)))
## transaction_date   receipt_number          item_id       item_group 
##                0                0                0                0 
## item_major_group         quantity        price_usd        total_usd 
##                0                0                0                0 
##     payment_type       sales_type         datetime 
##                0                0                0

Before further preprocessing, we take a look at the summary statistics.

summary(train)
##  transaction_date                 receipt_number       item_id         
##  Min.   :2017-12-01 13:32:46.00   Length:137748      Length:137748     
##  1st Qu.:2017-12-20 19:42:26.00   Class :character   Class :character  
##  Median :2018-01-08 18:21:20.00   Mode  :character   Mode  :character  
##  Mean   :2018-01-09 18:37:52.10                                        
##  3rd Qu.:2018-01-29 22:47:17.00                                        
##  Max.   :2018-02-18 23:58:32.00                                        
##   item_group        item_major_group      quantity        price_usd    
##  Length:137748      Length:137748      Min.   : 1.000   Min.   : 1.03  
##  Class :character   Class :character   1st Qu.: 1.000   1st Qu.: 3.13  
##  Mode  :character   Mode  :character   Median : 1.000   Median : 4.61  
##                                        Mean   : 1.163   Mean   : 4.88  
##                                        3rd Qu.: 1.000   3rd Qu.: 6.22  
##                                        Max.   :36.000   Max.   :18.82  
##    total_usd       payment_type        sales_type       
##  Min.   :  1.030   Length:137748      Length:137748     
##  1st Qu.:  4.120   Class :character   Class :character  
##  Median :  4.860   Mode  :character   Mode  :character  
##  Mean   :  5.534                                        
##  3rd Qu.:  6.340                                        
##  Max.   :326.150                                        
##     datetime                     
##  Min.   :2017-12-01 13:00:00.00  
##  1st Qu.:2017-12-20 19:00:00.00  
##  Median :2018-01-08 18:00:00.00  
##  Mean   :2018-01-09 18:07:55.51  
##  3rd Qu.:2018-01-29 22:00:00.00  
##  Max.   :2018-02-18 23:00:00.00

The FnB contains 137,748 observations of individual FnB sales history from 2017-12-01 to 2018-02-18, with 137,748 unique receipt numbers. Since the objective of this project is to forecast the hourly number of visitors, we can aggregate the individual receipts of each hours of the day as the proxy for visitor

Using dplyr, we group the dataset by datetime and count the unique receipts of each group.

train <- train %>% 
  group_by(datetime) %>% 
  summarise(visitor = n_distinct(receipt_number))

Now, our dataset contains the number of visitors on hourly basis each day.

head(train)
range(train$visitor)
## [1]  1 83

The range of our dataset is 0 to 83. This would be useful later during model evaluation.

To do a time series analysis, we need to ascertain that the time series data is collected at a regular interval without gaps. Padding is used to ensure that all sequences in a list have the same length, by calculating the maximum and minimum dates within a time-series and then generating a uniform date sequence between them.

range(train$datetime)
## [1] "2017-12-01 13:00:00 UTC" "2018-02-18 23:00:00 UTC"
min_date <- min(train$datetime)
max_date <- max(train$datetime)
train <- train %>%
  pad(start_val = ymd_hms(min_date),
      end_val = ymd_hms(max_date))
## pad applied on the interval: hour

By padding on an hourly interval, we add the time gaps where there is no visitors with NA value. The software assumes a 24 hours a day, which is not relevant to the business that only opens from 10 am to 10 pm. Hence, we need to filter out the hours outside the operating hours.

train$transaction_hour <- hour(train$datetime)

train <- train %>% 
  mutate(visitor = replace_na(visitor,0)) %>% 
  filter(transaction_hour %in% (10:22))
# alternative to pad, idk if it yields the same result
# train <- tidyr::complete(train, transaction_date = seq(min(transaction_date), max(transaction_date), by = "1 hour"))

After padding and imputating missing value with 0, there is no missing value left.

colSums(is.na(train))
##         datetime          visitor transaction_hour 
##                0                0                0

For preliminary data exploration, we can use a simple plot line to observe the means of visitors on hourly basis during the period. The number of visitors tend to increase as the day goes on, with a peak hour around 8 pm. Then, it tapers off until closing time.

train %>%
  group_by(transaction_hour) %>%
  summarise(mean_visitor = mean(visitor)) %>%
  plot(type = "o")

Also using line chart, we can map the trend during the period. The trend of the number of visitors seems pretty flat, with a strong seasonality pattern. A further time series decomposition is required before modelling and forecasting.

ggplot(data = train, mapping = aes(x = datetime, y = visitor)) +
  geom_point()+
  geom_line()

Seasonality analysis

First, we create a time series object with an interval of 13, as the daily operating time is 13 hours.

train_ts <- ts(data = train$visitor,
             frequency = 13)

Then, we estimate the trend, seasonal, and irregular components of a time series using decompose().

train_dc <- train_ts %>% 
  decompose(type = "additive")
train_ts %>%
  tail(13*7*4) %>% 
  decompose() %>% 
  autoplot()

From the decomposition above, we have a clear seasonality. However, the estimated trend shows a pattern of uptrends and downtrends. This pattern in trend might be sourced from uncaptured extra seasonality from higher natural period in this case,so it can be considered as multi-seasonal data.

To solve this complex seasonality, we need to convert the data into msts() object which accept multiple frequency setting. msts is a class for multi seasonal time series objects, intended to be used for models that support multiple seasonal periods.

train_msts <- msts(data = train$visitor,seasonal.periods = c(13,13*7))

We can decompose the data for one month period to capture the multiple seasonality using mstl() which estimates the seasonal components iteratively using STL. It seems that Seasonal13 is the daily seasonality, while Seasonal91 the weekly seasonality.

train_msts %>% 
  head(13*7*4) %>% 
  mstl() %>% 
  autoplot() +
    labs(title = "Multiple seasonal decomposition")

Using the library timetk, we can take a closer look at the daily and weekly seasonality.

library(timetk)
train %>%
    plot_seasonal_diagnostics(datetime, visitor, .interactive = FALSE)

Daily seasonality: - The opening hour, 10 am, has the least visitors during the day. - The numbers of visitors increases throughout the day from opening hour until it peaks at 8 pm, which then tapers off until the closing hour at 10 pm.

Weekly Seasonality: - The store is busiest during weekends on Sunday and Saturday. - During weekday, it is mostly business as usual, but the number of visitors slumps during Thursday, on average.

Model Fitting and Evaluation

Single Seasonality

Cross Validation

Cross-validation is primarily used in modelling to estimate its performance on unseen data. The technique involves reserving a particular sample of a dataset on which you do not train the model. Later, you test your model on this sample before finalizing it. In time series, we use the latest data as the test dataset.

We will further split the train dataset into 2 set of data, train and validation data. For validation data, we will use the last one week.

ts_info(train_ts)
##  The train_ts series is a ts object with 1 variable and 1037 observations
##  Frequency: 13 
##  Start time: 1 1 
##  End time: 80 10
t_train <- head(train_ts, n= length(train)-13*7)
v_train <- tail(train_ts, n = 13*7)

Model Fitting

  1. Triple Exponential Smoothing Model

Since our data has error, trend and seasonality, triple exponential smoothing model can be one of our model.

# Modelling
model_hw <- HoltWinters(t_train)

# Forecasting
forecast_hw <- forecast(model_hw, h = 13*7)

Let’s visualize our forecast model using Holtwinters.

train_ts %>% 
  autoplot(series = "actual") +
  autolayer(forecast_hw$fitted, series = "predict train") +
  autolayer(forecast_hw$mean, series = "predict validation") +
  labs(title = "Actual vs Forecasted Number of Visitors using Holtwinters Model")
## Warning: Removed 13 row(s) containing missing values (geom_path).

Visually, the Triple Exponential Smoothing Model isn’t doing a good job at modelling the actual data, and even worst job on predicting the validation data.

hw_accuracy <- accuracy(forecast_hw, v_train)
hw_accuracy
##                      ME      RMSE      MAE  MPE MAPE     MASE      ACF1
## Training set  0.3094772  7.382416 5.717301  NaN  Inf 0.685461 0.1356699
## Test set     -5.7374391 10.420920 8.724644 -Inf  Inf 1.046019 0.6099355
##              Theil's U
## Training set        NA
## Test set             0

The metrics to evaluate this model is MAE. We cannot use MAPE since the data contains zero, creating a MAPE of infinity. On average, the forecast’s distance from the true value is 5.7 in training dataset and 8.7 in test dataset.

  1. SARIMA (Seasonal Autoregressive Model)

Instead of ARIMA, which doesn’t support seasonal components, the next model we are going to use is SARIMA. It is an extension to ARIMA that supports the direct modeling of the seasonal component of the series.

using auto.arima(), we don’t have to check for the SARIMA orders manually.

# Create the Model
model_sarima <- auto.arima(t_train, seasonal = T)

# Forecast n periods of the data
forecast_sarima <- forecast(model_sarima, h=13*7)
train_ts %>% 
  autoplot(series = "actual") +
  autolayer(forecast_sarima$fitted, series = "predict train") +
  autolayer(forecast_sarima$mean, series = "predict validation") +
  labs(title = "Actual vs Forecasted Number of Visitors using SARIMA Model")

sarima_accuracy <- accuracy(forecast_sarima, v_train)
sarima_accuracy
##                       ME     RMSE       MAE  MPE MAPE      MASE        ACF1
## Training set -0.01654614  7.90352  6.088449 -Inf  Inf 0.7299587 0.002808819
## Test set     -9.66625683 13.56822 11.259677 -Inf  Inf 1.3499498 0.492584068
##              Theil's U
## Training set        NA
## Test set             0

SARIMA has an even worse performance compared to Holtwinters on this dataset. On average, the forecast’s distance from the true value is 6.1 in training dataset and 11.3 in test dataset. The model is quite overfitted.

  1. ETS (Exponential Smoothing State Space Model)

ETS (Error, Trend, Seasonal) model focuses on trend and seasonal components. The flexibility of the ETS model lies in its ability to trend and seasonal components of different traits. We are going to combine this method with STL (Seasonal Trend Decomposition using Loess) decomposition, it is a more advanced technique to extract seasonality, in the sense that is allows seasonality to vary, which is not the case in decompose.

# Create the Model
model_ets <- stlm(t_train)

# Forecast n period
forecast_ets <- forecast(t_train, h=13*7)
train_ts %>% 
  autoplot(series = "actual") +
  autolayer(forecast_ets$fitted, series = "predict train") +
  autolayer(forecast_ets$mean, series = "predict validation") +
  labs(title = "Actual vs Forecasted Number of Visitors using ETS Model")

# Model Evaluation
ets_accuracy <- accuracy(forecast_ets, v_train)
ets_accuracy
##                       ME      RMSE      MAE  MPE MAPE      MASE      ACF1
## Training set -0.01374468  6.365181 4.959802  NaN  Inf 0.5946425 0.1168178
## Test set     -5.59783572 10.105443 8.416542 -Inf  Inf 1.0090795 0.5812847
##              Theil's U
## Training set        NA
## Test set             0

Compared to Holtwinters and SARIMA, ETS does have lower MAE in the training set. However, the error in test set is quite high. The model is overfitted.

Mutiple Seasonality

Cross Validation

For modelling with multiple seasonality, we have to use msts object. First, we split train_msts into train and validation dataset.

train_msts_t <- head(train_msts, n= length(train_msts)-13*7)
train_msts_v <- tail(train_msts, n = 13*7)

Model Fitting

  1. TBATS Model

TBATS stands for:

  • T: Trigonometric seasonality
  • B: Box-Cox transformation
  • A: ARIMA errors
  • T: Trend
  • S: Seasonal components

TBATS is a forecasting method to model time series data.The main aim of this is to forecast time series with multiple seasonal patterns using exponential smoothing. One advantage of the TBATS model is the seasonality is allowed to change slowly over time.

# Modelling
model_tbats <- train_msts_t %>%
            tbats(use.box.cox = FALSE, 
                  use.trend = TRUE, 
                  use.damped.trend = TRUE)
# Forecasting
forecast_tbats <-  forecast(model_tbats,h=13*7) 
train_msts %>% 
  autoplot(series = "actual") +
  autolayer(forecast_tbats$fitted, series = "predict train") +
  autolayer(forecast_tbats$mean, series = "predict validation") +
  labs(title = "Actual vs Forecasted Number of Visitors using TBATS Model")

tbats_accuracy <- accuracy(forecast_tbats, train_msts_v)
tbats_accuracy
##                       ME     RMSE      MAE  MPE MAPE     MASE         ACF1
## Training set -0.07770965 6.360268 4.875763  NaN  Inf 0.754803 -0.005977595
## Test set     -3.78367568 7.953826 6.533778 -Inf  Inf 1.011476  0.487505608
##              Theil's U
## Training set        NA
## Test set             0

So far, TBATS model has the best MAE on both training set and test set compared to the single seasonality models.

  1. ARIMA Model

We are going to fit STLM model using ARIMA methods.

# Model Fitting
model_arima <- stlm(train_msts_t, method = "arima")

# Forecasting
forecast_arima <- forecast(model_arima, h=13*7)
train_msts %>% 
  autoplot(series = "actual") +
  autolayer(forecast_arima$fitted, series = "predict train") +
  autolayer(forecast_arima$mean, series = "predict validation") +
  labs(title = "Actual vs Forecasted Number of Visitors using ARIMA Multiple Seasonality Model")

arima_accuracy <- accuracy(forecast_arima, train_msts_v)
arima_accuracy
##                       ME     RMSE      MAE MPE MAPE      MASE          ACF1
## Training set -0.02985275 5.140970 3.895996 NaN  Inf 0.6031282 -0.0002734328
## Test set     -2.11595779 7.323714 5.694023 NaN  Inf 0.8814756  0.3436755889
##              Theil's U
## Training set        NA
## Test set             0

ARIMA with multiple seasonal and STL decomposition model yields the lowest MAE in both training and test set among all model.

Here’s a compilation of the accuracy of all five models fitted. STL+ARIMA comes out at the top.

rnames <- c("Holtwinters - train", "Holtwinters - val", "SARIMA - train", "SARIMA - val", "ETS - train", "ETS - test", "TBATS - train", "TBATS - test", "ARIMA STLM - Train", "ARIMA STLM - Test")
compile <- rbind(hw_accuracy, sarima_accuracy, ets_accuracy, tbats_accuracy, arima_accuracy)
rownames(compile) <- rnames

compile
##                              ME      RMSE       MAE  MPE MAPE      MASE
## Holtwinters - train  0.30947718  7.382416  5.717301  NaN  Inf 0.6854610
## Holtwinters - val   -5.73743908 10.420920  8.724644 -Inf  Inf 1.0460185
## SARIMA - train      -0.01654614  7.903520  6.088449 -Inf  Inf 0.7299587
## SARIMA - val        -9.66625683 13.568216 11.259677 -Inf  Inf 1.3499498
## ETS - train         -0.01374468  6.365181  4.959802  NaN  Inf 0.5946425
## ETS - test          -5.59783572 10.105443  8.416542 -Inf  Inf 1.0090795
## TBATS - train       -0.07770965  6.360268  4.875763  NaN  Inf 0.7548030
## TBATS - test        -3.78367568  7.953826  6.533778 -Inf  Inf 1.0114756
## ARIMA STLM - Train  -0.02985275  5.140970  3.895996  NaN  Inf 0.6031282
## ARIMA STLM - Test   -2.11595779  7.323714  5.694023  NaN  Inf 0.8814756
##                              ACF1 Theil's U
## Holtwinters - train  0.1356698572        NA
## Holtwinters - val    0.6099355369         0
## SARIMA - train       0.0028088186        NA
## SARIMA - val         0.4925840681         0
## ETS - train          0.1168178319        NA
## ETS - test           0.5812847125         0
## TBATS - train       -0.0059775948        NA
## TBATS - test         0.4875056084         0
## ARIMA STLM - Train  -0.0002734328        NA
## ARIMA STLM - Test    0.3436755889         0

Prediction Performance

Since the smallest MAE of all 5 models is from ARIMA model with Multiple Seasonality, we will use this model to forecast in our test data.

# Fitting model using all the train dataset
model_arima_final <- stlm(train_msts, method = "arima")

# Forecasting
forecast_arima_final <- forecast(model_arima_final, h=13*7)

# Insert the forecast into `test` dataframe
test$visitor <- forecast_arima_final$mean

# save as .csv
submission <- write.csv(test,file = "submission_dyana.csv")
head(test)

The model passed the MAE threshold of 6 on evalution dataset.

Conclusion

A good time series model passes several assumptions: 1. No-Autocorrelation for residuals 2. Normality of residuals

No-Autocorrelation Assumption

Ljung-Box Test

Box.test(model_arima_final$residuals, type = "Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  model_arima_final$residuals
## X-squared = 0.00010908, df = 1, p-value = 0.9917

\(H_0\): residual has no-autocorrelation

\(H_1\): residual has autocorrelation

p-value is higher than alpha (0.9917 > 0.05), the model fails to reject H0. The residuals has no-autocorrelation.

Normality of Residuals

Shaphiro Test

shapiro.test(model_arima_final$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model_arima_final$residuals
## W = 0.99139, p-value = 9.359e-06

\(H_0\): residual distributes normally [v]

\(H_1\): residual doesn’t distribute normally

p-value is less than alpha (9*10^-6 < 0.05), the model rejects H0. The residuals doesn’t distribute normally. Note that Shapiro test tests only deviation of residual distribution from the normal and not the forecast performance, which worsens in longer forecast. If we want to forecast longer data, we will need to add more data to feed our model.