The Food and Beverage dataset is provided by Dattabot, which contains detailed transactions of multiple food and beverage outlets. Using this dataset, we are challenged to do some forecasting and time series analysis to help the outlet’s owner making a better business decision.
Customer behaviour, especially in the food and beverage industry is highly related to seasonality patterns. The owner wants to analyze the number of visitors (includes dine in, delivery, and takeaway transactions) so he could make better judgment in 2018. Fortunately, you already know that time series analysis is enough to provide a good forecast and seasonality explanation.
This project aims to forecast the result and seasonality explanation for hourly number of visitors, that would be evaluated on the next 7 days (Monday, February 19th 2018 to Sunday, February 25th 2018).
# data wrangling
library(tidyverse)
library(lubridate) # date manipulation
library(zoo)
# time series
library(forecast) # time series library
library(TTR) # for Simple moving average function
library(MLmetrics) # calculate error
library(tseries) # adf.test
library(TSstudio) # mempercantik visualisasi timeseries
library(padr)
library(imputeTS)
The train dataset contains detailed transaction details from December 1st 2017 to December 18th of February 2018. The shop opens from 10 am to 10 pm every day:
## 'data.frame': 137748 obs. of 10 variables:
## $ transaction_date: chr "2017-12-01T13:32:46Z" "2017-12-01T13:32:46Z" "2017-12-01T13:33:39Z" "2017-12-01T13:33:39Z" ...
## $ receipt_number : chr "A0026694" "A0026694" "A0026695" "A0026695" ...
## $ item_id : chr "I10100139" "I10500037" "I10500044" "I10400009" ...
## $ item_group : chr "noodle_dish" "drinks" "drinks" "side_dish" ...
## $ item_major_group: chr "food" "beverages" "beverages" "food" ...
## $ quantity : int 1 1 1 1 1 1 1 1 1 2 ...
## $ price_usd : num 7.33 4.12 2.02 5.6 3.01 4.86 6.34 7.58 4.12 1.77 ...
## $ total_usd : num 7.33 4.12 2.02 5.6 3.01 4.86 6.34 7.58 4.12 3.54 ...
## $ payment_type : chr "cash" "cash" "cash" "cash" ...
## $ sales_type : chr "dine_in" "dine_in" "dine_in" "dine_in" ...
Dataset Description
transaction_date
: The timestamp of a transactionreceipt_number
: The ID of a transactionitem_id
: The ID of an item in a transactionitem_group
: The group ID of an item in a
transactionitem_major_group
: The major-group ID of an item in a
transactionquantity
: The quantity of purchased itemprice_usd
: The price of purchased itemtotal_usd
: The total price of purchased itempayment_type
: The payment methodsales_type
: The sales methodUsing the lubridate
library, we change the
transaction_date
column to date-time data type, then we
create a new variable called datetime
which rounds the
date-time object down to the nearest integer value of the specified time
unit. It’s hourly, in this case.
train <- train %>%
mutate(transaction_date = ymd_hms(transaction_date)) %>%
mutate(datetime = floor_date(transaction_date, unit = "hour"))
Then, we check for missing value. It seems that the data doesn’t contain any.
## transaction_date receipt_number item_id item_group
## 0 0 0 0
## item_major_group quantity price_usd total_usd
## 0 0 0 0
## payment_type sales_type datetime
## 0 0 0
Before further preprocessing, we take a look at the summary statistics.
## transaction_date receipt_number item_id
## Min. :2017-12-01 13:32:46.00 Length:137748 Length:137748
## 1st Qu.:2017-12-20 19:42:26.00 Class :character Class :character
## Median :2018-01-08 18:21:20.00 Mode :character Mode :character
## Mean :2018-01-09 18:37:52.10
## 3rd Qu.:2018-01-29 22:47:17.00
## Max. :2018-02-18 23:58:32.00
## item_group item_major_group quantity price_usd
## Length:137748 Length:137748 Min. : 1.000 Min. : 1.03
## Class :character Class :character 1st Qu.: 1.000 1st Qu.: 3.13
## Mode :character Mode :character Median : 1.000 Median : 4.61
## Mean : 1.163 Mean : 4.88
## 3rd Qu.: 1.000 3rd Qu.: 6.22
## Max. :36.000 Max. :18.82
## total_usd payment_type sales_type
## Min. : 1.030 Length:137748 Length:137748
## 1st Qu.: 4.120 Class :character Class :character
## Median : 4.860 Mode :character Mode :character
## Mean : 5.534
## 3rd Qu.: 6.340
## Max. :326.150
## datetime
## Min. :2017-12-01 13:00:00.00
## 1st Qu.:2017-12-20 19:00:00.00
## Median :2018-01-08 18:00:00.00
## Mean :2018-01-09 18:07:55.51
## 3rd Qu.:2018-01-29 22:00:00.00
## Max. :2018-02-18 23:00:00.00
The FnB contains 137,748 observations of individual FnB sales history from 2017-12-01 to 2018-02-18, with 137,748 unique receipt numbers. Since the objective of this project is to forecast the hourly number of visitors, we can aggregate the individual receipts of each hours of the day as the proxy for visitor
Using dplyr
, we group the dataset by
datetime
and count the unique receipts of each group.
Now, our dataset contains the number of visitors on hourly basis each day.
## [1] 1 83
The range of our dataset is 0 to 83. This would be useful later during model evaluation.
To do a time series analysis, we need to ascertain that the time series data is collected at a regular interval without gaps. Padding is used to ensure that all sequences in a list have the same length, by calculating the maximum and minimum dates within a time-series and then generating a uniform date sequence between them.
## [1] "2017-12-01 13:00:00 UTC" "2018-02-18 23:00:00 UTC"
## pad applied on the interval: hour
By padding on an hourly interval, we add the time gaps where there is no visitors with NA value. The software assumes a 24 hours a day, which is not relevant to the business that only opens from 10 am to 10 pm. Hence, we need to filter out the hours outside the operating hours.
train$transaction_hour <- hour(train$datetime)
train <- train %>%
mutate(visitor = replace_na(visitor,0)) %>%
filter(transaction_hour %in% (10:22))
# alternative to pad, idk if it yields the same result
# train <- tidyr::complete(train, transaction_date = seq(min(transaction_date), max(transaction_date), by = "1 hour"))
After padding and imputating missing value with 0, there is no missing value left.
## datetime visitor transaction_hour
## 0 0 0
For preliminary data exploration, we can use a simple plot line to observe the means of visitors on hourly basis during the period. The number of visitors tend to increase as the day goes on, with a peak hour around 8 pm. Then, it tapers off until closing time.
train %>%
group_by(transaction_hour) %>%
summarise(mean_visitor = mean(visitor)) %>%
plot(type = "o")
Also using line chart, we can map the trend during the period. The trend of the number of visitors seems pretty flat, with a strong seasonality pattern. A further time series decomposition is required before modelling and forecasting.
First, we create a time series object with an interval of 13, as the daily operating time is 13 hours.
Then, we estimate the trend, seasonal, and irregular components of a
time series using decompose()
.
From the decomposition above, we have a clear seasonality. However, the estimated trend shows a pattern of uptrends and downtrends. This pattern in trend might be sourced from uncaptured extra seasonality from higher natural period in this case,so it can be considered as multi-seasonal data.
To solve this complex seasonality, we need to convert the data into
msts()
object which accept multiple frequency setting.
msts
is a class for multi seasonal time series objects,
intended to be used for models that support multiple seasonal
periods.
We can decompose the data for one month period to capture the
multiple seasonality using mstl()
which estimates the
seasonal components iteratively using STL. It seems that
Seasonal13
is the daily seasonality, while
Seasonal91
the weekly seasonality.
train_msts %>%
head(13*7*4) %>%
mstl() %>%
autoplot() +
labs(title = "Multiple seasonal decomposition")
Using the library timetk
, we can take a closer look at
the daily and weekly seasonality.
Daily seasonality: - The opening hour, 10 am, has the least
visitors during the day. - The numbers of visitors increases throughout
the day from opening hour until it peaks at 8 pm, which then tapers off
until the closing hour at 10 pm.
Weekly Seasonality: - The store is busiest during weekends on Sunday and Saturday. - During weekday, it is mostly business as usual, but the number of visitors slumps during Thursday, on average.
Cross-validation is primarily used in modelling to estimate its performance on unseen data. The technique involves reserving a particular sample of a dataset on which you do not train the model. Later, you test your model on this sample before finalizing it. In time series, we use the latest data as the test dataset.
We will further split the train dataset into 2 set of data, train and validation data. For validation data, we will use the last one week.
## The train_ts series is a ts object with 1 variable and 1037 observations
## Frequency: 13
## Start time: 1 1
## End time: 80 10
Since our data has error, trend and seasonality, triple exponential smoothing model can be one of our model.
# Modelling
model_hw <- HoltWinters(t_train)
# Forecasting
forecast_hw <- forecast(model_hw, h = 13*7)
Let’s visualize our forecast model using Holtwinters.
train_ts %>%
autoplot(series = "actual") +
autolayer(forecast_hw$fitted, series = "predict train") +
autolayer(forecast_hw$mean, series = "predict validation") +
labs(title = "Actual vs Forecasted Number of Visitors using Holtwinters Model")
## Warning: Removed 13 row(s) containing missing values (geom_path).
Visually, the Triple Exponential Smoothing Model isn’t doing a good job at modelling the actual data, and even worst job on predicting the validation data.
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 0.3094772 7.382416 5.717301 NaN Inf 0.685461 0.1356699
## Test set -5.7374391 10.420920 8.724644 -Inf Inf 1.046019 0.6099355
## Theil's U
## Training set NA
## Test set 0
The metrics to evaluate this model is MAE. We cannot use MAPE since the data contains zero, creating a MAPE of infinity. On average, the forecast’s distance from the true value is 5.7 in training dataset and 8.7 in test dataset.
Instead of ARIMA, which doesn’t support seasonal components, the next model we are going to use is SARIMA. It is an extension to ARIMA that supports the direct modeling of the seasonal component of the series.
using auto.arima()
, we don’t have to check for the
SARIMA orders manually.
# Create the Model
model_sarima <- auto.arima(t_train, seasonal = T)
# Forecast n periods of the data
forecast_sarima <- forecast(model_sarima, h=13*7)
train_ts %>%
autoplot(series = "actual") +
autolayer(forecast_sarima$fitted, series = "predict train") +
autolayer(forecast_sarima$mean, series = "predict validation") +
labs(title = "Actual vs Forecasted Number of Visitors using SARIMA Model")
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -0.01654614 7.90352 6.088449 -Inf Inf 0.7299587 0.002808819
## Test set -9.66625683 13.56822 11.259677 -Inf Inf 1.3499498 0.492584068
## Theil's U
## Training set NA
## Test set 0
SARIMA has an even worse performance compared to Holtwinters on this dataset. On average, the forecast’s distance from the true value is 6.1 in training dataset and 11.3 in test dataset. The model is quite overfitted.
ETS (Error, Trend, Seasonal) model focuses on trend and seasonal components. The flexibility of the ETS model lies in its ability to trend and seasonal components of different traits. We are going to combine this method with STL (Seasonal Trend Decomposition using Loess) decomposition, it is a more advanced technique to extract seasonality, in the sense that is allows seasonality to vary, which is not the case in decompose.
# Create the Model
model_ets <- stlm(t_train)
# Forecast n period
forecast_ets <- forecast(t_train, h=13*7)
train_ts %>%
autoplot(series = "actual") +
autolayer(forecast_ets$fitted, series = "predict train") +
autolayer(forecast_ets$mean, series = "predict validation") +
labs(title = "Actual vs Forecasted Number of Visitors using ETS Model")
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -0.01374468 6.365181 4.959802 NaN Inf 0.5946425 0.1168178
## Test set -5.59783572 10.105443 8.416542 -Inf Inf 1.0090795 0.5812847
## Theil's U
## Training set NA
## Test set 0
Compared to Holtwinters and SARIMA, ETS does have lower MAE in the training set. However, the error in test set is quite high. The model is overfitted.
For modelling with multiple seasonality, we have to use
msts
object. First, we split train_msts
into
train and validation dataset.
TBATS stands for:
TBATS is a forecasting method to model time series data.The main aim of this is to forecast time series with multiple seasonal patterns using exponential smoothing. One advantage of the TBATS model is the seasonality is allowed to change slowly over time.
# Modelling
model_tbats <- train_msts_t %>%
tbats(use.box.cox = FALSE,
use.trend = TRUE,
use.damped.trend = TRUE)
# Forecasting
forecast_tbats <- forecast(model_tbats,h=13*7)
train_msts %>%
autoplot(series = "actual") +
autolayer(forecast_tbats$fitted, series = "predict train") +
autolayer(forecast_tbats$mean, series = "predict validation") +
labs(title = "Actual vs Forecasted Number of Visitors using TBATS Model")
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -0.07770965 6.360268 4.875763 NaN Inf 0.754803 -0.005977595
## Test set -3.78367568 7.953826 6.533778 -Inf Inf 1.011476 0.487505608
## Theil's U
## Training set NA
## Test set 0
So far, TBATS model has the best MAE on both training set and test set compared to the single seasonality models.
We are going to fit STLM model using ARIMA methods.
# Model Fitting
model_arima <- stlm(train_msts_t, method = "arima")
# Forecasting
forecast_arima <- forecast(model_arima, h=13*7)
train_msts %>%
autoplot(series = "actual") +
autolayer(forecast_arima$fitted, series = "predict train") +
autolayer(forecast_arima$mean, series = "predict validation") +
labs(title = "Actual vs Forecasted Number of Visitors using ARIMA Multiple Seasonality Model")
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -0.02985275 5.140970 3.895996 NaN Inf 0.6031282 -0.0002734328
## Test set -2.11595779 7.323714 5.694023 NaN Inf 0.8814756 0.3436755889
## Theil's U
## Training set NA
## Test set 0
ARIMA with multiple seasonal and STL decomposition model yields the lowest MAE in both training and test set among all model.
Here’s a compilation of the accuracy of all five models fitted. STL+ARIMA comes out at the top.
rnames <- c("Holtwinters - train", "Holtwinters - val", "SARIMA - train", "SARIMA - val", "ETS - train", "ETS - test", "TBATS - train", "TBATS - test", "ARIMA STLM - Train", "ARIMA STLM - Test")
compile <- rbind(hw_accuracy, sarima_accuracy, ets_accuracy, tbats_accuracy, arima_accuracy)
rownames(compile) <- rnames
compile
## ME RMSE MAE MPE MAPE MASE
## Holtwinters - train 0.30947718 7.382416 5.717301 NaN Inf 0.6854610
## Holtwinters - val -5.73743908 10.420920 8.724644 -Inf Inf 1.0460185
## SARIMA - train -0.01654614 7.903520 6.088449 -Inf Inf 0.7299587
## SARIMA - val -9.66625683 13.568216 11.259677 -Inf Inf 1.3499498
## ETS - train -0.01374468 6.365181 4.959802 NaN Inf 0.5946425
## ETS - test -5.59783572 10.105443 8.416542 -Inf Inf 1.0090795
## TBATS - train -0.07770965 6.360268 4.875763 NaN Inf 0.7548030
## TBATS - test -3.78367568 7.953826 6.533778 -Inf Inf 1.0114756
## ARIMA STLM - Train -0.02985275 5.140970 3.895996 NaN Inf 0.6031282
## ARIMA STLM - Test -2.11595779 7.323714 5.694023 NaN Inf 0.8814756
## ACF1 Theil's U
## Holtwinters - train 0.1356698572 NA
## Holtwinters - val 0.6099355369 0
## SARIMA - train 0.0028088186 NA
## SARIMA - val 0.4925840681 0
## ETS - train 0.1168178319 NA
## ETS - test 0.5812847125 0
## TBATS - train -0.0059775948 NA
## TBATS - test 0.4875056084 0
## ARIMA STLM - Train -0.0002734328 NA
## ARIMA STLM - Test 0.3436755889 0
Since the smallest MAE of all 5 models is from ARIMA model with Multiple Seasonality, we will use this model to forecast in our test data.
# Fitting model using all the train dataset
model_arima_final <- stlm(train_msts, method = "arima")
# Forecasting
forecast_arima_final <- forecast(model_arima_final, h=13*7)
# Insert the forecast into `test` dataframe
test$visitor <- forecast_arima_final$mean
# save as .csv
submission <- write.csv(test,file = "submission_dyana.csv")
The model passed the MAE threshold of 6 on evalution dataset.
A good time series model passes several assumptions: 1. No-Autocorrelation for residuals 2. Normality of residuals
Ljung-Box Test
##
## Box-Ljung test
##
## data: model_arima_final$residuals
## X-squared = 0.00010908, df = 1, p-value = 0.9917
\(H_0\): residual has no-autocorrelation
\(H_1\): residual has autocorrelation
p-value is higher than alpha (0.9917 > 0.05), the model fails to reject H0. The residuals has no-autocorrelation.
Shaphiro Test
##
## Shapiro-Wilk normality test
##
## data: model_arima_final$residuals
## W = 0.99139, p-value = 9.359e-06
\(H_0\): residual distributes normally [v]
\(H_1\): residual doesn’t distribute normally
p-value is less than alpha (9*10^-6 < 0.05), the model rejects H0. The residuals doesn’t distribute normally. Note that Shapiro test tests only deviation of residual distribution from the normal and not the forecast performance, which worsens in longer forecast. If we want to forecast longer data, we will need to add more data to feed our model.