This will be my final project as Algoritma Trainee. After being trained for 2,5 months and provided a lof of compatible material for us, we are given project to train our skill. I decided to go with Forecasting Model.
The Food and Beverage dataset is provided by Dattabot, which contains detailed transaction of multiple food and beverage outlets. Using this dataset, I am going to do some forecasting and time series analysis to help the outlet’s owner making a better business decision.
Customer behaviour, especially in food and beverage industry is highly related to seasonality patterns. The owner wants to analyze the number visitor so he could make better judgement in 2018.
library(forecast)
library(tidyverse)
library(fpp)
library(TTR)
library(MLmetrics)
library(tseries)
library(lubridate)
library(dplyr)
library(TSstudio)
library(xts)
library(ggplot2)
library(lubridate)
library(skimr)
library(padr)
The train dataset contains detailed transaction details from December 1st 2017 to February 18th 2018:
fnb <- read.csv("Data Capstone/3. FnB-ts-visitor/data/data-train.csv")
The dataset includes information about:
transaction_date: The timestamp of a transactionreceipt_number: The ID of a transactionitem_id: The ID of an item in a transactionitem_group: The group ID of an item in a transactionitem_major_group: The major-group ID of an item in a transactionquantity: The quantity of purchased itemprice_usd: The price of purchased itemtotal_usd: The total price of purchased itempayment_type: The payment methodsales_type: The sales methodcolSums(is.na(fnb))
## transaction_date receipt_number item_id item_group
## 0 0 0 0
## item_major_group quantity price_usd total_usd
## 0 0 0 0
## payment_type sales_type
## 0 0
head(fnb)
## transaction_date receipt_number item_id item_group item_major_group
## 1 2017-12-01T13:32:46Z A0026694 I10100139 noodle_dish food
## 2 2017-12-01T13:32:46Z A0026694 I10500037 drinks beverages
## 3 2017-12-01T13:33:39Z A0026695 I10500044 drinks beverages
## 4 2017-12-01T13:33:39Z A0026695 I10400009 side_dish food
## 5 2017-12-01T13:33:39Z A0026695 I10500046 drinks beverages
## 6 2017-12-01T13:35:59Z A0026696 I10300002 pastry food
## quantity price_usd total_usd payment_type sales_type
## 1 1 7.33 7.33 cash dine_in
## 2 1 4.12 4.12 cash dine_in
## 3 1 2.02 2.02 cash dine_in
## 4 1 5.60 5.60 cash dine_in
## 5 1 3.01 3.01 cash dine_in
## 6 1 4.86 4.86 cash dine_in
skim(fnb)
| Name | fnb |
| Number of rows | 137748 |
| Number of columns | 10 |
| _______________________ | |
| Column type frequency: | |
| factor | 7 |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| transaction_date | 0 | 1 | FALSE | 35477 | 201: 58, 201: 50, 201: 48, 201: 46 |
| receipt_number | 0 | 1 | FALSE | 35522 | A00: 58, A00: 50, A00: 48, A00: 46 |
| item_id | 0 | 1 | FALSE | 264 | I10: 8360, I10: 6290, I10: 5310, I10: 3099 |
| item_group | 0 | 1 | FALSE | 14 | dri: 40687, noo: 23832, ric: 19597, sid: 11085 |
| item_major_group | 0 | 1 | FALSE | 4 | foo: 74026, bev: 58114, top: 5158, pac: 450 |
| payment_type | 0 | 1 | FALSE | 2 | cas: 112512, non: 25236 |
| sales_type | 0 | 1 | FALSE | 3 | din: 127257, del: 9550, tak: 941 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| quantity | 0 | 1 | 1.16 | 0.65 | 1.00 | 1.00 | 1.00 | 1.00 | 36.00 | ▇▁▁▁▁ |
| price_usd | 0 | 1 | 4.88 | 1.93 | 1.03 | 3.13 | 4.61 | 6.22 | 18.82 | ▆▇▁▁▁ |
| total_usd | 0 | 1 | 5.53 | 3.65 | 1.03 | 4.12 | 4.86 | 6.34 | 326.15 | ▇▁▁▁▁ |
| After imported d | ata, we need | to format `tran | saction | _date` | into da | te form | at. |
fnb_agg <- fnb %>%
# this function `floor_date()` functions to do date rounding
mutate(transaction_date = ymd_hms(transaction_date) %>% floor_date(unit = "hour")) %>%
group_by(transaction_date) %>%
# The n_distinct function is used to count every unique order ID
summarise(visitor_number = n_distinct(receipt_number)) %>%
arrange(transaction_date) %>%
ungroup()
fnb_agg <- fnb_agg %>%
pad(start_val = ymd("2017-12-01"), end_val = ymd("2018-02-18"))
In order to get transaction time hourly, we need to round the transaction_date.
fnb_agg <- fnb_agg %>%
mutate(hour = hour(transaction_date)) %>%
filter(hour %in% c(10:22)) %>%
select(transaction_date, visitor_number)
Since the restaurant is opened at 10 AM to 10 PM, we need to filter the datetime from 10 AM to 10 PM and replace na value with 0.
fnb_agg<-fnb_agg %>%
mutate(visitor_number = replace_na(visitor_number, 0))
fnb_agg <- fnb_agg[-c(1:3),]
To make sure that there is no missing time in our data, let’s do time series padding.
min_date <- min(fnb_agg$transaction_date)
max_date <- max(fnb_agg$transaction_date)
min_date
## [1] "2017-12-01 13:00:00 UTC"
max_date
## [1] "2018-02-17 22:00:00 UTC"
First step we need to do is convert the data into time series object.
vis_ts <- ts(fnb_agg$visitor_number,start=c(1,4),frequency = 13)
Visualize the data to see the distribution of the data
autoplot(vis_ts) + theme_minimal()
Early assumption we can get:
1. The data has trend, seasonanality, error.
If we want to see the trend of the data, we only need to decompose our data with one month periods since every month has its own seasonality.
We find:
- Since graphic in trend column haven’t been created smoothly, find out that there is seasonality that haven’t been caught in vis_ts. That means our data are most likely having multiple seasonality.
fnb_deco<-decompose(x=vis_ts)
autoplot(fnb_deco)
## Multiple Time Series Let’s re-create our time series data.
fnb_msts<-msts(fnb_agg$visitor_number , seasonal.periods = c(13,13*7))
autoplot(fnb_msts)
fnb_msts_decompose<-mstl(fnb_msts)
autoplot(fnb_msts_decompose)
As we can see, the trend is created smoothly indicating that our data set has multiple seasonality.
## Seasonality Analysis
# Single seasonality
fnb_agg %>%
mutate(
seasonal = fnb_deco$seasonal,
hour = hour(transaction_date)
) %>%
distinct(hour, seasonal) %>%
ggplot(mapping = aes(x = hour, y = seasonal)) +
geom_col() +
theme_minimal() +
scale_x_continuous(breaks = seq(10,22,1)) +
labs(
title = "Single Seasonality Analysis",
subtitle = "Daily"
)
We find:
- Most of visitors come to restaurant at 19.00 - 22.00
- Opening hour (10 A.M) is the least of visitors of restaurant.
Besides from Single Seasonality Pattern, we find:
- For lunch time period (11 A.M to 1 P.M), Friday is the day with least visitors.
- Weekend (Saturday & Sunday), the restaurant will start busy at 3 P.M.
- Friday, Saturday & Sunday are the highest traffic at 7 P.M - 10 P.M.
Let’s split data into 2 set of data, train and validation data. For validation data, we will use last 2 weeks.
ts_info(vis_ts)
## The vis_ts series is a ts object with 1 variable and 1024 observations
## Frequency: 13
## Start time: 1 4
## End time: 79 13
The end of our data is “13” which means we can consider our the it’s full day data (10.00-22.00).
#data splitting
fnbtrain <- head(vis_ts, n= length(vis_ts)-13*7)
fnbtest <- tail(vis_ts, n = 13*7)
Since our data has error, trend and seasonality, we can use triple exponential smoothing method as one of our modeling
model_tes_ts <- HoltWinters(fnbtrain)
forecast_tes_ts <- forecast(model_tes_ts, h = 13*7)
MAE(y_pred = forecast_tes_ts$mean, y_true = fnbtest)
## [1] 14.91141
Let’s visualize the forecast with our actual value that we get.
plot_forecast(forecast_tes_ts)
As we can see from visualization, most of forecast data point (green color) are not close to our actual data (blue color) which mean there is a lot of error from our model.
model_arima_ts <- stlm(fnbtrain, method = "arima")
# forecast
forecast_arima_ts <- forecast(model_arima_ts, h=13*7)
# model evaluation
MAE(y_pred = forecast_arima_ts$mean, y_true = fnbtest)
## [1] 7.077565
Let’s how this model works in visualization.
plot_forecast(forecast_arima_ts)
As we can see from the visualization, this model is better than the previous one. The forecast data point is closer to actual data and the score of MAE of this model is smaller than the previous one.
fnbtrain_msts <- head(fnb_msts, n= length(vis_ts)-13*7)
fnbtest_msts <- tail(fnb_msts, n = 13*7)
model_tes_msts <- HoltWinters(fnbtrain_msts)
# forecast
forecast_tes_msts <- forecast(model_tes_msts, h = 13*7)
# model evaluation
MAE(y_pred = forecast_tes_msts$mean, y_true = fnbtest_msts)
## [1] 6.277796
Let’s see how the visualization of this model is
plot_forecast(forecast_tes_msts)
### ARIMA Model
model_arima_msts <- stlm(fnbtrain_msts, method = "arima")
# forecast
forecast_arima_msts <- forecast(model_arima_msts, h=13*7)
# model evaluation
MAE(y_pred = forecast_arima_msts$mean, y_true = fnbtest_msts)
## [1] 5.579281
This model has the smallest MAE comparing to our previous models. Let’s see how the visualization is.
plot_forecast(forecast_arima_msts)
The result is good. Most of our forecast data point is closer to the actual data point. We can assume that this model is better than the others.
Since the smallest MAE that we get is from ARIMA model of Multiple Seasonality (5.57), we will use this model to forecast in our test data.
test_data <- read.csv("Data Capstone/3. FnB-ts-visitor/data/data-test.csv")
model_arima_test <- stlm(fnb_msts, method = "arima")
forecast_arima_test <- forecast(model_arima_test, h=13*7)
# insert the data into table
test_data$visitor <- forecast_arima_test$mean
write.csv(test_data,file = "forecast_fnb5.csv")
As result of submitting to leaderboard, we pass the standard requirement of MAE.
To make a good forecasting, we have to test several assumptions:
1. No-Autocorrelation for residuals
2. Normality of residuals
Box.test(x= model_arima_test$residuals)
##
## Box-Pierce test
##
## data: model_arima_test$residuals
## X-squared = 0.0074198, df = 1, p-value = 0.9314
Due to p-value is bigger than 0.05, we can conclude that the residuals has no-autocorrelation.
shapiro.test(model_arima_test$residuals)
##
## Shapiro-Wilk normality test
##
## data: model_arima_test$residuals
## W = 0.99123, p-value = 8.823e-06
Since the p-value is less than 0.9, so the residuals are not distributed normally. Note that Shapiro test tests only deviation of residual distribution from the normal and not the forecast performance, which worsens for longer forecast. If we want to forecast longer data, we will need to add more data to feed in our model.
As for the last, we can see from the Seasonality Analysis, we conclude that Saturday at 20.00 (8 P.M) is the highest visitors.