The Food and Beverage dataset is provided by Dattabot, which contains detailed transaction of multiple food and beverage outlets. Using this dataset, we are challenged to do some forecasting and time series analysis to help the outlet’s owner making a better business decision.
Food & Beverage: “It’s friday night!” Customer behaviour, especially in food and beverage industry is highly related to seasonality patterns. The owner wants to analyze the number visitor so he could make better judgement in 2018.
The train dataset contains detailed transaction details from October 1st 2017 to December 2nd 2017: The dataset includes information about:
transaction_date: The timestamp of a transaction receipt_number: The ID of a transaction item_id: The ID of an item in a transaction item_group: The group ID of an item in a transaction item_major_group: The major-group ID of an item in a transaction quantity: The quantity of purchased item price_usd: The price of purchased item total_usd: The total price of purchased item payment_type: The payment method sales_type: The sales method
Challenges: Please make a report of your forecasting result and seasonality explanation for hourly number of visitors, that would be evaluated on the next 7 days (Monday, December 19th 2017 to Sunday, December 25th 2017)!
Import Data into R
## Parsed with column specification:
## cols(
## transaction_date = col_datetime(format = ""),
## receipt_number = col_character(),
## item_id = col_character(),
## item_group = col_character(),
## item_major_group = col_character(),
## quantity = col_double(),
## price_usd = col_double(),
## total_usd = col_double(),
## payment_type = col_character(),
## sales_type = col_character()
## )
Agreggrate to get sum of visitor per hour
fnb_clean <- fnb %>%
mutate(transaction_date = transaction_date %>% floor_date(unit="hour")
) %>%
group_by(transaction_date) %>% ## per hour
summarise(visitor = n_distinct(receipt_number)) Check the completeness and order of transaction date. Use pad() from padR package.
## [1] "2017-12-01 13:00:00 UTC"
## [1] "2018-02-18 23:00:00 UTC"
fnb_clean <- fnb_clean %>%
pad(start_val = make_datetime(year = year(min_date),month = month(min_date), day= day(min_date), hour = 0), end_val =make_datetime(year = year(max_date),month = month(max_date), day= day(max_date), hour = 23) )## pad applied on the interval: hour
Filter on transcation date , because the store open on 10 am and close on 22 pm.
fnb_clean <- fnb_clean %>%
mutate(hour = hour(transaction_date)) %>%
filter(hour %in% c(10:22)) %>%
mutate(visitor1 = visitor)if there are missing data, replace it by zero.
## [1] TRUE
fnb_clean <- fnb_clean %>%
mutate(visitor = replace_na(visitor,0)) %>%
mutate(visitor1 = replace_na(visitor1,1))For TBATS, it needs to replace NA with 1 (not zero)
ggplot(data=fnb_clean, mapping=aes(
x=fnb_clean$transaction_date,
y=fnb_clean$visitor
)
) +
geom_point() +
geom_line() +
theme_minimal()hourly
fnb_clean %>%
head(13*2) %>%
mutate(seasonalHH = dc_fnbHH$seasonal ) %>%
ggplot(aes(hour, seasonalHH)) +
geom_col()fnb_clean %>%
tail(13*2) %>%
mutate(seasonalHT = dc_fnbHT$seasonal ) %>%
ggplot(aes(hour, seasonalHT)) +
geom_col()From this hourly chat, we find out the high frequency order happens on 20pm, and the lowest frequency order happens on 10 am.
WEEKLY
msts_fnb <- fnb_clean$visitor %>% msts(seasonal.periods = c(13, 13*7))
msts_fnb %>%
tail(13*7*4) %>%
stl(s.window = "periodic") %>%
autoplot()msts_fnb_dc <- mstl(msts_fnb)
as.data.frame(msts_fnb_dc) %>%
mutate(tdate = fnb_clean$transaction_date) %>%
mutate(
days1 = wday(tdate, label = TRUE, abbr = FALSE),
hour = as.factor(hour(tdate))
) %>%
group_by(days1, hour) %>%
summarise(seasonal = sum(Seasonal13 + Seasonal91)) %>%
ggplot(mapping = aes(x = hour, y = seasonal)) +
geom_col(aes(fill = days1)) From this weekly chart, we find out the high frequency order happens on Saturday , 20pm, and the lowest frequency order happens on Friday, 10 am.
use fnb_clean$visitor
msts_vs <- fnb_clean$visitor %>% msts(seasonal.periods = c(13, 13*7))
msts_vs %>%
head(13*7*4) %>%
mstl() %>%
autoplot()use fnb_clean$visitor1
msts_vs1 <- fnb_clean$visitor1 %>% msts(seasonal.periods = c(13, 13*7))
msts_vs1 %>%
head(13*7*4) %>%
mstl() %>%
autoplot()Tbats model can not accepts the data visitor = zero. Because when we use log(), it has result = -INF, and it shows error (Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, …) : NA/NaN/Inf in ‘y’) So for TBATS model, data visitor1 will be used.
tbats_mod <- msts_train1 %>%
log() %>%
tbats(use.box.cox = NULL,
use.trend = NULL,
use.damped.trend = NULL)Using msts_test that contains zero, it show -Inf for MPE , and Inf for MAPE
stlmForecast <- as.vector(stlm_model$mean)
accuracy_stlm <- accuracy(stlmForecast,msts_test)
accuracy_stlm## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set -6.082599 9.330179 7.678216 -Inf Inf 0.3548303 0
## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set -6.049632 9.314613 7.645249 -60.65058 64.90921 0.3556996 1.376096
Using msts_test that contains zero, it show -Inf for MPE , and Inf for MAPE.
tbatsForecast <- as.vector(exp(tbats_model$mean))
accuracy_tbats <- accuracy(tbatsForecast,msts_test)
accuracy_tbats## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set -3.298899 8.727963 6.903395 -Inf Inf 0.5524017 0
## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set -3.265932 8.717369 6.870428 -33.82967 46.53715 0.5567218 0.9446494
Fitting Model
Forecasting
Visualize actual data with forecasting
Model EvaluationHWForecast <- as.vector(fnbhw_forecast$mean)
accuracyHW <- accuracy(HWForecast,msts_test)
accuracyHW## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set -2.931366 8.16131 6.484465 -Inf Inf 0.4675556 0
Create ARIMA object
Forecast
visualization
Model EvaluationarimaForecast <- as.vector(fnbar_forecast$mean)
accuracyArima <- accuracy(arimaForecast,msts_test)
accuracyArima## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set -2.127162 7.332962 5.656791 NaN Inf 0.3444129 0
result <- rbind(accuracy_stlm1, accuracy_tbats1, accuracyHW, accuracyArima)
rownames(result) <- c("stlm model", "tbats model", "holt winters model", "arima model")
result## ME RMSE MAE MPE MAPE ACF1
## stlm model -6.049632 9.314613 7.645249 -60.65058 64.90921 0.3556996
## tbats model -3.265932 8.717369 6.870428 -33.82967 46.53715 0.5567218
## holt winters model -2.931366 8.161310 6.484465 -Inf Inf 0.4675556
## arima model -2.127162 7.332962 5.656791 NaN Inf 0.3444129
## Theil's U
## stlm model 1.3760963
## tbats model 0.9446494
## holt winters model 0.0000000
## arima model 0.0000000
accuracyData <- data.frame(visitor=fnb_clean$transaction_date %>% tail(13*7),
actual=as.vector(msts_test1),
stlmForecast, tbatsForecast, fnbhw_forecast, fnbar_forecast
)accuracyData %>%
ggplot()+
geom_line(aes(x=(fnb_clean$transaction_date %>% tail(13*7)),
y=(fnb_clean$visitor1 %>% tail(13*7)),
colour="actual"
)) +
geom_line(aes(x=(fnb_clean$transaction_date %>% tail(13*7)),
y=stlm_model$mean,
colour="stlm"
)) +
geom_line(aes(x=(fnb_clean$transaction_date %>% tail(13*7)),
y=exp(tbats_model$mean),
colour="tbats"
)) +
geom_line(aes(x=(fnb_clean$transaction_date %>% tail(13*7)),
y=fnbhw_forecast$mean,
colour="holt winters"
)) +
geom_line(aes(x=(fnb_clean$transaction_date %>% tail(13*7)),
y=fnbar_forecast$mean,
colour="arima"
)) +
scale_y_continuous()+
labs(
title = "Hourly visitor in F&B",
x="Date",
y="",
colour=""
)Because Arima model shows the smallest MAE, so it will be chosen to predict to Data-Test.csv
When it is submitted into Leaderboard, the MAE still under the metrics (6.0)
There are several assumptions to do, in order to find out the good forecast : 1. No-Autocorrelation for residuals
##
## Box-Pierce test
##
## data: model_test_arima$residuals
## X-squared = 0.0081299, df = 1, p-value = 0.9282
Conclucsion: p-value > 0.05, so we can assume there’s no-autocorrelation for residuals.
##
## Shapiro-Wilk normality test
##
## data: model_test_arima$residuals
## W = 0.99114, p-value = 6.817e-06
Conclusion: p-value < 0.05 , so we can assume there’s not normalit of residuals.