This will be my final project as Algoritma Trainee. After being trained for 2,5 months and provided a lof of compatible material for us, we are given project to train our skill. I decided to go with Forecasting Model.
The Food and Beverage dataset is provided by Dattabot, which contains detailed transaction of multiple food and beverage outlets. Using this dataset, I am going to do some forecasting and time series analysis to help the outlet’s owner making a better business decision.
Customer behaviour, especially in food and beverage industry is highly related to seasonality patterns. The owner wants to analyze the number visitor so he could make better judgement in 2018.
The train dataset contains detailed transaction details from December 1st 2017 to February 18th 2018:
The dataset includes information about:
transaction_date: The timestamp of a transactionreceipt_number: The ID of a transactionitem_id: The ID of an item in a transactionitem_group: The group ID of an item in a transactionitem_major_group: The major-group ID of an item in a transactionquantity: The quantity of purchased itemprice_usd: The price of purchased itemtotal_usd: The total price of purchased itempayment_type: The payment methodsales_type: The sales methodAfter imported data, we need to format transaction_date into date format.
In order to get transaction time hourly, we need to round the transaction_date.
For a customer can order more than one item in one time, let’s summarise the data by receipt_number to see how many orders created hourly.
To make sure that there is no missing time in our data, let’s do time series padding.
resto <- resto %>%
pad(start_val = make_datetime(year = year(min_date),month = month(min_date), day= day(min_date), hour = 0), end_val =make_datetime(year = year(max_date),month = month(max_date), day= day(max_date), hour = 23) )## pad applied on the interval: hour
Since the restaurant is opened at 10 AM to 10 PM, we need to filter the datetime from 10 AM to 10 PM and replace na value with 0.
First step we need to do is convert the data into time series object.
Visualize the data to see the distribution of the data
Early assumption we can get: 1. The data has trend, seasonanality, error.
If we want to see the trend of the data, we only need to decompose our data with one month periods since every month has its own seasonality.
We find:
- Since graphic in trend column haven’t been created smoothly, find out that there is seasonality that haven’t been caught in resto_ts. That means our data are most likely having multiple seasonality.
Let’s re-create our time series data.
As we can see, the trend is created smoothly indicating that our data set has multiple seasonality.
# Single seasonality
resto_fin %>%
mutate(
seasonal = resto_single_decompose$seasonal,
hour = hour(datetime)
) %>%
distinct(hour, seasonal) %>%
ggplot(mapping = aes(x = hour, y = seasonal)) +
geom_col() +
theme_minimal() +
scale_x_continuous(breaks = seq(10,22,1)) +
labs(
title = "Single Seasonality Analysis",
subtitle = "Daily"
)We find:
- Most of visitors come to restaurant at 19.00 - 22.00
- Opening hour (10 A.M) is the least of visitors of restaurant.
# Multiple Seasonality
as.data.frame(resto_double_decompose) %>%
mutate(datetime = resto_fin$datetime) %>%
mutate(
dow = wday(datetime, label = TRUE, abbr = FALSE),
hour = as.factor(hour(datetime))
) %>%
group_by(dow, hour) %>%
summarise(seasonal = sum(Seasonal13 + Seasonal91)) %>%
ggplot(mapping = aes(x = hour, y = seasonal)) +
geom_col(aes(fill = dow)) +
scale_fill_viridis_d(option = "plasma") +
theme_minimal() +
labs(
title = "Multiple Seasonality Analysis",
subtitle = "Daily & Weekly"
)Besides from Single Seasonality Pattern, we find:
- For lunch time period (11 A.M to 1 P.M), Friday is the day with least visitors.
- Weekend (Saturday & Sunday), the restaurant will start busy at 3 P.M.
- Friday, Saturday & Sunday are the highest traffic at 7 P.M - 10 P.M.
Let’s split data into 2 set of data, train and validation data. For validation data, we will use last 2 weeks.
## The resto_ts series is a ts object with 1 variable and 1037 observations
## Frequency: 13
## Start time: 1 4
## End time: 80 13
The end of our data is “13” which means we can consider our the it’s full day data (10.00-22.00).
Since our data has error, trend and seasonality, we can use triple exponential smoothing method as one of our modeling
# modeling
model_tes_ts <- HoltWinters(train_resto_ts)
# forecast
forecast_tes_ts <- forecast(model_tes_ts, h = 13*7)
# model evaluation
MAE(y_pred = forecast_tes_ts$mean, y_true = test_resto_ts)## [1] 11.61221
Let’s visualize the forecast with our actual value that we get.
As we can see from visualization, most of forecast data point (green color) are not close to our actual data (blue color) which mean there is a lot of error from our model.model_arima_ts <- stlm(train_resto_ts, method = "arima")
# forecast
forecast_arima_ts <- forecast(model_arima_ts, h=13*7)
# model evaluation
MAE(y_pred = forecast_arima_ts$mean, y_true = test_resto_ts)## [1] 7.461238
Let’s how this model works in visualization.
As we can see from the visualization, this model is better than the previous one. The forecast data point is closer to actual data and the score of MAE of this model is smaller than the previous one.## Warning in HoltWinters(train_resto_msts): optimization difficulties: ERROR:
## ABNORMAL_TERMINATION_IN_LNSRCH
# forecast
forecast_tes_msts <- forecast(model_tes_msts, h = 13*7)
# model evaluation
MAE(y_pred = forecast_tes_msts$mean, y_true = test_resto_msts)## [1] 6.451498
Let’s see how the visualization of this model is
model_arima_msts <- stlm(train_resto_msts, method = "arima")
# forecast
forecast_arima_msts <- forecast(model_arima_msts, h=13*7)
# model evaluation
MAE(y_pred = forecast_arima_msts$mean, y_true = test_resto_msts)## [1] 5.656791
This model has the smallest MAE comparing to our previous models. Let’s see how the visualization is.
test_forecast(actual = resto_msts ,
forecast.obj = forecast_arima_msts,
train = train_resto_msts,
test = test_resto_msts)The result is good. Most of our forecast data point is closer to the actual data point. We can assume that this model is better than the others.
Since the smallest MAE that we get is from ARIMA model of Multiple Seasonality (5.66) in our validation data, we will use this model to forecast in our test data.
test_data <- read.csv("data_input/data/data-test.csv")
model_arima_test <- stlm(resto_msts, method = "arima")
# forecast
forecast_arima_test <- forecast(model_arima_test, h=13*7)
# insert the data into table
test_data$visitor <- forecast_arima_test$mean
write.csv(test_data,file = "forecast_fnb.csv")As result of submitting to leaderboard, we pass the standard requirement of MAE.
To make a good forecasting, we have to test several assumptions:
1. No-Autocorrelation for residuals
2. Normality of residuals
##
## Box-Pierce test
##
## data: model_arima_test$residuals
## X-squared = 0.0081299, df = 1, p-value = 0.9282
Due to p-value is bigger than 0.05, we can conclude that the residuals has no-autocorrelation.
##
## Shapiro-Wilk normality test
##
## data: model_arima_test$residuals
## W = 0.99114, p-value = 6.817e-06
Since the p-value is less than 0.05, so the residuals are not distributed normally. Note that Shapiro test tests only deviation of residual distribution from the normal and not the forecast performance, which worsens for longer forecast. If we want to forecast longer data, we will need to add more data to feed in our model.
As for the last, we can see from the Seasonality Analysis, we conclude that Saturday at 20.00 (8 P.M) is the highest visitors.