The Food and Beverage dataset is provided by Dattabot, which contains detailed transactions of multiple food and beverage outlets. Using this dataset, we are challenged to do some forecasting and time series analysis to help the outlet’s owner making a better business decision.
#> transaction_date receipt_number item_id item_group item_major_group
#> 1 2017-12-01T13:32:46Z A0026694 I10100139 noodle_dish food
#> 2 2017-12-01T13:32:46Z A0026694 I10500037 drinks beverages
#> 3 2017-12-01T13:33:39Z A0026695 I10500044 drinks beverages
#> 4 2017-12-01T13:33:39Z A0026695 I10400009 side_dish food
#> 5 2017-12-01T13:33:39Z A0026695 I10500046 drinks beverages
#> 6 2017-12-01T13:35:59Z A0026696 I10300002 pastry food
#> quantity price_usd total_usd payment_type sales_type
#> 1 1 7.33 7.33 cash dine_in
#> 2 1 4.12 4.12 cash dine_in
#> 3 1 2.02 2.02 cash dine_in
#> 4 1 5.60 5.60 cash dine_in
#> 5 1 3.01 3.01 cash dine_in
#> 6 1 4.86 4.86 cash dine_in
Column explanation :
transaction_date: The timestamp of a transactionreceipt_number: The ID of a transactionitem_id: The ID of an item in a transactionitem_group: The group ID of an item in a
transactionitem_major_group: The major-group ID of an item in a
transactionquantity: The quantity of purchased itemprice_usd: The price of purchased itemtotal_usd: The total price of purchased itempayment_type: The payment methodsales_type: The sales methodMake sure that the data format is appropriate.
#> Rows: 137,748
#> Columns: 10
#> $ transaction_date <chr> "2017-12-01T13:32:46Z", "2017-12-01T13:32:46Z", "2017…
#> $ receipt_number <chr> "A0026694", "A0026694", "A0026695", "A0026695", "A002…
#> $ item_id <chr> "I10100139", "I10500037", "I10500044", "I10400009", "…
#> $ item_group <chr> "noodle_dish", "drinks", "drinks", "side_dish", "drin…
#> $ item_major_group <chr> "food", "beverages", "beverages", "food", "beverages"…
#> $ quantity <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1,…
#> $ price_usd <dbl> 7.33, 4.12, 2.02, 5.60, 3.01, 4.86, 6.34, 7.58, 4.12,…
#> $ total_usd <dbl> 7.33, 4.12, 2.02, 5.60, 3.01, 4.86, 6.34, 7.58, 4.12,…
#> $ payment_type <chr> "cash", "cash", "cash", "cash", "cash", "cash", "cash…
#> $ sales_type <chr> "dine_in", "dine_in", "dine_in", "dine_in", "dine_in"…
We have to change the data type of transaction_date into
datetime.
Round datetime into hour using floor_date function
Aggregate/summarise column transaction_date and
receipt_number to get the number of visitors.
fnb_clean <- fnb %>%
group_by(transaction_date) %>%
summarise(visitor = length(unique(receipt_number)))
fnb_clean#> # A tibble: 1,244 × 2
#> transaction_date visitor
#> <dttm> <int>
#> 1 2017-12-01 13:00:00 16
#> 2 2017-12-01 14:00:00 38
#> 3 2017-12-01 15:00:00 27
#> 4 2017-12-01 16:00:00 29
#> 5 2017-12-01 17:00:00 44
#> 6 2017-12-01 18:00:00 50
#> 7 2017-12-01 19:00:00 66
#> 8 2017-12-01 20:00:00 70
#> 9 2017-12-01 21:00:00 63
#> 10 2017-12-01 22:00:00 63
#> # ℹ 1,234 more rows
Do Time Series Padding to fill missing data series (in hour).
fnb_clean <- fnb_clean %>%
pad(interval = "hour",
start_val = ymd_hms("2017-12-01 10:00:00 UTC"),
end_val = ymd_hms("2018-02-18 22:00:00 UTC"))
fnb_clean#> # A tibble: 1,909 × 2
#> transaction_date visitor
#> <dttm> <int>
#> 1 2017-12-01 10:00:00 NA
#> 2 2017-12-01 11:00:00 NA
#> 3 2017-12-01 12:00:00 NA
#> 4 2017-12-01 13:00:00 16
#> 5 2017-12-01 14:00:00 38
#> 6 2017-12-01 15:00:00 27
#> 7 2017-12-01 16:00:00 29
#> 8 2017-12-01 17:00:00 44
#> 9 2017-12-01 18:00:00 50
#> 10 2017-12-01 19:00:00 66
#> # ℹ 1,899 more rows
Filter Date Time in hour to capture range time of Outlet Open, from 10 am to 10 pm.
fnb_clean <- fnb_clean %>%
filter(hour(transaction_date) >= 10 & hour(transaction_date) <= 22)
fnb_clean#> # A tibble: 1,040 × 2
#> transaction_date visitor
#> <dttm> <int>
#> 1 2017-12-01 10:00:00 NA
#> 2 2017-12-01 11:00:00 NA
#> 3 2017-12-01 12:00:00 NA
#> 4 2017-12-01 13:00:00 16
#> 5 2017-12-01 14:00:00 38
#> 6 2017-12-01 15:00:00 27
#> 7 2017-12-01 16:00:00 29
#> 8 2017-12-01 17:00:00 44
#> 9 2017-12-01 18:00:00 50
#> 10 2017-12-01 19:00:00 66
#> # ℹ 1,030 more rows
Replace NA Value
We should replace the missing value by zero number. This indicates that there are no visitors in that hour.
#> # A tibble: 1,040 × 2
#> transaction_date visitor
#> <dttm> <int>
#> 1 2017-12-01 10:00:00 0
#> 2 2017-12-01 11:00:00 0
#> 3 2017-12-01 12:00:00 0
#> 4 2017-12-01 13:00:00 16
#> 5 2017-12-01 14:00:00 38
#> 6 2017-12-01 15:00:00 27
#> 7 2017-12-01 16:00:00 29
#> 8 2017-12-01 17:00:00 44
#> 9 2017-12-01 18:00:00 50
#> 10 2017-12-01 19:00:00 66
#> # ℹ 1,030 more rows
Check the time interval of the transaction_date after
padding and filtering.
#> [1] "2017-12-01 10:00:00 UTC" "2018-02-18 22:00:00 UTC"
Create Time Series Object using ts() function with frequency 13, where 13 is time range of the outlet open everyday, from 10am to 10pm.
Explanation :
From plot above, the Trend component still has a pattern and has not shown a smooth graph. This indicates there are multiseasonal in the data. So, we need to convert the data into “Multiple Seasonality Time Series” which accept. multiple frequency setting.
Create “Multi-seasonality Time Series” Object using msts() function, with Frequency : Hourly (13) and Weekly (13*7), this will capture seasonality in Hourly and Weekly. Then decompose and plotting.
Explanation :
Based on plot above, now the Trend is smoother and clearer. Seasonal on Daily and Weekly also clearer.
Create a visualization of hourly and weekly seasonality.
fnb_msts_dc_fr <- data.frame(fnb_dc_2)
fnb_msts_dc_fr %>%
mutate(Day_of_Week = wday(fnb_clean$transaction_date,
label = TRUE,
abbr = FALSE),
Hour = (hour(fnb_clean$transaction_date))) %>%
group_by(Day_of_Week, Hour) %>%
summarise(Seasonal = sum(Seasonal13 + Seasonal91)) %>%
ggplot() +
geom_bar(aes(x = Hour, y = Seasonal, fill = Day_of_Week),stat ="identity", position = "stack", width = 0.7)+
scale_x_continuous(breaks = seq(10,22,1)) +
labs(title = "Multi-Seasonality Analysis - Weekly & Hourly")
Explanation :
Based on plot above, we can see that on Saturday at 20.00 is the time where most Visitor come. And at 10.00 - every day of the week, is the time where the least visitors come as the Outlet just started/opened.
Splitting Data into Validation and Train (Total Data exclude Validation Data). The validation.
Since Time Series Data is having Error, Trend, and Seasonal then the model will be built using :
1). Holt’s Winters Exponential (Triple Exponential Smoothing)
2). ARIMA Model
Forecast both models 1 week ahead.
Here is the summary of model in “Multi-Seasonal Time Series”
acc_hw <- accuracy(object = forecast_hw$mean, fnb_val)
acc_arima <- accuracy(object = forecast_arima$mean, fnb_val)
result <- rbind(acc_hw, acc_arima)
rownames(result) <- c("Holt Winter", "Arima")
result#> ME RMSE MAE MPE MAPE ACF1 Theil's U
#> Holt Winter -2.944367 8.168453 6.463898 -Inf Inf 0.4683447 0
#> Arima -2.092000 7.317929 5.688691 NaN Inf 0.3436265 0
Based on the Summary, The Accuracy of “ARIMA Model” is better than Holw Winter with a MAE value of 5.688691 (< 6). This model will be chosen to be used for Prediction.
accuracyData <- data.frame(trx_date= fnb_clean$transaction_date %>% tail(13*7),
actual = as.vector(fnb_val) ,
HW_Forecast = as.vector(forecast_hw$mean) ,
Arima_Forecast = as.vector(forecast_arima$mean))
accuracyData %>%
ggplot() +
geom_line(aes(x = trx_date, y = actual, colour = "Actual"),size=1)+
geom_line(aes(x = trx_date, y = Arima_Forecast, colour = "Arima Model (Best Model)"),size=1)+
labs(title = "Hourly Visitor - Actual Vs Arima Model",
x = "Date",
y = "Visitor",
colour = NULL)accuracyData %>%
ggplot() +
geom_line(aes(x = trx_date, y = actual, colour = "Actual"),size=0.5)+
geom_line(aes(x = trx_date, y = HW_Forecast, colour = "Holt Winter Model"),size=0.1)+
geom_line(aes(x = trx_date, y = Arima_Forecast, colour = "Arima Model (Best Model)"),size=0.5)+
labs(title = "Hourly Visitor - Actual Vs All Models",x = "Date",y = "Visitor",colour = "")The Best Model is “Arima Model” as it has MAE Accuracy 5.688691.
#> ME RMSE MAE MPE MAPE ACF1 Theil's U
#> Test set -2.092 7.317929 5.688691 NaN Inf 0.3436265 0
Predict using ARIMA Model with Multiseasonal and save The Prediction into CSV File “fnb_test_arima.csv”.
Read the Test Data
Create “ARIMA” Model with Multi-Seasonal Time Series Data then forecast it.
model_arima_test <- stlm(fnb_ts_2, method = "arima")
forecast_arima_test <- forecast(model_arima_test, h=13*7)
fnb_test$visitor <- forecast_arima_test$meanSee the result of prediction.
#> datetime visitor
#> 1 2018-02-19T10:00:00Z 4.173315
#> 2 2018-02-19T11:00:00Z 6.880140
#> 3 2018-02-19T12:00:00Z 12.955462
#> 4 2018-02-19T13:00:00Z 15.673966
#> 5 2018-02-19T14:00:00Z 11.139184
#> 6 2018-02-19T15:00:00Z 19.988398
#> 7 2018-02-19T16:00:00Z 21.812542
#> 8 2018-02-19T17:00:00Z 20.382694
#> 9 2018-02-19T18:00:00Z 24.183307
#> 10 2018-02-19T19:00:00Z 40.806998
Save the Test Data into a CSV File “submission_arima.csv”.
Model Performance
#>
#> Box-Ljung test
#>
#> data: model_arima$residuals
#> X-squared = 0.00021456, df = 1, p-value = 0.9883
From the above results obtained a p-value of 0.9883. Because the p-value > alpha (0.05) then H0 is not rejected. So, it can be concluded that the Residuals have no-autocorrelation.
Normality of residuals, tested using saphiro.test function
#>
#> Shapiro-Wilk normality test
#>
#> data: model_arima$residuals
#> W = 0.99053, p-value = 8.6e-06
From the above results obtained a p-value of 0.0000086. Because the p-value < alpha (0.05) then H0 is rejected. So, it can be concluded that the Residuals are not distributed normally.
How to handle it ?
It was found that the residuals are not distributed normally, this could
happen since the size of our data if not large enough, but does not mean
Forecast Performance is not good. We need to get more data to build and
feed into models, so the residual more distributed normally and getting
better result on Model Evaluation and Forecast Performance.
The highest visitors occur on Saturday at 20.00
The least visitors occur at 10.00 every day of the week, when the outlet just opened