FnB Time Series
Introduction
As a final step in machine learning course, me as a student should complete 1 case as my machine learning capstone project. In this article, i will going through on solving a problem as my machine learning capstone project. In this project I choose Food and Beverage cases.
Customer behaviour, especially in the food and beverage industry is highly related to seasonality patterns. The owner wants to analyze the number of visitors (includes dine in, delivery, and takeaway transactions) so he could make better judgment in 2018. Fortunately, you already know that time series analysis is enough to provide a good forecast and seasonality explanation.
Dataset
The Food and Beverage dataset is provided by Dattabot, which contains detailed transactions of multiple food and beverage outlets. Using this dataset, we are challenged to do some forecasting and time series analysis to help the outlet’s owner making a better business decision.
The train dataset contains detailed transaction details from December 1st 2017 to December 18th of February 2018. The shop opens from 10 am to 10 pm every day:
The dataset includes information about:
transaction_date: The timestamp of a transactionreceipt_number: The ID of a transactionitem_id: The ID of an item in a transactionitem_group: The group ID of an item in a transactionitem_major_group: The major-group ID of an item in a transactionquantity: The quantity of purchased itemprice_usd: The price of purchased itemtotal_usd: The total price of purchased itempayment_type: The payment methodsales_type: The sales method
Library
library(dplyr)
library(lubridate)
library(tidyr)
library(ggplot2)
library(TSstudio)
library(forecast)
library(readr)
library(padr)
library(zoo)
library(ggsci)
library(xts)
library(tseries)Data Preprocess
Rubrics
(2 Points) Demonstrated and explain how to properly do data aggregation.
- Do you need to aggregate/summarise the number of visitors before doing time series padding?
- Do you need to filter the time to certain hours after doing time series padding?
- Do you need to replace NA value?
(2 Points) Demonstrate how to properly do time series padding.
- Should you do time series padding? Yes
- Do you need to round the datetime into hour or minutes? hour
- When is the start and the end of the time interval for time series padding? “2017-12-01 10:00:00 UTC” “2018-02-17 22:00:00 UTC”
Pre-processing
- Read the source data file :
fnb <- read.csv("3. FnB-ts-visitor/data/data-train.csv")head(fnb)#> transaction_date receipt_number item_id item_group item_major_group
#> 1 2017-12-01T13:32:46Z A0026694 I10100139 noodle_dish food
#> 2 2017-12-01T13:32:46Z A0026694 I10500037 drinks beverages
#> 3 2017-12-01T13:33:39Z A0026695 I10500044 drinks beverages
#> 4 2017-12-01T13:33:39Z A0026695 I10400009 side_dish food
#> 5 2017-12-01T13:33:39Z A0026695 I10500046 drinks beverages
#> 6 2017-12-01T13:35:59Z A0026696 I10300002 pastry food
#> quantity price_usd total_usd payment_type sales_type
#> 1 1 7.33 7.33 cash dine_in
#> 2 1 4.12 4.12 cash dine_in
#> 3 1 2.02 2.02 cash dine_in
#> 4 1 5.60 5.60 cash dine_in
#> 5 1 3.01 3.01 cash dine_in
#> 6 1 4.86 4.86 cash dine_in
- Summary of data
glimpse(fnb)#> Rows: 137,748
#> Columns: 10
#> $ transaction_date <chr> "2017-12-01T13:32:46Z", "2017-12-01T13:32:46Z", "2017…
#> $ receipt_number <chr> "A0026694", "A0026694", "A0026695", "A0026695", "A002…
#> $ item_id <chr> "I10100139", "I10500037", "I10500044", "I10400009", "…
#> $ item_group <chr> "noodle_dish", "drinks", "drinks", "side_dish", "drin…
#> $ item_major_group <chr> "food", "beverages", "beverages", "food", "beverages"…
#> $ quantity <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1,…
#> $ price_usd <dbl> 7.33, 4.12, 2.02, 5.60, 3.01, 4.86, 6.34, 7.58, 4.12,…
#> $ total_usd <dbl> 7.33, 4.12, 2.02, 5.60, 3.01, 4.86, 6.34, 7.58, 4.12,…
#> $ payment_type <chr> "cash", "cash", "cash", "cash", "cash", "cash", "cash…
#> $ sales_type <chr> "dine_in", "dine_in", "dine_in", "dine_in", "dine_in"…
- Change the data type of
transaction_dateinto datetime and then round it into hour, using floor_date, save the result into new column calledtd_hour.
fnb_data <- fnb %>%
mutate(transaction_date=ymd_hms(transaction_date),
td_hour = floor_date(transaction_date, unit = "hours")) %>%
select(transaction_date, receipt_number, td_hour)fnb_data <- fnb_data %>%
group_by(td_hour) %>%
summarise(visitor = n_distinct(receipt_number)) fnb_data <- fnb_data %>% pad()fnb_data <- fnb_data %>%
mutate(visitor = na.fill(visitor, 0))anyNA(fnb_data)#> [1] FALSE
fnb_data <- fnb_data %>% filter(hour(td_hour) >=10 & hour(td_hour) <=22) range(fnb_data$td_hour)#> [1] "2017-12-01 13:00:00 UTC" "2018-02-18 22:00:00 UTC"
tail(fnb_data, 26)#> # A tibble: 26 × 2
#> td_hour visitor
#> <dttm> <int>
#> 1 2018-02-17 10:00:00 2
#> 2 2018-02-17 11:00:00 11
#> 3 2018-02-17 12:00:00 10
#> 4 2018-02-17 13:00:00 24
#> 5 2018-02-17 14:00:00 13
#> 6 2018-02-17 15:00:00 33
#> 7 2018-02-17 16:00:00 31
#> 8 2018-02-17 17:00:00 23
#> 9 2018-02-17 18:00:00 32
#> 10 2018-02-17 19:00:00 55
#> # … with 16 more rows
Seasonality Analysis
Rubrics
(2 Points) Compare multiple time series decomposition approach.
- Can you decompose the time series into the observed data, trend, hourly seasonality, weekly seasonality, and the residuals?
(2 Points) Reported interpretable hourly and weekly seasonality.
- Can you create a better visualization of hourly and weekly seasonality?
- How do you interpret the seasonality? Describe the interpretation.
Observation & Decomposition
Create Time Series Object using ts() function with
frequency 13, where 13 is time range of the outlet open everyday, from
10.00 to 22.00.
fnb_ts <- ts(data = fnb_data$visitor, frequency = 13)Visualize the data to see the distribution of the data
autoplot(fnb_ts) + theme_minimal()If we want to see the trend of the data, we only need to decompose our data with one month periods since every month has its own seasonality.
fnb_decom <- decompose(x = fnb_ts)
fnb_ts %>% tail(13*7*4) %>% decompose() %>% autoplot()From above plot, the Estimated Trend component is showing a pattern, where it might contains un-captured extra seasonality and this can be considered as multi-seasonal data. To solve multi-seasonality, we need to convert the data into “Multiple Seasonality Time Series” which accept multiple frequency setting.
Time Series with Multiple Seasonality
Create “Multi-seasonality Time Series” Object using
msts() function, with Frequency : Daily (13) and Weekly
(13*7), this will capture seasonality in Daily and Weekly. Then
decompose and plotting.
fnb_msts <- msts(fnb_data$visitor, seasonal.periods = c(13, 13*7))
autoplot(fnb_msts)fnb_msts_decom <- mstl(fnb_msts)
fnb_msts %>% tail(4*7*13) %>% stl(s.window = "periodic") %>% autoplot()Based on above Plot, now the Estimated Trend with “Multiple Seasonality Time Series” is more smoother and clearer. And also more clearer on Daily and Weekly Seasonality, more explainable for further analysis.
Daily by Hour
fnb_data %>%
mutate(seasonal = fnb_decom$seasonal,
hour = hour(td_hour)) %>%
distinct(hour, seasonal) %>%
ggplot(mapping = aes(x = hour, y = seasonal)) +
geom_col() +
theme_minimal() +
scale_x_continuous(breaks = seq(10, 22, 1)) +
labs(
title = "Single Seasonality Analysis",
subtitle = "Daily"
)From this daily chart, Peak Time range of the Outlet is between 19.00 - 22.00, and at 20pm is the time where most Visitor come. And at 10am is the time where the least visitors come to the Outlet as the Outlet just started/opened.
Weekly by Hour
as.data.frame(fnb_msts_decom) %>%
mutate(tdate = fnb_data$td_hour) %>%
mutate(
Day_of_Week = wday(tdate, label = T, abbr = F),
Hour = as.factor(hour(tdate))
) %>%
group_by(Day_of_Week, Hour) %>%
summarise(Seasonal = sum(Seasonal13 + Seasonal91)) %>%
ggplot() +
geom_bar(aes(x = Hour, y = Seasonal, fill = Day_of_Week), stat ="identity", position = "stack", width = 0.7) +
scale_fill_locuszoom() +
labs(
title = "Multi Seasonality Analysis",
subtitle = "Weekly"
)From this weekly chart, Peak Time range of the Outlet is between 19.00 - 22.00 every day of the week, and On Saturday at 20pm is the time where most Visitor come. And at 10am - every day of the week, is the time where the least visitors come as the Outlet just started/opened.
Model Fitting and Evaluation
Model Fitting and Evaluation will be using “Multi Seasonality Time Series”, as it captures multi-seasonal data, and will provide better model/forecast performance.
Rubrics
(4 Points) Demonstrate and explain how to prepare cross-validation data for this case.
- Do you need to do cross validation before doing time series analysis? Yes
- How do you split the data into training and testing dataset?
(4 Points) Demonstrate and explain how to properly do model fitting and evaluation.
- What data preprocessing you used before fitting the model?
- What time series model did you use?
- Can you visualize the actual vs estimated number of visitors?
- how to evaluate the model performance?
(4 Points) Compare multiple model specifications.
- How many forecasting model will you use?
- Will you use exponential smoothing? Will you use ARIMA?
- How to evaluate the model performance?
- Can you visualize the actual vs estimated number of visitors?
Cross-Validation
The cross-validation scheme for time series should not be sampled randomly, but splitted sequentially. In this case we will splitting Data into Test (1 Week = 7 x 13 Hours) and Train (Total Data exclude Test Data)
test_fnb <- tail(fnb_msts, 7*13)
train_fnb <- head(fnb_msts, length(fnb_msts)-length(test_fnb))Build Model
So based on our Exploratory data analysis before, we know that the data have trend and seasonal. Method that we will use in this case is [Holt Winters] or Triple Exponential Smoothing and ARIMA.
Triple Exponential Smoothing
This method extends double exponential smoothing, by adding a seasonal smoothing factor. We know that our seasonal pattern is an additive.
Model
fnb_holtw <- HoltWinters(x = train_fnb)
fnb_holtw#> Holt-Winters exponential smoothing with trend and additive seasonal component.
#>
#> Call:
#> HoltWinters(x = train_fnb)
#>
#> Smoothing parameters:
#> alpha: 0.0191209
#> beta : 0.004309985
#> gamma: 0.3818075
#>
#> Coefficients:
#> [,1]
#> a 28.742874555
#> b 0.003078526
#> s1 -21.060291253
#> s2 -20.536169771
#> s3 -11.490760138
#> s4 -14.309376753
#> s5 -12.051714988
#> s6 -8.869880294
#> s7 -13.590677807
#> s8 -6.209037821
#> s9 1.185451522
#> s10 7.155888720
#> s11 16.277672145
#> s12 16.088114917
#> s13 8.643299764
#> s14 -22.118531380
#> s15 -17.763132319
#> s16 -12.553281087
#> s17 -11.656781541
#> s18 -12.371434004
#> s19 -8.661168586
#> s20 -8.817953402
#> s21 -2.879736919
#> s22 -0.056961969
#> s23 12.710839700
#> s24 13.777433791
#> s25 17.906803201
#> s26 14.406581401
#> s27 -18.922446615
#> s28 -12.565014946
#> s29 -9.356182712
#> s30 -13.839585571
#> s31 -8.542446012
#> s32 -9.490214843
#> s33 -3.866236183
#> s34 -1.026496476
#> s35 1.319236358
#> s36 13.456542623
#> s37 22.548377622
#> s38 15.300154075
#> s39 13.329446117
#> s40 -19.035228290
#> s41 -14.638255462
#> s42 -8.437600318
#> s43 -12.156153819
#> s44 -10.298551447
#> s45 -6.347554091
#> s46 -7.056881557
#> s47 -7.720502224
#> s48 0.434764095
#> s49 20.786816411
#> s50 28.207612442
#> s51 27.873888077
#> s52 14.983282452
#> s53 -27.737659196
#> s54 -27.729972674
#> s55 -27.724824158
#> s56 -23.003337888
#> s57 -10.813697302
#> s58 -7.804865668
#> s59 -10.803056362
#> s60 -7.191090968
#> s61 7.484375853
#> s62 22.008565686
#> s63 35.071541032
#> s64 30.188928999
#> s65 21.888818528
#> s66 -18.715994900
#> s67 -8.756871352
#> s68 -1.866696104
#> s69 -2.836979178
#> s70 0.875665978
#> s71 2.267971701
#> s72 1.154065726
#> s73 6.994975901
#> s74 11.725920689
#> s75 24.791812514
#> s76 42.229997336
#> s77 37.857742447
#> s78 25.251908553
#> s79 -21.185174031
#> s80 -9.669197974
#> s81 -2.906681297
#> s82 0.040076532
#> s83 2.839796721
#> s84 6.351161672
#> s85 5.365133700
#> s86 11.824289721
#> s87 13.989226363
#> s88 27.915899182
#> s89 38.474663936
#> s90 28.471598245
#> s91 18.852043126
Forecast
# Forecast
forecast_holtw <- forecast(object = fnb_holtw, h = 13*7)Plot
# Plot
train_fnb %>%
autoplot(series = "Actual") +
autolayer(test_fnb, series = "Actual test") +
autolayer(fnb_holtw$fitted[,1], series = "Fitted values (additive)") +
autolayer(forecast_holtw$mean, series = "Predict data test (additive)")test_forecast(actual = fnb_msts,
forecast.obj = forecast_holtw,
train = train_fnb,
test = test_fnb)ARIMA
ARIMA abbrevation of Autoregressive Integrated Moving Average. While exponential smoothing methods do not make any assumptions about correlations between successive values of the time series, in some cases you can make a better predictive model by taking correlations in the data into account. Autoregressive Integrated Moving Average (ARIMA) models include an explicit statistical model for the irregular component of a time series, that allows for non-zero autocorrelation in the irregular component.
Differencing Time Series
ARIMA models are defined for stationary time series. Therefore, if we start off with a non-stationary time series, will need to ‘difference’ the time series until we obtain a stationary time series.
We check our time series data, is it stationary?
- H0: Data not stationary
- H1: Data stationary
If p-value > 0.05 means H0: Data not stationary
dts <- adf.test(x = fnb_msts)
dts#>
#> Augmented Dickey-Fuller Test
#>
#> data: fnb_msts
#> Dickey-Fuller = -9.0527, Lag order = 10, p-value = 0.01
#> alternative hypothesis: stationary
We got p-value = 0.01 which is less than 0.05. Our data is considered stationary.
Model
We make ARIMA model using function auto.arima, this function automate will give us optimal result of arima model.
fnb_auto_arima <- auto.arima(train_fnb, seasonal = T)Forecast
# Forecast
forecast_auto_arima <- forecast(fnb_auto_arima, h = 13*7)Plot
# Plot
train_fnb %>%
autoplot(series = "Actual") +
autolayer(test_fnb, series = "Actual Test") +
autolayer(forecast_auto_arima$mean, series = "Predict Auto ARIMA")test_forecast(actual = fnb_msts,
forecast.obj = forecast_auto_arima,
train = train_fnb,
test = test_fnb)STLM (ARIMA)
STLM is abbreviation of Seasonal Trend with Loess Model. The objective of STL is to extract a single seasonal component, a trend component, and a residual component from a time series. It does this by applying Loess to multiple transformations of the original time series and recursively extracts the trend and seasonal component.
Model
fnb_stlm_arima <- stlm(train_fnb, method = "arima")Forecast
# Forecast
forecast_stlm_arima <- forecast(fnb_stlm_arima, h = 13*7)Plot
# Plot
train_fnb %>%
autoplot(series = "Actual") +
autolayer(test_fnb, series = "Actual Test") +
autolayer(forecast_stlm_arima$mean, series = "Predict STLM ARIMA")test_forecast(actual = fnb_msts,
forecast.obj = forecast_stlm_arima,
train = train_fnb,
test = test_fnb)Model Evaluation
Evaluate All Models Performance
Triple Exponential Smoothing
# evaluation data train (Additive)
accuracy(fnb_holtw$fitted[,1], train_fnb)#> ME RMSE MAE MPE MAPE ACF1 Theil's U
#> Test set 0.3537315 7.113313 5.286998 NaN Inf 0.2490194 0
# evaluation data test (Additive)
acc_holtw <- accuracy(forecast_holtw$mean, test_fnb)Auto ARIMA
# evaluation data train (Auto ARIMA)
accuracy(fnb_auto_arima$fitted, train_fnb)#> ME RMSE MAE MPE MAPE ACF1 Theil's U
#> Test set -0.1706769 8.008848 5.720017 NaN Inf -0.0007779329 0
# evaluation data test (Auto ARIMA)
acc_auto_arima <- accuracy(forecast_auto_arima$mean, test_fnb)STLM (ARIMA)
# evaluation data train (STLM ARIMA)
accuracy(fnb_stlm_arima$fitted, train_fnb)#> ME RMSE MAE MPE MAPE ACF1 Theil's U
#> Test set -0.02985275 5.14097 3.895996 NaN Inf -0.0002734328 0
# evaluation data test (STLM ARIMA)
acc_stlm_arima <- accuracy(forecast_stlm_arima$mean, test_fnb)result <- rbind(acc_holtw, acc_auto_arima, acc_stlm_arima)
rownames(result) <- c("Triple Exponential Smoothing Model", "Auto ARIMA Model", "STLM ARIMA Model")
result#> ME RMSE MAE MPE MAPE
#> Triple Exponential Smoothing Model -2.898399 8.154527 6.451498 -Inf Inf
#> Auto ARIMA Model -1.616111 9.412900 7.412500 Inf Inf
#> STLM ARIMA Model -2.115958 7.323714 5.694023 NaN Inf
#> ACF1 Theil's U
#> Triple Exponential Smoothing Model 0.4680130 0
#> Auto ARIMA Model 0.4217281 0
#> STLM ARIMA Model 0.3436756 0
Based on above Accuracy Summary, The Accuracy of “STLM ARIMA Model” is better, MAE is 5.694023 (lower than 6), this model will be chosen to be used for Prediction.
Visualization Actual vs Estimated
Data Preparation for the Plot :
accuracyData <- data.frame(
tdate = fnb_data$td_hour %>% tail(13*7),
actual = as.vector(test_fnb),
holtw_forecast = as.vector(forecast_holtw$mean),
auto_arima_forecast = as.vector(forecast_auto_arima$mean),
stlm_arima_forecast = as.vector(forecast_stlm_arima$mean)
)Visualization of Actual vs Estimated number of visitors (Best Model)
accuracyData %>%
ggplot() +
geom_line(aes(x = tdate, y = actual, colour = "Actual"), size = 1) +
geom_line(aes(x = tdate, y = stlm_arima_forecast, colour = "STLM Arima Model (Best Model)"), size = 1)+
labs(
title = "Actual vs Arima Model (Best models)",
subtitle = "Hourly visitor",
x = "Date",
y = "Visitor",
colour = ""
)Visualization of Actual vs Estimated number of visitors (All Models)
accuracyData %>%
ggplot() +
geom_line(aes(x = tdate, y = actual, colour = "Actual"), size = 0.5) +
geom_line(aes(x = tdate, y = holtw_forecast, colour = "Triple Exponential Smoothing Model"), size = 0.1) +
geom_line(aes(x = tdate, y = auto_arima_forecast, colour = "Auto ARIMA Model"), size = 0.1) +
geom_line(aes(x = tdate, y = stlm_arima_forecast, colour = "STLM ARIMA Model (Best Model)"), size = 0.5)+
labs(
title = "Actual vs All Models",
subtitle = "Hourly visitor",
x = "Date",
y = "Visitor",
colour = "")Prediction Performance
Rubrics
- (6 Points) Reached MAE < 6 in (your own) validation dataset.
- (6 Points) Reached MAE < 6 in test dataset.
Based on Cross-Validation dataset
The Best Model is “STLM ARIMA Model” as it has MAE Accuracy 5.694023, the smallest MAE compare the other created models.
accuracy(forecast_stlm_arima$mean, test_fnb)#> ME RMSE MAE MPE MAPE ACF1 Theil's U
#> Test set -2.115958 7.323714 5.694023 NaN Inf 0.3436756 0
Based on Test dataset
- Read Test Data given
data_test <- read_csv("3. FnB-ts-visitor/data/data-test.csv")head(data_test, 26)#> # A tibble: 26 × 2
#> datetime visitor
#> <dttm> <lgl>
#> 1 2018-02-19 10:00:00 NA
#> 2 2018-02-19 11:00:00 NA
#> 3 2018-02-19 12:00:00 NA
#> 4 2018-02-19 13:00:00 NA
#> 5 2018-02-19 14:00:00 NA
#> 6 2018-02-19 15:00:00 NA
#> 7 2018-02-19 16:00:00 NA
#> 8 2018-02-19 17:00:00 NA
#> 9 2018-02-19 18:00:00 NA
#> 10 2018-02-19 19:00:00 NA
#> # … with 16 more rows
- Change data type
data_test <- data_test %>%
mutate(datetime = ymd_hms(datetime) %>%
floor_date(unit = "hour"))- Model used for Forecasting
model_stlm_arima <- stlm(fnb_msts, method = "arima")- Submission Template
# Forecast the target using your model
forecast_mod <- forecast(model_stlm_arima, h = 13*7)
# Create submission data
submission <- data_test %>%
mutate(visitor = forecast_mod$mean)
# save data
write.csv(submission, "submission-aisah.csv", row.names = F)
# check first 3 data
head(submission, 26)#> # A tibble: 26 × 2
#> datetime visitor
#> <dttm> <dbl>
#> 1 2018-02-19 10:00:00 4.27
#> 2 2018-02-19 11:00:00 6.95
#> 3 2018-02-19 12:00:00 12.8
#> 4 2018-02-19 13:00:00 15.7
#> 5 2018-02-19 14:00:00 11.1
#> 6 2018-02-19 15:00:00 20.0
#> 7 2018-02-19 16:00:00 21.8
#> 8 2018-02-19 17:00:00 20.4
#> 9 2018-02-19 18:00:00 24.2
#> 10 2018-02-19 19:00:00 40.8
#> # … with 16 more rows
- Upload “fnb_stlm_arima_test.csv” into “Scoring Dashboard Portal” and achieve the Metrics MAE Target (less than 6), MAE Value is 4.82.
Scoring Dashboard Portal
Conclusion
There are several assumptions to do, in order to find out the good forecast:
Rubrics
(4 Point) Assumption Checking
- Does the model meet the autocorrelation assumption?
- What about the normality of residuals?
- If the assumptions are not met, what is the cause? how to handle that?
- Based on seasonality when the highest visitors?
No-Autocorrelation
We want check no-autocorrelation we will use Box.test()
function. We expect p-value from our model is > 0.05 to accept
H0.
- H0 : No autocorrelation in the forecast errors
- H1 : there is an autocorrelation in the forecast errors
Box.test(x = model_stlm_arima$residuals, type = "Ljung-Box")#>
#> Box-Ljung test
#>
#> data: model_stlm_arima$residuals
#> X-squared = 0.00010908, df = 1, p-value = 0.9917
p-value = 0.9917 > 0.05 (alpha), which can be concluded that the residuals has no-autocorrelation.
Normality Residuals
Function we use in this checking is shapiro.test() and
we expect to accepting H0 where p-value > 0.05.
- H0 : residuals are normally distributed
- H1 : residuals are not normally distributed
shapiro.test(model_stlm_arima$residuals)#>
#> Shapiro-Wilk normality test
#>
#> data: model_stlm_arima$residuals
#> W = 0.99139, p-value = 0.000009359
p-value = 0.000009359 < 0.05 (alpha), which can be concluded that the residuals are not distributed normally.
Root Cause and How to handle : It was found that the residuals are not distributed normally, this could happen since the size of our data if not large enough, but does not mean Forecast Performance is not good. We need to get more data to build and feed into models, so the residual more distributed normally and getting better result on Model Evaluation and Forecast Performance.
Summary
- Peak Time range of the Outlet is between 19.00 - 22.00 every day of the week
- On Saturday at 20.00 is the time where most Visitor come.
- At 10.00 - every day of the week, is the time where the least visitors come as the Outlet just started/opened.