FnB Time Series

Introduction

As a final step in machine learning course, me as a student should complete 1 case as my machine learning capstone project. In this article, i will going through on solving a problem as my machine learning capstone project. In this project I choose Food and Beverage cases.

Customer behaviour, especially in the food and beverage industry is highly related to seasonality patterns. The owner wants to analyze the number of visitors (includes dine in, delivery, and takeaway transactions) so he could make better judgment in 2018. Fortunately, you already know that time series analysis is enough to provide a good forecast and seasonality explanation.

Dataset

The Food and Beverage dataset is provided by Dattabot, which contains detailed transactions of multiple food and beverage outlets. Using this dataset, we are challenged to do some forecasting and time series analysis to help the outlet’s owner making a better business decision.

The train dataset contains detailed transaction details from December 1st 2017 to December 18th of February 2018. The shop opens from 10 am to 10 pm every day:

The dataset includes information about:

  • transaction_date: The timestamp of a transaction
  • receipt_number: The ID of a transaction
  • item_id: The ID of an item in a transaction
  • item_group: The group ID of an item in a transaction
  • item_major_group: The major-group ID of an item in a transaction
  • quantity: The quantity of purchased item
  • price_usd: The price of purchased item
  • total_usd: The total price of purchased item
  • payment_type: The payment method
  • sales_type: The sales method

Library

library(dplyr)
library(lubridate)
library(tidyr)
library(ggplot2)
library(TSstudio) 
library(forecast)
library(readr)
library(padr)
library(zoo)
library(ggsci)
library(xts)
library(tseries)

Data Preprocess

Rubrics

  • (2 Points) Demonstrated and explain how to properly do data aggregation.

    • Do you need to aggregate/summarise the number of visitors before doing time series padding?
    • Do you need to filter the time to certain hours after doing time series padding?
    • Do you need to replace NA value?
  • (2 Points) Demonstrate how to properly do time series padding.

    • Should you do time series padding? Yes
    • Do you need to round the datetime into hour or minutes? hour
    • When is the start and the end of the time interval for time series padding? “2017-12-01 10:00:00 UTC” “2018-02-17 22:00:00 UTC”

Pre-processing

  • Read the source data file :
fnb <- read.csv("3. FnB-ts-visitor/data/data-train.csv")
head(fnb)
#>       transaction_date receipt_number   item_id  item_group item_major_group
#> 1 2017-12-01T13:32:46Z       A0026694 I10100139 noodle_dish             food
#> 2 2017-12-01T13:32:46Z       A0026694 I10500037      drinks        beverages
#> 3 2017-12-01T13:33:39Z       A0026695 I10500044      drinks        beverages
#> 4 2017-12-01T13:33:39Z       A0026695 I10400009   side_dish             food
#> 5 2017-12-01T13:33:39Z       A0026695 I10500046      drinks        beverages
#> 6 2017-12-01T13:35:59Z       A0026696 I10300002      pastry             food
#>   quantity price_usd total_usd payment_type sales_type
#> 1        1      7.33      7.33         cash    dine_in
#> 2        1      4.12      4.12         cash    dine_in
#> 3        1      2.02      2.02         cash    dine_in
#> 4        1      5.60      5.60         cash    dine_in
#> 5        1      3.01      3.01         cash    dine_in
#> 6        1      4.86      4.86         cash    dine_in
  • Summary of data
glimpse(fnb)
#> Rows: 137,748
#> Columns: 10
#> $ transaction_date <chr> "2017-12-01T13:32:46Z", "2017-12-01T13:32:46Z", "2017…
#> $ receipt_number   <chr> "A0026694", "A0026694", "A0026695", "A0026695", "A002…
#> $ item_id          <chr> "I10100139", "I10500037", "I10500044", "I10400009", "…
#> $ item_group       <chr> "noodle_dish", "drinks", "drinks", "side_dish", "drin…
#> $ item_major_group <chr> "food", "beverages", "beverages", "food", "beverages"…
#> $ quantity         <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1,…
#> $ price_usd        <dbl> 7.33, 4.12, 2.02, 5.60, 3.01, 4.86, 6.34, 7.58, 4.12,…
#> $ total_usd        <dbl> 7.33, 4.12, 2.02, 5.60, 3.01, 4.86, 6.34, 7.58, 4.12,…
#> $ payment_type     <chr> "cash", "cash", "cash", "cash", "cash", "cash", "cash…
#> $ sales_type       <chr> "dine_in", "dine_in", "dine_in", "dine_in", "dine_in"…
  • Change the data type of transaction_date into datetime and then round it into hour, using floor_date, save the result into new column called td_hour.
fnb_data <- fnb %>% 
  mutate(transaction_date=ymd_hms(transaction_date),
         td_hour = floor_date(transaction_date, unit = "hours")) %>%
  select(transaction_date, receipt_number, td_hour)
fnb_data <- fnb_data %>% 
  group_by(td_hour) %>% 
  summarise(visitor = n_distinct(receipt_number)) 
fnb_data <- fnb_data %>% pad()
fnb_data <- fnb_data %>% 
  mutate(visitor = na.fill(visitor, 0))
anyNA(fnb_data)
#> [1] FALSE
fnb_data <- fnb_data %>% filter(hour(td_hour) >=10 & hour(td_hour) <=22) 
range(fnb_data$td_hour)
#> [1] "2017-12-01 13:00:00 UTC" "2018-02-18 22:00:00 UTC"
tail(fnb_data, 26)
#> # A tibble: 26 × 2
#>    td_hour             visitor
#>    <dttm>                <int>
#>  1 2018-02-17 10:00:00       2
#>  2 2018-02-17 11:00:00      11
#>  3 2018-02-17 12:00:00      10
#>  4 2018-02-17 13:00:00      24
#>  5 2018-02-17 14:00:00      13
#>  6 2018-02-17 15:00:00      33
#>  7 2018-02-17 16:00:00      31
#>  8 2018-02-17 17:00:00      23
#>  9 2018-02-17 18:00:00      32
#> 10 2018-02-17 19:00:00      55
#> # … with 16 more rows

Seasonality Analysis

Rubrics

  • (2 Points) Compare multiple time series decomposition approach.

    • Can you decompose the time series into the observed data, trend, hourly seasonality, weekly seasonality, and the residuals?
  • (2 Points) Reported interpretable hourly and weekly seasonality.

    • Can you create a better visualization of hourly and weekly seasonality?
    • How do you interpret the seasonality? Describe the interpretation.

Observation & Decomposition

Create Time Series Object using ts() function with frequency 13, where 13 is time range of the outlet open everyday, from 10.00 to 22.00.

fnb_ts <- ts(data = fnb_data$visitor, frequency = 13)

Visualize the data to see the distribution of the data

autoplot(fnb_ts) + theme_minimal()

If we want to see the trend of the data, we only need to decompose our data with one month periods since every month has its own seasonality.

fnb_decom <- decompose(x = fnb_ts)

fnb_ts %>% tail(13*7*4) %>% decompose() %>% autoplot()

From above plot, the Estimated Trend component is showing a pattern, where it might contains un-captured extra seasonality and this can be considered as multi-seasonal data. To solve multi-seasonality, we need to convert the data into “Multiple Seasonality Time Series” which accept multiple frequency setting.

Time Series with Multiple Seasonality

Create “Multi-seasonality Time Series” Object using msts() function, with Frequency : Daily (13) and Weekly (13*7), this will capture seasonality in Daily and Weekly. Then decompose and plotting.

fnb_msts <- msts(fnb_data$visitor, seasonal.periods = c(13, 13*7))

autoplot(fnb_msts)

fnb_msts_decom  <- mstl(fnb_msts)

fnb_msts %>% tail(4*7*13) %>% stl(s.window = "periodic") %>% autoplot()

Based on above Plot, now the Estimated Trend with “Multiple Seasonality Time Series” is more smoother and clearer. And also more clearer on Daily and Weekly Seasonality, more explainable for further analysis.

Daily by Hour

fnb_data %>% 
  mutate(seasonal = fnb_decom$seasonal,
         hour = hour(td_hour)) %>%
  distinct(hour, seasonal) %>%
  ggplot(mapping = aes(x = hour, y = seasonal)) +
  geom_col() +
  theme_minimal() +
  scale_x_continuous(breaks = seq(10, 22, 1)) +
  labs(
    title = "Single Seasonality Analysis",
    subtitle = "Daily"
  )

From this daily chart, Peak Time range of the Outlet is between 19.00 - 22.00, and at 20pm is the time where most Visitor come. And at 10am is the time where the least visitors come to the Outlet as the Outlet just started/opened.

Weekly by Hour

as.data.frame(fnb_msts_decom) %>%
  mutate(tdate = fnb_data$td_hour) %>%
  mutate(
    Day_of_Week = wday(tdate, label = T, abbr = F),
    Hour = as.factor(hour(tdate))
  ) %>%
  group_by(Day_of_Week, Hour) %>%
  summarise(Seasonal = sum(Seasonal13 + Seasonal91)) %>%
  ggplot() +
  geom_bar(aes(x = Hour, y = Seasonal, fill = Day_of_Week), stat ="identity", position = "stack", width = 0.7) +
  scale_fill_locuszoom() +
  labs(
    title = "Multi Seasonality Analysis",
    subtitle = "Weekly"
  )

From this weekly chart, Peak Time range of the Outlet is between 19.00 - 22.00 every day of the week, and On Saturday at 20pm is the time where most Visitor come. And at 10am - every day of the week, is the time where the least visitors come as the Outlet just started/opened.

Model Fitting and Evaluation

Model Fitting and Evaluation will be using “Multi Seasonality Time Series”, as it captures multi-seasonal data, and will provide better model/forecast performance.

Rubrics

  • (4 Points) Demonstrate and explain how to prepare cross-validation data for this case.

    • Do you need to do cross validation before doing time series analysis? Yes
    • How do you split the data into training and testing dataset?
  • (4 Points) Demonstrate and explain how to properly do model fitting and evaluation.

    • What data preprocessing you used before fitting the model?
    • What time series model did you use?
    • Can you visualize the actual vs estimated number of visitors?
    • how to evaluate the model performance?
  • (4 Points) Compare multiple model specifications.

    • How many forecasting model will you use?
    • Will you use exponential smoothing? Will you use ARIMA?
    • How to evaluate the model performance?
    • Can you visualize the actual vs estimated number of visitors?

Cross-Validation

The cross-validation scheme for time series should not be sampled randomly, but splitted sequentially. In this case we will splitting Data into Test (1 Week = 7 x 13 Hours) and Train (Total Data exclude Test Data)

test_fnb <- tail(fnb_msts, 7*13)
train_fnb <- head(fnb_msts, length(fnb_msts)-length(test_fnb))

Build Model

So based on our Exploratory data analysis before, we know that the data have trend and seasonal. Method that we will use in this case is [Holt Winters] or Triple Exponential Smoothing and ARIMA.

Triple Exponential Smoothing

This method extends double exponential smoothing, by adding a seasonal smoothing factor. We know that our seasonal pattern is an additive.

Model

fnb_holtw <- HoltWinters(x = train_fnb)

fnb_holtw
#> Holt-Winters exponential smoothing with trend and additive seasonal component.
#> 
#> Call:
#> HoltWinters(x = train_fnb)
#> 
#> Smoothing parameters:
#>  alpha: 0.0191209
#>  beta : 0.004309985
#>  gamma: 0.3818075
#> 
#> Coefficients:
#>              [,1]
#> a    28.742874555
#> b     0.003078526
#> s1  -21.060291253
#> s2  -20.536169771
#> s3  -11.490760138
#> s4  -14.309376753
#> s5  -12.051714988
#> s6   -8.869880294
#> s7  -13.590677807
#> s8   -6.209037821
#> s9    1.185451522
#> s10   7.155888720
#> s11  16.277672145
#> s12  16.088114917
#> s13   8.643299764
#> s14 -22.118531380
#> s15 -17.763132319
#> s16 -12.553281087
#> s17 -11.656781541
#> s18 -12.371434004
#> s19  -8.661168586
#> s20  -8.817953402
#> s21  -2.879736919
#> s22  -0.056961969
#> s23  12.710839700
#> s24  13.777433791
#> s25  17.906803201
#> s26  14.406581401
#> s27 -18.922446615
#> s28 -12.565014946
#> s29  -9.356182712
#> s30 -13.839585571
#> s31  -8.542446012
#> s32  -9.490214843
#> s33  -3.866236183
#> s34  -1.026496476
#> s35   1.319236358
#> s36  13.456542623
#> s37  22.548377622
#> s38  15.300154075
#> s39  13.329446117
#> s40 -19.035228290
#> s41 -14.638255462
#> s42  -8.437600318
#> s43 -12.156153819
#> s44 -10.298551447
#> s45  -6.347554091
#> s46  -7.056881557
#> s47  -7.720502224
#> s48   0.434764095
#> s49  20.786816411
#> s50  28.207612442
#> s51  27.873888077
#> s52  14.983282452
#> s53 -27.737659196
#> s54 -27.729972674
#> s55 -27.724824158
#> s56 -23.003337888
#> s57 -10.813697302
#> s58  -7.804865668
#> s59 -10.803056362
#> s60  -7.191090968
#> s61   7.484375853
#> s62  22.008565686
#> s63  35.071541032
#> s64  30.188928999
#> s65  21.888818528
#> s66 -18.715994900
#> s67  -8.756871352
#> s68  -1.866696104
#> s69  -2.836979178
#> s70   0.875665978
#> s71   2.267971701
#> s72   1.154065726
#> s73   6.994975901
#> s74  11.725920689
#> s75  24.791812514
#> s76  42.229997336
#> s77  37.857742447
#> s78  25.251908553
#> s79 -21.185174031
#> s80  -9.669197974
#> s81  -2.906681297
#> s82   0.040076532
#> s83   2.839796721
#> s84   6.351161672
#> s85   5.365133700
#> s86  11.824289721
#> s87  13.989226363
#> s88  27.915899182
#> s89  38.474663936
#> s90  28.471598245
#> s91  18.852043126

Forecast

# Forecast
forecast_holtw <- forecast(object = fnb_holtw, h = 13*7)

Plot

# Plot
train_fnb %>% 
  autoplot(series = "Actual") + 
  autolayer(test_fnb, series = "Actual test") + 
  autolayer(fnb_holtw$fitted[,1], series = "Fitted values (additive)") +
  autolayer(forecast_holtw$mean, series = "Predict data test (additive)")

test_forecast(actual = fnb_msts,
             forecast.obj = forecast_holtw,
             train = train_fnb,
             test = test_fnb)

ARIMA

ARIMA abbrevation of Autoregressive Integrated Moving Average. While exponential smoothing methods do not make any assumptions about correlations between successive values of the time series, in some cases you can make a better predictive model by taking correlations in the data into account. Autoregressive Integrated Moving Average (ARIMA) models include an explicit statistical model for the irregular component of a time series, that allows for non-zero autocorrelation in the irregular component.

Differencing Time Series

ARIMA models are defined for stationary time series. Therefore, if we start off with a non-stationary time series, will need to ‘difference’ the time series until we obtain a stationary time series.

We check our time series data, is it stationary?

  • H0: Data not stationary
  • H1: Data stationary

If p-value > 0.05 means H0: Data not stationary

dts <- adf.test(x = fnb_msts)
dts
#> 
#>  Augmented Dickey-Fuller Test
#> 
#> data:  fnb_msts
#> Dickey-Fuller = -9.0527, Lag order = 10, p-value = 0.01
#> alternative hypothesis: stationary

We got p-value = 0.01 which is less than 0.05. Our data is considered stationary.

Model

We make ARIMA model using function auto.arima, this function automate will give us optimal result of arima model.

fnb_auto_arima <- auto.arima(train_fnb, seasonal = T)

Forecast

# Forecast
forecast_auto_arima <- forecast(fnb_auto_arima, h = 13*7)

Plot

# Plot
train_fnb %>% 
  autoplot(series = "Actual") +
  autolayer(test_fnb, series = "Actual Test") + 
  autolayer(forecast_auto_arima$mean, series = "Predict Auto ARIMA")

test_forecast(actual = fnb_msts,
             forecast.obj = forecast_auto_arima,
             train = train_fnb,
             test = test_fnb)

STLM (ARIMA)

STLM is abbreviation of Seasonal Trend with Loess Model. The objective of STL is to extract a single seasonal component, a trend component, and a residual component from a time series. It does this by applying Loess to multiple transformations of the original time series and recursively extracts the trend and seasonal component.

Model

fnb_stlm_arima <- stlm(train_fnb, method = "arima")

Forecast

# Forecast
forecast_stlm_arima <- forecast(fnb_stlm_arima, h = 13*7)

Plot

# Plot
train_fnb %>% 
  autoplot(series = "Actual") +
  autolayer(test_fnb, series = "Actual Test") + 
  autolayer(forecast_stlm_arima$mean, series = "Predict STLM ARIMA")

test_forecast(actual = fnb_msts,
             forecast.obj = forecast_stlm_arima,
             train = train_fnb,
             test = test_fnb)

Model Evaluation

Evaluate All Models Performance

Triple Exponential Smoothing

# evaluation data train (Additive)
accuracy(fnb_holtw$fitted[,1], train_fnb)
#>                 ME     RMSE      MAE MPE MAPE      ACF1 Theil's U
#> Test set 0.3537315 7.113313 5.286998 NaN  Inf 0.2490194         0
# evaluation data test (Additive)
acc_holtw <- accuracy(forecast_holtw$mean, test_fnb)

Auto ARIMA

# evaluation data train (Auto ARIMA)
accuracy(fnb_auto_arima$fitted, train_fnb)
#>                  ME     RMSE      MAE MPE MAPE          ACF1 Theil's U
#> Test set -0.1706769 8.008848 5.720017 NaN  Inf -0.0007779329         0
# evaluation data test (Auto ARIMA)
acc_auto_arima <- accuracy(forecast_auto_arima$mean, test_fnb)

STLM (ARIMA)

# evaluation data train (STLM ARIMA)
accuracy(fnb_stlm_arima$fitted, train_fnb)
#>                   ME    RMSE      MAE MPE MAPE          ACF1 Theil's U
#> Test set -0.02985275 5.14097 3.895996 NaN  Inf -0.0002734328         0
# evaluation data test (STLM ARIMA)
acc_stlm_arima <- accuracy(forecast_stlm_arima$mean, test_fnb)
result <- rbind(acc_holtw, acc_auto_arima, acc_stlm_arima)
rownames(result) <- c("Triple Exponential Smoothing Model", "Auto ARIMA Model", "STLM ARIMA Model")
result
#>                                           ME     RMSE      MAE  MPE MAPE
#> Triple Exponential Smoothing Model -2.898399 8.154527 6.451498 -Inf  Inf
#> Auto ARIMA Model                   -1.616111 9.412900 7.412500  Inf  Inf
#> STLM ARIMA Model                   -2.115958 7.323714 5.694023  NaN  Inf
#>                                         ACF1 Theil's U
#> Triple Exponential Smoothing Model 0.4680130         0
#> Auto ARIMA Model                   0.4217281         0
#> STLM ARIMA Model                   0.3436756         0

Based on above Accuracy Summary, The Accuracy of “STLM ARIMA Model” is better, MAE is 5.694023 (lower than 6), this model will be chosen to be used for Prediction.

Visualization Actual vs Estimated

Data Preparation for the Plot :

accuracyData <- data.frame(
  tdate = fnb_data$td_hour %>% tail(13*7),
  actual = as.vector(test_fnb),
  holtw_forecast = as.vector(forecast_holtw$mean),
  auto_arima_forecast = as.vector(forecast_auto_arima$mean),
  stlm_arima_forecast = as.vector(forecast_stlm_arima$mean)
)

Visualization of Actual vs Estimated number of visitors (Best Model)

accuracyData %>% 
  ggplot() +
  geom_line(aes(x = tdate, y = actual, colour = "Actual"), size = 1) +
  geom_line(aes(x = tdate, y = stlm_arima_forecast, colour = "STLM Arima Model (Best Model)"), size = 1)+
  labs(
    title = "Actual vs Arima Model (Best models)",
    subtitle = "Hourly visitor", 
    x = "Date", 
    y = "Visitor", 
    colour = ""
    )

Visualization of Actual vs Estimated number of visitors (All Models)

accuracyData %>% 
  ggplot() +
  geom_line(aes(x = tdate, y = actual, colour = "Actual"), size = 0.5) +
  geom_line(aes(x = tdate, y = holtw_forecast, colour = "Triple Exponential Smoothing Model"), size = 0.1) + 
  geom_line(aes(x = tdate, y = auto_arima_forecast, colour = "Auto ARIMA Model"), size = 0.1) + 
  geom_line(aes(x = tdate, y = stlm_arima_forecast, colour = "STLM ARIMA Model (Best Model)"), size = 0.5)+
  labs(
    title = "Actual vs All Models", 
    subtitle = "Hourly visitor", 
    x = "Date", 
    y = "Visitor", 
    colour = "")

Prediction Performance

Rubrics

  • (6 Points) Reached MAE < 6 in (your own) validation dataset.
  • (6 Points) Reached MAE < 6 in test dataset.

Based on Cross-Validation dataset

The Best Model is “STLM ARIMA Model” as it has MAE Accuracy 5.694023, the smallest MAE compare the other created models.

accuracy(forecast_stlm_arima$mean, test_fnb)
#>                 ME     RMSE      MAE MPE MAPE      ACF1 Theil's U
#> Test set -2.115958 7.323714 5.694023 NaN  Inf 0.3436756         0

Based on Test dataset

  • Read Test Data given
data_test <- read_csv("3. FnB-ts-visitor/data/data-test.csv")
head(data_test, 26)
#> # A tibble: 26 × 2
#>    datetime            visitor
#>    <dttm>              <lgl>  
#>  1 2018-02-19 10:00:00 NA     
#>  2 2018-02-19 11:00:00 NA     
#>  3 2018-02-19 12:00:00 NA     
#>  4 2018-02-19 13:00:00 NA     
#>  5 2018-02-19 14:00:00 NA     
#>  6 2018-02-19 15:00:00 NA     
#>  7 2018-02-19 16:00:00 NA     
#>  8 2018-02-19 17:00:00 NA     
#>  9 2018-02-19 18:00:00 NA     
#> 10 2018-02-19 19:00:00 NA     
#> # … with 16 more rows
  • Change data type
data_test <- data_test %>% 
  mutate(datetime = ymd_hms(datetime) %>% 
           floor_date(unit = "hour"))
  • Model used for Forecasting
model_stlm_arima <- stlm(fnb_msts, method = "arima")
  • Submission Template
# Forecast the target using your model
forecast_mod <- forecast(model_stlm_arima, h = 13*7)

# Create submission data
submission <- data_test %>% 
  mutate(visitor = forecast_mod$mean)

# save data
write.csv(submission, "submission-aisah.csv", row.names = F)

# check first 3 data
head(submission, 26)
#> # A tibble: 26 × 2
#>    datetime            visitor
#>    <dttm>                <dbl>
#>  1 2018-02-19 10:00:00    4.27
#>  2 2018-02-19 11:00:00    6.95
#>  3 2018-02-19 12:00:00   12.8 
#>  4 2018-02-19 13:00:00   15.7 
#>  5 2018-02-19 14:00:00   11.1 
#>  6 2018-02-19 15:00:00   20.0 
#>  7 2018-02-19 16:00:00   21.8 
#>  8 2018-02-19 17:00:00   20.4 
#>  9 2018-02-19 18:00:00   24.2 
#> 10 2018-02-19 19:00:00   40.8 
#> # … with 16 more rows
  • Upload “fnb_stlm_arima_test.csv” into “Scoring Dashboard Portal” and achieve the Metrics MAE Target (less than 6), MAE Value is 4.82.

Scoring Dashboard Portal

Conclusion

There are several assumptions to do, in order to find out the good forecast:

Rubrics

  • (4 Point) Assumption Checking

    • Does the model meet the autocorrelation assumption?
    • What about the normality of residuals?
    • If the assumptions are not met, what is the cause? how to handle that?
    • Based on seasonality when the highest visitors?

No-Autocorrelation

We want check no-autocorrelation we will use Box.test() function. We expect p-value from our model is > 0.05 to accept H0.

  • H0 : No autocorrelation in the forecast errors
  • H1 : there is an autocorrelation in the forecast errors
Box.test(x = model_stlm_arima$residuals, type = "Ljung-Box")
#> 
#>  Box-Ljung test
#> 
#> data:  model_stlm_arima$residuals
#> X-squared = 0.00010908, df = 1, p-value = 0.9917

p-value = 0.9917 > 0.05 (alpha), which can be concluded that the residuals has no-autocorrelation.

Normality Residuals

Function we use in this checking is shapiro.test() and we expect to accepting H0 where p-value > 0.05.

  • H0 : residuals are normally distributed
  • H1 : residuals are not normally distributed
shapiro.test(model_stlm_arima$residuals)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  model_stlm_arima$residuals
#> W = 0.99139, p-value = 0.000009359

p-value = 0.000009359 < 0.05 (alpha), which can be concluded that the residuals are not distributed normally.

Root Cause and How to handle : It was found that the residuals are not distributed normally, this could happen since the size of our data if not large enough, but does not mean Forecast Performance is not good. We need to get more data to build and feed into models, so the residual more distributed normally and getting better result on Model Evaluation and Forecast Performance.

Summary

  • Peak Time range of the Outlet is between 19.00 - 22.00 every day of the week
  • On Saturday at 20.00 is the time where most Visitor come.
  • At 10.00 - every day of the week, is the time where the least visitors come as the Outlet just started/opened.