Background

In this exercise we will create a time series model using a sales dataset provided in Kaggle. The dataset you will be using is stored under data_input/dive-deeper folder in a file called train.csv. This dataset consist of 4 years of store-item sales data. Along this dive deeper exercise we will perform a time series analysis and forecasting using Prophet model.

Data Preparation

In this section we will prepare our raw dataset into a proper time series format. Please read in our train.csv dataset using read.csv() function we have learned and perform a simple inspection of the data:

## Observations: 730,500
## Variables: 4
## $ date  <fct> 2013-01-01, 2013-01-02, 2013-01-03, 2013-01-04, 2013-01-05, 201…
## $ store <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ item  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ sales <int> 13, 11, 14, 13, 10, 12, 10, 9, 12, 9, 9, 7, 10, 12, 5, 7, 16, 7…
##         date store item sales
## 1 2013-01-01     1    1    13
## 2 2013-01-02     1    1    11
## 3 2013-01-03     1    1    14
## 4 2013-01-04     1    1    13
## 5 2013-01-05     1    1    10
## 6 2013-01-06     1    1    12
##              date store item sales
## 730495 2016-12-26    10   50    61
## 730496 2016-12-27    10   50    60
## 730497 2016-12-28    10   50    43
## 730498 2016-12-29    10   50    68
## 730499 2016-12-30    10   50    63
## 730500 2016-12-31    10   50    64
## [1] "factor"

There are 730,500 observations with 4 variables, and the data is about sales information include date, where the store is, and what kind of item it is in each row. the class of date needs to be change to a date type.

## # A tibble: 1,461 x 2
##    date       demand
##    <date>      <int>
##  1 2013-01-01  13696
##  2 2013-01-02  13678
##  3 2013-01-03  14488
##  4 2013-01-04  15677
##  5 2013-01-05  16237
##  6 2013-01-06  17291
##  7 2013-01-07  11769
##  8 2013-01-08  13560
##  9 2013-01-09  13878
## 10 2013-01-10  14642
## # … with 1,451 more rows

Baseline Prophet Model

We can see in the weekly seasonality that Monday is the lowest point of demand. While yearly seasonality, July resulting the highest point through the year. This is a bit confusing because on the information before, we know that the demand is high around Nov to Dec. This initial model coundn’t catch the seasonality. We should tune the model to see if the additional seasonality (for instance, monthly, quarterly) can affect the model to sensitively catch the seasonality better.

Model Fine Tuning

  • Trend (Linear)
    • changepoint.prior.scale
    • changepoints =
  • Seasonality
    • Daily, Weekly, Yearly (TRUE/FALSE)
    • yearly.seasonality = FALSE
    • daily.seasonality = FALSE –> Per jam
    • weekly.seasonality = FALSE
    • Add non-regular seasonality
      • period
      • fourier.order (3-10)
  • Holiday
##      holiday         ds lower_window upper_window
## 1 newyeareve 2013-12-31           -5            0
## 2 newyeareve 2014-12-31           -5            0
## 3 newyeareve 2015-12-31           -5            0
## 4 newyeareve 2016-12-31           -5            0

Model Evaluation

## Observations: 182,500
## Variables: 4
## $ date  <fct> 2017-01-01, 2017-01-02, 2017-01-03, 2017-01-04, 2017-01-05, 201…
## $ store <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ item  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ sales <int> 19, 15, 10, 16, 14, 24, 14, 20, 18, 11, 14, 17, 7, 16, 29, 15, …
## # A tibble: 365 x 5
##    ds                      y   yhat yhat_upper yhat_lower
##    <dttm>              <int>  <dbl>      <dbl>      <dbl>
##  1 2017-01-01 00:00:00 23709 26755.     28089.     25334.
##  2 2017-01-02 00:00:00 15772 16667.     18023.     15236.
##  3 2017-01-03 00:00:00 18650 19998.     21405.     18716.
##  4 2017-01-04 00:00:00 18510 20017.     21442.     18627.
##  5 2017-01-05 00:00:00 19895 21598.     23003.     20139.
##  6 2017-01-06 00:00:00 20994 23282.     24511.     21790.
##  7 2017-01-07 00:00:00 22591 24913.     26331.     23511.
##  8 2017-01-08 00:00:00 23700 26610.     27901.     25092.
##  9 2017-01-09 00:00:00 15797 16508.     17898.     15023.
## 10 2017-01-10 00:00:00 18608 19834.     21142.     18423.
## # … with 355 more rows
## # A tibble: 365 x 5
##    ds                      y   yhat yhat_upper yhat_lower
##    <dttm>              <int>  <dbl>      <dbl>      <dbl>
##  1 2017-01-01 00:00:00 23709 26974.     28498.     25411.
##  2 2017-01-02 00:00:00 15772 16911.     18426.     15405.
##  3 2017-01-03 00:00:00 18650 20290.     21944.     18715.
##  4 2017-01-04 00:00:00 18510 20351.     21910.     18761.
##  5 2017-01-05 00:00:00 19895 21971.     23539.     20386.
##  6 2017-01-06 00:00:00 20994 23670.     25157.     22137.
##  7 2017-01-07 00:00:00 22591 25295.     26833.     23718.
##  8 2017-01-08 00:00:00 23700 26975.     28600.     25423.
##  9 2017-01-09 00:00:00 15797 16861.     18321.     15366.
## 10 2017-01-10 00:00:00 18608 20211.     21696.     18700.
## # … with 355 more rows
## [1] 0.05254495
## [1] 0.05550715

The MAPE on forecast_new return 0.05254495 which is slightly better than the forecast_prophet on 0.05550715. It means the tuning model to add seasonality on monthly and adding Holidays argument is quite significant to make the model perform better than without tuning.

The seasonality of monthly added to the model_new shows a bigger MAPE value on Jan, slightly on July, and increasing during the last Quarter of the year. This seasonality gives us a hint that the sales could have some outliers when the demand are high. With this insight, we can tell to the company they should be well-prepared with the stock so they can be ready for increasing demand on Jan, Feb, Jul, Nov, and Dec each year.

It is interesting that a model can recognize seasonality and even better: it is flexible enough to be tuned on the seasonality and adding a changepoint, (also Holiday!), where there will always be changing behaviour over time. The Prophet package can be a good tool to show business people understand better what can seasonality do to our data, and also to our target (forecasted sales). It can also make a business run to be more effective as the forecasted data could be a valuable information to prepare the warehouse and the worker to catching on the demand.