Retail Sales

1. Introduction

Business Introduction

PT.A is a company that supplies various products to multiple urban areas, but it has been struggling to accurately forecast demand in each area. As a result, they experience stockouts or overstocking, which negatively affects their business performance. To overcome this challenge, PT.A needs the expertise of a data science professional who can help them develop and implement effective predictive models. By utilizing the company’s historical sales data and applying various algorithms and techniques such as time-series analysis, regression, and machine learning, a predictive model can be developed to forecast demand accurately. The implementation of this model can enable PT.A to optimize inventory management, reduce stock outs, and improve customer satisfaction.

Goals

This study aims to investigate the time series analysis of product demand in the supply chain context. Specifically, it focuses on exploring the patterns and trends in product demand over time, identifying the factors that influence demand, and developing forecasting models to predict future demand. The study utilizes historical sales data and applies various statistical techniques, such as time series predictive models depend on data set, to analyze and predict demand patterns. The findings of this study can provide insights for supply chain managers to optimize inventory management, production planning, and distribution strategies.

About the data

for all data set can be accessed here Predict Demand

  • We are going to use data-set historical sales from PT.A in 2012 - 2018
  • We focus only one retail store and one product

Concept of Time Series

Time Series Analysis describes a set of research problems where our observations are collected at regular time intervals and where we can assume correlations among successive observations. The principal idea is to learn from these past observations any inherent structures or patterns within the data, with the objective of generating future values for the series. The activity of generating future value is called “forecasting”.

💡 Main requirements for time series data:

  1. Data must be sorted according to the time period from the oldest data to the newest data
  2. The time interval must remain the same
  3. No data is allowed to be missed for each interval
  4. Nothing is missing

Work Flow Analyze Time-Series

  • Read data + data understanding
  • Data Wrangling
  • Data Pre-processing
  • Exploratory Data Analysis (Decompose)
  • Cross Validation
  • Build Model
  • Predict
  • Evaluation
  • Check Assumption (for ARIMA)
  • Tuning

EDA in time series

EDA in time series is doing Decompose, to focus on analyzed time series of data set is additive or multiplicative Components in time series data:

  • Trend : data pattern in general, tends to increase or decrease. If there is still a pattern in the trend, it means that the data components have not been decomposed properly.
  • Seasonal : a seasonal pattern that repeats itself over a fixed period of time.
  • Error/Reminder/Random : patterns that cannot be caught by trend and seasonal.

There are 2 types of models in time series data, namely:

  1. Additive Model : Time series model that has a constant variance following the trend and seasonality

\[Y_t = T_t + S_t + E_t\]

Data = Trend + Seasonal + Error

  1. Model Multiplicative: A time series model that has a higher/lower variance following the existing trend and seasonality

\[Y_t = T_t * S_t * E_t\]

From the results of the decompose, check the trend whether there is a repeating pattern or not, if there is then the frequency needs to be replaced or there is a multi-seasonal indication

2. Preparation

Import Library

library(dplyr) # data wrangling
library(lubridate) # date manipulation
library(padr) # complete data frame
library(zoo) # Missing value imputation
library(forecast) # time series library
library(TTR) # for Simple moving average function
library(MLmetrics) # calculate error
library(tseries) # adf.test
library(TSstudio) # mempercantik visualisasi timeseries
library(ggplot2) 
library(plotly)
library(tidyr)

Read Data

sales <- read.csv("data_input/demand.csv")
sales <- sales %>% 
  select(-id) %>% # Remove ID
  select(-lat) %>% # Remove Latitude
  select(-long) %>% # Remove Longitude
  select(-pop) # Remove population of area
glimpse(sales)
#> Rows: 7,560
#> Columns: 8
#> $ date      <chr> "31/01/12", "31/01/12", "31/01/12", "31/01/12", "31/01/12", …
#> $ city      <chr> "Athens", "Athens", "Athens", "Athens", "Athens", "Athens", …
#> $ shop      <chr> "shop_1", "shop_1", "shop_1", "shop_1", "shop_1", "shop_1", …
#> $ brand     <chr> "kinder-cola", "kinder-cola", "kinder-cola", "adult-cola", "…
#> $ container <chr> "glass", "plastic", "can", "glass", "can", "glass", "can", "…
#> $ capacity  <chr> "500ml", "1.5lt", "330ml", "500ml", "330ml", "500ml", "330ml…
#> $ price     <dbl> 0.96, 2.86, 0.87, 1.00, 0.39, 1.00, 0.43, 0.49, 0.70, 2.21, …
#> $ quantity  <int> 13280, 6727, 9848, 20050, 25696, 15041, 34578, 44734, 18623,…
colSums(is.na(sales))
#>      date      city      shop     brand container  capacity     price  quantity 
#>         0         0         0         0         0         0      1080      1080

Insight :

  • There is a type of columns should be change as a factor :city,shop,brand,container,capacity
  • column date should change as a date using library lubridate
  • We can see date of data set, it seems sales closing date, it will be difficult to check interval date soon
  • We should delete blank rows

Manipulate Data

sales <- sales %>% 
  mutate_at(vars(city,shop,brand,container,capacity), as.factor) %>% 
  mutate(date=dmy(date)) %>% 
  mutate(date = floor_date(date,unit = "month")) %>% 
  drop_na()  #delete blank rows
glimpse(sales)
#> Rows: 6,480
#> Columns: 8
#> $ date      <date> 2012-01-01, 2012-01-01, 2012-01-01, 2012-01-01, 2012-01-01,…
#> $ city      <fct> Athens, Athens, Athens, Athens, Athens, Athens, Athens, Athe…
#> $ shop      <fct> shop_1, shop_1, shop_1, shop_1, shop_1, shop_1, shop_1, shop…
#> $ brand     <fct> kinder-cola, kinder-cola, kinder-cola, adult-cola, adult-col…
#> $ container <fct> glass, plastic, can, glass, can, glass, can, glass, glass, p…
#> $ capacity  <fct> 500ml, 1.5lt, 330ml, 500ml, 330ml, 500ml, 330ml, 500ml, 500m…
#> $ price     <dbl> 0.96, 2.86, 0.87, 1.00, 0.39, 1.00, 0.43, 0.49, 0.70, 2.21, …
#> $ quantity  <int> 13280, 6727, 9848, 20050, 25696, 15041, 34578, 44734, 18623,…

Insight :

  • Manipulate Date to be in the beginning of month already done
  • Nothing is missing

Analyzed Product

as a supply chain staff, we need to know what kind of product should be prepared before focus on prediction’s step.

Which City have the highest demand ?

sales_city<- sales %>% 
  group_by(city) %>% 
  count() %>% 
  rename(total = n)
  
sales_city <- ggplot(sales_city)+
  geom_col(mapping = aes(x= city,y=total),fill = 'navy')
ggplotly(sales_city)

Insight :

  • Athens’s City, is the highest demand, we want to know more what kind the enormous product.

The Enormous Product

sales %>% 
  group_by(city,brand,container,capacity) %>% 
  filter(city == "Athens") %>% 
  summarise(total_qty = sum(quantity)) %>% 
  arrange(-total_qty)

Insight :

  • We are going to use gazoza 330ml as our sample product’s forecast due to the highest demand

What kind of department store/ shope should be used as our sample ?

sales %>% 
  group_by(city,shop) %>% 
  count() %>% 
  rename(total = n) %>% 
  filter(city=="Athens")

3. Athens Shope 1 using product Gazoza 330ml

sales1 <- sales %>% 
  filter(city=="Athens") %>% 
  filter(shop == "shop_1") %>% 
  filter(brand == "gazoza") %>% 
  filter(capacity == "330ml")
sales1

4. Check Time’s Periods

unique(sales1$date)
#>  [1] "2012-01-01" "2012-02-01" "2012-03-01" "2012-04-01" "2012-05-01"
#>  [6] "2012-06-01" "2012-07-01" "2012-08-01" "2012-09-01" "2012-10-01"
#> [11] "2012-11-01" "2012-12-01" "2013-01-01" "2013-02-01" "2013-03-01"
#> [16] "2013-04-01" "2013-05-01" "2013-06-01" "2013-07-01" "2013-08-01"
#> [21] "2013-09-01" "2013-10-01" "2013-11-01" "2013-12-01" "2014-01-01"
#> [26] "2014-02-01" "2014-03-01" "2014-04-01" "2014-05-01" "2014-06-01"
#> [31] "2014-07-01" "2014-08-01" "2014-09-01" "2014-10-01" "2014-11-01"
#> [36] "2014-12-01" "2015-01-01" "2015-02-01" "2015-03-01" "2015-04-01"
#> [41] "2015-05-01" "2015-06-01" "2015-07-01" "2015-08-01" "2015-09-01"
#> [46] "2015-10-01" "2015-11-01" "2015-12-01" "2016-01-01" "2016-02-01"
#> [51] "2016-03-01" "2016-04-01" "2016-05-01" "2016-06-01" "2016-07-01"
#> [56] "2016-08-01" "2016-09-01" "2016-10-01" "2016-11-01" "2016-12-01"
#> [61] "2017-01-01" "2017-02-01" "2017-03-01" "2017-04-01" "2017-05-01"
#> [66] "2017-06-01" "2017-07-01" "2017-08-01" "2017-09-01" "2017-10-01"
#> [71] "2017-11-01" "2017-12-01"
interval <- seq.Date(from= as.Date(range(sales1$date)[1]),
                     to = as.Date(range(sales1$date)[2]),
                     by = "month")

length(interval)
#> [1] 72
all(interval == sales1$date)
#> [1] TRUE

Insight :

  • The time interval has already remained same as data-set
  • Data has already sorted according to the time period from the oldest data to the newest data
💡 The main requirements for time series data have been fulfilled

5. Visualization

# add monthly column
sales1.1 <- sales1 %>%
  mutate(monthly = month(date, label = T))

# visualize
sales1.1 <- sales1.1 %>% 
  ggplot(aes(x = date, y =quantity))+
    geom_point(data = sales1.1, col = "firebrick")+
  geom_line()
ggplotly(sales1.1)

Insight :

  • Type of data additive

6. EDA Decompose , TS Object

sales_ts <- ts(data =sales1$quantity,start = c(2012,1),frequency = 12)
sales_dec <- sales_ts %>% decompose() %>%  autoplot()
sales_dec

Insight :

  • If we look at the chart pattern, the trend still has repeated waves and doesn’t look smooth and our frequency already fitted with pattern using 12 due to our data set in showed in every months.
  • If in the decompose results, the trend still forms a pattern, then it can be suspected that there is still seasonality that has not been captured. The trend should tend to increase or tend to decrease smoothly, so we can conclude our data set has detected Multi-seasonal
sales1$quantity %>% 
  msts(seasonal.periods = c(12), start = c(2012,1)) %>% # multiseasonal ts (monthly)
  mstl() %>% # multiseasonal ts decomposition
  autoplot() 

Insight :

  • We can see our trend seems decreased smoothly, it means data components have been decomposed properly.
# assign final ts object
sales_msts <- sales1$quantity %>% 
  msts(seasonal.periods = c(12) , start = c(2012,1))

# check for stationary
adf.test(sales_msts)
#> 
#>  Augmented Dickey-Fuller Test
#> 
#> data:  sales_msts
#> Dickey-Fuller = -4.6232, Lag order = 4, p-value = 0.01
#> alternative hypothesis: stationary

Insight :

  • Data has already stationary with p-value < 0.05

7. Cross Validation

Splitted data into data sales train and test using 80:20

sales_train <- sales_msts %>% 
  head(-2*12)

sales_test <- sales_msts %>% 
  tail(2*12)

8. Forecast Model

Due to our data contain seasonal and trend, I would like to build predictive models using holt winters and SARIMA.

Holt Winter Exponential Smoothing

Model

sales_hw <- HoltWinters(sales_train)
sales_hw$fitted
#>               xhat    level     trend      season
#> Jan 2013  23133.83 59552.69 -526.3466 -35892.5174
#> Feb 2013  47437.04 59109.20 -526.3466 -11145.8090
#> Mar 2013  41238.30 58593.79 -526.3466 -16829.1424
#> Apr 2013  51819.62 58052.48 -526.3466  -5706.5174
#> May 2013  61212.49 57520.73 -526.3466   4218.1076
#> Jun 2013 103446.08 56925.24 -526.3466  47047.1910
#> Jul 2013  81358.70 56213.89 -526.3466  25671.1493
#> Aug 2013  49752.14 53810.17 -526.3466  -3531.6840
#> Sep 2013  46562.39 53619.63 -526.3466  -6530.8924
#> Oct 2013  49751.34 52363.54 -526.3466  -2085.8507
#> Nov 2013  62615.79 52931.28 -526.3466  10210.8576
#> Dec 2013  43922.20 49873.44 -526.3466  -5424.8924
#> Jan 2014  13368.01 49324.97 -526.3466 -35430.6181
#> Feb 2014  39641.24 51252.43 -526.3466 -11084.8394
#> Mar 2014  32372.54 49811.43 -526.3466 -16912.5491
#> Apr 2014  43346.47 49609.46 -526.3466  -5736.6404
#> May 2014  52892.50 49586.23 -526.3466   3832.6232
#> Jun 2014  94988.87 49499.39 -526.3466  46015.8245
#> Jul 2014  59983.17 45304.65 -526.3466  15204.8649
#> Aug 2014  41006.38 43192.34 -526.3466  -1659.6175
#> Sep 2014  31196.71 42322.23 -526.3466 -10599.1768
#> Oct 2014  45259.61 41772.29 -526.3466   4013.6650
#> Nov 2014  36327.32 40755.81 -526.3466  -3902.1397
#> Dec 2014  33219.87 39294.43 -526.3466  -5548.2146
#> Jan 2015  17364.73 39641.84 -526.3466 -21750.7628
#> Feb 2015  22922.98 39633.29 -526.3466 -16183.9729
#> Mar 2015  23088.05 38718.55 -526.3466 -15104.1597
#> Apr 2015  35582.90 39041.05 -526.3466  -2931.8047
#> May 2015  43484.79 37728.26 -526.3466   6282.8737
#> Jun 2015  61866.85 36828.55 -526.3466  25564.6514
#> Jul 2015  41739.87 35903.03 -526.3466   6363.1918
#> Aug 2015  33202.15 37304.56 -526.3466  -3576.0664
#> Sep 2015  26309.71 37566.76 -526.3466 -10730.7054
#> Oct 2015  39453.76 38698.97 -526.3466   1281.1369
#> Nov 2015  27915.03 37556.24 -526.3466  -9114.8701
#> Dec 2015  37077.04 38280.49 -526.3466   -677.0983

Forecast

sales_hw_fc <- forecast(object = sales_hw, h = 24)
# using autoplot dan autolayer
sales_hw_vis <- sales_ts %>% 
  autoplot() +
  autolayer(sales_hw$fitted[, 1], series = "train") +
  autolayer(sales_hw_fc$mean, series = "forecast")

ggplotly(sales_hw_vis)
accuracy(sales_hw$fitted[,1], sales_train)
#>                 ME     RMSE      MAE       MPE     MAPE        ACF1 Theil's U
#> Test set -1370.346 15471.57 11024.39 -7.410821 28.13871 -0.06714306 0.7668653
accuracy(sales_hw_fc$mea,sales_test)
#>                ME     RMSE      MAE      MPE     MAPE      ACF1 Theil's U
#> Test set 11199.02 16080.92 13304.33 25.15117 32.26329 0.1549821  1.172273

Sarima

Model

sales_auto <- auto.arima(sales_msts)
sales_auto
#> Series: sales_msts 
#> ARIMA(1,1,1)(1,0,0)[12] 
#> 
#> Coefficients:
#>          ar1      ma1    sar1
#>       0.2420  -0.9284  0.2944
#> s.e.  0.1602   0.0546  0.1442
#> 
#> sigma^2 = 214265849:  log likelihood = -781.38
#> AIC=1570.76   AICc=1571.37   BIC=1579.81

Manual Tuning Model

sales_msts %>% 
  diff(lag = 12) %>% #seasonal
  diff(lag = 1) %>% #trend
  tsdisplay(lag.max = 36)

Insight :

For manual tuning model we get frequency based on our data : - SARIMA (1,1,1)(0,0,0) - SARIMA (4,1,0)(0,0,0)

SARIMA1 <- Arima(sales_msts, order = c(1,1,1), seasonal = c(0,0,0))
SARIMA2 <- Arima(sales_msts, order = c(4,1,1), seasonal = c(0,0,0))

Check Goodness of fit using AIC’s value

sales_auto$aic
#> [1] 1570.76
SARIMA1$aic
#> [1] 1572.676
SARIMA2$aic
#> [1] 1583.208

Insight :

  • Tips and trick, based on best practice in R, using auto.arima() already automatic calculate AIC to get the lowest value than manual tuning models.

Forecast

sales_train_fc <- forecast(object = sales_auto, h = 12)
sales_train_fc
#>          Point Forecast    Lo 80    Hi 80     Lo 95    Hi 95
#> Jan 2018       31750.37 12991.24 50509.49  3060.757 60439.98
#> Feb 2018       33530.74 13871.02 53190.47  3463.791 63597.70
#> Mar 2018       39462.19 19608.81 59315.57  9099.063 69825.31
#> Apr 2018       41782.85 21827.75 61737.94 11264.165 72301.53
#> May 2018       34810.15 14771.34 54848.96  4163.440 65456.86
#> Jun 2018       39640.83 19522.62 59759.04  8872.685 70408.97
#> Jul 2018       43231.71 23035.35 63428.06 12344.048 74119.36
#> Aug 2018       43192.79 22918.82 63466.76 12186.423 74199.15
#> Sep 2018       38591.80 18240.57 58943.04  7467.270 69716.34
#> Oct 2018       41018.25 20590.04 61446.45  9776.006 72260.48
#> Nov 2018       35825.70 15320.83 56330.57  4466.205 67185.20
#> Dec 2018       32714.24 12132.98 53295.49  1237.919 64190.55
sales_arima <- sales_ts %>% 
  autoplot() +
  autolayer(sales_auto$fitted, series = "train") +
  autolayer(sales_train_fc$mean, series = "forecast")

ggplotly(sales_arima)

Seasonal Trend with Loess Model

Decompose

Due to our data-set is additive, we can use STLM without doing log().

sales_train %>% stl(s.window = 12) %>% autoplot()

### Model

model_stlm <- stlm(y = sales_train, s.window = 12, method = "arima")

Forecast

forecast_stlm <- forecast(model_stlm, h = 24)
accuracy(forecast_stlm)
#>                     ME     RMSE      MAE       MPE     MAPE      MASE
#> Training set -437.9875 10920.94 8662.202 -5.644549 21.24465 0.5642026
#>                    ACF1
#> Training set 0.06659016
sales_stlm <- sales_ts %>% 
  autoplot() +
  autolayer(model_stlm$fitted, series = "train") +
  autolayer(forecast_stlm$mean, series = "forecast")

ggplotly(sales_stlm)

9. Model Evaluation

accuracy(sales_hw$fitted[,1], sales_train) #Holt Winter for data train
#>                 ME     RMSE      MAE       MPE     MAPE        ACF1 Theil's U
#> Test set -1370.346 15471.57 11024.39 -7.410821 28.13871 -0.06714306 0.7668653
accuracy(sales_hw_fc$mean,sales_test) # Holt Winter for data test
#>                ME     RMSE      MAE      MPE     MAPE      ACF1 Theil's U
#> Test set 11199.02 16080.92 13304.33 25.15117 32.26329 0.1549821  1.172273
accuracy(sales_auto$fitted,sales_train) # Sarima for data train
#>                 ME     RMSE      MAE       MPE    MAPE       ACF1 Theil's U
#> Test set -537.7388 15379.81 12059.13 -11.41317 29.5855 0.03458137 0.7700517
accuracy(sales_train_fc$fitted, sales_test) # Sarima for data test
#>                 ME     RMSE      MAE       MPE     MAPE       ACF1 Theil's U
#> Test set -350.9228 11576.23 9961.559 -11.08725 29.41554 0.01393134 0.7017803
accuracy(forecast_stlm$fitted,sales_train) # STLM for data train
#>                 ME     RMSE      MAE       MPE     MAPE       ACF1 Theil's U
#> Test set -437.9875 10920.94 8662.202 -5.644549 21.24465 0.06659016 0.5764131
accuracy(forecast_stlm$mean,sales_test) # STLM for data test
#>                ME     RMSE      MAE       MPE     MAPE       ACF1 Theil's U
#> Test set 1260.026 11078.65 9184.929 -3.482281 26.44826 0.08199727 0.8033282

Insight :

  • Based on model, models tested so far the best is STLM with MAPE 26.44% than others,but it seems over fit due to MAPE accuracy data test higher than data train.
  • Based on model, we can see the MAPE above 25% it means, our model not optimal for this data set. for further experiment, we can try using neuron network (RNN), VAR, DFM, SARIMAX with target on MAPE below 10%.

10. Assumption Check for SARIMA

# menggunakan Ljung-Box test
Box.test(sales_auto$residuals, type = "Ljung-Box")
#> 
#>  Box-Ljung test
#> 
#> data:  sales_auto$residuals
#> X-squared = 0.12586, df = 1, p-value = 0.7228

Insight :

  • P-value > 0,05 , it means our data time series’s residual has no-autocorrelation
shapiro.test(sales_auto$residuals)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  sales_auto$residuals
#> W = 0.98908, p-value = 0.7927

Insight :

  • P-value > 0,05 , it means our data time series’s residual spread normally

11. Conclusion

  • After we trial and error with our models, we can see all models are over fit, they good on data train but not in data test.
  • For further experiment, we should trial neuron network (RNN), VAR, DFM, SARIMAX with target on MAPE below 10%. We are using MAPE as our benchmark due to more easily to interpret or model’s performance.
  • For limited experiment, we remember our data set time series have seasonal and trend, that is why we don’t try SES and Holt Exponential.
  • Assumptions on the time series are tested to measure whether the residuals obtained from the modeling results are good enough to describe and capture information in the data.

After we considered all conclusion above, our goals not achieved due to our models are currently over fit which not good to be our benchmark.