Retail Sales
for all data set can be accessed here Predict Demand
Time Series Analysis describes a set of research problems where our observations are collected at regular time intervals and where we can assume correlations among successive observations. The principal idea is to learn from these past observations any inherent structures or patterns within the data, with the objective of generating future values for the series. The activity of generating future value is called “forecasting”.
💡 Main requirements for time series data:
EDA in time series is doing Decompose, to focus on analyzed time series of data set is additive or multiplicative Components in time series data:
There are 2 types of models in time series data, namely:
\[Y_t = T_t + S_t + E_t\]
Data = Trend + Seasonal + Error
\[Y_t = T_t * S_t * E_t\]
From the results of the decompose, check the trend whether there is a repeating pattern or not, if there is then the frequency needs to be replaced or there is a multi-seasonal indication
library(dplyr) # data wrangling
library(lubridate) # date manipulation
library(padr) # complete data frame
library(zoo) # Missing value imputation
library(forecast) # time series library
library(TTR) # for Simple moving average function
library(MLmetrics) # calculate error
library(tseries) # adf.test
library(TSstudio) # mempercantik visualisasi timeseries
library(ggplot2)
library(plotly)
library(tidyr)sales <- read.csv("data_input/demand.csv")
sales <- sales %>%
select(-id) %>% # Remove ID
select(-lat) %>% # Remove Latitude
select(-long) %>% # Remove Longitude
select(-pop) # Remove population of area
glimpse(sales)#> Rows: 7,560
#> Columns: 8
#> $ date <chr> "31/01/12", "31/01/12", "31/01/12", "31/01/12", "31/01/12", …
#> $ city <chr> "Athens", "Athens", "Athens", "Athens", "Athens", "Athens", …
#> $ shop <chr> "shop_1", "shop_1", "shop_1", "shop_1", "shop_1", "shop_1", …
#> $ brand <chr> "kinder-cola", "kinder-cola", "kinder-cola", "adult-cola", "…
#> $ container <chr> "glass", "plastic", "can", "glass", "can", "glass", "can", "…
#> $ capacity <chr> "500ml", "1.5lt", "330ml", "500ml", "330ml", "500ml", "330ml…
#> $ price <dbl> 0.96, 2.86, 0.87, 1.00, 0.39, 1.00, 0.43, 0.49, 0.70, 2.21, …
#> $ quantity <int> 13280, 6727, 9848, 20050, 25696, 15041, 34578, 44734, 18623,…
colSums(is.na(sales))#> date city shop brand container capacity price quantity
#> 0 0 0 0 0 0 1080 1080
Insight :
city,shop,brand,container,capacitydate should change as a date using library
lubridatesales <- sales %>%
mutate_at(vars(city,shop,brand,container,capacity), as.factor) %>%
mutate(date=dmy(date)) %>%
mutate(date = floor_date(date,unit = "month")) %>%
drop_na() #delete blank rows
glimpse(sales)#> Rows: 6,480
#> Columns: 8
#> $ date <date> 2012-01-01, 2012-01-01, 2012-01-01, 2012-01-01, 2012-01-01,…
#> $ city <fct> Athens, Athens, Athens, Athens, Athens, Athens, Athens, Athe…
#> $ shop <fct> shop_1, shop_1, shop_1, shop_1, shop_1, shop_1, shop_1, shop…
#> $ brand <fct> kinder-cola, kinder-cola, kinder-cola, adult-cola, adult-col…
#> $ container <fct> glass, plastic, can, glass, can, glass, can, glass, glass, p…
#> $ capacity <fct> 500ml, 1.5lt, 330ml, 500ml, 330ml, 500ml, 330ml, 500ml, 500m…
#> $ price <dbl> 0.96, 2.86, 0.87, 1.00, 0.39, 1.00, 0.43, 0.49, 0.70, 2.21, …
#> $ quantity <int> 13280, 6727, 9848, 20050, 25696, 15041, 34578, 44734, 18623,…
Insight :
as a supply chain staff, we need to know what kind of product should be prepared before focus on prediction’s step.
sales_city<- sales %>%
group_by(city) %>%
count() %>%
rename(total = n)
sales_city <- ggplot(sales_city)+
geom_col(mapping = aes(x= city,y=total),fill = 'navy')
ggplotly(sales_city)Insight :
sales %>%
group_by(city,brand,container,capacity) %>%
filter(city == "Athens") %>%
summarise(total_qty = sum(quantity)) %>%
arrange(-total_qty)Insight :
sales %>%
group_by(city,shop) %>%
count() %>%
rename(total = n) %>%
filter(city=="Athens")sales1 <- sales %>%
filter(city=="Athens") %>%
filter(shop == "shop_1") %>%
filter(brand == "gazoza") %>%
filter(capacity == "330ml")
sales1unique(sales1$date)#> [1] "2012-01-01" "2012-02-01" "2012-03-01" "2012-04-01" "2012-05-01"
#> [6] "2012-06-01" "2012-07-01" "2012-08-01" "2012-09-01" "2012-10-01"
#> [11] "2012-11-01" "2012-12-01" "2013-01-01" "2013-02-01" "2013-03-01"
#> [16] "2013-04-01" "2013-05-01" "2013-06-01" "2013-07-01" "2013-08-01"
#> [21] "2013-09-01" "2013-10-01" "2013-11-01" "2013-12-01" "2014-01-01"
#> [26] "2014-02-01" "2014-03-01" "2014-04-01" "2014-05-01" "2014-06-01"
#> [31] "2014-07-01" "2014-08-01" "2014-09-01" "2014-10-01" "2014-11-01"
#> [36] "2014-12-01" "2015-01-01" "2015-02-01" "2015-03-01" "2015-04-01"
#> [41] "2015-05-01" "2015-06-01" "2015-07-01" "2015-08-01" "2015-09-01"
#> [46] "2015-10-01" "2015-11-01" "2015-12-01" "2016-01-01" "2016-02-01"
#> [51] "2016-03-01" "2016-04-01" "2016-05-01" "2016-06-01" "2016-07-01"
#> [56] "2016-08-01" "2016-09-01" "2016-10-01" "2016-11-01" "2016-12-01"
#> [61] "2017-01-01" "2017-02-01" "2017-03-01" "2017-04-01" "2017-05-01"
#> [66] "2017-06-01" "2017-07-01" "2017-08-01" "2017-09-01" "2017-10-01"
#> [71] "2017-11-01" "2017-12-01"
interval <- seq.Date(from= as.Date(range(sales1$date)[1]),
to = as.Date(range(sales1$date)[2]),
by = "month")
length(interval)#> [1] 72
all(interval == sales1$date)#> [1] TRUE
Insight :
# add monthly column
sales1.1 <- sales1 %>%
mutate(monthly = month(date, label = T))
# visualize
sales1.1 <- sales1.1 %>%
ggplot(aes(x = date, y =quantity))+
geom_point(data = sales1.1, col = "firebrick")+
geom_line()
ggplotly(sales1.1)Insight :
sales_ts <- ts(data =sales1$quantity,start = c(2012,1),frequency = 12)
sales_dec <- sales_ts %>% decompose() %>% autoplot()
sales_dec
Insight :
sales1$quantity %>%
msts(seasonal.periods = c(12), start = c(2012,1)) %>% # multiseasonal ts (monthly)
mstl() %>% # multiseasonal ts decomposition
autoplot()
Insight :
# assign final ts object
sales_msts <- sales1$quantity %>%
msts(seasonal.periods = c(12) , start = c(2012,1))
# check for stationary
adf.test(sales_msts)#>
#> Augmented Dickey-Fuller Test
#>
#> data: sales_msts
#> Dickey-Fuller = -4.6232, Lag order = 4, p-value = 0.01
#> alternative hypothesis: stationary
Insight :
Splitted data into data sales train and test using 80:20
sales_train <- sales_msts %>%
head(-2*12)
sales_test <- sales_msts %>%
tail(2*12)Due to our data contain seasonal and trend, I would like to build predictive models using holt winters and SARIMA.
sales_hw <- HoltWinters(sales_train)
sales_hw$fitted#> xhat level trend season
#> Jan 2013 23133.83 59552.69 -526.3466 -35892.5174
#> Feb 2013 47437.04 59109.20 -526.3466 -11145.8090
#> Mar 2013 41238.30 58593.79 -526.3466 -16829.1424
#> Apr 2013 51819.62 58052.48 -526.3466 -5706.5174
#> May 2013 61212.49 57520.73 -526.3466 4218.1076
#> Jun 2013 103446.08 56925.24 -526.3466 47047.1910
#> Jul 2013 81358.70 56213.89 -526.3466 25671.1493
#> Aug 2013 49752.14 53810.17 -526.3466 -3531.6840
#> Sep 2013 46562.39 53619.63 -526.3466 -6530.8924
#> Oct 2013 49751.34 52363.54 -526.3466 -2085.8507
#> Nov 2013 62615.79 52931.28 -526.3466 10210.8576
#> Dec 2013 43922.20 49873.44 -526.3466 -5424.8924
#> Jan 2014 13368.01 49324.97 -526.3466 -35430.6181
#> Feb 2014 39641.24 51252.43 -526.3466 -11084.8394
#> Mar 2014 32372.54 49811.43 -526.3466 -16912.5491
#> Apr 2014 43346.47 49609.46 -526.3466 -5736.6404
#> May 2014 52892.50 49586.23 -526.3466 3832.6232
#> Jun 2014 94988.87 49499.39 -526.3466 46015.8245
#> Jul 2014 59983.17 45304.65 -526.3466 15204.8649
#> Aug 2014 41006.38 43192.34 -526.3466 -1659.6175
#> Sep 2014 31196.71 42322.23 -526.3466 -10599.1768
#> Oct 2014 45259.61 41772.29 -526.3466 4013.6650
#> Nov 2014 36327.32 40755.81 -526.3466 -3902.1397
#> Dec 2014 33219.87 39294.43 -526.3466 -5548.2146
#> Jan 2015 17364.73 39641.84 -526.3466 -21750.7628
#> Feb 2015 22922.98 39633.29 -526.3466 -16183.9729
#> Mar 2015 23088.05 38718.55 -526.3466 -15104.1597
#> Apr 2015 35582.90 39041.05 -526.3466 -2931.8047
#> May 2015 43484.79 37728.26 -526.3466 6282.8737
#> Jun 2015 61866.85 36828.55 -526.3466 25564.6514
#> Jul 2015 41739.87 35903.03 -526.3466 6363.1918
#> Aug 2015 33202.15 37304.56 -526.3466 -3576.0664
#> Sep 2015 26309.71 37566.76 -526.3466 -10730.7054
#> Oct 2015 39453.76 38698.97 -526.3466 1281.1369
#> Nov 2015 27915.03 37556.24 -526.3466 -9114.8701
#> Dec 2015 37077.04 38280.49 -526.3466 -677.0983
sales_hw_fc <- forecast(object = sales_hw, h = 24)# using autoplot dan autolayer
sales_hw_vis <- sales_ts %>%
autoplot() +
autolayer(sales_hw$fitted[, 1], series = "train") +
autolayer(sales_hw_fc$mean, series = "forecast")
ggplotly(sales_hw_vis)accuracy(sales_hw$fitted[,1], sales_train)#> ME RMSE MAE MPE MAPE ACF1 Theil's U
#> Test set -1370.346 15471.57 11024.39 -7.410821 28.13871 -0.06714306 0.7668653
accuracy(sales_hw_fc$mea,sales_test)#> ME RMSE MAE MPE MAPE ACF1 Theil's U
#> Test set 11199.02 16080.92 13304.33 25.15117 32.26329 0.1549821 1.172273
sales_auto <- auto.arima(sales_msts)
sales_auto#> Series: sales_msts
#> ARIMA(1,1,1)(1,0,0)[12]
#>
#> Coefficients:
#> ar1 ma1 sar1
#> 0.2420 -0.9284 0.2944
#> s.e. 0.1602 0.0546 0.1442
#>
#> sigma^2 = 214265849: log likelihood = -781.38
#> AIC=1570.76 AICc=1571.37 BIC=1579.81
sales_msts %>%
diff(lag = 12) %>% #seasonal
diff(lag = 1) %>% #trend
tsdisplay(lag.max = 36)
Insight :
For manual tuning model we get frequency based on our data : - SARIMA (1,1,1)(0,0,0) - SARIMA (4,1,0)(0,0,0)
SARIMA1 <- Arima(sales_msts, order = c(1,1,1), seasonal = c(0,0,0))
SARIMA2 <- Arima(sales_msts, order = c(4,1,1), seasonal = c(0,0,0))Check Goodness of fit using AIC’s value
sales_auto$aic#> [1] 1570.76
SARIMA1$aic#> [1] 1572.676
SARIMA2$aic#> [1] 1583.208
Insight :
auto.arima() already automatic calculate AIC to get the
lowest value than manual tuning models.sales_train_fc <- forecast(object = sales_auto, h = 12)
sales_train_fc#> Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
#> Jan 2018 31750.37 12991.24 50509.49 3060.757 60439.98
#> Feb 2018 33530.74 13871.02 53190.47 3463.791 63597.70
#> Mar 2018 39462.19 19608.81 59315.57 9099.063 69825.31
#> Apr 2018 41782.85 21827.75 61737.94 11264.165 72301.53
#> May 2018 34810.15 14771.34 54848.96 4163.440 65456.86
#> Jun 2018 39640.83 19522.62 59759.04 8872.685 70408.97
#> Jul 2018 43231.71 23035.35 63428.06 12344.048 74119.36
#> Aug 2018 43192.79 22918.82 63466.76 12186.423 74199.15
#> Sep 2018 38591.80 18240.57 58943.04 7467.270 69716.34
#> Oct 2018 41018.25 20590.04 61446.45 9776.006 72260.48
#> Nov 2018 35825.70 15320.83 56330.57 4466.205 67185.20
#> Dec 2018 32714.24 12132.98 53295.49 1237.919 64190.55
sales_arima <- sales_ts %>%
autoplot() +
autolayer(sales_auto$fitted, series = "train") +
autolayer(sales_train_fc$mean, series = "forecast")
ggplotly(sales_arima)Due to our data-set is additive, we can use STLM without doing
log().
sales_train %>% stl(s.window = 12) %>% autoplot()
### Model
model_stlm <- stlm(y = sales_train, s.window = 12, method = "arima")forecast_stlm <- forecast(model_stlm, h = 24)accuracy(forecast_stlm)#> ME RMSE MAE MPE MAPE MASE
#> Training set -437.9875 10920.94 8662.202 -5.644549 21.24465 0.5642026
#> ACF1
#> Training set 0.06659016
sales_stlm <- sales_ts %>%
autoplot() +
autolayer(model_stlm$fitted, series = "train") +
autolayer(forecast_stlm$mean, series = "forecast")
ggplotly(sales_stlm)accuracy(sales_hw$fitted[,1], sales_train) #Holt Winter for data train#> ME RMSE MAE MPE MAPE ACF1 Theil's U
#> Test set -1370.346 15471.57 11024.39 -7.410821 28.13871 -0.06714306 0.7668653
accuracy(sales_hw_fc$mean,sales_test) # Holt Winter for data test#> ME RMSE MAE MPE MAPE ACF1 Theil's U
#> Test set 11199.02 16080.92 13304.33 25.15117 32.26329 0.1549821 1.172273
accuracy(sales_auto$fitted,sales_train) # Sarima for data train#> ME RMSE MAE MPE MAPE ACF1 Theil's U
#> Test set -537.7388 15379.81 12059.13 -11.41317 29.5855 0.03458137 0.7700517
accuracy(sales_train_fc$fitted, sales_test) # Sarima for data test#> ME RMSE MAE MPE MAPE ACF1 Theil's U
#> Test set -350.9228 11576.23 9961.559 -11.08725 29.41554 0.01393134 0.7017803
accuracy(forecast_stlm$fitted,sales_train) # STLM for data train#> ME RMSE MAE MPE MAPE ACF1 Theil's U
#> Test set -437.9875 10920.94 8662.202 -5.644549 21.24465 0.06659016 0.5764131
accuracy(forecast_stlm$mean,sales_test) # STLM for data test#> ME RMSE MAE MPE MAPE ACF1 Theil's U
#> Test set 1260.026 11078.65 9184.929 -3.482281 26.44826 0.08199727 0.8033282
Insight :
# menggunakan Ljung-Box test
Box.test(sales_auto$residuals, type = "Ljung-Box")#>
#> Box-Ljung test
#>
#> data: sales_auto$residuals
#> X-squared = 0.12586, df = 1, p-value = 0.7228
Insight :
shapiro.test(sales_auto$residuals)#>
#> Shapiro-Wilk normality test
#>
#> data: sales_auto$residuals
#> W = 0.98908, p-value = 0.7927
Insight :
After we considered all conclusion above, our goals not achieved due to our models are currently over fit which not good to be our benchmark.