1 Case

A large Indian retail chain has stores across 3 states in India: Maharashtra, Telangana and Kerala. These stores stock products across various categories such as FMCG (fast moving consumer goods), eatables / perishables and others. Managing the inventory is crucial for the revenue stream of the retail chain. Meeting the demand is important to not lose potential revenue, while at the same time stocking excessive products could lead to losses.

In this case i will building a time series for forcasting the sales of product for one month from fmcg categorical

2 Library

library(dplyr) # for data wrangling
library(lubridate) # to dea with date
library(padr) # for padding
library(forecast) # time series library
library(tseries) # for adf.test
library(MLmetrics) # calculate error
library(zoo) #imputation missing value
library(tseries) # adf.test

3 Read Data

This data is obtained from Kaggle, which is product daily sales data from each store in India

fmcg <- read.csv("data.csv")
glimpse(fmcg)
#> Rows: 395,000
#> Columns: 7
#> $ date                  <chr> "2012-01-01", "2012-01-01", "2012-01-01", "2012-~
#> $ product_identifier    <int> 74, 337, 423, 432, 581, 611, 631, 659, 743, 797,~
#> $ department_identifier <int> 11, 11, 12, 12, 21, 21, 21, 21, 21, 21, 21, 21, ~
#> $ category_of_product   <chr> "others", "others", "others", "others", "fast_mo~
#> $ outlet                <int> 111, 111, 111, 111, 111, 111, 111, 111, 111, 111~
#> $ state                 <chr> "Maharashtra", "Maharashtra", "Maharashtra", "Ma~
#> $ sales                 <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 3, ~

Data Description :
date : date of sales obsevation
product_identifier : the id for a product
department_identifier : The id for spesific department in a store
category_of_product : The category to which a product belongs
outlet : The id for a store
state : The name of the state
sales : The number of sales product

4 Data Wrangling

fmcg <- fmcg %>% 
  mutate(category_of_product = as.factor(category_of_product),
         state = as.factor(state),
         date = ymd(date))
head(fmcg)

4.1 Data Agregation

From the data set we just concern to sales forecast in fmcg, so we must to get date and sales from fmcg category product

fmcg_data <- fmcg %>% 
             filter(category_of_product == "fast_moving_consumer_goods") %>% 
             select(date,sales)
daily_fmcg <- fmcg_data %>% 
  group_by(date) %>% 
  summarise(sales = sum(sales)) %>% 
  arrange(date)
daily_fmcg

A good time series data is having data that is sequential and not empty, therefore by using a pad, it will check each row of data whether there are blanks or not.

daily_fmcg %>% filter(sales == 0)
daily_fmcg <- daily_fmcg[daily_fmcg$sales > 0, ]
daily_fmcg %>% filter(sales == 0 )

change the value of sales 0 to the value of sales on the previous day

daily_fmcg %>% 
  arrange(date) %>% 
  pad() %>% 
  na.locf()
range(daily_fmcg$date)
#> [1] "2012-01-01" "2014-02-28"

5 Time Series object & EDA

fmcg_ts <- ts(data = daily_fmcg$sales,
              start = range(daily_fmcg$date)[[1]],
              frequency = 7) #weekly seasonality

In this process I will conduct an analysis to find out whether the data are seasonal and trend, one seasonal or more than one seasonal

fmcg_decom <- decompose(fmcg_ts)
autoplot(fmcg_decom)

From the results of the plot above, it can be said that the data is multiseasonal, and the trend data pattern still shows other seasonal patterns

#making 2nd ts object
daily_fmcg$sales %>% 
  msts(seasonal.periods = c(7,7*4)) %>%  # multiseasonal ts (weekly, monthly)
  mstl() %>% #multiseasonal ts decomposotion
  autoplot()

#making 3nd ts object
daily_fmcg$sales %>% 
  msts(seasonal.periods = c(7*4*3, 7*4*12)) %>%  # multiseasonal ts (quarterly, yearly)
  mstl() %>% #multiseasonal ts decomposotion
  autoplot()

the last object is to conduct three time series model experiments, and the last time series model is a model that fits two seasonal and downward trends.

#assign final ts object
fmcg_msts <- daily_fmcg$sales %>% 
  msts(seasonal.periods = c(7*4*3, 7*4*12))

# check for stationary
adf.test(fmcg_msts)
#> 
#>  Augmented Dickey-Fuller Test
#> 
#> data:  fmcg_msts
#> Dickey-Fuller = -4.2625, Lag order = 9, p-value = 0.01
#> alternative hypothesis: stationary

Based on Augmented Dickey-Fuller Test (adf.test) result, the p-value is < alpha (0.05) and therefore the data is already stationary. Therefore, for a model building using SARIMA, we do not need to perform differencing on the data first.

6 Cross Validation

In cross validation I will take test data for the model for one month or four weeks

fmcg_train <- function(x){
  train <- head(x, length(x) - 7*4)
}

fmcg_test <- function(x){
  test <- tail(x, 7*4)
}

train_1 <- fmcg_train(fmcg_msts)
test_1 <- fmcg_test(fmcg_msts) 

7 Model Building

For modeling we will compare ets holt winters with Seasonal Arima, because the data has seasonal and trend

There is one way to get decomposed data but still maintain information from all the data we have, that is by using STL (Seasonal Trend with Loess). STL will conceptually smooth the neighboring data for each observation by giving a heavier weight to the data that is close to the observed data.

To model the STL results, we can also apply the exponential smoothing (ETS) and ARIMA methods. In addition, STLM can be used as an alternative way to catch seasonal which cannot be caught by the usual ETS and ARIMA methods.

#ets Holt - Winters
fmcg_ets <- stlm(train_1, method = "ets", lambda = 0) # no log transformation for addivite data

# SARIMA
fmcg_arima <- stlm(train_1, method = "arima", lambda = 0)

8 Forecas & Evaluation

compare_forecast <- function(x, test){
  lapply(x, function(x) forecast::accuracy(x, test)["Test set", ]) %>%
    lapply(., function(x) x %>% t() %>% as.data.frame) %>% 
    bind_rows() %>% 
    mutate(model = names(x)) %>%
    select(model, everything())
}
#Forecast
forecast_ets <- forecast(object = fmcg_ets, h = length(test_1))
forecast_arima <- forecast(object = fmcg_arima, h = length(test_1))
#Evaluation
forecast_list <- list(
  "ETS" = forecast_ets,
  "Arima" = forecast_arima
)
compare_forecast(forecast_list,test_1 )

From the comparison results above, Arima produces a smaller MAPE than ETS

fmcg_msts %>% 
  autoplot(series = "Actual") +
  autolayer(forecast_ets$mean, series = "ETS") + 
  autolayer(forecast_arima$mean, series = "ARIMA")

9 Shapiro Test

shapiro.test(forecast_arima$residuals)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  forecast_arima$residuals
#> W = 0.99535, p-value = 0.02145
hist(forecast_arima$residuals, breaks = 50)

TSstudio::test_forecast(actual = fmcg_msts, 
              forecast.obj = forecast_arima, 
              train = train_1, 
              test = test_1)

10 Predicting

Using model has selected

model_fmcg <- fmcg_msts

Data predict

Data test for one month,and we will only choose categorical fmcg and accumulate daily from each store

predict <- read.csv("test_data.csv")
predict <- predict %>% 
  filter(category_of_product == "fast_moving_consumer_goods") %>% 
  select(-id)
predict_clean <- predict %>% mutate(date = ymd(date))
predict_forecast <- predict_clean[!duplicated(predict_clean$date),] %>% 
  select(date)
predict_forecast

11 Model

forecast using the Arima model model

arima_model <- stlm(model_fmcg, method = "arima", lambda = 0)

#forecast
arima_f <- forecast(arima_model, h = 31)# the day in predict
model_fmcg %>% 
  autoplot(series = "Actual") +
  autolayer(arima_f$mean, series = "Forecast")

# Data frame predict

predict_forecast %>% 
  cbind(arima_f) %>% as.data.frame()

12 Conclutions

From the results of the analysis above, the data has a multiseasonal pattern, namely quarterly and yearly, and the Arima model is the best model with a 20% MAPE so that the accuracy of the model can be said to be 80% for forecasting.