A large Indian retail chain has stores across 3 states in India: Maharashtra, Telangana and Kerala. These stores stock products across various categories such as FMCG (fast moving consumer goods), eatables / perishables and others. Managing the inventory is crucial for the revenue stream of the retail chain. Meeting the demand is important to not lose potential revenue, while at the same time stocking excessive products could lead to losses.
In this case i will building a time series for forcasting the sales of product for one month from fmcg categorical
library(dplyr) # for data wrangling
library(lubridate) # to dea with date
library(padr) # for padding
library(forecast) # time series library
library(tseries) # for adf.test
library(MLmetrics) # calculate error
library(zoo) #imputation missing value
library(tseries) # adf.test
This data is obtained from Kaggle, which is product daily sales data from each store in India
fmcg <- read.csv("data.csv")
glimpse(fmcg)
#> Rows: 395,000
#> Columns: 7
#> $ date <chr> "2012-01-01", "2012-01-01", "2012-01-01", "2012-~
#> $ product_identifier <int> 74, 337, 423, 432, 581, 611, 631, 659, 743, 797,~
#> $ department_identifier <int> 11, 11, 12, 12, 21, 21, 21, 21, 21, 21, 21, 21, ~
#> $ category_of_product <chr> "others", "others", "others", "others", "fast_mo~
#> $ outlet <int> 111, 111, 111, 111, 111, 111, 111, 111, 111, 111~
#> $ state <chr> "Maharashtra", "Maharashtra", "Maharashtra", "Ma~
#> $ sales <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 3, ~
Data Description :
date : date of sales obsevation
product_identifier : the id for a product
department_identifier : The id for spesific department in a store
category_of_product : The category to which a product belongs
outlet : The id for a store
state : The name of the state
sales : The number of sales product
fmcg <- fmcg %>%
mutate(category_of_product = as.factor(category_of_product),
state = as.factor(state),
date = ymd(date))
head(fmcg)
From the data set we just concern to sales forecast in fmcg, so we must to get date and sales from fmcg category product
fmcg_data <- fmcg %>%
filter(category_of_product == "fast_moving_consumer_goods") %>%
select(date,sales)
daily_fmcg <- fmcg_data %>%
group_by(date) %>%
summarise(sales = sum(sales)) %>%
arrange(date)
daily_fmcg
A good time series data is having data that is sequential and not empty, therefore by using a pad, it will check each row of data whether there are blanks or not.
daily_fmcg %>% filter(sales == 0)
daily_fmcg <- daily_fmcg[daily_fmcg$sales > 0, ]
daily_fmcg %>% filter(sales == 0 )
change the value of sales 0 to the value of sales on the previous day
daily_fmcg %>%
arrange(date) %>%
pad() %>%
na.locf()
range(daily_fmcg$date)
#> [1] "2012-01-01" "2014-02-28"
fmcg_ts <- ts(data = daily_fmcg$sales,
start = range(daily_fmcg$date)[[1]],
frequency = 7) #weekly seasonality
In this process I will conduct an analysis to find out whether the data are seasonal and trend, one seasonal or more than one seasonal
fmcg_decom <- decompose(fmcg_ts)
autoplot(fmcg_decom)
From the results of the plot above, it can be said that the data is multiseasonal, and the trend data pattern still shows other seasonal patterns
#making 2nd ts object
daily_fmcg$sales %>%
msts(seasonal.periods = c(7,7*4)) %>% # multiseasonal ts (weekly, monthly)
mstl() %>% #multiseasonal ts decomposotion
autoplot()
#making 3nd ts object
daily_fmcg$sales %>%
msts(seasonal.periods = c(7*4*3, 7*4*12)) %>% # multiseasonal ts (quarterly, yearly)
mstl() %>% #multiseasonal ts decomposotion
autoplot()
the last object is to conduct three time series model experiments, and the last time series model is a model that fits two seasonal and downward trends.
#assign final ts object
fmcg_msts <- daily_fmcg$sales %>%
msts(seasonal.periods = c(7*4*3, 7*4*12))
# check for stationary
adf.test(fmcg_msts)
#>
#> Augmented Dickey-Fuller Test
#>
#> data: fmcg_msts
#> Dickey-Fuller = -4.2625, Lag order = 9, p-value = 0.01
#> alternative hypothesis: stationary
Based on Augmented Dickey-Fuller Test (adf.test) result, the p-value is < alpha (0.05) and therefore the data is already stationary. Therefore, for a model building using SARIMA, we do not need to perform differencing on the data first.
In cross validation I will take test data for the model for one month or four weeks
fmcg_train <- function(x){
train <- head(x, length(x) - 7*4)
}
fmcg_test <- function(x){
test <- tail(x, 7*4)
}
train_1 <- fmcg_train(fmcg_msts)
test_1 <- fmcg_test(fmcg_msts)
For modeling we will compare ets holt winters with Seasonal Arima, because the data has seasonal and trend
There is one way to get decomposed data but still maintain information from all the data we have, that is by using STL (Seasonal Trend with Loess). STL will conceptually smooth the neighboring data for each observation by giving a heavier weight to the data that is close to the observed data.
To model the STL results, we can also apply the exponential smoothing (ETS) and ARIMA methods. In addition, STLM can be used as an alternative way to catch seasonal which cannot be caught by the usual ETS and ARIMA methods.
#ets Holt - Winters
fmcg_ets <- stlm(train_1, method = "ets", lambda = 0) # no log transformation for addivite data
# SARIMA
fmcg_arima <- stlm(train_1, method = "arima", lambda = 0)
compare_forecast <- function(x, test){
lapply(x, function(x) forecast::accuracy(x, test)["Test set", ]) %>%
lapply(., function(x) x %>% t() %>% as.data.frame) %>%
bind_rows() %>%
mutate(model = names(x)) %>%
select(model, everything())
}
#Forecast
forecast_ets <- forecast(object = fmcg_ets, h = length(test_1))
forecast_arima <- forecast(object = fmcg_arima, h = length(test_1))
#Evaluation
forecast_list <- list(
"ETS" = forecast_ets,
"Arima" = forecast_arima
)
compare_forecast(forecast_list,test_1 )
From the comparison results above, Arima produces a smaller MAPE than ETS
fmcg_msts %>%
autoplot(series = "Actual") +
autolayer(forecast_ets$mean, series = "ETS") +
autolayer(forecast_arima$mean, series = "ARIMA")
shapiro.test(forecast_arima$residuals)
#>
#> Shapiro-Wilk normality test
#>
#> data: forecast_arima$residuals
#> W = 0.99535, p-value = 0.02145
hist(forecast_arima$residuals, breaks = 50)
TSstudio::test_forecast(actual = fmcg_msts,
forecast.obj = forecast_arima,
train = train_1,
test = test_1)
Using model has selected
model_fmcg <- fmcg_msts
Data predict
Data test for one month,and we will only choose categorical fmcg and accumulate daily from each store
predict <- read.csv("test_data.csv")
predict <- predict %>%
filter(category_of_product == "fast_moving_consumer_goods") %>%
select(-id)
predict_clean <- predict %>% mutate(date = ymd(date))
predict_forecast <- predict_clean[!duplicated(predict_clean$date),] %>%
select(date)
predict_forecast
forecast using the Arima model model
arima_model <- stlm(model_fmcg, method = "arima", lambda = 0)
#forecast
arima_f <- forecast(arima_model, h = 31)# the day in predict
model_fmcg %>%
autoplot(series = "Actual") +
autolayer(arima_f$mean, series = "Forecast")
# Data frame predict
predict_forecast %>%
cbind(arima_f) %>% as.data.frame()
From the results of the analysis above, the data has a multiseasonal pattern, namely quarterly and yearly, and the Arima model is the best model with a 20% MAPE so that the accuracy of the model can be said to be 80% for forecasting.